Author Topic: Find Substring  (Read 3236 times)

Offline munair

  • Jr. Member
  • *
  • Posts: 37
  • Country: nl
  • SharpBASIC compiler developer
    • SharpBASIC
Find Substring
« on: December 03, 2021, 08:08:06 PM »
For the SharpBASIC compiler I've written a function called instr (in-string) with inline asm to find a substring from a specific start position. The function works, but I'm learning ASM as I go, so I wonder if the code is ok or if it could be optimized/shortened. Elsewhere I read that repe cmpsb isn't the fastest way to compare strings. Any suggestions are very welcome.
Code: [Select]
func _sb_instr(src: str, sbs: str, start: int): int
  dim srcl, sbsl: int; ' source and sub-string length
do
  asm do
    push edi
    push esi

    mov     eax, [ebp + 16]               ; source string descriptor
    call    _sb_load_strdesc              ; address in ebx, length in edx

    mov     esi, ebx                      ; string address in esi

    test    edx, edx                      ; non-zero is fixed-length
    jnz     .next1
    call    _sb_strzlen                   ; null-terminated string

    .next1:
    test    edx, edx                      ; anything to search in?
    jz      .out
   
    mov     [ebp - 8], edx                ; save source string length

    mov     eax, [ebp + 12]               ; sub-string descriptor
    call    _sb_load_strdesc              ; address in ebx, length in edx

    test    edx, edx                      ; non-zero is fixed-length
    jnz     .next2
    call    _sb_strzlen                   ; null-terminated string

    .next2:
    test    edx, edx                      ; anything to search for?
    jz      .out
   
    mov     [ebp - 12], edx               ; save sub-string length

    mov     ecx, [ebp + 8]                ; start position
    mov     edx, [ebp - 8]                ; source string length
    sub     edx, ecx                      ; length minus start position
    mov     eax, edx                      ; save length
    sub     eax, [ebp - 12]               ; subtract sub-string length
    cmp     eax, 0
    jl      .out                          ; sub-string is longer

    dec     ecx                           ; 0-based
    push    esi                           ; save start address
    add     esi, ecx                      ; start address from start position

    .repeat:                              ; compare loop
    mov     ecx, [ebp - 12]               ; substring length is loop counter
    mov     edi, ebx                      ; sub-string address in edi
    cld
    mov     eax, esi                      ; save esi
    repe    cmpsb                         ; compare strings (edi with esi)
    jz      .done

    inc     eax                           ; move up one byte
    mov     esi, eax                      ; restore modified esi
    dec     edx                           ; decrement source string length
    and     edx, edx                      ; non-zero means continue search
    jnz     .repeat

    jmp     .out

    .done:
    sub     esi, [ebp-12]                 ; beginning of sub-string
    pop     eax                           ; restore start address
    sub     esi, eax                      ; start position in string
    inc     esi                           ; position 1-based
    mov     [ebp - 4], esi                ; save result

    .out:
    pop esi
    pop edi
  end;
end;
SharpBASIC (www.sharpbasic.com) is a compiler in development that uses NASM as backend.

Offline fredericopissarra

  • Full Member
  • **
  • Posts: 368
  • Country: br

Offline munair

  • Jr. Member
  • *
  • Posts: 37
  • Country: nl
  • SharpBASIC compiler developer
    • SharpBASIC
Re: Find Substring
« Reply #2 on: December 04, 2021, 05:52:06 PM »
Rabin-Karp algorithm
Knuth-Morris-Pratt algorithm
Boyer-Moore algorithm

Examples in the links provided suggest these advanced search algorithms are primarily for higher level languages and larger pieces of texts. The basic InStr function is most often used for smaller sized strings. For this reason I also did not optimize the loop to test the remaining source string length against the sub-string length.
SharpBASIC (www.sharpbasic.com) is a compiler in development that uses NASM as backend.

Offline debs3759

  • Global Moderator
  • Full Member
  • *****
  • Posts: 221
  • Country: gb
    • GPUZoo
Re: Find Substring
« Reply #3 on: December 04, 2021, 10:10:49 PM »
Examples in the links provided suggest these advanced search algorithms are primarily for higher level languages and larger pieces of texts. The basic InStr function is most often used for smaller sized strings. For this reason I also did not optimize the loop to test the remaining source string length against the sub-string length.

The more you learn about asm coding, the easier it is to look at an algorithm as pseudo code and translate it. repsb is the easiest to code, but easy/short code is not always the best. I'm going to take a better look at those algorithms to see how I can adapt them into routines I can use in asm apps. The only place I have done string searches so far in my limited coding is in cpuid code and my boot sector, but I do hope to get back into OS coding in the next year, with small code for AT class systems and faster code for 386 and later (I like working with old hardware).
My graphics card database: www.gpuzoo.com

Offline munair

  • Jr. Member
  • *
  • Posts: 37
  • Country: nl
  • SharpBASIC compiler developer
    • SharpBASIC
Re: Find Substring
« Reply #4 on: December 05, 2021, 09:39:38 AM »
Interesting. I also keep older computers and still use them if possible, such as the server for the website and forum, which is a 16 years old 32 bits system. I also still have a 80-386 SX 25MHz with 16MB RAM. It was my first DOS-Win31 system that I used A LOT for programming, although my first programming experience was on a ZX80 with 4KB RAM connected to a TV ;D.

Back on topic, the algorithms are indeed very interesting, but hashes on small strings IMO is overkill. They're more suited for search functions in browsers or word processors. For memo-stuff up to a few hundred words, I wonder which approach is faster.

SharpBASIC (www.sharpbasic.com) is a compiler in development that uses NASM as backend.