Author Topic: Parse string to segments  (Read 34023 times)

Offline Aurel

  • Jr. Member
  • *
  • Posts: 2
Parse string to segments
« on: September 25, 2010, 03:43:28 PM »
Hi to all...
Im totaly new in assembler world of coding.
In first place im just a hoby programmer in basic .
my basic compiler (which i use currently) have optinon to add
inline assembly code.
I have written basic like interpreter in this compiler and i want more speed.
So my question is:
How on simple way make parser or in another words extractor.
For example i have string like this:
a$ = "string1 string2 string3"
i want extract this 3 delimited strings to 3 new strings
with nasm code.
Is maby somwhere any example how do this?
Or maby someone of you have idea how do this ?

thanks advance ....
Aurel

Offline Frank Kotler

  • NASM Developer
  • Hero Member
  • *****
  • Posts: 2667
  • Country: us
Re: Parse string to segments
« Reply #1 on: September 26, 2010, 03:09:56 PM »
Well... we don't really know what "a$" looks like. I seem to recall that BASIC uses a byte prefix with the length (or am I thinking of Pascal?)... And by "extract to a new string", I guess you mean copy the string (to newly allocated memory?). Depending on what you're doing, you may not need to copy the delimited strings - just "finding" them might be enough. But ASSuming that you've got an "lstring" - byte prefix with the length - and want to copy the delimited strings to similar, newly allocated, strings...

Code: [Select]
; nasm -f elf32 extract.asm
; ld -o extract extract.o -I/lib/ld-linux.so.2 -lc


global _start
extern malloc

%define DELIMITER ' '

section .text
_start:
bp1: ; just a breakpoint for debugging

    mov esi, basicstring
    lodsb ; get its length
    movzx ebx, al ; transfer it to ebx
    add ebx, esi ; end of string (so we know when we're done)

    mov edx, pointers ; array of pointers to extracted strings
   
.top:

; first, figure out how long our delimited string is
    xor ecx, ecx
.getlen:
    cmp byte [esi + ecx], DELIMITER
    jz .gotlen
    inc ecx
; if we're at the end of string, we won't find another delimiter, so check!
    lea edi, [esi + ecx]
    cmp edi, ebx
    jnz .getlen
.gotlen:

; then, allocate some memory for it
    inc ecx ; we need an extra byte for the length!

    push edx ; save our edx - malloc trashes it!
    push ecx ; both the parameter to malloc, and "save ecx"
    ; push ecx - for stdcall (Windows API) push it again!
    call malloc ; get some memory for our new string
    pop ecx ; restore our length
    pop edx ; restore our edx (pointers)

    ; should check if malloc succeeded - I ASSume it does :(
    mov [edx], eax ; save the address we got in "pointers" array
    add edx, 4 ; and get ready for next one
   
    mov edi, eax ; make our address "destination" for movsb
    dec ecx ; we don't need the "extra" byte anymore

    mov al, cl ; save the length byte
    stosb
    rep movsb ; and copy the string

    inc esi ; we left esi pointed at the delimiter - move past it
    inc dword [stringcount] ; count our delimited strings
    cmp esi, ebx ; are we done?
    jb .top ; no? do more.

; we're finished - print 'em, just to prove it worked :)
; this part is specific to Linux.

    mov esi, pointers
print_next:
    mov ecx, [esi] ; address of our delimited string
    add esi, 4 ; get ready for next one
    movzx edx, byte [ecx] ; Linux wants the length in edx
    inc ecx ; move past the length byte
    mov ebx, 1 ; STDOUT
    mov eax, 4 ; __NR_write
    int 80h ; call kernel

    mov ecx, newline
    mov edx, 1
    mov ebx, 1
    mov eax, 4
    int 80h

    dec dword [stringcount]
    jnz print_next

exit:
    mov eax, 1
    int 80h
;-----------------------

section .data

basicstring db .end - basicstring - 1
    db "string1 string2 string3"
.end:

newline db 10

;------------------
section .bss
    pointers resd 128
    stringcount resd 1
;----------------------   


That probably isn't what you want - not in Linux, anyway - but maybe it'll give you an idea how to approach it.

Best,
Frank


Offline Aurel

  • Jr. Member
  • *
  • Posts: 2
Re: Parse string to segments
« Reply #2 on: September 26, 2010, 03:39:31 PM »
First of all thank you very much Frank i now understand little bit better
how things work.
It's about basic compiler called EBasic which have NASM.
This compiler is only for Windows.
One guy help me to and here is his code:
Basic code:
Code: [Select]
DECLARE Split(Inp$:STRING,Deliminator:CHAR,RetArray:POINTER),INT
CONST MaxSplit = 17
DEF A$,src:STRING
def I:INT
'def w:pointer
DEF StrPArray[MaxSplit]:INT
A$ = "This string will be"
OPENCONSOLE
'PRINT A$
src=A$:A$=""
W = Split(src," ",StrPArray)
PRINT "Number of strings:",str$(W)
print

FOR I = 0 TO W-1
IF I = 0 then PRINT *<STRING>(StrPArray[I])
IF I = 1 THEN PRINT *<STRING>(StrPArray[I])
IF I = 2 THEN PRINT *<STRING>(StrPArray[I])
IF I = 3 THEN PRINT *<STRING>(StrPArray[I])
IF I = 4 THEN PRINT *<STRING>(StrPArray[I])
IF I = 5 THEN PRINT *<STRING>(StrPArray[I])
IF I = 6 THEN PRINT *<STRING>(StrPArray[I])
IF I = 7 THEN PRINT *<STRING>(StrPArray[I])
IF I = 8 THEN PRINT *<STRING>(StrPArray[I])
IF I = 9 THEN PRINT *<STRING>(StrPArray[I])
IF I = 10 THEN PRINT *<STRING>(StrPArray[I])
IF I = 11 THEN PRINT *<STRING>(StrPArray[I])
IF I = 12 THEN PRINT *<STRING>(StrPArray[I])
IF I = 13 THEN PRINT *<STRING>(StrPArray[I])
IF I = 14 then PRINT *<STRING>(StrPArray[I])
IF I = 15 THEN PRINT *<STRING>(StrPArray[I])
IF I = 16 THEN PRINT *<STRING>(StrPArray[I])
NEXT I

DO
UNTIL INKEY$<>""
END

and here is assembler code:
Code: [Select]
_asm
Split: push ebp
mov ebp, esp
push esi
push edi
push ebx
mov edi, [ebp+8]
mov esi, [ebp+16]
xor ecx, ecx
xor ebx, ebx
movzx eax, byte [ebp+12]
C01:mov [esi], edi
inc ebx
C00:cmp byte [edi], 0
        jz Exit
inc ecx
scasb
jnz C00
lea esi, [esi+4]
mov [edi-1], ah
jmp C01
Exit: mov dword [esi+4], 0          ;Can be omited since we have return value.
xchg eax, ebx
pop ebx
pop edi
pop esi
leave
ret 0x0C
_endasm

And works but original string is destroyed but it's not important.
String is piece of text inside quotes and each new string is puted in
array element as you can see.

Your explanation how assembler work it's great and first time see
explanation like this and understand much better.
Of course i will try with your code to.
thanks again...

Aurel

 

Offline munair

  • Jr. Member
  • *
  • Posts: 37
  • Country: nl
  • SharpBASIC compiler developer
    • SharpBASIC
Re: Parse string to segments
« Reply #3 on: November 13, 2021, 10:50:37 AM »
Well... we don't really know what "a$" looks like. I seem to recall that BASIC uses a byte prefix with the length (or am I thinking of Pascal?)... And by "extract to a new string", I guess you mean copy the string (to newly allocated memory?). Depending on what you're doing, you may not need to copy the delimited strings - just "finding" them might be enough. But ASSuming that you've got an "lstring" - byte prefix with the length - and want to copy the delimited strings to similar, newly allocated, strings...

Code: [Select]
; nasm -f elf32 extract.asm
; ld -o extract extract.o -I/lib/ld-linux.so.2 -lc


global _start
extern malloc

%define DELIMITER ' '

section .text
_start:
bp1: ; just a breakpoint for debugging

    mov esi, basicstring
    lodsb ; get its length
    movzx ebx, al ; transfer it to ebx
    add ebx, esi ; end of string (so we know when we're done)

    mov edx, pointers ; array of pointers to extracted strings
   
.top:

; first, figure out how long our delimited string is
    xor ecx, ecx
.getlen:
    cmp byte [esi + ecx], DELIMITER
    jz .gotlen
    inc ecx
; if we're at the end of string, we won't find another delimiter, so check!
    lea edi, [esi + ecx]
    cmp edi, ebx
    jnz .getlen
.gotlen:

; then, allocate some memory for it
    inc ecx ; we need an extra byte for the length!

    push edx ; save our edx - malloc trashes it!
    push ecx ; both the parameter to malloc, and "save ecx"
    ; push ecx - for stdcall (Windows API) push it again!
    call malloc ; get some memory for our new string
    pop ecx ; restore our length
    pop edx ; restore our edx (pointers)

    ; should check if malloc succeeded - I ASSume it does :(
    mov [edx], eax ; save the address we got in "pointers" array
    add edx, 4 ; and get ready for next one
   
    mov edi, eax ; make our address "destination" for movsb
    dec ecx ; we don't need the "extra" byte anymore

    mov al, cl ; save the length byte
    stosb
    rep movsb ; and copy the string

    inc esi ; we left esi pointed at the delimiter - move past it
    inc dword [stringcount] ; count our delimited strings
    cmp esi, ebx ; are we done?
    jb .top ; no? do more.

; we're finished - print 'em, just to prove it worked :)
; this part is specific to Linux.

    mov esi, pointers
print_next:
    mov ecx, [esi] ; address of our delimited string
    add esi, 4 ; get ready for next one
    movzx edx, byte [ecx] ; Linux wants the length in edx
    inc ecx ; move past the length byte
    mov ebx, 1 ; STDOUT
    mov eax, 4 ; __NR_write
    int 80h ; call kernel

    mov ecx, newline
    mov edx, 1
    mov ebx, 1
    mov eax, 4
    int 80h

    dec dword [stringcount]
    jnz print_next

exit:
    mov eax, 1
    int 80h
;-----------------------

section .data

basicstring db .end - basicstring - 1
    db "string1 string2 string3"
.end:

newline db 10

;------------------
section .bss
    pointers resd 128
    stringcount resd 1
;----------------------   


That probably isn't what you want - not in Linux, anyway - but maybe it'll give you an idea how to approach it.

Best,
Frank
When linking this I get:

[frank@frank-pc xmpl]$ ld -o extract extract.o -I/lib/ld-linux.so.2 -lc
ld: i386 architecture of input file `extract.o' is incompatible with i386:x86-64 output

and:

[frank@frank-pc xmpl]$ ld -m elf_i386 -o extract extract.o -I/lib/ld-linux.so.2 -lc
ld: skipping incompatible /usr/lib/libc.so when searching for -lc
ld: skipping incompatible /usr/lib/libc.a when searching for -lc
ld: cannot find -lc
ld: skipping incompatible /usr/lib/libc.so when searching for -lc
SharpBASIC (www.sharpbasic.com) is a compiler in development that uses NASM as backend.

Offline Frank Kotler

  • NASM Developer
  • Hero Member
  • *****
  • Posts: 2667
  • Country: us
Re: Parse string to segments
« Reply #4 on: November 13, 2021, 10:04:14 PM »
Dunno.
Code: [Select]
apt-get install gcc miltilib
perhaps?

Or do it in 64 bits?

Best,
Frank


Offline fredericopissarra

  • Full Member
  • **
  • Posts: 373
  • Country: br
Re: Parse string to segments
« Reply #5 on: November 13, 2021, 11:31:05 PM »
What calling convention is used by your basic compiler? (Which compiler?)
The use of the stack to pass arguments to a function/procedure in x86-64 mode isn't usual...

Offline Frank Kotler

  • NASM Developer
  • Hero Member
  • *****
  • Posts: 2667
  • Country: us
Re: Parse string to segments
« Reply #6 on: November 14, 2021, 02:50:48 AM »
I should have stopped after "Dunno".

Code I posted was 32 bit, probably assembled and linked on a 32 bit system. I have bo ,emory of where those parameters came from.

Where are we now? Dunno.

Probably best to start over: What OS and what do you need to do?

Best,
Frank


Offline munair

  • Jr. Member
  • *
  • Posts: 37
  • Country: nl
  • SharpBASIC compiler developer
    • SharpBASIC
Re: Parse string to segments
« Reply #7 on: November 15, 2021, 06:29:04 AM »
I compile 32bits code with NASM and LD on Manjaro Linux x64. The code for the SharpBASIC 32bits compiler I work on (not to confuse with the BASIC referred to by the OP) compiles and runs fine. But Frank's example doesn't. I'm sure it's because of ld-linux.so.2, which I don't use for my code.
« Last Edit: November 15, 2021, 06:37:20 AM by munair »
SharpBASIC (www.sharpbasic.com) is a compiler in development that uses NASM as backend.

Offline fredericopissarra

  • Full Member
  • **
  • Posts: 373
  • Country: br
Re: Parse string to segments
« Reply #8 on: November 15, 2021, 02:01:20 PM »
I compile 32bits code with NASM and LD on Manjaro Linux x64. The code for the SharpBASIC 32bits compiler I work on (not to confuse with the BASIC referred to by the OP) compiles and runs fine. But Frank's example doesn't. I'm sure it's because of ld-linux.so.2, which I don't use for my code.
The question is there because of this error:
Code: [Select]
[frank@frank-pc xmpl]$ ld -o extract extract.o -I/lib/ld-linux.so.2 -lc
ld: i386 architecture of input file `extract.o' is incompatible with i386:x86-64 output
You are trying to link an ELF32 object file to an ELF64 one. So, what is it? 64 or 32 bits?

Offline munair

  • Jr. Member
  • *
  • Posts: 37
  • Country: nl
  • SharpBASIC compiler developer
    • SharpBASIC
Re: Parse string to segments
« Reply #9 on: November 15, 2021, 10:12:42 PM »
I compile 32bits code with NASM and LD on Manjaro Linux x64. The code for the SharpBASIC 32bits compiler I work on (not to confuse with the BASIC referred to by the OP) compiles and runs fine. But Frank's example doesn't. I'm sure it's because of ld-linux.so.2, which I don't use for my code.
The question is there because of this error:
Code: [Select]
[frank@frank-pc xmpl]$ ld -o extract extract.o -I/lib/ld-linux.so.2 -lc
ld: i386 architecture of input file `extract.o' is incompatible with i386:x86-64 output
You are trying to link an ELF32 object file to an ELF64 one. So, what is it? 64 or 32 bits?

The only object file is extract.o, which is 32 bits. I finally got it linked with:

Code: [Select]
ld -o extract extract.o -m elf_i386 -L /usr/lib32 -lc -dynamic-linker /lib/ld-linux.so.2

Without "-dynamic-linker" links too, but the resulting executable doesn't run. Although it's there, I get "no such file or directory".

« Last Edit: November 15, 2021, 10:23:05 PM by munair »
SharpBASIC (www.sharpbasic.com) is a compiler in development that uses NASM as backend.

Offline munair

  • Jr. Member
  • *
  • Posts: 37
  • Country: nl
  • SharpBASIC compiler developer
    • SharpBASIC
Re: Parse string to segments
« Reply #10 on: November 15, 2021, 10:15:55 PM »
I'm happy that 'extract' works. I will use it as example for string manipulation routines in the SharpBASIC compiler.
SharpBASIC (www.sharpbasic.com) is a compiler in development that uses NASM as backend.