Author Topic: some questions about the manual (Read 23518 times)

codeferever · « **on:** December 01, 2010, 12:16:51 PM »

Hello everyone !
I have some questions while reading the manual:
>In a function such as printf which takes a variable number
>of parameters,the pushing of the parameters in reverse order means
>that the function knows where to find its first parameter,which tells
>it the number and type of the ramaining ones.
And the question is about the its first parameter, can someone make it clear ?

In chapter 9:writing 32-bit code and at page 103?
>There is an alternative calling convertion used by Win32 programs for Windows API calls,...
>...in that the callee clears the stack by passing a parameter to the RET instruction. ...
The second line really puzzled me...

Why not :
add esp, 8 ; remove the parameter after call instruction
but:
add esp,byte 8

Bryant Keller · « **Reply #1 on:** December 01, 2010, 01:46:31 PM »

Quote from: codeferever on December 01, 2010, 12:16:51 PM

Hello everyone !
I have some questions while reading the manual:
>In a function such as printf which takes a variable number
>of parameters,the pushing of the parameters in reverse order means
>that the function knows where to find its first parameter,which tells
>it the number and type of the ramaining ones.
And the question is about the its first parameter, can someone make it clear ?

The first parameter to printf is the format string. In the format string are tokens qualified by the '%' character which the function uses to identify how many and what type of arguments follow that first argument.

Code: [Select]

printf( "%s %d %c %c", "hello", 1, '!', '\n' );
Notice that the number of '%' tokens is equal to the number of arguments following the format string.

Quote from: codeferever on December 01, 2010, 12:16:51 PM

In chapter 9:writing 32-bit code and at page 103?
>There is an alternative calling convertion used by Win32 programs for Windows API calls,...
>...in that the callee clears the stack by passing a parameter to the RET instruction. ...
The second line really puzzled me...

Why not :
add esp, 8 ; remove the parameter after call instruction
but:
add esp,byte 8

The Win32 STDCALL convention doesn't require you to clean up the stack. The procedure does it for you by using RET <NUM>. STDCALL was created for procedures which don't use variadic arguments (like printf). Since at implementation time we have no way to telling how much memory printf() has allocated for variables, we have to clean up the stack ourselves after the call. But other functions like ExitProcess have a static count of variables, therefore the cleanup can be done at implementation time (which saves a few bytes on your program).

HtH,
~ Bryant

Rob Neff · « **Reply #2 on:** December 01, 2010, 01:47:59 PM »

Two separate issues here, although both involve the stack:

Quote from: codeferever on December 01, 2010, 12:16:51 PM

And the question is about the its first parameter, can someone make it clear ?

The first parameter to printf() is the format string ( ie: "Hello, %s! This is your %d post on these boards\n" )
That string contains the formatting instructions to printf and tell it what to expect on the stack following the address of that string.
In this case the string tells printf that there will be a pointer to another string ( %s) and an integer ( %d ).
The stack itself would look like the following upon entry into printf:

Code: [Select]

Please note that these are imaginary numbers and have no real significance
except to demonstrate the stack layout.

[ 0x000002A ]      <- a simple integer ( 42 )
[ 0x0402000 ]      <- address of the name string
[ 0x0401000 ]      <- address of the format string
[ 0x0C01234 ]      <- [esp] - return address

Thus that format string is vital to printf/scanf family of functions.

Quote from: codeferever on December 01, 2010, 12:16:51 PM

In chapter 9:writing 32-bit code and at page 103?
>There is an alternative calling convertion used by Win32 programs for Windows API calls,...
>...in that the callee clears the stack by passing a parameter to the RET instruction. ...
The second line really puzzled me...

Why not :
add esp, 8 ; remove the parameter after call instruction
but:
add esp,byte 8

What really happens is that the final RET instruction contains the amount of additional bytes to pop of the stack before control is transfered back to the caller. In win32 calling convention, functions marked as __stdcall have a cdecl function prologue where the parameters are passed on the stack as a normal C function but, unlike C, the called function performs the stack clean up ( similar to pascal calling convention).

Thus the called function will issue a RET XXX ( where XXX is the number of bytes to add to the stack pointer following the return ).
Doing this helps to slim down code size by not having add esp, XXX after each call throughout your source code.

Hope that helps!

Edit: Hah! Bryant beat me by a minute and a half!

codeferever · « **Reply #3 on:** December 02, 2010, 02:04:45 PM »

Thank you,I know,I know.
Here is my first try to do this work,
but there are still some problems, wish
someone could help me.

Code: [Select]

; int printf(char *buff,const char *fmt,va_list args)
; [ss:sp] = ?  in 32-bit by default

global _printf

section .text

countw	equ		0	;to count the number of '%s'
countb	equ		0	;to count the number of '%d' or '*c'

_printf:
	push ebp
	mov ebp,esp
	sub esp,0x40	;32 words of local stack space,for some temporary variables
	mov ebx,[ebp+8]	;the first parameter is a string's pointer
	
;Such as "Hello %s, welcome to asm world." we deal with it like this:
;	"Hello ",'%s',"welcome to asm world." three parts.
;	Then replace the %s with the second parameter(ebp+12),
;	and combine the parts as a result.Yes ?
	xor cx,cx
	mov stringtmp,ebx	;How to define a pointer ? And if it's right ?
loop1:
	mov ax,byte [ebx+cx]
	cmp ax,'\0'	;judge if it's the end, but I don't know whether it's right...
	jz	the_end
	inc cx
	cmp ax,'%s'
	jz	fmt_s
	cmp ax,'%c'
	jz	fmt_c
	cmp ax,'%d'
	jz	fmt_d
	; more ...
	jmp loop1
fmt_s:
	mov dx,countw
	inc dx
	mov eax,[ebp+8+countw<<2+countb<<1]
	...
...

Bryant Keller · « **Reply #4 on:** December 03, 2010, 12:27:43 AM »

Quote

Code: [Select]
; int printf(char *buff,const char *fmt,va_list args) ; [ss:sp] = ? in 32-bit by default global _printf section .text countw equ 0 ;to count the number of '%s' countb equ 0 ;to count the number of '%d' or '*c' _printf: push ebp mov ebp,esp sub esp,0x40 ;32 words of local stack space,for some temporary variables mov ebx,[ebp+8] ;the first parameter is a string's pointer ;Such as "Hello %s, welcome to asm world." we deal with it like this: ; "Hello ",'%s',"welcome to asm world." three parts. ; Then replace the %s with the second parameter(ebp+12), ; and combine the parts as a result.Yes ? xor cx,cx mov stringtmp,ebx ;How to define a pointer ? And if it's right ? loop1: mov ax,byte [ebx+cx] cmp ax,'\0' ;judge if it's the end, but I don't know whether it's right... jz the_end inc cx cmp ax,'%s' jz fmt_s cmp ax,'%c' jz fmt_c cmp ax,'%d' jz fmt_d ; more ... jmp loop1 fmt_s: mov dx,countw inc dx mov eax,[ebp+8+countw<<2+countb<<1] ... ...

First thing I notice is you shouldn't be using equ like it's a runtime variable. Equates are compile time, and I'm assuming you are trying to use them as incremental values for tracking arguments, in which case won't work.

Next thing I notice is that you define variable space on the stack (which isn't needed) then try to access that variable with a literal which hasn't been specified. If you want to use arguments in that way, either use the NASMX project from http://nasmx.sourceforge.net or make use of the procedure handling directives built-in to NASM.

Code: [Select]

slow_swap:
%push
%stacksize flat
%assign %$localsize 0
%arg src:dword, dst:dword
%local tmp:dword
	enter %$localsize, 0
	mov eax, [src]
	mov [tmp], eax
	mov eax,[tmp]
	mov [dst], eax
	leave
	ret
%pop

The above code is horrible and should never be used. But it's a great example of using arguments and locals with NASM's built-in directives. Please read the manual for more information.

http://www.nasm.us/doc/nasmdoc4.html#section-4.8

FInally, in your loop the comparisons are all wrong. AFAIK the line 'mov ax,byte [ebx+cx]' won't even assemble (or shouldn't) since you are trying to typecast a byte value to a word storage and 'mov' doesn't zero extend on it's own (that's what the movzx variant is for). Also, when you compare the values you are assuming that a word has been read, so maybe you actually meant the previous to be 'mov ax, word [ebx+ecx]'. However, even if that was the case you have to expect that it would be 'c%', 's%', etc. Intel processors are little endian, so word value should be reversed. To avoid this type of confusion, I suggest working byte-2-byte which will make things a bit easier on you. You aren't really getting any optimization in your code by reading word values since you are byte-incrementing which is where the optimization would have occurred. I don't really care for the "state machine" style comparison you have going on (just personal opinion) so I cooked up an example that uses a more if/elsif/else/endif style code layout.

The following code has been tested and is just unoptimized enough to give you a few things to play around with. For example, once you've added the OS dependent stuff, you might think about calculating the size of the string and using dword incrementing to reduce the number of iterations.

Code: [Select]

[BITS 32]
[CPU 386]
[GLOBAL print]
[SECTION .text]
;; --------------------------------------------------
; @brief prints a formatted string.
; @param fmt	- format string
; @param ...	- variadic argument list
; @return	- number qualifiers in format.
;; --------------------------------------------------
print:
	push ebp
	mov ebp, esp

	;; --------------------------------------------------

	xor eax, eax
	xor ecx, ecx
	mov esi, [ebp+8+(ecx*4)]
	inc ecx

run_again:
	mov al, [esi]
	inc esi

	or eax, eax
	jz is_done

	cmp al, '%'
	jnz print_char

	mov al, [esi]
	inc esi

	cmp al, 's'
	jne not_string

	;; --------------------------------------------------

	mov edi, [ebp+8+(ecx*4)]
	inc ecx

	;; --------------------------------------------------
	;; PRINT ASCIIZ STRING IN ARGUMENT (IN EDI)
	;; --------------------------------------------------

	jmp run_again

not_string:
	cmp al, 'c'
	jne not_character

	;; --------------------------------------------------
	;; PRINT CHARACTER ARGUMENT (IN EAX)
	;; --------------------------------------------------

	mov eax, [ebp+8+(ecx*4)]
	jmp print_char

	;; --------------------------------------------------

	jmp run_again

not_character:
	cmp al, 'd'
	jne not_decimal

	;; --------------------------------------------------

	mov edi, [ebp+8+(ecx*4)]
	inc ecx

	;; --------------------------------------------------
	;;  CONVERT NUMBER IN EDI TO DECIMAL STRING THEN PRINT
	;; --------------------------------------------------

	jmp run_again

not_decimal:
	cmp al, 'x'
	jne not_hexadecimal

	;; --------------------------------------------------

	mov edi, [ebp+8+(ecx*4)]
	inc ecx

	;; --------------------------------------------------
	;; CONVERT NUMBER IN EDI TO HEX STRING THEN PRINT 
	;; --------------------------------------------------

	jmp run_again

not_hexadecimal:
	cmp al, '%'
	je print_char

	;; --------------------------------------------------
	;; PRINT FORMAT ERROR MESSAGE
	;; --------------------------------------------------

	jmp is_done

print_char:

	;; --------------------------------------------------
	;; PRINT CHARACTER IN AL
	;; --------------------------------------------------

	jmp run_again

is_done:

	;; --------------------------------------------------
	;;  Swap EAX & ECX to return number of handled tokens
	;; --------------------------------------------------

	xor eax, ecx
	xor ecx, eax
	xor eax, ecx
	dec eax

	;; --------------------------------------------------

	leave
	ret

This example 'print' function handles '%%', '%s', '%c', '%d', and '%x' tokens. As I said before, you'll need to write in the OS dependent stuff. This code leaves a lot of room for improvement. Only "optimization" I've done is using the generic swapping algorithm at the end, I point this out cause it tends to confuse people. At the end, EAX=n and ECX=0 where 'n' equals the number of % tokens you've handled (just like printf does). I also left you a place to do error handling and possibly clean up the stack in case an invalid/unsupported token modifier is passed. I tested this code by inserting 'printf()' to handle the real output and letting this parser direct tokenizing. then I traced the call stack in gdb to make sure it invoked printf the right number of times for the string printf("Hello, %s%c%c", "Bryant", '!', 10).

Regards,
Bryant Keller

Edited for awkward tab-stops.

codeferever · « **Reply #5 on:** December 03, 2010, 06:58:40 AM »

Here is a routine writen for not_hex,I don't if it is right...

Code: [Select]

;parameter in eax, a decimal in it
;return with eax, a hex data in it
decimal_hex:
xor ebx,ebx
decimal_hex_begin:
shr eax
or ebx,eflags & 0x00000001  
shl ebx
decimal_hex_loop:
or eax,eax
jnz decimal_hex_begin
decimal_hex_end:
mov eax,ebx
ret

cm · « **Reply #6 on:** December 03, 2010, 01:21:20 PM »

Quote from: Bryant Keller on December 03, 2010, 12:27:43 AM

[...] or make use of the procedure handling directives built-in to NASM.

I wouldn't recommend these.

Quote

Finally, in your loop the comparisons are all wrong. AFAIK the line 'mov ax,byte [ebx+cx]' won't even assemble (or shouldn't) since you are trying to typecast a byte value to a word storage and 'mov' doesn't zero extend on it's own (that's what the movzx variant is for). Also, when you compare the values you are assuming that a word has been read, so maybe you actually meant the previous to be 'mov ax, word [ebx+ecx]'.

I think [ebx+cx] isn't even a valid addressing form.

Quote

However, even if that was the case you have to expect that it would be 'c%', 's%', etc. Intel processors are little endian, so word value should be reversed.

NASM already reverses immediate string values (with more than one character) for you if they are used as word or dword operands. Therefore, the following code is right (and checks for the byte '1' followed by '2'):

Code: [Select]


 mov ax, [somestring]  ; or lodsw
 cmp ax, "12"
 je somewhere

Bryant Keller · « **Reply #7 on:** December 04, 2010, 01:08:50 AM »

Quote from: cm on December 03, 2010, 01:21:20 PM

Quote from: Bryant Keller on December 03, 2010, 12:27:43 AM
[...] or make use of the procedure handling directives built-in to NASM.

I wouldn't recommend these.

Any specific reason why? I use them a lot for demo code (not much else) but I've not seen any reason for people to avoid them.

Quote from: cm on December 03, 2010, 01:21:20 PM

Quote
However, even if that was the case you have to expect that it would be 'c%', 's%', etc. Intel processors are little endian, so word value should be reversed.

NASM already reverses immediate string values (with more than one character) for you if they are used as word or dword operands. Therefore, the following code is right (and checks for the byte '1' followed by '2'):

Code: [Select]
mov ax, [somestring] ; or lodsw cmp ax, "12" je somewhere

I was not aware of that. $:-\$

cm · « **Reply #8 on:** December 04, 2010, 12:35:45 PM »

Quote from: Bryant Keller on December 04, 2010, 01:08:50 AM

Quote from: cm on December 03, 2010, 01:21:20 PM
Quote from: Bryant Keller on December 03, 2010, 12:27:43 AM
[...] or make use of the procedure handling directives built-in to NASM.

I wouldn't recommend these.

Any specific reason why? I use them a lot for demo code (not much else) but I've not seen any reason for people to avoid them.

Eh, last time I looked into the code (of NASM) I think I saw some bugs (and/or features) that would produce different results from what I expected. Then again, they're not as important to me as to fix them right now. (Note that the bogus behaviour might well be restricted to 16-bit usage. I don't remember the details.)

Quote from: Bryant Keller on December 04, 2010, 01:08:50 AM

Quote from: cm on December 03, 2010, 01:21:20 PM
NASM already reverses immediate string values (with more than one character) for you if they are used as word or dword operands.

I was not aware of that. $:-\$

I guess we'll have to ban you from the NASM guru temple then ;-)

Bryant Keller · « **Reply #9 on:** December 04, 2010, 10:25:50 PM »

Quote from: cm on December 04, 2010, 12:35:45 PM

Eh, last time I looked into the code (of NASM) I think I saw some bugs (and/or features) that would produce different results from what I expected. Then again, they're not as important to me as to fix them right now. (Note that the bogus behaviour might well be restricted to 16-bit usage. I don't remember the details.)

It's been a while since I've delved into the source. Not surprised I overlooked it if it's restricted to 16-bit usage, I've not coded 16-bit code since DOS was the mainstream OS, and back then I used TASM. lol

Quote from: cm on December 04, 2010, 12:35:45 PM

I guess we'll have to ban you from the NASM guru temple then ;-)

Guru? O_O Don't know about that. I prefer the term "Enthusiast", leaves me room to screw up a lot.

I have a habit of being explicit with operand 'types' and a multi-character string (to me) is left to reserved data section. I'll use things like cmp al, 'n' but once there are more characters, I break out the ASCII chart and get a hex-word/dword value. Tbh, I held my tongue on that last post. I'm actually very much against NASM automatically converting endianness for you. It just seems to me like it would cause a lot of issues with newcomers learning to debug their code (expecting to see 0x25730000 instead of 0x00007325 representing '%s'). I didn't want to make a federal case about it though since I suspect nobody has had a problem with it yet and since it's a small enough feature that I overlooked it (likely to my coding style), then apparently there isn't any harm.

cm · « **Reply #10 on:** December 05, 2010, 12:48:24 AM »

Quote from: Bryant Keller on December 04, 2010, 10:25:50 PM

[...] back then I used TASM.

Ideal or MASM mode?

Quote from: Bryant Keller on December 04, 2010, 10:25:50 PM

Guru? O_O Don't know about that. I prefer the term "Enthusiast", leaves me room to screw up a lot.

Ah, but I'm just a free software guru. Quoth the 2-clause BSD license, "ANY EXPRESS OR IMPLIED WARRANTIES [...] ARE DISCLAIMED".

Quote from: Bryant Keller on December 04, 2010, 10:25:50 PM

I have a habit of being explicit with operand 'types' and a multi-character string (to me) is left to reserved data section. I'll use things like cmp al, 'n' but once there are more characters, I break out the ASCII chart and get a hex-word/dword value.

Symmetrically, you should then use ASCII charts for encoding other character operands too. Oh, and machine language charts for encoding instructions. Seriously, NASM is an assembler and I use it to (mostly) avoid code charts.

Quote from: Bryant Keller on December 04, 2010, 10:25:50 PM

Tbh, I held my tongue on that last post. I'm actually very much against NASM automatically converting endianness for you. It just seems to me like it would cause a lot of issues with newcomers learning to debug their code (expecting to see 0x25730000 instead of 0x00007325 representing '%s'). I didn't want to make a federal case about it though since I suspect nobody has had a problem with it yet and since it's a small enough feature that I overlooked it (likely to my coding style), then apparently there isn't any harm.

Actually, I don't think assembly language newcomers will have any problem with it. At least, the case where you load a register from a string and compare it for a sequence of characters seems intuitive to me with this "endian-reversal" (not really, hence the quotes). It is equivalent to dw with 2-byte strings as operand; I'd expect dw to write a string in the order I can read it in the source.

Most people opposing this notation do so because the other assemblers interpret the string the "right" (ie wrong) way. But the CPUID instruction returns its vendor string in the order NASM interprets strings.

This is not entirely the same issue, but expecting 0x25730000 is really beyond me. I would expect a 2-byte string used as a 4-byte operand to zero-pad from the top. Just as for numbers. Think about this:

Code: [Select]


 mov eax, 00003338h
 cmp ax, "38"
 ; equal (3338h == 3338h)
 cmp eax, "38"
 ; not equal! (00003338h != 33380000h)

Bryant Keller · « **Reply #11 on:** December 05, 2010, 03:42:13 AM »

Quote from: cm on December 05, 2010, 12:48:24 AM

Ideal or MASM mode?

Ideal mode.. I hated it, it was the reason why I switched to C and left DOS behind in favor of UNIX. Didn't get back into assembly again until a friend passed me a zipped copy of the NASM source.

Quote from: cm on December 05, 2010, 12:48:24 AM

Ah, but I'm just a free software guru. Quoth the 2-clause BSD license, "ANY EXPRESS OR IMPLIED WARRANTIES [...] ARE DISCLAIMED".

Hah! Maybe I should start applying the 2-clause BSD License for when I'm dating. XD

Quote from: cm on December 05, 2010, 12:48:24 AM

Symmetrically, you should then use ASCII charts for encoding other character operands too. Oh, and machine language charts for encoding instructions. Seriously, NASM is an assembler and I use it to (mostly) avoid code charts.

I have the Jegerlehner code table, ASCII character set chart, HTML Color codes chart, and MazeGen's geek32 Intel Instruction set encoding chart hanging on the wall above my desk... I like charts, they are efficient.

Quote from: cm on December 05, 2010, 12:48:24 AM

Actually, I don't think assembly language newcomers will have any problem with it. At least, the case where you load a register from a string and compare it for a sequence of characters seems intuitive to me with this "endian-reversal" (not really, hence the quotes). It is equivalent to dw with 2-byte strings as operand; I'd expect dw to write a string in the order I can read it in the source.

Most people opposing this notation do so because the other assemblers interpret the string the "right" (ie wrong) way. But the CPUID instruction returns its vendor string in the order NASM interprets strings.

This is not entirely the same issue, but expecting 0x25730000 is really beyond me. I would expect a 2-byte string used as a 4-byte operand to zero-pad from the top. Just as for numbers. Think about this:

Code: [Select]
mov eax, 00003338h cmp ax, "38" ; equal (3338h == 3338h) cmp eax, "38" ; not equal! (00003338h != 33380000h)

That was actually my point, you get the novices coming from high level languages where they learn "An ASCIIZ string is zero terminated." Then you show them something like cmp eax, "38" how do you think they are going to assume that "string" is represented. Why speculate, I've asked 3 people who program in high level languages, and whom have no assembly language experience, how they interpret the second operand of cmp eax, "38" given only the information that eax is 32-bits long making the argument the same size, they gave the representation "0x33380000". I know I can be pedantic about things like this sometime and I'm not saying it should be changed, I'm just saying I've mentored several people (some with zero previous programming knowledge) and you would be amazed what kind of assumptions people from HL languages can have. I see this as a potential pitfall.

cm · « **Reply #12 on:** December 05, 2010, 06:39:36 PM »

Quote from: Bryant Keller on December 05, 2010, 03:42:13 AM

[...] you would be amazed what kind of assumptions people [...] can have. I see this as a potential pitfall.

Yes. Yes, I fully agree with you on this.

Edit:
You can actually say that the (32-bit) register is a 4-byte memory storage and then say that the string operand is stored as ASCIZ by NASM - but in the right direction. So "38\0\0" becomes 0x00003833 because if you read that little-endian dword as a string of bytes it shows up as 0x33, 0x38, 0x00, 0x00 (the string '3', '8', '\0', '\0'). The zero-padding occurs at the end of the string as you would expect for ASCIZ.

The only difficulty lies in figuring out that NASM "interprets" the register's "memory representation" as little-endian. This is logical because actually storing the register's value in real memory will use the little-endian byte order. This conclusion, of course, requires figuring out the whole byte order business first.

An even simpler point for NASM's string operand byte order is probably the instruction's representation (which boils down to the machine's actual byte order too, but let's ignore that):

Code: [Select]

bits 16
mov ax, "38"
; NASM says 0B8h,33h,38h - ie opcode,'3','8'
; Other assemblers say 0B8h,38h,33h - ie opcode,'8','3'

NASM - The Netwide Assembler

News:

Author Topic: some questions about the manual (Read 23518 times)

codeferever

some questions about the manual

Bryant Keller

Re: some questions about the manual

Rob Neff

Re: some questions about the manual

codeferever

Re: some questions about the manual

Bryant Keller

Re: some questions about the manual

codeferever

Re: some questions about the manual

cm

Re: some questions about the manual

Bryant Keller

Re: some questions about the manual

cm

Re: some questions about the manual

Bryant Keller

Re: some questions about the manual

cm

Re: some questions about the manual

Bryant Keller

Re: some questions about the manual

cm

Re: some questions about the manual