Author Topic: Print numbers (Windows 64-bit) (Read 17941 times)

jkot · « **on:** August 08, 2014, 12:54:56 PM »

Hi!
I'm trying to learn assembly programming, and this is so far the most complex thing I've written. It's a function (named Write_int) that takes a 32-bit signed integer as parameter and prints it to the console (using only Windows API). I use this as a .dll so I can call the function from my other nasm-programs. It seems to be working correctly, but I'd like to have some feedback on what I could improve, are there any possible bugs that I've missed, and so on. I also have a few questions:

1. How do functions like printf in C do this at the machine code level? Is it similar to this, and is it more efficient or less efficient?

2. What names should I use for labels that are part of a function?

Here's the code:

Code: [Select]

extern GetProcessHeap
extern HeapAlloc
extern HeapFree
extern GetStdHandle
extern WriteConsoleA

export Write_int

section .rdata ;this data can't be modified

const10: dd 10

section .data

numCharsWritten: dd 0
testNum: dd 0

section .text

bits 64

;Write_int(rcx number)
Write_int:
sub rsp, 48

mov [rsp + 32], rcx ;save the number for later use

;allocate the string that stores the number

call GetProcessHeap ;heap handle now in rax

;LPVOID WINAPI HeapAlloc( _In_  HANDLE hHeap, _In_  DWORD dwFlags, _In_  SIZE_T dwBytes)
mov rcx, rax ;heap handle
mov rdx, 0 ;no flags
mov r8, 15 ;enough memory for string that represents 32-bit signed integer (11bytes)+ string length (4bytes)
call HeapAlloc ;pointer to allocated memory in rax (rax = *string, rax + 11 = *stringLength)

;convert the number to a string

mov rcx, [rsp + 32] ;number
push rax ;save the pointer to string
push rax ;we need it 2 times
mov rdx, rax ;*string
add rax, 11 ;because string length stored after string
mov r8, rax ;*stringLength
call IntToString

;write the string to console

;HANDLE WINAPI GetStdHandle( _In_  DWORD nStdHandle ) 
mov rcx, -11 ; -11 = STD_OUTPUT_HANDLE
call GetStdHandle ;handle is now stored in rax

pop r9 ;pointer to string

; BOOL WINAPI WriteConsole(
;       _In_        HANDLE hConsoleOutput,
;       _In_        const VOID *lpBuffer,
;       _In_        DWORD nNumberOfCharsToWrite,
;       _Out_       LPDWORD lpNumberOfCharsWritten,
;       _Reserved_  LPVOID lpReserved ) ;
mov rcx, rax ;this is the handle returned from GetStdHandle
mov rdx, r9 ;*string
add r9, 11 ;because string length stored after string
mov r8, [r9] ;string length
mov r9, numCharsWritten
mov [rsp + 32], dword 0 ;Reserved must be 0
call WriteConsoleA

;deallocate the string

call GetProcessHeap ;heap handle now in rax

;BOOL WINAPI HeapFree(_In_  HANDLE hHeap, _In_  DWORD dwFlags, _In_  LPVOID lpMem)
mov rcx, rax ;heap handle
mov rdx, 0 ;no flags
pop r8 ;pointer to string
call HeapFree

add rsp, 48
ret


;IntToString(rcx number, rdx *string, r8 *stringLength)
IntToString:

test ecx, ecx 
js IntToString1 ;if number was < 0
jmp UintToString ;otherwise just do UintToString

IntToString1:

mov [r8], byte 0
neg ecx ; convert to positive
inc byte [r8] ;increase stringLength because of sign
mov [rdx], byte 45 ;45 is the ascii code for minus-sign
inc rdx
jmp UintToString


;UintToString(rcx number, rdx *string, r8 *stringLength)
UintToString:

mov r9, rdx ;r9 now contains the memory address of the string
xor rdx, rdx ;rdx = 0, rdx will be used to store one digit
mov eax, ecx ;dividend (32-bit integer) in eax
xor rcx, rcx ;rcx = 0, rcx will store the string length
xor r10, r10 ;r10 = 0, r10 will be used as a loop counter

UintToString1:

inc rcx ;string length++
div dword [const10] ;quotient now in eax, remainder in edx
push rdx ;remainder (which is one digit of the number)
xor rdx, rdx
test eax, eax ;test if quotient is zero
jz UintToString2 ;if it is, we are done  
jmp UintToString1

UintToString2:

pop rdx ;get the digit that was saved last (the most significant digit)
add rdx, 48 ;add 48 to get ascii representation of the digit
mov [r9], dl ;save the digit to memory
inc r9 ;r9 now contains memory address of next digit
inc r10
cmp rcx, r10 ;compare rcx and r10 for equality
jne UintToString2 ;if they are not equal (hasn't yet looped for each digit), loop again
add [r8], cl ;save the string length to memory
ret

encryptor256 · « **Reply #1 on:** August 08, 2014, 03:35:06 PM »

Hi!

I think this line would be faster and consume less machine code bytes, if you use xor instruction to zero-out a desired register:

Code: [Select]

mov rdx, 0 ;no flagsLike:

Code: [Select]

xor rdx,rdx

Quote

2. What names should I use for labels that are part of a function?

I use dot in front of a label, something like:

Code: [Select]

myfunction:

    ...
    jmp .quit
    ...

    .quit:

    ret

Quote

1. How do functions like printf in C do this at the machine code level? Is it similar to this, and is it more efficient or less efficient?

Well, I think C / C++ / Other (GCC) is more like general-case.
Everything that is special-case is faster.
* special-case: for example design of a function that handles only certain arguments.
* general-case: like printf it can print anything, before that, it has to determine what to print, so it might be a bit slower.

Byte,
Encryptor256!

jkot · « **Reply #2 on:** August 08, 2014, 06:01:40 PM »

Quote

I think this line would be faster and consume less machine code bytes, if you use xor instruction to zero-out a desired register:

Yes, I try to remember to do that every time I need to zero a register.

Quote

Well, I think C / C++ / Other (GCC) is more like general-case.
Everything that is special-case is faster.
* special-case: for example design of a function that handles only certain arguments.
* general-case: like printf it can print anything, before that, it has to determine what to print, so it might be a bit slower.

Ok, I understand that. But I'm just wondering if the algorithm for converting integer to string is same or is it somehow more optimized?

Any other ideas for improvement?

Frank Kotler · « **Reply #3 on:** August 13, 2014, 06:56:31 AM »

I would think it depends on the implementation of printf (etc) that you've got. They may not all be the same. If source code is available, you can find out if you care enough...

"div" is a horribly slow instruction. You can do better with repeated subtraction, actually. The best way, AFAIK, is "Terje's method" - devised by Terje Mathisen, a clever asm programmer from Norway. The AMD optimization manual has it - not credited to him... perhaps they came up with it independently. It involves multiplying by the reciprocal of 10 instead of dividing. This requires some "fixed point" math, and I think you have to "back multiply" to find out if you have to adjust the result by 1. It doesn't "look" like it would be faster, but it is. It is not suitable to show to beginners - I haven't figured it out myself. I've also got a method Wolfgang Kern showed me (haven't figured that one out either). There's a method by "Brethren" (if memory serves) in the "Examples" section, I think. See what Agner Fog has to say on the subject. So yeah, there are more optimal ways of doing it.

"push" and "pop" aren't that fast, either. There may be better ways of getting the remainders in the "right order". You can start at the right end of the buffer and work leftward. You may not get to the beginning of the buffer before running out of digits. If you're going to print it right away, this isn't an issue. You can space pad to the beginning of the buffer, giving a right justified number - looks better for printing columns of numbers (one of the many things printf will do for you).

If you're going to throw away the string after printing it, grabbing some space on the stack would surely be faster than calling the OS to allocate it and then again to free it. If you want to keep the string around for later use, this isn't going to work...

So your routine could probably be "improved" if it's worth it. How many numbers you gonna print?

Best,
Frank

jkot · « **Reply #4 on:** August 13, 2014, 06:18:03 PM »

Quote

"div" is a horribly slow instruction. You can do better with repeated subtraction, actually. The best way, AFAIK, is "Terje's method" - devised by Terje Mathisen, a clever asm programmer from Norway. The AMD optimization manual has it - not credited to him... perhaps they came up with it independently. It involves multiplying by the reciprocal of 10 instead of dividing. This requires some "fixed point" math, and I think you have to "back multiply" to find out if you have to adjust the result by 1. It doesn't "look" like it would be faster, but it is. It is not suitable to show to beginners - I haven't figured it out myself. I've also got a method Wolfgang Kern showed me (haven't figured that one out either). There's a method by "Brethren" (if memory serves) in the "Examples" section, I think. See what Agner Fog has to say on the subject. So yeah, there are more optimal ways of doing it.

That's interesting to know about those methods, I'll take a look at them. For now, I think I'll be using div because it's much simpler. I actually downloaded Agner Fog's documents few days ago, there is really lots of good information in them.

Quote

"push" and "pop" aren't that fast, either. There may be better ways of getting the remainders in the "right order". You can start at the right end of the buffer and work leftward

I'm not sure how to do this. If I didn't use push/pop, I'd still have to do mov + add/sub to get the remainders at the right place in the buffer, right? I thought that's the same cost as push/pop?

Quote

If you're going to throw away the string after printing it, grabbing some space on the stack would surely be faster than calling the OS to allocate it and then again to free it. If you want to keep the string around for later use, this isn't going to work...

Yeah, I'm not even sure why I allocated it on the heap. I think I had some problems trying to use the stack and somehow thought it needs to be on heap. But now I changed it to use stack and it works correctly.

Quote

So your routine could probably be "improved" if it's worth it. How many numbers you gonna print?

Well, I don't really need to have a super-optimized printing function. I'm just interested to see what kind of optimizations are possible in assembly.

Thanks for the tips!

NASM - The Netwide Assembler

News:

Author Topic: Print numbers (Windows 64-bit) (Read 17941 times)

jkot

Print numbers (Windows 64-bit)

encryptor256

Re: Print numbers (Windows 64-bit)

jkot

Re: Print numbers (Windows 64-bit)

Frank Kotler

Re: Print numbers (Windows 64-bit)

jkot

Re: Print numbers (Windows 64-bit)