NASM - The Netwide Assembler

NASM Forum => Programming with NASM => Topic started by: jkot on August 08, 2014, 12:54:56 PM

Title: Print numbers (Windows 64-bit)
Post by: jkot on August 08, 2014, 12:54:56 PM: Hi!
I'm trying to learn assembly programming, and this is so far the most complex thing I've written. It's a function (named Write_int) that takes a 32-bit signed integer as parameter and prints it to the console (using only Windows API). I use this as a .dll so I can call the function from my other nasm-programs. It seems to be working correctly, but I'd like to have some feedback on what I could improve, are there any possible bugs that I've missed, and so on. I also have a few questions:

1. How do functions like printf in C do this at the machine code level? Is it similar to this, and is it more efficient or less efficient?

2. What names should I use for labels that are part of a function?

Here's the code:
Code: [Select]
extern GetProcessHeap extern HeapAlloc extern HeapFree extern GetStdHandle extern WriteConsoleA export Write_int section .rdata ;this data can't be modified const10: dd 10 section .data numCharsWritten: dd 0 testNum: dd 0 section .text bits 64 ;Write_int(rcx number) Write_int: sub rsp, 48 mov [rsp + 32], rcx ;save the number for later use ;allocate the string that stores the number call GetProcessHeap ;heap handle now in rax ;LPVOID WINAPI HeapAlloc( _In_ HANDLE hHeap, _In_ DWORD dwFlags, _In_ SIZE_T dwBytes) mov rcx, rax ;heap handle mov rdx, 0 ;no flags mov r8, 15 ;enough memory for string that represents 32-bit signed integer (11bytes)+ string length (4bytes) call HeapAlloc ;pointer to allocated memory in rax (rax = *string, rax + 11 = *stringLength) ;convert the number to a string mov rcx, [rsp + 32] ;number push rax ;save the pointer to string push rax ;we need it 2 times mov rdx, rax ;*string add rax, 11 ;because string length stored after string mov r8, rax ;*stringLength call IntToString ;write the string to console ;HANDLE WINAPI GetStdHandle( _In_ DWORD nStdHandle ) mov rcx, -11 ; -11 = STD_OUTPUT_HANDLE call GetStdHandle ;handle is now stored in rax pop r9 ;pointer to string ; BOOL WINAPI WriteConsole( ; _In_ HANDLE hConsoleOutput, ; _In_ const VOID *lpBuffer, ; _In_ DWORD nNumberOfCharsToWrite, ; _Out_ LPDWORD lpNumberOfCharsWritten, ; _Reserved_ LPVOID lpReserved ) ; mov rcx, rax ;this is the handle returned from GetStdHandle mov rdx, r9 ;*string add r9, 11 ;because string length stored after string mov r8, [r9] ;string length mov r9, numCharsWritten mov [rsp + 32], dword 0 ;Reserved must be 0 call WriteConsoleA ;deallocate the string call GetProcessHeap ;heap handle now in rax ;BOOL WINAPI HeapFree(_In_ HANDLE hHeap, _In_ DWORD dwFlags, _In_ LPVOID lpMem) mov rcx, rax ;heap handle mov rdx, 0 ;no flags pop r8 ;pointer to string call HeapFree add rsp, 48 ret ;IntToString(rcx number, rdx *string, r8 *stringLength) IntToString: test ecx, ecx js IntToString1 ;if number was < 0 jmp UintToString ;otherwise just do UintToString IntToString1: mov [r8], byte 0 neg ecx ; convert to positive inc byte [r8] ;increase stringLength because of sign mov [rdx], byte 45 ;45 is the ascii code for minus-sign inc rdx jmp UintToString ;UintToString(rcx number, rdx *string, r8 *stringLength) UintToString: mov r9, rdx ;r9 now contains the memory address of the string xor rdx, rdx ;rdx = 0, rdx will be used to store one digit mov eax, ecx ;dividend (32-bit integer) in eax xor rcx, rcx ;rcx = 0, rcx will store the string length xor r10, r10 ;r10 = 0, r10 will be used as a loop counter UintToString1: inc rcx ;string length++ div dword [const10] ;quotient now in eax, remainder in edx push rdx ;remainder (which is one digit of the number) xor rdx, rdx test eax, eax ;test if quotient is zero jz UintToString2 ;if it is, we are done jmp UintToString1 UintToString2: pop rdx ;get the digit that was saved last (the most significant digit) add rdx, 48 ;add 48 to get ascii representation of the digit mov [r9], dl ;save the digit to memory inc r9 ;r9 now contains memory address of next digit inc r10 cmp rcx, r10 ;compare rcx and r10 for equality jne UintToString2 ;if they are not equal (hasn't yet looped for each digit), loop again add [r8], cl ;save the string length to memory ret
Title: Re: Print numbers (Windows 64-bit)
Post by: encryptor256 on August 08, 2014, 03:35:06 PM: Hi!

I think this line would be faster and consume less machine code bytes, if you use xor instruction to zero-out a desired register:
Code: [Select]
mov rdx, 0 ;no flagsLike:
Code: [Select]
xor rdx,rdx

Quote
2. What names should I use for labels that are part of a function?
I use dot in front of a label, something like:
Code: [Select]
myfunction: ... jmp .quit ... .quit: ret
Quote
1. How do functions like printf in C do this at the machine code level? Is it similar to this, and is it more efficient or less efficient?
Well, I think C / C++ / Other (GCC) is more like general-case.
Everything that is special-case is faster.
* special-case: for example design of a function that handles only certain arguments.
* general-case: like printf it can print anything, before that, it has to determine what to print, so it might be a bit slower.

Byte,
Encryptor256!
Title: Re: Print numbers (Windows 64-bit)
Post by: jkot on August 08, 2014, 06:01:40 PM: Quote
I think this line would be faster and consume less machine code bytes, if you use xor instruction to zero-out a desired register:

Yes, I try to remember to do that every time I need to zero a register.

Quote
Well, I think C / C++ / Other (GCC) is more like general-case.
Everything that is special-case is faster.
* special-case: for example design of a function that handles only certain arguments.
* general-case: like printf it can print anything, before that, it has to determine what to print, so it might be a bit slower.

Ok, I understand that. But I'm just wondering if the algorithm for converting integer to string is same or is it somehow more optimized?

Any other ideas for improvement?
Title: Re: Print numbers (Windows 64-bit)
Post by: Frank Kotler on August 13, 2014, 06:56:31 AM: I would think it depends on the implementation of printf (etc) that you've got. They may not all be the same. If source code is available, you can find out if you care enough...

"div" is a horribly slow instruction. You can do better with repeated subtraction, actually. The best way, AFAIK, is "Terje's method" - devised by Terje Mathisen, a clever asm programmer from Norway. The AMD optimization manual has it - not credited to him... perhaps they came up with it independently. It involves multiplying by the reciprocal of 10 instead of dividing. This requires some "fixed point" math, and I think you have to "back multiply" to find out if you have to adjust the result by 1. It doesn't "look" like it would be faster, but it is. It is not suitable to show to beginners - I haven't figured it out myself. I've also got a method Wolfgang Kern showed me (haven't figured that one out either). There's a method by "Brethren" (if memory serves) in the "Examples" section, I think. See what Agner Fog has to say on the subject. So yeah, there are more optimal ways of doing it.

"push" and "pop" aren't that fast, either. There may be better ways of getting the remainders in the "right order". You can start at the right end of the buffer and work leftward. You may not get to the beginning of the buffer before running out of digits. If you're going to print it right away, this isn't an issue. You can space pad to the beginning of the buffer, giving a right justified number - looks better for printing columns of numbers (one of the many things printf will do for you).

If you're going to throw away the string after printing it, grabbing some space on the stack would surely be faster than calling the OS to allocate it and then again to free it. If you want to keep the string around for later use, this isn't going to work...

So your routine could probably be "improved" if it's worth it. How many numbers you gonna print? :)

Best,
Frank
Title: Re: Print numbers (Windows 64-bit)
Post by: jkot on August 13, 2014, 06:18:03 PM: Quote
"div" is a horribly slow instruction. You can do better with repeated subtraction, actually. The best way, AFAIK, is "Terje's method" - devised by Terje Mathisen, a clever asm programmer from Norway. The AMD optimization manual has it - not credited to him... perhaps they came up with it independently. It involves multiplying by the reciprocal of 10 instead of dividing. This requires some "fixed point" math, and I think you have to "back multiply" to find out if you have to adjust the result by 1. It doesn't "look" like it would be faster, but it is. It is not suitable to show to beginners - I haven't figured it out myself. I've also got a method Wolfgang Kern showed me (haven't figured that one out either). There's a method by "Brethren" (if memory serves) in the "Examples" section, I think. See what Agner Fog has to say on the subject. So yeah, there are more optimal ways of doing it.

That's interesting to know about those methods, I'll take a look at them. For now, I think I'll be using div because it's much simpler. I actually downloaded Agner Fog's documents few days ago, there is really lots of good information in them.

Quote
"push" and "pop" aren't that fast, either. There may be better ways of getting the remainders in the "right order". You can start at the right end of the buffer and work leftward

I'm not sure how to do this. If I didn't use push/pop, I'd still have to do mov + add/sub to get the remainders at the right place in the buffer, right? I thought that's the same cost as push/pop?

Quote
If you're going to throw away the string after printing it, grabbing some space on the stack would surely be faster than calling the OS to allocate it and then again to free it. If you want to keep the string around for later use, this isn't going to work...

Yeah, I'm not even sure why I allocated it on the heap. I think I had some problems trying to use the stack and somehow thought it needs to be on heap. But now I changed it to use stack and it works correctly.

Quote
So your routine could probably be "improved" if it's worth it. How many numbers you gonna print? :)

Well, I don't really need to have a super-optimized printing function. I'm just interested to see what kind of optimizations are possible in assembly. ;)

Thanks for the tips!