NASM - The Netwide Assembler
NASM Forum => Programming with NASM => Topic started by: jkot on August 08, 2014, 12:54:56 PM
-
Hi!
I'm trying to learn assembly programming, and this is so far the most complex thing I've written. It's a function (named Write_int) that takes a 32-bit signed integer as parameter and prints it to the console (using only Windows API). I use this as a .dll so I can call the function from my other nasm-programs. It seems to be working correctly, but I'd like to have some feedback on what I could improve, are there any possible bugs that I've missed, and so on. I also have a few questions:
1. How do functions like printf in C do this at the machine code level? Is it similar to this, and is it more efficient or less efficient?
2. What names should I use for labels that are part of a function?
Here's the code:
extern GetProcessHeap
extern HeapAlloc
extern HeapFree
extern GetStdHandle
extern WriteConsoleA
export Write_int
section .rdata ;this data can't be modified
const10: dd 10
section .data
numCharsWritten: dd 0
testNum: dd 0
section .text
bits 64
;Write_int(rcx number)
Write_int:
sub rsp, 48
mov [rsp + 32], rcx ;save the number for later use
;allocate the string that stores the number
call GetProcessHeap ;heap handle now in rax
;LPVOID WINAPI HeapAlloc( _In_ HANDLE hHeap, _In_ DWORD dwFlags, _In_ SIZE_T dwBytes)
mov rcx, rax ;heap handle
mov rdx, 0 ;no flags
mov r8, 15 ;enough memory for string that represents 32-bit signed integer (11bytes)+ string length (4bytes)
call HeapAlloc ;pointer to allocated memory in rax (rax = *string, rax + 11 = *stringLength)
;convert the number to a string
mov rcx, [rsp + 32] ;number
push rax ;save the pointer to string
push rax ;we need it 2 times
mov rdx, rax ;*string
add rax, 11 ;because string length stored after string
mov r8, rax ;*stringLength
call IntToString
;write the string to console
;HANDLE WINAPI GetStdHandle( _In_ DWORD nStdHandle )
mov rcx, -11 ; -11 = STD_OUTPUT_HANDLE
call GetStdHandle ;handle is now stored in rax
pop r9 ;pointer to string
; BOOL WINAPI WriteConsole(
; _In_ HANDLE hConsoleOutput,
; _In_ const VOID *lpBuffer,
; _In_ DWORD nNumberOfCharsToWrite,
; _Out_ LPDWORD lpNumberOfCharsWritten,
; _Reserved_ LPVOID lpReserved ) ;
mov rcx, rax ;this is the handle returned from GetStdHandle
mov rdx, r9 ;*string
add r9, 11 ;because string length stored after string
mov r8, [r9] ;string length
mov r9, numCharsWritten
mov [rsp + 32], dword 0 ;Reserved must be 0
call WriteConsoleA
;deallocate the string
call GetProcessHeap ;heap handle now in rax
;BOOL WINAPI HeapFree(_In_ HANDLE hHeap, _In_ DWORD dwFlags, _In_ LPVOID lpMem)
mov rcx, rax ;heap handle
mov rdx, 0 ;no flags
pop r8 ;pointer to string
call HeapFree
add rsp, 48
ret
;IntToString(rcx number, rdx *string, r8 *stringLength)
IntToString:
test ecx, ecx
js IntToString1 ;if number was < 0
jmp UintToString ;otherwise just do UintToString
IntToString1:
mov [r8], byte 0
neg ecx ; convert to positive
inc byte [r8] ;increase stringLength because of sign
mov [rdx], byte 45 ;45 is the ascii code for minus-sign
inc rdx
jmp UintToString
;UintToString(rcx number, rdx *string, r8 *stringLength)
UintToString:
mov r9, rdx ;r9 now contains the memory address of the string
xor rdx, rdx ;rdx = 0, rdx will be used to store one digit
mov eax, ecx ;dividend (32-bit integer) in eax
xor rcx, rcx ;rcx = 0, rcx will store the string length
xor r10, r10 ;r10 = 0, r10 will be used as a loop counter
UintToString1:
inc rcx ;string length++
div dword [const10] ;quotient now in eax, remainder in edx
push rdx ;remainder (which is one digit of the number)
xor rdx, rdx
test eax, eax ;test if quotient is zero
jz UintToString2 ;if it is, we are done
jmp UintToString1
UintToString2:
pop rdx ;get the digit that was saved last (the most significant digit)
add rdx, 48 ;add 48 to get ascii representation of the digit
mov [r9], dl ;save the digit to memory
inc r9 ;r9 now contains memory address of next digit
inc r10
cmp rcx, r10 ;compare rcx and r10 for equality
jne UintToString2 ;if they are not equal (hasn't yet looped for each digit), loop again
add [r8], cl ;save the string length to memory
ret
-
Hi!
I think this line would be faster and consume less machine code bytes, if you use xor instruction to zero-out a desired register:
mov rdx, 0 ;no flags
Like:
xor rdx,rdx
2. What names should I use for labels that are part of a function?
I use dot in front of a label, something like:
myfunction:
...
jmp .quit
...
.quit:
ret
1. How do functions like printf in C do this at the machine code level? Is it similar to this, and is it more efficient or less efficient?
Well, I think C / C++ / Other (GCC) is more like general-case.
Everything that is special-case is faster.
* special-case: for example design of a function that handles only certain arguments.
* general-case: like printf it can print anything, before that, it has to determine what to print, so it might be a bit slower.
Byte,
Encryptor256!
-
I think this line would be faster and consume less machine code bytes, if you use xor instruction to zero-out a desired register:
Yes, I try to remember to do that every time I need to zero a register.
Well, I think C / C++ / Other (GCC) is more like general-case.
Everything that is special-case is faster.
* special-case: for example design of a function that handles only certain arguments.
* general-case: like printf it can print anything, before that, it has to determine what to print, so it might be a bit slower.
Ok, I understand that. But I'm just wondering if the algorithm for converting integer to string is same or is it somehow more optimized?
Any other ideas for improvement?
-
I would think it depends on the implementation of printf (etc) that you've got. They may not all be the same. If source code is available, you can find out if you care enough...
"div" is a horribly slow instruction. You can do better with repeated subtraction, actually. The best way, AFAIK, is "Terje's method" - devised by Terje Mathisen, a clever asm programmer from Norway. The AMD optimization manual has it - not credited to him... perhaps they came up with it independently. It involves multiplying by the reciprocal of 10 instead of dividing. This requires some "fixed point" math, and I think you have to "back multiply" to find out if you have to adjust the result by 1. It doesn't "look" like it would be faster, but it is. It is not suitable to show to beginners - I haven't figured it out myself. I've also got a method Wolfgang Kern showed me (haven't figured that one out either). There's a method by "Brethren" (if memory serves) in the "Examples" section, I think. See what Agner Fog has to say on the subject. So yeah, there are more optimal ways of doing it.
"push" and "pop" aren't that fast, either. There may be better ways of getting the remainders in the "right order". You can start at the right end of the buffer and work leftward. You may not get to the beginning of the buffer before running out of digits. If you're going to print it right away, this isn't an issue. You can space pad to the beginning of the buffer, giving a right justified number - looks better for printing columns of numbers (one of the many things printf will do for you).
If you're going to throw away the string after printing it, grabbing some space on the stack would surely be faster than calling the OS to allocate it and then again to free it. If you want to keep the string around for later use, this isn't going to work...
So your routine could probably be "improved" if it's worth it. How many numbers you gonna print? :)
Best,
Frank
-
"div" is a horribly slow instruction. You can do better with repeated subtraction, actually. The best way, AFAIK, is "Terje's method" - devised by Terje Mathisen, a clever asm programmer from Norway. The AMD optimization manual has it - not credited to him... perhaps they came up with it independently. It involves multiplying by the reciprocal of 10 instead of dividing. This requires some "fixed point" math, and I think you have to "back multiply" to find out if you have to adjust the result by 1. It doesn't "look" like it would be faster, but it is. It is not suitable to show to beginners - I haven't figured it out myself. I've also got a method Wolfgang Kern showed me (haven't figured that one out either). There's a method by "Brethren" (if memory serves) in the "Examples" section, I think. See what Agner Fog has to say on the subject. So yeah, there are more optimal ways of doing it.
That's interesting to know about those methods, I'll take a look at them. For now, I think I'll be using div because it's much simpler. I actually downloaded Agner Fog's documents few days ago, there is really lots of good information in them.
"push" and "pop" aren't that fast, either. There may be better ways of getting the remainders in the "right order". You can start at the right end of the buffer and work leftward
I'm not sure how to do this. If I didn't use push/pop, I'd still have to do mov + add/sub to get the remainders at the right place in the buffer, right? I thought that's the same cost as push/pop?
If you're going to throw away the string after printing it, grabbing some space on the stack would surely be faster than calling the OS to allocate it and then again to free it. If you want to keep the string around for later use, this isn't going to work...
Yeah, I'm not even sure why I allocated it on the heap. I think I had some problems trying to use the stack and somehow thought it needs to be on heap. But now I changed it to use stack and it works correctly.
So your routine could probably be "improved" if it's worth it. How many numbers you gonna print? :)
Well, I don't really need to have a super-optimized printing function. I'm just interested to see what kind of optimizations are possible in assembly. ;)
Thanks for the tips!