NASM - The Netwide Assembler
NASM Forum => Programming with NASM => Topic started by: manler on December 23, 2012, 09:00:33 PM
-
Hello everybody!
This is my first real attempt at using nasm to create a function that I link into a c++ program.
The function adds to vectors of integers and returns the result. So it isn't the worlds most useful function but I wrote it to get the hang of nasm. It uses SSE3 commands.
Am I doing the right things here? I mean things like:
register use?
how to setup & tear down a function? enter/leave?
reading args passed to the function?
Please give me some feedback.
My main goal with using nasm is to create functions that link into c++ programs.
In C++ the function looks like this:
int addvectors_c(int *vec1, int *vec2, int elements)
{
int sum = 0;
for(int i = 0; i<elements; ++i)
{
sum += vec1[i] + vec2[i];
}
return sum;
}
In Nasm the function looks like this:
segment .data
segment .bss
segment .text
global _addvectors
;
; int addvectors(int *vec1, int *vec2, int elements);
;
_addvectors:
enter 0,0
pusha
mov esi, [ebp+8] ; first parameter vec1
mov ecx, [ebp+12] ; second parameter vec2
mov edx, [ebp+16] ; third parameter elements
shl edx, 2 ; convert to number of bytes since integer is 4 bytes.
add edx, esi ; calculate end of vec1
mov edi, esi
sub edi, 16 ; subtract 16 (one lddqu read) to find last index before we start reading 4 bytes at a time
pxor xmm0, xmm0
pxor xmm7, xmm7
; while(esi <= edi)
.more_bytes:
cmp edi, esi
jl .remaining_bytes
lddqu xmm1, [esi] ; Using unaligned load (also tried movdqu)
lddqu xmm2, [ecx]
paddd xmm0, xmm1 ; Add to xmm0
paddd xmm0, xmm2 ; Add to xmm0
add esi, 16
add ecx, 16
jmp .more_bytes
.remaining_bytes:
cmp edx, esi
jle .calc_sum
movd xmm6, [esi]
movd xmm7, [ecx]
paddd xmm0, xmm6
paddd xmm0, xmm7
add esi, 4
jmp .remaining_bytes
.calc_sum:
popa
phaddd xmm0, xmm7 ; Do horizontal add of xmm0
phaddd xmm0, xmm7 ; Horizontal add finished
pextrd eax, xmm0, 0 ; Extract sum and put into eax for return value
leave
ret
Just as a side note, the performance of the two functions seems to be faster in asm when the vectors are larger than 1024*800 bytes. Otherwise it is faster in C++. Especially when calling the functions many times with small vectors, then C++ is much faster. I'm guessing that msvc2012 inlines the code and removes the actual function call?
Thank you for any tips and help!
/Mathias
-
I would eliminate the enter/leave and the pusha/popa instructions. As per the C calling convention ( which is used by C++ ) the only registers used by your function which must be non-volatile ( appear unchanged to the calling function ) are ESI and EDI. Thus the framework for function entry and exit could be as follows:
_addvectors:
push edi
push esi
mov esi, [ebp+12] ; first parameter vec1
mov ecx, [ebp+16] ; second parameter vec2
mov edx, [ebp+20] ; third parameter elements
.
. ; implementation
.
pextrd eax, xmm0, 0 ; Extract sum and put into eax for return value
pop esi
pop edi
ret
In the .more_bytes: loop this doesn't look right to me:
add esi, 16
add ecx, 16
Are you sure you don't mean:
add esi, 4 ; point to next vec1 int
add ecx, 4 ; point to next vec2 int
Have you verified that your vector addition function is arithmetically correct?
-
Regarding the validity,
The initialization ,
add edx, esi ; calculate end of vec1
mov edi, esi
sub edi, 16 ; subtract 16 (one lddqu read) to find last index before we start reading 4 bytes at a time
should have been
add edx, esi ; calculate end of vec1
mov edi, edx
sub edi, 16 ; subtract 16 (one lddqu read) to find last index before we start reading 4 bytes at a time
And we need to increment ecx also in .remaining_bytes loop.
paddd xmm0, xmm6
paddd xmm0, xmm7
add esi, 4
add ecx, 4
Rob,
In the .more_bytes: loop this doesn't look right to me:
Code: [Select]
add esi, 16
add ecx, 16
I think it is correct . The idea here is to add 4 integers at a time using paddd.
Obviously the c compiler didn't use xmm registers..
But i guess there is still scope for improving performance in the asm routine.
All the Best!
Regards,
Mathi.
-
I'm guessing that msvc2012 inlines the code and removes the actual function call?
I guess the vc compiler won't inline unless we ask it to.
I used vs2005 though.
BTW i was able to view only the MM0 - MM7 registers . i was not able to view xmm0 to xmm7 . Any idea anyone?
Thanks,
Mathi.