Hello everybody!
This is my first real attempt at using nasm to create a function that I link into a c++ program.
The function adds to vectors of integers and returns the result. So it isn't the worlds most useful function but I wrote it to get the hang of nasm. It uses SSE3 commands.
Am I doing the right things here? I mean things like:
register use?
how to setup & tear down a function? enter/leave?
reading args passed to the function?
Please give me some feedback.
My main goal with using nasm is to create functions that link into c++ programs.
In C++ the function looks like this:
int addvectors_c(int *vec1, int *vec2, int elements)
{
int sum = 0;
for(int i = 0; i<elements; ++i)
{
sum += vec1[i] + vec2[i];
}
return sum;
}
In Nasm the function looks like this:
segment .data
segment .bss
segment .text
global _addvectors
;
; int addvectors(int *vec1, int *vec2, int elements);
;
_addvectors:
enter 0,0
pusha
mov esi, [ebp+8] ; first parameter vec1
mov ecx, [ebp+12] ; second parameter vec2
mov edx, [ebp+16] ; third parameter elements
shl edx, 2 ; convert to number of bytes since integer is 4 bytes.
add edx, esi ; calculate end of vec1
mov edi, esi
sub edi, 16 ; subtract 16 (one lddqu read) to find last index before we start reading 4 bytes at a time
pxor xmm0, xmm0
pxor xmm7, xmm7
; while(esi <= edi)
.more_bytes:
cmp edi, esi
jl .remaining_bytes
lddqu xmm1, [esi] ; Using unaligned load (also tried movdqu)
lddqu xmm2, [ecx]
paddd xmm0, xmm1 ; Add to xmm0
paddd xmm0, xmm2 ; Add to xmm0
add esi, 16
add ecx, 16
jmp .more_bytes
.remaining_bytes:
cmp edx, esi
jle .calc_sum
movd xmm6, [esi]
movd xmm7, [ecx]
paddd xmm0, xmm6
paddd xmm0, xmm7
add esi, 4
jmp .remaining_bytes
.calc_sum:
popa
phaddd xmm0, xmm7 ; Do horizontal add of xmm0
phaddd xmm0, xmm7 ; Horizontal add finished
pextrd eax, xmm0, 0 ; Extract sum and put into eax for return value
leave
ret
Just as a side note, the performance of the two functions seems to be faster in asm when the vectors are larger than 1024*800 bytes. Otherwise it is faster in C++. Especially when calling the functions many times with small vectors, then C++ is much faster. I'm guessing that msvc2012 inlines the code and removes the actual function call?
Thank you for any tips and help!
/Mathias