Author Topic: My first attempt using nasm and adding two vectors! (Read 15983 times)

manler · « **on:** December 23, 2012, 09:00:33 PM »

Hello everybody!

This is my first real attempt at using nasm to create a function that I link into a c++ program.
The function adds to vectors of integers and returns the result. So it isn't the worlds most useful function but I wrote it to get the hang of nasm. It uses SSE3 commands.

Am I doing the right things here? I mean things like:
register use?
how to setup & tear down a function? enter/leave?
reading args passed to the function?

Please give me some feedback.
My main goal with using nasm is to create functions that link into c++ programs.

In C++ the function looks like this:

Code: [Select]

int addvectors_c(int *vec1, int *vec2, int elements)
{
	
	int sum = 0;
	for(int i = 0; i<elements; ++i)
	{
		sum += vec1[i] + vec2[i];
	}	
	return sum;
}

In Nasm the function looks like this:

Code: [Select]

segment .data
segment .bss
segment .text
	global	_addvectors
;
; int addvectors(int *vec1, int *vec2, int elements);
;
_addvectors:
			enter 0,0						
			pusha
			mov		esi, [ebp+8]	; first parameter vec1
			mov		ecx, [ebp+12]	; second parameter vec2
			mov		edx, [ebp+16]	; third parameter elements
			shl		edx, 2		; convert to number of bytes since integer is 4 bytes.
			add		edx, esi		; calculate end of vec1
			mov		edi, esi
			sub		edi, 16               ; subtract 16 (one lddqu read) to find last index before we start reading 4 bytes at a time
			pxor	xmm0, xmm0		
			pxor	xmm7, xmm7		

; while(esi <= edi)
.more_bytes:		
			cmp		edi, esi
			jl		.remaining_bytes
			lddqu	xmm1, [esi]		                ; Using unaligned load (also tried movdqu)
			lddqu	xmm2, [ecx]
			paddd	xmm0, xmm1		; Add to xmm0
			paddd	xmm0, xmm2		; Add to xmm0
			add		esi, 16
			add		ecx, 16
			jmp		.more_bytes

.remaining_bytes:
			cmp		edx, esi
			jle		.calc_sum
			movd	xmm6, [esi]
			movd	xmm7, [ecx]
			paddd	xmm0, xmm6
			paddd	xmm0, xmm7
			add		esi, 4
			jmp		.remaining_bytes
			
.calc_sum:	
			popa
			phaddd	xmm0, xmm7		; Do horizontal add of xmm0
			phaddd	xmm0, xmm7		; Horizontal add finished
			pextrd	eax, xmm0, 0	        ; Extract sum and put into eax for return value
			leave
			ret

Just as a side note, the performance of the two functions seems to be faster in asm when the vectors are larger than 1024*800 bytes. Otherwise it is faster in C++. Especially when calling the functions many times with small vectors, then C++ is much faster. I'm guessing that msvc2012 inlines the code and removes the actual function call?

Thank you for any tips and help!

/Mathias

Rob Neff · « **Reply #1 on:** December 24, 2012, 05:00:23 PM »

I would eliminate the enter/leave and the pusha/popa instructions. As per the C calling convention ( which is used by C++ ) the only registers used by your function which must be non-volatile ( appear unchanged to the calling function ) are ESI and EDI. Thus the framework for function entry and exit could be as follows:

Code: [Select]

_addvectors:
    push  edi
    push  esi
    mov	esi, [ebp+12]	; first parameter vec1
    mov	ecx, [ebp+16]	; second parameter vec2
    mov	edx, [ebp+20]	; third parameter elements
    .
    .  ; implementation
    .
    pextrd	eax, xmm0, 0	        ; Extract sum and put into eax for return value
    pop   esi
    pop   edi
    ret

In the .more_bytes: loop this doesn't look right to me:

Code: [Select]

    add		esi, 16
    add		ecx, 16

Are you sure you don't mean:

Code: [Select]

    add		esi, 4    ; point to next vec1 int
    add		ecx, 4   ; point to next vec2 int

Have you verified that your vector addition function is arithmetically correct?

Mathi · « **Reply #2 on:** December 25, 2012, 03:59:13 AM »

Regarding the validity,
The initialization ,

Code: [Select]

add		edx, esi		; calculate end of vec1
mov		edi, esi
sub		edi, 16               ; subtract 16 (one lddqu read) to find last index before we start reading 4 bytes at a time

should have been

add      edx, esi      ; calculate end of vec1
mov      edi, edx
sub      edi, 16 ; subtract 16 (one lddqu read) to find last index before we start reading 4 bytes at a time

And we need to increment ecx also in .remaining_bytes loop.

paddd   xmm0, xmm6
paddd   xmm0, xmm7
add      esi, 4
add       ecx, 4

Rob,

Quote

In the .more_bytes: loop this doesn't look right to me:
Code: [Select]

add esi, 16
add ecx, 16

I think it is correct . The idea here is to add 4 integers at a time using paddd.

Obviously the c compiler didn't use xmm registers..
But i guess there is still scope for improving performance in the asm routine.
All the Best!

Regards,
Mathi.

Mathi · « **Reply #3 on:** December 25, 2012, 05:12:10 AM »

Quote

I'm guessing that msvc2012 inlines the code and removes the actual function call?

I guess the vc compiler won't inline unless we ask it to.
I used vs2005 though.

BTW i was able to view only the MM0 - MM7 registers . i was not able to view xmm0 to xmm7 . Any idea anyone?

Thanks,
Mathi.

NASM - The Netwide Assembler

News:

Author Topic: My first attempt using nasm and adding two vectors! (Read 15983 times)

manler

My first attempt using nasm and adding two vectors!

Rob Neff

Re: My first attempt using nasm and adding two vectors!

Mathi

Re: My first attempt using nasm and adding two vectors!

Mathi

Re: My first attempt using nasm and adding two vectors!