Hello,
This is Win64 SSE Vector2D Speed Test (Assembly, C, C Assembly, C Assembly Macros).
Notes:
Test is performed in MINGW64 -> GCC -> C environment.
Test is based on twelve vector functions.
There are four sets of these kind'a functions.
Each set is defined in a different way.
Test is time-based-test, in miliseconds.
Vector2D base functions:
typedef struct tagVector2D
{
double X;
double Y;
}Vector2D;
extern char Vector2DAddD(Vector2D*,double,double);
extern char Vector2DSubD(Vector2D*,double,double);
extern char Vector2DMulD(Vector2D*,double,double);
extern char Vector2DDivD(Vector2D*,double,double);
extern char Vector2DAddV(Vector2D*,Vector2D*);
extern char Vector2DSubV(Vector2D*,Vector2D*);
extern char Vector2DMulV(Vector2D*,Vector2D*);
extern char Vector2DDivV(Vector2D*,Vector2D*);
extern char Vector2DMagnitude(Vector2D*,double*);
extern char Vector2DNormalize(Vector2D*);
extern char Vector2DDotProduct(Vector2D*,Vector2D*,double*);
extern char Vector2DNormal(Vector2D*);
First set (No prefix):
Functions are created in NASM, created obj file, then attached at linktime to test project.
Located in: "vector2d.asm".
Compile: "nasm.exe -f win64 -o vector2d.obj vector2d.asm".
Notify: Code is brand new and used as a base code for other function code set's.
Preview of first function: Vector2DAddD.
align 16
Vector2DAddD:
test rcx,rcx
setz al
jnz .proceed
ret
.proceed:
movapd xmm0,[rcx]
movlhps xmm1,xmm2
addpd xmm0,xmm1
movapd [rcx],xmm0
ret
Second set (Prefix Z):
Defined as C Style functions.
Located in: "ZFunctions.c".
Notify: Code is nearly based on first set.
Preview of first function: ZVector2DAddD.
char ZVector2DAddD(Vector2D * v0,double pX,double pY)
{
if(v0==NULL) return 1;
v0->X+=pX;
v0->Y+=pY;
return 0;
};
Thrid set (Prefix P):
Defined as C Assembly functions.
Located in: "main.c".
Notify: Code is based on first set.
Preview of first function: PVector2DAddD.
asm(" \n\
PVector2DAddD: \n\
test %rcx,%rcx \n\
setz %al \n\
jnz .proceed0 \n\
ret \n\
.proceed0: \n\
movapd 0x0(%rcx),%xmm0 \n\
movlhps %xmm2,%xmm1 \n\
addpd %xmm1,%xmm0 \n\
movapd %xmm0,0x0(%rcx) \n\
ret \n\
");
Fourth set (Prefix M):
Defined as C Assembly Macro, more like direct inline code.
Located in: "main.c".
Notify: Code is based on third set.
Preview of first function: MVector2DAddD.
#define MVector2DAddV(v0,v1) asm("movapd %1,%%xmm0; movapd %2,%%xmm1; addpd %%xmm1,%%xmm0; movapd %%xmm0,%0;" :"=m" (v0) :"m" (v0), "m" (v1) :"%xmm0", "%xmm1");
Each function set has it's own test loop function.
Located in: "main.c".
void noprefixLoop(void);
void ZprefixLoop(void);
void PprefixLoop(void);
void MprefixLoop(void);
Base test loop (No prefix):
void noprefixLoop(void)
{
unsigned long counter=0xfffffff;
clock_t timeStart;
double vardouble;
Vector2D v0,v1;
clock_t time;
char result;
v1 = (Vector2D){2000.0,3000.00};
v0 = (Vector2D){100000.00,150000.0};
timeStart=clock();
printf("\r\n TEST: noprefixLoop (vector2d.obj)",result,v0.X,v0.Y);
while(counter>0)
{
//
Vector2DAddD(&v0,v1.X,v1.Y);
Vector2DSubD(&v0,v1.X,v1.Y);
Vector2DMulD(&v0,v1.X,v1.Y);
Vector2DDivD(&v0,v1.X,v1.Y);
//
Vector2DAddV(&v0,&v1);
Vector2DSubV(&v0,&v1);
Vector2DMulV(&v0,&v1);
Vector2DDivV(&v0,&v1);
//
v0 = (Vector2D){10.0,15.0};
Vector2DMagnitude(&v0,&vardouble);
v0 = (Vector2D){10.0,15.0};
Vector2DNormalize(&v0);
v0 = (Vector2D){10.0,15.0};
v1 = (Vector2D){2.0,3.0};
Vector2DDotProduct(&v0,&v1,&vardouble);
v0 = (Vector2D){10.0,15.0};
Vector2DNormal(&v0);
//
counter=counter-1;
};
time = clock() - timeStart;
printf("\r\nTime spent: %d",time);
};
Test Project files:
- vector2d.asm - obj
- main.h
- ZFunctions.c
- main.c
Compile:
"gcc main.c vector2d.obj ZFunctions.c -Ofast"
!!! Test results !!!:
TEST: noprefixLoop (vector2d.obj)
Time spent: 48467
TEST: ZprefixLoop (C Style functions)
Time spent: 37187
TEST: PprefixLoop (Defined as Plain Assembly Functions)
Time spent: 48342
TEST: MprefixLoop (Defined as Inline C Assembly Macros)
Time spent: 3656
END
After calling twelve vector functions, 0xfffffff times:
Fastest are "M prefix functions" defined as C Assembly Macros, with 3656 miliseconds,
because code is almost raw and inline.
This is more like environment issue.
Added attachment.
I tested on:
Processors Information
-------------------------------------------------------------------------
Processor 1 ID = 0
Number of cores 2 (max 2)
Number of threads 2 (max 2)
Name Intel Core 2 Duo E6400
Codename Conroe
Specification Intel(R) Core(TM)2 CPU 6400 @ 2.13GHz
Package (platform ID) Socket 775 LGA (0x0)
CPUID 6.F.6
Extended CPUID 6.F
Core Stepping B2
Technology 65 nm
Core Speed 2133.1 MHz
Multiplier x Bus Speed 8.0 x 266.6 MHz
Rated Bus speed 1066.6 MHz
Stock frequency 2133 MHz
Instructions sets MMX, SSE, SSE2, SSE3, SSSE3, EM64T, VT-x
L1 Data cache 2 x 32 KBytes, 8-way set associative, 64-byte line size
L1 Instruction cache 2 x 32 KBytes, 8-way set associative, 64-byte line size
L2 cache 2048 KBytes, 8-way set associative, 64-byte line size
FID/VID Control yes
FID range 6.0x - 8.0x
Max VID 1.325 V
After doing this test and getting results,
i don't want to deal with functions anymore.
This test revealed, that inline code is way better than calling functions, especially if the time is important and not the code size.
But GCC doesn't stand a chance against NASM.
In NASM we can do - raw brain power.
Bye!