Author Topic: Win64 SSE Vector2D Speed Test (GCC, NASM)  (Read 13518 times)

Offline encryptor256

  • Full Member
  • **
  • Posts: 250
  • Country: lv
  • Win64 .
    • On Youtube: encryptor256
Win64 SSE Vector2D Speed Test (GCC, NASM)
« on: March 17, 2014, 06:49:08 AM »
Hello,
This is Win64 SSE Vector2D Speed Test (Assembly, C, C Assembly, C Assembly Macros).

Notes:
Test is performed in MINGW64 -> GCC -> C environment.
Test is based on twelve vector functions.
There are four sets of these kind'a functions.
Each set is defined in a different way.
Test is time-based-test, in miliseconds.

Vector2D base functions:
Code: [Select]
typedef struct tagVector2D
{
double X;
double Y;
}Vector2D;

extern char Vector2DAddD(Vector2D*,double,double);
extern char Vector2DSubD(Vector2D*,double,double);
extern char Vector2DMulD(Vector2D*,double,double);
extern char Vector2DDivD(Vector2D*,double,double);
extern char Vector2DAddV(Vector2D*,Vector2D*);
extern char Vector2DSubV(Vector2D*,Vector2D*);
extern char Vector2DMulV(Vector2D*,Vector2D*);
extern char Vector2DDivV(Vector2D*,Vector2D*);
extern char Vector2DMagnitude(Vector2D*,double*);
extern char Vector2DNormalize(Vector2D*);
extern char Vector2DDotProduct(Vector2D*,Vector2D*,double*);
extern char Vector2DNormal(Vector2D*);

First set (No prefix):
Functions are created in NASM, created obj file, then attached at linktime to test project.
Located in: "vector2d.asm".
Compile: "nasm.exe -f win64 -o vector2d.obj vector2d.asm".
Notify: Code is brand new and used as a base code for other function code set's.
Preview of first function: Vector2DAddD.
Code: [Select]
align 16
Vector2DAddD:
test rcx,rcx
setz al
jnz .proceed
ret
.proceed:
movapd xmm0,[rcx]
movlhps xmm1,xmm2
addpd xmm0,xmm1
movapd [rcx],xmm0
ret

Second set (Prefix Z):
Defined as C Style functions.
Located in: "ZFunctions.c".
Notify: Code is nearly based on first set.
Preview of first function: ZVector2DAddD.
Code: [Select]
char ZVector2DAddD(Vector2D * v0,double pX,double pY)
{
if(v0==NULL) return 1;
v0->X+=pX;
v0->Y+=pY;
return 0;
};

Thrid set (Prefix P):
Defined as C Assembly functions.
Located in: "main.c".
Notify: Code is based on first set.
Preview of first function: PVector2DAddD.
Code: [Select]
asm(" \n\
PVector2DAddD: \n\
test %rcx,%rcx \n\
setz %al \n\
jnz .proceed0 \n\
ret \n\
.proceed0: \n\
movapd 0x0(%rcx),%xmm0 \n\
movlhps %xmm2,%xmm1 \n\
addpd %xmm1,%xmm0 \n\
movapd %xmm0,0x0(%rcx) \n\
ret \n\
");

Fourth set (Prefix M):
Defined as C Assembly Macro, more like direct inline code.
Located in: "main.c".
Notify: Code is based on third set.
Preview of first function: MVector2DAddD.
Code: [Select]
#define MVector2DAddV(v0,v1) asm("movapd %1,%%xmm0; movapd %2,%%xmm1; addpd %%xmm1,%%xmm0; movapd %%xmm0,%0;" :"=m" (v0) :"m" (v0), "m" (v1) :"%xmm0", "%xmm1");






Each function set has it's own test loop function.
Located in: "main.c".
Code: [Select]
void noprefixLoop(void);
void ZprefixLoop(void);
void PprefixLoop(void);
void MprefixLoop(void);

Base test loop (No prefix):
Code: [Select]
void noprefixLoop(void)
{
unsigned long counter=0xfffffff;
clock_t timeStart;
double vardouble;
Vector2D v0,v1;
clock_t time;
char result;

v1 = (Vector2D){2000.0,3000.00};
v0 = (Vector2D){100000.00,150000.0};

timeStart=clock();

printf("\r\n TEST: noprefixLoop (vector2d.obj)",result,v0.X,v0.Y);

while(counter>0)
{
//
Vector2DAddD(&v0,v1.X,v1.Y);
Vector2DSubD(&v0,v1.X,v1.Y);
Vector2DMulD(&v0,v1.X,v1.Y);
Vector2DDivD(&v0,v1.X,v1.Y);
//
Vector2DAddV(&v0,&v1);
Vector2DSubV(&v0,&v1);
Vector2DMulV(&v0,&v1);
Vector2DDivV(&v0,&v1);
//
v0 = (Vector2D){10.0,15.0};
Vector2DMagnitude(&v0,&vardouble);
v0 = (Vector2D){10.0,15.0};
Vector2DNormalize(&v0);
v0 = (Vector2D){10.0,15.0};
v1 = (Vector2D){2.0,3.0};
Vector2DDotProduct(&v0,&v1,&vardouble);
v0 = (Vector2D){10.0,15.0};
Vector2DNormal(&v0);
//
counter=counter-1;
};

time = clock() - timeStart;
printf("\r\nTime spent: %d",time);
};

Test Project files:
  • vector2d.asm - obj
  • main.h
  • ZFunctions.c
  • main.c

Compile:
"gcc main.c vector2d.obj ZFunctions.c -Ofast"

!!! Test results !!!:
Code: [Select]

 TEST: noprefixLoop (vector2d.obj)
Time spent: 48467
 TEST: ZprefixLoop (C Style functions)
Time spent: 37187
 TEST: PprefixLoop (Defined as Plain Assembly Functions)
Time spent: 48342
 TEST: MprefixLoop (Defined as Inline C Assembly Macros)
Time spent: 3656
END

After calling twelve vector functions, 0xfffffff times:
Fastest are "M prefix functions" defined as C Assembly Macros, with 3656 miliseconds,
because code is almost raw and inline.
This is more like environment issue.

Added attachment.



I tested on:
Code: [Select]
Processors Information
-------------------------------------------------------------------------

Processor 1 ID = 0
Number of cores 2 (max 2)
Number of threads 2 (max 2)
Name Intel Core 2 Duo E6400
Codename Conroe
Specification Intel(R) Core(TM)2 CPU          6400  @ 2.13GHz
Package (platform ID) Socket 775 LGA (0x0)
CPUID 6.F.6
Extended CPUID 6.F
Core Stepping B2
Technology 65 nm
Core Speed 2133.1 MHz
Multiplier x Bus Speed 8.0 x 266.6 MHz
Rated Bus speed 1066.6 MHz
Stock frequency 2133 MHz
Instructions sets MMX, SSE, SSE2, SSE3, SSSE3, EM64T, VT-x
L1 Data cache 2 x 32 KBytes, 8-way set associative, 64-byte line size
L1 Instruction cache 2 x 32 KBytes, 8-way set associative, 64-byte line size
L2 cache 2048 KBytes, 8-way set associative, 64-byte line size
FID/VID Control yes
FID range 6.0x - 8.0x
Max VID 1.325 V



After doing this test and getting results,
i don't want to deal with functions anymore. :D
This test revealed, that inline code is way better than calling functions, especially if the time is important and not the code size.

But GCC doesn't stand a chance against NASM.
In NASM we can do - raw brain power. :D

Bye!
Encryptor256's Investigation \ Research Department.