A small measure of both methods: glibc (for x86-64) don't use fsin instruction (since SSE/SSE2 is the default means to use floating point):

`; testfp.asm`

bits 64

default rel

section .text

; Since SSE don't have any 'transcendental' instructions we can use

; fp87 fsin, but the argument comes from XMM0 and the result is XMM0 as well.

; Using red zone here.

global sin_

sin_:

movsd [rsp-8],xmm0

fld qword [rsp-8]

fsin

fstp qword [rsp-8]

movsd xmm0,[rsp-8]

ret

And the test code:

`// test.c`

#include <stdio.h>

#include <inttypes.h>

#include <math.h>

#include "cycle_counting.h"

extern double sin_( double );

int main( void )

{

double a, s1, s2;

counter_T c1, c2;

unsigned int i;

i = 0;

c1 = c2 = 0;

for ( a = 0.0; a < 2.0 * M_PI; a += M_PI / 180.0, i++ ) // dregrees to radians.

{

counter_T ctmp;

ctmp = BEGIN_TSC();

s1 = sin( a ); // glibc sin() function.

ctmp = END_TSC( ctmp );

c1 += ctmp;

ctmp = BEGIN_TSC();

s2 = sin_( a ); // our function.

ctmp = END_TSC( ctmp );

c2 += ctmp;

// this is here 'cause without this the compiler will get rid of sin() call

// since it is an 'intrinsic' and the result isn't used, otherwise.

printf( "%g, %g\n", s1, s2 );

}

c1 /= i;

c2 /= i;

printf( "glibc sin(): %" PRIu64 " cycles.\n"

"sin_(): %" PRIu64 " cycles.\n",

c1, c2 );

}

BEGIN_TSC() and END_TSC() gets the timestamp counter, serializing the processor.

Sometimes sin_() is faster, but not always. Here's two results:

`$ nasm -felf64 -o testfp.o testfp.asm`

$ cc -O2 -ffast-math -c -o test.o test.c

$ cc -s -o test test.o testfp.o -lm

$ ./test

...

glibc sin(): 145 cycles.

sin_(): 102 cycles.

...

$ ./test

...

glibc sin(): 174 cycles.

sin_(): 210 cycles.

First case, sin_() is 29,6% faster than sin(). Second case, sin() is 17% faster than sin_().

So, to use fp87 instructions isn't a garantee of performance from hardware assisted complex functions.