A small measure of both methods: glibc (for x86-64) don't use fsin instruction (since SSE/SSE2 is the default means to use floating point):
; testfp.asm
bits 64
default rel
section .text
; Since SSE don't have any 'transcendental' instructions we can use
; fp87 fsin, but the argument comes from XMM0 and the result is XMM0 as well.
; Using red zone here.
global sin_
sin_:
movsd [rsp-8],xmm0
fld qword [rsp-8]
fsin
fstp qword [rsp-8]
movsd xmm0,[rsp-8]
ret
And the test code:
// test.c
#include <stdio.h>
#include <inttypes.h>
#include <math.h>
#include "cycle_counting.h"
extern double sin_( double );
int main( void )
{
double a, s1, s2;
counter_T c1, c2;
unsigned int i;
i = 0;
c1 = c2 = 0;
for ( a = 0.0; a < 2.0 * M_PI; a += M_PI / 180.0, i++ ) // dregrees to radians.
{
counter_T ctmp;
ctmp = BEGIN_TSC();
s1 = sin( a ); // glibc sin() function.
ctmp = END_TSC( ctmp );
c1 += ctmp;
ctmp = BEGIN_TSC();
s2 = sin_( a ); // our function.
ctmp = END_TSC( ctmp );
c2 += ctmp;
// this is here 'cause without this the compiler will get rid of sin() call
// since it is an 'intrinsic' and the result isn't used, otherwise.
printf( "%g, %g\n", s1, s2 );
}
c1 /= i;
c2 /= i;
printf( "glibc sin(): %" PRIu64 " cycles.\n"
"sin_(): %" PRIu64 " cycles.\n",
c1, c2 );
}
BEGIN_TSC() and END_TSC() gets the timestamp counter, serializing the processor.
Sometimes sin_() is faster, but not always. Here's two results:
$ nasm -felf64 -o testfp.o testfp.asm
$ cc -O2 -ffast-math -c -o test.o test.c
$ cc -s -o test test.o testfp.o -lm
$ ./test
...
glibc sin(): 145 cycles.
sin_(): 102 cycles.
...
$ ./test
...
glibc sin(): 174 cycles.
sin_(): 210 cycles.
First case, sin_() is 29,6% faster than sin(). Second case, sin() is 17% faster than sin_().
So, to use fp87 instructions isn't a garantee of performance from hardware assisted complex functions.