Author Topic: What are the SIMD equivalents to x87 transcendental functions? (Read 59103 times)

ben321 · « **on:** December 13, 2023, 03:22:55 PM »

For example, with x87 floating point registers, I can use fpsin to calculate the sine of an angle. Now with 64bit CPUs, the recommendation is to stop using the x87 instructions, and use the SIMD instructions instead, which operate on the XMM registers. But I have a problem. What are the SIMD instructions for transcendental functions like sine? Do they even exist? I can't find any proof of their existence. One thing is certain though, is that even if they exist, I can't find anything on the internet about them.

fredericopissarra · « **Reply #1 on:** December 13, 2023, 06:29:49 PM »

There are no math instructions other then the four basic arithmetic functions (add, substraction, multiplication and division) with SSE (or almost all SIMD in other processors as well).

This is not a bad thing, since fsin and fcos are very slow (50-120 cycles).

ben321 · « **Reply #2 on:** December 16, 2023, 08:19:46 AM »

Quote from: fredericopissarra on December 13, 2023, 06:29:49 PM

There are no math instructions other then the four basic arithmetic functions (add, substraction, multiplication and division) with SSE (or almost all SIMD in other processors as well).

This is not a bad thing, since fsin and fcos are very slow (50-120 cycles).

Then how does Intel expect you to run any useful algorithms (which often require transcendental functions like sine or cosine) with the XMM registers (what they want you to use when in 64bit mode)? You can't do sine and cosine simply by using a combination of add, subtract, multiply, and divide.

Michael Mehlich · « **Reply #3 on:** December 16, 2023, 04:12:48 PM »

Transcendental functions actually can be computed (i.e. approximated up to an arbitrary desired precision) using only those basic operations; efficient algorithms to do so are not particularly simple though, and are best left to be implemented by experts in that area. As such, you should probably just use a library that has implementations for these. A library written (in C) to compute these using SIMD operations can be found at https://sleef.org/.

ben321 · « **Reply #4 on:** December 16, 2023, 07:43:24 PM »

Quote from: Michael Mehlich on December 16, 2023, 04:12:48 PM

Transcendental functions actually can be computed (i.e. approximated up to an arbitrary desired precision) using only those basic operations; efficient algorithms to do so are not particularly simple though, and are best left to be implemented by experts in that area. As such, you should probably just use a library that has implementations for these. A library written (in C) to compute these using SIMD operations can be found at https://sleef.org/.

If you need to do tons of arithmetic to get it to work though, is it actually any faster than the native x87 fsin and fcos instructions.

Michael Mehlich · « **Reply #5 on:** December 16, 2023, 11:23:24 PM »

The transcendental operations in the x87 are very slow operations; they actually result in the CPU to execute on the order of 100 "micro" instructions, doing the same kind of approximation as one would do with SIMD instructions. There is only a small inherent performance advantage in having a single "macro" instruction for the sequence of "micro" instructions.
There are some discussions on stackoverflow that suggest that modern SIMD based implementations can indeed be faster (see e.g. https://stackoverflow.com/questions/2683588/what-is-the-fastest-way-to-compute-sin-and-cos-together).
(As always, perform your own measurements...)
Note however, that actual performance of various implementations (whether in hardware or in software) is often hard to compare as they provide different precision, and different ranges on which they work properly. In general, supporting less precision over a smaller range will mean larger performance.

ben321 · « **Reply #6 on:** December 17, 2023, 01:49:27 AM »

Interesting. I need large precision for doing things like highly detailed mandlebrot fractal generation. The higher the precision the better. This involves complex multiplication and addition.

One complex add or subtract uses 2 fp adds or subtracts.
One complex multiply uses 4 fp multiplies, 1 fp add, and 1 fp subtract.
One complex squared performs a complex multiply with both inputs being the same complex value.
Magnitude of a complex number uses 2 fp multiplies, 1 fp add, and 1 fp square root.

These functions are all used, over many times over (due to the iterative nature of the mandelbrot equation), for calculating each pixel of a mandelbrot image.

Michael Mehlich · « **Reply #7 on:** December 17, 2023, 06:11:45 PM »

If you want to do deep zoom levels for mandelbrot, you'll eventually need more precision than any hardware implementation provides. In that case, you'll need some arbitrary precision floating point support. You can write your own (with a lot of effort), or get an existing package, e.g. gnu mpfr that provides such functionality.
There is going to be a big performance penalty for doing so; but you won't have a choice if you need the extra precision.
A good implementation with zoom capabilities will probably start using ordinary float/double on low zoom levels, and only when larger precision is needed, switch over to some higher precision implementation, e.g. first to using quad floats (which AFAIK are available in sleef), and then ultimately to an arbitrary precision package like mpfr, selecting a suitable precision for the zoom level of interest.

fredericopissarra · « **Reply #8 on:** December 19, 2023, 12:56:46 PM »

A small measure of both methods: glibc (for x86-64) don't use fsin instruction (since SSE/SSE2 is the default means to use floating point):

Code: [Select]

; testfp.asm
  bits  64
  default rel

  section .text

  ; Since SSE don't have any 'transcendental' instructions we can use
  ; fp87 fsin, but the argument comes from XMM0 and the result is XMM0 as well.
  ; Using red zone here.

  global sin_
sin_:
  movsd   [rsp-8],xmm0

  fld     qword [rsp-8]
  fsin
  fstp    qword [rsp-8]

  movsd   xmm0,[rsp-8]
  ret

And the test code:

Code: [Select]

// test.c
#include <stdio.h>
#include <inttypes.h>
#include <math.h>
#include "cycle_counting.h"

extern double sin_( double );

int main( void )
{
  double a, s1, s2;
  counter_T c1, c2;
  unsigned int i;

  i = 0;
  c1 = c2 = 0;
  for ( a = 0.0; a < 2.0 * M_PI; a += M_PI / 180.0, i++ ) // dregrees to radians.
  {
    counter_T ctmp;

    ctmp = BEGIN_TSC();
    s1 = sin( a );          // glibc sin() function.
    ctmp = END_TSC( ctmp );
    c1 += ctmp;

    ctmp = BEGIN_TSC();
    s2 = sin_( a );         // our function.
    ctmp = END_TSC( ctmp );
    c2 += ctmp;

    // this is here 'cause without this the compiler will get rid of sin() call
    // since it is an 'intrinsic' and the result isn't used, otherwise.
    printf( "%g, %g\n", s1, s2 );
  }

  c1 /= i;
  c2 /= i;

  printf( "glibc sin(): %" PRIu64 " cycles.\n"
          "sin_():      %" PRIu64 " cycles.\n",
    c1, c2 );
}

BEGIN_TSC() and END_TSC() gets the timestamp counter, serializing the processor.
Sometimes sin_() is faster, but not always. Here's two results:

Code: [Select]

$ nasm -felf64 -o testfp.o testfp.asm
$ cc -O2 -ffast-math -c -o test.o test.c
$ cc -s -o test test.o testfp.o -lm
$ ./test
...
glibc sin(): 145 cycles.
sin_():      102 cycles.
...
$ ./test
...
glibc sin(): 174 cycles.
sin_():      210 cycles.

First case, sin_() is 29,6% faster than sin(). Second case, sin() is 17% faster than sin_().

So, to use fp87 instructions isn't a garantee of performance from hardware assisted complex functions.

fredericopissarra · « **Reply #9 on:** December 19, 2023, 12:59:23 PM »

If you are curious, here's the implementation of sin() from glibc 2.31 source code: /sysdeps/ieee754/dbl-64/s_sin.c

NASM - The Netwide Assembler

News:

Author Topic: What are the SIMD equivalents to x87 transcendental functions? (Read 59103 times)

ben321

What are the SIMD equivalents to x87 transcendental functions?

fredericopissarra

Re: What are the SIMD equivalents to x87 transcendental functions?

ben321

Re: What are the SIMD equivalents to x87 transcendental functions?

Michael Mehlich

Re: What are the SIMD equivalents to x87 transcendental functions?

ben321

Re: What are the SIMD equivalents to x87 transcendental functions?

Michael Mehlich

Re: What are the SIMD equivalents to x87 transcendental functions?

ben321

Re: What are the SIMD equivalents to x87 transcendental functions?

Michael Mehlich

Re: What are the SIMD equivalents to x87 transcendental functions?

fredericopissarra

Re: What are the SIMD equivalents to x87 transcendental functions?

fredericopissarra

Re: What are the SIMD equivalents to x87 transcendental functions?