Author Topic: 64 bit Windows command line program with SSE2 and AVX instructions (Read 37949 times)

Gerhard · « **on:** March 03, 2013, 06:45:14 PM »

I've added an archive to this message: float.zip. Please read the readme.txt file first (it's included in the archive). The applications should run under Win64, SP1 (native or VM). I couldn't test it under Windows 8, but it should work too.

The program checks the available instruction sets for the underlying machine during runtime. If your CPU doesn't support AVX, the application won't crash; in that case only the last procedure is skipped and the program terminates correct.

The program floatsum.exe sums up an array of float (REAL4) numbers in C and assembly language (with SSE2 instructions and the new AVX instructions). The differences are tremendous. Here is the application's output on my machine: Intel Core i7-3770, 3.4 GHz with Win7 (64 bit) and SP1:

Code: [Select]

Supported by Processor and installed Operating System:
------------------------------------------------------

     MMX, CMOV and FCOMI, SSE, SSE2, SSE3, SSSE3, SSE4.1,
     POPCNT, SSE4.2, AVX, PCLMUL and AES

Calculating the sum of a float array with different methods.
That'll take a little while. Please be patient ...

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 15.96 Seconds

FPU code with 4 accumulators:
-----------------------------
sum2              = 8390656.00
Elapsed Time      = 7.10 Seconds
Performance Boost = 225%

C implementation with 4 accumulators:
-------------------------------------
sum3              = 8390656.00
Elapsed Time      = 5.38 Seconds
Performance Boost = 297%

SSE2 code with 4 accumulators:
------------------------------
sum4              = 8390656.00
Elapsed Time      = 1.36 Seconds
Performance Boost = 1175%

AVX code with 4 accumulators:
-----------------------------
sum5              = 8390656.00
Elapsed Time      = 0.69 Seconds
Performance Boost = 2326%

For the C sources I used gcc 4.7.2 for Windows, but with some minimal changes (especially the data alignment) should it work with VC or Pelles C, too, but that's not tested. The assembly language sources are processed with nasm 2.10.07 for Windows.

The software isn't in a final stadium. Hints and proposals for improvements are welcome, as well as any feedback. The Linux version is coming soon.

Gerhard

Rob Neff · « **Reply #1 on:** March 04, 2013, 06:01:21 PM »

Added a function where I unrolled the simple 1 register accumulator function a bit and achieved a rather large improvement of 372%. Tested on an AMD Phenom II 965 3.4Ghz using MSVS 2005 ( yes, it's an ancient compiler but it's legit! ) using just /O2 (max speed) optimization for all C files:

Code: [Select]

float Sum1AccuC_Unrolled(float V[], unsigned int m)
{
    register float sum1 = 0.0;              // zero out sums
    register unsigned int i;
    for (i = 0; i < m; i+=4) {
        sum1 += (V[i] + V[i+1] + V[i+2] + V[i+3]);
    }
    return sum1;
}

My results:

Code: [Select]

Supported by Processor and installed Operating System:
------------------------------------------------------

     MMX, CMOV and FCOMI, SSE, SSE2, SSE3

Calculating the sum of a float array with different methods.
That'll take a little while. Please be patient ...

Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 24.11 Seconds

Simple C implementation (Unrolled):
------------------------
sum2              = 8390656.00
Elapsed Time      = 6.48 Seconds
Performance Boost = 372%

FPU code with 4 accumulators:
-----------------------------
sum3              = 8390656.00
Elapsed Time      = 8.90 Seconds
Performance Boost = 271%

C implementation with 4 accumulators:
-------------------------------------
sum4              = 8390656.00
Elapsed Time      = 6.06 Seconds
Performance Boost = 398%

SSE2 code with 4 accumulators:
------------------------------
sum5              = 8390656.00
Elapsed Time      = 1.57 Seconds
Performance Boost = 1537%
Your current CPU doesn't support the AVX instruction set.
You'll need the Sandy Bridge or Ivy Bridge architecture.

The application terminates now.

I'm sure that using a newer compiler and playing around with additional command line switches would provide even more speed. However, as you can see, just by making a simple change to the source you can help the compiler to easily improve performance. Of course, no compiler can match professional hand-optimized assembly code - but they do seem to be getting closer as they mature...

Gerhard · « **Reply #2 on:** March 04, 2013, 06:54:22 PM »

Hi Rob,

thank you for your response.

Quote from: Rob Neff on March 04, 2013, 06:01:21 PM

Added a function where I unrolled the simple 1 register accumulator function a bit and achieved a rather large improvement of 372%. Tested on an AMD Phenom II 965 3.4Ghz using MSVS 2005 ( yes, it's an ancient compiler but it's legit! ) using just /O2 (max speed) optimization for all C files:

Code: [Select]
float Sum1AccuC_Unrolled(float V[], unsigned int m) { register float sum1 = 0.0; // zero out sums register unsigned int i; for (i = 0; i < m; i+=4) { sum1 += (V[i] + V[i+1] + V[i+2] + V[i+3]); } return sum1; }I'm sure that using a newer compiler and playing around with additional command line switches would provide even more speed. However, as you can see, just by making a simple change to the source you can help the compiler to easily improve performance. Of course, no compiler can match professional hand-optimized assembly code - but they do seem to be getting closer as they mature...

You're right, because your idea is similar to the function that uses 4 accumulators in C. It's partial loop unrolling, which gives a performance boost.

My goal was to show the advantage of SSE code and the new AVX features.

Gerhard

NASM - The Netwide Assembler

News:

Author Topic: 64 bit Windows command line program with SSE2 and AVX instructions (Read 37949 times)

Gerhard

64 bit Windows command line program with SSE2 and AVX instructions

Rob Neff

Re: 64 bit Windows command line program with SSE2 and AVX instructions

Gerhard

Re: 64 bit Windows command line program with SSE2 and AVX instructions