Author Topic: SSE & SSE2 Question  (Read 4944 times)

Zolaerla

  • Guest
SSE & SSE2 Question
« on: January 29, 2008, 05:13:14 PM »
Through all of the documentation from Intel and AMD, none of them answer one really simple question that I've had:

What are the differences between ANDPD, ANDPS, and PAND xmm, xmm/m128 (66 0F DB /r)? As far as I can see, they're all simply 128bit ANDs between two operands and the description of them is identical. The same goes for not only other bitwise operations, but also 128bit MOV* operations such as MOVAPD vs MOVAPS vs MOVDQA...

Debbie Wiles

  • Guest
Re: SSE & SSE2 Question
« Reply #1 on: January 29, 2008, 06:19:58 PM »
The 66 prefix is used by the SSE2 instruction but is illegal in SSE. I haven't tested it, but my guess is:

1) When SSE sees the 66 prefix, it ignores it (it is illegal for SSE, so is not guaranteed to do anything, which means it may not work - I haven't tested it to find out).
2) When SSE2 sees the prefix, it knows that it will be working on 64-bit numbers, and uses a faster algorithm
3) SSE2 pairs better when 64-bit operands are operated on 64-bits at a time.

The above is only a guess, and if I am wildly wrong I'd love to know as well - I'm not sure I ever really took much notice of it before :)

Zolaerla

  • Guest
Re: SSE & SSE2 Question
« Reply #2 on: January 30, 2008, 06:08:41 PM »
I've been making thorough documentation on every instruction for the x86, and haven't found anything anywhere that tells me these instructions have any real differences.


Seeing as how the messages seem to be a fixed point font, let's see if this table comes through. Since all bitwise ops have the same timing, I'm just listing PAND vs ANDP(S/D)

This is a list of the timing of the instructions on P4, Core2 and AMD64 based on Agner Fog's optimization charts (http://www.agner.org/optimize/):

Core2 and AMD64 timing listed as "latency/reciprocal throughput"
P4 timing is listed as "latency/additional latency/reciprocal throughput"
Latency of memory moves is inaccurate for the P4.

P4      Core2     AMD64
PAND       xmm, xmm     2/1/1   1/0.33    2/1
PAND       xmm, m128    2/1/2   ?/1       2/1
ANDP(S/D)  xmm, xmm     2/1/2   1/0.33    2/2
ANDP(S/D)  xmm, m128    2/1/2   ?/1       2/2

MOVAP(S/D) xmm, xmm     6/0/1   1/0.33    2/1
MOVAP(S/D) xmm, m128    ~7/0/1  2/1       ?/2
MOVAP(S/D) m128, xmm    ~7/0/2  3/1       ?/2
MOVDQA     xmm, xmm     6/0/1   1/0.33    2/1
MOVDQA     xmm, m128    ~8/0/1  2/1       ?/2
MOVDQA     m128, xmm    ~8/0/2  3/1       ?/2

From this we can see that, for the most part, the timing is the same for the various instructions with PAND being slightly faster than ANDP(S/D) in general...

Zolaerla

  • Guest
Re: SSE & SSE2 Question
« Reply #3 on: January 30, 2008, 06:11:11 PM »
Well, since the table didn't come out right, I put it online in a text file at:
http://anyplatform.net/media/text/sse2table.txt

Sorry for not being more familiar with SourceForge's forums...