Author Topic: SSE & SSE2 Question (Read 31543 times)

Zolaerla · « **on:** January 29, 2008, 05:13:14 PM »

Through all of the documentation from Intel and AMD, none of them answer one really simple question that I've had:

What are the differences between ANDPD, ANDPS, and PAND xmm, xmm/m128 (66 0F DB /r)? As far as I can see, they're all simply 128bit ANDs between two operands and the description of them is identical. The same goes for not only other bitwise operations, but also 128bit MOV* operations such as MOVAPD vs MOVAPS vs MOVDQA...

Debbie Wiles · « **Reply #1 on:** January 29, 2008, 06:19:58 PM »

The 66 prefix is used by the SSE2 instruction but is illegal in SSE. I haven't tested it, but my guess is:

1) When SSE sees the 66 prefix, it ignores it (it is illegal for SSE, so is not guaranteed to do anything, which means it may not work - I haven't tested it to find out).
2) When SSE2 sees the prefix, it knows that it will be working on 64-bit numbers, and uses a faster algorithm
3) SSE2 pairs better when 64-bit operands are operated on 64-bits at a time.

The above is only a guess, and if I am wildly wrong I'd love to know as well - I'm not sure I ever really took much notice of it before :)

Zolaerla · « **Reply #2 on:** January 30, 2008, 06:08:41 PM »

I've been making thorough documentation on every instruction for the x86, and haven't found anything anywhere that tells me these instructions have any real differences.

Seeing as how the messages seem to be a fixed point font, let's see if this table comes through. Since all bitwise ops have the same timing, I'm just listing PAND vs ANDP(S/D)

This is a list of the timing of the instructions on P4, Core2 and AMD64 based on Agner Fog's optimization charts (http://www.agner.org/optimize/):

Core2 and AMD64 timing listed as "latency/reciprocal throughput"
P4 timing is listed as "latency/additional latency/reciprocal throughput"
Latency of memory moves is inaccurate for the P4.

P4 Core2 AMD64
PAND xmm, xmm 2/1/1 1/0.33 2/1
PAND xmm, m128 2/1/2 ?/1 2/1
ANDP(S/D) xmm, xmm 2/1/2 1/0.33 2/2
ANDP(S/D) xmm, m128 2/1/2 ?/1 2/2

MOVAP(S/D) xmm, xmm 6/0/1 1/0.33 2/1
MOVAP(S/D) xmm, m128 ~7/0/1 2/1 ?/2
MOVAP(S/D) m128, xmm ~7/0/2 3/1 ?/2
MOVDQA xmm, xmm 6/0/1 1/0.33 2/1
MOVDQA xmm, m128 ~8/0/1 2/1 ?/2
MOVDQA m128, xmm ~8/0/2 3/1 ?/2

From this we can see that, for the most part, the timing is the same for the various instructions with PAND being slightly faster than ANDP(S/D) in general...

Zolaerla · « **Reply #3 on:** January 30, 2008, 06:11:11 PM »

Well, since the table didn't come out right, I put it online in a text file at:
http://anyplatform.net/media/text/sse2table.txt

Sorry for not being more familiar with SourceForge's forums...

NASM - The Netwide Assembler

News:

Author Topic: SSE & SSE2 Question (Read 31543 times)

Zolaerla

SSE & SSE2 Question

Debbie Wiles

Re: SSE & SSE2 Question

Zolaerla

Re: SSE & SSE2 Question

Zolaerla

Re: SSE & SSE2 Question