I've been making thorough documentation on every instruction for the x86, and haven't found anything anywhere that tells me these instructions have any real differences.
Seeing as how the messages seem to be a fixed point font, let's see if this table comes through. Since all bitwise ops have the same timing, I'm just listing PAND vs ANDP(S/D)
This is a list of the timing of the instructions on P4, Core2 and AMD64 based on Agner Fog's optimization charts (
http://www.agner.org/optimize/):
Core2 and AMD64 timing listed as "latency/reciprocal throughput"
P4 timing is listed as "latency/additional latency/reciprocal throughput"
Latency of memory moves is inaccurate for the P4.
P4 Core2 AMD64
PAND xmm, xmm 2/1/1 1/0.33 2/1
PAND xmm, m128 2/1/2 ?/1 2/1
ANDP(S/D) xmm, xmm 2/1/2 1/0.33 2/2
ANDP(S/D) xmm, m128 2/1/2 ?/1 2/2
MOVAP(S/D) xmm, xmm 6/0/1 1/0.33 2/1
MOVAP(S/D) xmm, m128 ~7/0/1 2/1 ?/2
MOVAP(S/D) m128, xmm ~7/0/2 3/1 ?/2
MOVDQA xmm, xmm 6/0/1 1/0.33 2/1
MOVDQA xmm, m128 ~8/0/1 2/1 ?/2
MOVDQA m128, xmm ~8/0/2 3/1 ?/2
From this we can see that, for the most part, the timing is the same for the various instructions with PAND being slightly faster than ANDP(S/D) in general...