I posted the question about call/jmp relative offsets earlier.
I just went through and implemented all the latest SSE instructions in my code, and am comparing the machine code I generate against NASM. I'm seeing different/unexpected behavior from NASM -- doesn't seem to follow the Intel manuals.
I think the opcode is supposed to be 0x0F 0x38 0xF0 (8bit src operand) or 0xF1 (16/32/64bit src operand), with an 0xF2 prefix. For every single documented operand combination, NASM isn't generating what I would expect. Below I've listed the test instructions I'm generating, along with my (corepy) and NASM's output for a few of the cases.
crc32 r12, r12
nasm output: f24d0f380166e4
corepy output: f24d0f38f1e4
crc32 r12, qword [r12 + 32]
nasm output: f24d0f380166642420
corepy output: f24d0f38f1642420
crc32 r12, byte [rbp + -8]
nasm output: f24c0f380165f8
corepy output: f24c0f38f065f8
Which is correct? If it is NASM, why, and where is the documentation backing it up?
I'd test this on hardware, but I don't have access to any machines with SSE 4.2. Also I do realize some of the instructions/operands above don't make practical sense -- these are just tests used to verify machine code output.
Thanks!