Michael Abrash, in his articles and books, said: "There Is No Such A Thing As The Fastest Code", citing the acronymum from Robert A. Heinlein novels (I believe it is "The Number of The Beast"). Here's an example. Let's say we want to construct a rotine which converts a BYTE to a "binary ascii string", C style. We could do something like this:
bits 64
default rel
section .text
; The buffers pointed by ptr must be 9 chars log.
;--- This is the usual routine:
; void byte2bin( uint8_t byte, char *ptr )
; {
; unsigned int size = 8;
;
; while ( size-- )
; {
; *outp++ = '0' + !!(byte & 0x80);
; byte <<= 1;
; }
; *outp = '\0';
; }
global byte2bin
align 4
byte2bin:
; Almost all of these instructions can run on all execution units
; (0,1,5 or 6), so they will be paralelized. But we hava an 8
; iteractions loop!
lea rcx, [rsi+8] ; Past the end of buffer (units 1 and 5).
mov rax, rsi
; This loop repeats 8 times!
align 4
.loop:
mov edx, edi
shr dl, 7
add dl, '0'
inc rax
add edi, edi
mov BYTE [rax-1], dl ; Only unit 4
cmp rcx, rax
jne .loop ; Only unit 6.
mov BYTE [rcx], 0 ; Puts the NUL char. Only unit 4.
ret
This routine will execute in, more or less, 40 clock cycles due to the 8 iterations loop, above. Can we make it better? Probably. But there are other approach. We can use a technique called SWAR (SIMD Within A Register). Kinda of emulating SIMD taking advantage of the size of the registers. It is more complex, but we can eliminate the loop:
;--- Routine using SWAR.
; void byte2bin_swar( uint8_t byte, char *ptr );
global byte2bin_swar
align 4
byte2bin_swar:
; Every single one of these instructions (except imul and ret)
; can run on all 4 execution units (0,1,5 or 6), so they will
; be paralelized.
movzx edi, dil ; Zero upper 56 bits of RDI.
; MAGIC!! Hehe.
mov rax, 0x0002000800200080
mov rdx, 0x0000040010004001
imul rax, rdi ; These 2 instructions takes 6 cycles.
imul rdi, rdx ; 3 cycles each, and they run only on execution unit 1.
or rdi, rax
; Isolate the lsbs and add '0' to each byte.
mov rax, 0x0101010101010101
and rdi, rax
mov rax, 0x3030303030303030
add rdi, rax
; These 2 only in unit 4.
movbe [rsi], rdi ; move rdi (converting to big endian).
mov BYTE [rsi+8], 0 ; writes the NUL char.
ret
I'll leave the "magic!" for you to study (it is quite simple, if you think about it). But this routine is 4 times faster than the previous one.
But, wait... maybe there is a even faster way to do this... Since we are dealing with a conversion from a BYTE to a string with 9 chars (8 chars '0' or '1' and a NUL char), what if we use a TABLE? 256 entries of 9 bytes will get us only a 2.37 KiB table, isn't it? No big deal:
; FASTER? But expensive!
global byte2bin_tbl
align 4
byte2bin_tbl:
movzx edi,dil
lea rax,[bintbl] ; RIP relative address.
lea rdi,[rdi + rdi*8]
add rax,rdi
ret
section .rodata
; 2.37 KiB table
align 8
bintbl:
db "00000000",0
db "00000001",0
db "00000010",0
...
db "11111110",0
db "11111111",0
Why not use a single instruction as this?
lea rax,[rdi+rdi*8+bintbl]
Because we are using RIP relative addresing. This bintbl offset must be relocated using GOT (SysV ABI). Using the code above we avoid relocations.
This code will cost only about 5 clock cycles, half of byte2bin_swar and 1/8 of byte2bin. But there is a problem...