Author Topic: TINSATATFC  (Read 4509 times)

Offline fredericopissarra

  • Full Member
  • **
  • Posts: 373
  • Country: br
TINSATATFC
« on: October 25, 2022, 05:27:11 PM »
Michael Abrash, in his articles and books, said: "There Is No Such A Thing As The Fastest Code", citing the acronymum from Robert A. Heinlein novels (I believe it is "The Number of The Beast"). Here's an example. Let's say we want to construct a rotine which converts a BYTE to a "binary ascii string", C style. We could do something like this:
Code: [Select]
  bits  64
  default rel

  section .text

; The buffers pointed by ptr must be 9 chars log.

;--- This is the usual routine:
; void byte2bin( uint8_t byte, char *ptr )
; {
;   unsigned int size = 8;
;
;   while ( size-- )
;   {
;     *outp++ = '0' + !!(byte & 0x80);
;     byte <<= 1;
;   }
;   *outp = '\0';
; }

  global byte2bin

  align 4
byte2bin:
  ; Almost all of these instructions can run on all execution units
  ; (0,1,5 or 6), so they will be paralelized. But we hava an 8
  ; iteractions loop!
  lea rcx, [rsi+8]      ; Past the end of buffer (units 1 and 5).
  mov rax, rsi

  ; This loop repeats 8 times!
  align 4
.loop:
  mov edx, edi
  shr dl, 7
  add dl, '0'
  inc rax
  add edi, edi
  mov BYTE [rax-1], dl  ; Only unit 4
  cmp rcx, rax
  jne .loop             ; Only unit 6.

  mov BYTE [rcx], 0     ; Puts the NUL char. Only unit 4.
  ret
This routine will execute in, more or less, 40 clock cycles due to the 8 iterations loop, above. Can we make it better? Probably. But there are other approach. We can use a technique called SWAR (SIMD Within A Register). Kinda of emulating SIMD taking advantage of the size of the registers. It is more complex, but we can eliminate the loop:
Code: [Select]
;--- Routine using SWAR.
; void byte2bin_swar( uint8_t byte, char *ptr );

  global byte2bin_swar

  align 4
byte2bin_swar:
  ; Every single one of these instructions (except imul and ret)
  ; can run on all 4 execution units (0,1,5 or 6), so they will
  ; be paralelized.
  movzx   edi, dil    ; Zero upper 56 bits of RDI.

  ; MAGIC!! Hehe.
  mov     rax, 0x0002000800200080
  mov     rdx, 0x0000040010004001
  imul    rax, rdi    ; These 2 instructions takes 6 cycles.
  imul    rdi, rdx    ; 3 cycles each, and they run only on execution unit 1.
  or      rdi, rax

  ; Isolate the lsbs and add '0' to each byte.
  mov     rax, 0x0101010101010101
  and     rdi, rax
  mov     rax, 0x3030303030303030
  add     rdi, rax

  ; These 2 only in unit 4.
  movbe   [rsi], rdi        ; move rdi (converting to big endian).
  mov     BYTE [rsi+8], 0   ; writes the NUL char.
  ret
I'll leave the "magic!" for you to study (it is quite simple, if you think about it). But this routine is 4 times faster than the previous one.

But, wait... maybe there is a even faster way to do this... Since we are dealing with a conversion from a BYTE to a string with 9 chars (8 chars '0' or '1' and a NUL char), what if we use a TABLE? 256 entries of 9 bytes will get us only a 2.37 KiB table, isn't it? No big deal:
Code: [Select]
  ; FASTER? But expensive!
  global  byte2bin_tbl

  align 4
byte2bin_tbl:
  movzx   edi,dil
  lea     rax,[bintbl]    ; RIP relative address.
  lea     rdi,[rdi + rdi*8]
  add     rax,rdi
  ret

  section .rodata

  ; 2.37 KiB table
  align 8
bintbl:
  db "00000000",0
  db "00000001",0
  db "00000010",0
  ...
  db "11111110",0
  db "11111111",0
Why not use a single instruction as this?
Code: [Select]
lea rax,[rdi+rdi*8+bintbl]Because we are using RIP relative addresing. This bintbl offset must be relocated using GOT (SysV ABI). Using the code above we avoid relocations.

This code will cost only about 5 clock cycles, half of byte2bin_swar and 1/8 of byte2bin. But there is a problem...
« Last Edit: October 25, 2022, 05:53:03 PM by fredericopissarra »

Offline fredericopissarra

  • Full Member
  • **
  • Posts: 373
  • Country: br
Re: TINSATATFC
« Reply #1 on: October 25, 2022, 05:44:59 PM »
The problem with the 3rd routine is the table. To accelerate memory access your processor have 2 or 3 level caches: L1, L2 (and, nowadays, L3). What matters for performance purposes is L1D cache (D from DATA). L1 cache is splited in two 32 KiB caches, divided in "lines" 64 bytes long. Each time you read/write memory the processor tries to load/store data in a line, in the cache, not on physical memory.

With a line 64 bytes long we have 512 lines on cache L1D available to the processor (shared between YOUR code and every other "thread" running on this processor). Notice that what the routine byte2bin_tbl is doing is to calculate the address of a string and returning it. No access to memory is made here, but will be done if you try to access that string. If there is no space available on cache L1 this access will slow things down a lot, because the processor will check if this data is available on L2 and load the lines of L1D... If not, it will try to see if it is available on L3... if not, data will be loaded from physical memory to L3, then to L2, then to L1...

Notice that byte2bin and byte2bin_swar deals only with a 9 bytes buffer. If this buffer isn't present on L1D a single line will be "loaded". The effect seems to be the same, but we're not dealing with a 2.37 KiB table: Multiple calls to the first two functions have to potential do slow things down only for the time to load a single line for the buffer pointed by ptr, once. In this test:
Code: [Select]
#include <stdio.h>
#include <stdint.h>

extern void byte2bin( uint8_t, char * );
extern void byte2bin_swar( uint8_t, char * );
extern const char *byte2bin_tbl( uint8_t );

int main( void )
{
  char buffer[9];

  byte2bin( 0x5a, buffer );
  printf( "%#x: \"%s\"\n", 0x5a, buffer );

  byte2bin_swar( 0x99, buffer );
  printf( "%#x: \"%s\"\n", 0x99, buffer );

  printf( "%#x: \"%s\"\n", 0xa5, byte2bin_tbl( 0xa5 ) );
}
The buffer buffer can be transferred to L1D by byte2bin and it will already be there for byte2bin_swar to use it. For subsequent calls to byte2bin and byte2bin_swar, L1D cache probably still have the buffer on cache already.

So... byte2bin_tbl is faster? Yep, but there is a small chance it can be slower than byte2bin...

Hence, TINSATATFC.

[]s
Fred
« Last Edit: October 25, 2022, 05:48:18 PM by fredericopissarra »