Author Topic: One example using C to build an assembly routine.  (Read 1415 times)

Offline fredericopissarra

  • Full Member
  • **
  • Posts: 368
  • Country: br
One example using C to build an assembly routine.
« on: October 28, 2022, 12:37:05 PM »
I've got more or less 37 years of experience in x86 assembly and C language. The way I see, C is nothing more than "high level" assembly and, to me, it makes perfect sense to develop assembly routines using C.

Nowadays, good C compilers (GCC, Clang, Intel C++, but not MSVC++!), using optimization options, create a very good code indeed, taking advantage of lots of characteristics about your processor, avoiding branch mispredictions, caches mismatches and other exoteric things.

Here's an example of the famous (and not ISO 9899 standard function) itoa() function (modified just to use base 10). I'll show you 4 ways to do it.

First, itoa1.c, writes each algarism to the pointed buffer and, in the end, reverse the string:
Code: [Select]
// itoa1.c
char *itoa10( char *p, int x )
{
  char buffer[12];
  char *q, *r;
  _Bool negative;

  r = p;                    // We'll need this copy later.

  q = buffer;               // Pointer to write individual algarisms in ASCII.
  negative = x < 0;         // Flag: x is negative?

  // Calc absulute value of x.
  // FIXME: There's a problem here.
  if ( negative )
    x = -x;

  // Convert each decimal algarism, forward.
  do
    *q++ = '0' + x % 10;
  while ( x /= 10 );

  // Put an extra '-' if x is negative.
  if ( negative )
    *q++ = '-';

  *q-- = '\0';

  // copy string in reverse to buffer pointed by p.
  while ( q >= buffer )
    *p++ = *q--;
  *p = '\0'; 

  return r;
}
Why 12 chars in the local buffer? Because INT_MIN is -2147483648 (11 chars), plus the '\0' at the end of the string.

Here we got 2 problems. The first I did it on purpose to show we have to be careful when creating any routine. If x is INT_MIN, there is no way to negate it, using 2's complement. This can be fixed making a copy of x to a more precise (bigger type, as in long long int) and, then, negate it, if necessary. The second problem is the string reversal, copying the local buffer to the buffer pointed by the argument. Here we have, still, a third problem: The need for local buffer! It is unecessary, as described below, in a better implementation:
Code: [Select]
// itoa2.c
char *itoa10( char *p, int x )
{
  // Since -INT_MIN cannot be represented on an 'int', we
  // use better precision to hold the value (long long int is 64 bits long).
  long long int n;
  char *q, *r;
  _Bool negative;

  negative = x < 0;

  n = llabs( x );

  // We'll need r later to calculate the string length.
  // 12 is used here because INT_MIN is "-2147483648"
  // (11 chars), plus the extra '\0'.
  // Here the pointers point 1 char after the end of the
  // buffer.
  q = r = p + 12;

  *--q = '\0';

  // Convert each decimal algarism, backwards.
  do
    *--q = '0' + n % 10;
  while ( n /= 10 );

  // Put an extra '-' if x is less then zero.
  if ( negative )
    *--q = '-';

  // q points to the first char we have on buffer.

  // Copy converted string to the beginning of the target buffer.
  // This works because there are, at least, 2 bytes to move.
  memmove( p, q, r - q );

  return p;
}
Here we got rid of the local buffer, using the argument p as the target buffer and we got rid of the string reversal as well. But, yet, we have one loop and one "movement" of bytes in the target buffer. The final movement is needed because your buffer pointed by p must begin with the string converted. The clock cycles wasted by memmove() depends on r - q bytes moved.

This seems to be better then the previous, but we can "improve" this by calculating how many chars will be in the final buffer. We can do this using 10's base logarithm of absolute value of x (if x != 0). This should improve the routine as we get rid of the final copy, but the number of tests to make a modified version of ilog10() to work will waste, more of less, the same number of clock cycles of the final movement. This third routine can be like this:
Code: [Select]
// itoa3.c
// Modified log10 for integers.
// Used to get the number of chars in the buffer.
static int ilog10_( unsigned int x )
{
  static const unsigned int v[] =
  { 1000000000U, 100000000U, 10000000U, 1000000U, 100000U, 10000U, 1000U, 100U, 10U };

  /* ilog10_(0) doesn't exist!
     this is checking in itoa10() routine.  */
  //if ( ! x )
  //  return -1;

  for ( int i = 0; i < sizeof v / sizeof v[0]; i++ )
    if ( x >= v[i] )
      return 9 - i;

  return 0;
}

char *itoa10( char *p, int x )
{
  // Since -INT_MIN cannot be represented on an 'int', we
  // use better precision to hold the value (long long int is 64 bits long).
  long long int n;
  char *q;
  _Bool negative;

  negative = x < 0;

  n = llabs( x );

  // We have, at least, 2 chars ocuppied in the buffer:
  //  '0' and '\0'.
  q = p + 2;

  // ilog10() isn't defined for 0, so the test is necessary.
  if ( x )
    q += ilog10_( n ) + negative;

  *--q = '\0';

  // Convert each decimal algarism backwards.
  do
    *--q = '0' + n % 10;
  while ( n /= 10 );

  // Put an extra '-' if x is less then zero.
  if ( negative )
    *--q = '-';

  return p;
}
We can tweak ilog10_() to improve the timing if, most of the time, we'll convert small values.

But I think the version of itoa2.c is better then this, in terms of performance (must measure!).

Offline fredericopissarra

  • Full Member
  • **
  • Posts: 368
  • Country: br
Re: One example using C to build an assembly routine.
« Reply #1 on: October 28, 2022, 12:40:39 PM »
We can avoid the loops using SWAR, but more work must be made to this routine to work the same way we did before:
Code: [Select]
// itoa_swar.c
// credit: Paul Khuong
static uint64_t encode_ten_thousands(uint64_t hi, uint64_t lo)
{
  uint64_t merged = hi | (lo << 32);
  uint64_t top = ((merged * 10486ULL) >> 20) & ((0x7FULL << 32) | 0x7FULL);
  uint64_t bot = merged - 100ULL * top;
  uint64_t hundreds;
  uint64_t tens;

  hundreds = (bot << 16) + top;
  tens = (hundreds * 103ULL) >> 10;
  tens &= (0xFULL << 48) | (0xFULL << 32) | (0xFULL << 16) | 0xFULL;
  tens += (hundreds - 10ULL * tens) << 8;

  return tens;
}

// credit: Paul Khuong
static void to_string_khuong(uint64_t x, char *out)
{
  uint64_t *p = (uint64_t *)out;
  uint64_t top = x / 100000000;
  uint64_t bottom = x % 100000000;
  uint64_t first =
      0x3030303030303030ULL + encode_ten_thousands(top / 10000, top % 10000);
  uint64_t second =
      0x3030303030303030ULL + encode_ten_thousands(bottom / 10000, bottom % 10000);

  *p++ = first;
  *p = second;
}

__attribute__((noinline))
char *itoa10( char *p, int x )
{
  long long int n;

  n = llabs( x );
 
  if ( x < 0 )
    p++;

  to_string_khuong( n, p );

  if ( x < 0 )
    *--p = '-';

  return p;
}
This is as fast as we can expect, but if you convert something as "-1234" you'll get "-0000000000001234".

Offline fredericopissarra

  • Full Member
  • **
  • Posts: 368
  • Country: br
Re: One example using C to build an assembly routine.
« Reply #2 on: October 28, 2022, 12:53:49 PM »
Now... how can we "convert" one of those C implementations to assembly: Easy, use this command line:

$ cc -O2 -masm=intel -S -march=native \
  -fomit-frame-pointer -fcf-protection=none itoa2.c -o itoa2.s


The generated code (with some instructions reordering made by me and syntax adapted to NASM syntax) is very good but not perfect (from assembly point of view):
Code: [Select]
itoa10:
  mov   ecx, esi

  sub   rsp, 8        ; Here because we're using memmove call.
                      ; So, no red zone!

  mov   r8, rdi
  mov   r9d, esi

  mov   BYTE [rdi+11], 0
  lea   r11, [rdi+12]
  lea   rsi, [rdi+11]

  mov   rdi, -3689348814741910323

  neg   ecx
  cmovs ecx, esi
  mov   ecx, ecx

  align 4
.loop:
  mov   rax, rcx
  mov   r10, rsi
  sub   rsi, 1

  mul   rdi
  shr   rdx, 3
  lea   rax, [rdx+rdx*4]
  add   rax, rax
  sub   rcx, rax
  ; Here RDX = n / 10, RCX = n % 10

  add   ecx, '0'
  mov   [rsi], cl

  mov   rcx, rdx

  test  rdx, rdx    ; quotient is zero?
  jne   .loop       ; no, stay in the loop.

  ; Put '-' if x is negative.
  test  r9d, r9d
  jns   .not_negative
  mov   BYTE [rsi-1], '-'
  lea   rsi, [r10-2]

.not_negative:
  mov   rdx, r11
  mov   rdi, r8
  sub   rdx, rsi
  call  memmove wrt .plt

  add   rsp, 8
  ret
It's not an ideal code, since we can avoid memmove() call and use the red zone. The rest of the code seems pretty good and not so easy to improve.
« Last Edit: October 28, 2022, 12:57:30 PM by fredericopissarra »