NASM - The Netwide Assembler

NASM Forum => Programming with NASM => Topic started by: fredericopissarra on November 02, 2022, 01:11:09 PM

Title: To create routines in assembly is advisable?
Post by: fredericopissarra on November 02, 2022, 01:11:09 PM
Most of the time the answer is NO, specially when dealing with libraries like libc, libm or some other made to be use in C language. Why not?

Lots of standard (ISO 9899) functions are "intrinsic", meaning the compiler knows how to optimize them, avoiding function calls. One example is printf. This call:
Code: [Select]
printf( "Hello\n" );Is translated, again, most of the time, to a single call to fputs or write (or a variant called io_write) function, which are faster! To call the printf here is really slow and the compiler knows it.
There are other examples: abs(), for instance, usually is translated not to call, as we can see below:
Code: [Select]
; int f( int x ) { return abs(x); }
f:
  mov eax,edi
  neg eax
  cmovs eax,edi
  ret
So, to call abs() -- present on libc, is superfluous.

And, as I said before, good C compilers (GCC, CLANG, Intel C++), avoid some penalties for performance which the average assembly programmer don't pay any atention (branch mis-predictions, uneccessary data propagation, cache misses, over usage of the stack...).

When assembly is a good idea then? Well... when the high level language compiler don't do a good job. This happens sometimes, specially on not well designed code. I tend to think abour assembly only in termos of performance. If your C code can be improved a lot (more then 100%, as an example), then -- and only then -- assembly can be a good ideia.

Another area is where it is difficult to do something in pure C. Let's say we want to set the direction flag and move some block of data backwards. Using ISO 9899 C this is not possible using memcpy() or movemem(). In most modern C/C++ compilers this isn't possible as well. Assembly can be the answer.

There is also another usage for assembly: To make your routines shorter (optiimization of size -- not performance). This is useful, but, again, I think the best usage for assembly is to think always about performance. But, beware: most of the time your assembly code is SLOWER than the equivalent routine made in C. There is only one way to be sure about the gains: MEASURE YOUR ROTINES.

Here's an example: Suppose you want to move a block of data from one buffer to another. We have 2 pointers and a size as argument. In C, the best way to do it, if the buffers don't overlap, is to use memcpy() function. Most of the time your compiler will do a function call and you may think this will slow down your routines a bit, but consider the alternatives:

1 - You can create a simple loop, moving sub-blocks of data individually;
2 - You can use rep/movs (byte, word, dword or qword)

Like:
Code: [Select]
void move1( int *dest, int *src, unsigned int elems )
{ while ( elems-- ) *dest++ = *src++; }

void move2( int *dest, int *src, unsigned int elems )
{
  __asm__ __volatile__ (
    "rep; movsd" : : "D" (dest), "S" (src), "c" (elems) : "memory"
  );
}
Here I'm movind one DWORD at a time. If you MEASURE this 2 routines against memcpy( dest, src, elems * sizeof( int ) ); you'll the latter one is way faster than move1() and move2().

This summarize my adivce, based, of course, in my experience and experiments: When mixing code created by good C compilers and assembly, avoid to try to recreate the function calls in assembly thinking your assembly code will be faster than the created by the high level compiler. This is not the case in the majority of the cases! Reserve usage for assembly only to those cases where the compiler surely don't do a good job (and only after MEASURING the time spent by the routines).

[]s
Fred
Title: Re: To create routines in assembly is advisable?
Post by: vitsoft on November 06, 2022, 09:06:23 PM
When I am forced to use foreign code (library function), I often have to struggle with poor documentation, version-hell, licence restriction, library installation, difficult debugging...

Optimization for performance or size is not necessary the only criterion. I prefer code written all by myself - optimized to be comprehensible by me.
Title: Re: To create routines in assembly is advisable?
Post by: fredericopissarra on November 06, 2022, 11:06:32 PM
Optimization for performance or size is not necessary the only criterion. I prefer code written all by myself - optimized to be comprehensible by me.
Well... I think if it's not about performance or size, assembly isn't a good idea and C is a better option.
First, your code will change (in terms of opcodes) from one mode to another. Second, the code is mode comprehensible using a high level language.

[]s
Fred
Title: Re: To create routines in assembly is advisable?
Post by: vitsoft on November 08, 2022, 08:58:28 AM
the code is more comprehensible using a high level language.
I kind of agree...and that's why I like assemblers which allow high level constructs (macros). Comprehensibility mostly depends on the quality of documentation and how concisely are indentificators chosen. When I saw the function name atoi(), it never occured to me that it could be a shortcut of ASCII to integer conversion. Nonintuitive shortcuts, repudiation of mixed case, name mangling, using multiple underscores as leading characters...that makes me to hate C.

Of course the encoded instructions will change between 32 and 64 bit programs, between different OSes, no matter if written in ASM or HLL. This is solvable at macro level, for instance I can write homonymous macroinstruction ShellSort in three different versions (for 16 (https://euroassembler.eu/maclib/sort16.htm#ShellSort), 32 (https://euroassembler.eu/maclib/sort32.htm#ShellSort), 64 (https://euroassembler.eu/maclib/sort64.htm#ShellSort) bit assembler programs) and use them almost as comfortably as in HLL.

Source is best comprehensible in the language that one masters the most.
Title: Re: To create routines in assembly is advisable?
Post by: fredericopissarra on November 08, 2022, 10:16:48 AM
I kind of agree...and that's why I like assemblers which allow high level constructs (macros). Comprehensibility mostly depends on the quality of documentation and how concisely are indentificators chosen. When I saw the function name atoi(), it never occured to me that it could be a shortcut of ASCII to integer conversion. Nonintuitive shortcuts, repudiation of mixed case, name mangling, using multiple underscores as leading characters...that makes me to hate C.
I wonder how confusing PCLMULQDQ or UNPCKHPS has been... ;)

Here's an example of what I meant: Let's say we are trying to create an itoa() function (for radix 10 only). In C this is very simple (coments here only to avoid confusion because I like to use pointers very much!):
Code: [Select]
#include <stdlib.h>
#include <string.h>

// Requires destp to point to a buffer with, at least 12 chars.
char *itoa( char *destp, int x )
{
  char *p, *endp;
  long long n;

  // 12 because INT_MIN has 11 chars + NUL char.
  p = endp = destp + 12;    // Points past the end of the buffer.

  n = llabs( x );           // Get the absolute value of x in higher precision.

  // Convert each algarism
  *--p = '\0';
  do *--p = '0' + n % 10; while ( n /= 10 );

  // Puts a '-' in front if x is negative.
  if ( x < 0 )
    *--p = '-';

  // Move the buffer to beginning if we're not there.
  if ( p != destp )
    memmove( destp, p, endp - p );

  return destp;
}

Now, compare the assembly code (way more complicated, but a direct translation [made with -S option with GCC - I took the liberty to convert the mnemonics and directives to be compatible with NASM]):
Code: [Select]
; Entry: RDI = destp, ESI = x
itoa:
  mov   ecx, esi
  mov   BYTE [rdi+11], 0
  mov   r8, rdi
  mov   r9d, esi
  neg   ecx
  cmovs ecx, esi
  lea   rsi, [rdi+11]
  mov   rdi, 0xCCCCCCCCCCCCCCCD
  mov   ecx, ecx

  align 4
.loop:
  mov   rax, rcx
  mov   r10, rsi
  sub   rsi, 1
  mul   rdi
  shr   rdx, 3
  lea   rax, [rdx+rdx*4]
  add   rax, rax
  sub   rcx, rax
  add   ecx, '0'
  mov   BYTE [rsi], cl
  mov   rcx, rdx
  test  rdx, rdx
  jne   .loop
  test  r9d, r9d
  jns   .skip
  mov   BYTE [rsi-1], '-'
  lea   rsi, [r10-2]
.skip:
  cmp   rsi, r8
  je    .nomove
  lea   rdx, [r8+12]
  sub   rsp, 8
  mov   rdi, r8
  sub   rdx, rsi
  call  memmove wrt .plt
  add   rsp, 8
  ret

  align 4
.nomove:
  mov   rax, r8
  ret

I think the unavoidable conclusion is that high level language has a mode comprehensible version of the routine.

Of course the encoded instructions will change between 32 and 64 bit programs, between different OSes, no matter if written in ASM or HLL. This is solvable at macro level, for instance I can write homonymous macroinstruction ShellSort in three different versions (for 16 (https://euroassembler.eu/maclib/sort16.htm#ShellSort), 32 (https://euroassembler.eu/maclib/sort32.htm#ShellSort), 64 (https://euroassembler.eu/maclib/sort64.htm#ShellSort) bit assembler programs) and use them almost as comfortably as in HLL.
This is not what I was trying to convey: Take the usage of MOVDQA instruction: It exist on Pentium 4 or superior (SSE2), but not on old processors (and there are, still old processors around). Assembly is "unportable", C code isn't (if you obey the specifications).

[]s
Fred