NASM - The Netwide Assembler
NASM Forum => Programming with NASM => Topic started by: fredericopissarra on November 02, 2022, 01:11:09 PM
-
Most of the time the answer is NO, specially when dealing with libraries like libc, libm or some other made to be use in C language. Why not?
Lots of standard (ISO 9899) functions are "intrinsic", meaning the compiler knows how to optimize them, avoiding function calls. One example is printf. This call:
printf( "Hello\n" );
Is translated, again, most of the time, to a single call to fputs or write (or a variant called io_write) function, which are faster! To call the printf here is really slow and the compiler knows it.
There are other examples: abs(), for instance, usually is translated not to call, as we can see below:
; int f( int x ) { return abs(x); }
f:
mov eax,edi
neg eax
cmovs eax,edi
ret
So, to call abs() -- present on libc, is superfluous.
And, as I said before, good C compilers (GCC, CLANG, Intel C++), avoid some penalties for performance which the average assembly programmer don't pay any atention (branch mis-predictions, uneccessary data propagation, cache misses, over usage of the stack...).
When assembly is a good idea then? Well... when the high level language compiler don't do a good job. This happens sometimes, specially on not well designed code. I tend to think abour assembly only in termos of performance. If your C code can be improved a lot (more then 100%, as an example), then -- and only then -- assembly can be a good ideia.
Another area is where it is difficult to do something in pure C. Let's say we want to set the direction flag and move some block of data backwards. Using ISO 9899 C this is not possible using memcpy() or movemem(). In most modern C/C++ compilers this isn't possible as well. Assembly can be the answer.
There is also another usage for assembly: To make your routines shorter (optiimization of size -- not performance). This is useful, but, again, I think the best usage for assembly is to think always about performance. But, beware: most of the time your assembly code is SLOWER than the equivalent routine made in C. There is only one way to be sure about the gains: MEASURE YOUR ROTINES.
Here's an example: Suppose you want to move a block of data from one buffer to another. We have 2 pointers and a size as argument. In C, the best way to do it, if the buffers don't overlap, is to use memcpy() function. Most of the time your compiler will do a function call and you may think this will slow down your routines a bit, but consider the alternatives:
1 - You can create a simple loop, moving sub-blocks of data individually;
2 - You can use rep/movs (byte, word, dword or qword)
Like:
void move1( int *dest, int *src, unsigned int elems )
{ while ( elems-- ) *dest++ = *src++; }
void move2( int *dest, int *src, unsigned int elems )
{
__asm__ __volatile__ (
"rep; movsd" : : "D" (dest), "S" (src), "c" (elems) : "memory"
);
}
Here I'm movind one DWORD at a time. If you MEASURE this 2 routines against memcpy( dest, src, elems * sizeof( int ) ); you'll the latter one is way faster than move1() and move2().
This summarize my adivce, based, of course, in my experience and experiments: When mixing code created by good C compilers and assembly, avoid to try to recreate the function calls in assembly thinking your assembly code will be faster than the created by the high level compiler. This is not the case in the majority of the cases! Reserve usage for assembly only to those cases where the compiler surely don't do a good job (and only after MEASURING the time spent by the routines).
[]s
Fred
-
When I am forced to use foreign code (library function), I often have to struggle with poor documentation, version-hell, licence restriction, library installation, difficult debugging...
Optimization for performance or size is not necessary the only criterion. I prefer code written all by myself - optimized to be comprehensible by me.
-
Optimization for performance or size is not necessary the only criterion. I prefer code written all by myself - optimized to be comprehensible by me.
Well... I think if it's not about performance or size, assembly isn't a good idea and C is a better option.
First, your code will change (in terms of opcodes) from one mode to another. Second, the code is mode comprehensible using a high level language.
[]s
Fred
-
the code is more comprehensible using a high level language.
I kind of agree...and that's why I like assemblers which allow high level constructs (macros). Comprehensibility mostly depends on the quality of documentation and how concisely are indentificators chosen. When I saw the function name atoi(), it never occured to me that it could be a shortcut of ASCII to integer conversion. Nonintuitive shortcuts, repudiation of mixed case, name mangling, using multiple underscores as leading characters...that makes me to hate C.
Of course the encoded instructions will change between 32 and 64 bit programs, between different OSes, no matter if written in ASM or HLL. This is solvable at macro level, for instance I can write homonymous macroinstruction ShellSort in three different versions (for 16 (https://euroassembler.eu/maclib/sort16.htm#ShellSort), 32 (https://euroassembler.eu/maclib/sort32.htm#ShellSort), 64 (https://euroassembler.eu/maclib/sort64.htm#ShellSort) bit assembler programs) and use them almost as comfortably as in HLL.
Source is best comprehensible in the language that one masters the most.
-
I kind of agree...and that's why I like assemblers which allow high level constructs (macros). Comprehensibility mostly depends on the quality of documentation and how concisely are indentificators chosen. When I saw the function name atoi(), it never occured to me that it could be a shortcut of ASCII to integer conversion. Nonintuitive shortcuts, repudiation of mixed case, name mangling, using multiple underscores as leading characters...that makes me to hate C.
I wonder how confusing PCLMULQDQ or UNPCKHPS has been... ;)
Here's an example of what I meant: Let's say we are trying to create an itoa() function (for radix 10 only). In C this is very simple (coments here only to avoid confusion because I like to use pointers very much!):
#include <stdlib.h>
#include <string.h>
// Requires destp to point to a buffer with, at least 12 chars.
char *itoa( char *destp, int x )
{
char *p, *endp;
long long n;
// 12 because INT_MIN has 11 chars + NUL char.
p = endp = destp + 12; // Points past the end of the buffer.
n = llabs( x ); // Get the absolute value of x in higher precision.
// Convert each algarism
*--p = '\0';
do *--p = '0' + n % 10; while ( n /= 10 );
// Puts a '-' in front if x is negative.
if ( x < 0 )
*--p = '-';
// Move the buffer to beginning if we're not there.
if ( p != destp )
memmove( destp, p, endp - p );
return destp;
}
Now, compare the assembly code (way more complicated, but a direct translation [made with -S option with GCC - I took the liberty to convert the mnemonics and directives to be compatible with NASM]):
; Entry: RDI = destp, ESI = x
itoa:
mov ecx, esi
mov BYTE [rdi+11], 0
mov r8, rdi
mov r9d, esi
neg ecx
cmovs ecx, esi
lea rsi, [rdi+11]
mov rdi, 0xCCCCCCCCCCCCCCCD
mov ecx, ecx
align 4
.loop:
mov rax, rcx
mov r10, rsi
sub rsi, 1
mul rdi
shr rdx, 3
lea rax, [rdx+rdx*4]
add rax, rax
sub rcx, rax
add ecx, '0'
mov BYTE [rsi], cl
mov rcx, rdx
test rdx, rdx
jne .loop
test r9d, r9d
jns .skip
mov BYTE [rsi-1], '-'
lea rsi, [r10-2]
.skip:
cmp rsi, r8
je .nomove
lea rdx, [r8+12]
sub rsp, 8
mov rdi, r8
sub rdx, rsi
call memmove wrt .plt
add rsp, 8
ret
align 4
.nomove:
mov rax, r8
ret
I think the unavoidable conclusion is that high level language has a mode comprehensible version of the routine.
Of course the encoded instructions will change between 32 and 64 bit programs, between different OSes, no matter if written in ASM or HLL. This is solvable at macro level, for instance I can write homonymous macroinstruction ShellSort in three different versions (for 16 (https://euroassembler.eu/maclib/sort16.htm#ShellSort), 32 (https://euroassembler.eu/maclib/sort32.htm#ShellSort), 64 (https://euroassembler.eu/maclib/sort64.htm#ShellSort) bit assembler programs) and use them almost as comfortably as in HLL.
This is not what I was trying to convey: Take the usage of MOVDQA instruction: It exist on Pentium 4 or superior (SSE2), but not on old processors (and there are, still old processors around). Assembly is "unportable", C code isn't (if you obey the specifications).
[]s
Fred