Since this forum is directed to NASM and Intel x86 processors, this will be my focus. I'm telling you this because there's no typical "assembly language". It depends on the processor AND the compiler (assembler).
First: Respect the calling convention.
If you are mixing assembly code with C code, there is 3 different modes of operation of x86 processors: 16 bits (or "real" mode); 32 bits (or i386 mode) and 64 bits (or x86-64 mode). Each one with different rules. In real mode, for instance, it's not possible to use other registers than BX or BP as base address in "effective address" operand (those with [ and ]). This is an error in real mode:
mov byte [dx],0
You'll have to do something like this:
mov bx,dx
mov byte [bx],0
But, since 80386 you can use E?? registers and break this rule, in real mode:
mov byte [edx],0
What the processor will do is to add EDX with DS*16 and use only the lower 20 bits of the result. It is precaucious to zero the upper 16 bits of EDX before hand:
movzx edx,dx
mov byte [edx],0
But longer than using BX, instead.
I am telling you this because BP is used in real mode to access the arguments passed to a function, using cdecl C calling convention, and local objects, on stack, instead of SP because it is mandatory. This isn't an obligation on i386 or x86-64 mode, since we can use any register as base address.
And because BX and BP must be the base registers used in an effective address, they are "preserved" between calls (before a call and after the routine return). This is true for the other modes of operation as well, but for historical reasons.
If your code is mixed with C functions, you need to know that (E)SI and (E)DI is preserved between calls as well (for real and i386 modes, and using MS ABI in x86-64 mode). All other registers are free to be changed: (E)AX, (E)CX, (E)DX, (E)FLAGS. But (E)SP isn't among them for obvious reasons (it is the stack pointer!).
In case of BIOS calls (16 bits), the ROM-BIOS code tends to preserve all registers, except, AX and FLAGS. But this depends on the service. Take service 0x0E (TELETYPE OUTPUT) from INT 0x10. There is said all registers are preserved. But, there is said, as well, that BP can be changed if a scroll up occurs, in certain BIOSes (google "Ralf Brown Interrupt List" for reference). Now, take the service READ DISK SECTORS (0x02) from INT 0x13. There is said all registers, except FLAGS and AX are preserved. In case of error, CF=1 and AH contains an error code, otherwise CF=0 and AH=0.
So, BIOS has its own "calling convention".
Second: When mixing codes, avoid using high level functions in your assembly code.
Why? Because different modes have different calling conventions. Your code can work well on i386 mode, but not work at all in x86-64 mode. These two are the main modes used today (nowadays x86-64 mode is mode common). This tip is valid for real->i386 modes as well.
For example... In cdecl calling convention for i386 all arguments must be pushed to the stack, but in x86-64 mode (SysV ABI) the first 6 "integer" arguments are passed through registers (RDI, RSI, RDX, RCX, R8 and R9), not the stack.
Third: Avoid using R?? registers on x86-64 mode.
Why? Because your processor is still a 32 bits processor, even if it has a 64 bits mode... To use R?? registers the instructions must have a prefix (called REX prefix) and it can result in a longer instruction. As an example:
mov rax,0
Is a 10 bytes instruction (0x48 0xB8 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00), where 0x48 is a REX prefix. And:
mov eax,0
Which does the same thing, is a 5 bytes instruction (0xB8 0x00 0x00 0x00 0x00).
Think of R?? registers as the type `long long int` in C. Most of the time you'll deal with `int` (E??). And x86-64 mode has the property that, when using E?? register, the upper 32 bits of R?? register are automatically zeroed. This is valid only for E??. This won't zero the upper bits of RAX:
mov ax,1
But this does:
mov eax,1
Fourth: Syscalls aren't universal.
As I said before: "Respect the calling convention". Or, properly said, the Application Binary Interface. Using the int 0x80 or syscall instructions are valid only for i386 or x86-64 modes, respectively, but on SysV ABI (Linux, FreeBSD...). Windows has its own calling convention and syscalls and, usually, they are very difficult to use (since each version of Windows have its own set of syscalls, different from each other). On Windows, prefer to use Win32 API, like Console API, for instance.
Sixth: Microsoft and Linux are different from each other.
Linux follows the [UNIX] SysV ABI. Microsoft has its own ABI (MS-ABI). In x86-64 mode, as an example, SysV ABI uses 6 registers as the six first "integer" arguments to a function, but MS-ABI uses only 4 (RCX, RDX, R8 and R9, in that order). There is also "floating point" arguments: SysV ABI uses 8 SSE XMM registers as the first 8 arguments. MS-ABI uses only 4 (XMM0~XMM3).
SysV ABI expects AL to be set with the number of floating point arguments passed to a variadic function. MS-ABI don't follow the same rule (as far as I know).
Seventh: Windows 10+ don't deal with 16 bits code anymore
If you are learning assembly from a 16 bits perspective, know that it is impossible to run a .COM file or a 16 bits MZ executable on Windows 10 or superior. You'll need to use something like DOSBox (not completely equivalent to MS-DOS) or a VM running FreeDOS (or the old MS-DOS). You can write a MBR (Master Boot Record) and run it with qemu if you like.
Eighth: Focus on x86-64 mode
Since the majority of installed operating systems runs in this mode, using i386 mode is obsolete. And x86-64 is easier to deal with: The segment selector registers are useless in this mode. And there is an advantage. We have 8 more "integer" general purpose registers available: R8 to R15 (R12 to R15 must be preserved between calls when mixing code with C or syscalls) and 16 XMM registers (i386 mode uses only 8, when the processor has SSE capability).
Ninth: Take a look in what a good C compiler does
Most of the time is better to optimize an already optimized code. GCC, when using -O2 option, tends to create a very good optimized code, taking advantage of knowledge about the processor (when -march=native is used) like avoiding penalties from branch misprecictions, cache mismatch effects, data or instructions misalignment effects etc. All this, when creating a function directly in assembly, can be overlooked.
Take a simple division by 10 example:
int div10( int x ) { return x / 10; }
Here's two impementations, one by hand (div10h) and other that came from GCC (div10gcc):
div10h:
xor edx,edx
mov eax,edi
mov edi,10
idiv edi
ret
div10gcc:
movsx rax, edi
sar edi, 31
imul rax, rax, 1717986919
sar rax, 34
sub eax, edi
ret
The first seems to be better (less instructions), but very slow (idiv takes 30~100 cycles, imul, only 4 cycles).
I can show you examples using loops, which GCC creates "strange", but better, faster, code.
Well... if you like this, I'll continue in later texts.
[]s
Fred