NASM - The Netwide Assembler

Simple Machines Forum

News:

« previous next »

Print

Pages: [1]

Author Topic: Tips for new assembly programmers (Read 24969 times)

fredericopissarra

Full Member
Posts: 388
Country:

Tips for new assembly programmers

« on: October 21, 2022, 03:29:19 PM »

Since this forum is directed to NASM and Intel x86 processors, this will be my focus. I'm telling you this because there's no typical "assembly language". It depends on the processor AND the compiler (assembler).

First: Respect the calling convention.

If you are mixing assembly code with C code, there is 3 different modes of operation of x86 processors: 16 bits (or "real" mode); 32 bits (or i386 mode) and 64 bits (or x86-64 mode). Each one with different rules. In real mode, for instance, it's not possible to use other registers than BX or BP as base address in "effective address" operand (those with [ and ]). This is an error in real mode:

mov byte [dx],0You'll have to do something like this:

mov bx,dx
mov byte [bx],0

But, since 80386 you can use E?? registers and break this rule, in real mode:

mov byte [edx],0What the processor will do is to add EDX with DS*16 and use only the lower 20 bits of the result. It is precaucious to zero the upper 16 bits of EDX before hand:

movzx edx,dx
mov byte [edx],0

But longer than using BX, instead.

I am telling you this because BP is used in real mode to access the arguments passed to a function, using cdecl C calling convention, and local objects, on stack, instead of SP because it is mandatory. This isn't an obligation on i386 or x86-64 mode, since we can use any register as base address.

And because BX and BP must be the base registers used in an effective address, they are "preserved" between calls (before a call and after the routine return). This is true for the other modes of operation as well, but for historical reasons.

If your code is mixed with C functions, you need to know that (E)SI and (E)DI is preserved between calls as well (for real and i386 modes, and using MS ABI in x86-64 mode). All other registers are free to be changed: (E)AX, (E)CX, (E)DX, (E)FLAGS. But (E)SP isn't among them for obvious reasons (it is the stack pointer!).

In case of BIOS calls (16 bits), the ROM-BIOS code tends to preserve all registers, except, AX and FLAGS. But this depends on the service. Take service 0x0E (TELETYPE OUTPUT) from INT 0x10. There is said all registers are preserved. But, there is said, as well, that BP can be changed if a scroll up occurs, in certain BIOSes (google "Ralf Brown Interrupt List" for reference). Now, take the service READ DISK SECTORS (0x02) from INT 0x13. There is said all registers, except FLAGS and AX are preserved. In case of error, CF=1 and AH contains an error code, otherwise CF=0 and AH=0.

So, BIOS has its own "calling convention".

Second: When mixing codes, avoid using high level functions in your assembly code.

Why? Because different modes have different calling conventions. Your code can work well on i386 mode, but not work at all in x86-64 mode. These two are the main modes used today (nowadays x86-64 mode is mode common). This tip is valid for real->i386 modes as well.

For example... In cdecl calling convention for i386 all arguments must be pushed to the stack, but in x86-64 mode (SysV ABI) the first 6 "integer" arguments are passed through registers (RDI, RSI, RDX, RCX, R8 and R9), not the stack.

Third: Avoid using R?? registers on x86-64 mode.

Why? Because your processor is still a 32 bits processor, even if it has a 64 bits mode... To use R?? registers the instructions must have a prefix (called REX prefix) and it can result in a longer instruction. As an example:

mov rax,0Is a 10 bytes instruction (0x48 0xB8 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00), where 0x48 is a REX prefix. And:

mov eax,0Which does the same thing, is a 5 bytes instruction (0xB8 0x00 0x00 0x00 0x00).
Think of R?? registers as the type `long long int` in C. Most of the time you'll deal with `int` (E??). And x86-64 mode has the property that, when using E?? register, the upper 32 bits of R?? register are automatically zeroed. This is valid only for E??. This won't zero the upper bits of RAX:

mov ax,1But this does:

mov eax,1
Fourth: Syscalls aren't universal.

As I said before: "Respect the calling convention". Or, properly said, the Application Binary Interface. Using the int 0x80 or syscall instructions are valid only for i386 or x86-64 modes, respectively, but on SysV ABI (Linux, FreeBSD...). Windows has its own calling convention and syscalls and, usually, they are very difficult to use (since each version of Windows have its own set of syscalls, different from each other). On Windows, prefer to use Win32 API, like Console API, for instance.

Sixth: Microsoft and Linux are different from each other.

Linux follows the [UNIX] SysV ABI. Microsoft has its own ABI (MS-ABI). In x86-64 mode, as an example, SysV ABI uses 6 registers as the six first "integer" arguments to a function, but MS-ABI uses only 4 (RCX, RDX, R8 and R9, in that order). There is also "floating point" arguments: SysV ABI uses 8 SSE XMM registers as the first 8 arguments. MS-ABI uses only 4 (XMM0~XMM3).

SysV ABI expects AL to be set with the number of floating point arguments passed to a variadic function. MS-ABI don't follow the same rule (as far as I know).

Seventh: Windows 10+ don't deal with 16 bits code anymore

If you are learning assembly from a 16 bits perspective, know that it is impossible to run a .COM file or a 16 bits MZ executable on Windows 10 or superior. You'll need to use something like DOSBox (not completely equivalent to MS-DOS) or a VM running FreeDOS (or the old MS-DOS). You can write a MBR (Master Boot Record) and run it with qemu if you like.

Eighth: Focus on x86-64 mode

Since the majority of installed operating systems runs in this mode, using i386 mode is obsolete. And x86-64 is easier to deal with: The segment selector registers are useless in this mode. And there is an advantage. We have 8 more "integer" general purpose registers available: R8 to R15 (R12 to R15 must be preserved between calls when mixing code with C or syscalls) and 16 XMM registers (i386 mode uses only 8, when the processor has SSE capability).

Ninth: Take a look in what a good C compiler does

Most of the time is better to optimize an already optimized code. GCC, when using -O2 option, tends to create a very good optimized code, taking advantage of knowledge about the processor (when -march=native is used) like avoiding penalties from branch misprecictions, cache mismatch effects, data or instructions misalignment effects etc. All this, when creating a function directly in assembly, can be overlooked.

Take a simple division by 10 example:

int div10( int x ) { return x / 10; }Here's two impementations, one by hand (div10h) and other that came from GCC (div10gcc):

div10h:
  xor   edx,edx
  mov   eax,edi
  mov   edi,10
  idiv  edi
  ret

div10gcc:
  movsx rax, edi
  sar edi, 31
  imul  rax, rax, 1717986919
  sar rax, 34
  sub eax, edi
  ret

The first seems to be better (less instructions), but very slow (idiv takes 30~100 cycles, imul, only 4 cycles).

I can show you examples using loops, which GCC creates "strange", but better, faster, code.

Well... if you like this, I'll continue in later texts.

[]s
Fred

« Last Edit: October 21, 2022, 03:37:55 PM by fredericopissarra »

Logged

fredericopissarra

Full Member
Posts: 388
Country:

Re: Tips for new assembly programmers

« Reply #1 on: October 21, 2022, 03:39:34 PM »

I beg your pardon if this text isn't "coloquial" english or if there are any grammar mistakes. English isn't my native language (I'm from Brazil).

Logged

Frank Kotler

NASM Developer
Hero Member
Posts: 2667
Country:

Re: Tips for new assembly programmers

« Reply #2 on: October 21, 2022, 05:27:12 PM »

I like it, Fred! The CPU doesn't speak English.

Best,
Frank

Logged

debs3759

Global Moderator
Full Member
Posts: 228
Country:

Re: Tips for new assembly programmers

« Reply #3 on: October 21, 2022, 05:28:10 PM »

This is easy to read, your English is good.

A more extensive tutorial would be helpful, especially for x86-64 code (there is a lot more info online for 16- and 32-bit).

Logged

My graphics card database: www.gpuzoo.com

fredericopissarra

Full Member
Posts: 388
Country:

Re: Tips for new assembly programmers

« Reply #4 on: October 22, 2022, 02:02:42 AM »

Thanks!

And the forum isn't allowing me to post "part 2" - it thinks I am a spammer.

Logged

Frank Kotler

NASM Developer
Hero Member
Posts: 2667
Country:

Re: Tips for new assembly programmers

« Reply #5 on: October 22, 2022, 02:51:43 AM »

Bummer!

I'll see if there is anything I can do about it... Hang in there!

Best,
Frank

Logged

fredericopissarra

Full Member
Posts: 388
Country:

Re: Tips for new assembly programmers

« Reply #6 on: October 22, 2022, 01:43:19 PM »

Small tips, just not to be seen as a spammer:

In x86-64 mode there are 16 general purpose registers available: RAX, RBX, RCX, RDX, RSI, RDI, RBP, RSP, R8, R9, R10, R11, R12, R13, R14 and R15 (RIP and RFLAGS too, but they aren't made to be "general").

From RAX to RSP, in that list, there are the usual aliases, for example: EAX, AX, AH and AL are aliases of RAX. They are pieces of RAX, the same way it happens in i386 mode.. There is a new thing there: RSI, RDI, RBP and RSP now have an extra alias to access the LSB (less significant BYTE): SIL, DIL, BPL and SPL. I can't see a scenario where SPL is useful, but there it is.

From R8 to R15 there are the same aliases, using a suffix D, W or B for DWORD, WORD and BYTE, like in R8D, R9W or R10B.

[]s
Fred

Logged

Frank Kotler

NASM Developer
Hero Member
Posts: 2667
Country:

Re: Tips for new assembly programmers

« Reply #7 on: October 22, 2022, 01:59:38 PM »

Thanks, Fred!

We're working on the Clean Talk problem...

I just removed a real spam...

Best,
Frank

Logged

fredericopissarra

Full Member
Posts: 388
Country:

Re: Tips for new assembly programmers

« Reply #8 on: October 22, 2022, 03:32:33 PM »

Some performance tips:

1- Avoid using the three components of an effective address ([base+index*scale+offset]), because the processor has a circuit that pre-calculates the address using FMA (Fused Multiply and Add). With 3 components the instruction takes 2 extra clock cycles, because base+index*scale is done by FMA (or base+offset, or index*scale+offset), but base+index*scale+offset needs to be done in 2 steps: FMA plus an add.

Using [base+index*scale+offset] is slower;

2- Avoid using read-modify-write instructions like:

mov [rbx],eaxHere [rbx] must be read and kept in a temporary register inside the processor, added to EAX and then writen back to [rbx]. This will take 2 extra cycles. It is faster to read data to a register and do the operation on that register:

mov ecx,[rbx]
add ecx,eax

3- Avoid using XCHG with memory references. Like this:

xchg eax,[rbx]This works, but XCHG automatically lower the LOCK# signal of the processor to avoid a race condition. Which is slower... If you are using XCHG instruction, prefer to use only registers.

4- Prefer to use 32 bits registers. Of course we have aliases to WORD and BYTE portions of GPRs (General Purpose Registers, but those instructions are bigger and slower. This is why C compilers prefer the int type.

5- Pay attention to the direction of conditional jumps.
Your processor uses an internal algorithm to predict if a jump will be taken or not, for conditional jumps, called static branch prediction algorithm. Forward jumps are assumed NOT to be taken and backward jumps are assumed TO BE taken. When a forward jump IS taken, the processor has to reload the instructions and reorder them, wasting time.

That's why good C compilers do this:

; while ( x != 0 ) x = f(x);
;--- assumng x in EAX:
  jmp .test
.loop:
  mov  edi,eax
  call  f
.test:
  test eax,eax
  jnz  .loop

Here, jnz is taken all the time, except when EAX==0, and since it is backwards, there's no penalty (except in the last iteration).

The compiler tends to invert the condition for ifs as well, like:

; if ( x == 0 ) f();
  test eax,eax
  jnz  .skip
  call f
.skip:

Your work is to be sure the condition is true more times than it is false. This way jnz isn't taken most of the time.

[]s
Fred

Logged

fredericopissarra

Full Member
Posts: 388
Country:

Re: Tips for new assembly programmers

« Reply #9 on: October 22, 2022, 03:54:06 PM »

Ahhhh...
Prefer not to use J(E)CXZ, XLAT, LOOP, INC or DEC instructions. They are slower then the equivalent using "usual" instructions:

; equivalent to JECXZ .L1:
...
  test ecx,ecx
  jz .L1

; Equivalent of XLATB:
...
  mov al,[ebx]

; Equivalent of LOOP:
.loop:
  ...
  sub  ecx,1
  jnz   .loop

INC and DEC is slower then ADD and SUB because they don't affect the CF (_Carry Flag_), so the processor has to read-modify-write RFLAGS, when using INC/DEC. Taking an extra cycle. This is not valid to all processors - recent ones don't have this penalty (as far as I know).

Another tip: Prefer to use REP/MOVSB or REP/STOSB, instead of REP/MOVSW, REP/MOVSD, REP/MOVSQ or REP/STOSW, REP/STOSD or REP/STOSQ. Recent processors (I believe since Ivy Bridge Microarchitecture) hava an otimized REP/MOVSB REP/STOSB that copies entire cache lines -- which is as fast as it can be.

« Last Edit: October 22, 2022, 04:00:34 PM by fredericopissarra »

Logged

fredericopissarra

Full Member
Posts: 388
Country:

Re: Tips for new assembly programmers

« Reply #10 on: October 22, 2022, 10:25:41 PM »

The stack
The process stack is used not only to keep record of the returing point for called functions, but as a place to pass arguments to the functions and local objets. Each time you call a function (call instruction) the RIP register (which containt the NEXT instruction address) is pushed to the stack. But, before that, in real and i386 modes, and if we use more then 4 (MS-ABI) or 6 (SysV ABI) integer arguments and/or mode than 4 (MS-ABI) or mode than 8 (SysV ABI) floating point arguments, the stack is used as well.

I've already shown a tip about using structures to manage "stack frames" in real and i386 modes. Let's extend this to usage of local objects. Let's say we declare an local array of 16 ints, as in:

int f( int x )
{
  int a[16];
  ...
}

In i386 mode x is passed through the stack. Remember that ESP points to where the return address pushed by CALL is placed, so ESP+4 is the address where x is. This 16 ints array is allocated on stack after the returning address, from ESP-4 to ESP-68 (64 bytes). So, it is common to subtract 68 bytes from ESP before using it to get both argument and access to local objects:

f:
  sub esp,64  ; allocate space on stack for a.

  ... ESP points to the begining of a[].
  ... use ESP+68 to get x (64 bytes of a plus 4 bytes for the returning address).

  add esp,64  ; return ESP to its original state.
  ret

This is easily done with structures also:

struc fstk
.a: resd 16
.localstk:
    resd 1    ; the return address
.x: resd 1   ; x argument on stack
endstruc

f:
  sub  esp,fstk.localstk
  mov eax,[esp+fstk.x]   ; get x,
  ...
  add  esp,fsk.localstk
  ret

« Last Edit: October 22, 2022, 10:44:02 PM by fredericopissarra »

Logged

fredericopissarra

Full Member
Posts: 388
Country:

Re: Tips for new assembly programmers

« Reply #11 on: October 22, 2022, 10:33:35 PM »

In i386 mode ESP must be aligned by DWORD (must be a multiple of 4), but in x86-64 RSP must be QWORD aligned ( multiple of 8 ). And there's another advantage for x86-64: There is a thing called "The Red Zone" (nothing like "The Twilight Zone"!).

Beware, ESP (or RSP) must be kept aligned all the time. If we declare a 11 bytes array locally, we must allocate 12 bytes (i386) or 16 bytes (x86-64).

The red zone is a space, before the original RSP, guaranteed not to be disturbed by interruptions. It is a 128 bytes zone before RSP. If our local data is inside this zone, we don't need to tweak RSP the same way we did before.

This zone exists only for funcitions that don't do other calls. If there is any calls, no red zone is present and you MUST allocate space as shown before, for local objects.

« Last Edit: October 22, 2022, 10:45:40 PM by fredericopissarra »

Logged

fredericopissarra

Full Member
Posts: 388
Country:

Re: Tips for new assembly programmers

« Reply #12 on: October 22, 2022, 10:49:43 PM »

There is one more thing about RSP alignment on x86-64 mode. Truly, RSP must be kept aligned by DQWORD (16 bytes), not 8. This is because x86-64 mode uses SSE for scalar floating point and XMM registers are 128 bits long and must be DQWORD aligned.

When calling a function the return address, pushed to the stack is QWORD aligned, but NOT DQWORD aligned, the next QWORD (before and after RSP) are garanteed to be DQWORD aligned.

Logged

fredericopissarra

Full Member
Posts: 388
Country:

Re: Tips for new assembly programmers

« Reply #13 on: October 24, 2022, 01:53:22 PM »

Using custom sections

An executable is divided into some "default" sections. Each section is a block of bytes where code or data are loaded into memory be your operating system process loader. There are 4 default sections for most operating systems: .text, .data, .rodata and .bss.

.text is used to contain instructions, the actual executable code; .data is used to contain initialized and writable data; .rodata is used to contain read only, non writable, data; and .bss is used to contain non-initialized data.

Tipically, .text, .data and .rodata are loaded from the executable to memory and .bss is initialized with zeros by the program itself.

These are the default sections. You can create your own, if necessary (usually the default sections are enough!). Section names beginning with '.' are, usually, reserved by the ABI (or the executable format), like .text or .bss, so you can create your own section naming it the way you like (there is no limit for the section name... well... not a 'practical' limit). But you have to describe your section. Here's an example:

; test.asm - elf64 executable.
;
;   nasm -felf64 -o test.o test.asm
;   ld -o test test.o
;
  bits  64
  default rel

  section .text     ; default 'code' section

  global  _start

  align 4
_start:
  call writestr
  jmp  exit 

  ; Custom 'code' section
  section strrtn progbits alloc exec nowrite align=4

writestr:
  lea   rsi,[msg]
  mov   eax,1
  mov   edi,eax
  mov   edx,msg_len
  syscall
  ret

  section .rodata   ; default 'readonly' data section.

msg:
  db    `Hello\n`,0
msg_len equ $ - msg

  ; Another custom 'code' section
  section system progbits alloc exec nowrite align=4

exit:
  mov   eax,60
  xor   edi,edi
  syscall

Compiling and taking a look at the headers:

$ nasm -felf64 -o test.o test.asm
$ objdump -h test.o
test.o:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         0000000a  0000000000000000  0000000000000000  000002c0  2**4
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  1 strrtn        00000016  0000000000000000  0000000000000000  000002d0  2**2
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  2 .rodata       00000007  0000000000000000  0000000000000000  000002f0  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  3 system        00000009  0000000000000000  0000000000000000  00000300  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE

And it works:

$ ./test
Hello

To create a custum section is useful, mostly, when you are trying to create your own "operating system". You can mix 16 bits code with 32 bits code and with 64 bits code in different sections, if you like.

« Last Edit: October 24, 2022, 01:56:18 PM by fredericopissarra »

Logged

fredericopissarra

Full Member
Posts: 388
Country:

Re: Tips for new assembly programmers

« Reply #14 on: October 24, 2022, 02:08:20 PM »

PS: Take a look at Inigo Quilez Elevated demo source code, here. This is a 32 bits (i386) code, directed to Windows, using DX9. Notice the nasm codes uses a lot of custom sections.

Logged

Print

Pages: [1]

« previous next »

SMF 2.0.19 | SMF © 2021, Simple Machines
XHTML
RSS
WAP2