NASM Forum > Programming with NASM

Tips for new assembly programmers

<< < (2/3) > >>

Frank Kotler:
Bummer!

I'll see if there is anything I can do about it... Hang in there!

Best,
Frank

fredericopissarra:
Small tips, just not to be seen as a spammer:

In x86-64 mode there are 16 general purpose registers available: RAX, RBX, RCX, RDX, RSI, RDI, RBP, RSP, R8, R9, R10, R11, R12, R13, R14 and R15 (RIP and RFLAGS too, but they aren't made to be "general").

From RAX to RSP, in that list, there are the usual aliases, for example: EAX, AX, AH and AL are aliases of RAX. They are pieces of RAX, the same way it happens in i386 mode.. There is a new thing there: RSI, RDI, RBP and RSP now have an extra alias to access the LSB (less significant BYTE): SIL, DIL, BPL and SPL. I can't see a scenario where SPL is useful,  but there it is.

From R8 to R15 there are the same aliases, using a suffix D, W or B for DWORD, WORD and BYTE, like in R8D, R9W or R10B.

[]s
Fred

Frank Kotler:
Thanks, Fred!

We're working on the Clean Talk problem...

I just removed a real spam...

Best,
Frank

fredericopissarra:
Some performance tips:

1- Avoid using the three components of an effective address ([base+index*scale+offset]), because the processor has a circuit that pre-calculates the address using FMA (Fused Multiply and Add). With 3 components the instruction takes 2 extra clock cycles, because base+index*scale is done by FMA (or base+offset, or index*scale+offset), but base+index*scale+offset needs to be done in 2 steps: FMA plus an add.

Using [base+index*scale+offset] is slower;

2- Avoid using read-modify-write instructions like:

--- Code: ---mov [rbx],eax
--- End code ---
Here [rbx] must be read and kept in a temporary register inside the processor, added to EAX and then writen back to [rbx]. This will take 2 extra cycles. It is faster to read data to a register and do the operation on that register:

--- Code: ---mov ecx,[rbx]
add ecx,eax
--- End code ---

3- Avoid using XCHG with memory references. Like this:

--- Code: ---xchg eax,[rbx]
--- End code ---
This works, but XCHG automatically lower the LOCK# signal of the processor to avoid a race condition. Which is slower... If you are using XCHG instruction, prefer to use only registers.

4- Prefer to use 32 bits registers. Of course we have aliases to WORD and BYTE portions of GPRs (General Purpose Registers, but those instructions are bigger and slower. This is why C compilers prefer the int type.

5- Pay attention to the direction of conditional jumps.
Your processor uses an internal algorithm to predict if a jump will be taken or not, for conditional jumps, called static branch prediction algorithm. Forward jumps are assumed NOT to be taken and backward jumps are assumed TO BE taken. When a forward jump IS taken, the processor has to reload the instructions and reorder them, wasting time.

That's why good C compilers do this:

--- Code: ---; while ( x != 0 ) x = f(x);
;--- assumng x in EAX:
  jmp .test
.loop:
  mov  edi,eax
  call  f
.test:
  test eax,eax
  jnz  .loop
--- End code ---
Here, jnz is taken all the time, except when EAX==0, and since it is backwards, there's no penalty (except in the last iteration).

The compiler tends to invert the condition for ifs as well, like:

--- Code: ---; if ( x == 0 ) f();
  test eax,eax
  jnz  .skip
  call f
.skip:
--- End code ---
Your work is to be sure the condition is true more times than it is false. This way jnz isn't taken most of the time.

[]s
Fred

fredericopissarra:
Ahhhh...
Prefer not to use J(E)CXZ, XLAT, LOOP, INC or DEC instructions. They are slower then the equivalent using "usual" instructions:

--- Code: ---; equivalent to JECXZ .L1:
...
  test ecx,ecx
  jz .L1
--- End code ---

--- Code: ---; Equivalent of XLATB:
...
  mov al,[ebx]
--- End code ---

--- Code: ---; Equivalent of LOOP:
.loop:
  ...
  sub  ecx,1
  jnz   .loop
--- End code ---
INC and DEC is slower then ADD and SUB because they don't affect the CF (_Carry Flag_), so the processor has to read-modify-write RFLAGS, when using INC/DEC. Taking an extra cycle. This is not valid to all processors - recent ones don't have this penalty (as far as I know).

Another tip: Prefer to use REP/MOVSB or REP/STOSB, instead of REP/MOVSW, REP/MOVSD, REP/MOVSQ or REP/STOSW, REP/STOSD or REP/STOSQ. Recent processors (I believe since Ivy Bridge Microarchitecture) hava an otimized REP/MOVSB REP/STOSB that copies entire cache lines -- which is as fast as it can be.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version