Author Topic: About obsolete practices... (Read 2958 times)

fredericopissarra · « **on:** June 18, 2023, 01:27:11 PM »

Almost 40 years ago Intel launched the 80386 microprocessor and, yet, there are lots of people still thinking in terms of the old (1979) 8086. And again, the modern operating systems don't support even the Pentium (1993) aren't supported anymore. So, why bother with these processors and use techniques that were mandatory to them?

Here's an example: Prologues/Epilogues to manipulate stack frames. They existed because the pre-386 processors didn't allow access to the stack via any other registers used as base in an effective address, but just BP. Yep, you couldn't use SP in an address as [sp-4]. This:

Code: [Select]

; int f( int x ) { return x + 1; }
f:
  mov ax,[sp-4]
  inc ax
  ret

Dont't compile, even today!

386 changed this... you CAN use `ESP` there, in real mode, because in that mode all logical addresses is segment*16+offset, resulting in a 20 bits "physical" address. The address calculation will use only the lower 16 bits of ESP there. This way, proloque/epiloque are obsolete and should be avoided. Why? `push` writes to memory AND decrement ESP. Writing to memory not present in the cache will add a huge penalty to your code... That wasn't a thing in early 8086~80386 processors, but it is in Pentium Pro and newest ones. The newbie tends to automatically insert prologue/epiloque in the function above like this:

Code: [Select]

f:
  ; Epilogue
  push bp         ; Write to the stack.
  mov  bp,sp      ; Copy SP because we can use only BP as base.

  mov  ax,[bp-8]
  inc  ax

  ; Prologue
  pop  bp         ; Restore BP and SP.

  ret

When it is way more simple to do:

Code: [Select]

f:
  mov  ax,[esp-4]
  inc  ax
  ret

The use of ESP here garantees SS is used as selector. And the resulting code is shorter (and faster)... PUSH/MOV will waste 3 bytes and takes 2~3 clock cycles. POP will waste 1 more byte and waste 2 cycles. In the above function MOV/INC will waste 3 cycles and that's it. The only inconvenience is that MOV AX,[ESP-4] will add a 0x67 (address override) to the instruction.

Of course I'm talking about real mode here. In 386+ protected mode there's no address override.

Another thing about obsolescence is that Intel is preparing to get rid of real/i386 mode of operation (see x86-S specification) in the near future. I don't know if this will happen in the new Core or Core-Ultra processors (generation 14), but it will happen. Take that and the fact not even the Pentium Pro is sold anymore for a couple of decades, that you'll see why programming in assembly as if we are still in the 80's is ridiculous.

munair · « **Reply #1 on:** July 20, 2023, 07:12:48 AM »

prologue = opening section
epilogue = closing section

Other than that, many compilers today still generate stack frames (in non-optimized code).

fredericopissarra · « **Reply #2 on:** July 20, 2023, 11:29:14 AM »

Quote from: munair on July 20, 2023, 07:12:48 AM

prologue = opening section
epilogue = closing section

Other than that, many compilers today still generate stack frames (in non-optimized code).

No... prologues and epilogues (not sure those words exists in english -- I'm brazillian, so, sorry if I'm wrong) are ancient technique to access the stack. They aren't needed since the 386.

munair · « **Reply #3 on:** July 21, 2023, 06:21:55 AM »

In normal language prologue is a part that comes "before", while epilogue is a part that comes "after". With stack frames this is the same:

Quote

In assembly language programming, the function prologue is a few lines of code at the beginning of a function, which prepare the stack and registers for use within the function. Similarly, the function epilogue appears at the end of the function,

Source: wikipedia

That said, without optimization switches, GCC still generates old-fashioned stack frames. Have a look at the following example:

Code: (C) [Select]

int main( void )
{
    unsigned int x = 3, y = 1, sum, carry;
    sum = x ^ y; // x XOR y
    carry = x & y; // x AND y
    while (carry != 0)
    {
        carry = carry << 1; // left shift the carry
        x = sum; // initialize x as sum
        y = carry; // initialize y as carry
        sum = x ^ y; // sum is calculated
        carry = x & y; /* carry is calculated, the loop condition is 
                          evaluated and the process is repeated until 
                          carry is equal to 0.
                        */
    }
    printf("%u\n", sum); // the program will print 4
    return 0;
}

On compiler explorer GCC 13.1 generates the following masm code:

Code: (masm) [Select]

.LC0:
        .string "%u\n"
main:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        mov     DWORD PTR [rbp-12], 3
        mov     DWORD PTR [rbp-16], 1
        mov     eax, DWORD PTR [rbp-12]
        xor     eax, DWORD PTR [rbp-16]
        mov     DWORD PTR [rbp-4], eax
        mov     eax, DWORD PTR [rbp-12]
        and     eax, DWORD PTR [rbp-16]
        mov     DWORD PTR [rbp-8], eax
        jmp     .L2
.L3:
        sal     DWORD PTR [rbp-8]
        mov     eax, DWORD PTR [rbp-4]
        mov     DWORD PTR [rbp-12], eax
        mov     eax, DWORD PTR [rbp-8]
        mov     DWORD PTR [rbp-16], eax
        mov     eax, DWORD PTR [rbp-12]
        xor     eax, DWORD PTR [rbp-16]
        mov     DWORD PTR [rbp-4], eax
        mov     eax, DWORD PTR [rbp-12]
        and     eax, DWORD PTR [rbp-16]
        mov     DWORD PTR [rbp-8], eax
.L2:
        cmp     DWORD PTR [rbp-8], 0
        jne     .L3
        mov     eax, DWORD PTR [rbp-4]
        mov     esi, eax
        mov     edi, OFFSET FLAT:.LC0
        mov     eax, 0
        call    printf
        mov     eax, 0
        leave
        ret

Perhaps one of the reasons is that prologues and epilogues can contain code for buffer overflow protection.

fredericopissarra · « **Reply #4 on:** July 21, 2023, 12:15:42 PM »

Quote from: munair on July 21, 2023, 06:21:55 AM

In normal language prologue is a part that comes "before", while epilogue is a part that comes "after".

This is the dictionary definition of the words...

Quote from: munair on July 21, 2023, 06:21:55 AM

Code: (masm) [Select]
.LC0: .string "%u\n" main: push rbp mov rbp, rsp sub rsp, 16 mov DWORD PTR [rbp-12], 3 mov DWORD PTR [rbp-16], 1 mov eax, DWORD PTR [rbp-12] xor eax, DWORD PTR [rbp-16] mov DWORD PTR [rbp-4], eax mov eax, DWORD PTR [rbp-12] and eax, DWORD PTR [rbp-16] mov DWORD PTR [rbp-8], eax jmp .L2 .L3: sal DWORD PTR [rbp-8] mov eax, DWORD PTR [rbp-4] mov DWORD PTR [rbp-12], eax mov eax, DWORD PTR [rbp-8] mov DWORD PTR [rbp-16], eax mov eax, DWORD PTR [rbp-12] xor eax, DWORD PTR [rbp-16] mov DWORD PTR [rbp-4], eax mov eax, DWORD PTR [rbp-12] and eax, DWORD PTR [rbp-16] mov DWORD PTR [rbp-8], eax .L2: cmp DWORD PTR [rbp-8], 0 jne .L3 mov eax, DWORD PTR [rbp-4] mov esi, eax mov edi, OFFSET FLAT:.LC0 mov eax, 0 call printf mov eax, 0 leave ret

Why the prologue/epilogue since, in x86-64 mode, all arguments are passed through registers? Here, without optimizations, the compiler chooses to use the stack to hold the local objects (unecessary as well, since there are sufficient registers to hold those objects). Notice that not a minimum of optimization is done (mov eax,0 is bigger then xor eax,eax and not macro-fused).

LEAVE has a throughput of 4 cycles. while POP RBP, 3 (that's WHY the compiler don't use ENTER instruction: 8 cycles, against PUSH RBP with only 3).

Without optimizations the compiler will always create inefficient code. Here's an example:

Code: [Select]

; int f( int x ) { return x + 1; }

; -O2 -fomit-frame-pointer    ; No optimizations
f:                            f:
  lea eax, [rdi+1]              push  rbp
  ret                           mov   rbp, rsp

                                mov   [rbp-4], edi  ; write argument on the stack.
                                mov   eax, [rbp-4]  ; read back from stack (why?!)
                                add   eax, 1

                                pop rbp
                                ret

The not optimized code is the worse code possible: 4 accesses to memory (2 potential cache misses) and no instrcutions can be paired (each one depends on the previous). Not considering call/ret, the optimized version runs in 3 cycles, the unoptimized, in 12 (at least).

In i386 mode, using cdecl convention, the compiler uses the stack, but even then, the prologue/epilogue aren't necessary, getting rid of 3 instructions and 7 cycles, but unoptimized code will add other artifacts (specially if you are using PIE executables):

Code: [Select]

; optimized             ; not optimized
f:                      f:
  mov eax, [esp+4]        push  ebp
  add eax, 1              mov ebp, esp
  ret
                          call  __x86.get_pc_thunk.ax
                          add eax, _GLOBAL_OFFSET_TABLE_
                          mov eax, [ebp+8]
                          add eax, 1

                          pop ebp
                          ret

                        __x86.get_pc_thunk.ax:
                          mov eax, [esp]
                          ret

Again: This is an OLD practice and should be avoided. Specially in assembly. The ONLY purpose for prologues/epilogues is the allow access to the stack in old 8086/80186/80286 processors. Since 386 this is unecessary. Without it RBP is free to use as "general purpose", instead of base pointer to stack.

munair · « **Reply #5 on:** July 21, 2023, 02:16:17 PM »

Without optimizations, compilers simply output the logical translation of the source code step by step, even if it means unnecessarily reading back from the stack. Compilers are among the most complex pieces of software; each process has to be logical and clear. Therefore, optimization is by necessity a separate step.

In its current state, whereby optimization has not been implemented yet, the SharpBASIC compiler is doing even worse if we take your example:

Code: (SB) [Select]

func f(x:int):int
do
  f = x + 1;
end

which is translated to:

Code: [Select]

_I107:
push    ebp
mov     ebp, esp
sub     esp, 4
mov     dword [ebp - 4], 0
mov     eax, dword [ebp + 8]
push    eax
mov     eax, 1
pop     edx
add     eax, edx
mov     [ebp - 4], eax
._L0:
mov     eax, dword [ebp - 4]
mov     esp, ebp
pop     ebp
ret

The expression parser doesn't care much what is being computed; it simply follows an initial standard logic by pushing and popping lhs and rhs operands. Obviously, there is a LOT of optimization left to do. But the whole process, from translation to executable IMO is just magical if you think about it.

NASM - The Netwide Assembler

News:

Author Topic: About obsolete practices... (Read 2958 times)

fredericopissarra

About obsolete practices...

munair

Re: About obsolete practices...

fredericopissarra

Re: About obsolete practices...

munair

Re: About obsolete practices...

fredericopissarra

Re: About obsolete practices...

munair

Re: About obsolete practices...