Author Topic: The discordance of 64-bit fastcall conventions  (Read 13596 times)

Offline Rob Neff

  • Forum Moderator
  • Full Member
  • *****
  • Posts: 429
  • Country: us
The discordance of 64-bit fastcall conventions
« on: October 04, 2010, 02:04:06 AM »

Disclaimer:
The following contains "idea code" and as such should not be copy/pasted for use in production systems! :P
It is semi-random yet provides insight of issues currently being addressed.
Also: Warning - rants ahead!

Imagine, if you will, you are trying to encapsulate 64-bit fastcall calling conventions of Linux and Windows using macros, oh, say, like NASMX.

On Windows - function parameters are passed in using RCX,RDX,R8,R9 for ints/ptrs and xmm0-3 for float types.
On Linux - parameters are passed in RDI, RSI, RDX, RCX, R8, R9 for ints/ptrs and xmm0-7 for float types.

Now, Windows REQUIRES a register storage area on the stack frame above the function return address which will reserve space for the registers used as parameters to "spill" into if needed following a function call.  This allows you to reference (ie: [rbp+16]) from your code to store(spill) the register into if needed or read from it.  When CALLING a function you must provide for this storage area PRIOR to calling (ie: sub rsp,32) after pushing any remaining parameters exceeding the registers allocated.

One trick is to figure out from within the function the call that has the most parameters and use that count ( ie: sub rsp,count * 8 ).  HOWEVER, when trying to define the stack frame prologue BEFORE you encounter all calls made within the function makes it impossible to use this technique.

proc myproc
;    sub rsp,???  ; myproc can't know this yet: ( func3 has 6 params ( 6 * 8 ) = sub rsp,48 )

    ; must do this instead...arrrg..looks like debug code :(
    invoke func1, qword a, qword b, qword c, qword d
        sub rsp, 32
        ; put args into regs
        call func1
        add rsp, 32
    invoke func2, qword a, qword b, qword c, qword d, qword e
        sub rsp, 40
        ; put args into regs and stack
        call func2
        add rsp, 40
    invoke func3, qword a, qword b, qword c, qword d, qword e, qword f
        sub rsp, 48
        call func3
        add rsp,48

;   add rsp,???   ; would be nice to only have this outer frame!
endproc

Linux has a spill area located below the frame pointer (ie: [rbp-8]) or based from RSP (ie: [rsp+40] if you'd rather make RBP available as a general register).  It doesn't use Windows register shadow space convention but has it's own convention (naturally).

Both systems use different conventions when providing parameters in registers.

Let's use an example:
    int myfunc(char* p, int x, double d, int z, float f);

Assume a standard stack frame prologue:
    push rbp
   mov  rbp, rsp

Windows will store the params as:
    RCX  - p
   RDX  - x
   xmm2 - d
   R9   - z
   [rbp+48] - f

It's easy to set up and define offsets from the frame based on the current arg count:

%assign %$frame_offset 16
%rep %$argcount
    %$argname EQU %$frame_offset ; assign offset
   %assign %$frame_offset 8+%$frame_offset
%endrep

    ; spill first register
    mov [rbp+p], rcx  ; positive offset from rbp

Linux, correct me if I'm wrong, does this:
    RDI  - p
   RSI  - x
   xmm0 - d
   RDX  - z
   xmm1 - f
   
Keeping in mind that no args have been pushed to the stack the spill area is defined below the frame pointer:
%assign %$frame_offset 0
%rep %$argcount
   %assign %$frame_offset 8+%$frame_offset
    %$argname EQU %$frame_offset ; assign offset
%endrep

    ; spill first register
    mov [rbp-p], rdi   ; negative offset from rbp

So, spill areas are different, no big deal.  HOWEVER (you knew it was coming, right?) let's see what the following little prototype deals us:

// btw - don't let me catch you doing this :p
int myfunc( int a, int b, int c, int d, int e, int f, int g, int h );

                  Spill Area
Windows:
   [rbp+72] = h
   [rbp+64] = g
   [rbp+56] = f
   [rbp+48] = e
   R9  = d     ; [rbp+40]
   R8  = c     ; [rbp+32]
   RDX = b     ; [rbp+24]
    RCX = a     ; [rbp+16]

Linux:
    RDI = a     ; [rbp-8]
   RSI = b     ; [rbp-16]
   RDX = c     ; [rbp-24]
   RCX = d     ; [rbp-32]
   R8  = e     ; [rbp-32]
   R9  = f     ; [rbp-40]
   [rbp+16] = g         ; <-- ack!!!
   [rbp+24] = h         ; <-- ack!!!

Yes, that's right, when out of registers THEN the stack frame located ABOVE the frame pointer is used, beginning as if a normal cdecl calling convention from param 1 (ie: rbp+16])!!
Now account for how your users will call your macro:

%ifidni __OUTPUT_FORMAT__,win64
    %define argv(v) rbp+v
%elifidni __OUTPUT_FORMAT__,elf64
    %define argv(v) rbp-v
%endif

    mov rcx, [argv(f)]
    mov rdx, [argv(g)]

Yep, broken.  Best way to fix?  Require your linux users to not exceed allocated registers? Bleh, but workable.

I like Windows system of maintaining parameter order according to cdecl, but I like Linux register allocation.
The trials and tribulations of software engineering. What to do, oh what to do...


Offline Bryant Keller

  • Forum Moderator
  • Full Member
  • *****
  • Posts: 360
  • Country: us
    • About Bryant Keller
Re: The discordance of 64-bit fastcall conventions
« Reply #1 on: October 04, 2010, 02:55:16 AM »
I don't program in nasm64 so I can't honestly say I'm going to have a whole lot to add to this thread, however my suggestion would be to do like STRUC/ENDSTRUC and build a "memory template" for the procedure. This is something I do a LOT. I use it to create classes, handle arguments, represent local memory, etc. In fact, just about any application I write that other people don't have to read I'll probably use this type of technique. :p

Code: [Select]
[ABSOLUTE 8]
myproc:
myproc.a resq 1
%define myproc.a_sign -
myproc.b resq 1
%define myproc.b_sign -
myproc.c resq 1
%define myproc.c_sign -
myproc.d resq 1
%define myproc.d_sign -
myproc.e resq 1
%define myproc.e_sign -
myproc.f resq 1
%define myproc.f_sign -
myproc_eoargs equ ($-myproc)
[ABSOLUTE 16]
myproc.g resq 1
%define myproc.g_sign +
myproc.h resq 1
%define myproc.h_sign +

%define stack_sign_ref(_x_) _x_ %+ _sign

%define stack(_x_) rbp %[stack_sign_ref(%{$procname}. %+ _x_)] %{$procname}. %+ _x_

Just like yours, the above is "idea code" and to be taken with a grain of salt. Ideally, this all would more or less be generated by NASMX on the back-end of course. This also assumes that PROC defines a %$procname which we haven't changed out of context of, so this could cause a problem when in HL constructs which lose the parent PROC context. You might get around this by constantly redefining a single variable %$CurProc and using that through %define/%undef.

Now just glancing over the code, you might ask why I threw in the 'myproc_eoargs' line when we should kinda KNOW how big that section is. Well, we aren't always going to have the maximum so we can actually use that as our pointer to locate the beginning of where to grow our locals. Locals can then later be created using something like:

Code: [Select]
[ABSOLUTE myproc_eoargs]
myproc.someVar RESD 1
%define myproc.someVar_sign -
myproc.someOtherVar RESD 1
%define myproc.someOtherVar_sign -
myproc_eolocals equ ($-myproc)
...
%ifdef myproc_eolocals
sub rsp, myproc_eolocals
%else
sub rsp, myproc_eoargs
%endif

Nice huh! Do you now see why I used the name stack() for the utility to access arguments earlier? Cause now you should be able to access both arguments and locals with the stack(someVar) or stack(a) as long as your in that context. ;) At this point we are mostly talking theory as every bit of this is, as we say in the south (US/GA), "speaking strictly tongue and cheek" but I like where your head is at. The procedure arguments and locals system was something I fought with myself, Keith and I argued over it time and time again. That's why there is actually TWO ways of declaring arguments in NASMX (through ARGD and through PROC) which I know has caused some confusion with some people wondering which should be used but in the end it was more of a "whatever you feel more comfortable with".

Well, I think I reached my limit on the 64-bit stuff, I tend to just use C on 64-bit platforms. lol Keith was always better at that stuff than I was.

Regards,
Bryant Keller

About Bryant Keller
bkeller@about.me

Offline cm

  • Jr. Member
  • *
  • Posts: 65
Re: The discordance of 64-bit fastcall conventions
« Reply #2 on: October 04, 2010, 02:25:22 PM »
One trick is to figure out from within the function the call that has the most parameters and use that count ( ie: sub rsp,count * 8 ).  HOWEVER, when trying to define the stack frame prologue BEFORE you encounter all calls made within the function makes it impossible to use this technique.

proc myproc
;    sub rsp,???  ; myproc can't know this yet: ( func3 has 6 params ( 6 * 8 ) = sub rsp,48 )

    ; must do this instead...arrrg..looks like debug code :(
    invoke func1, qword a, qword b, qword c, qword d
        sub rsp, 32
        ; put args into regs
        call func1
        add rsp, 32
    invoke func2, qword a, qword b, qword c, qword d, qword e
        sub rsp, 40
        ; put args into regs and stack
        call func2
        add rsp, 40
    invoke func3, qword a, qword b, qword c, qword d, qword e, qword f
        sub rsp, 48
        call func3
        add rsp,48

;   add rsp,???   ; would be nice to only have this outer frame!
endproc

First, this is entirely an optimization, which should be optional. Users that use this have to be aware of it and not use the stack to store things on during the invoke macro call.

Second, you can do this. The only downside is that the initial sub instruction can't be automatically optimized by NASM. The endproc macro has to use the determined necessary size, and then define an Assembler label with that value. Using equ. It isn't a preprocessor-only solution, but works well enough. I recommend a macro-specific name (prefixed by %%), and manually telling NASM to optimize the sub instruction. (As I understand you, the register spill area is never larger than 32 byte, so you can assume that the immediate value will fit as a signed byte. (Is "sub rsp, byte 32" a valid instruction that does the right thing?))

BTW, it might not be clear how to use a macro-specific name as name for equ because %% will have a different value in the macro that has the sub (i.e. proc) and the one that contains the equ (i.e. endproc). This can be worked around by specifying that proc's %% will be used, and then using %define to create a context-local smacro that contains the name proc "requests" endproc to use for equ.

Code: [Select]
[ABSOLUTE 8]
myproc:
myproc.a resq 1
%define myproc.a_sign -
myproc.b resq 1
%define myproc.b_sign -
myproc.c resq 1
%define myproc.c_sign -
myproc.d resq 1
%define myproc.d_sign -
myproc.e resq 1
%define myproc.e_sign -
myproc.f resq 1
%define myproc.f_sign -
myproc_eoargs equ ($-myproc)
[ABSOLUTE 16]
myproc.g resq 1
%define myproc.g_sign +
myproc.h resq 1
%define myproc.h_sign +

%define stack_sign_ref(_x_) _x_ %+ _sign

%define stack(_x_) rbp %[stack_sign_ref(%{$procname}. %+ _x_)] %{$procname}. %+ _x_

I won't pretend to have understood that (didn't try), but won't it (sometimes?) save you some effort to use negative values as base to ABSOLUTE? This is a known feature of NASM and even documented, though only indirectly. ABSOLUTE's parameter is a (critical) expression too, so I imagine it might be useful for NASMX or other macros.
C. Masloch

Offline Bryant Keller

  • Forum Moderator
  • Full Member
  • *****
  • Posts: 360
  • Country: us
    • About Bryant Keller
Re: The discordance of 64-bit fastcall conventions
« Reply #3 on: October 04, 2010, 05:00:20 PM »
I won't pretend to have understood that (didn't try), but won't it (sometimes?) save you some effort to use negative values as base to ABSOLUTE? This is a known feature of NASM and even documented, though only indirectly. ABSOLUTE's parameter is a (critical) expression too, so I imagine it might be useful for NASMX or other macros.

Not in this case, this code works using the structure of RBP [SIGN] [OFFSET]. OFFSET is a positive value growing from our lowest base up to the variable in question. The second [absolute] acts sorta like a union, resetting the current offset back to 16 to allow for the next values. These values each get a *_sign option which is used to set SIGN so basically what we are generating is RBP - 8 and RBP + 16, note that 8 and 16 are absolute values and this is how we are dealing with them, where - and + sets the sign for our addition to RBP. If you think about it mathematically, we are just doing RBP + (-|8|) and RBP + (+|16|).

I have, however, noticed a little error in the "idea code". I'm kinda surprised nobody has noticed it. I used the procedure name as the start of my memory template, this is horribly wrong as it would create a multiple definition error, it should probably be changed from myproc to myproc_tpl. Heh, I just noticed it when I was looking at cm's repost of my code. That's kinda why I mentioned before I like to stay away from "idea code". ;D

About Bryant Keller
bkeller@about.me

Offline cm

  • Jr. Member
  • *
  • Posts: 65
Re: The discordance of 64-bit fastcall conventions
« Reply #4 on: October 04, 2010, 05:55:51 PM »
OFFSET is a positive value growing from our lowest base up to the variable in question.

Yes, the items would of course need to be "reversed" then, which might be more work. Another point is that you can't use normal labels there, even if you would define a linear structure there - labels must of course be unique, which often is undesirable, so you would instead have to use "%assign label $" every time instead of defining a label the normal way. (This works in absolute's space without using any workaround involving $$.)

Quote
I have, however, noticed a little error in the "idea code". I'm kinda surprised nobody has noticed it. I used the procedure name as the start of my memory template, this is horribly wrong as it would create a multiple definition error, it should probably be changed from myproc to myproc_tpl. Heh, I just noticed it when I was looking at cm's repost of my code. That's kinda why I mentioned before I like to stay away from "idea code". ;D

I think your code doesn't need that label at all, because you define all the labels in this structure as, for example, "myproc.a" - i.e. with an explicit non-local part in front of the label. (Technically it just isn't a local label to NASM, but can be used like one.) This means you don't have to actually define the non-local label "myproc" anywhere at all, or you can define it elsewhere instead of specifically in front of these labels. (Of course, you still would collide with a label ".a" local to myproc then. Which is the reason you have to specify some kind of "name space" for things like these so that they won't collide with anything, whether using specific preprocessor support (like %$, %%) for it or not.)

When posting example or test case code somewhere, I usually run it through NASM first and check whether it does (appear to) work as expected. But that is also mostly how I check ideas I'm unsure about when I write NASM preprocessor, or even Assembler, code so I'm used to this method.


I had the idea that the optimization of the register spill area should probably be included in an existing stack frame, which makes the inability to automatically optimize the sub instruction an actual issue as you can't just manually optimize the associated instruction any more (or at least, not as well). Another issue is that an unnecessary "sub rsp, 0" instruction might be included in the prologue if the spill area is not required at all. (This happens if invoke isn't used, or only so that the optimization is not of use.) A possible way to avoid this (mostly) is to make proc by default (or only as option?) not set up the optimization, but give a way to switch it on (or off). Switching it on would have the minor ill effects of producing possibly not optimal code (unnecessary "sub rsp, 0", or less optimal sub because the area didn't need a full 32 byte) but it would never produce bad code (or abort assembling, for that matter). The non-optimal code could show a warning. (Actually, it might produce bad code, but only if invoke would be used and assumed the spill area to be on top of the stack although something was pushed on the stack. This inherently is a possibility with the whole spill area stack frame optimization though.)
C. Masloch

Offline Rob Neff

  • Forum Moderator
  • Full Member
  • *****
  • Posts: 429
  • Country: us
Re: The discordance of 64-bit fastcall conventions
« Reply #5 on: October 04, 2010, 10:26:50 PM »

A Proposed Solution
Both Bryant and cm sparked an idea within me (thanks guys) regarding memory layout - simply re-adjust position but include a bias. ;D
Thus, the following shows a rough idea using simplified logic to appropriately handle the fastcall calling convention without resorting to extreme trickery.  It shows a simple function and the subsequent pseudo-implementation of assembler macros in order of execution.  It is heavily commented to explain the proof-of-concept.  It is geared more towards a Linux implementation to show that functions exceeding register allocations can/will be handled properly.  However, the Windows implementation will use the same concept but obviously with a different adjusted frame bias.  Bear in mind that this solution also applies equally to all other calling conventions (cdecl,pascal,stdcall,etc) reducing the overall quantity of code that must be implemented within the NASMX macros themselves.  I think you will find it elegant.

;==============================================================
; Our simple example:
;
; void myfunc(int y)
; {
;    int z;
; }
;
;
;==============================================================
; converting to nasm:
;
[section .code]

; proc myfunc, qword y

    %push CTX_PROC ; give us a context to play in
    %define %$curproc myfunc   ; obtained from %1
    %[%$curproc]:
    %assign %$arg_bias 0
   ; for each param do:
        ; define the function param frame offsets from bias
        %[%$curproc].y equ %[%$arg_bias]
        ; prologue - 1 param * 8 bytes = 8 bytes for Linux spill area
        %assign %$arg_bias 8 + %$arg_bias  ; increase bias
      ; ... we will eventually adjust for exceeding register allocation...
    ; must define outside of context to avert HL construct issues
    %xdefine __NASMX_ARG_BIAS__ %[%$arg_bias]
    ;==================================================

; One major change to existing NASMX syntax required -
; the locals macro is now mandatory due to 64-bit fastcall
;
; locals    ; usage: locals [none | xxx]
            ;    none - allows us to avoid using stackframe if user
            ;           writing a small leaf function or will handle
            ;           function prologue separately
            ;     xxx - number of additional bytes to subtract from
            ;           rsp in prologue definition.  One excellent use
            ;           of this capability is optimizing outer stack
            ;           frame for win64.
            %assign %$locals_bias 0  ; init bias offset

; uses      ; usage: uses rbx, r11
            ;     we must save non-volatile registers prior to adjusting
            ;     stack as this would interfere with win64 fastcall
            ;     Must be called within the framework of locals

; local qword z
        ; for each local variable do:
            ; define the local var as negative frame offset from arg bias
            %assign %$locals_bias %$locals_bias - 8
            %[%$curproc].z equ %$locals_bias

; endlocals
    ; Show requirements of endlocals macro:
    ; the caveat is that this MUST be performed in the endlocals macro
    ; or set up from "locals none" thus requiring AT LEAST one macro
    ; call following the proc macro
    ; define the offset from the frame arg bias
    %xdefine __NASMX_LOCALS_BIAS__ %[%$locals_bias]
    push rbp
    mov  rbp, rsp
    ; the following completes our stack frame setup
    sub  rsp, __NASMX_ARG_BIAS__ - __NASMX_LOCALS_BIAS__ ; subtract negative to realize positive
    ; This completes our knowledge and setup of the proc prologue
    ;============================================================


; also defined in nasmx.inc - both are identical but
; provide backward compatibility to older NASMX style
%define argv(v) rbp - __NASMX_ARG_BIAS__ + v
%define  var(v) rbp - __NASMX_ARG_BIAS__ + v

; examples of how a user would access parameter or local stack vars
mov qword[argv(myfunc.y)], rdi  ; spill register to save area
mov rsi, qword[var(myfunc.z)]  ; read a local var
mov rax, 0  ; return no error!

; endproc
    ; ...restore non-volatile registers...
    ; Finally perform epilogue
    mov rsp, rbp
    pop rbp
    ret ; or ret xxx if callconv requires

    %pop
;===============================================================

The above code assembles properly (albeit using some hand-edited code - examine the listing output for verification) and shows a valid methodology of accessing both parameter and local variables.  The beauty of the scheme is that it handles both Windows and Linux register spills identically with the kicker being that linux args that are not placed in registers but on the stack frame can just as easily be accessed using the same argv() macro definition.  The proc macro obviously must account for the appropriate positive bias offset for parameters while the local macro within a prologue context must account for the negative bias offset.

Think of this proposed solution as a RFC where your input has a decisive impact on the future of NASMX.
I'm looking for holes in the theory or comments regarding compatibility or portability.  Thank you.