Author Topic: Opengl/OpenAL game 100% NASM x86_64 Assembly (Read 166564 times)

Rodrigo Robles · « **on:** June 28, 2023, 03:23:34 AM »

When I first saw x86_64 I was amazed. 16 general purpose 64-bit registers plus 16 128-bit floating point registers is much more than a guy raised with 6502 could imagine.

I thought that would be so easy to code that Assembly so the effort would be close to write C code. Then some time ago I decided to write a little OpenGL/openal game 100% Assembly to measure the productivity and prove the viability of writing large programs in x86_64. In the last years I made some little retro games for Android in JavaScript so I could make a comparison.

I choose to make a revamp for the classic 1982's Attack of the Timelord. Here is the sources: https://gitlab.com/RodrigoRobles/trevaskas-2

Here is a screenshot of the game:

The graphics are quite simple because it is a retro game, but there is no obstacle to make larger games with fancy graphics with pure x86_64 Assembly.

As I expected, the productivity in hours/FP (for not much optimized code) was close to the JavaScript, wich proves that "basic" x86_64 is much easier to write than previous 8-bit or x86 architectures. (Of course optimized modern multithreading SIMD code costs much more than ordinary x86_64 code)

It proves Randall Hyde's point of view:
"Software engineers estimate that developers spend only about thirty percent of their time coding a solution to a problem. Even if it took twice as much time to write a program in assembly versus some HLL, there would only be a fifteen percent difference in the total project completion time. In fact, good assembly language programmers do not need twice
as much time to implement something in assembly language."

Being happy with the results, later I wrote a paper about the theme of large x86_64 Assembly programs: https://drive.google.com/uc?id=1_fKS97tb0UzWJ0RqZXpfTA8odrkCK5bE&export=download

Also created an itch.io page: https://rodrigo-robles.itch.io/trevaskas-ii

You can see a video of gameplay here: https://youtu.be/GzBffhLwkR4

Deskman243 · « **Reply #1 on:** June 28, 2023, 11:51:54 AM »

If that library is closed source and based on C how is any of that possible?

fredericopissarra · « **Reply #2 on:** June 28, 2023, 12:48:46 PM »

Just a couple of considerations on the source code (and your "paper")...

1. You don't need to align the stack pointer to DQWORD if you are not using the stack. sub rsp,8 and add rsp,8 as prolog and epilog aren't necessary all the time;

2. If you are loading a 32 bits value into a 64 bits register, use E?? instead of R??. The instruction will be smaller and faster (since there's no REX prefix if registers below R8 are used). For example, instead of xor rax,rax, use xor eax,eax.

3. There's no real gain to use assembly for C like routines, unless you are prepared to optimize the code in ways GCC can't do. Example, to use SSE4.2 for string routines. GCC do a better job with integer divisions, for example, than simply using div/idiv (specially with literal divisors). I recommend to consider to create freestanding routines in C.

4. To use -fno-pie is against SysV ABI for x86-64, you should consider to use rip relative effective addressing in your code.

Overall the code is very good! Just for fun, I'm trying to optimize my way and show to you here, if there is interest in such a thing...

[]s
Fred

fredericopissarra · « **Reply #3 on:** June 28, 2023, 04:28:32 PM »

Another thing... this:

Code: [Select]

  section .data
  ...
width:  dq 1
  ...
  section .text
  ...
  movq xmm0,[width]
  ...

Will not load 1.0 (double) in XMM0, but a QWORD 1 (0x00000001). The correct approach is to convert the integer representation to double as in:

Code: [Select]

  ; Casting necessary because you can use a dword reference as well...
  cvtsi2sd xmm0,qword [width]

The other way around as well:

Code: [Select]

  ; write the double as an integer
  cvtsd2si rax,xmm0  ; destination MUST be a register.
  mov [width],rax

And... the default for NASM is 32 bits code, it is recommended you tell the compiler your code is 64 bits and using RIP relative addressing, at the beginning:

Code: [Select]

  bits 64
  default rel

And all effective addresses loaded to registers should be done with LEA, like:

Code: [Select]

  mov eax,1
  mov edi,eax
  lea rsi,[msg]  ; this is a rip relative effective address.
  mov edx,msg_size
  syscall
...
msg: db `Hello\n`
msg_size equ ($ - hello)

Rodrigo Robles · « **Reply #4 on:** June 29, 2023, 09:35:44 PM »

Quote from: Deskman243 on June 28, 2023, 11:51:54 AM

If that library is closed source and based on C how is any of that possible?

Are you talking about Opengl and Openal?
They are not part of the project, the game call these libraries to render graphics and play sound. Theoretically one could call directly Linux audio and video drivers, but it would be really uncommon.
By the way, most (or all?) Linux distros uses opensource libraries for this (libopengl, libopenal, freeglut).

fredericopissarra · « **Reply #5 on:** June 29, 2023, 11:14:40 PM »

Hi, Rodrigo. A question first: Are you brazillian? (I am!)

Well... about your code, as I said before, there are some improvements you can do: First is to avoid using floating point whatsoever, since your projection is orthogonal. Second, is to use lea instruction to load registers with the effective address, using RIP relative addressing mode. I believe you used -fno-pie when linking is because you got tons of relocation errors with something like this:

Code: [Select]

  mov rsi,label ; lea rsi,[label] is better.
  ...
label: dq 0

Third, you are using OpenGL in compatible mode, something pretty outdated nowadays. And fourth: Yep, you are, in essence, writing a C program, but using assembly (and, most of the time, poorly -- sorry).

I'm not saying the game isn't well structured of "wrong". No... it is good, but can be way better.

For example, instead of doing some calculations using R?? registers, you could do using E??, because the coordinates will never be longer than, let's say, 11 bits (2¹¹-1 = 2047). Keeping the majority of the code 32 bits will make it a lot smaller (and faster).

Trying to obey SysV ABI is another point...

And I didn't understand why you create your own mystrcmp when you are using glibc (linked to your code by GCC). Why you are using double precision? Since all OpenGL functions dealing with floating point uses single precision (float)... AND, take functions like glBindTexture, which take 2 int arguments, but your textures use a QWORD (via RSI) instead of a DWORD (ESI), as well as the enumeration, just adding a REX prefix to the instructions!

Another thing is zeroing registers... Instead of xor rdx,rdx or, worse, [fount=courier]mov rdx,0[/font] you could use xor edx,edx, a 2 bytes long instruction... And, with floating point, something like xorps xmm0,xmm0 is faster and smaller then movq xmm0,[DQ_ZERO]. At the same time, instead of using movq or movd, assuming (correctly) that a double is a QWORD (and a float is a DWORD), the use of movss or movsd is more clear and with no penalty. Ahhh... vzeroall is an AVX/AVX2 instruction, not available in all processors supporting x86-64 mode.

Since you are using SSE2, I think some routines should make a better use of vectorization as well.

I partially agree with you about "libraries", but, since the idea is to create a game in assembly, I woun't use glut or OpenAL, but XLib (or XCB) to create the fullscreen (or Window) and ALSA for sound, leaving only OpenGL to be used for graphics. Since it doesn't depends on glibc, you could create a more "pure" assembly code this way without havind to deal with "drivers" directly.

Deskman243 · « **Reply #6 on:** June 30, 2023, 07:55:11 PM »

I really like to provide a measured response whenever we have a specification for review. I think this presentation certainly has an admirable amount of yields from build. In particular I'd like to reflect on a few curious contestions of these.
You have here a certain amount of references to platforms outside of the standard NASM environment. I was intrigued by how there is even a reference to javascript right beside the build tools however I'm not clear on how this relates. Intriguingly this gives a contrast whereby the down turn section there is also a relation to a weaker tool sets however there's a claim of unmodified versions of the communities' source code. Also the subject is referenced only remotely and could be more relevant if you actually posted these type of figures. It may appear difficult for an ordinary investigation however other than that the actual performance figures are in fact the type of details that conserves the status for a good review.

Good Job and Cheers!

alCoPaUL · « **Reply #7 on:** July 01, 2023, 03:33:44 PM »

<wrong thread, lelz>
should be here https://forum.nasm.us/index.php?topic=3741.0

Rodrigo Robles · « **Reply #8 on:** July 01, 2023, 03:41:52 PM »

Quote from: fredericopissarra on June 28, 2023, 12:48:46 PM

1. You don't need to align the stack pointer to DQWORD if you are not using the stack. sub rsp,8 and add rsp,8 as prolog and epilog aren't necessary all the time;

You're right. It's required only for some SIMD or FPU instructions. I'm doing this to all the functions in a defensive strategy to avoid random errors, but it surely can be removed from some functions. In the paper I pointed that is not always necessary.

Quote from: fredericopissarra on June 28, 2023, 12:48:46 PM

2. If you are loading a 32 bits value into a 64 bits register, use E?? instead of R??. The instruction will be smaller and faster (since there's no REX prefix if registers below R8 are used). For example, instead of xor rax,rax, use xor eax,eax.

I was afraid of partial register stalls, but after your feedback I did some research and saw that really is no penalty for accessing 32-bit registers, it happens only when accessing 8-bit or 16-bit registers. Now I'm aware of this surely I will use a lot more 32-bit data and code in my next x86_64 programs.

Quote from: fredericopissarra on June 28, 2023, 12:48:46 PM

3. There's no real gain to use assembly for C like routines, unless you are prepared to optimize the code in ways GCC can't do. Example, to use SSE4.2 for string routines. GCC do a better job with integer divisions, for example, than simply using div/idiv (specially with literal divisors). I recommend to consider to create freestanding routines in C.

Yes, In the performance standpoint there's no gain for write assembly like a C compiler. In this particular program my goal was not to reach maximum optimization, but try the viability of 100% large Assembly programs in terms of tech difficulty and cost. Anyway I'm taking seriously your feedback and I will make better use of optimizations in the future.

Quote from: fredericopissarra on June 28, 2023, 12:48:46 PM

4. To use -fno-pie is against SysV ABI for x86-64, you should consider to use rip relative effective addressing in your code.

Thanks for this hint. I was not aware of the advantages of RIP-relative addressing and position independent executables. Be sure I will use this features in my next projects.

Quote from: fredericopissarra on June 28, 2023, 12:48:46 PM

Overall the code is very good! Just for fun, I'm trying to optimize my way and show to you here, if there is interest in such a thing...

Thank you. And of course I'm interested in your feedback about the optimizations.

Rodrigo Robles · « **Reply #9 on:** July 02, 2023, 03:19:57 PM »

Quote from: fredericopissarra on June 28, 2023, 04:28:32 PM

Another thing... this:
Code: [Select]
section .data ... width: dq 1 ... section .text ... movq xmm0,[width] ...Will not load 1.0 (double) in XMM0, but a QWORD 1 (0x00000001). The correct approach is to convert the integer representation to double as in:
Code: [Select]
; Casting necessary because you can use a dword reference as well... cvtsi2sd xmm0,qword [width]The other way around as well:
Code: [Select]
; write the double as an integer cvtsd2si rax,xmm0 ; destination MUST be a register. mov [width],rax

I did not found the code above in this program ("width: dq 1" or "movq xmm0,[width]"). The program has a width variable which is uninitialized and it's used in some functions as an integer and in other functions as a float.

Quote from: fredericopissarra on June 28, 2023, 04:28:32 PM

And... the default for NASM is 32 bits code, it is recommended you tell the compiler your code is 64 bits and using RIP relative addressing, at the beginning:
Code: [Select]
bits 64 default relAnd all effective addresses loaded to registers should be done with LEA, like:
Code: [Select]
mov eax,1 mov edi,eax lea rsi,[msg] ; this is a rip relative effective address. mov edx,msg_size syscall ... msg: db `Hello\n` msg_size equ ($ - hello)

Looks like -felf64 already sets NASM to 64-bit mode, anyway is a good suggestion to use BITS 64 to ensure the mode independent of the command line used.

Now I'm aware of the advantages of rip-relative addressing certainly I will use it in my next projects.

fredericopissarra · « **Reply #10 on:** July 02, 2023, 07:00:39 PM »

Quote from: Rodrigo Robles on July 02, 2023, 03:19:57 PM

Looks like -felf64 already sets NASM to 64-bit mode, anyway is a good suggestion to use BITS 64 to ensure the mode independent of the command line used.

Not quite. bits 64 tells NASM that the code is for x86-64 mode. This is important because INC/DEC instructions, for example, have different opcodes at 32 and 64 bits. While -f elf64 only tells NASM that and ELF x86-64 object file will be created.

Rodrigo Robles · « **Reply #11 on:** July 03, 2023, 02:43:09 AM »

Quote from: fredericopissarra on June 29, 2023, 11:14:40 PM

Hi, Rodrigo. A question first: Are you brazillian? (I am!)

Yes, I'm also brazilian.

Quote from: fredericopissarra on June 29, 2023, 11:14:40 PM

Well... about your code, as I said before, there are some improvements you can do: First is to avoid using floating point whatsoever, since your projection is orthogonal. Second, is to use lea instruction to load registers with the effective address, using RIP relative addressing mode. I believe you used -fno-pie when linking is because you got tons of relocation errors with something like this:
Code: [Select]
mov rsi,label ; lea rsi,[label] is better. ... label: dq 0

Now I'm aware that accessing 32-bit registers generates no penalty, I probably will use it a lot more in the future. The same for rip-relative addressing.

Quote from: fredericopissarra on June 29, 2023, 11:14:40 PM

Third, you are using OpenGL in compatible mode, something pretty outdated nowadays. And fourth: Yep, you are, in essence, writing a C program, but using assembly (and, most of the time, poorly -- sorry).

Compatible mode is really very outdated. This was a cheap architecture choice I made. But I really want to move to a more modern opengl in the next project.
This "C accent" was unavoidable, I believe it should reduce as I improve my x86_64 skills.

Quote from: fredericopissarra on June 29, 2023, 11:14:40 PM

I'm not saying the game isn't well structured of "wrong". No... it is good, but can be way better.

For example, instead of doing some calculations using R?? registers, you could do using E??, because the coordinates will never be longer than, let's say, 11 bits (2¹¹-1 = 2047). Keeping the majority of the code 32 bits will make it a lot smaller (and faster).

Ok, I'm already convinced of the advantages of 32-bit code.

Quote from: fredericopissarra on June 29, 2023, 11:14:40 PM

Trying to obey SysV ABI is another point...

And I didn't understand why you create your own mystrcmp when you are using glibc (linked to your code by GCC). Why you are using double precision? Since all OpenGL functions dealing with floating point uses single precision (float)... AND, take functions like glBindTexture, which take 2 int arguments, but your textures use a QWORD (via RSI) instead of a DWORD (ESI), as well as the enumeration, just adding a REX prefix to the instructions!

I didn't call a single libc function, it's there as a dependency of opengl/openal/glut. I'm using opengl/opengl because is almost mandatory to access video and sound, but I can successfully avoid any other libraries.
I was trying to use 64-bit in everything I could, I was afraid to get some penalties for using 32-bit, but in the end I got penalized for using 64-bit, wasting memory and machine code where 32-bit should be used.

Quote from: fredericopissarra on June 29, 2023, 11:14:40 PM

Another thing is zeroing registers... Instead of xor rdx,rdx or, worse, [fount=courier]mov rdx,0[/font] you could use xor edx,edx, a 2 bytes long instruction... And, with floating point, something like xorps xmm0,xmm0 is faster and smaller then movq xmm0,[DQ_ZERO]. At the same time, instead of using movq or movd, assuming (correctly) that a double is a QWORD (and a float is a DWORD), the use of movss or movsd is more clear and with no penalty. Ahhh... vzeroall is an AVX/AVX2 instruction, not available in all processors supporting x86-64 mode.

Nice optimization hints, I should pay more atention to this.
About vzeroall, I'm considering AVX2 as a minimal requirement to this program.

Quote from: fredericopissarra on June 29, 2023, 11:14:40 PM

Since you are using SSE2, I think some routines should make a better use of vectorization as well.

I avoided this level of optimization by purpose to get a faster deliver. But I would like to use more vectorization in the future.

Quote from: fredericopissarra on June 29, 2023, 11:14:40 PM

I partially agree with you about "libraries", but, since the idea is to create a game in assembly, I woun't use glut or OpenAL, but XLib (or XCB) to create the fullscreen (or Window) and ALSA for sound, leaving only OpenGL to be used for graphics. Since it doesn't depends on glibc, you could create a more "pure" assembly code this way without havind to deal with "drivers" directly.

Looks like that will not be easy to get rid from libc, according to ldd both libX11.so and libGL.so depend on this. Anyway I can at least avoid it in my own code. I like the suggestion of using xlib and alsa, since it's more low level than glut and openal.

I want to thank you for all this comments, it's the most valuable feedback I received about this program until now.

fredericopissarra · « **Reply #12 on:** July 03, 2023, 12:09:38 PM »

Quote from: Rodrigo Robles on July 03, 2023, 02:43:09 AM

Yes, I'm also brazilian.

I'm from Vitória-ES!

Quote from: Rodrigo Robles on July 03, 2023, 02:43:09 AM

Compatible mode is really very outdated. This was a cheap architecture choice I made. But I really want to move to a more modern opengl in the next project.
This "C accent" was unavoidable, I believe it should reduce as I improve my x86_64 skills.

Since you're using an othogonal projection the vertex shader will be very simple and you can ditch those matrix manipulation functions...

Quote from: Rodrigo Robles on July 03, 2023, 02:43:09 AM

I didn't call a single libc function, it's there as a dependency of opengl/openal/glut. I'm using opengl/opengl because is almost mandatory to access video and sound, but I can successfully avoid any other libraries.
I was trying to use 64-bit in everything I could, I was afraid to get some penalties for using 32-bit, but in the end I got penalized for using 64-bit, wasting memory and machine code where 32-bit should be used.

Yep, some libraries depends on libc, but your code don't need to include libc dependency. libGL.so, libglut.so, etc will load their own dependency by themselves...

Quote from: Rodrigo Robles on July 03, 2023, 02:43:09 AM

Nice optimization hints, I should pay more atention to this.
About vzeroall, I'm considering AVX2 as a minimal requirement to this program.

Thanks, take notice that, even in x86-64 mode, your processor is, still, a 32 bits one. 64 bits mode is an extension.
If you are considering using AVX2 (or SSE greater then 2, the same goes for FMA, BMI, AVX-512, ...) you should test if the processor support it. For example, I deal with some virtual machines which supports AVX, but not AVX2. In x86-64 mode the only garantee you have about SIMD is SSE and SSE2.

Quote from: Rodrigo Robles on July 03, 2023, 02:43:09 AM

Looks like that will not be easy to get rid from libc, according to ldd both libX11.so and libGL.so depend on this. Anyway I can at least avoid it in my own code. I like the suggestion of using xlib and alsa, since it's more low level than glut and openal.

As said before, yep, those libs could depend on libc, but not your program.

Quote from: Rodrigo Robles on July 03, 2023, 02:43:09 AM

I want to thank you for all this comments, it's the most valuable feedback I received about this program until now.

The pleasure is all mine!

Rodrigo Robles · « **Reply #13 on:** July 06, 2023, 01:03:40 AM »

Quote from: Deskman243 on June 30, 2023, 07:55:11 PM

I really like to provide a measured response whenever we have a specification for review. I think this presentation certainly has an admirable amount of yields from build. In particular I'd like to reflect on a few curious contestions of these.
You have here a certain amount of references to platforms outside of the standard NASM environment. I was intrigued by how there is even a reference to javascript right beside the build tools however I'm not clear on how this relates. Intriguingly this gives a contrast whereby the down turn section there is also a relation to a weaker tool sets however there's a claim of unmodified versions of the communities' source code. Also the subject is referenced only remotely and could be more relevant if you actually posted these type of figures. It may appear difficult for an ordinary investigation however other than that the actual performance figures are in fact the type of details that conserves the status for a good review.

Good Job and Cheers!

Thanks for the feedback!

NASM - The Netwide Assembler

News:

Author Topic: Opengl/OpenAL game 100% NASM x86_64 Assembly (Read 166564 times)

Rodrigo Robles

Opengl/OpenAL game 100% NASM x86_64 Assembly

Deskman243

Re: Opengl/OpenAL game 100% NASM x86_64 Assembly

fredericopissarra

Re: Opengl/OpenAL game 100% NASM x86_64 Assembly

fredericopissarra

Re: Opengl/OpenAL game 100% NASM x86_64 Assembly

Rodrigo Robles

Re: Opengl/OpenAL game 100% NASM x86_64 Assembly

fredericopissarra

Re: Opengl/OpenAL game 100% NASM x86_64 Assembly

Deskman243

Re: Opengl/OpenAL game 100% NASM x86_64 Assembly

alCoPaUL

Re: Opengl/OpenAL game 100% NASM x86_64 Assembly

Rodrigo Robles

Re: Opengl/OpenAL game 100% NASM x86_64 Assembly

Rodrigo Robles

Re: Opengl/OpenAL game 100% NASM x86_64 Assembly

fredericopissarra

Re: Opengl/OpenAL game 100% NASM x86_64 Assembly

Rodrigo Robles

Re: Opengl/OpenAL game 100% NASM x86_64 Assembly

fredericopissarra

Re: Opengl/OpenAL game 100% NASM x86_64 Assembly

Rodrigo Robles

Re: Opengl/OpenAL game 100% NASM x86_64 Assembly