NASM - The Netwide Assembler
NASM Forum => Example Code => Topic started by: TightCoderEx on June 17, 2012, 01:34:20 AM
-
This thread is not intended to bash "C" or "C++", but rather a lot of times those using higher level languages question the logic of low level programming. This example is functionally equivalent to char *strcpy ( char *Dest, char *Src )
0: 55 push rbp
1: 48 89 e5 mov rbp,rsp
4: 57 push rdi
5: 56 push rsi
6: 8b 75 18 mov esi,DWORD PTR [Src]
9: 8b 7d 10 mov edi,DWORD PTR [Dest]
c: eb 01 jmp f <L0>
e: aa stos BYTE PTR es:[rdi],al
f: ac lods al,BYTE PTR ds:[rsi]
10: 08 c0 or al,al
12: 75 fa jne e <L0-0x1>
14: 5e pop rsi
15: 5f pop rdi
16: c9 leave
17: 58 pop rax
18: c2 08 00 ret 0x8
= 27 bytes
It would be interesting to see if someone really proficient in "C" or "C++" could rival this function and how they do it.
-
It would be interesting to see if someone really proficient in "C" or "C++" could rival this function and how they do it.
In what respect. Size? Speed? Safety? Portability?
-
Definitely not portability, although once it comes down to API's even "C" is less portable.
Safety, definitely not especially the way I code some things, like doing 8 and 16 bit comparisons on memory or register because I know what the value of the high order bits will always be.
Size and as a natural consequence maybe speed. I'm pretty sure there is a combination of coding style such as using a reference variable *Dest = 'B' versus Dest [3] = 'B' and compile options that produces the most efficient code size and/or speed.
-
I read long time ago somewhere(intel manuals?) than stos/lods are relatively slow at speed compared with simply mov/inc esi/edi on newer processors. Indeed debugging a little I found than my C code do that.
-
Similarly
enter 148, 0
is more weighty than push ebp
mov ebp, rsp
sub rsp, 148
and slightly heavier than add rsp, -148
I have to ask myself though, what was the rationale for Intel engineers to design such functionality into the processor. My guess would be ENTER is 1/3 the size of the conventional method in this example
0: c8 48 14 00 enter 0x1448,0x0
4: 55 push rbp
5: 48 89 e5 mov rbp,rsp
8: 48 81 ec 48 14 00 00 sub rsp,0x1448
and by that logic more efficient speed wise, at least in this example anyway
-
I read long time ago somewhere(intel manuals?) than stos/lods are relatively slow at speed compared with simply mov/inc esi/edi on newer processors. Indeed debugging a little I found than my C code do that.
The underlying hardware design deviated from the instruction set quite some time ego. Modern x86 is a superscalar architecture. You have to factor in pipelines, microcode, instruction reordering, register renaming, caches, etc.
I am not at all surprised if stos/lods are slower, despite the ability of microcode to even things out. I wouldn't be surprised if compilers, i.e. favoring more generic and RISC-like code, are driving such hardware evolution.
However, while shooting for loop optimization is indeed important, as an assembly language programmer, I'd be more concerned over larger optimization gains that can be had across the entire architecture. Being conscientious of the fact that code and data caches can greatly impact performance seems more relevant. Is something like "rep movsb" the be-all-end-all to data copying? No, but it sure is compact (code cache) and causes predictable (linear and thus easily optimized) data cache access. I am more than content to leave micro optimizations to compilers that care about such wild goose chases ;)
-
I have to ask myself though, what was the rationale for Intel engineers to design such functionality into the processor. My guess would be ENTER is 1/3 the size of the conventional method in this example
...
and by that logic more efficient speed wise, at least in this example anyway
At the time that ENTER/LEAVE were conceived, I believe things were measured in Kilobytes and perhaps Megabytes. At the same time, IIRC, ENTER would have taken longer (clock cycles) to perform, technically. Your classic speed-vs-size tradeoff.
Does it still matter now that "clock cycles" are more of an ambiguous and moving target? Well, ENTER has implicit dependencies while the "long" way has explicit dependencies, so I don't see any obvious (simplistic) pipeline optimizations to gain there. However, I would chalk it up akin to my last response, ENTER is another one of those CISC-y instructions that have fallen out of favor in the age of superscalar.
-
Well this thread and others have steered me onto Agner Fog's material and I've just read the first 75 of 161 pages of Optimising for Assembly. Maybe I'll be able to answer my own question in the near future, but already have better insight into standards especially as they apply to calling conventions in the different platforms.
It will be an interesting exercise to see if I can optimise what little I've done already to be more compliant of calling conventions, as it applies to 64 bit Linux and size and speed sensitive.
-
I really don't know a "vector path" instruction from Adam's off ox, but I'm told that "enter" is slow (compared to discrete instructions) because it's "vector path". Apparently "leave" is "direct path", so is smaller and no slower than discrete instructions. Perhaps if you use that mysterious second parameter to "enter" it catches up again?
I haven't read Agner Fog's optimization guide, but he's got a lot of interesting material. http://www.agner.org - great site!
Best,
Frank
-
I too also recommend reading Agner for excellent discussions on optimizations.
-
for AMD64 amd cpu's AMD (who'd guess) has put out optimization guides
http://developer.amd.com/Resources/documentation/guides/Pages/default.aspx
they say for memcpy mov's are fastest (i read strcpy is basically copying an array, so memory ;but idk)
id guess intel put out some too