Author Topic: Copy string ASM vs C (Read 23778 times)

TightCoderEx · « **on:** June 17, 2012, 01:34:20 AM »

This thread is not intended to bash "C" or "C++", but rather a lot of times those using higher level languages question the logic of low level programming. This example is functionally equivalent to char *strcpy ( char *Dest, char *Src )

Code: [Select]

   0:	55                   	push   rbp
   1:	48 89 e5             	mov    rbp,rsp
   
   4:	57                   	push   rdi
   5:	56                   	push   rsi
   6:	8b 75 18             	mov    esi,DWORD PTR [Src]
   9:	8b 7d 10             	mov    edi,DWORD PTR [Dest]
   c:	eb 01                	jmp    f <L0>
   
   e:	aa                   	stos   BYTE PTR es:[rdi],al
   f:	ac                   	lods   al,BYTE PTR ds:[rsi]
  10:	08 c0                	or     al,al
  12:	75 fa                	jne    e <L0-0x1>
  
  14:	5e                   	pop    rsi
  15:	5f                   	pop    rdi
  
  16:	c9                   	leave 
   
  17:	58                   	pop    rax
  18:	c2 08 00             	ret    0x8

= 27 bytes

It would be interesting to see if someone really proficient in "C" or "C++" could rival this function and how they do it.

Keith Kanios · « **Reply #1 on:** June 17, 2012, 04:17:30 AM »

Quote from: TightCoderEx on June 17, 2012, 01:34:20 AM

It would be interesting to see if someone really proficient in "C" or "C++" could rival this function and how they do it.

In what respect. Size? Speed? Safety? Portability?

TightCoderEx · « **Reply #2 on:** June 17, 2012, 06:41:45 AM »

Definitely not portability, although once it comes down to API's even "C" is less portable.

Safety, definitely not especially the way I code some things, like doing 8 and 16 bit comparisons on memory or register because I know what the value of the high order bits will always be.

Size and as a natural consequence maybe speed. I'm pretty sure there is a combination of coding style such as using a reference variable *Dest = 'B' versus Dest [3] = 'B' and compile options that produces the most efficient code size and/or speed.

Arq · « **Reply #3 on:** June 18, 2012, 12:07:43 AM »

I read long time ago somewhere(intel manuals?) than stos/lods are relatively slow at speed compared with simply mov/inc esi/edi on newer processors. Indeed debugging a little I found than my C code do that.

TightCoderEx · « **Reply #4 on:** June 18, 2012, 12:48:57 AM »

Similarly

Code: [Select]

enter 148, 0is more weighty than

Code: [Select]

        push    ebp
        mov     ebp, rsp
        sub     rsp, 148

and slightly heavier than

Code: [Select]

add rsp, -148
I have to ask myself though, what was the rationale for Intel engineers to design such functionality into the processor. My guess would be ENTER is 1/3 the size of the conventional method in this example

Code: [Select]

   0:	c8 48 14 00          	enter  0x1448,0x0
   
   4:	55                   	push   rbp
   5:	48 89 e5             	mov    rbp,rsp
   8:	48 81 ec 48 14 00 00 	sub    rsp,0x1448

and by that logic more efficient speed wise, at least in this example anyway

Keith Kanios · « **Reply #5 on:** June 18, 2012, 04:42:14 AM »

Quote from: Arq on June 18, 2012, 12:07:43 AM

I read long time ago somewhere(intel manuals?) than stos/lods are relatively slow at speed compared with simply mov/inc esi/edi on newer processors. Indeed debugging a little I found than my C code do that.

The underlying hardware design deviated from the instruction set quite some time ego. Modern x86 is a superscalar architecture. You have to factor in pipelines, microcode, instruction reordering, register renaming, caches, etc.

I am not at all surprised if stos/lods are slower, despite the ability of microcode to even things out. I wouldn't be surprised if compilers, i.e. favoring more generic and RISC-like code, are driving such hardware evolution.

However, while shooting for loop optimization is indeed important, as an assembly language programmer, I'd be more concerned over larger optimization gains that can be had across the entire architecture. Being conscientious of the fact that code and data caches can greatly impact performance seems more relevant. Is something like "rep movsb" the be-all-end-all to data copying? No, but it sure is compact (code cache) and causes predictable (linear and thus easily optimized) data cache access. I am more than content to leave micro optimizations to compilers that care about such wild goose chases

Keith Kanios · « **Reply #6 on:** June 18, 2012, 04:57:56 AM »

Quote from: TightCoderEx on June 18, 2012, 12:48:57 AM

I have to ask myself though, what was the rationale for Intel engineers to design such functionality into the processor. My guess would be ENTER is 1/3 the size of the conventional method in this example

...

and by that logic more efficient speed wise, at least in this example anyway

At the time that ENTER/LEAVE were conceived, I believe things were measured in Kilobytes and perhaps Megabytes. At the same time, IIRC, ENTER would have taken longer (clock cycles) to perform, technically. Your classic speed-vs-size tradeoff.

Does it still matter now that "clock cycles" are more of an ambiguous and moving target? Well, ENTER has implicit dependencies while the "long" way has explicit dependencies, so I don't see any obvious (simplistic) pipeline optimizations to gain there. However, I would chalk it up akin to my last response, ENTER is another one of those CISC-y instructions that have fallen out of favor in the age of superscalar.

TightCoderEx · « **Reply #7 on:** June 18, 2012, 05:27:14 AM »

Well this thread and others have steered me onto Agner Fog's material and I've just read the first 75 of 161 pages of Optimising for Assembly. Maybe I'll be able to answer my own question in the near future, but already have better insight into standards especially as they apply to calling conventions in the different platforms.

It will be an interesting exercise to see if I can optimise what little I've done already to be more compliant of calling conventions, as it applies to 64 bit Linux and size and speed sensitive.

Frank Kotler · « **Reply #8 on:** June 18, 2012, 11:22:14 AM »

I really don't know a "vector path" instruction from Adam's off ox, but I'm told that "enter" is slow (compared to discrete instructions) because it's "vector path". Apparently "leave" is "direct path", so is smaller and no slower than discrete instructions. Perhaps if you use that mysterious second parameter to "enter" it catches up again?

I haven't read Agner Fog's optimization guide, but he's got a lot of interesting material. http://www.agner.org - great site!

Best,
Frank

Rob Neff · « **Reply #9 on:** June 19, 2012, 12:09:23 AM »

I too also recommend reading Agner for excellent discussions on optimizations.

gens · « **Reply #10 on:** October 15, 2012, 09:45:43 PM »

for AMD64 amd cpu's AMD (who'd guess) has put out optimization guides

http://developer.amd.com/Resources/documentation/guides/Pages/default.aspx
they say for memcpy mov's are fastest (i read strcpy is basically copying an array, so memory ;but idk)

id guess intel put out some too

NASM - The Netwide Assembler

News:

Author Topic: Copy string ASM vs C (Read 23778 times)

TightCoderEx

Copy string ASM vs C

Keith Kanios

Re: Copy string ASM vs C

TightCoderEx

Re: Copy string ASM vs C

Arq

Re: Copy string ASM vs C

TightCoderEx

Re: Copy string ASM vs C

Keith Kanios

Re: Copy string ASM vs C

Keith Kanios

Re: Copy string ASM vs C

TightCoderEx

Re: Copy string ASM vs C

Frank Kotler

Re: Copy string ASM vs C

Rob Neff

Re: Copy string ASM vs C

gens

Re: Copy string ASM vs C