In principle, I agree with you 100% (or more).
However... it has come to my attention that the guys on news:comp.arch differentiate between the "architecture" and the "hardware". "Architecture" being the instruction set that we have access to, and "hardware" being the "micro-code" which the "risc core" actually executes (and which is, AFAIK, "proprietary" - a "trade secret").
From this viewpoint, "push" is of course a "complex instruction". The "hardware" actually has to execute something roughly like:
sub rsp, 8
mov [rsp], rbp
sub rsp, 8
mov [rsp], rbx
One thing that really jumps out at me in the compiler's code is that the "sub" is done last. An "old timer" would never ever touch anything below sp. An interrupt could occur at any time, using the same sp, and could trash anything we put below sp. However, a modern architecture/hardware/OS (I'm not sure which actually determines this) uses a "different stack" for interrupts, so it is "safe" to use the area under esp/rsp - we can even use esp/rsp as a general purpose register if the need is dire (saving/restoring it somewhere - not on the stack, obviously). In terms of "dependencies", if the "sub" were put first the other instructions would need to wait for the new value of rsp to be known (the instruction "retired") before the "mov"s could even begin. We have multiple "execution units" even on a single core, and "out-of-order execution" to contend with!
I had occasion to test this - AMD K5 vs K6 I think, but don't hold me to that. I was accused of "pretending to be surprised" at the result. I wasn't surprised that "push" had become slower than "mov", but I was surprised by how much. As I recall, on the older hardware they were about the same - with a slight advantage to "push". On the newer hardware "push" took more than twice as long!
So as much as it pains me to say it, I think the compiler's code may be better than what you or I would (probably) write. There's a tradeoff to this. If we can get more "business" done per cache-line, cache, or page, we may be able to avoid the very slow operation of reloading any of these. I don't know where (if ever) the "win" occurs.
I pretty much gave up on optimizing for speed when I realized that they changed the rules with every generation of CPU. I recall reading (but not where) that "push" has been made fast again on very recent hardware. A compiler, in the hands of a competent author/maintainer, can keep up with these changes - and optimize for older hardware instead, if specified. When optimizing for size, it's easy to "keep score" with a simple "ls -l" or equivalent. My attempts to time code have been... inconclusive. I concentrate on "do everything that needs to be done, and nothing that doesn't need to be done". Works for me.
Don't get me wrong, I "like" asm better than any HLL and will probably never abandon it. But you don't get the advantages from it that you did in the "good old days". One thing I've learned is that change happens. Whether you like it or not (I frequently don't), you'd better learn to cope with it. One way to cope is to ignore any change you don't like, when and if possible.
Hope I didn't spoil your rant!
