Author Topic: Branch prediction prefixes  (Read 18345 times)

nobody

  • Guest
Branch prediction prefixes
« on: June 25, 2009, 10:27:52 PM »
Hi,

is there currently any way to specify branch prediction prefixes (2E / 3E) for the Jcc instructions in NASM?

Many thanks in advance

nobody

  • Guest
Re: Branch prediction prefixes
« Reply #1 on: June 26, 2009, 04:30:55 AM »
I recall considerable discussion among assembler authors/maintainers, in an attempt to get all assemblers to do it the same way. I don't think we ever reached agreement.

My opinion is that Intel didn't give us names for 'em because they already have names: "cs" and "ds". You can "%define hint_taken cs" and "%define hint_not_taken" ds" (or is it the other way around?), or whatever names you like.

I don't think Nasm ever added anything "built in" for this... unless I just forget... Maybe it isn't a good idea to make it "too easy" - I've heard that using hints "incorrectly" can screw up the CPU's own branch prediction, giving worse performance than default, so they should only be used by people who know what they're doing (sounds like you do).

Best,
Frank

nobody

  • Guest
Re: Branch prediction prefixes
« Reply #2 on: June 29, 2009, 01:05:12 AM »
Hi Frank,

thanks for the fast response! I totally forgot about the segment prefixes since they are so rarely used on modern architectures...

It's true that using branch prediction prefixes can make things worse than the default. This is typically the result of placing prefixes at conditional branches whose outcome is not statistically independent of the outcome of previous iterations.

Let me give an example: Let's consider a loop containing a Jcc jumping in forward direction. Let's further assume this loop does 30 iterations. On the first 20 iterations the branch is taken, on the last 10 iterations it is not taken.

Older CPUs built on the NetBurst architecture typically employ a static branch prediction algorithm. This always predicts forward branches as not being taken and backward branches as being taken (helps with loops). Therefore, e.g. a P4 CPU would produce 20 failed predictions and 10 successful predictions by default. By placing a prefix you could flip this the other way around producing 20 successful predictions and 10 failed predictions. This decreases average latency of the Jcc by about 50%.

On modern CPUs (e.g. Core architecture) the case is different. Core based CPUs typically employ a dynamic predictor. This works by caching the previous k outcomes of the previous n conditional branches encountered. Every time the CPU sees a new Jcc a static prediction is applied. If the Jcc is in the cache the prediction is based on the previous outcomes (e.g. by majority vote or something more sophisticated). Therefore a modern CPU will typically produce less than 5 failed predictions for this Jcc. If you insert a prefix, you essentially override this dynamic predictor, again producing 10 failed predictions. This increases the average latency of the Jcc by over 100%.

Therefore, if you would like to get optimal performance for this loop on all CPUs, you would need to include two code versions together with code checking the CPU type.

An even better way of optimization in this case would be to rearrange the code to use a backward Jcc without prefix (if possible). This way you would get best performance in all cases by default.

To summarize this in a few guidelines:

1. Avoid branch prediction prefixes if the outcome of the branch is not statistically independent from previous outcomes except if you include architecture specific code.
2. Always favor rearranging your code to match the 'natural' prediction behavior over placing prefixes (if this doesn't produce other problems).
3. Never insert unnecessary prefixes just because they look nice (even if the prefix doesn't change the prediction, it is not entirely free because it increases code size and might therefore decrease decoder throughput in certain situations).
4. The most important rule about all kinds of low-level optimization: If you're not absolutely sure about the run-time behavior of your code, don't try to explain it to the CPU. Use a profiler to measure it first.

Best regards and thanks again,

Andreas

nobody

  • Guest
Re: Branch prediction prefixes
« Reply #3 on: July 02, 2009, 08:47:24 AM »
Does anything besides Intel P4 even support jump hints? I didn't think so from what little scraps I've read on the 'Net, but I'm not sure. (Bonus points for finding docs confirming for or against.)

rugxulo _AT_ gmail _DOT_ com