Hi Frank,
thanks for the fast response! I totally forgot about the segment prefixes since they are so rarely used on modern architectures...
It's true that using branch prediction prefixes can make things worse than the default. This is typically the result of placing prefixes at conditional branches whose outcome is not statistically independent of the outcome of previous iterations.
Let me give an example: Let's consider a loop containing a Jcc jumping in forward direction. Let's further assume this loop does 30 iterations. On the first 20 iterations the branch is taken, on the last 10 iterations it is not taken.
Older CPUs built on the NetBurst architecture typically employ a static branch prediction algorithm. This always predicts forward branches as not being taken and backward branches as being taken (helps with loops). Therefore, e.g. a P4 CPU would produce 20 failed predictions and 10 successful predictions by default. By placing a prefix you could flip this the other way around producing 20 successful predictions and 10 failed predictions. This decreases average latency of the Jcc by about 50%.
On modern CPUs (e.g. Core architecture) the case is different. Core based CPUs typically employ a dynamic predictor. This works by caching the previous k outcomes of the previous n conditional branches encountered. Every time the CPU sees a new Jcc a static prediction is applied. If the Jcc is in the cache the prediction is based on the previous outcomes (e.g. by majority vote or something more sophisticated). Therefore a modern CPU will typically produce less than 5 failed predictions for this Jcc. If you insert a prefix, you essentially override this dynamic predictor, again producing 10 failed predictions. This increases the average latency of the Jcc by over 100%.
Therefore, if you would like to get optimal performance for this loop on all CPUs, you would need to include two code versions together with code checking the CPU type.
An even better way of optimization in this case would be to rearrange the code to use a backward Jcc without prefix (if possible). This way you would get best performance in all cases by default.
To summarize this in a few guidelines:
1. Avoid branch prediction prefixes if the outcome of the branch is not statistically independent from previous outcomes except if you include architecture specific code.
2. Always favor rearranging your code to match the 'natural' prediction behavior over placing prefixes (if this doesn't produce other problems).
3. Never insert unnecessary prefixes just because they look nice (even if the prefix doesn't change the prediction, it is not entirely free because it increases code size and might therefore decrease decoder throughput in certain situations).
4. The most important rule about all kinds of low-level optimization: If you're not absolutely sure about the run-time behavior of your code, don't try to explain it to the CPU. Use a profiler to measure it first.
Best regards and thanks again,
Andreas