Having branches automatically convert into
predicates when they branch forward a short distance <7 instructions.
So the hardware should take predictability of a condition and the >availability of the condition into consideration for if-conversion.
Robert Finch <robfi680@gmail.com> writes:
Having branches automatically convert into
predicates when they branch forward a short distance <7 instructions.
If-conversion in hardware is a good idea, if done well, because it
involves issues that tend to be unknown to compilers:
* How predictable is the condition? If the condition is very well
predictable, if-conversion is not a good idea, because it turns the
control dependency (which does not cost latency when the prediction
is correct) into a data dependency. Moreover, in this case the
if-conversion increases the resource consumption. Compilers are not
good at predicting the predictability AFAIK.
* Is the condition available before or after the original data
dependencies? And if afterwards, by how many cycles? If it is
afterwards and the branch prediction would be correct, the
if-conversion means that the result of the instruction is available
later, which may reduce IPC.
OTOH, if the branch prediction would
be incorrect, the recovery also depends on when the condition
becomes available,
and the total latency is higher in the case of no
if-conversion. The compiler may do an ok job at predicting whether
a condition is available before or after the original data
dependencies (I don't know a paper that evaluates that), but without
knowing about the prediction accuracy of a specific condition that
does not help much.
So the hardware should take predictability of a condition and the availability of the condition into consideration for if-conversion.
What about reverse if-conversion in hardware, i.e., converting
predicated instructions and the like (conditional moves, if-then-else instructions and the instructions they control) into branch-predicted
phantom branches and eliminating the data dependency on the condition
from the instruction.
For performance, one might consider reverse if-conversion, because the
same considerations apply; however, there is also a security aspect: programmers have used these instructions instead of branches to
produce constant-time code to avoid timing side channels of code that
deals with secrets; and the discovery of Spectre has shown additional
timing side channels of branches. Because you cannot be sure that the predicated instruction is there for security reasons, you must not use reverse if-conversion in hardware.
- anton
On Mon, 21 Apr 2025 6:05:32 +0000, Anton Ertl wrote:
Robert Finch <robfi680@gmail.com> writes:
Having branches automatically convert into
predicates when they branch forward a short distance <7 instructions.
If-conversion in hardware is a good idea, if done well, because it
involves issues that tend to be unknown to compilers:
I had little trouble teaching Brian how to put if-conversion into the >compiler with my PRED instructions. Alleviating HW from having to bother >other than being able to execute PREDicated clauses.
* How predictable is the condition? If the condition is very well
predictable, if-conversion is not a good idea, because it turns the
control dependency (which does not cost latency when the prediction
is correct) into a data dependency. Moreover, in this case the
if-conversion increases the resource consumption. Compilers are not
good at predicting the predictability AFAIK.
Rather than base the choice on the predictability of the condition,
It is based on whether FETCH will pass the join-point before the
condition resolves. On an 8-wide machine this might be "THE next
cycle".
* Is the condition available before or after the original data
dependencies? And if afterwards, by how many cycles? If it is
afterwards and the branch prediction would be correct, the
if-conversion means that the result of the instruction is available
later, which may reduce IPC.
Generally, it only adds latency--if the execution window is not staled
at either end this does not harm IPC.
OTOH, if the branch prediction would
be incorrect, the recovery also depends on when the condition
becomes available,
There is no "recovery" from PREDication, just one clause getting
nullified.
and the total latency is higher in the case of no
if-conversion. The compiler may do an ok job at predicting whether
a condition is available before or after the original data
dependencies (I don't know a paper that evaluates that), but without
knowing about the prediction accuracy of a specific condition that
does not help much.
So the hardware should take predictability of a condition and the
availability of the condition into consideration for if-conversion.
My argument is that this is a SW decision (in the compiler) not a
HW decision (other than providing the PREDs).
Since PREDs are not
predicted (unless you think they are predicted BOTH ways) they do
not diminish the performance of the branch predictors.
The compiler choose PRED because FETCH reaches the join-point prior
to the branch resolving. PRED is almost always faster--and when
it has both then-clause and else-clause, it always saves a branch
instruction (jumping over the else-clause).
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 21 Apr 2025 6:05:32 +0000, Anton Ertl wrote:
Robert Finch <robfi680@gmail.com> writes:I had little trouble teaching Brian how to put if-conversion into the
Having branches automatically convert intoIf-conversion in hardware is a good idea, if done well, because it
predicates when they branch forward a short distance <7 instructions.
involves issues that tend to be unknown to compilers:
compiler with my PRED instructions. Alleviating HW from having to bother
other than being able to execute PREDicated clauses.
Compilers certainly can perform if-conversion, and there have been
papers about that for at least 30 years, but compilers do not know
when if-conversion is profitable. E.g., "branchless" (if-converted)
binary search can be faster than a branching one when the searched
array is cached at a close-enough level, but slower when most of the
accesses miss the close caches <https://stackoverflow.com/questions/11360831/about-the-branchless-binary-search>.
mitchalsup@aol.com (MitchAlsup1) writes:--------------
My argument is that this is a SW decision (in the compiler) not a
HW decision (other than providing the PREDs).
That's a position, not an argument. Do you have an argument for your position?
Since PREDs are not
predicted (unless you think they are predicted BOTH ways) they do
not diminish the performance of the branch predictors.
Nor increase it.
But it sounds like you think that the compiler
should choose predication when the condition is not particularly
predictable. How should the compiler know that?
- anton
Anton Ertl wrote:
Compilers certainly can perform if-conversion, and there have been
papers about that for at least 30 years, but compilers do not know
when if-conversion is profitable. E.g., "branchless" (if-converted)
binary search can be faster than a branching one when the searched
array is cached at a close-enough level, but slower when most of the
accesses miss the close caches
<https://stackoverflow.com/questions/11360831/about-the-branchless-binary-search>.
I'm missing something because the first answer by BeeOnRope looks wrong. >Unfortunately they don't show both pieces of code and the asm,
but a branching binary search to me looks just as load data dependent as it >recalculates middle each iteration and the array index is array[middle].
If branch prediction is always correct, the branching version builds
an instruction queue that is a long chain of scaled indexed loads
where the load index is data dependent on the prior iteration load value.
[*] I want to see the asm because Intel's CMOV always executes the
operand operation, then tosses the result if the predicate is false.
EricP <ThatWouldBeTelling@thevillage.com> writes:
Anton Ertl wrote:
Compilers certainly can perform if-conversion, and there have been
papers about that for at least 30 years, but compilers do not know
when if-conversion is profitable. E.g., "branchless" (if-converted)
binary search can be faster than a branching one when the searched
array is cached at a close-enough level, but slower when most of the
accesses miss the close caches
<https://stackoverflow.com/questions/11360831/about-the-branchless-binary-search>.
I'm missing something because the first answer by BeeOnRope looks wrong. >>Unfortunately they don't show both pieces of code and the asm,
but a branching binary search to me looks just as load data dependent as
it
recalculates middle each iteration and the array index is array[middle].
Here's the branchless version:
while (n > 1) {
int middle = n/2;
base += (needle < *base[middle]) ? 0 : middle;
n -= middle;
}
The branching version would then be:
while (n > 1) {
int middle = n/2;
if (needle >= *base[middle])
base += middle;
n -= middle;
}
In the branching version we have the following loop recurrences:
1) middle = n/2; n -= middle
2) base += middle;
Recurrence 1 costs a few cycles per iteration, ideally 2. Recurrence
2 costs even less. There is no load in the loop recurrences. The
load is used only for verifying the branch prediction. And in
particular, the load does not data-depend on any previous load.
If branch prediction is always correct, the branching version builds
an instruction queue that is a long chain of scaled indexed loads
where the load index is data dependent on the prior iteration load value.
That's certainly not the case here.
Even if we assume that the branch predictor misses half of the time
(which is realistic), that means that the next load and the load of
all followig correctly predicted ifs just have to wait for the load of
the mispredicted branch plus the misprediction penalty. Which means
that on average two loads will happen in parallel, while the
branchless version only performs dependent loads. If the loads cost
more than a branch misprediction (because of cache misses), the
branching version is faster.
One can now improve the performance by using branchless for the first
few levels (which are cached), and adding prefetching and a mixture of branchless and branchy for the rest, but for understanding the
principle the simple variants are best.
[*] I want to see the asm because Intel's CMOV always executes the
operand operation, then tosses the result if the predicate is false.
That's the usual thing for conditional execution/predication. But
here 0 has no dependencies and middle has only cheap dependencies, the
only expensive part is the load for the condition that turns into a
data dependency in the branchless version.
- anton
I do not see 2 LDDs being performed parallel unless the execution
width is at least 14-wide. In any event loop recurrence restricts the
I do not see 2 LDDs being performed parallel unless the execution
width is at least 14-wide. In any event loop recurrence restricts the
overall retirement to 0.5 LDDs per cycle--it is the recurrence that
feeds the iterations (i.e., retirement).
[*] I want to see the asm because Intel's CMOV always executes the >>>operand operation, then tosses the result if the predicate is false.
Use a less-stupid ISA
Of course, nothing prevents a CPU from doing "predicate prediction":
instead of waiting for an answer to `needle < *base[middle]`, it could
try and guess whether it will be true or false and thus choose to send
one of the two addresses (or both) to the memory (and later check the >prediction and rollback, just like we do with normal branches).
mitchalsup@aol.com (MitchAlsup1) writes:
I do not see 2 LDDs being performed parallel unless the execution
width is at least 14-wide. In any event loop recurrence restricts the >>overall retirement to 0.5 LDDs per cycle--it is the recurrence that
feeds the iterations (i.e., retirement).
Yes. But with loads that take longer than two cycles (very common in
OoO microarchitectures even for L1 hits), the second load starts
before the first finishes. And in the case where the branchy version
is profitable (when the load latency longer than the misprediction
penalty due to cache misses), many loads will start before the first
finishes (most of them will be canceled due to misprediction, but even
an average of two useful parallel loads produces a good speedup).
[EricP:]
[*] I want to see the asm because Intel's CMOV always executes the >>>>operand operation, then tosses the result if the predicate is false.
Use a less-stupid ISA
The ISA does not require that. It could just as well be implemented
as waiting for the condition, and only then perform the operation.
And with a more sophisticated implementation one could even do that
for operations that are not part of the CMOV instruction, but produce
one of the source operands of the CMOV instruction. However,
apparently such implementations have enough disadvantages (probably in performance) that nobody has gone there AFAIK. AFAIK everyone,
including implementations of different ISAs implements
CMOV/predication as performing the operation and then conditionally
squashing the result.
- anton
EricP <ThatWouldBeTelling@thevillage.com> writes:
Anton Ertl wrote:
Compilers certainly can perform if-conversion, and there have beenI'm missing something because the first answer by BeeOnRope looks wrong.
papers about that for at least 30 years, but compilers do not know
when if-conversion is profitable. E.g., "branchless" (if-converted)
binary search can be faster than a branching one when the searched
array is cached at a close-enough level, but slower when most of the
accesses miss the close caches
<https://stackoverflow.com/questions/11360831/about-the-branchless-binary-search>.
Unfortunately they don't show both pieces of code and the asm,
but a branching binary search to me looks just as load data dependent as it >> recalculates middle each iteration and the array index is array[middle].
Here's the branchless version:
while (n > 1) {
int middle = n/2;
base += (needle < *base[middle]) ? 0 : middle;
n -= middle;
}
The branching version would then be:
while (n > 1) {
int middle = n/2;
if (needle >= *base[middle])
base += middle;
n -= middle;
}
In the branching version we have the following loop recurrences:
1) middle = n/2; n -= middle
2) base += middle;
Recurrence 1 costs a few cycles per iteration, ideally 2. Recurrence
2 costs even less. There is no load in the loop recurrences. The
load is used only for verifying the branch prediction. And in
particular, the load does not data-depend on any previous load.
If branch prediction is always correct, the branching version builds
an instruction queue that is a long chain of scaled indexed loads
where the load index is data dependent on the prior iteration load value.
That's certainly not the case here.
Even if we assume that the branch predictor misses half of the time
(which is realistic), that means that the next load and the load of
all followig correctly predicted ifs just have to wait for the load of
the mispredicted branch plus the misprediction penalty. Which means
that on average two loads will happen in parallel, while the
branchless version only performs dependent loads. If the loads cost
more than a branch misprediction (because of cache misses), the
branching version is faster.
One can now improve the performance by using branchless for the first
few levels (which are cached), and adding prefetching and a mixture of branchless and branchy for the rest, but for understanding the
principle the simple variants are best.
I do not see 2 LDDs being performed parallel unless the execution
width is at least 14-wide. In any event loop recurrence restricts the
IIUC you'll get multiple loads in parallel if the loads take a long time because of cache misses. Say each load takes 100 cycles, then there is plenty of time during one load to predict many more iterations of the
loop and hence issue many more loads.
With a branching code, the addresses of those loads depend mostly on the branch predictions, so the branch predictions end up performing a kind
of "value prediction" (where the value that's predicted is the address
of the next lookup).
With predication your load address will conceptually depend on 3 inputs:
the computation of `base + middle`, the computation of `base + 0`, and
the computation of the previous `needle < *base[middle]` test to choose between the first two. If the LD of `*base[middle]` takes 100 cycle,
that means a delay of 100 cycles before the next LD can be issued.
Of course, nothing prevents a CPU from doing "predicate prediction":
instead of waiting for an answer to `needle < *base[middle]`, it could
try and guess whether it will be true or false and thus choose to send
one of the two addresses (or both) to the memory (and later check the prediction and rollback, just like we do with normal branches).
Stefan Monnier <monnier@iro.umontreal.ca> writes:
Of course, nothing prevents a CPU from doing "predicate prediction":
instead of waiting for an answer to `needle < *base[middle]`, it could
try and guess whether it will be true or false and thus choose to send
one of the two addresses (or both) to the memory (and later check the
prediction and rollback, just like we do with normal branches).
That would subvert the use of predication for getting constant-time
code for avoiding timing side channels. I.e., not a good idea.
- anton
mitchalsup@aol.com (MitchAlsup1) writes:
[EricP:]
Use a less-stupid ISA[*] I want to see the asm because Intel's CMOV always executes the
operand operation, then tosses the result if the predicate is false.
The ISA does not require that. It could just as well be implemented
as waiting for the condition, and only then perform the operation.
And with a more sophisticated implementation one could even do that
for operations that are not part of the CMOV instruction, but produce
one of the source operands of the CMOV instruction.
However,
apparently such implementations have enough disadvantages (probably in performance) that nobody has gone there AFAIK. AFAIK everyone,
including implementations of different ISAs implements
CMOV/predication as performing the operation and then conditionally
squashing the result.
- anton
Anton Ertl wrote:
Stefan Monnier <monnier@iro.umontreal.ca> writes:
Of course, nothing prevents a CPU from doing "predicate prediction":
instead of waiting for an answer to `needle < *base[middle]`, it could
try and guess whether it will be true or false and thus choose to send
one of the two addresses (or both) to the memory (and later check the
prediction and rollback, just like we do with normal branches).
That would subvert the use of predication for getting constant-time
code for avoiding timing side channels. I.e., not a good idea.
- anton
This assumes that the predicate's not-executed path is nullified in a constant time way. That might be ok for simple 1-clock instructions but
I would want heavier ones, MUL, DIV, all floats, and all LD and ST,
to be at worst nullified to 1-clock each, even better to 0.
OoO nullified (not-executed) instructions may still take 1 clock if all
such instructions that write a register have to copy the old physical
dest register value to the new physical register.
A constant time guarantee might only be possible if one manually
arranges
to unconditionally perform all calculations then CMOV the result.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 02:54:28 |
Calls: | 10,387 |
Calls today: | 2 |
Files: | 14,061 |
Messages: | 6,416,755 |