Forum: >>> Magnum BBS <<<

Re: auto predicating branches

From Anton Ertl@21:1/5 to Robert Finch on Mon Apr 21 06:05:32 2025

Robert Finch <robfi680@gmail.com> writes:

Having branches automatically convert into
predicates when they branch forward a short distance <7 instructions.

If-conversion in hardware is a good idea, if done well, because it
involves issues that tend to be unknown to compilers:

* How predictable is the condition? If the condition is very well
predictable, if-conversion is not a good idea, because it turns the
control dependency (which does not cost latency when the prediction
is correct) into a data dependency. Moreover, in this case the
if-conversion increases the resource consumption. Compilers are not
good at predicting the predictability AFAIK.

* Is the condition available before or after the original data
dependencies? And if afterwards, by how many cycles? If it is
afterwards and the branch prediction would be correct, the
if-conversion means that the result of the instruction is available
later, which may reduce IPC. OTOH, if the branch prediction would
be incorrect, the recovery also depends on when the condition
becomes available, and the total latency is higher in the case of no
if-conversion. The compiler may do an ok job at predicting whether
a condition is available before or after the original data
dependencies (I don't know a paper that evaluates that), but without
knowing about the prediction accuracy of a specific condition that
does not help much.

So the hardware should take predictability of a condition and the
availability of the condition into consideration for if-conversion.

What about reverse if-conversion in hardware, i.e., converting
predicated instructions and the like (conditional moves, if-then-else instructions and the instructions they control) into branch-predicted
phantom branches and eliminating the data dependency on the condition
from the instruction.

For performance, one might consider reverse if-conversion, because the
same considerations apply; however, there is also a security aspect: programmers have used these instructions instead of branches to
produce constant-time code to avoid timing side channels of code that
deals with secrets; and the discovery of Spectre has shown additional
timing side channels of branches. Because you cannot be sure that the predicated instruction is there for security reasons, you must not use
reverse if-conversion in hardware.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Anton Ertl on Mon Apr 21 13:39:25 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

So the hardware should take predictability of a condition and the >availability of the condition into consideration for if-conversion.

Another criterion would be whether one of the potentially
if-converted instructions is on the critical data-dependency path. If
none of them are, additional latency (if any) from if-conversion may
be less of an issue. OTOH, it may also mean that the code is
resource-limited rather than dependency-limited, and that one should
avoid if-conversion in order to avoid additional resource consumption
(and of course prediction accuracy also plays into these
considerations).

This made me think how one could determine whether an instruction is
on the critical path. If an instruction X is committed, and the only instructions later in the reorder buffer are data-flow successors of
X, then X is obviously on the critical path. An instruction Y that
delivers the last input that releases X from the scheduler into the
functional unit is also on the critical path; this is transitive, and
several results may arrive in the same cycle.

Marking X as a critical instruction is relatively easy, although one
might also count how many times of all dynamic instances of the
instruction were critical. Marking the critical data-flow
predecessors is harder, especially across several levels (not to
mention all levels); one way is to do it over time: when an
instruction V delivers a dependency that releases an instruction W
from the scheduler that is already marked as critical, V is marked as
critical as well.

The same instruction may be on the critical path in one execution, but
not in another execution, due to cache hits, or differences in control
flow. The approach outlined above only works correctly if W is always
on the critical path. V should probably only marked as critical if
the number of critical executions of W relative to the overall
executions of W is high.

Showing the critical path through performance counters might help
programmers improve the performance of their programs.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Mon Apr 21 17:29:15 2025

On Mon, 21 Apr 2025 6:05:32 +0000, Anton Ertl wrote:

Robert Finch <robfi680@gmail.com> writes:

Having branches automatically convert into
predicates when they branch forward a short distance <7 instructions.

If-conversion in hardware is a good idea, if done well, because it
involves issues that tend to be unknown to compilers:

I had little trouble teaching Brian how to put if-conversion into the
compiler with my PRED instructions. Alleviating HW from having to bother
other than being able to execute PREDicated clauses.

* How predictable is the condition? If the condition is very well
predictable, if-conversion is not a good idea, because it turns the
control dependency (which does not cost latency when the prediction
is correct) into a data dependency. Moreover, in this case the
if-conversion increases the resource consumption. Compilers are not
good at predicting the predictability AFAIK.

Rather than base the choice on the predictability of the condition,
It is based on whether FETCH will pass the join-point before the
condition resolves. On an 8-wide machine this might be "THE next
cycle".

* Is the condition available before or after the original data
dependencies? And if afterwards, by how many cycles? If it is
afterwards and the branch prediction would be correct, the
if-conversion means that the result of the instruction is available
later, which may reduce IPC.

Generally, it only adds latency--if the execution window is not staled
at either end this does not harm IPC.

OTOH, if the branch prediction would
be incorrect, the recovery also depends on when the condition
becomes available,

There is no "recovery" from PREDication, just one clause getting
nullified.

and the total latency is higher in the case of no
if-conversion. The compiler may do an ok job at predicting whether
a condition is available before or after the original data
dependencies (I don't know a paper that evaluates that), but without
knowing about the prediction accuracy of a specific condition that
does not help much.

So the hardware should take predictability of a condition and the availability of the condition into consideration for if-conversion.

My argument is that this is a SW decision (in the compiler) not a
HW decision (other than providing the PREDs). Since PREDs are not
predicted (unless you think they are predicted BOTH ways) they do
not diminish the performance of the branch predictors.

What about reverse if-conversion in hardware, i.e., converting
predicated instructions and the like (conditional moves, if-then-else instructions and the instructions they control) into branch-predicted
phantom branches and eliminating the data dependency on the condition
from the instruction.

The compiler choose PRED because FETCH reaches the join-point prior
to the branch resolving. PRED is almost always faster--and when
it has both then-clause and else-clause, it always saves a branch
instruction (jumping over the else-clause).

For performance, one might consider reverse if-conversion, because the
same considerations apply; however, there is also a security aspect: programmers have used these instructions instead of branches to
produce constant-time code to avoid timing side channels of code that
deals with secrets; and the discovery of Spectre has shown additional
timing side channels of branches. Because you cannot be sure that the predicated instruction is there for security reasons, you must not use reverse if-conversion in hardware.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Tue Apr 22 05:10:10 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 21 Apr 2025 6:05:32 +0000, Anton Ertl wrote:

Robert Finch <robfi680@gmail.com> writes:

Having branches automatically convert into
predicates when they branch forward a short distance <7 instructions.

If-conversion in hardware is a good idea, if done well, because it
involves issues that tend to be unknown to compilers:

I had little trouble teaching Brian how to put if-conversion into the >compiler with my PRED instructions. Alleviating HW from having to bother >other than being able to execute PREDicated clauses.

Compilers certainly can perform if-conversion, and there have been
papers about that for at least 30 years, but compilers do not know
when if-conversion is profitable. E.g., "branchless" (if-converted)
binary search can be faster than a branching one when the searched
array is cached at a close-enough level, but slower when most of the
accesses miss the close caches <https://stackoverflow.com/questions/11360831/about-the-branchless-binary-search>.

* How predictable is the condition? If the condition is very well
predictable, if-conversion is not a good idea, because it turns the
control dependency (which does not cost latency when the prediction
is correct) into a data dependency. Moreover, in this case the
if-conversion increases the resource consumption. Compilers are not
good at predicting the predictability AFAIK.

Rather than base the choice on the predictability of the condition,
It is based on whether FETCH will pass the join-point before the
condition resolves. On an 8-wide machine this might be "THE next
cycle".

On a CPU with out-of-order execution, the instruction fetcher often
runs many dozens of instructions ahead of the functional units which
produce the conditions, so your criterion could cover pretty big IFs
And, given that you want the compiler to do it, the compiler would
have to know about that. Ok, what decision will you take in what
case, and why?

* Is the condition available before or after the original data
dependencies? And if afterwards, by how many cycles? If it is
afterwards and the branch prediction would be correct, the
if-conversion means that the result of the instruction is available
later, which may reduce IPC.

Generally, it only adds latency--if the execution window is not staled
at either end this does not harm IPC.

If the additional latency is on the critical path and execution is dependency-limited, this reduces IPC. And yes, this will result in
the buffers (especially the schedulers) filling up and stalling the
front end.

OTOH, if the branch prediction would
be incorrect, the recovery also depends on when the condition
becomes available,

There is no "recovery" from PREDication, just one clause getting
nullified.

I apparently wrote that in a misunderstandable way. Here's another
attempt: When comparing the branching variant to the predicated
(if-converted) variant, if the branching variant would be
mispredicted, it is always at a disadvantage wrt. latency compared to
the predicated variant, because the branching variant restarts from
the instruction fetch when the condition becomes available, while the predicated variant is already fetched and decoded and waits in a
scheduler for the condition.

Note that in the binary-search case linked-to above, that's also the
case, but in the branchy version the benefit comes from the correct
predictions and the lack of data-dependencies between the loads: In
those cases the cache-missing load does not depend on the previous cache-missing load (unlike the branchless version), resulting in an
overall shorter latency.

and the total latency is higher in the case of no
if-conversion. The compiler may do an ok job at predicting whether
a condition is available before or after the original data
dependencies (I don't know a paper that evaluates that), but without
knowing about the prediction accuracy of a specific condition that
does not help much.

So the hardware should take predictability of a condition and the
availability of the condition into consideration for if-conversion.

My argument is that this is a SW decision (in the compiler) not a
HW decision (other than providing the PREDs).

That's a position, not an argument. Do you have an argument for your
position?

Since PREDs are not
predicted (unless you think they are predicted BOTH ways) they do
not diminish the performance of the branch predictors.

Nor increase it. But it sounds like you think that the compiler
should choose predication when the condition is not particularly
predictable. How should the compiler know that?

The compiler choose PRED because FETCH reaches the join-point prior
to the branch resolving. PRED is almost always faster--and when
it has both then-clause and else-clause, it always saves a branch
instruction (jumping over the else-clause).

It appears that you have an underpowered front end in mind. Even in
the wide machines of today and without predication, the front end
normally does not have a problem fetching at least as many
instructions as the rest of the machine can handle, as long as the
predictions are correct. Even if the instruction fetcher cannot fetch
the full width in one cycle due to having more taken branches than the instruction fetcher can handle or some other hiccup, this is usually
made up by delivering more instructions in other cycles than the rest
of the CPU can handle. E.g., Skymont <https://old.chipsandcheese.com/2024/06/15/intel-details-skymont/>
fetches from three sequential streams, 3*32bytes/cycle, and decodes
into 3*3 uops/cycle and stores them in 3 32-entry uop queues; the
renamer consumes 8 instructions from these queues, so as long as the predictions are correct and the average fetching and decoding is >8
uops/cycle, the renamer will rarely see fewer than 8 uops available,
even if there is the occasional cycle where the taken branches are so
dense that the instruction fetcher cannot deliver enough relevant
bytes to the instruction decoder.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Tue Apr 22 11:23:57 2025

Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 21 Apr 2025 6:05:32 +0000, Anton Ertl wrote:

Robert Finch <robfi680@gmail.com> writes:

Having branches automatically convert into
predicates when they branch forward a short distance <7 instructions.

If-conversion in hardware is a good idea, if done well, because it
involves issues that tend to be unknown to compilers:

I had little trouble teaching Brian how to put if-conversion into the
compiler with my PRED instructions. Alleviating HW from having to bother
other than being able to execute PREDicated clauses.

Compilers certainly can perform if-conversion, and there have been
papers about that for at least 30 years, but compilers do not know
when if-conversion is profitable. E.g., "branchless" (if-converted)
binary search can be faster than a branching one when the searched
array is cached at a close-enough level, but slower when most of the
accesses miss the close caches <https://stackoverflow.com/questions/11360831/about-the-branchless-binary-search>.

I'm missing something because the first answer by BeeOnRope looks wrong. Unfortunately they don't show both pieces of code and the asm,
but a branching binary search to me looks just as load data dependent as it recalculates middle each iteration and the array index is array[middle].

Assuming the branching version is: https://en.wikipedia.org/wiki/Binary_search#Procedure

If branch prediction is always correct, the branching version builds
an instruction queue that is a long chain of scaled indexed loads
where the load index is data dependent on the prior iteration load value.
As each load finishes the next index is calculated and next load started,
and the chain collapses serially.

But that is exactly what the branchless version also does.

The difference is that the branching version sometimes DOES mispredict,
so purges and rebuilds the IQ for each mispredict. But what it purges
for the current iteration is just an INC or DEC instruction,
either L = m+1 or R = m-1, PLUS everything prefetched in IQ that followed.

Assuming both versions have the same number of memory accesses[*],
it looks to me that the branchless version should always win or tie.

[*] I want to see the asm because Intel's CMOV always executes the
operand operation, then tosses the result if the predicate is false.
That means that if it does CMOVGT rax,[middle] it always does the
[middle] memory load and then conditionally tosses the value.
It may be that the branchless version is actually doing more memory
accesses which is why it is slower.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Tue Apr 22 16:45:20 2025

On Tue, 22 Apr 2025 5:10:10 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

--------------

My argument is that this is a SW decision (in the compiler) not a
HW decision (other than providing the PREDs).

That's a position, not an argument. Do you have an argument for your position?

Architecture defines the primitives, Hardware provides the primitives,
SW decides to use them (or not) on a case-by-case basis.

Since PREDs are not
predicted (unless you think they are predicted BOTH ways) they do
not diminish the performance of the branch predictors.

Nor increase it.

When there is an else-clause, the PRED uses 1 fewer instruction:
the branch that exits the then-clause and jumps over the else-
clause does not exist.

But it sounds like you think that the compiler
should choose predication when the condition is not particularly
predictable. How should the compiler know that?

Compiler is setup to use PRED when the then-clause is 8 or fewer
instructions and when the else clause is 8 or fewer instructions,
otherwise it uses branches.
----------

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Tue Apr 22 17:31:03 2025

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

Compilers certainly can perform if-conversion, and there have been
papers about that for at least 30 years, but compilers do not know
when if-conversion is profitable. E.g., "branchless" (if-converted)
binary search can be faster than a branching one when the searched
array is cached at a close-enough level, but slower when most of the
accesses miss the close caches
<https://stackoverflow.com/questions/11360831/about-the-branchless-binary-search>.

I'm missing something because the first answer by BeeOnRope looks wrong. >Unfortunately they don't show both pieces of code and the asm,
but a branching binary search to me looks just as load data dependent as it >recalculates middle each iteration and the array index is array[middle].

Here's the branchless version:

while (n > 1) {
int middle = n/2;
base += (needle < *base[middle]) ? 0 : middle;
n -= middle;
}

The branching version would then be:

while (n > 1) {
int middle = n/2;
if (needle >= *base[middle])
base += middle;
n -= middle;
}

In the branching version we have the following loop recurrences:

1) middle = n/2; n -= middle
2) base += middle;

Recurrence 1 costs a few cycles per iteration, ideally 2. Recurrence
2 costs even less. There is no load in the loop recurrences. The
load is used only for verifying the branch prediction. And in
particular, the load does not data-depend on any previous load.

If branch prediction is always correct, the branching version builds
an instruction queue that is a long chain of scaled indexed loads
where the load index is data dependent on the prior iteration load value.

That's certainly not the case here.

Even if we assume that the branch predictor misses half of the time
(which is realistic), that means that the next load and the load of
all followig correctly predicted ifs just have to wait for the load of
the mispredicted branch plus the misprediction penalty. Which means
that on average two loads will happen in parallel, while the
branchless version only performs dependent loads. If the loads cost
more than a branch misprediction (because of cache misses), the
branching version is faster.

One can now improve the performance by using branchless for the first
few levels (which are cached), and adding prefetching and a mixture of branchless and branchy for the rest, but for understanding the
principle the simple variants are best.

[*] I want to see the asm because Intel's CMOV always executes the
operand operation, then tosses the result if the predicate is false.

That's the usual thing for conditional execution/predication. But
here 0 has no dependencies and middle has only cheap dependencies, the
only expensive part is the load for the condition that turns into a
data dependency in the branchless version.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Tue Apr 22 22:32:40 2025

On Tue, 22 Apr 2025 17:31:03 +0000, Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

Compilers certainly can perform if-conversion, and there have been
papers about that for at least 30 years, but compilers do not know
when if-conversion is profitable. E.g., "branchless" (if-converted)
binary search can be faster than a branching one when the searched
array is cached at a close-enough level, but slower when most of the
accesses miss the close caches
<https://stackoverflow.com/questions/11360831/about-the-branchless-binary-search>.

I'm missing something because the first answer by BeeOnRope looks wrong. >>Unfortunately they don't show both pieces of code and the asm,
but a branching binary search to me looks just as load data dependent as
it
recalculates middle each iteration and the array index is array[middle].

Here's the branchless version:

while (n > 1) {
int middle = n/2;
base += (needle < *base[middle]) ? 0 : middle;
n -= middle;
}

loop:
SRA Rn,Rn,#1 // Rn in
LDD Rl,[Rb,Rm<<3] // Rb in
CMP R7,Rn,Rl
CMOVLT R7,#0,Rm
ADD Rb,Rb,-Rm // Rb out
ADD Rn,Rn,-Rm // Rn out
BR loop

The branching version would then be:

while (n > 1) {
int middle = n/2;
if (needle >= *base[middle])
base += middle;
n -= middle;
}

loop:
SRA Rn,Rn,#1 // Rn in
LDD Rl,[Rb,Rm<<3] // Rb in
CMP R7,Rn,Rl
PGE R7,T
ADD Rb,Rb,-Rm // Rb conditionally out
ADD Rn,Rn,-Rm // Rn out
BR loop

The ADD or Rn can be freely positioned in the loop, but the recurrence
remains at 2 cycles.

A 7-wide (or wider) machine will be able to insert a whole iteration
of the loop per cycle. The loop has a 2 cycle recurrence, so, the
execution window will fill up rapidly and retire at 1 loop per 2 cycles;
being recurrence-bound. This agrees with the second paragraph below.

In the branching version we have the following loop recurrences:

1) middle = n/2; n -= middle
2) base += middle;

Recurrence 1 costs a few cycles per iteration, ideally 2. Recurrence
2 costs even less. There is no load in the loop recurrences. The
load is used only for verifying the branch prediction. And in
particular, the load does not data-depend on any previous load.

A shifter with a latency of 2 would really hurt this loop.

If branch prediction is always correct, the branching version builds
an instruction queue that is a long chain of scaled indexed loads
where the load index is data dependent on the prior iteration load value.

That's certainly not the case here.

Even if we assume that the branch predictor misses half of the time
(which is realistic), that means that the next load and the load of
all followig correctly predicted ifs just have to wait for the load of
the mispredicted branch plus the misprediction penalty. Which means
that on average two loads will happen in parallel, while the
branchless version only performs dependent loads. If the loads cost
more than a branch misprediction (because of cache misses), the
branching version is faster.

I do not see 2 LDDs being performed parallel unless the execution
width is at least 14-wide. In any event loop recurrence restricts the
overall retirement to 0.5 LDDs per cycle--it is the recurrence that
feeds the iterations (i.e., retirement). The water hose is fundamentally restricted by the recurrence.

One can now improve the performance by using branchless for the first
few levels (which are cached), and adding prefetching and a mixture of branchless and branchy for the rest, but for understanding the
principle the simple variants are best.

Predicating has added latency on the delivery of Rb's recurrence in that
the consumer has to be ready for {didn't happen and now you are free"
along with "here is the data you are looking for". Complicating the
instruction queue "a little".

[*] I want to see the asm because Intel's CMOV always executes the
operand operation, then tosses the result if the predicate is false.

Use a less-stupid ISA

I took a look at extracting the < or >= cases and using it as a mask
(0, middle) but this takes 1 more instruction and does nothing about
the fundamental loop limiter.

That's the usual thing for conditional execution/predication. But
here 0 has no dependencies and middle has only cheap dependencies, the
only expensive part is the load for the condition that turns into a
data dependency in the branchless version.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue Apr 22 22:59:10 2025

I do not see 2 LDDs being performed parallel unless the execution
width is at least 14-wide. In any event loop recurrence restricts the

IIUC you'll get multiple loads in parallel if the loads take a long time because of cache misses. Say each load takes 100 cycles, then there is
plenty of time during one load to predict many more iterations of the
loop and hence issue many more loads.

With a branching code, the addresses of those loads depend mostly on the
branch predictions, so the branch predictions end up performing a kind
of "value prediction" (where the value that's predicted is the address
of the next lookup).

With predication your load address will conceptually depend on 3 inputs:
the computation of `base + middle`, the computation of `base + 0`, and
the computation of the previous `needle < *base[middle]` test to choose
between the first two. If the LD of `*base[middle]` takes 100 cycle,
that means a delay of 100 cycles before the next LD can be issued.

Of course, nothing prevents a CPU from doing "predicate prediction":
instead of waiting for an answer to `needle < *base[middle]`, it could
try and guess whether it will be true or false and thus choose to send
one of the two addresses (or both) to the memory (and later check the prediction and rollback, just like we do with normal branches).

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Wed Apr 23 17:44:56 2025

mitchalsup@aol.com (MitchAlsup1) writes:

I do not see 2 LDDs being performed parallel unless the execution
width is at least 14-wide. In any event loop recurrence restricts the
overall retirement to 0.5 LDDs per cycle--it is the recurrence that
feeds the iterations (i.e., retirement).

Yes. But with loads that take longer than two cycles (very common in
OoO microarchitectures even for L1 hits), the second load starts
before the first finishes. And in the case where the branchy version
is profitable (when the load latency longer than the misprediction
penalty due to cache misses), many loads will start before the first
finishes (most of them will be canceled due to misprediction, but even
an average of two useful parallel loads produces a good speedup).

[EricP:]

[*] I want to see the asm because Intel's CMOV always executes the >>>operand operation, then tosses the result if the predicate is false.

Use a less-stupid ISA

The ISA does not require that. It could just as well be implemented
as waiting for the condition, and only then perform the operation.
And with a more sophisticated implementation one could even do that
for operations that are not part of the CMOV instruction, but produce
one of the source operands of the CMOV instruction. However,
apparently such implementations have enough disadvantages (probably in performance) that nobody has gone there AFAIK. AFAIK everyone,
including implementations of different ISAs implements
CMOV/predication as performing the operation and then conditionally
squashing the result.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stefan Monnier on Wed Apr 23 18:09:28 2025

Stefan Monnier <monnier@iro.umontreal.ca> writes:

Of course, nothing prevents a CPU from doing "predicate prediction":
instead of waiting for an answer to `needle < *base[middle]`, it could
try and guess whether it will be true or false and thus choose to send
one of the two addresses (or both) to the memory (and later check the >prediction and rollback, just like we do with normal branches).

That would subvert the use of predication for getting constant-time
code for avoiding timing side channels. I.e., not a good idea.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Wed Apr 23 21:34:51 2025

On Wed, 23 Apr 2025 17:44:56 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

I do not see 2 LDDs being performed parallel unless the execution
width is at least 14-wide. In any event loop recurrence restricts the >>overall retirement to 0.5 LDDs per cycle--it is the recurrence that
feeds the iterations (i.e., retirement).

Yes. But with loads that take longer than two cycles (very common in
OoO microarchitectures even for L1 hits), the second load starts
before the first finishes. And in the case where the branchy version
is profitable (when the load latency longer than the misprediction
penalty due to cache misses), many loads will start before the first
finishes (most of them will be canceled due to misprediction, but even
an average of two useful parallel loads produces a good speedup).

We still have a 2 cycle loop recurrence, so even if we could perform
1000
LDs per cycle, we are fundamentally SRA and SUB bound around the loop
from iteration to iteration.

[EricP:]

[*] I want to see the asm because Intel's CMOV always executes the >>>>operand operation, then tosses the result if the predicate is false.

Use a less-stupid ISA

The ISA does not require that. It could just as well be implemented
as waiting for the condition, and only then perform the operation.
And with a more sophisticated implementation one could even do that
for operations that are not part of the CMOV instruction, but produce
one of the source operands of the CMOV instruction. However,
apparently such implementations have enough disadvantages (probably in performance) that nobody has gone there AFAIK. AFAIK everyone,
including implementations of different ISAs implements
CMOV/predication as performing the operation and then conditionally
squashing the result.

That is difficult with renaming. In order for the later instructions
to wait on the CMOV renamed result register or the earlier predicted
value, each entry in each station has to be able to wait on one or
the other. In general, it is far easier to make CMOV be able to deliver
either result so nothing downstream of CMOV has any added complexity.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Thu Apr 24 08:54:05 2025

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

Compilers certainly can perform if-conversion, and there have been
papers about that for at least 30 years, but compilers do not know
when if-conversion is profitable. E.g., "branchless" (if-converted)
binary search can be faster than a branching one when the searched
array is cached at a close-enough level, but slower when most of the
accesses miss the close caches
<https://stackoverflow.com/questions/11360831/about-the-branchless-binary-search>.

I'm missing something because the first answer by BeeOnRope looks wrong.
Unfortunately they don't show both pieces of code and the asm,
but a branching binary search to me looks just as load data dependent as it >> recalculates middle each iteration and the array index is array[middle].

Here's the branchless version:

while (n > 1) {
int middle = n/2;
base += (needle < *base[middle]) ? 0 : middle;
n -= middle;
}

The branching version would then be:

while (n > 1) {
int middle = n/2;
if (needle >= *base[middle])
base += middle;
n -= middle;
}

In the branching version we have the following loop recurrences:

1) middle = n/2; n -= middle
2) base += middle;

Recurrence 1 costs a few cycles per iteration, ideally 2. Recurrence
2 costs even less. There is no load in the loop recurrences. The
load is used only for verifying the branch prediction. And in
particular, the load does not data-depend on any previous load.

Yes, I missed that it separates the array index calculation off into
its own compute bound dependency chain that can be calculated ASAP.
That allows it to hit D-cache with multiple speculated loads.

But it is the pipelined cache that allows that to be an advantage.
See below.

If branch prediction is always correct, the branching version builds
an instruction queue that is a long chain of scaled indexed loads
where the load index is data dependent on the prior iteration load value.

That's certainly not the case here.

Even if we assume that the branch predictor misses half of the time
(which is realistic), that means that the next load and the load of
all followig correctly predicted ifs just have to wait for the load of
the mispredicted branch plus the misprediction penalty. Which means
that on average two loads will happen in parallel, while the
branchless version only performs dependent loads. If the loads cost
more than a branch misprediction (because of cache misses), the
branching version is faster.

Yes it is that the cache is pipelined and/or multi-ported that allows it
to take advantage of this. Pipelining allows the code to submit many more
loads and toss away half the values in less time than the serial branchless version could have. Plus having many miss buffers allows a similar advantage.

For example, assuming a cache latency of 3 and throughput of 1,
6 serial loads takes 18 clocks whereas a pipelined cache can do
15 loads in the same time, or fewer loads in a shorter time.
If it has multiple miss buffers then similarly for cache misses.

If the cache is not pipelined, say a latency of 1 throughput of 1
like standard design, or too few (< 4?) miss buffers, then the branchless version should win or tie because it does less loads.

One can now improve the performance by using branchless for the first
few levels (which are cached), and adding prefetching and a mixture of branchless and branchy for the rest, but for understanding the
principle the simple variants are best.

I was wondering whether adding explicit prefetching to the branchless
version would help. It probably would help with cache misses but not for
hits because it still would not make use of the cache pipeline.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Stefan Monnier on Thu Apr 24 09:47:18 2025

Stefan Monnier wrote:

I do not see 2 LDDs being performed parallel unless the execution
width is at least 14-wide. In any event loop recurrence restricts the

IIUC you'll get multiple loads in parallel if the loads take a long time because of cache misses. Say each load takes 100 cycles, then there is plenty of time during one load to predict many more iterations of the
loop and hence issue many more loads.

With a branching code, the addresses of those loads depend mostly on the branch predictions, so the branch predictions end up performing a kind
of "value prediction" (where the value that's predicted is the address
of the next lookup).

With predication your load address will conceptually depend on 3 inputs:
the computation of `base + middle`, the computation of `base + 0`, and
the computation of the previous `needle < *base[middle]` test to choose between the first two. If the LD of `*base[middle]` takes 100 cycle,
that means a delay of 100 cycles before the next LD can be issued.

Of course, nothing prevents a CPU from doing "predicate prediction":
instead of waiting for an answer to `needle < *base[middle]`, it could
try and guess whether it will be true or false and thus choose to send
one of the two addresses (or both) to the memory (and later check the prediction and rollback, just like we do with normal branches).

There was a flurry of predication research circa 2000 because of Itanium.
One I saw suggested moving the prediction from BRcc to CMP op so it works
for both branch and predicates. Their rational was that if-conversions
affect the branch predictor stats by removing samples from the test sets.

So yes it becomes a value prediction, but its a binary value
and should be the same binary value as before.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Thu Apr 24 10:10:50 2025

Anton Ertl wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

Of course, nothing prevents a CPU from doing "predicate prediction":
instead of waiting for an answer to `needle < *base[middle]`, it could
try and guess whether it will be true or false and thus choose to send
one of the two addresses (or both) to the memory (and later check the
prediction and rollback, just like we do with normal branches).

That would subvert the use of predication for getting constant-time
code for avoiding timing side channels. I.e., not a good idea.

- anton

This assumes that the predicate's not-executed path is nullified in a
constant time way. That might be ok for simple 1-clock instructions but
I would want heavier ones, MUL, DIV, all floats, and all LD and ST,
to be at worst nullified to 1-clock each, even better to 0.

OoO nullified (not-executed) instructions may still take 1 clock if all
such instructions that write a register have to copy the old physical
dest register value to the new physical register.

A constant time guarantee might only be possible if one manually arranges
to unconditionally perform all calculations then CMOV the result.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Fri Apr 25 13:19:28 2025

Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:
[EricP:]

[*] I want to see the asm because Intel's CMOV always executes the
operand operation, then tosses the result if the predicate is false.

Use a less-stupid ISA

The ISA does not require that. It could just as well be implemented
as waiting for the condition, and only then perform the operation.
And with a more sophisticated implementation one could even do that
for operations that are not part of the CMOV instruction, but produce
one of the source operands of the CMOV instruction.

For general OoO predication I call this "pruning".
At least the way I was looking at it, to do pruning for compute operations
(not LD or ST) requires almost doubling the dependency tracking matrix. E.g.

if (pred) ADD r3=r1+r2

requires tracking the original physical rp1,rp2 sources plus pred and
old physical assigned to r3, which I'll call old_rp3.

- if pred has not resolved then hold the uOp in the scheduler
or reservation station.
- If pred resolves to false then it can prune the dependency on rp1,rp2
which if old_rp3 is has resolved may allow it to launch the copy of
old_rp3 to new_rp3 immediately.
- if pred resolves to true then it can prune the dependency on old_rp3
which may allow it to launch ADD immediately if rp1,rp2 are resolved.
- in either case after the ADD finishes it must perform a Writeback
to set the Instruction Queue "Done" flag to True so it can retire.

The above can be optimized further by Dispatch (handoff from front end
to back end). If Dispatch has its own register file write port(s) it can optimize away Load Immediate reg, MOV reg-reg, zero reg by writing
directly to the PRF and eliminate the trip through a Function Unit.
If pred is resolved to false at the time of Dispatch then it can optimize
away the move of old_rp3 to new_rp3 for a zero-cycle execute cost.

The above adds an extra operand value to the uOp plus whatever the pred condition costs. There is extra dependency matrix and extra forwarding
bus loading for old_rp and pred, extra logic in the reservation station.
There is a FU writeback cycle to set the "Done" flag on the IQ entry
whether pred is true or false (unless optimized away by Dispatch).

Predication on LD or ST affects the LSQ because the memory address
dependency detector, which decides whether a younger LD or ST can bypass
an older LD or ST, also interacts with predicate resolution and pruning. However this is not so bad as it already has to deal with unresolved
virtual addresses so unresolved predicates looks like its similar.

Another possible optimization is if pred is unresolved but the source
operands are, and if this is a longer duration operation like MUL, DIV,
or float op, then schedule the uOp launch at a low priority if FU is idle.
This one is a little iffy because launching a long duration low priority
op on an otherwise idle FU might block a high priority that shows up later.

However,
apparently such implementations have enough disadvantages (probably in performance) that nobody has gone there AFAIK. AFAIK everyone,
including implementations of different ISAs implements
CMOV/predication as performing the operation and then conditionally
squashing the result.

- anton

The likely reason that Intel does execute-then-toss variant was because
it only applied to CMOV, they were retrofitting it into the ISA,
and the optimal version logic looks expensive especially as it
interacts with the LSQ.

Arm A32 general conditional execution would have originally been
implemented in an in-order uArch so there was likely no benefit from
pruning nullified instructions. For OoO it takes much more logic to
do the pruning version and given their target market was embedded
little customer demand (who probably want constant timing).

But if an ISA has predication from the start I see no reason to not
at least plan for doing pruning in a high performance OoO implementation.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Fri Apr 25 20:51:03 2025

On Thu, 24 Apr 2025 14:10:50 +0000, EricP wrote:

Anton Ertl wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

Of course, nothing prevents a CPU from doing "predicate prediction":
instead of waiting for an answer to `needle < *base[middle]`, it could
try and guess whether it will be true or false and thus choose to send
one of the two addresses (or both) to the memory (and later check the
prediction and rollback, just like we do with normal branches).

That would subvert the use of predication for getting constant-time
code for avoiding timing side channels. I.e., not a good idea.

- anton

This assumes that the predicate's not-executed path is nullified in a constant time way. That might be ok for simple 1-clock instructions but
I would want heavier ones, MUL, DIV, all floats, and all LD and ST,
to be at worst nullified to 1-clock each, even better to 0.

OoO nullified (not-executed) instructions may still take 1 clock if all
such instructions that write a register have to copy the old physical
dest register value to the new physical register.

Or to broadcast that you should NOT be waiting on this operand any
longer.

A constant time guarantee might only be possible if one manually
arranges
to unconditionally perform all calculations then CMOV the result.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	02:54:28
Calls:	10,387
Calls today:	2
Files:	14,061
Messages:	6,416,755

Re: auto predicating branches

Who's Online

Recent Visitors

System Info