• Re: auto predicating branches

    From Anton Ertl@21:1/5 to Robert Finch on Mon Apr 21 06:05:32 2025
    Robert Finch <robfi680@gmail.com> writes:
    Having branches automatically convert into
    predicates when they branch forward a short distance <7 instructions.

    If-conversion in hardware is a good idea, if done well, because it
    involves issues that tend to be unknown to compilers:

    * How predictable is the condition? If the condition is very well
    predictable, if-conversion is not a good idea, because it turns the
    control dependency (which does not cost latency when the prediction
    is correct) into a data dependency. Moreover, in this case the
    if-conversion increases the resource consumption. Compilers are not
    good at predicting the predictability AFAIK.

    * Is the condition available before or after the original data
    dependencies? And if afterwards, by how many cycles? If it is
    afterwards and the branch prediction would be correct, the
    if-conversion means that the result of the instruction is available
    later, which may reduce IPC. OTOH, if the branch prediction would
    be incorrect, the recovery also depends on when the condition
    becomes available, and the total latency is higher in the case of no
    if-conversion. The compiler may do an ok job at predicting whether
    a condition is available before or after the original data
    dependencies (I don't know a paper that evaluates that), but without
    knowing about the prediction accuracy of a specific condition that
    does not help much.

    So the hardware should take predictability of a condition and the
    availability of the condition into consideration for if-conversion.

    What about reverse if-conversion in hardware, i.e., converting
    predicated instructions and the like (conditional moves, if-then-else instructions and the instructions they control) into branch-predicted
    phantom branches and eliminating the data dependency on the condition
    from the instruction.

    For performance, one might consider reverse if-conversion, because the
    same considerations apply; however, there is also a security aspect: programmers have used these instructions instead of branches to
    produce constant-time code to avoid timing side channels of code that
    deals with secrets; and the discovery of Spectre has shown additional
    timing side channels of branches. Because you cannot be sure that the predicated instruction is there for security reasons, you must not use
    reverse if-conversion in hardware.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Mon Apr 21 13:39:25 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    So the hardware should take predictability of a condition and the >availability of the condition into consideration for if-conversion.

    Another criterion would be whether one of the potentially
    if-converted instructions is on the critical data-dependency path. If
    none of them are, additional latency (if any) from if-conversion may
    be less of an issue. OTOH, it may also mean that the code is
    resource-limited rather than dependency-limited, and that one should
    avoid if-conversion in order to avoid additional resource consumption
    (and of course prediction accuracy also plays into these
    considerations).

    This made me think how one could determine whether an instruction is
    on the critical path. If an instruction X is committed, and the only instructions later in the reorder buffer are data-flow successors of
    X, then X is obviously on the critical path. An instruction Y that
    delivers the last input that releases X from the scheduler into the
    functional unit is also on the critical path; this is transitive, and
    several results may arrive in the same cycle.

    Marking X as a critical instruction is relatively easy, although one
    might also count how many times of all dynamic instances of the
    instruction were critical. Marking the critical data-flow
    predecessors is harder, especially across several levels (not to
    mention all levels); one way is to do it over time: when an
    instruction V delivers a dependency that releases an instruction W
    from the scheduler that is already marked as critical, V is marked as
    critical as well.

    The same instruction may be on the critical path in one execution, but
    not in another execution, due to cache hits, or differences in control
    flow. The approach outlined above only works correctly if W is always
    on the critical path. V should probably only marked as critical if
    the number of critical executions of W relative to the overall
    executions of W is high.

    Showing the critical path through performance counters might help
    programmers improve the performance of their programs.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Mon Apr 21 17:29:15 2025
    On Mon, 21 Apr 2025 6:05:32 +0000, Anton Ertl wrote:

    Robert Finch <robfi680@gmail.com> writes:
    Having branches automatically convert into
    predicates when they branch forward a short distance <7 instructions.

    If-conversion in hardware is a good idea, if done well, because it
    involves issues that tend to be unknown to compilers:

    I had little trouble teaching Brian how to put if-conversion into the
    compiler with my PRED instructions. Alleviating HW from having to bother
    other than being able to execute PREDicated clauses.

    * How predictable is the condition? If the condition is very well
    predictable, if-conversion is not a good idea, because it turns the
    control dependency (which does not cost latency when the prediction
    is correct) into a data dependency. Moreover, in this case the
    if-conversion increases the resource consumption. Compilers are not
    good at predicting the predictability AFAIK.

    Rather than base the choice on the predictability of the condition,
    It is based on whether FETCH will pass the join-point before the
    condition resolves. On an 8-wide machine this might be "THE next
    cycle".

    * Is the condition available before or after the original data
    dependencies? And if afterwards, by how many cycles? If it is
    afterwards and the branch prediction would be correct, the
    if-conversion means that the result of the instruction is available
    later, which may reduce IPC.

    Generally, it only adds latency--if the execution window is not staled
    at either end this does not harm IPC.

    OTOH, if the branch prediction would
    be incorrect, the recovery also depends on when the condition
    becomes available,

    There is no "recovery" from PREDication, just one clause getting
    nullified.

    and the total latency is higher in the case of no
    if-conversion. The compiler may do an ok job at predicting whether
    a condition is available before or after the original data
    dependencies (I don't know a paper that evaluates that), but without
    knowing about the prediction accuracy of a specific condition that
    does not help much.

    So the hardware should take predictability of a condition and the availability of the condition into consideration for if-conversion.

    My argument is that this is a SW decision (in the compiler) not a
    HW decision (other than providing the PREDs). Since PREDs are not
    predicted (unless you think they are predicted BOTH ways) they do
    not diminish the performance of the branch predictors.

    What about reverse if-conversion in hardware, i.e., converting
    predicated instructions and the like (conditional moves, if-then-else instructions and the instructions they control) into branch-predicted
    phantom branches and eliminating the data dependency on the condition
    from the instruction.

    The compiler choose PRED because FETCH reaches the join-point prior
    to the branch resolving. PRED is almost always faster--and when
    it has both then-clause and else-clause, it always saves a branch
    instruction (jumping over the else-clause).

    For performance, one might consider reverse if-conversion, because the
    same considerations apply; however, there is also a security aspect: programmers have used these instructions instead of branches to
    produce constant-time code to avoid timing side channels of code that
    deals with secrets; and the discovery of Spectre has shown additional
    timing side channels of branches. Because you cannot be sure that the predicated instruction is there for security reasons, you must not use reverse if-conversion in hardware.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Tue Apr 22 05:10:10 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 21 Apr 2025 6:05:32 +0000, Anton Ertl wrote:

    Robert Finch <robfi680@gmail.com> writes:
    Having branches automatically convert into
    predicates when they branch forward a short distance <7 instructions.

    If-conversion in hardware is a good idea, if done well, because it
    involves issues that tend to be unknown to compilers:

    I had little trouble teaching Brian how to put if-conversion into the >compiler with my PRED instructions. Alleviating HW from having to bother >other than being able to execute PREDicated clauses.

    Compilers certainly can perform if-conversion, and there have been
    papers about that for at least 30 years, but compilers do not know
    when if-conversion is profitable. E.g., "branchless" (if-converted)
    binary search can be faster than a branching one when the searched
    array is cached at a close-enough level, but slower when most of the
    accesses miss the close caches <https://stackoverflow.com/questions/11360831/about-the-branchless-binary-search>.

    * How predictable is the condition? If the condition is very well
    predictable, if-conversion is not a good idea, because it turns the
    control dependency (which does not cost latency when the prediction
    is correct) into a data dependency. Moreover, in this case the
    if-conversion increases the resource consumption. Compilers are not
    good at predicting the predictability AFAIK.

    Rather than base the choice on the predictability of the condition,
    It is based on whether FETCH will pass the join-point before the
    condition resolves. On an 8-wide machine this might be "THE next
    cycle".

    On a CPU with out-of-order execution, the instruction fetcher often
    runs many dozens of instructions ahead of the functional units which
    produce the conditions, so your criterion could cover pretty big IFs
    And, given that you want the compiler to do it, the compiler would
    have to know about that. Ok, what decision will you take in what
    case, and why?

    * Is the condition available before or after the original data
    dependencies? And if afterwards, by how many cycles? If it is
    afterwards and the branch prediction would be correct, the
    if-conversion means that the result of the instruction is available
    later, which may reduce IPC.

    Generally, it only adds latency--if the execution window is not staled
    at either end this does not harm IPC.

    If the additional latency is on the critical path and execution is dependency-limited, this reduces IPC. And yes, this will result in
    the buffers (especially the schedulers) filling up and stalling the
    front end.

    OTOH, if the branch prediction would
    be incorrect, the recovery also depends on when the condition
    becomes available,

    There is no "recovery" from PREDication, just one clause getting
    nullified.

    I apparently wrote that in a misunderstandable way. Here's another
    attempt: When comparing the branching variant to the predicated
    (if-converted) variant, if the branching variant would be
    mispredicted, it is always at a disadvantage wrt. latency compared to
    the predicated variant, because the branching variant restarts from
    the instruction fetch when the condition becomes available, while the predicated variant is already fetched and decoded and waits in a
    scheduler for the condition.

    Note that in the binary-search case linked-to above, that's also the
    case, but in the branchy version the benefit comes from the correct
    predictions and the lack of data-dependencies between the loads: In
    those cases the cache-missing load does not depend on the previous cache-missing load (unlike the branchless version), resulting in an
    overall shorter latency.

    and the total latency is higher in the case of no
    if-conversion. The compiler may do an ok job at predicting whether
    a condition is available before or after the original data
    dependencies (I don't know a paper that evaluates that), but without
    knowing about the prediction accuracy of a specific condition that
    does not help much.

    So the hardware should take predictability of a condition and the
    availability of the condition into consideration for if-conversion.

    My argument is that this is a SW decision (in the compiler) not a
    HW decision (other than providing the PREDs).

    That's a position, not an argument. Do you have an argument for your
    position?

    Since PREDs are not
    predicted (unless you think they are predicted BOTH ways) they do
    not diminish the performance of the branch predictors.

    Nor increase it. But it sounds like you think that the compiler
    should choose predication when the condition is not particularly
    predictable. How should the compiler know that?

    The compiler choose PRED because FETCH reaches the join-point prior
    to the branch resolving. PRED is almost always faster--and when
    it has both then-clause and else-clause, it always saves a branch
    instruction (jumping over the else-clause).

    It appears that you have an underpowered front end in mind. Even in
    the wide machines of today and without predication, the front end
    normally does not have a problem fetching at least as many
    instructions as the rest of the machine can handle, as long as the
    predictions are correct. Even if the instruction fetcher cannot fetch
    the full width in one cycle due to having more taken branches than the instruction fetcher can handle or some other hiccup, this is usually
    made up by delivering more instructions in other cycles than the rest
    of the CPU can handle. E.g., Skymont <https://old.chipsandcheese.com/2024/06/15/intel-details-skymont/>
    fetches from three sequential streams, 3*32bytes/cycle, and decodes
    into 3*3 uops/cycle and stores them in 3 32-entry uop queues; the
    renamer consumes 8 instructions from these queues, so as long as the predictions are correct and the average fetching and decoding is >8
    uops/cycle, the renamer will rarely see fewer than 8 uops available,
    even if there is the occasional cycle where the taken branches are so
    dense that the instruction fetcher cannot deliver enough relevant
    bytes to the instruction decoder.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Tue Apr 22 11:23:57 2025
    Anton Ertl wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 21 Apr 2025 6:05:32 +0000, Anton Ertl wrote:

    Robert Finch <robfi680@gmail.com> writes:
    Having branches automatically convert into
    predicates when they branch forward a short distance <7 instructions.
    If-conversion in hardware is a good idea, if done well, because it
    involves issues that tend to be unknown to compilers:
    I had little trouble teaching Brian how to put if-conversion into the
    compiler with my PRED instructions. Alleviating HW from having to bother
    other than being able to execute PREDicated clauses.

    Compilers certainly can perform if-conversion, and there have been
    papers about that for at least 30 years, but compilers do not know
    when if-conversion is profitable. E.g., "branchless" (if-converted)
    binary search can be faster than a branching one when the searched
    array is cached at a close-enough level, but slower when most of the
    accesses miss the close caches <https://stackoverflow.com/questions/11360831/about-the-branchless-binary-search>.

    I'm missing something because the first answer by BeeOnRope looks wrong. Unfortunately they don't show both pieces of code and the asm,
    but a branching binary search to me looks just as load data dependent as it recalculates middle each iteration and the array index is array[middle].

    Assuming the branching version is: https://en.wikipedia.org/wiki/Binary_search#Procedure

    If branch prediction is always correct, the branching version builds
    an instruction queue that is a long chain of scaled indexed loads
    where the load index is data dependent on the prior iteration load value.
    As each load finishes the next index is calculated and next load started,
    and the chain collapses serially.

    But that is exactly what the branchless version also does.

    The difference is that the branching version sometimes DOES mispredict,
    so purges and rebuilds the IQ for each mispredict. But what it purges
    for the current iteration is just an INC or DEC instruction,
    either L = m+1 or R = m-1, PLUS everything prefetched in IQ that followed.

    Assuming both versions have the same number of memory accesses[*],
    it looks to me that the branchless version should always win or tie.

    [*] I want to see the asm because Intel's CMOV always executes the
    operand operation, then tosses the result if the predicate is false.
    That means that if it does CMOVGT rax,[middle] it always does the
    [middle] memory load and then conditionally tosses the value.
    It may be that the branchless version is actually doing more memory
    accesses which is why it is slower.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Tue Apr 22 16:45:20 2025
    On Tue, 22 Apr 2025 5:10:10 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    --------------

    My argument is that this is a SW decision (in the compiler) not a
    HW decision (other than providing the PREDs).

    That's a position, not an argument. Do you have an argument for your position?

    Architecture defines the primitives, Hardware provides the primitives,
    SW decides to use them (or not) on a case-by-case basis.

    Since PREDs are not
    predicted (unless you think they are predicted BOTH ways) they do
    not diminish the performance of the branch predictors.

    Nor increase it.

    When there is an else-clause, the PRED uses 1 fewer instruction:
    the branch that exits the then-clause and jumps over the else-
    clause does not exist.

    But it sounds like you think that the compiler
    should choose predication when the condition is not particularly
    predictable. How should the compiler know that?

    Compiler is setup to use PRED when the then-clause is 8 or fewer
    instructions and when the else clause is 8 or fewer instructions,
    otherwise it uses branches.
    ----------

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Tue Apr 22 17:31:03 2025
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    Compilers certainly can perform if-conversion, and there have been
    papers about that for at least 30 years, but compilers do not know
    when if-conversion is profitable. E.g., "branchless" (if-converted)
    binary search can be faster than a branching one when the searched
    array is cached at a close-enough level, but slower when most of the
    accesses miss the close caches
    <https://stackoverflow.com/questions/11360831/about-the-branchless-binary-search>.

    I'm missing something because the first answer by BeeOnRope looks wrong. >Unfortunately they don't show both pieces of code and the asm,
    but a branching binary search to me looks just as load data dependent as it >recalculates middle each iteration and the array index is array[middle].

    Here's the branchless version:

    while (n > 1) {
    int middle = n/2;
    base += (needle < *base[middle]) ? 0 : middle;
    n -= middle;
    }

    The branching version would then be:

    while (n > 1) {
    int middle = n/2;
    if (needle >= *base[middle])
    base += middle;
    n -= middle;
    }

    In the branching version we have the following loop recurrences:

    1) middle = n/2; n -= middle
    2) base += middle;

    Recurrence 1 costs a few cycles per iteration, ideally 2. Recurrence
    2 costs even less. There is no load in the loop recurrences. The
    load is used only for verifying the branch prediction. And in
    particular, the load does not data-depend on any previous load.

    If branch prediction is always correct, the branching version builds
    an instruction queue that is a long chain of scaled indexed loads
    where the load index is data dependent on the prior iteration load value.

    That's certainly not the case here.

    Even if we assume that the branch predictor misses half of the time
    (which is realistic), that means that the next load and the load of
    all followig correctly predicted ifs just have to wait for the load of
    the mispredicted branch plus the misprediction penalty. Which means
    that on average two loads will happen in parallel, while the
    branchless version only performs dependent loads. If the loads cost
    more than a branch misprediction (because of cache misses), the
    branching version is faster.

    One can now improve the performance by using branchless for the first
    few levels (which are cached), and adding prefetching and a mixture of branchless and branchy for the rest, but for understanding the
    principle the simple variants are best.

    [*] I want to see the asm because Intel's CMOV always executes the
    operand operation, then tosses the result if the predicate is false.

    That's the usual thing for conditional execution/predication. But
    here 0 has no dependencies and middle has only cheap dependencies, the
    only expensive part is the load for the condition that turns into a
    data dependency in the branchless version.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Tue Apr 22 22:32:40 2025
    On Tue, 22 Apr 2025 17:31:03 +0000, Anton Ertl wrote:

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    Compilers certainly can perform if-conversion, and there have been
    papers about that for at least 30 years, but compilers do not know
    when if-conversion is profitable. E.g., "branchless" (if-converted)
    binary search can be faster than a branching one when the searched
    array is cached at a close-enough level, but slower when most of the
    accesses miss the close caches
    <https://stackoverflow.com/questions/11360831/about-the-branchless-binary-search>.

    I'm missing something because the first answer by BeeOnRope looks wrong. >>Unfortunately they don't show both pieces of code and the asm,
    but a branching binary search to me looks just as load data dependent as
    it
    recalculates middle each iteration and the array index is array[middle].

    Here's the branchless version:

    while (n > 1) {
    int middle = n/2;
    base += (needle < *base[middle]) ? 0 : middle;
    n -= middle;
    }

    loop:
    SRA Rn,Rn,#1 // Rn in
    LDD Rl,[Rb,Rm<<3] // Rb in
    CMP R7,Rn,Rl
    CMOVLT R7,#0,Rm
    ADD Rb,Rb,-Rm // Rb out
    ADD Rn,Rn,-Rm // Rn out
    BR loop

    The branching version would then be:

    while (n > 1) {
    int middle = n/2;
    if (needle >= *base[middle])
    base += middle;
    n -= middle;
    }

    loop:
    SRA Rn,Rn,#1 // Rn in
    LDD Rl,[Rb,Rm<<3] // Rb in
    CMP R7,Rn,Rl
    PGE R7,T
    ADD Rb,Rb,-Rm // Rb conditionally out
    ADD Rn,Rn,-Rm // Rn out
    BR loop

    The ADD or Rn can be freely positioned in the loop, but the recurrence
    remains at 2 cycles.

    A 7-wide (or wider) machine will be able to insert a whole iteration
    of the loop per cycle. The loop has a 2 cycle recurrence, so, the
    execution window will fill up rapidly and retire at 1 loop per 2 cycles;
    being recurrence-bound. This agrees with the second paragraph below.

    In the branching version we have the following loop recurrences:

    1) middle = n/2; n -= middle
    2) base += middle;

    Recurrence 1 costs a few cycles per iteration, ideally 2. Recurrence
    2 costs even less. There is no load in the loop recurrences. The
    load is used only for verifying the branch prediction. And in
    particular, the load does not data-depend on any previous load.

    A shifter with a latency of 2 would really hurt this loop.

    If branch prediction is always correct, the branching version builds
    an instruction queue that is a long chain of scaled indexed loads
    where the load index is data dependent on the prior iteration load value.

    That's certainly not the case here.

    Even if we assume that the branch predictor misses half of the time
    (which is realistic), that means that the next load and the load of
    all followig correctly predicted ifs just have to wait for the load of
    the mispredicted branch plus the misprediction penalty. Which means
    that on average two loads will happen in parallel, while the
    branchless version only performs dependent loads. If the loads cost
    more than a branch misprediction (because of cache misses), the
    branching version is faster.

    I do not see 2 LDDs being performed parallel unless the execution
    width is at least 14-wide. In any event loop recurrence restricts the
    overall retirement to 0.5 LDDs per cycle--it is the recurrence that
    feeds the iterations (i.e., retirement). The water hose is fundamentally restricted by the recurrence.

    One can now improve the performance by using branchless for the first
    few levels (which are cached), and adding prefetching and a mixture of branchless and branchy for the rest, but for understanding the
    principle the simple variants are best.

    Predicating has added latency on the delivery of Rb's recurrence in that
    the consumer has to be ready for {didn't happen and now you are free"
    along with "here is the data you are looking for". Complicating the
    instruction queue "a little".

    [*] I want to see the asm because Intel's CMOV always executes the
    operand operation, then tosses the result if the predicate is false.

    Use a less-stupid ISA

    I took a look at extracting the < or >= cases and using it as a mask
    (0, middle) but this takes 1 more instruction and does nothing about
    the fundamental loop limiter.

    That's the usual thing for conditional execution/predication. But
    here 0 has no dependencies and middle has only cheap dependencies, the
    only expensive part is the load for the condition that turns into a
    data dependency in the branchless version.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Apr 22 22:59:10 2025
    I do not see 2 LDDs being performed parallel unless the execution
    width is at least 14-wide. In any event loop recurrence restricts the

    IIUC you'll get multiple loads in parallel if the loads take a long time because of cache misses. Say each load takes 100 cycles, then there is
    plenty of time during one load to predict many more iterations of the
    loop and hence issue many more loads.

    With a branching code, the addresses of those loads depend mostly on the
    branch predictions, so the branch predictions end up performing a kind
    of "value prediction" (where the value that's predicted is the address
    of the next lookup).

    With predication your load address will conceptually depend on 3 inputs:
    the computation of `base + middle`, the computation of `base + 0`, and
    the computation of the previous `needle < *base[middle]` test to choose
    between the first two. If the LD of `*base[middle]` takes 100 cycle,
    that means a delay of 100 cycles before the next LD can be issued.

    Of course, nothing prevents a CPU from doing "predicate prediction":
    instead of waiting for an answer to `needle < *base[middle]`, it could
    try and guess whether it will be true or false and thus choose to send
    one of the two addresses (or both) to the memory (and later check the prediction and rollback, just like we do with normal branches).


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Wed Apr 23 17:44:56 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    I do not see 2 LDDs being performed parallel unless the execution
    width is at least 14-wide. In any event loop recurrence restricts the
    overall retirement to 0.5 LDDs per cycle--it is the recurrence that
    feeds the iterations (i.e., retirement).

    Yes. But with loads that take longer than two cycles (very common in
    OoO microarchitectures even for L1 hits), the second load starts
    before the first finishes. And in the case where the branchy version
    is profitable (when the load latency longer than the misprediction
    penalty due to cache misses), many loads will start before the first
    finishes (most of them will be canceled due to misprediction, but even
    an average of two useful parallel loads produces a good speedup).

    [EricP:]
    [*] I want to see the asm because Intel's CMOV always executes the >>>operand operation, then tosses the result if the predicate is false.

    Use a less-stupid ISA

    The ISA does not require that. It could just as well be implemented
    as waiting for the condition, and only then perform the operation.
    And with a more sophisticated implementation one could even do that
    for operations that are not part of the CMOV instruction, but produce
    one of the source operands of the CMOV instruction. However,
    apparently such implementations have enough disadvantages (probably in performance) that nobody has gone there AFAIK. AFAIK everyone,
    including implementations of different ISAs implements
    CMOV/predication as performing the operation and then conditionally
    squashing the result.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stefan Monnier on Wed Apr 23 18:09:28 2025
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    Of course, nothing prevents a CPU from doing "predicate prediction":
    instead of waiting for an answer to `needle < *base[middle]`, it could
    try and guess whether it will be true or false and thus choose to send
    one of the two addresses (or both) to the memory (and later check the >prediction and rollback, just like we do with normal branches).

    That would subvert the use of predication for getting constant-time
    code for avoiding timing side channels. I.e., not a good idea.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Wed Apr 23 21:34:51 2025
    On Wed, 23 Apr 2025 17:44:56 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    I do not see 2 LDDs being performed parallel unless the execution
    width is at least 14-wide. In any event loop recurrence restricts the >>overall retirement to 0.5 LDDs per cycle--it is the recurrence that
    feeds the iterations (i.e., retirement).

    Yes. But with loads that take longer than two cycles (very common in
    OoO microarchitectures even for L1 hits), the second load starts
    before the first finishes. And in the case where the branchy version
    is profitable (when the load latency longer than the misprediction
    penalty due to cache misses), many loads will start before the first
    finishes (most of them will be canceled due to misprediction, but even
    an average of two useful parallel loads produces a good speedup).

    We still have a 2 cycle loop recurrence, so even if we could perform
    1000
    LDs per cycle, we are fundamentally SRA and SUB bound around the loop
    from iteration to iteration.

    [EricP:]
    [*] I want to see the asm because Intel's CMOV always executes the >>>>operand operation, then tosses the result if the predicate is false.

    Use a less-stupid ISA

    The ISA does not require that. It could just as well be implemented
    as waiting for the condition, and only then perform the operation.
    And with a more sophisticated implementation one could even do that
    for operations that are not part of the CMOV instruction, but produce
    one of the source operands of the CMOV instruction. However,
    apparently such implementations have enough disadvantages (probably in performance) that nobody has gone there AFAIK. AFAIK everyone,
    including implementations of different ISAs implements
    CMOV/predication as performing the operation and then conditionally
    squashing the result.

    That is difficult with renaming. In order for the later instructions
    to wait on the CMOV renamed result register or the earlier predicted
    value, each entry in each station has to be able to wait on one or
    the other. In general, it is far easier to make CMOV be able to deliver
    either result so nothing downstream of CMOV has any added complexity.


    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Thu Apr 24 08:54:05 2025
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    Compilers certainly can perform if-conversion, and there have been
    papers about that for at least 30 years, but compilers do not know
    when if-conversion is profitable. E.g., "branchless" (if-converted)
    binary search can be faster than a branching one when the searched
    array is cached at a close-enough level, but slower when most of the
    accesses miss the close caches
    <https://stackoverflow.com/questions/11360831/about-the-branchless-binary-search>.
    I'm missing something because the first answer by BeeOnRope looks wrong.
    Unfortunately they don't show both pieces of code and the asm,
    but a branching binary search to me looks just as load data dependent as it >> recalculates middle each iteration and the array index is array[middle].

    Here's the branchless version:

    while (n > 1) {
    int middle = n/2;
    base += (needle < *base[middle]) ? 0 : middle;
    n -= middle;
    }

    The branching version would then be:

    while (n > 1) {
    int middle = n/2;
    if (needle >= *base[middle])
    base += middle;
    n -= middle;
    }

    In the branching version we have the following loop recurrences:

    1) middle = n/2; n -= middle
    2) base += middle;

    Recurrence 1 costs a few cycles per iteration, ideally 2. Recurrence
    2 costs even less. There is no load in the loop recurrences. The
    load is used only for verifying the branch prediction. And in
    particular, the load does not data-depend on any previous load.

    Yes, I missed that it separates the array index calculation off into
    its own compute bound dependency chain that can be calculated ASAP.
    That allows it to hit D-cache with multiple speculated loads.

    But it is the pipelined cache that allows that to be an advantage.
    See below.

    If branch prediction is always correct, the branching version builds
    an instruction queue that is a long chain of scaled indexed loads
    where the load index is data dependent on the prior iteration load value.

    That's certainly not the case here.

    Even if we assume that the branch predictor misses half of the time
    (which is realistic), that means that the next load and the load of
    all followig correctly predicted ifs just have to wait for the load of
    the mispredicted branch plus the misprediction penalty. Which means
    that on average two loads will happen in parallel, while the
    branchless version only performs dependent loads. If the loads cost
    more than a branch misprediction (because of cache misses), the
    branching version is faster.

    Yes it is that the cache is pipelined and/or multi-ported that allows it
    to take advantage of this. Pipelining allows the code to submit many more
    loads and toss away half the values in less time than the serial branchless version could have. Plus having many miss buffers allows a similar advantage.

    For example, assuming a cache latency of 3 and throughput of 1,
    6 serial loads takes 18 clocks whereas a pipelined cache can do
    15 loads in the same time, or fewer loads in a shorter time.
    If it has multiple miss buffers then similarly for cache misses.

    If the cache is not pipelined, say a latency of 1 throughput of 1
    like standard design, or too few (< 4?) miss buffers, then the branchless version should win or tie because it does less loads.

    One can now improve the performance by using branchless for the first
    few levels (which are cached), and adding prefetching and a mixture of branchless and branchy for the rest, but for understanding the
    principle the simple variants are best.

    I was wondering whether adding explicit prefetching to the branchless
    version would help. It probably would help with cache misses but not for
    hits because it still would not make use of the cache pipeline.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Stefan Monnier on Thu Apr 24 09:47:18 2025
    Stefan Monnier wrote:
    I do not see 2 LDDs being performed parallel unless the execution
    width is at least 14-wide. In any event loop recurrence restricts the

    IIUC you'll get multiple loads in parallel if the loads take a long time because of cache misses. Say each load takes 100 cycles, then there is plenty of time during one load to predict many more iterations of the
    loop and hence issue many more loads.

    With a branching code, the addresses of those loads depend mostly on the branch predictions, so the branch predictions end up performing a kind
    of "value prediction" (where the value that's predicted is the address
    of the next lookup).

    With predication your load address will conceptually depend on 3 inputs:
    the computation of `base + middle`, the computation of `base + 0`, and
    the computation of the previous `needle < *base[middle]` test to choose between the first two. If the LD of `*base[middle]` takes 100 cycle,
    that means a delay of 100 cycles before the next LD can be issued.

    Of course, nothing prevents a CPU from doing "predicate prediction":
    instead of waiting for an answer to `needle < *base[middle]`, it could
    try and guess whether it will be true or false and thus choose to send
    one of the two addresses (or both) to the memory (and later check the prediction and rollback, just like we do with normal branches).

    There was a flurry of predication research circa 2000 because of Itanium.
    One I saw suggested moving the prediction from BRcc to CMP op so it works
    for both branch and predicates. Their rational was that if-conversions
    affect the branch predictor stats by removing samples from the test sets.

    So yes it becomes a value prediction, but its a binary value
    and should be the same binary value as before.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Thu Apr 24 10:10:50 2025
    Anton Ertl wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    Of course, nothing prevents a CPU from doing "predicate prediction":
    instead of waiting for an answer to `needle < *base[middle]`, it could
    try and guess whether it will be true or false and thus choose to send
    one of the two addresses (or both) to the memory (and later check the
    prediction and rollback, just like we do with normal branches).

    That would subvert the use of predication for getting constant-time
    code for avoiding timing side channels. I.e., not a good idea.

    - anton

    This assumes that the predicate's not-executed path is nullified in a
    constant time way. That might be ok for simple 1-clock instructions but
    I would want heavier ones, MUL, DIV, all floats, and all LD and ST,
    to be at worst nullified to 1-clock each, even better to 0.

    OoO nullified (not-executed) instructions may still take 1 clock if all
    such instructions that write a register have to copy the old physical
    dest register value to the new physical register.

    A constant time guarantee might only be possible if one manually arranges
    to unconditionally perform all calculations then CMOV the result.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Fri Apr 25 13:19:28 2025
    Anton Ertl wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    [EricP:]
    [*] I want to see the asm because Intel's CMOV always executes the
    operand operation, then tosses the result if the predicate is false.
    Use a less-stupid ISA

    The ISA does not require that. It could just as well be implemented
    as waiting for the condition, and only then perform the operation.
    And with a more sophisticated implementation one could even do that
    for operations that are not part of the CMOV instruction, but produce
    one of the source operands of the CMOV instruction.

    For general OoO predication I call this "pruning".
    At least the way I was looking at it, to do pruning for compute operations
    (not LD or ST) requires almost doubling the dependency tracking matrix. E.g.

    if (pred) ADD r3=r1+r2

    requires tracking the original physical rp1,rp2 sources plus pred and
    old physical assigned to r3, which I'll call old_rp3.

    - if pred has not resolved then hold the uOp in the scheduler
    or reservation station.
    - If pred resolves to false then it can prune the dependency on rp1,rp2
    which if old_rp3 is has resolved may allow it to launch the copy of
    old_rp3 to new_rp3 immediately.
    - if pred resolves to true then it can prune the dependency on old_rp3
    which may allow it to launch ADD immediately if rp1,rp2 are resolved.
    - in either case after the ADD finishes it must perform a Writeback
    to set the Instruction Queue "Done" flag to True so it can retire.

    The above can be optimized further by Dispatch (handoff from front end
    to back end). If Dispatch has its own register file write port(s) it can optimize away Load Immediate reg, MOV reg-reg, zero reg by writing
    directly to the PRF and eliminate the trip through a Function Unit.
    If pred is resolved to false at the time of Dispatch then it can optimize
    away the move of old_rp3 to new_rp3 for a zero-cycle execute cost.

    The above adds an extra operand value to the uOp plus whatever the pred condition costs. There is extra dependency matrix and extra forwarding
    bus loading for old_rp and pred, extra logic in the reservation station.
    There is a FU writeback cycle to set the "Done" flag on the IQ entry
    whether pred is true or false (unless optimized away by Dispatch).

    Predication on LD or ST affects the LSQ because the memory address
    dependency detector, which decides whether a younger LD or ST can bypass
    an older LD or ST, also interacts with predicate resolution and pruning. However this is not so bad as it already has to deal with unresolved
    virtual addresses so unresolved predicates looks like its similar.

    Another possible optimization is if pred is unresolved but the source
    operands are, and if this is a longer duration operation like MUL, DIV,
    or float op, then schedule the uOp launch at a low priority if FU is idle.
    This one is a little iffy because launching a long duration low priority
    op on an otherwise idle FU might block a high priority that shows up later.

    However,
    apparently such implementations have enough disadvantages (probably in performance) that nobody has gone there AFAIK. AFAIK everyone,
    including implementations of different ISAs implements
    CMOV/predication as performing the operation and then conditionally
    squashing the result.

    - anton

    The likely reason that Intel does execute-then-toss variant was because
    it only applied to CMOV, they were retrofitting it into the ISA,
    and the optimal version logic looks expensive especially as it
    interacts with the LSQ.

    Arm A32 general conditional execution would have originally been
    implemented in an in-order uArch so there was likely no benefit from
    pruning nullified instructions. For OoO it takes much more logic to
    do the pruning version and given their target market was embedded
    little customer demand (who probably want constant timing).

    But if an ISA has predication from the start I see no reason to not
    at least plan for doing pruning in a high performance OoO implementation.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Fri Apr 25 20:51:03 2025
    On Thu, 24 Apr 2025 14:10:50 +0000, EricP wrote:

    Anton Ertl wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    Of course, nothing prevents a CPU from doing "predicate prediction":
    instead of waiting for an answer to `needle < *base[middle]`, it could
    try and guess whether it will be true or false and thus choose to send
    one of the two addresses (or both) to the memory (and later check the
    prediction and rollback, just like we do with normal branches).

    That would subvert the use of predication for getting constant-time
    code for avoiding timing side channels. I.e., not a good idea.

    - anton

    This assumes that the predicate's not-executed path is nullified in a constant time way. That might be ok for simple 1-clock instructions but
    I would want heavier ones, MUL, DIV, all floats, and all LD and ST,
    to be at worst nullified to 1-clock each, even better to 0.

    OoO nullified (not-executed) instructions may still take 1 clock if all
    such instructions that write a register have to copy the old physical
    dest register value to the new physical register.

    Or to broadcast that you should NOT be waiting on this operand any
    longer.

    A constant time guarantee might only be possible if one manually
    arranges
    to unconditionally perform all calculations then CMOV the result.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)