• An execution time puzzle

    From Anton Ertl@21:1/5 to All on Mon Mar 10 07:33:18 2025
    I have the sequence

    1 add $0x8,%rbx
    2 sub $0x8,%r13
    3 mov %rbx,0x0(%r13)
    4 mov %rdx,%rbx
    5 mov (%rbx),%rax
    6 jmp *%rax
    7 mov %r8,%r15
    8 add $0x10,%rbx
    9 mov 0x0(%r13),%rbx
    10 mov -0x10(%r15),%rax
    11 mov %r15,%rdx
    12 add $0x8,%r13
    13 sub $0x8,%rbx
    14 jmp *%rax

    The contents of the registers and memory are such that the first jmp
    continues at the next instruction in the sequence and the second jmp
    continues at the top of the sequence. I measure this sequence with
    perf stat on a Zen4, terminating it with Ctrl-C, and get output like:

    21969657501 cycles
    27996663866 instructions # 1.27 insn per cycle

    I.e., about 11 cycles for the whole sequence of 14 instructions. In
    trying to unserstand where these 11 cycles come from, I asked
    llvm-mca with

    cat xxx.s|llvm-mca-16 -mcpu=znver4 --iterations=1000

    and it tells me that it thinks that 1000 iterations take 2342 cycles:

    Iterations: 1000
    Instructions: 14000
    Total Cycles: 2342
    Total uOps: 14000

    Dispatch Width: 6
    uOps Per Cycle: 5.98
    IPC: 5.98
    Block RThroughput: 2.3

    So llvm-mca does not predict the actual performance correctly in this
    case and I still have no explanation for the 11 cycles.

    Does anybody have an explanation?

    The indirect jumps predict very well (0.03% mispredictions), so that's
    not the reason. So the jumps and all instructions that only produce (intermediate) results consumed by the jumps should not contribute to
    the latency: instructions 5,6,10,14.

    Instruction 8 produces a dead result (overwritten by instruction 9)
    and therefore does not contribute to the latency. Instruction 4 and
    (in the previous iteration) 11 produce results that are only used in latency-irrelevant instructions. This leaves us with:

    1 add $0x8,%rbx
    2 sub $0x8,%r13
    3 mov %rbx,0x0(%r13)
    7 mov %r8,%r15
    9 mov 0x0(%r13),%rbx
    12 add $0x8,%r13
    13 sub $0x8,%rbx

    One idea is that in this case the hardware alias analysis and 0-cycle store-to-load forwarding fails for storing and reloading a value
    to/from 0(%r13) (instructions 3 and 9), but I would expect a latency
    of 6 cycles (1 cycle from instruction 1, 0 from 3, 4 from 9, 1 from
    13) from that, not 11.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Mon Mar 10 08:54:20 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    I have the sequence

    1 add $0x8,%rbx
    2 sub $0x8,%r13
    3 mov %rbx,0x0(%r13)
    4 mov %rdx,%rbx
    5 mov (%rbx),%rax
    6 jmp *%rax
    7 mov %r8,%r15
    8 add $0x10,%rbx
    9 mov 0x0(%r13),%rbx
    10 mov -0x10(%r15),%rax
    11 mov %r15,%rdx
    12 add $0x8,%r13
    13 sub $0x8,%rbx
    14 jmp *%rax

    The contents of the registers and memory are such that the first jmp >continues at the next instruction in the sequence and the second jmp >continues at the top of the sequence. I measure this sequence with
    perf stat on a Zen4, terminating it with Ctrl-C, and get output like:

    21969657501 cycles
    27996663866 instructions # 1.27 insn per cycle

    I.e., about 11 cycles for the whole sequence of 14 instructions. In
    trying to unserstand where these 11 cycles come from, I asked
    llvm-mca with

    cat xxx.s|llvm-mca-16 -mcpu=znver4 --iterations=1000

    and it tells me that it thinks that 1000 iterations take 2342 cycles:

    Iterations: 1000
    Instructions: 14000
    Total Cycles: 2342
    Total uOps: 14000

    Dispatch Width: 6
    uOps Per Cycle: 5.98
    IPC: 5.98
    Block RThroughput: 2.3

    So llvm-mca does not predict the actual performance correctly in this
    case and I still have no explanation for the 11 cycles.

    Even more puzzling: In order to experiment with removing instructions
    I recreated this in assembly language:

    .text
    .globl main
    main:
    mov $threaded, %rdx
    mov $0, %rbx
    mov $(returnstack+8),%r13
    mov %rdx, %r8
    docol:
    add $0x8,%rbx
    sub $0x8,%r13
    mov %rbx,0x0(%r13)
    mov %rdx,%rbx
    mov (%rbx),%rax
    jmp *%rax
    outout:
    mov %r8,%r15
    add $0x10,%rbx
    mov 0x0(%r13),%rbx
    mov -0x10(%r15),%rax
    mov %r15,%rdx
    add $0x8,%r13
    sub $0x8,%rbx
    jmp *%rax

    .data
    .quad docol
    .quad 0
    threaded:
    .quad outout
    returnstack:
    .zero 16,0

    I assembled and linked this with:

    gcc xxx.s -Wl,-no-pie

    I ran the result with

    perf stat -e cycles -e instructions a.out

    terminated it with Ctrl-C and the result is:

    10764822288 cycles
    64556841216 instructions # 6.00 insn per cycle

    I.e., as predicted by llvm-mca. The main difference AFAICS is that in
    the slow version docol and outout are not adjacent, but far from each
    other, and returnstack is also not close to threaded (and the two
    64-bit words before it that also belong to threaded).

    It looks like I have found a microarchitectural pitfall, but it's not
    clear what it is.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Anton Ertl on Mon Mar 10 16:09:28 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    I have the sequence

    1 add $0x8,%rbx
    2 sub $0x8,%r13
    3 mov %rbx,0x0(%r13)
    4 mov %rdx,%rbx
    5 mov (%rbx),%rax
    6 jmp *%rax
    7 mov %r8,%r15
    8 add $0x10,%rbx
    9 mov 0x0(%r13),%rbx
    10 mov -0x10(%r15),%rax
    11 mov %r15,%rdx
    12 add $0x8,%r13
    13 sub $0x8,%rbx
    14 jmp *%rax

    The contents of the registers and memory are such that the first jmp
    continues at the next instruction in the sequence and the second jmp
    continues at the top of the sequence. I measure this sequence with
    perf stat on a Zen4, terminating it with Ctrl-C, and get output like:

    21969657501 cycles
    27996663866 instructions # 1.27 insn per cycle

    I.e., about 11 cycles for the whole sequence of 14 instructions. In
    trying to unserstand where these 11 cycles come from, I asked
    llvm-mca with

    cat xxx.s|llvm-mca-16 -mcpu=znver4 --iterations=1000

    and it tells me that it thinks that 1000 iterations take 2342 cycles:

    Iterations: 1000
    Instructions: 14000
    Total Cycles: 2342
    Total uOps: 14000

    Dispatch Width: 6
    uOps Per Cycle: 5.98
    IPC: 5.98
    Block RThroughput: 2.3

    So llvm-mca does not predict the actual performance correctly in this
    case and I still have no explanation for the 11 cycles.

    Even more puzzling: In order to experiment with removing instructions
    I recreated this in assembly language:

    .text
    .globl main
    main:
    mov $threaded, %rdx
    mov $0, %rbx
    mov $(returnstack+8),%r13
    mov %rdx, %r8
    docol:
    add $0x8,%rbx
    sub $0x8,%r13
    mov %rbx,0x0(%r13)
    mov %rdx,%rbx
    mov (%rbx),%rax
    jmp *%rax
    outout:
    mov %r8,%r15
    add $0x10,%rbx
    mov 0x0(%r13),%rbx
    mov -0x10(%r15),%rax
    mov %r15,%rdx
    add $0x8,%r13
    sub $0x8,%rbx
    jmp *%rax

    .data
    .quad docol
    .quad 0
    threaded:
    .quad outout
    returnstack:
    .zero 16,0

    I assembled and linked this with:

    gcc xxx.s -Wl,-no-pie

    I ran the result with

    perf stat -e cycles -e instructions a.out

    terminated it with Ctrl-C and the result is:

    10764822288 cycles
    64556841216 instructions # 6.00 insn per cycle

    I.e., as predicted by llvm-mca. The main difference AFAICS is that in
    the slow version docol and outout are not adjacent, but far from each
    other, and returnstack is also not close to threaded (and the two
    64-bit words before it that also belong to threaded).

    It looks like I have found a microarchitectural pitfall, but it's not
    clear what it is.

    - anton

    How about giving us the original source code function, my x86 is rusty and
    it is helpful to plug source into compiler explorer to see what different compilers do.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Brett on Mon Mar 10 16:55:16 2025
    Brett <ggtgp@yahoo.com> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    I recreated this in assembly language:

    .text
    .globl main
    main:
    mov $threaded, %rdx
    mov $0, %rbx
    mov $(returnstack+8),%r13
    mov %rdx, %r8
    docol:
    add $0x8,%rbx
    sub $0x8,%r13
    mov %rbx,0x0(%r13)
    mov %rdx,%rbx
    mov (%rbx),%rax
    jmp *%rax
    outout:
    mov %r8,%r15
    add $0x10,%rbx
    mov 0x0(%r13),%rbx
    mov -0x10(%r15),%rax
    mov %r15,%rdx
    add $0x8,%r13
    sub $0x8,%rbx
    jmp *%rax

    .data
    .quad docol
    .quad 0
    threaded:
    .quad outout
    returnstack:
    .zero 16,0
    ...
    How about giving us the original source code function, my x86 is rusty and
    it is helpful to plug source into compiler explorer to see what different >compilers do.

    The original source code is

    : foo dup execute-exit ;
    \ invoked with
    ' foo foo

    This is Forth source code for Gforth (and the output is from
    gforth-fast). I expect that most c.a readers will find the assembly
    language more approachable:-)

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Robert Finch on Mon Mar 10 20:02:13 2025
    On Mon, 10 Mar 2025 13:36:03 -0400
    Robert Finch <robfi680@gmail.com> wrote:
    Does it matter which core it is running on? Performance or economy?


    Performance and efficiency (that you misspelled as economy) cores are
    Intel terms. AMD calls it Zen 4 and Zen 4c.
    According to all available information at CPU core microarchitecture
    level Zen 4c is identical to Zen 4.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Mon Mar 10 17:14:27 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    I have the sequence

    1 add $0x8,%rbx
    2 sub $0x8,%r13
    3 mov %rbx,0x0(%r13)
    4 mov %rdx,%rbx
    5 mov (%rbx),%rax
    6 jmp *%rax
    7 mov %r8,%r15
    8 add $0x10,%rbx
    9 mov 0x0(%r13),%rbx
    10 mov -0x10(%r15),%rax
    11 mov %r15,%rdx
    12 add $0x8,%r13
    13 sub $0x8,%rbx
    14 jmp *%rax

    The contents of the registers and memory are such that the first jmp >>continues at the next instruction in the sequence and the second jmp >>continues at the top of the sequence. I measure this sequence with
    perf stat on a Zen4, terminating it with Ctrl-C, and get output like:

    21969657501 cycles
    27996663866 instructions # 1.27 insn per cycle

    I.e., about 11 cycles for the whole sequence of 14 instructions. In
    trying to unserstand where these 11 cycles come from, I asked
    llvm-mca with

    cat xxx.s|llvm-mca-16 -mcpu=znver4 --iterations=1000

    and it tells me that it thinks that 1000 iterations take 2342 cycles:

    Iterations: 1000
    Instructions: 14000
    Total Cycles: 2342
    Total uOps: 14000

    Dispatch Width: 6
    uOps Per Cycle: 5.98
    IPC: 5.98
    Block RThroughput: 2.3

    So llvm-mca does not predict the actual performance correctly in this
    case and I still have no explanation for the 11 cycles.

    Even more puzzling: In order to experiment with removing instructions
    I recreated this in assembly language:

    .text
    .globl main
    main:
    mov $threaded, %rdx
    mov $0, %rbx
    mov $(returnstack+8),%r13
    mov %rdx, %r8
    docol:
    add $0x8,%rbx
    sub $0x8,%r13
    mov %rbx,0x0(%r13)
    mov %rdx,%rbx
    mov (%rbx),%rax
    jmp *%rax
    outout:
    mov %r8,%r15
    add $0x10,%rbx
    mov 0x0(%r13),%rbx
    mov -0x10(%r15),%rax
    mov %r15,%rdx
    add $0x8,%r13
    sub $0x8,%rbx
    jmp *%rax

    .data
    .quad docol
    .quad 0
    threaded:
    .quad outout
    returnstack:
    .zero 16,0

    I assembled and linked this with:

    gcc xxx.s -Wl,-no-pie

    I ran the result with

    perf stat -e cycles -e instructions a.out

    terminated it with Ctrl-C and the result is:

    10764822288 cycles
    64556841216 instructions # 6.00 insn per cycle

    I.e., as predicted by llvm-mca. The main difference AFAICS is that in
    the slow version docol and outout are not adjacent, but far from each
    other, and returnstack is also not close to threaded (and the two
    64-bit words before it that also belong to threaded).

    Inserting 4096 bytes before outout and before returnstack did not
    change the performance on Zen4. Another difference is that in the
    slow version outout is in rwx memory while docol is in rx memory. I
    am too weak in assembly language to produce such an rwx section (and
    too lazy to do it by actually dynamically allocating the rwx memory.

    It looks like I have found a microarchitectural pitfall, but it's not
    clear what it is.

    Yes, looks like a microarchitectural pitfall:

    On Zen4, with two different builds of gforth-fast:

    gcc-12 gcc-10
    11 cycles/iteration 8 cycles/iteration
    mov %r8,%r15 mov %r8,%r15
    add $0x10,%rbx add $0x10,%rbx
    mov 0x0(%r13),%rbx mov (%r14),%rbx
    mov -0x10(%r15),%rax mov -0x10(%r15),%rax
    mov %r15,%rdx mov %r15,%rdx
    add $0x8,%r13 add $0x8,%r14
    sub $0x8,%rbx sub $0x8,%rbx
    jmp *%rax jmp *%rax
    add $0x8,%rbx add $0x8,%rbx
    sub $0x8,%r13 sub $0x8,%r14
    mov %rbx,0x0(%r13) mov %rbx,(%r14)
    mov %rdx,%rbx mov %rdx,%rbx
    mov (%rbx),%rax mov (%rbx),%rax
    jmp *%rax jmp *%rax

    Of course, there is also a difference in where the code and data
    pieces are placed.

    And here are measurements with the gcc-10 build on various other microarchitectures (IPC=14/(c/it)); lower c/it numbers are better.

    cyc/it
    gf as
    8 2.3 Zen4
    8 3 Zen3
    4 3 Zen2
    9 9 Zen
    2.4 2.4 Golden Cove
    3 Rocket Lake
    6 3 Gracemont
    10.6 Tremont

    It's interesting that several microarchitectures show a difference
    between the version of the code produced by gforth-fast (gf) and my assembly-language variant (as) that executes the same instruction
    sequences.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Robert Finch on Tue Mar 11 08:13:15 2025
    Robert Finch <robfi680@gmail.com> writes:
    It looks like LLVM is calculating 6 cycles (14000/2342) same as what you >would expect.

    Yes, and what I see from the assembly-language variant.

    Could there be something else interfering with the
    performance stat (interrupts?)

    I see no reasons for more than usual interference from interrupts.

    Does it matter which core it is running
    on? Performance or economy?

    There are no efficiency cores on the Ryzen 8700G where my Zen4
    measurements were taken. But certainly the microarchitecture plays a
    role, and on CPUs with cores with several microarchitectures, one sees different results: the Golden Cove and Gracemont results in <2025Mar10.181427@mips.complang.tuwien.ac.at> were measured on the
    same CPU.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Tue Mar 11 08:18:17 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    And here are measurements with the gcc-10 build on various other >microarchitectures (IPC=14/(c/it)); lower c/it numbers are better.

    cyc/it
    gf as
    8 2.3 Zen4
    8 3 Zen3
    4 3 Zen2
    9 9 Zen
    2.4 2.4 Golden Cove
    3 Rocket Lake
    6 3 Gracemont
    10.6 Tremont

    It's interesting that several microarchitectures show a difference
    between the version of the code produced by gforth-fast (gf) and my >assembly-language variant (as) that executes the same instruction
    sequences.

    Given that I have troubles reproducing the slowness in gforth-fast with assembly language, I took another approach: The Forth source code is:

    : foo dup execute-exit ;

    So I added a primitive for the combination of DUP and EXECUTE-;S.
    This allows exploring the difference between dynamically-generated and
    static native code in Gforth. Here are the different code sequences:

    In all versions, the same static docol sequence is used

    add $0x8,%rbx
    sub $0x8,%r14
    mov %rbx,(%r14)
    mov %rdx,%rbx
    mov (%rbx),%rax
    jmp *%rax

    For FOO, there are the following different sequences:

    1) dynamic code for "dup execute-exit" (sequence)
    2) dynamic code for "dup-execute-exit" (primitive)
    3) static code for "dup-execute-exit" (primitive)

    dynamic sequence dynamic primitive static primitive
    mov %r8,%r15
    add $0x10,%rbx add $0x8,%rbx
    mov (%r14),%rbx mov (%r14),%rbx mov (%r14),%rbx
    mov -0x10(%r15),%rax mov -0x10(%r8),%rax mov -0x10(%r8),%rax
    mov %r15,%rdx mov %r8,%rdx mov %r8,%rdx
    add $0x8,%r14 add $0x8,%r14 add $0x8,%r14
    sub $0x8,%rbx sub $0x8,%rbx sub $0x8,%rbx
    jmp *%rax jmp *%rax jmp *%rax

    To eliminate the difference between the dynamic and static primitive
    variants, I also measured a variant where I manually arranged the
    dynamic code to not execute the "add" at the start:

    4) static-like dynamic code for "dup-execute-exit" (primitive)

    I measured this on a Zen3, which has a similar difference between the
    Gforth code and the assembly-language code as the Zen4. The results are:

    c/it
    8 1) dynamic sequence
    8 2) dynamic primitive
    2 3) static primitive
    8 4) static-like dynamic primitive
    3 5) 4) with dynamic docol (see below)
    2 6) 5) with aligned dynamic docol (see below)

    So apparently the difference between static code and dynamic code
    causes the slowdown on Zen3 (and probably on Zen4).

    5) One reason could be that the dynamic code is far away in the address
    space from the static code of the docol. E.g., in one execution of 4)
    the code for docol starts at 0x00005558a3b5eac3 and the code for the dup-execute-exit starts at 0x00007f937beae764. In order to test this
    theory, I copied the docol code right behind the dup-execute-exit code
    and made the pointer to docol point to it. And indeed, the speed
    increased to 3 cycles/iteration.

    So the distance plays a role in Zen3 and probably others; I guess they
    do not store the full length of the target in the L1 BTB, and such a
    far branch therefore is never promoted to the L1 BTB; the branch
    therefore uses the L2 BTB and takes several cycles.

    6) There is still one cycle/iteration of difference between 3) and 5),
    but I guess this can be explained with the usual sources of
    variations, such as code alignment variations. I tried this theory by
    aligning the copied docol code to a 32-byte boundary. And that indeed
    produced 2 cycles/iteration.

    Another open issue is that the gcc-12 build of gforth-fast (using r13
    instead of r14) is 3 cycles slower than the gcc-10 build. I don't see
    an extension of my BTB theory that would explain this. So either my
    BTB theory is wrong or there is another effect at work.

    Here's how you can reproduce this:

    For adding the primitive, I added

    dup-execute-;s ( xt R:w -- xt ) gforth-internal dup_execute_semis
    SET_IP((Xt *)w);
    SUPER_END;
    VM_JUMP(EXEC1(xt));

    to the file prim in Gforth (commit
    d96c5dba9343e2b331e183b0594b6ee1622808f7) and rebuilt it (with
    gcc-10.2.1).

    The measurements were then done on a Ryzen 5800X with:

    1) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup execute-;s ; ' foo foo"

    2) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo foo"

    3) perf stat -e cycles -e instructions ./gforth-fast --no-dynamic -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo foo"

    4) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo @ 4 + ' foo ! ' foo foo"

    5) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo @ 4 + ' foo ! ' foo -2 cells + @ ' foo cell+ @ tuck 20 move ' foo -2 cells + ! ' foo foo"

    6) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo @ 4 + ' foo ! ' foo -2 cells + @ ' foo cell+ @ 32 naligned tuck 20 move ' foo -2 cells + ! ' foo foo"

    This code always ends in an endless loop, so I pressed Ctrl-C after a
    second or so, and then computed

    (cycles/instructions)*(instructions/iteration)

    where instructions/iteration is 14 for 1), 13 for 2) and 12 for the others.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Tue Mar 11 13:25:13 2025
    On Tue, 11 Mar 2025 08:18:17 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    Another open issue is that the gcc-12 build of gforth-fast (using r13
    instead of r14) is 3 cycles slower than the gcc-10 build. I don't see
    an extension of my BTB theory that would explain this. So either my
    BTB theory is wrong or there is another effect at work.


    I tried to understand Indirect Target Predictor paragraph in Opt.
    Manual, but failed.
    Here is the text of this short paragraph for those who don't like too
    look for things themselves, but have better chance than me
    to understand what is going on (i.e. primarily for Mitch Alsup)

    2.8.1.4
    Indirect Target Predictor
    The processor implements a 1024-entry indirect target array used to
    predict the target of some non-RET indirect branches. If a branch has
    had multiple different targets, the indirect target predictor chooses
    among them using global history at L2 BTB correction latency.
    Branches that have so far always had the same target are predicted
    using the static target from the branch's BTB entry. This means the
    prediction latency for correctly predicted indirect branches is
    roughly 5-(3/N), where N is the number of different targets of the
    indirect branch. For these reasons, code should attempt to reduce the
    number of different targets per indirect branch.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Tue Mar 11 18:09:51 2025
    Michael S <already5chosen@yahoo.com> writes:
    Another open issue is that the gcc-12 build of gforth-fast (using r13
    instead of r14) is 3 cycles slower than the gcc-10 build. I don't see
    an extension of my BTB theory that would explain this. So either my
    BTB theory is wrong or there is another effect at work.


    I tried to understand Indirect Target Predictor paragraph in Opt.
    Manual, but failed.
    Here is the text of this short paragraph for those who don't like too
    look for things themselves, but have better chance than me
    to understand what is going on (i.e. primarily for Mitch Alsup)

    Thanks.

    2.8.1.4
    Indirect Target Predictor
    The processor implements a 1024-entry indirect target array used to
    predict the target of some non-RET indirect branches. If a branch has
    had multiple different targets, the indirect target predictor chooses
    among them using global history at L2 BTB correction latency.
    Branches that have so far always had the same target are predicted
    using the static target from the branch's BTB entry. This means the >prediction latency for correctly predicted indirect branches is
    roughly 5-(3/N), where N is the number of different targets of the
    indirect branch. For these reasons, code should attempt to reduce the
    number of different targets per indirect branch.

    In the case of this microbenchmark, every indirect branch has only one
    target, and the fact that we see cases where this loop with two
    indirect branches is executed in 2 cycles indicates that such indirect
    branches can be performed in one cycle; that's probably the part about
    the "static target".

    What is written looks pretty clear to me; maybe when you have read the indirect-branch sections of several chipsandcheese articles, this all
    looks normal to you (although the formula looks curious to me). If
    you have any questions, I can give you my interpretation of what is
    written here.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)