Forum: >>> Magnum BBS <<<

An execution time puzzle

From Anton Ertl@21:1/5 to All on Mon Mar 10 07:33:18 2025

I have the sequence

1 add $0x8,%rbx
2 sub $0x8,%r13
3 mov %rbx,0x0(%r13)
4 mov %rdx,%rbx
5 mov (%rbx),%rax
6 jmp *%rax
7 mov %r8,%r15
8 add $0x10,%rbx
9 mov 0x0(%r13),%rbx
10 mov -0x10(%r15),%rax
11 mov %r15,%rdx
12 add $0x8,%r13
13 sub $0x8,%rbx
14 jmp *%rax

The contents of the registers and memory are such that the first jmp
continues at the next instruction in the sequence and the second jmp
continues at the top of the sequence. I measure this sequence with
perf stat on a Zen4, terminating it with Ctrl-C, and get output like:

21969657501 cycles
27996663866 instructions # 1.27 insn per cycle

I.e., about 11 cycles for the whole sequence of 14 instructions. In
trying to unserstand where these 11 cycles come from, I asked
llvm-mca with

cat xxx.s|llvm-mca-16 -mcpu=znver4 --iterations=1000

and it tells me that it thinks that 1000 iterations take 2342 cycles:

Iterations: 1000
Instructions: 14000
Total Cycles: 2342
Total uOps: 14000

Dispatch Width: 6
uOps Per Cycle: 5.98
IPC: 5.98
Block RThroughput: 2.3

So llvm-mca does not predict the actual performance correctly in this
case and I still have no explanation for the 11 cycles.

Does anybody have an explanation?

The indirect jumps predict very well (0.03% mispredictions), so that's
not the reason. So the jumps and all instructions that only produce (intermediate) results consumed by the jumps should not contribute to
the latency: instructions 5,6,10,14.

Instruction 8 produces a dead result (overwritten by instruction 9)
and therefore does not contribute to the latency. Instruction 4 and
(in the previous iteration) 11 produce results that are only used in latency-irrelevant instructions. This leaves us with:

1 add $0x8,%rbx
2 sub $0x8,%r13
3 mov %rbx,0x0(%r13)
7 mov %r8,%r15
9 mov 0x0(%r13),%rbx
12 add $0x8,%r13
13 sub $0x8,%rbx

One idea is that in this case the hardware alias analysis and 0-cycle store-to-load forwarding fails for storing and reloading a value
to/from 0(%r13) (instructions 3 and 9), but I would expect a latency
of 6 cycles (1 cycle from instruction 1, 0 from 3, 4 from 9, 1 from
13) from that, not 11.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Anton Ertl on Mon Mar 10 08:54:20 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

I have the sequence

1 add $0x8,%rbx
2 sub $0x8,%r13
3 mov %rbx,0x0(%r13)
4 mov %rdx,%rbx
5 mov (%rbx),%rax
6 jmp *%rax
7 mov %r8,%r15
8 add $0x10,%rbx
9 mov 0x0(%r13),%rbx
10 mov -0x10(%r15),%rax
11 mov %r15,%rdx
12 add $0x8,%r13
13 sub $0x8,%rbx
14 jmp *%rax

The contents of the registers and memory are such that the first jmp >continues at the next instruction in the sequence and the second jmp >continues at the top of the sequence. I measure this sequence with
perf stat on a Zen4, terminating it with Ctrl-C, and get output like:

21969657501 cycles
27996663866 instructions # 1.27 insn per cycle

I.e., about 11 cycles for the whole sequence of 14 instructions. In
trying to unserstand where these 11 cycles come from, I asked
llvm-mca with

cat xxx.s|llvm-mca-16 -mcpu=znver4 --iterations=1000

and it tells me that it thinks that 1000 iterations take 2342 cycles:

Iterations: 1000
Instructions: 14000
Total Cycles: 2342
Total uOps: 14000

Dispatch Width: 6
uOps Per Cycle: 5.98
IPC: 5.98
Block RThroughput: 2.3

So llvm-mca does not predict the actual performance correctly in this
case and I still have no explanation for the 11 cycles.

Even more puzzling: In order to experiment with removing instructions
I recreated this in assembly language:

.text
.globl main
main:
mov $threaded, %rdx
mov $0, %rbx
mov $(returnstack+8),%r13
mov %rdx, %r8
docol:
add $0x8,%rbx
sub $0x8,%r13
mov %rbx,0x0(%r13)
mov %rdx,%rbx
mov (%rbx),%rax
jmp *%rax
outout:
mov %r8,%r15
add $0x10,%rbx
mov 0x0(%r13),%rbx
mov -0x10(%r15),%rax
mov %r15,%rdx
add $0x8,%r13
sub $0x8,%rbx
jmp *%rax

.data
.quad docol
.quad 0
threaded:
.quad outout
returnstack:
.zero 16,0

I assembled and linked this with:

gcc xxx.s -Wl,-no-pie

I ran the result with

perf stat -e cycles -e instructions a.out

terminated it with Ctrl-C and the result is:

10764822288 cycles
64556841216 instructions # 6.00 insn per cycle

I.e., as predicted by llvm-mca. The main difference AFAICS is that in
the slow version docol and outout are not adjacent, but far from each
other, and returnstack is also not close to threaded (and the two
64-bit words before it that also belong to threaded).

It looks like I have found a microarchitectural pitfall, but it's not
clear what it is.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Anton Ertl on Mon Mar 10 16:09:28 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

I have the sequence

1 add $0x8,%rbx
2 sub $0x8,%r13
3 mov %rbx,0x0(%r13)
4 mov %rdx,%rbx
5 mov (%rbx),%rax
6 jmp *%rax
7 mov %r8,%r15
8 add $0x10,%rbx
9 mov 0x0(%r13),%rbx
10 mov -0x10(%r15),%rax
11 mov %r15,%rdx
12 add $0x8,%r13
13 sub $0x8,%rbx
14 jmp *%rax

The contents of the registers and memory are such that the first jmp
continues at the next instruction in the sequence and the second jmp
continues at the top of the sequence. I measure this sequence with
perf stat on a Zen4, terminating it with Ctrl-C, and get output like:

21969657501 cycles
27996663866 instructions # 1.27 insn per cycle

I.e., about 11 cycles for the whole sequence of 14 instructions. In
trying to unserstand where these 11 cycles come from, I asked
llvm-mca with

cat xxx.s|llvm-mca-16 -mcpu=znver4 --iterations=1000

and it tells me that it thinks that 1000 iterations take 2342 cycles:

Iterations: 1000
Instructions: 14000
Total Cycles: 2342
Total uOps: 14000

Dispatch Width: 6
uOps Per Cycle: 5.98
IPC: 5.98
Block RThroughput: 2.3

So llvm-mca does not predict the actual performance correctly in this
case and I still have no explanation for the 11 cycles.

Even more puzzling: In order to experiment with removing instructions
I recreated this in assembly language:

.text
.globl main
main:
mov $threaded, %rdx
mov $0, %rbx
mov $(returnstack+8),%r13
mov %rdx, %r8
docol:
add $0x8,%rbx
sub $0x8,%r13
mov %rbx,0x0(%r13)
mov %rdx,%rbx
mov (%rbx),%rax
jmp *%rax
outout:
mov %r8,%r15
add $0x10,%rbx
mov 0x0(%r13),%rbx
mov -0x10(%r15),%rax
mov %r15,%rdx
add $0x8,%r13
sub $0x8,%rbx
jmp *%rax

.data
.quad docol
.quad 0
threaded:
.quad outout
returnstack:
.zero 16,0

I assembled and linked this with:

gcc xxx.s -Wl,-no-pie

I ran the result with

perf stat -e cycles -e instructions a.out

terminated it with Ctrl-C and the result is:

10764822288 cycles
64556841216 instructions # 6.00 insn per cycle

I.e., as predicted by llvm-mca. The main difference AFAICS is that in
the slow version docol and outout are not adjacent, but far from each
other, and returnstack is also not close to threaded (and the two
64-bit words before it that also belong to threaded).

It looks like I have found a microarchitectural pitfall, but it's not
clear what it is.

- anton

How about giving us the original source code function, my x86 is rusty and
it is helpful to plug source into compiler explorer to see what different compilers do.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Brett on Mon Mar 10 16:55:16 2025

Brett <ggtgp@yahoo.com> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
I recreated this in assembly language:

.text
.globl main
main:
mov $threaded, %rdx
mov $0, %rbx
mov $(returnstack+8),%r13
mov %rdx, %r8
docol:
add $0x8,%rbx
sub $0x8,%r13
mov %rbx,0x0(%r13)
mov %rdx,%rbx
mov (%rbx),%rax
jmp *%rax
outout:
mov %r8,%r15
add $0x10,%rbx
mov 0x0(%r13),%rbx
mov -0x10(%r15),%rax
mov %r15,%rdx
add $0x8,%r13
sub $0x8,%rbx
jmp *%rax

.data
.quad docol
.quad 0
threaded:
.quad outout
returnstack:
.zero 16,0

...

How about giving us the original source code function, my x86 is rusty and
it is helpful to plug source into compiler explorer to see what different >compilers do.

The original source code is

: foo dup execute-exit ;
\ invoked with
' foo foo

This is Forth source code for Gforth (and the output is from
gforth-fast). I expect that most c.a readers will find the assembly
language more approachable:-)

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Robert Finch on Mon Mar 10 20:02:13 2025

On Mon, 10 Mar 2025 13:36:03 -0400
Robert Finch <robfi680@gmail.com> wrote:

Does it matter which core it is running on? Performance or economy?

Performance and efficiency (that you misspelled as economy) cores are
Intel terms. AMD calls it Zen 4 and Zen 4c.
According to all available information at CPU core microarchitecture
level Zen 4c is identical to Zen 4.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Anton Ertl on Mon Mar 10 17:14:27 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

I have the sequence

1 add $0x8,%rbx
2 sub $0x8,%r13
3 mov %rbx,0x0(%r13)
4 mov %rdx,%rbx
5 mov (%rbx),%rax
6 jmp *%rax
7 mov %r8,%r15
8 add $0x10,%rbx
9 mov 0x0(%r13),%rbx
10 mov -0x10(%r15),%rax
11 mov %r15,%rdx
12 add $0x8,%r13
13 sub $0x8,%rbx
14 jmp *%rax

The contents of the registers and memory are such that the first jmp >>continues at the next instruction in the sequence and the second jmp >>continues at the top of the sequence. I measure this sequence with
perf stat on a Zen4, terminating it with Ctrl-C, and get output like:

21969657501 cycles
27996663866 instructions # 1.27 insn per cycle

I.e., about 11 cycles for the whole sequence of 14 instructions. In
trying to unserstand where these 11 cycles come from, I asked
llvm-mca with

cat xxx.s|llvm-mca-16 -mcpu=znver4 --iterations=1000

and it tells me that it thinks that 1000 iterations take 2342 cycles:

Iterations: 1000
Instructions: 14000
Total Cycles: 2342
Total uOps: 14000

Dispatch Width: 6
uOps Per Cycle: 5.98
IPC: 5.98
Block RThroughput: 2.3

So llvm-mca does not predict the actual performance correctly in this
case and I still have no explanation for the 11 cycles.

Even more puzzling: In order to experiment with removing instructions
I recreated this in assembly language:

.text
.globl main
main:
mov $threaded, %rdx
mov $0, %rbx
mov $(returnstack+8),%r13
mov %rdx, %r8
docol:
add $0x8,%rbx
sub $0x8,%r13
mov %rbx,0x0(%r13)
mov %rdx,%rbx
mov (%rbx),%rax
jmp *%rax
outout:
mov %r8,%r15
add $0x10,%rbx
mov 0x0(%r13),%rbx
mov -0x10(%r15),%rax
mov %r15,%rdx
add $0x8,%r13
sub $0x8,%rbx
jmp *%rax

.data
.quad docol
.quad 0
threaded:
.quad outout
returnstack:
.zero 16,0

I assembled and linked this with:

gcc xxx.s -Wl,-no-pie

I ran the result with

perf stat -e cycles -e instructions a.out

terminated it with Ctrl-C and the result is:

10764822288 cycles
64556841216 instructions # 6.00 insn per cycle

I.e., as predicted by llvm-mca. The main difference AFAICS is that in
the slow version docol and outout are not adjacent, but far from each
other, and returnstack is also not close to threaded (and the two
64-bit words before it that also belong to threaded).

Inserting 4096 bytes before outout and before returnstack did not
change the performance on Zen4. Another difference is that in the
slow version outout is in rwx memory while docol is in rx memory. I
am too weak in assembly language to produce such an rwx section (and
too lazy to do it by actually dynamically allocating the rwx memory.

It looks like I have found a microarchitectural pitfall, but it's not
clear what it is.

Yes, looks like a microarchitectural pitfall:

On Zen4, with two different builds of gforth-fast:

gcc-12 gcc-10
11 cycles/iteration 8 cycles/iteration
mov %r8,%r15 mov %r8,%r15
add $0x10,%rbx add $0x10,%rbx
mov 0x0(%r13),%rbx mov (%r14),%rbx
mov -0x10(%r15),%rax mov -0x10(%r15),%rax
mov %r15,%rdx mov %r15,%rdx
add $0x8,%r13 add $0x8,%r14
sub $0x8,%rbx sub $0x8,%rbx
jmp *%rax jmp *%rax
add $0x8,%rbx add $0x8,%rbx
sub $0x8,%r13 sub $0x8,%r14
mov %rbx,0x0(%r13) mov %rbx,(%r14)
mov %rdx,%rbx mov %rdx,%rbx
mov (%rbx),%rax mov (%rbx),%rax
jmp *%rax jmp *%rax

Of course, there is also a difference in where the code and data
pieces are placed.

And here are measurements with the gcc-10 build on various other microarchitectures (IPC=14/(c/it)); lower c/it numbers are better.

cyc/it
gf as
8 2.3 Zen4
8 3 Zen3
4 3 Zen2
9 9 Zen
2.4 2.4 Golden Cove
3 Rocket Lake
6 3 Gracemont
10.6 Tremont

It's interesting that several microarchitectures show a difference
between the version of the code produced by gforth-fast (gf) and my assembly-language variant (as) that executes the same instruction
sequences.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Robert Finch on Tue Mar 11 08:13:15 2025

Robert Finch <robfi680@gmail.com> writes:

It looks like LLVM is calculating 6 cycles (14000/2342) same as what you >would expect.

Yes, and what I see from the assembly-language variant.

Could there be something else interfering with the
performance stat (interrupts?)

I see no reasons for more than usual interference from interrupts.

Does it matter which core it is running
on? Performance or economy?

There are no efficiency cores on the Ryzen 8700G where my Zen4
measurements were taken. But certainly the microarchitecture plays a
role, and on CPUs with cores with several microarchitectures, one sees different results: the Golden Cove and Gracemont results in <2025Mar10.181427@mips.complang.tuwien.ac.at> were measured on the
same CPU.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Anton Ertl on Tue Mar 11 08:18:17 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

And here are measurements with the gcc-10 build on various other >microarchitectures (IPC=14/(c/it)); lower c/it numbers are better.

cyc/it
gf as
8 2.3 Zen4
8 3 Zen3
4 3 Zen2
9 9 Zen
2.4 2.4 Golden Cove
3 Rocket Lake
6 3 Gracemont
10.6 Tremont

It's interesting that several microarchitectures show a difference
between the version of the code produced by gforth-fast (gf) and my >assembly-language variant (as) that executes the same instruction
sequences.

Given that I have troubles reproducing the slowness in gforth-fast with assembly language, I took another approach: The Forth source code is:

: foo dup execute-exit ;

So I added a primitive for the combination of DUP and EXECUTE-;S.
This allows exploring the difference between dynamically-generated and
static native code in Gforth. Here are the different code sequences:

In all versions, the same static docol sequence is used

add $0x8,%rbx
sub $0x8,%r14
mov %rbx,(%r14)
mov %rdx,%rbx
mov (%rbx),%rax
jmp *%rax

For FOO, there are the following different sequences:

1) dynamic code for "dup execute-exit" (sequence)
2) dynamic code for "dup-execute-exit" (primitive)
3) static code for "dup-execute-exit" (primitive)

dynamic sequence dynamic primitive static primitive
mov %r8,%r15
add $0x10,%rbx add $0x8,%rbx
mov (%r14),%rbx mov (%r14),%rbx mov (%r14),%rbx
mov -0x10(%r15),%rax mov -0x10(%r8),%rax mov -0x10(%r8),%rax
mov %r15,%rdx mov %r8,%rdx mov %r8,%rdx
add $0x8,%r14 add $0x8,%r14 add $0x8,%r14
sub $0x8,%rbx sub $0x8,%rbx sub $0x8,%rbx
jmp *%rax jmp *%rax jmp *%rax

To eliminate the difference between the dynamic and static primitive
variants, I also measured a variant where I manually arranged the
dynamic code to not execute the "add" at the start:

4) static-like dynamic code for "dup-execute-exit" (primitive)

I measured this on a Zen3, which has a similar difference between the
Gforth code and the assembly-language code as the Zen4. The results are:

c/it
8 1) dynamic sequence
8 2) dynamic primitive
2 3) static primitive
8 4) static-like dynamic primitive
3 5) 4) with dynamic docol (see below)
2 6) 5) with aligned dynamic docol (see below)

So apparently the difference between static code and dynamic code
causes the slowdown on Zen3 (and probably on Zen4).

5) One reason could be that the dynamic code is far away in the address
space from the static code of the docol. E.g., in one execution of 4)
the code for docol starts at 0x00005558a3b5eac3 and the code for the dup-execute-exit starts at 0x00007f937beae764. In order to test this
theory, I copied the docol code right behind the dup-execute-exit code
and made the pointer to docol point to it. And indeed, the speed
increased to 3 cycles/iteration.

So the distance plays a role in Zen3 and probably others; I guess they
do not store the full length of the target in the L1 BTB, and such a
far branch therefore is never promoted to the L1 BTB; the branch
therefore uses the L2 BTB and takes several cycles.

6) There is still one cycle/iteration of difference between 3) and 5),
but I guess this can be explained with the usual sources of
variations, such as code alignment variations. I tried this theory by
aligning the copied docol code to a 32-byte boundary. And that indeed
produced 2 cycles/iteration.

Another open issue is that the gcc-12 build of gforth-fast (using r13
instead of r14) is 3 cycles slower than the gcc-10 build. I don't see
an extension of my BTB theory that would explain this. So either my
BTB theory is wrong or there is another effect at work.

Here's how you can reproduce this:

For adding the primitive, I added

dup-execute-;s ( xt R:w -- xt ) gforth-internal dup_execute_semis
SET_IP((Xt *)w);
SUPER_END;
VM_JUMP(EXEC1(xt));

to the file prim in Gforth (commit
d96c5dba9343e2b331e183b0594b6ee1622808f7) and rebuilt it (with
gcc-10.2.1).

The measurements were then done on a Ryzen 5800X with:

1) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup execute-;s ; ' foo foo"

2) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo foo"

3) perf stat -e cycles -e instructions ./gforth-fast --no-dynamic -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo foo"

4) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo @ 4 + ' foo ! ' foo foo"

5) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo @ 4 + ' foo ! ' foo -2 cells + @ ' foo cell+ @ tuck 20 move ' foo -2 cells + ! ' foo foo"

6) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo @ 4 + ' foo ! ' foo -2 cells + @ ' foo cell+ @ 32 naligned tuck 20 move ' foo -2 cells + ! ' foo foo"

This code always ends in an endless loop, so I pressed Ctrl-C after a
second or so, and then computed

(cycles/instructions)*(instructions/iteration)

where instructions/iteration is 14 for 1), 13 for 2) and 12 for the others.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Tue Mar 11 13:25:13 2025

On Tue, 11 Mar 2025 08:18:17 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Another open issue is that the gcc-12 build of gforth-fast (using r13
instead of r14) is 3 cycles slower than the gcc-10 build. I don't see
an extension of my BTB theory that would explain this. So either my
BTB theory is wrong or there is another effect at work.

I tried to understand Indirect Target Predictor paragraph in Opt.
Manual, but failed.
Here is the text of this short paragraph for those who don't like too
look for things themselves, but have better chance than me
to understand what is going on (i.e. primarily for Mitch Alsup)

2.8.1.4
Indirect Target Predictor
The processor implements a 1024-entry indirect target array used to
predict the target of some non-RET indirect branches. If a branch has
had multiple different targets, the indirect target predictor chooses
among them using global history at L2 BTB correction latency.
Branches that have so far always had the same target are predicted
using the static target from the branch's BTB entry. This means the
prediction latency for correctly predicted indirect branches is
roughly 5-(3/N), where N is the number of different targets of the
indirect branch. For these reasons, code should attempt to reduce the
number of different targets per indirect branch.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Tue Mar 11 18:09:51 2025

Michael S <already5chosen@yahoo.com> writes:

Another open issue is that the gcc-12 build of gforth-fast (using r13
instead of r14) is 3 cycles slower than the gcc-10 build. I don't see
an extension of my BTB theory that would explain this. So either my
BTB theory is wrong or there is another effect at work.

I tried to understand Indirect Target Predictor paragraph in Opt.
Manual, but failed.
Here is the text of this short paragraph for those who don't like too
look for things themselves, but have better chance than me
to understand what is going on (i.e. primarily for Mitch Alsup)

Thanks.

2.8.1.4
Indirect Target Predictor
The processor implements a 1024-entry indirect target array used to
predict the target of some non-RET indirect branches. If a branch has
had multiple different targets, the indirect target predictor chooses
among them using global history at L2 BTB correction latency.
Branches that have so far always had the same target are predicted
using the static target from the branch's BTB entry. This means the >prediction latency for correctly predicted indirect branches is
roughly 5-(3/N), where N is the number of different targets of the
indirect branch. For these reasons, code should attempt to reduce the
number of different targets per indirect branch.

In the case of this microbenchmark, every indirect branch has only one
target, and the fact that we see cases where this loop with two
indirect branches is executed in 2 cycles indicates that such indirect
branches can be performed in one cycle; that's probably the part about
the "static target".

What is written looks pretty clear to me; maybe when you have read the indirect-branch sections of several chipsandcheese articles, this all
looks normal to you (although the formula looks curious to me). If
you have any questions, I can give you my interpretation of what is
written here.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 07:56:03 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	06:50:21
Calls:	10,386
Calls today:	1
Files:	14,058
Messages:	6,416,638

An execution time puzzle

Who's Online

Recent Visitors

System Info