I have the sequence
1 add $0x8,%rbx
2 sub $0x8,%r13
3 mov %rbx,0x0(%r13)
4 mov %rdx,%rbx
5 mov (%rbx),%rax
6 jmp *%rax
7 mov %r8,%r15
8 add $0x10,%rbx
9 mov 0x0(%r13),%rbx
10 mov -0x10(%r15),%rax
11 mov %r15,%rdx
12 add $0x8,%r13
13 sub $0x8,%rbx
14 jmp *%rax
The contents of the registers and memory are such that the first jmp >continues at the next instruction in the sequence and the second jmp >continues at the top of the sequence. I measure this sequence with
perf stat on a Zen4, terminating it with Ctrl-C, and get output like:
21969657501 cycles
27996663866 instructions # 1.27 insn per cycle
I.e., about 11 cycles for the whole sequence of 14 instructions. In
trying to unserstand where these 11 cycles come from, I asked
llvm-mca with
cat xxx.s|llvm-mca-16 -mcpu=znver4 --iterations=1000
and it tells me that it thinks that 1000 iterations take 2342 cycles:
Iterations: 1000
Instructions: 14000
Total Cycles: 2342
Total uOps: 14000
Dispatch Width: 6
uOps Per Cycle: 5.98
IPC: 5.98
Block RThroughput: 2.3
So llvm-mca does not predict the actual performance correctly in this
case and I still have no explanation for the 11 cycles.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
I have the sequence
1 add $0x8,%rbx
2 sub $0x8,%r13
3 mov %rbx,0x0(%r13)
4 mov %rdx,%rbx
5 mov (%rbx),%rax
6 jmp *%rax
7 mov %r8,%r15
8 add $0x10,%rbx
9 mov 0x0(%r13),%rbx
10 mov -0x10(%r15),%rax
11 mov %r15,%rdx
12 add $0x8,%r13
13 sub $0x8,%rbx
14 jmp *%rax
The contents of the registers and memory are such that the first jmp
continues at the next instruction in the sequence and the second jmp
continues at the top of the sequence. I measure this sequence with
perf stat on a Zen4, terminating it with Ctrl-C, and get output like:
21969657501 cycles
27996663866 instructions # 1.27 insn per cycle
I.e., about 11 cycles for the whole sequence of 14 instructions. In
trying to unserstand where these 11 cycles come from, I asked
llvm-mca with
cat xxx.s|llvm-mca-16 -mcpu=znver4 --iterations=1000
and it tells me that it thinks that 1000 iterations take 2342 cycles:
Iterations: 1000
Instructions: 14000
Total Cycles: 2342
Total uOps: 14000
Dispatch Width: 6
uOps Per Cycle: 5.98
IPC: 5.98
Block RThroughput: 2.3
So llvm-mca does not predict the actual performance correctly in this
case and I still have no explanation for the 11 cycles.
Even more puzzling: In order to experiment with removing instructions
I recreated this in assembly language:
.text
.globl main
main:
mov $threaded, %rdx
mov $0, %rbx
mov $(returnstack+8),%r13
mov %rdx, %r8
docol:
add $0x8,%rbx
sub $0x8,%r13
mov %rbx,0x0(%r13)
mov %rdx,%rbx
mov (%rbx),%rax
jmp *%rax
outout:
mov %r8,%r15
add $0x10,%rbx
mov 0x0(%r13),%rbx
mov -0x10(%r15),%rax
mov %r15,%rdx
add $0x8,%r13
sub $0x8,%rbx
jmp *%rax
.data
.quad docol
.quad 0
threaded:
.quad outout
returnstack:
.zero 16,0
I assembled and linked this with:
gcc xxx.s -Wl,-no-pie
I ran the result with
perf stat -e cycles -e instructions a.out
terminated it with Ctrl-C and the result is:
10764822288 cycles
64556841216 instructions # 6.00 insn per cycle
I.e., as predicted by llvm-mca. The main difference AFAICS is that in
the slow version docol and outout are not adjacent, but far from each
other, and returnstack is also not close to threaded (and the two
64-bit words before it that also belong to threaded).
It looks like I have found a microarchitectural pitfall, but it's not
clear what it is.
- anton
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:...
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
I recreated this in assembly language:
.text
.globl main
main:
mov $threaded, %rdx
mov $0, %rbx
mov $(returnstack+8),%r13
mov %rdx, %r8
docol:
add $0x8,%rbx
sub $0x8,%r13
mov %rbx,0x0(%r13)
mov %rdx,%rbx
mov (%rbx),%rax
jmp *%rax
outout:
mov %r8,%r15
add $0x10,%rbx
mov 0x0(%r13),%rbx
mov -0x10(%r15),%rax
mov %r15,%rdx
add $0x8,%r13
sub $0x8,%rbx
jmp *%rax
.data
.quad docol
.quad 0
threaded:
.quad outout
returnstack:
.zero 16,0
How about giving us the original source code function, my x86 is rusty and
it is helpful to plug source into compiler explorer to see what different >compilers do.
Does it matter which core it is running on? Performance or economy?
I have the sequence
1 add $0x8,%rbx
2 sub $0x8,%r13
3 mov %rbx,0x0(%r13)
4 mov %rdx,%rbx
5 mov (%rbx),%rax
6 jmp *%rax
7 mov %r8,%r15
8 add $0x10,%rbx
9 mov 0x0(%r13),%rbx
10 mov -0x10(%r15),%rax
11 mov %r15,%rdx
12 add $0x8,%r13
13 sub $0x8,%rbx
14 jmp *%rax
The contents of the registers and memory are such that the first jmp >>continues at the next instruction in the sequence and the second jmp >>continues at the top of the sequence. I measure this sequence with
perf stat on a Zen4, terminating it with Ctrl-C, and get output like:
21969657501 cycles
27996663866 instructions # 1.27 insn per cycle
I.e., about 11 cycles for the whole sequence of 14 instructions. In
trying to unserstand where these 11 cycles come from, I asked
llvm-mca with
cat xxx.s|llvm-mca-16 -mcpu=znver4 --iterations=1000
and it tells me that it thinks that 1000 iterations take 2342 cycles:
Iterations: 1000
Instructions: 14000
Total Cycles: 2342
Total uOps: 14000
Dispatch Width: 6
uOps Per Cycle: 5.98
IPC: 5.98
Block RThroughput: 2.3
So llvm-mca does not predict the actual performance correctly in this
case and I still have no explanation for the 11 cycles.
Even more puzzling: In order to experiment with removing instructions
I recreated this in assembly language:
.text
.globl main
main:
mov $threaded, %rdx
mov $0, %rbx
mov $(returnstack+8),%r13
mov %rdx, %r8
docol:
add $0x8,%rbx
sub $0x8,%r13
mov %rbx,0x0(%r13)
mov %rdx,%rbx
mov (%rbx),%rax
jmp *%rax
outout:
mov %r8,%r15
add $0x10,%rbx
mov 0x0(%r13),%rbx
mov -0x10(%r15),%rax
mov %r15,%rdx
add $0x8,%r13
sub $0x8,%rbx
jmp *%rax
.data
.quad docol
.quad 0
threaded:
.quad outout
returnstack:
.zero 16,0
I assembled and linked this with:
gcc xxx.s -Wl,-no-pie
I ran the result with
perf stat -e cycles -e instructions a.out
terminated it with Ctrl-C and the result is:
10764822288 cycles
64556841216 instructions # 6.00 insn per cycle
I.e., as predicted by llvm-mca. The main difference AFAICS is that in
the slow version docol and outout are not adjacent, but far from each
other, and returnstack is also not close to threaded (and the two
64-bit words before it that also belong to threaded).
It looks like I have found a microarchitectural pitfall, but it's not
clear what it is.
It looks like LLVM is calculating 6 cycles (14000/2342) same as what you >would expect.
Could there be something else interfering with the
performance stat (interrupts?)
Does it matter which core it is running
on? Performance or economy?
And here are measurements with the gcc-10 build on various other >microarchitectures (IPC=14/(c/it)); lower c/it numbers are better.
cyc/it
gf as
8 2.3 Zen4
8 3 Zen3
4 3 Zen2
9 9 Zen
2.4 2.4 Golden Cove
3 Rocket Lake
6 3 Gracemont
10.6 Tremont
It's interesting that several microarchitectures show a difference
between the version of the code produced by gforth-fast (gf) and my >assembly-language variant (as) that executes the same instruction
sequences.
Another open issue is that the gcc-12 build of gforth-fast (using r13
instead of r14) is 3 cycles slower than the gcc-10 build. I don't see
an extension of my BTB theory that would explain this. So either my
BTB theory is wrong or there is another effect at work.
Another open issue is that the gcc-12 build of gforth-fast (using r13
instead of r14) is 3 cycles slower than the gcc-10 build. I don't see
an extension of my BTB theory that would explain this. So either my
BTB theory is wrong or there is another effect at work.
I tried to understand Indirect Target Predictor paragraph in Opt.
Manual, but failed.
Here is the text of this short paragraph for those who don't like too
look for things themselves, but have better chance than me
to understand what is going on (i.e. primarily for Mitch Alsup)
2.8.1.4
Indirect Target Predictor
The processor implements a 1024-entry indirect target array used to
predict the target of some non-RET indirect branches. If a branch has
had multiple different targets, the indirect target predictor chooses
among them using global history at L2 BTB correction latency.
Branches that have so far always had the same target are predicted
using the static target from the branch's BTB entry. This means the >prediction latency for correctly predicted indirect branches is
roughly 5-(3/N), where N is the number of different targets of the
indirect branch. For these reasons, code should attempt to reduce the
number of different targets per indirect branch.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 06:50:21 |
Calls: | 10,386 |
Calls today: | 1 |
Files: | 14,058 |
Messages: | 6,416,638 |