Michael S wrote:
On Tue, 5 Aug 2025 22:17:00 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Tue, 5 Aug 2025 17:31:34 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
In this case 'adc edx,edx' is just slightly shorter encoding
of 'adc edx,0'. EDX register zeroize few lines above.
OK, nice.
BTW, it seems that in your code fragment above you forgot to
zeroize EDX at the beginning of iteration. Or am I mssing
something?
No, you are not. I skipped pretty much all the setup code. :-)
Anyway, the three main ADD RAX,... operations still define the
minimum possible latency, right?
I don't think so.
It seems to me that there is only one chains of data dependencies
between iterations of the loop - a trivial dependency through RCX.
Some modern processors are already capable to eliminate this sort
of dependency in renamer. Probably not yet when it is coded as
'inc', but when coded as 'add' or 'lea'.
The dependency through RDX/RBX does not form a chain. The next
value of [rdi+rcx*8] does depend on value of rbx from previous
iteration, but the next value of rbx depends only on [rsi+rcx*8],
[r8+rcx*8] and [r9+rcx*8]. It does not depend on the previous
value of rbx, except for control dependency that hopefully would
be speculated around.
I believe we are doing a bigint thre-way add, so each result word
depends on the three corresponding input words, plus any carries
from the previous round.
This is the carry chain that I don't see any obvious way to
break...
You break the chain by *predicting* that
carry[i] = CARRY(a[i]+b[i]+c[i]+carry(i-1) is equal to CARRY(a[i]+b[i]+c[i]). If the prediction turns out wrong then you
pay a heavy price of branch misprediction. But outside of specially
crafted inputs it is extremely rare.
Aha!
That's _very_ nice.
Terje
On Wed, 6 Aug 2025 16:19:11 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Tue, 5 Aug 2025 22:17:00 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Tue, 5 Aug 2025 17:31:34 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
In this case 'adc edx,edx' is just slightly shorter encoding
of 'adc edx,0'. EDX register zeroize few lines above.
OK, nice.
BTW, it seems that in your code fragment above you forgot to
zeroize EDX at the beginning of iteration. Or am I mssing
something?
No, you are not. I skipped pretty much all the setup code. :-)
It's a setup code that looks to me as missing, but zeroing of RDX in
the body of the loop.
I did few tests on few machines: Raptor Cove (i7-14700 P core),
Gracemont (i7-14700 E core), Skylake-C (Xeon E-2176G) and Zen3 (EPYC
7543P).
In order to see effects more clearly I had to modify Anton's function:
to one that operates on pointers, because otherwise too much time was
spend at caller's site copying things around which made the
measurements too noisy.
void add3(uintNN_t *dst, const uintNN_t* a, const uintNN_t* b, const uintNN_t* c) {
*dst = *a + *b + *c;
}
After the change on 3 out of 4 platforms I had seen a significant
speed-up after modification. The only platform where speed-up was non-significant was Skylake, probably because its rename stage is too
narrow to profit from the change. The widest machine (Raptor Cove)
benefited most.
The results appear non-conclusive with regard to question whether
dependency between loop iterations is eliminated completely or just
shortened to 1-2 clock cycles per iteration. Even the widest of my
cores is relatively narrow. Considering that my variant of loop contains
13 x86-64 instruction and 16 uOps, I am afraid that even likes of Apple
M4 would be too narrow :(
Here are results in nanoseconds for N=65472
Platform RC GM SK Z3
clang 896.1 1476.7 1453.2 1348.0
gcc 879.2 1661.4 1662.9 1655.0
x86 585.8 1489.3 901.5 672.0
Terje's 772.6 1293.2 1012.6 1127.0
My 397.5 803.8 965.3 660.0
ADX 579.1 1650.1 728.9 853.0
x86/u2 581.5 1246.2 679.9 584.0
Terje's/u3 503.7 954.3 630.9 755.0
My/u3 266.6 487.2 486.5 440.0
ADX/u8 350.4 839.3 490.4 451.0
'x86' is a variant that that was sketched in one of my above
posts. It calculates the sum in two passes over arrays.
'ADX' is a variant that uses ADCX/ADOX instructions as suggested by
Anton, but unlike his suggestion does it in a loop rather than in long straight code sequence.
/u2, /u3, /u8 indicate unroll factors of the inner loop.
Frequency:
RC 5.30 GHz (Est)
GM 4.20 GHz (Est)
SK 4.25 GHz
Z3 3.70 GHz
For extremely wide cores, like Apple's M (modulo ISA), AMD Zen5 and
Intel Lion Cove, I'd do the following modification to your inner loop
(back in Intel syntax):
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
adc edx,edx
add rax,[r9+rcx*8]
adc edx,0
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never >incremen_edx:
inc edx
jmp edx_ready
The idea is interesting, but I don't understand the code. The
following looks funny to me:
1) You increment edx in increment_edx, then jump back to edx_ready and
immediately overwrite edx with ebx. Then you do nothing with it,
and then you clear edx in the next iteration. So both the "inc
edx" and the "mov edx, ebx" look like dead code to me that can be
optimized away.
2) There is a loop-carried dependency through ebx, and the number
accumulating in ebx and the carry check makes no sense with that.
Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
makes more sense with that. ebx then contains the carry from the last
cycle on entry. The carry dependency chain starts at clearing edx,
then gets to additional carries, then is copied to ebx, transferred
into the next iteration, and is ended there by overwriting ebx. No >dependency cycles (except the loop counter and addresses, which can be
dealt with by hardware or by unrolling), and ebx contains the carry
from the last iteration
One other problem is that according to Agner Fog's instruction tables,
even the latest and greatest CPUs from AMD and Intel that he measured
(Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
and adc has a latency of 1, so breaking the dependency chain in a
beneficial way should avoid the use of adc. For our three-summand
add, it's not clear if adcx and adox can run in the same cycle, but
looking at your measurements, it is unlikely.
So we would need something other than "adc edx, edx" to set the carry >register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
(and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
have 1 in edi, and then do, for two-summand addition:
mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never >incremen_edx:
inc edx
jmp edx_ready
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
The idea is interesting, but I don't understand the code. The
following looks funny to me:
1) You increment edx in increment_edx, then jump back to edx_ready and
immediately overwrite edx with ebx. Then you do nothing with it,
and then you clear edx in the next iteration. So both the "inc
edx" and the "mov edx, ebx" look like dead code to me that can be
optimized away.
2) There is a loop-carried dependency through ebx, and the number
accumulating in ebx and the carry check makes no sense with that.
Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
makes more sense with that. ebx then contains the carry from the last
cycle on entry. The carry dependency chain starts at clearing edx,
then gets to additional carries, then is copied to ebx, transferred
into the next iteration, and is ended there by overwriting ebx. No
dependency cycles (except the loop counter and addresses, which can be
dealt with by hardware or by unrolling), and ebx contains the carry
from the last iteration
One other problem is that according to Agner Fog's instruction tables,
even the latest and greatest CPUs from AMD and Intel that he measured
(Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
and adc has a latency of 1, so breaking the dependency chain in a
beneficial way should avoid the use of adc. For our three-summand
add, it's not clear if adcx and adox can run in the same cycle, but
looking at your measurements, it is unlikely.
So we would need something other than "adc edx, edx" to set the carry
register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
(and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
have 1 in edi, and then do, for two-summand addition:
mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never
incremen_edx:
inc edx
jmp edx_ready
Forgot to fix the "mov edx, ebx" here. One other thing: I think that
the "add rbx, rax" should be "add rax, rbx". You want to add the
carry to rax before storing the result. So the version with just one iteration would be:
mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rax,rbx
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
inc edx
jmp edx_ready
And the version with the two additional adc-using iterations would be
(with an additional correction):
mov edi,1
xor ebx,ebx
next:
mov rax,[rsi+rcx*8]
add [r8+rcx*8], rax
mov rax,[rsi+rcx*8+8]
adc [r8+rcx*8+8], rax
xor edx, edx
mov rax,[rsi+rcx*8+16]
adc rax,[r8+rcx*8+16]
cmovc edx, edi
add rax,rbx
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8+16],rax
add rcx,3
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
inc edx
jmp edx_ready
- anton
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
The idea is interesting, but I don't understand the code. The
following looks funny to me:
1) You increment edx in increment_edx, then jump back to edx_ready
and
immediately overwrite edx with ebx. Then you do nothing with it,
and then you clear edx in the next iteration. So both the "inc
edx" and the "mov edx, ebx" look like dead code to me that can be
optimized away.
2) There is a loop-carried dependency through ebx, and the number
accumulating in ebx and the carry check makes no sense with that.
Could it be that you wanted to do "mov ebx, edx" at edx_ready? It
all makes more sense with that. ebx then contains the carry from
the last cycle on entry. The carry dependency chain starts at
clearing edx, then gets to additional carries, then is copied to
ebx, transferred into the next iteration, and is ended there by
overwriting ebx. No dependency cycles (except the loop counter and >addresses, which can be dealt with by hardware or by unrolling), and
ebx contains the carry from the last iteration
One other problem is that according to Agner Fog's instruction
tables, even the latest and greatest CPUs from AMD and Intel that he >measured (Zen5 and Tiger Lake) can only execute one adc/adcx/adox
per cycle, and adc has a latency of 1, so breaking the dependency
chain in a beneficial way should avoid the use of adc. For our >three-summand add, it's not clear if adcx and adox can run in the
same cycle, but looking at your measurements, it is unlikely.
So we would need something other than "adc edx, edx" to set the carry >register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
(and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
have 1 in edi, and then do, for two-summand addition:
mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately
never incremen_edx:
inc edx
jmp edx_ready
Forgot to fix the "mov edx, ebx" here. One other thing: I think that
the "add rbx, rax" should be "add rax, rbx". You want to add the
carry to rax before storing the result. So the version with just one iteration would be:
mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rax,rbx
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately
never incremen_edx:
inc edx
jmp edx_ready
And the version with the two additional adc-using iterations would be
(with an additional correction):
mov edi,1
xor ebx,ebx
next:
mov rax,[rsi+rcx*8]
add [r8+rcx*8], rax
mov rax,[rsi+rcx*8+8]
adc [r8+rcx*8+8], rax
xor edx, edx
mov rax,[rsi+rcx*8+16]
adc rax,[r8+rcx*8+16]
cmovc edx, edi
add rax,rbx
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8+16],rax
add rcx,3
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately
never incremen_edx:
inc edx
jmp edx_ready
- anton
Anton, I like what you and Michael have done, but I'm still not sure >everything is OK:
In your code, I only see two input arrays [rsi] and [r8], instead of
three? (Including [r9])
It would also be possible to use SETC to save the intermediate carries...
One other problem is that according to Agner Fog's instruction tables,
even the latest and greatest CPUs from AMD and Intel that he measured
(Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
On Tue, 19 Aug 2025 05:47:01 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
One other problem is that according to Agner Fog's instruction tables,
even the latest and greatest CPUs from AMD and Intel that he measured
(Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
I didn't measure on either TGL or Zen5, but both Raptor Cove and Zen3
are certainly capable of more than 1 adcx|adox per cycle.
Below are Execution times of very heavily unrolled adcx/adox code with dependency broken by trick similiar to above:
Platform RC GM SK Z3
add3_my_adx_u17 244.5 471.1 482.4 407.0
Considering that there are 2166 adcx/adox/adc instructions, we have
following number of adcx/adox/adc instructions per clock:
Platform RC GM SK Z3
1.67 1.10 1.05 1.44
For Gracemont and Skylake there exists a possibility of small
measurement mistake, but Raptor Cove appears to be capable of at least 2 instructions of this type per clock while Zen3 capable of at least 1.5
but more likely also 2.
It looks to me that the bottlenecks on both RC and Z3 are either rename
phase or more likely L1$ access. It seems that while Golden/Raptore Cove
can occasionally issue 3 load + 2 stores per clock, it can not sustain
more than 3 load-or-store accesses per clock
Code:
.file "add3_my_adx_u17.s"
.text
.p2align 4
.globl add3
.def add3; .scl 2; .type 32; .endef
.seh_proc add3
add3:
pushq %rsi
.seh_pushreg %rsi
pushq %rbx
.seh_pushreg %rbx
.seh_endprologue
# %rcx - dst
# %rdx - a
# %r8 - b
# %r9 - c
sub %rdx, %rcx
mov %rcx, %r10 # r10 = dst - a
sub %rdx, %r8 # r8 = b - a
sub %rdx, %r9 # r9 = c - c
mov %rdx, %r11 # r11 - a
mov $60, %edx
xor %ecx, %ecx
.p2align 4
.loop:
xor %ebx, %ebx # CF <= 0, OF <= 0, EBX <= 0
mov (%r11), %rsi
adcx (%r11,%r8), %rsi
adox (%r11,%r9), %rsi
mov 8(%r11), %rax
adcx 8(%r11,%r8), %rax
adox 8(%r11,%r9), %rax
mov %rax, 8(%r10,%r11)
Very impressive Michael!
I particularly like how you are interleaving ADOX and ADCX to gain
two carry bits without having to save them off to an additional
register.
Terje
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 04:22:31 |
Calls: | 10,387 |
Calls today: | 2 |
Files: | 14,061 |
Messages: | 6,416,782 |