Forum: >>> Magnum BBS <<<

3-way long addition (was: VAX)

From Michael S@21:1/5 to Terje Mathisen on Wed Aug 6 20:43:33 2025

On Wed, 6 Aug 2025 16:19:11 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Michael S wrote:

On Tue, 5 Aug 2025 22:17:00 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Michael S wrote:

On Tue, 5 Aug 2025 17:31:34 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
In this case 'adc edx,edx' is just slightly shorter encoding
of 'adc edx,0'. EDX register zeroize few lines above.

OK, nice.

BTW, it seems that in your code fragment above you forgot to
zeroize EDX at the beginning of iteration. Or am I mssing
something?

No, you are not. I skipped pretty much all the setup code. :-)

It's a setup code that looks to me as missing, but zeroing of RDX in
the body of the loop.

Anyway, the three main ADD RAX,... operations still define the
minimum possible latency, right?

I don't think so.
It seems to me that there is only one chains of data dependencies
between iterations of the loop - a trivial dependency through RCX.
Some modern processors are already capable to eliminate this sort
of dependency in renamer. Probably not yet when it is coded as
'inc', but when coded as 'add' or 'lea'.

The dependency through RDX/RBX does not form a chain. The next
value of [rdi+rcx*8] does depend on value of rbx from previous
iteration, but the next value of rbx depends only on [rsi+rcx*8],
[r8+rcx*8] and [r9+rcx*8]. It does not depend on the previous
value of rbx, except for control dependency that hopefully would
be speculated around.

I believe we are doing a bigint thre-way add, so each result word
depends on the three corresponding input words, plus any carries
from the previous round.

This is the carry chain that I don't see any obvious way to
break...

You break the chain by *predicting* that
carry[i] = CARRY(a[i]+b[i]+c[i]+carry(i-1) is equal to CARRY(a[i]+b[i]+c[i]). If the prediction turns out wrong then you
pay a heavy price of branch misprediction. But outside of specially
crafted inputs it is extremely rare.

Aha!

That's _very_ nice.

Terje

I did few tests on few machines: Raptor Cove (i7-14700 P core),
Gracemont (i7-14700 E core), Skylake-C (Xeon E-2176G) and Zen3 (EPYC
7543P).
In order to see effects more clearly I had to modify Anton's function:
to one that operates on pointers, because otherwise too much time was
spend at caller's site copying things around which made the
measurements too noisy.

void add3(uintNN_t *dst, const uintNN_t* a, const uintNN_t* b, const
uintNN_t* c) {
*dst = *a + *b + *c;
}

After the change on 3 out of 4 platforms I had seen a significant
speed-up after modification. The only platform where speed-up was non-significant was Skylake, probably because its rename stage is too
narrow to profit from the change. The widest machine (Raptor Cove)
benefited most.
The results appear non-conclusive with regard to question whether
dependency between loop iterations is eliminated completely or just
shortened to 1-2 clock cycles per iteration. Even the widest of my
cores is relatively narrow. Considering that my variant of loop contains
13 x86-64 instruction and 16 uOps, I am afraid that even likes of Apple
M4 would be too narrow :(

Here are results in nanoseconds for N=65472
Platform RC GM SK Z3
clang 896.1 1476.7 1453.2 1348.0
gcc 879.2 1661.4 1662.9 1655.0
x86 585.8 1489.3 901.5 672.0
Terje's 772.6 1293.2 1012.6 1127.0
My 397.5 803.8 965.3 660.0
ADX 579.1 1650.1 728.9 853.0
x86/u2 581.5 1246.2 679.9 584.0
Terje's/u3 503.7 954.3 630.9 755.0
My/u3 266.6 487.2 486.5 440.0
ADX/u8 350.4 839.3 490.4 451.0

'x86' is a variant that that was sketched in one of my above
posts. It calculates the sum in two passes over arrays.
'ADX' is a variant that uses ADCX/ADOX instructions as suggested by
Anton, but unlike his suggestion does it in a loop rather than in long
straight code sequence.
/u2, /u3, /u8 indicate unroll factors of the inner loop.

Frequency:
RC 5.30 GHz (Est)
GM 4.20 GHz (Est)
SK 4.25 GHz
Z3 3.70 GHz

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Thu Aug 7 15:15:17 2025

Michael S wrote:

On Wed, 6 Aug 2025 16:19:11 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Michael S wrote:

On Tue, 5 Aug 2025 22:17:00 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Michael S wrote:

On Tue, 5 Aug 2025 17:31:34 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
In this case 'adc edx,edx' is just slightly shorter encoding
of 'adc edx,0'. EDX register zeroize few lines above.

OK, nice.

BTW, it seems that in your code fragment above you forgot to
zeroize EDX at the beginning of iteration. Or am I mssing
something?

No, you are not. I skipped pretty much all the setup code. :-)

It's a setup code that looks to me as missing, but zeroing of RDX in
the body of the loop.

I don't remember my code exactly, but the intent was that RDX would
contain any incoming carries (0,1,2) from the previous iteration.

Using ADCX/ADOX would not be an obvious speedup, at least not obvious to me.

Terje

I did few tests on few machines: Raptor Cove (i7-14700 P core),
Gracemont (i7-14700 E core), Skylake-C (Xeon E-2176G) and Zen3 (EPYC
7543P).
In order to see effects more clearly I had to modify Anton's function:
to one that operates on pointers, because otherwise too much time was
spend at caller's site copying things around which made the
measurements too noisy.

void add3(uintNN_t *dst, const uintNN_t* a, const uintNN_t* b, const uintNN_t* c) {
*dst = *a + *b + *c;
}

After the change on 3 out of 4 platforms I had seen a significant
speed-up after modification. The only platform where speed-up was non-significant was Skylake, probably because its rename stage is too
narrow to profit from the change. The widest machine (Raptor Cove)
benefited most.
The results appear non-conclusive with regard to question whether
dependency between loop iterations is eliminated completely or just
shortened to 1-2 clock cycles per iteration. Even the widest of my
cores is relatively narrow. Considering that my variant of loop contains
13 x86-64 instruction and 16 uOps, I am afraid that even likes of Apple
M4 would be too narrow :(

Here are results in nanoseconds for N=65472
Platform RC GM SK Z3
clang 896.1 1476.7 1453.2 1348.0
gcc 879.2 1661.4 1662.9 1655.0
x86 585.8 1489.3 901.5 672.0
Terje's 772.6 1293.2 1012.6 1127.0
My 397.5 803.8 965.3 660.0
ADX 579.1 1650.1 728.9 853.0
x86/u2 581.5 1246.2 679.9 584.0
Terje's/u3 503.7 954.3 630.9 755.0
My/u3 266.6 487.2 486.5 440.0
ADX/u8 350.4 839.3 490.4 451.0

'x86' is a variant that that was sketched in one of my above
posts. It calculates the sum in two passes over arrays.
'ADX' is a variant that uses ADCX/ADOX instructions as suggested by
Anton, but unlike his suggestion does it in a loop rather than in long straight code sequence.
/u2, /u3, /u8 indicate unroll factors of the inner loop.

Frequency:
RC 5.30 GHz (Est)
GM 4.20 GHz (Est)
SK 4.25 GHz
Z3 3.70 GHz

Thanks for an interesting set of tests/results!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Tue Aug 19 05:47:01 2025

Michael S <already5chosen@yahoo.com> writes:

For extremely wide cores, like Apple's M (modulo ISA), AMD Zen5 and
Intel Lion Cove, I'd do the following modification to your inner loop
(back in Intel syntax):

xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
adc edx,edx
add rax,[r9+rcx*8]
adc edx,0
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret

; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never >incremen_edx:
inc edx
jmp edx_ready

The idea is interesting, but I don't understand the code. The
following looks funny to me:

1) You increment edx in increment_edx, then jump back to edx_ready and
immediately overwrite edx with ebx. Then you do nothing with it,
and then you clear edx in the next iteration. So both the "inc
edx" and the "mov edx, ebx" look like dead code to me that can be
optimized away.

2) There is a loop-carried dependency through ebx, and the number
accumulating in ebx and the carry check makes no sense with that.

Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
makes more sense with that. ebx then contains the carry from the last
cycle on entry. The carry dependency chain starts at clearing edx,
then gets to additional carries, then is copied to ebx, transferred
into the next iteration, and is ended there by overwriting ebx. No
dependency cycles (except the loop counter and addresses, which can be
dealt with by hardware or by unrolling), and ebx contains the carry
from the last iteration

One other problem is that according to Agner Fog's instruction tables,
even the latest and greatest CPUs from AMD and Intel that he measured
(Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
and adc has a latency of 1, so breaking the dependency chain in a
beneficial way should avoid the use of adc. For our three-summand
add, it's not clear if adcx and adox can run in the same cycle, but
looking at your measurements, it is unlikely.

So we would need something other than "adc edx, edx" to set the carry
register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
(and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
have 1 in edi, and then do, for two-summand addition:

mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
inc edx
jmp edx_ready

However, even without the loop overhead (which can be reduced with
unrolling) that's 8 instructions per iteration, and therefore we will
have a hard time executing it at less than 1cycle/iteration on current
CPUs. What if we mix in some adc-based stuff to bring down the
instruction count? E.g., with one adc-based and one cmov-based
iteration:

mov edi,1
xor ebx,ebx
next:
mov rax,[rsi+rcx*8]
add [r8+rcx*8], rax
mov rax,[rsi+rcx*8+8]
add [r8+rcx*8+8], rax
xor edx, edx
mov rax,[rsi+rcx*8+16]
adc rax,[r8+rcx*8+16]
cmovc edx, edi
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8+16],rax
add rcx,3
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
inc edx
jmp edx_ready

Now we have 15 instructions per unrolled iteration (3 original
iterations). Executing an unrolled iteration in less than three
cycles might be in reach for Zen3 and Raptor Cove (I don't remember if
all the other resource limits are also satisfied; the load/store unit
may be at its limit, too).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Anton Ertl on Tue Aug 19 07:09:51 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

The idea is interesting, but I don't understand the code. The
following looks funny to me:

1) You increment edx in increment_edx, then jump back to edx_ready and
immediately overwrite edx with ebx. Then you do nothing with it,
and then you clear edx in the next iteration. So both the "inc
edx" and the "mov edx, ebx" look like dead code to me that can be
optimized away.

2) There is a loop-carried dependency through ebx, and the number
accumulating in ebx and the carry check makes no sense with that.

Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
makes more sense with that. ebx then contains the carry from the last
cycle on entry. The carry dependency chain starts at clearing edx,
then gets to additional carries, then is copied to ebx, transferred
into the next iteration, and is ended there by overwriting ebx. No >dependency cycles (except the loop counter and addresses, which can be
dealt with by hardware or by unrolling), and ebx contains the carry
from the last iteration

One other problem is that according to Agner Fog's instruction tables,
even the latest and greatest CPUs from AMD and Intel that he measured
(Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
and adc has a latency of 1, so breaking the dependency chain in a
beneficial way should avoid the use of adc. For our three-summand
add, it's not clear if adcx and adox can run in the same cycle, but
looking at your measurements, it is unlikely.

So we would need something other than "adc edx, edx" to set the carry >register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
(and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
have 1 in edi, and then do, for two-summand addition:

mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never >incremen_edx:
inc edx
jmp edx_ready

Forgot to fix the "mov edx, ebx" here. One other thing: I think that
the "add rbx, rax" should be "add rax, rbx". You want to add the
carry to rax before storing the result. So the version with just one
iteration would be:

mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rax,rbx
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
inc edx
jmp edx_ready

And the version with the two additional adc-using iterations would be
(with an additional correction):

mov edi,1
xor ebx,ebx
next:
mov rax,[rsi+rcx*8]
add [r8+rcx*8], rax
mov rax,[rsi+rcx*8+8]
adc [r8+rcx*8+8], rax
xor edx, edx
mov rax,[rsi+rcx*8+16]
adc rax,[r8+rcx*8+16]
cmovc edx, edi
add rax,rbx
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8+16],rax
add rcx,3
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
inc edx
jmp edx_ready

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Anton Ertl on Tue Aug 19 12:11:56 2025

Anton, I like what you and Michael have done, but I'm still not sure
everything is OK:

In your code, I only see two input arrays [rsi] and [r8], instead of
three? (Including [r9])

Re breaking dependency chains (from Michael):

In each iteration we have four inputs:

carry_in from the previous iteration, [rsi+rcx*8], [r8+rcx*8] and
[r9+rcx*8], and we want to generate [rdi+rcx*8] and the carry_out.

Assuming effectively random inputs, cin+[rsi]+[r8]+[r9] will result in
random low-order 64 bits in [rdi], and either 0, 1 or 2 as carry_out.

In order to break the per-iteration dependency (per Michael), it is
sufficient to branch out IFF adding cin to the 3-sum produces an
additional carry:

; rdx = cin (0,1,2)
next:
mov rbx,rdx ; Save CIN
xor rdx,rdx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
adc rdx,rdx ; RDX = 0 or 1 (50:50)
add rax,[r9+rcx*8]
adc rdx,0 ; RDX = 0, 1 or 2 (33:33:33)

; At this point RAX has the 3-sum, now do the cin 0..2 add

add rax,rbx
jc fixup ; Pretty much never taken

save:
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next

fixup:
inc rdx
jmp save

It would also be possible to use SETC to save the intermediate carries...

Terje

Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

The idea is interesting, but I don't understand the code. The
following looks funny to me:

1) You increment edx in increment_edx, then jump back to edx_ready and
immediately overwrite edx with ebx. Then you do nothing with it,
and then you clear edx in the next iteration. So both the "inc
edx" and the "mov edx, ebx" look like dead code to me that can be
optimized away.

2) There is a loop-carried dependency through ebx, and the number
accumulating in ebx and the carry check makes no sense with that.

Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
makes more sense with that. ebx then contains the carry from the last
cycle on entry. The carry dependency chain starts at clearing edx,
then gets to additional carries, then is copied to ebx, transferred
into the next iteration, and is ended there by overwriting ebx. No
dependency cycles (except the loop counter and addresses, which can be
dealt with by hardware or by unrolling), and ebx contains the carry
from the last iteration

One other problem is that according to Agner Fog's instruction tables,
even the latest and greatest CPUs from AMD and Intel that he measured
(Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
and adc has a latency of 1, so breaking the dependency chain in a
beneficial way should avoid the use of adc. For our three-summand
add, it's not clear if adcx and adox can run in the same cycle, but
looking at your measurements, it is unlikely.

So we would need something other than "adc edx, edx" to set the carry
register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
(and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
have 1 in edi, and then do, for two-summand addition:

mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never
incremen_edx:
inc edx
jmp edx_ready

Forgot to fix the "mov edx, ebx" here. One other thing: I think that
the "add rbx, rax" should be "add rax, rbx". You want to add the
carry to rax before storing the result. So the version with just one iteration would be:

mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rax,rbx
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
inc edx
jmp edx_ready

And the version with the two additional adc-using iterations would be
(with an additional correction):

mov edi,1
xor ebx,ebx
next:
mov rax,[rsi+rcx*8]
add [r8+rcx*8], rax
mov rax,[rsi+rcx*8+8]
adc [r8+rcx*8+8], rax
xor edx, edx
mov rax,[rsi+rcx*8+16]
adc rax,[r8+rcx*8+16]
cmovc edx, edi
add rax,rbx
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8+16],rax
add rcx,3
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
inc edx
jmp edx_ready

- anton

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Tue Aug 19 17:20:54 2025

On Tue, 19 Aug 2025 07:09:51 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

The idea is interesting, but I don't understand the code. The
following looks funny to me:

1) You increment edx in increment_edx, then jump back to edx_ready
and
immediately overwrite edx with ebx. Then you do nothing with it,
and then you clear edx in the next iteration. So both the "inc
edx" and the "mov edx, ebx" look like dead code to me that can be
optimized away.

2) There is a loop-carried dependency through ebx, and the number
accumulating in ebx and the carry check makes no sense with that.

Could it be that you wanted to do "mov ebx, edx" at edx_ready? It
all makes more sense with that. ebx then contains the carry from
the last cycle on entry. The carry dependency chain starts at
clearing edx, then gets to additional carries, then is copied to
ebx, transferred into the next iteration, and is ended there by
overwriting ebx. No dependency cycles (except the loop counter and >addresses, which can be dealt with by hardware or by unrolling), and
ebx contains the carry from the last iteration

One other problem is that according to Agner Fog's instruction
tables, even the latest and greatest CPUs from AMD and Intel that he >measured (Zen5 and Tiger Lake) can only execute one adc/adcx/adox
per cycle, and adc has a latency of 1, so breaking the dependency
chain in a beneficial way should avoid the use of adc. For our >three-summand add, it's not clear if adcx and adox can run in the
same cycle, but looking at your measurements, it is unlikely.

So we would need something other than "adc edx, edx" to set the carry >register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
(and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
have 1 in edi, and then do, for two-summand addition:

mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately
never incremen_edx:
inc edx
jmp edx_ready

Forgot to fix the "mov edx, ebx" here. One other thing: I think that
the "add rbx, rax" should be "add rax, rbx". You want to add the
carry to rax before storing the result. So the version with just one iteration would be:

To many back and force mental switches between Intel and AT&T syntax.
The real code that I measured was for Windows platform, but in AT&T
(gnu) syntax.
Below is full function with loop unrolled by 3. The rest may be
I'd answer later, right now I don't have time.

.file "add3_my_u3.s"
.text
.p2align 4
.globl add3
.def add3; .scl 2; .type
32; .endef
.seh_proc add3
add3:
pushq %r13
.seh_pushreg %r13
pushq %r12
.seh_pushreg %r12
pushq %rbp
.seh_pushreg %rbp
pushq %rdi
.seh_pushreg %rdi
pushq %rsi
.seh_pushreg %rsi
pushq %rbx
.seh_pushreg %rbx
.seh_endprologue
# %rcx - dst
# %rdx - a
# %r8 - b
# %r9 - c
sub %rcx, %rdx
sub %rcx, %r8
sub %rcx, %r9
mov $341, %ebx
xor %eax, %eax
.loop:
xor %esi, %esi
mov (%rcx,%rdx), %rdi
mov 8(%rcx,%rdx), %rbp
mov 16(%rcx,%rdx), %r10
add (%rcx,%r8), %rdi
adc 8(%rcx,%r8), %rbp
adc 16(%rcx,%r8), %r10
adc %esi, %esi
add (%rcx,%r9), %rdi
adc 8(%rcx,%r9), %rbp
adc 16(%rcx,%r9), %r10
adc $0, %esi
add %rax, %rdi # add carry from the previous
iteration
jc .prop_carry
.carry_done:
mov %esi, %eax
mov %rdi, (%rcx)
mov %rbp, 8(%rcx)
mov %r10, 16(%rcx)
lea 24(%rcx), %rcx
dec %ebx
jnz .loop

sub $(1023*8), %rcx
mov %rcx, %rax

popq %rbx
popq %rsi
popq %rdi
popq %rbp
popq %r12
popq %r13
ret

.prop_carry:
add $1, %rbp
adc $0, %r10
adc $0, %esi
jmp .carry_done

.seh_endproc

mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rax,rbx
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately
never incremen_edx:
inc edx
jmp edx_ready

And the version with the two additional adc-using iterations would be
(with an additional correction):

mov edi,1
xor ebx,ebx
next:
mov rax,[rsi+rcx*8]
add [r8+rcx*8], rax
mov rax,[rsi+rcx*8+8]
adc [r8+rcx*8+8], rax
xor edx, edx
mov rax,[rsi+rcx*8+16]
adc rax,[r8+rcx*8+16]
cmovc edx, edi
add rax,rbx
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8+16],rax
add rcx,3
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately
never incremen_edx:
inc edx
jmp edx_ready

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to All on Tue Aug 19 17:24:23 2025

Above by mistake I posted not the most up to date variant, sorry.
Here is a correct code:

.file "add3_my_u3.s"
.text
.p2align 4
.globl add3
.def add3; .scl 2; .type
32; .endef .seh_proc add3
add3:
pushq %rbp
.seh_pushreg %rbp
pushq %rdi
.seh_pushreg %rdi
pushq %rsi
.seh_pushreg %rsi
pushq %rbx
.seh_pushreg %rbx
.seh_endprologue
# %rcx - dst
# %rdx - a
# %r8 - b
# %r9 - c
sub %rcx, %rdx
sub %rcx, %r8
sub %rcx, %r9
mov $341, %ebx
xor %eax, %eax
.loop:
xor %esi, %esi
mov (%rcx,%rdx), %rdi
mov 8(%rcx,%rdx), %r10
mov 16(%rcx,%rdx), %r11
add (%rcx,%r8), %rdi
adc 8(%rcx,%r8), %r10
adc 16(%rcx,%r8), %r11
adc %esi, %esi
add (%rcx,%r9), %rdi
adc 8(%rcx,%r9), %r10
adc 16(%rcx,%r9), %r11
adc $0, %esi
add %rax, %rdi # add carry from the previous
iteration jc .prop_carry
.carry_done:
mov %esi, %eax
mov %rdi, (%rcx)
mov %r10, 8(%rcx)
mov %r11, 16(%rcx)
lea 24(%rcx), %rcx
dec %ebx
jnz .loop

sub $(1023*8), %rcx
mov %rcx, %rax

popq %rbx
popq %rsi
popq %rdi
popq %rbp
ret

.prop_carry:
add $1, %r10
adc $0, %r11
adc $0, %esi
jmp .carry_done

.seh_endproc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Terje Mathisen on Tue Aug 19 17:43:25 2025

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Anton, I like what you and Michael have done, but I'm still not sure >everything is OK:

In your code, I only see two input arrays [rsi] and [r8], instead of
three? (Including [r9])

I implemented a two-summand addition, not three-summand. I wanted the
minumum of complexity to make it easier to understand, and latency is
a bigger problem for the two-summand case.

It would also be possible to use SETC to save the intermediate carries...

I must have had a bad morning. Instead of xor edx, edx, setc dl (also
2 per cycle on Zen3), I wrote

mov edi,1
...
xor edx, edx
...
cmovc edx, edi

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Tue Aug 19 23:03:01 2025

On Tue, 19 Aug 2025 05:47:01 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

One other problem is that according to Agner Fog's instruction tables,
even the latest and greatest CPUs from AMD and Intel that he measured
(Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,

I didn't measure on either TGL or Zen5, but both Raptor Cove and Zen3
are certainly capable of more than 1 adcx|adox per cycle.

Below are Execution times of very heavily unrolled adcx/adox code with dependency broken by trick similiar to above:

Platform RC GM SK Z3
add3_my_adx_u17 244.5 471.1 482.4 407.0

Considering that there are 2166 adcx/adox/adc instructions, we have
following number of adcx/adox/adc instructions per clock:
Platform RC GM SK Z3
1.67 1.10 1.05 1.44

For Gracemont and Skylake there exists a possibility of small
measurement mistake, but Raptor Cove appears to be capable of at least 2 instructions of this type per clock while Zen3 capable of at least 1.5
but more likely also 2.
It looks to me that the bottlenecks on both RC and Z3 are either rename
phase or more likely L1$ access. It seems that while Golden/Raptore Cove
can occasionally issue 3 load + 2 stores per clock, it can not sustain
more than 3 load-or-store accesses per clock.

Code:

.file "add3_my_adx_u17.s"
.text
.p2align 4
.globl add3
.def add3; .scl 2; .type 32; .endef
.seh_proc add3
add3:
pushq %rsi
.seh_pushreg %rsi
pushq %rbx
.seh_pushreg %rbx
.seh_endprologue
# %rcx - dst
# %rdx - a
# %r8 - b
# %r9 - c
sub %rdx, %rcx
mov %rcx, %r10 # r10 = dst - a
sub %rdx, %r8 # r8 = b - a
sub %rdx, %r9 # r9 = c - c
mov %rdx, %r11 # r11 - a
mov $60, %edx
xor %ecx, %ecx
.p2align 4
.loop:
xor %ebx, %ebx # CF <= 0, OF <= 0, EBX <= 0
mov (%r11), %rsi
adcx (%r11,%r8), %rsi
adox (%r11,%r9), %rsi

mov 8(%r11), %rax
adcx 8(%r11,%r8), %rax
adox 8(%r11,%r9), %rax
mov %rax, 8(%r10,%r11)

mov 16(%r11), %rax
adcx 16(%r11,%r8), %rax
adox 16(%r11,%r9), %rax
mov %rax, 16(%r10,%r11)

mov 24(%r11), %rax
adcx 24(%r11,%r8), %rax
adox 24(%r11,%r9), %rax
mov %rax, 24(%r10,%r11)

mov 32(%r11), %rax
adcx 32(%r11,%r8), %rax
adox 32(%r11,%r9), %rax
mov %rax, 32(%r10,%r11)

mov 40(%r11), %rax
adcx 40(%r11,%r8), %rax
adox 40(%r11,%r9), %rax
mov %rax, 40(%r10,%r11)

mov 48(%r11), %rax
adcx 48(%r11,%r8), %rax
adox 48(%r11,%r9), %rax
mov %rax, 48(%r10,%r11)

mov 56(%r11), %rax
adcx 56(%r11,%r8), %rax
adox 56(%r11,%r9), %rax
mov %rax, 56(%r10,%r11)

mov 64(%r11), %rax
adcx 64(%r11,%r8), %rax
adox 64(%r11,%r9), %rax
mov %rax, 64(%r10,%r11)

mov 72(%r11), %rax
adcx 72(%r11,%r8), %rax
adox 72(%r11,%r9), %rax
mov %rax, 72(%r10,%r11)

mov 80(%r11), %rax
adcx 80(%r11,%r8), %rax
adox 80(%r11,%r9), %rax
mov %rax, 80(%r10,%r11)

mov 88(%r11), %rax
adcx 88(%r11,%r8), %rax
adox 88(%r11,%r9), %rax
mov %rax, 88(%r10,%r11)

mov 96(%r11), %rax
adcx 96(%r11,%r8), %rax
adox 96(%r11,%r9), %rax
mov %rax, 96(%r10,%r11)

mov 104(%r11), %rax
adcx 104(%r11,%r8), %rax
adox 104(%r11,%r9), %rax
mov %rax, 104(%r10,%r11)

mov 112(%r11), %rax
adcx 112(%r11,%r8), %rax
adox 112(%r11,%r9), %rax
mov %rax, 112(%r10,%r11)

mov 120(%r11), %rax
adcx 120(%r11,%r8), %rax
adox 120(%r11,%r9), %rax
mov %rax, 120(%r10,%r11)

lea 136(%r11), %r11

mov -8(%r11), %rax
adcx -8(%r11,%r8), %rax
adox -8(%r11,%r9), %rax
mov %rax, -8(%r10,%r11)

mov %ebx, %eax # EAX <= 0
adcx %ebx, %eax # EAX <= OF, OF <= 0
adox %ebx, %eax # EAX <= OF, OF <= 0

add %rcx, %rsi
jc .prop_carry
.carry_done:
mov %rsi, -136(%r10,%r11)
mov %eax, %ecx
dec %edx
jnz .loop

# last 3
mov (%r11), %rax
mov 8(%r11), %rdx
mov 16(%r11), %rbx
add (%r11,%r8), %rax
adc 8(%r11,%r8), %rdx
adc 16(%r11,%r8), %rbx
add (%r11,%r9), %rax
adc 8(%r11,%r9), %rdx
adc 16(%r11,%r9), %rbx
add %rcx, %rax
adc $0, %rdx
adc $0, %rbx
mov %rax, (%r10,%r11)
mov %rdx, 8(%r10,%r11)
mov %rbx, 16(%r10,%r11)

lea (-1020*8)(%r10,%r11), %rax
popq %rbx
popq %rsi
ret

.prop_carry:
lea -128(%r10,%r11), %rbx
xor %ecx, %ecx
addq $1, (%rbx)
adc %rcx, 8(%rbx)
adc %rcx, 16(%rbx)
adc %rcx, 24(%rbx)
adc %rcx, 32(%rbx)
adc %rcx, 40(%rbx)
adc %rcx, 48(%rbx)
adc %rcx, 56(%rbx)
adc %rcx, 64(%rbx)
adc %rcx, 72(%rbx)
adc %rcx, 80(%rbx)
adc %rcx, 88(%rbx)
adc %rcx, 96(%rbx)
adc %rcx,104(%rbx)
adc %rcx,112(%rbx)
adc %rcx,120(%rbx)
adc %ecx, %eax
jmp .carry_done
.seh_endproc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Wed Aug 20 10:50:39 2025

Michael S wrote:

On Tue, 19 Aug 2025 05:47:01 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

One other problem is that according to Agner Fog's instruction tables,
even the latest and greatest CPUs from AMD and Intel that he measured
(Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,

I didn't measure on either TGL or Zen5, but both Raptor Cove and Zen3
are certainly capable of more than 1 adcx|adox per cycle.

Below are Execution times of very heavily unrolled adcx/adox code with dependency broken by trick similiar to above:

Platform RC GM SK Z3
add3_my_adx_u17 244.5 471.1 482.4 407.0

Considering that there are 2166 adcx/adox/adc instructions, we have
following number of adcx/adox/adc instructions per clock:
Platform RC GM SK Z3
1.67 1.10 1.05 1.44

For Gracemont and Skylake there exists a possibility of small
measurement mistake, but Raptor Cove appears to be capable of at least 2 instructions of this type per clock while Zen3 capable of at least 1.5
but more likely also 2.
It looks to me that the bottlenecks on both RC and Z3 are either rename
phase or more likely L1$ access. It seems that while Golden/Raptore Cove
can occasionally issue 3 load + 2 stores per clock, it can not sustain
more than 3 load-or-store accesses per clock

Code:

.file "add3_my_adx_u17.s"
.text
.p2align 4
.globl add3
.def add3; .scl 2; .type 32; .endef
.seh_proc add3
add3:
pushq %rsi
.seh_pushreg %rsi
pushq %rbx
.seh_pushreg %rbx
.seh_endprologue
# %rcx - dst
# %rdx - a
# %r8 - b
# %r9 - c
sub %rdx, %rcx
mov %rcx, %r10 # r10 = dst - a
sub %rdx, %r8 # r8 = b - a
sub %rdx, %r9 # r9 = c - c
mov %rdx, %r11 # r11 - a
mov $60, %edx
xor %ecx, %ecx
.p2align 4
.loop:
xor %ebx, %ebx # CF <= 0, OF <= 0, EBX <= 0
mov (%r11), %rsi
adcx (%r11,%r8), %rsi
adox (%r11,%r9), %rsi

mov 8(%r11), %rax
adcx 8(%r11,%r8), %rax
adox 8(%r11,%r9), %rax
mov %rax, 8(%r10,%r11)

[snipped the rest]

Very impressive Michael!

I particularly like how you are interleaving ADOX and ADCX to gain two
carry bits without having to save them off to an additional register.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Wed Aug 20 14:16:55 2025

On Wed, 20 Aug 2025 10:50:39 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Very impressive Michael!

I particularly like how you are interleaving ADOX and ADCX to gain
two carry bits without having to save them off to an additional
register.

Terje

It is interesting as an exercise in ADX extension programming, but in
practice it is only 0-10% faster than much simpler and smaller code
presented in the other post that uses no ISA extensions so runs on every
iAMD64 CPU since K8.
I suspect that this result is quite representative of the gains that
can be achieved with ADX. May be, if there is a crypto requirement
of independence of execution time from inputs then the gain would be
somewhat bigger, but even there I would be very surprised to find 1.5x
gain.
Overall, I think that time spent by Intel engineers on invention of ADX
could have been spent much better.

Going back to the task of 3-way addition, another approach that can
utilize the same idea of breaking data dependency is using SIMD.
In case of 4 cores that I tested SIMD means AVX2.
The are results of AVX2 implementation that unrolls by two i.e. 512
output bits per iteration of the inner loop.

Platform RC GM SK Z3
add3_avxq_u2 226.7 823.3 321.1 309.5

The speed is about equal to more unrolled ADX variant on RC, faster on
Z3, much faster on SK and much slower on GM. Unlike ADX, it runs on
Intel Haswell and on few pre-Zen AMD CPUs.

.file "add3_avxq_u2.s"
.text
.p2align 4
.globl add3
.def add3; .scl 2; .type 32; .endef
.seh_proc add3
add3:
subq $56, %rsp
.seh_stackalloc 56
vmovups %xmm6, 32(%rsp)
.seh_savexmm %xmm6, 32
.seh_endprologue
# %rcx - dst
# %rdx - a
# %r8 - b
# %r9 - c
sub %rcx, %rdx # %rdx - a-dst
sub %rcx, %r8 # %r8 - b-dst
sub %rcx, %r9 # %r9 - c-dst
vpcmpeqq %ymm6, %ymm6, %ymm6
vpsllq $63, %ymm6, %ymm6 # ymm6[0:3] = msbit = 2**63
vpxor %xmm5, %xmm5, %xmm5 # ymm5[0] = carry = 0
mov $127, %eax
.loop:
vpxor (%rdx,%rcx), %ymm6, %ymm0
# ymm0[0:3] = iA[0:3] = a[0:3] - msbit
vpxor 32(%rdx,%rcx), %ymm6, %ymm1
# ymm1[0:3] = iA[4:7] = a[4:7] - msbit
vpaddq (%r8, %rcx), %ymm0, %ymm2
# ymm2[0:3] = iSum1[0:3] = iA[0:3]+b[0:3]
vpaddq 32(%r8, %rcx), %ymm1, %ymm3
# ymm3[0:3] = iSum1[4:7] = iA[4:7] + b[4:7]
vpcmpgtq %ymm2, %ymm0, %ymm4
# ymm4[0:3] = c1[0:3] = iA[0:3] > iSum1[0:3]
vpaddq (%r9, %rcx), %ymm2, %ymm0
# ymm0[0:3] = iSum2[0:3] = iSum1[0:3]+c[0:3]
vpcmpgtq %ymm0, %ymm2, %ymm2
# ymm2[0:3] = c2[0:3] = iSum1[0:3] > iSum2[0:3]
vpaddq %ymm4, %ymm2, %ymm2
# ymm2[0:3] = cSum0[0:3] = c1[0:3]+c2[0:3]
vpcmpgtq %ymm3, %ymm1, %ymm4
# ymm4[0:3] = c1[4:7] = iA[4:7] > iSum1[4:7]
vpaddq 32(%r9, %rcx), %ymm3, %ymm1
# ymm1[0:3] = iSum2[4:7] = iSum1[4:7] + c[4:7]
vpcmpgtq %ymm1, %ymm3, %ymm3
# ymm3[0:3] = c2[4:7] = iSum1[4:7] > iSum2[4:7]
vpaddq %ymm4, %ymm3, %ymm3
# ymm3[0:3] = cSum0[4:7] = c1[4:7] + c2[4:7]
vpermq $0x93, %ymm2, %ymm4
# ymm4[0:3] = cSum0[3,0:2]
vpblendd $3, %ymm5, %ymm4, %ymm2
# ymm1[0:3] = cSum[0:3] = { carry[0], cSum0[0,1,2] }
vpermq $0x93, %ymm3, %ymm5
# ymm5[0:3] = cSum0[7,4:6] == carry
vpblendd $3, %ymm4, %ymm5, %ymm3
# ymm3[0:3] = cSum[4:7] = { cSum0[3], cSum0[4:6] }
.add_carry:
vpsubq %ymm2, %ymm0, %ymm2
# ymm2[0:3] = iSum3[0:3] = iSum2[0:3] - cSum[0:3]
vpsubq %ymm3, %ymm1, %ymm3
# ymm3[0:3] = iSum3[4:7] = iSum2[4:7] - cSum[4:7]
vpcmpgtq %ymm2, %ymm0, %ymm0
# ymm0[0:3] = c3[0:3] = iSum2[0:3] > iSum3[0:3]
vpcmpgtq %ymm3, %ymm1, %ymm1
# ymm3[0:3] = c3[4:7] = iSum2[4:7] > iSum3[4:7]
vpor %ymm0, %ymm1, %ymm4
vptest %ymm4, %ymm4
jne .prop_carry
vpxor %ymm2, %ymm6, %ymm0
# ymm0[0:3] = uSum3[0:3] = iSum3[0:3] + msbit
vpxor %ymm3, %ymm6, %ymm1
# ymm1[4:7] = uSum3[4:7] = iSum3[4:7] + msbit
vmovdqu %ymm0, (%rcx)
vmovdqu %ymm1, 32(%rcx)
addq $64, %rcx
dec %eax
jnz .loop

# last 7
vpxor (%rdx,%rcx), %ymm6, %ymm0
# ymm0[0:3] = iA[0:3] = a[0:3] - msbit
vpxor 24(%rdx,%rcx), %ymm6, %ymm1
# ymm1[0:3] = iA[3:6] = a[3:6] - msbit
vpaddq (%r8, %rcx), %ymm0, %ymm2
# ymm2[0:3] = iSum1[0:3] = iA[0:3]+b[0:3]
vpaddq 24(%r8, %rcx), %ymm1, %ymm3
# ymm3[0:3] = iSum1[3:6] = iA[3:6] + b[3:6]
vpcmpgtq %ymm2, %ymm0, %ymm4
# ymm4[0:3] = c1[0:3] = iA[0:3] > iSum1[0:3]
vpaddq (%r9, %rcx), %ymm2, %ymm0
# ymm0[0:3] = iSum2[0:3] = iSum1[0:3]+c[0:3]
vpcmpgtq %ymm0, %ymm2, %ymm2
# ymm2[0:3] = c2[0:3] = iSum1[0:3] > iSum2[0:3]
vpaddq %ymm4, %ymm2, %ymm2
# ymm2[0:3] = cSum0[0:3] = c1[0:3]+c2[0:3]
vpcmpgtq %ymm3, %ymm1, %ymm4
# ymm4[0:3] = c1[3:6] = iA[3:6] > iSum1[3:6]
vpaddq 24(%r9, %rcx), %ymm3, %ymm1
# ymm1[0:3] = iSum2[3:6] = iSum1[3:6] + c[3:6]
vpcmpgtq %ymm1, %ymm3, %ymm3
# ymm3[0:3] = c2[3:6] = iSum1[3:6] > iSum2[3:6]
vpaddq %ymm4, %ymm3, %ymm3
# ymm3[0:3] = cSum[4:7] = cSum0[3:6] = c1[3:6] + c2[367]
vpermq $0x93, %ymm2, %ymm4
# ymm2[0:3] = cSum0[3,0,1,2]
vpblendd $3, %ymm5, %ymm4, %ymm2
# ymm1[0:3] = cSum[0:3] = { carry[0], cSum0[0,1,2] }
vpermq $0xF9, %ymm1, %ymm1
# ymm3[0:3] = iSum2[4:6,6]
.add_carry2:
vpsubq %ymm2, %ymm0, %ymm2
# ymm2[0:3] = iSum3[0:3] = iSum2[0:3] - cSum[0:3]
vpsubq %ymm3, %ymm1, %ymm3
# ymm3[0:3] = iSum3[4:7] = iSum2[4:7] - cSum[4:7]
vpcmpgtq %ymm2, %ymm0, %ymm0
# ymm0[0:3] = c3[0:3] = iSum2[0:3] > iSum3[0:3]
vpcmpgtq %ymm3, %ymm1, %ymm1
# ymm1[0:3] = c3[4:7] = iSum2[4:7] > iSum3[4:7]
vptest %ymm0, %ymm0
jne .prop_carry2
vptest %ymm1, %ymm1
jne .prop_carry2
vpxor %ymm2, %ymm6, %ymm0
# ymm0[0:3] = uSum3[0:3] = iSum3[0:3] + msbit
vpxor %ymm3, %ymm6, %ymm1
# ymm1[4:7] = uSum3[4:7] = iSum3[4:7] + msbit
vmovdqu %ymm0, (%rcx)
vmovdqu %xmm1, 32(%rcx)
vextractf128 $1, %ymm1, %xmm1
vmovq %xmm1, 48(%rcx)

lea -(127*64)(%rcx), %rax
vzeroupper
vmovups 32(%rsp), %xmm6
addq $56, %rsp
ret

.prop_carry:
# input:
# ymm0[0:3] = c3[0:3]
# ymm1[0:3] = c3[4:7]
# ymm2[0:3] = iSum3[0:3]
# ymm3[0:3] = iSum3[4:7]
# ymm5[0] = carry
# output:
# ymm0[0:3] = iSum2[0:3]
# ymm1[0:3] = iSum2[4:7]
# ymm2[0:3] = cSum [0:3]
# ymm3[0:3] = cSum [4:7]
# ymm5[0] = carry
# scratch: ymm4
vpermq $0x93, %ymm0, %ymm4
# ymm4[0:3] = c3[3,0,1,2]
vmovdqa %ymm2, %ymm0
# ymm0[0:3] = iSum2[0:3] = iSum3[0:3]
vpermq $0x93, %ymm1, %ymm2
# ymm2[0:3] = c3[7,4,5,6]
vpaddq %xmm2, %xmm5, %xmm5
# ymm5[0] = carry += c3[7]
vmovdqa %ymm3, %ymm1
# ymm1[0:3] = iSum2[4:7] = iSum3[4:7]
vpblendd $3, %ymm4, %ymm2, %ymm3
# ymm3[0:3] = cSum[4:7] = { c3[3], c3[4,5,6] }
vpxor %xmm2, %xmm2, %xmm2
# ymm2[0:3] = 0
vpblendd $3, %ymm2, %ymm4, %ymm2
# ymm2[0:3] = cSum[0:3] = { 0, c3[0,1,2] }
jmp .add_carry

.prop_carry2:
# input:
# ymm0[0:3] = c3[0:3]
# ymm1[0:3] = c3[4:7]
# ymm2[0:3] = iSum3[0:3]
# ymm3[0:3] = iSum3[4:7]
# ymm5[0] = carry
# output:
# ymm0[0:3] = iSum2[0:3]
# ymm1[0:3] = iSum2[4:7]
# ymm2[0:3] = cSum [0:3]
# ymm3[0:3] = cSum [4:7]
# ymm5[0] = carry
# scratch: ymm4
vpermq $0x93, %ymm0, %ymm4
# ymm4[0:3] = c3[3,0,1,2]
vmovdqa %ymm2, %ymm0
# ymm0[0:3] = iSum2[0:3] = iSum3[0:3]
vpermq $0x93, %ymm1, %ymm2
# ymm2[0:3] = c3[7,4,5,6]
vmovdqa %ymm3, %ymm1
# ymm1[0:3] = iSum2[4:7] = iSum3[4:7]
vpblendd $3, %ymm4, %ymm2, %ymm3
# ymm3[0:3] = cSum[4:7] = { c3[3], c3[4,5,6] }
vpxor %xmm2, %xmm2, %xmm2
# ymm2[0:3] = 0
vpblendd $3, %ymm2, %ymm4, %ymm2
# ymm2[0:3] = cSum[0:3] = { 0, c3[0,1,2] }
jmp .add_carry2

.seh_endproc

AVX2 is rather poorly suited for this task - it lacks unsigned
comparison instructions, so the first input should be shifted by
half-range at the beginning and the result should be shifted back.

AVX-512 can be more suitable. But the only AVX-512 capable CPU that I
have access to is miniPC with cheap and slow core-i3 used by family
members almost exclusively for viewing movies. It does not even have
minimal programming environments installed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	04:22:31
Calls:	10,387
Calls today:	2
Files:	14,061
Messages:	6,416,782

3-way long addition (was: VAX)

Who's Online

Recent Visitors

System Info