• 3-way long addition (was: VAX)

    From Michael S@21:1/5 to Terje Mathisen on Wed Aug 6 20:43:33 2025
    On Wed, 6 Aug 2025 16:19:11 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 22:17:00 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 17:31:34 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    In this case 'adc edx,edx' is just slightly shorter encoding
    of 'adc edx,0'. EDX register zeroize few lines above.

    OK, nice.

    BTW, it seems that in your code fragment above you forgot to
    zeroize EDX at the beginning of iteration. Or am I mssing
    something?

    No, you are not. I skipped pretty much all the setup code. :-)

    It's a setup code that looks to me as missing, but zeroing of RDX in
    the body of the loop.



    Anyway, the three main ADD RAX,... operations still define the
    minimum possible latency, right?


    I don't think so.
    It seems to me that there is only one chains of data dependencies
    between iterations of the loop - a trivial dependency through RCX.
    Some modern processors are already capable to eliminate this sort
    of dependency in renamer. Probably not yet when it is coded as
    'inc', but when coded as 'add' or 'lea'.

    The dependency through RDX/RBX does not form a chain. The next
    value of [rdi+rcx*8] does depend on value of rbx from previous
    iteration, but the next value of rbx depends only on [rsi+rcx*8],
    [r8+rcx*8] and [r9+rcx*8]. It does not depend on the previous
    value of rbx, except for control dependency that hopefully would
    be speculated around.

    I believe we are doing a bigint thre-way add, so each result word
    depends on the three corresponding input words, plus any carries
    from the previous round.

    This is the carry chain that I don't see any obvious way to
    break...

    You break the chain by *predicting* that
    carry[i] = CARRY(a[i]+b[i]+c[i]+carry(i-1) is equal to CARRY(a[i]+b[i]+c[i]). If the prediction turns out wrong then you
    pay a heavy price of branch misprediction. But outside of specially
    crafted inputs it is extremely rare.

    Aha!

    That's _very_ nice.

    Terje



    I did few tests on few machines: Raptor Cove (i7-14700 P core),
    Gracemont (i7-14700 E core), Skylake-C (Xeon E-2176G) and Zen3 (EPYC
    7543P).
    In order to see effects more clearly I had to modify Anton's function:
    to one that operates on pointers, because otherwise too much time was
    spend at caller's site copying things around which made the
    measurements too noisy.

    void add3(uintNN_t *dst, const uintNN_t* a, const uintNN_t* b, const
    uintNN_t* c) {
    *dst = *a + *b + *c;
    }


    After the change on 3 out of 4 platforms I had seen a significant
    speed-up after modification. The only platform where speed-up was non-significant was Skylake, probably because its rename stage is too
    narrow to profit from the change. The widest machine (Raptor Cove)
    benefited most.
    The results appear non-conclusive with regard to question whether
    dependency between loop iterations is eliminated completely or just
    shortened to 1-2 clock cycles per iteration. Even the widest of my
    cores is relatively narrow. Considering that my variant of loop contains
    13 x86-64 instruction and 16 uOps, I am afraid that even likes of Apple
    M4 would be too narrow :(

    Here are results in nanoseconds for N=65472
    Platform RC GM SK Z3
    clang 896.1 1476.7 1453.2 1348.0
    gcc 879.2 1661.4 1662.9 1655.0
    x86 585.8 1489.3 901.5 672.0
    Terje's 772.6 1293.2 1012.6 1127.0
    My 397.5 803.8 965.3 660.0
    ADX 579.1 1650.1 728.9 853.0
    x86/u2 581.5 1246.2 679.9 584.0
    Terje's/u3 503.7 954.3 630.9 755.0
    My/u3 266.6 487.2 486.5 440.0
    ADX/u8 350.4 839.3 490.4 451.0

    'x86' is a variant that that was sketched in one of my above
    posts. It calculates the sum in two passes over arrays.
    'ADX' is a variant that uses ADCX/ADOX instructions as suggested by
    Anton, but unlike his suggestion does it in a loop rather than in long
    straight code sequence.
    /u2, /u3, /u8 indicate unroll factors of the inner loop.

    Frequency:
    RC 5.30 GHz (Est)
    GM 4.20 GHz (Est)
    SK 4.25 GHz
    Z3 3.70 GHz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Thu Aug 7 15:15:17 2025
    Michael S wrote:
    On Wed, 6 Aug 2025 16:19:11 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 22:17:00 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 17:31:34 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    In this case 'adc edx,edx' is just slightly shorter encoding
    of 'adc edx,0'. EDX register zeroize few lines above.

    OK, nice.

    BTW, it seems that in your code fragment above you forgot to
    zeroize EDX at the beginning of iteration. Or am I mssing
    something?

    No, you are not. I skipped pretty much all the setup code. :-)

    It's a setup code that looks to me as missing, but zeroing of RDX in
    the body of the loop.

    I don't remember my code exactly, but the intent was that RDX would
    contain any incoming carries (0,1,2) from the previous iteration.

    Using ADCX/ADOX would not be an obvious speedup, at least not obvious to me.

    Terje

    I did few tests on few machines: Raptor Cove (i7-14700 P core),
    Gracemont (i7-14700 E core), Skylake-C (Xeon E-2176G) and Zen3 (EPYC
    7543P).
    In order to see effects more clearly I had to modify Anton's function:
    to one that operates on pointers, because otherwise too much time was
    spend at caller's site copying things around which made the
    measurements too noisy.

    void add3(uintNN_t *dst, const uintNN_t* a, const uintNN_t* b, const uintNN_t* c) {
    *dst = *a + *b + *c;
    }


    After the change on 3 out of 4 platforms I had seen a significant
    speed-up after modification. The only platform where speed-up was non-significant was Skylake, probably because its rename stage is too
    narrow to profit from the change. The widest machine (Raptor Cove)
    benefited most.
    The results appear non-conclusive with regard to question whether
    dependency between loop iterations is eliminated completely or just
    shortened to 1-2 clock cycles per iteration. Even the widest of my
    cores is relatively narrow. Considering that my variant of loop contains
    13 x86-64 instruction and 16 uOps, I am afraid that even likes of Apple
    M4 would be too narrow :(

    Here are results in nanoseconds for N=65472
    Platform RC GM SK Z3
    clang 896.1 1476.7 1453.2 1348.0
    gcc 879.2 1661.4 1662.9 1655.0
    x86 585.8 1489.3 901.5 672.0
    Terje's 772.6 1293.2 1012.6 1127.0
    My 397.5 803.8 965.3 660.0
    ADX 579.1 1650.1 728.9 853.0
    x86/u2 581.5 1246.2 679.9 584.0
    Terje's/u3 503.7 954.3 630.9 755.0
    My/u3 266.6 487.2 486.5 440.0
    ADX/u8 350.4 839.3 490.4 451.0

    'x86' is a variant that that was sketched in one of my above
    posts. It calculates the sum in two passes over arrays.
    'ADX' is a variant that uses ADCX/ADOX instructions as suggested by
    Anton, but unlike his suggestion does it in a loop rather than in long straight code sequence.
    /u2, /u3, /u8 indicate unroll factors of the inner loop.

    Frequency:
    RC 5.30 GHz (Est)
    GM 4.20 GHz (Est)
    SK 4.25 GHz
    Z3 3.70 GHz


    Thanks for an interesting set of tests/results!

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Tue Aug 19 05:47:01 2025
    Michael S <already5chosen@yahoo.com> writes:
    For extremely wide cores, like Apple's M (modulo ISA), AMD Zen5 and
    Intel Lion Cove, I'd do the following modification to your inner loop
    (back in Intel syntax):

    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    adc edx,edx
    add rax,[r9+rcx*8]
    adc edx,0
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret


    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never >incremen_edx:
    inc edx
    jmp edx_ready

    The idea is interesting, but I don't understand the code. The
    following looks funny to me:

    1) You increment edx in increment_edx, then jump back to edx_ready and
    immediately overwrite edx with ebx. Then you do nothing with it,
    and then you clear edx in the next iteration. So both the "inc
    edx" and the "mov edx, ebx" look like dead code to me that can be
    optimized away.

    2) There is a loop-carried dependency through ebx, and the number
    accumulating in ebx and the carry check makes no sense with that.

    Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
    makes more sense with that. ebx then contains the carry from the last
    cycle on entry. The carry dependency chain starts at clearing edx,
    then gets to additional carries, then is copied to ebx, transferred
    into the next iteration, and is ended there by overwriting ebx. No
    dependency cycles (except the loop counter and addresses, which can be
    dealt with by hardware or by unrolling), and ebx contains the carry
    from the last iteration

    One other problem is that according to Agner Fog's instruction tables,
    even the latest and greatest CPUs from AMD and Intel that he measured
    (Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
    and adc has a latency of 1, so breaking the dependency chain in a
    beneficial way should avoid the use of adc. For our three-summand
    add, it's not clear if adcx and adox can run in the same cycle, but
    looking at your measurements, it is unlikely.

    So we would need something other than "adc edx, edx" to set the carry
    register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
    (and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
    have 1 in edi, and then do, for two-summand addition:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    However, even without the loop overhead (which can be reduced with
    unrolling) that's 8 instructions per iteration, and therefore we will
    have a hard time executing it at less than 1cycle/iteration on current
    CPUs. What if we mix in some adc-based stuff to bring down the
    instruction count? E.g., with one adc-based and one cmov-based
    iteration:

    mov edi,1
    xor ebx,ebx
    next:
    mov rax,[rsi+rcx*8]
    add [r8+rcx*8], rax
    mov rax,[rsi+rcx*8+8]
    add [r8+rcx*8+8], rax
    xor edx, edx
    mov rax,[rsi+rcx*8+16]
    adc rax,[r8+rcx*8+16]
    cmovc edx, edi
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8+16],rax
    add rcx,3
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    Now we have 15 instructions per unrolled iteration (3 original
    iterations). Executing an unrolled iteration in less than three
    cycles might be in reach for Zen3 and Raptor Cove (I don't remember if
    all the other resource limits are also satisfied; the load/store unit
    may be at its limit, too).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Tue Aug 19 07:09:51 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    The idea is interesting, but I don't understand the code. The
    following looks funny to me:

    1) You increment edx in increment_edx, then jump back to edx_ready and
    immediately overwrite edx with ebx. Then you do nothing with it,
    and then you clear edx in the next iteration. So both the "inc
    edx" and the "mov edx, ebx" look like dead code to me that can be
    optimized away.

    2) There is a loop-carried dependency through ebx, and the number
    accumulating in ebx and the carry check makes no sense with that.

    Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
    makes more sense with that. ebx then contains the carry from the last
    cycle on entry. The carry dependency chain starts at clearing edx,
    then gets to additional carries, then is copied to ebx, transferred
    into the next iteration, and is ended there by overwriting ebx. No >dependency cycles (except the loop counter and addresses, which can be
    dealt with by hardware or by unrolling), and ebx contains the carry
    from the last iteration

    One other problem is that according to Agner Fog's instruction tables,
    even the latest and greatest CPUs from AMD and Intel that he measured
    (Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
    and adc has a latency of 1, so breaking the dependency chain in a
    beneficial way should avoid the use of adc. For our three-summand
    add, it's not clear if adcx and adox can run in the same cycle, but
    looking at your measurements, it is unlikely.

    So we would need something other than "adc edx, edx" to set the carry >register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
    (and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
    have 1 in edi, and then do, for two-summand addition:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never >incremen_edx:
    inc edx
    jmp edx_ready

    Forgot to fix the "mov edx, ebx" here. One other thing: I think that
    the "add rbx, rax" should be "add rax, rbx". You want to add the
    carry to rax before storing the result. So the version with just one
    iteration would be:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    And the version with the two additional adc-using iterations would be
    (with an additional correction):

    mov edi,1
    xor ebx,ebx
    next:
    mov rax,[rsi+rcx*8]
    add [r8+rcx*8], rax
    mov rax,[rsi+rcx*8+8]
    adc [r8+rcx*8+8], rax
    xor edx, edx
    mov rax,[rsi+rcx*8+16]
    adc rax,[r8+rcx*8+16]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8+16],rax
    add rcx,3
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Anton Ertl on Tue Aug 19 12:11:56 2025
    Anton, I like what you and Michael have done, but I'm still not sure
    everything is OK:

    In your code, I only see two input arrays [rsi] and [r8], instead of
    three? (Including [r9])

    Re breaking dependency chains (from Michael):

    In each iteration we have four inputs:

    carry_in from the previous iteration, [rsi+rcx*8], [r8+rcx*8] and
    [r9+rcx*8], and we want to generate [rdi+rcx*8] and the carry_out.

    Assuming effectively random inputs, cin+[rsi]+[r8]+[r9] will result in
    random low-order 64 bits in [rdi], and either 0, 1 or 2 as carry_out.

    In order to break the per-iteration dependency (per Michael), it is
    sufficient to branch out IFF adding cin to the 3-sum produces an
    additional carry:

    ; rdx = cin (0,1,2)
    next:
    mov rbx,rdx ; Save CIN
    xor rdx,rdx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    adc rdx,rdx ; RDX = 0 or 1 (50:50)
    add rax,[r9+rcx*8]
    adc rdx,0 ; RDX = 0, 1 or 2 (33:33:33)

    ; At this point RAX has the 3-sum, now do the cin 0..2 add

    add rax,rbx
    jc fixup ; Pretty much never taken

    save:
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next

    fixup:
    inc rdx
    jmp save

    It would also be possible to use SETC to save the intermediate carries...

    Terje

    Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    The idea is interesting, but I don't understand the code. The
    following looks funny to me:

    1) You increment edx in increment_edx, then jump back to edx_ready and
    immediately overwrite edx with ebx. Then you do nothing with it,
    and then you clear edx in the next iteration. So both the "inc
    edx" and the "mov edx, ebx" look like dead code to me that can be
    optimized away.

    2) There is a loop-carried dependency through ebx, and the number
    accumulating in ebx and the carry check makes no sense with that.

    Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
    makes more sense with that. ebx then contains the carry from the last
    cycle on entry. The carry dependency chain starts at clearing edx,
    then gets to additional carries, then is copied to ebx, transferred
    into the next iteration, and is ended there by overwriting ebx. No
    dependency cycles (except the loop counter and addresses, which can be
    dealt with by hardware or by unrolling), and ebx contains the carry
    from the last iteration

    One other problem is that according to Agner Fog's instruction tables,
    even the latest and greatest CPUs from AMD and Intel that he measured
    (Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
    and adc has a latency of 1, so breaking the dependency chain in a
    beneficial way should avoid the use of adc. For our three-summand
    add, it's not clear if adcx and adox can run in the same cycle, but
    looking at your measurements, it is unlikely.

    So we would need something other than "adc edx, edx" to set the carry
    register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
    (and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
    have 1 in edi, and then do, for two-summand addition:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never
    incremen_edx:
    inc edx
    jmp edx_ready

    Forgot to fix the "mov edx, ebx" here. One other thing: I think that
    the "add rbx, rax" should be "add rax, rbx". You want to add the
    carry to rax before storing the result. So the version with just one iteration would be:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    And the version with the two additional adc-using iterations would be
    (with an additional correction):

    mov edi,1
    xor ebx,ebx
    next:
    mov rax,[rsi+rcx*8]
    add [r8+rcx*8], rax
    mov rax,[rsi+rcx*8+8]
    adc [r8+rcx*8+8], rax
    xor edx, edx
    mov rax,[rsi+rcx*8+16]
    adc rax,[r8+rcx*8+16]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8+16],rax
    add rcx,3
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    - anton



    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Tue Aug 19 17:20:54 2025
    On Tue, 19 Aug 2025 07:09:51 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    The idea is interesting, but I don't understand the code. The
    following looks funny to me:

    1) You increment edx in increment_edx, then jump back to edx_ready
    and
    immediately overwrite edx with ebx. Then you do nothing with it,
    and then you clear edx in the next iteration. So both the "inc
    edx" and the "mov edx, ebx" look like dead code to me that can be
    optimized away.

    2) There is a loop-carried dependency through ebx, and the number
    accumulating in ebx and the carry check makes no sense with that.

    Could it be that you wanted to do "mov ebx, edx" at edx_ready? It
    all makes more sense with that. ebx then contains the carry from
    the last cycle on entry. The carry dependency chain starts at
    clearing edx, then gets to additional carries, then is copied to
    ebx, transferred into the next iteration, and is ended there by
    overwriting ebx. No dependency cycles (except the loop counter and >addresses, which can be dealt with by hardware or by unrolling), and
    ebx contains the carry from the last iteration

    One other problem is that according to Agner Fog's instruction
    tables, even the latest and greatest CPUs from AMD and Intel that he >measured (Zen5 and Tiger Lake) can only execute one adc/adcx/adox
    per cycle, and adc has a latency of 1, so breaking the dependency
    chain in a beneficial way should avoid the use of adc. For our >three-summand add, it's not clear if adcx and adox can run in the
    same cycle, but looking at your measurements, it is unlikely.

    So we would need something other than "adc edx, edx" to set the carry >register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
    (and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
    have 1 in edi, and then do, for two-summand addition:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately
    never incremen_edx:
    inc edx
    jmp edx_ready

    Forgot to fix the "mov edx, ebx" here. One other thing: I think that
    the "add rbx, rax" should be "add rax, rbx". You want to add the
    carry to rax before storing the result. So the version with just one iteration would be:

    To many back and force mental switches between Intel and AT&T syntax.
    The real code that I measured was for Windows platform, but in AT&T
    (gnu) syntax.
    Below is full function with loop unrolled by 3. The rest may be
    I'd answer later, right now I don't have time.

    .file "add3_my_u3.s"
    .text
    .p2align 4
    .globl add3
    .def add3; .scl 2; .type
    32; .endef
    .seh_proc add3
    add3:
    pushq %r13
    .seh_pushreg %r13
    pushq %r12
    .seh_pushreg %r12
    pushq %rbp
    .seh_pushreg %rbp
    pushq %rdi
    .seh_pushreg %rdi
    pushq %rsi
    .seh_pushreg %rsi
    pushq %rbx
    .seh_pushreg %rbx
    .seh_endprologue
    # %rcx - dst
    # %rdx - a
    # %r8 - b
    # %r9 - c
    sub %rcx, %rdx
    sub %rcx, %r8
    sub %rcx, %r9
    mov $341, %ebx
    xor %eax, %eax
    .loop:
    xor %esi, %esi
    mov (%rcx,%rdx), %rdi
    mov 8(%rcx,%rdx), %rbp
    mov 16(%rcx,%rdx), %r10
    add (%rcx,%r8), %rdi
    adc 8(%rcx,%r8), %rbp
    adc 16(%rcx,%r8), %r10
    adc %esi, %esi
    add (%rcx,%r9), %rdi
    adc 8(%rcx,%r9), %rbp
    adc 16(%rcx,%r9), %r10
    adc $0, %esi
    add %rax, %rdi # add carry from the previous
    iteration
    jc .prop_carry
    .carry_done:
    mov %esi, %eax
    mov %rdi, (%rcx)
    mov %rbp, 8(%rcx)
    mov %r10, 16(%rcx)
    lea 24(%rcx), %rcx
    dec %ebx
    jnz .loop

    sub $(1023*8), %rcx
    mov %rcx, %rax

    popq %rbx
    popq %rsi
    popq %rdi
    popq %rbp
    popq %r12
    popq %r13
    ret

    .prop_carry:
    add $1, %rbp
    adc $0, %r10
    adc $0, %esi
    jmp .carry_done

    .seh_endproc








    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately
    never incremen_edx:
    inc edx
    jmp edx_ready

    And the version with the two additional adc-using iterations would be
    (with an additional correction):

    mov edi,1
    xor ebx,ebx
    next:
    mov rax,[rsi+rcx*8]
    add [r8+rcx*8], rax
    mov rax,[rsi+rcx*8+8]
    adc [r8+rcx*8+8], rax
    xor edx, edx
    mov rax,[rsi+rcx*8+16]
    adc rax,[r8+rcx*8+16]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8+16],rax
    add rcx,3
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately
    never incremen_edx:
    inc edx
    jmp edx_ready

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to All on Tue Aug 19 17:24:23 2025
    Above by mistake I posted not the most up to date variant, sorry.
    Here is a correct code:

    .file "add3_my_u3.s"
    .text
    .p2align 4
    .globl add3
    .def add3; .scl 2; .type
    32; .endef .seh_proc add3
    add3:
    pushq %rbp
    .seh_pushreg %rbp
    pushq %rdi
    .seh_pushreg %rdi
    pushq %rsi
    .seh_pushreg %rsi
    pushq %rbx
    .seh_pushreg %rbx
    .seh_endprologue
    # %rcx - dst
    # %rdx - a
    # %r8 - b
    # %r9 - c
    sub %rcx, %rdx
    sub %rcx, %r8
    sub %rcx, %r9
    mov $341, %ebx
    xor %eax, %eax
    .loop:
    xor %esi, %esi
    mov (%rcx,%rdx), %rdi
    mov 8(%rcx,%rdx), %r10
    mov 16(%rcx,%rdx), %r11
    add (%rcx,%r8), %rdi
    adc 8(%rcx,%r8), %r10
    adc 16(%rcx,%r8), %r11
    adc %esi, %esi
    add (%rcx,%r9), %rdi
    adc 8(%rcx,%r9), %r10
    adc 16(%rcx,%r9), %r11
    adc $0, %esi
    add %rax, %rdi # add carry from the previous
    iteration jc .prop_carry
    .carry_done:
    mov %esi, %eax
    mov %rdi, (%rcx)
    mov %r10, 8(%rcx)
    mov %r11, 16(%rcx)
    lea 24(%rcx), %rcx
    dec %ebx
    jnz .loop

    sub $(1023*8), %rcx
    mov %rcx, %rax

    popq %rbx
    popq %rsi
    popq %rdi
    popq %rbp
    ret

    .prop_carry:
    add $1, %r10
    adc $0, %r11
    adc $0, %esi
    jmp .carry_done

    .seh_endproc

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Tue Aug 19 17:43:25 2025
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Anton, I like what you and Michael have done, but I'm still not sure >everything is OK:

    In your code, I only see two input arrays [rsi] and [r8], instead of
    three? (Including [r9])

    I implemented a two-summand addition, not three-summand. I wanted the
    minumum of complexity to make it easier to understand, and latency is
    a bigger problem for the two-summand case.

    It would also be possible to use SETC to save the intermediate carries...

    I must have had a bad morning. Instead of xor edx, edx, setc dl (also
    2 per cycle on Zen3), I wrote

    mov edi,1
    ...
    xor edx, edx
    ...
    cmovc edx, edi

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Tue Aug 19 23:03:01 2025
    On Tue, 19 Aug 2025 05:47:01 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    One other problem is that according to Agner Fog's instruction tables,
    even the latest and greatest CPUs from AMD and Intel that he measured
    (Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,

    I didn't measure on either TGL or Zen5, but both Raptor Cove and Zen3
    are certainly capable of more than 1 adcx|adox per cycle.

    Below are Execution times of very heavily unrolled adcx/adox code with dependency broken by trick similiar to above:

    Platform RC GM SK Z3
    add3_my_adx_u17 244.5 471.1 482.4 407.0

    Considering that there are 2166 adcx/adox/adc instructions, we have
    following number of adcx/adox/adc instructions per clock:
    Platform RC GM SK Z3
    1.67 1.10 1.05 1.44

    For Gracemont and Skylake there exists a possibility of small
    measurement mistake, but Raptor Cove appears to be capable of at least 2 instructions of this type per clock while Zen3 capable of at least 1.5
    but more likely also 2.
    It looks to me that the bottlenecks on both RC and Z3 are either rename
    phase or more likely L1$ access. It seems that while Golden/Raptore Cove
    can occasionally issue 3 load + 2 stores per clock, it can not sustain
    more than 3 load-or-store accesses per clock.


    Code:

    .file "add3_my_adx_u17.s"
    .text
    .p2align 4
    .globl add3
    .def add3; .scl 2; .type 32; .endef
    .seh_proc add3
    add3:
    pushq %rsi
    .seh_pushreg %rsi
    pushq %rbx
    .seh_pushreg %rbx
    .seh_endprologue
    # %rcx - dst
    # %rdx - a
    # %r8 - b
    # %r9 - c
    sub %rdx, %rcx
    mov %rcx, %r10 # r10 = dst - a
    sub %rdx, %r8 # r8 = b - a
    sub %rdx, %r9 # r9 = c - c
    mov %rdx, %r11 # r11 - a
    mov $60, %edx
    xor %ecx, %ecx
    .p2align 4
    .loop:
    xor %ebx, %ebx # CF <= 0, OF <= 0, EBX <= 0
    mov (%r11), %rsi
    adcx (%r11,%r8), %rsi
    adox (%r11,%r9), %rsi

    mov 8(%r11), %rax
    adcx 8(%r11,%r8), %rax
    adox 8(%r11,%r9), %rax
    mov %rax, 8(%r10,%r11)

    mov 16(%r11), %rax
    adcx 16(%r11,%r8), %rax
    adox 16(%r11,%r9), %rax
    mov %rax, 16(%r10,%r11)

    mov 24(%r11), %rax
    adcx 24(%r11,%r8), %rax
    adox 24(%r11,%r9), %rax
    mov %rax, 24(%r10,%r11)

    mov 32(%r11), %rax
    adcx 32(%r11,%r8), %rax
    adox 32(%r11,%r9), %rax
    mov %rax, 32(%r10,%r11)

    mov 40(%r11), %rax
    adcx 40(%r11,%r8), %rax
    adox 40(%r11,%r9), %rax
    mov %rax, 40(%r10,%r11)

    mov 48(%r11), %rax
    adcx 48(%r11,%r8), %rax
    adox 48(%r11,%r9), %rax
    mov %rax, 48(%r10,%r11)

    mov 56(%r11), %rax
    adcx 56(%r11,%r8), %rax
    adox 56(%r11,%r9), %rax
    mov %rax, 56(%r10,%r11)

    mov 64(%r11), %rax
    adcx 64(%r11,%r8), %rax
    adox 64(%r11,%r9), %rax
    mov %rax, 64(%r10,%r11)

    mov 72(%r11), %rax
    adcx 72(%r11,%r8), %rax
    adox 72(%r11,%r9), %rax
    mov %rax, 72(%r10,%r11)

    mov 80(%r11), %rax
    adcx 80(%r11,%r8), %rax
    adox 80(%r11,%r9), %rax
    mov %rax, 80(%r10,%r11)

    mov 88(%r11), %rax
    adcx 88(%r11,%r8), %rax
    adox 88(%r11,%r9), %rax
    mov %rax, 88(%r10,%r11)

    mov 96(%r11), %rax
    adcx 96(%r11,%r8), %rax
    adox 96(%r11,%r9), %rax
    mov %rax, 96(%r10,%r11)

    mov 104(%r11), %rax
    adcx 104(%r11,%r8), %rax
    adox 104(%r11,%r9), %rax
    mov %rax, 104(%r10,%r11)

    mov 112(%r11), %rax
    adcx 112(%r11,%r8), %rax
    adox 112(%r11,%r9), %rax
    mov %rax, 112(%r10,%r11)

    mov 120(%r11), %rax
    adcx 120(%r11,%r8), %rax
    adox 120(%r11,%r9), %rax
    mov %rax, 120(%r10,%r11)

    lea 136(%r11), %r11

    mov -8(%r11), %rax
    adcx -8(%r11,%r8), %rax
    adox -8(%r11,%r9), %rax
    mov %rax, -8(%r10,%r11)

    mov %ebx, %eax # EAX <= 0
    adcx %ebx, %eax # EAX <= OF, OF <= 0
    adox %ebx, %eax # EAX <= OF, OF <= 0

    add %rcx, %rsi
    jc .prop_carry
    .carry_done:
    mov %rsi, -136(%r10,%r11)
    mov %eax, %ecx
    dec %edx
    jnz .loop

    # last 3
    mov (%r11), %rax
    mov 8(%r11), %rdx
    mov 16(%r11), %rbx
    add (%r11,%r8), %rax
    adc 8(%r11,%r8), %rdx
    adc 16(%r11,%r8), %rbx
    add (%r11,%r9), %rax
    adc 8(%r11,%r9), %rdx
    adc 16(%r11,%r9), %rbx
    add %rcx, %rax
    adc $0, %rdx
    adc $0, %rbx
    mov %rax, (%r10,%r11)
    mov %rdx, 8(%r10,%r11)
    mov %rbx, 16(%r10,%r11)

    lea (-1020*8)(%r10,%r11), %rax
    popq %rbx
    popq %rsi
    ret

    .prop_carry:
    lea -128(%r10,%r11), %rbx
    xor %ecx, %ecx
    addq $1, (%rbx)
    adc %rcx, 8(%rbx)
    adc %rcx, 16(%rbx)
    adc %rcx, 24(%rbx)
    adc %rcx, 32(%rbx)
    adc %rcx, 40(%rbx)
    adc %rcx, 48(%rbx)
    adc %rcx, 56(%rbx)
    adc %rcx, 64(%rbx)
    adc %rcx, 72(%rbx)
    adc %rcx, 80(%rbx)
    adc %rcx, 88(%rbx)
    adc %rcx, 96(%rbx)
    adc %rcx,104(%rbx)
    adc %rcx,112(%rbx)
    adc %rcx,120(%rbx)
    adc %ecx, %eax
    jmp .carry_done
    .seh_endproc

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Wed Aug 20 10:50:39 2025
    Michael S wrote:
    On Tue, 19 Aug 2025 05:47:01 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    One other problem is that according to Agner Fog's instruction tables,
    even the latest and greatest CPUs from AMD and Intel that he measured
    (Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,

    I didn't measure on either TGL or Zen5, but both Raptor Cove and Zen3
    are certainly capable of more than 1 adcx|adox per cycle.

    Below are Execution times of very heavily unrolled adcx/adox code with dependency broken by trick similiar to above:

    Platform RC GM SK Z3
    add3_my_adx_u17 244.5 471.1 482.4 407.0

    Considering that there are 2166 adcx/adox/adc instructions, we have
    following number of adcx/adox/adc instructions per clock:
    Platform RC GM SK Z3
    1.67 1.10 1.05 1.44

    For Gracemont and Skylake there exists a possibility of small
    measurement mistake, but Raptor Cove appears to be capable of at least 2 instructions of this type per clock while Zen3 capable of at least 1.5
    but more likely also 2.
    It looks to me that the bottlenecks on both RC and Z3 are either rename
    phase or more likely L1$ access. It seems that while Golden/Raptore Cove
    can occasionally issue 3 load + 2 stores per clock, it can not sustain
    more than 3 load-or-store accesses per clock


    Code:

    .file "add3_my_adx_u17.s"
    .text
    .p2align 4
    .globl add3
    .def add3; .scl 2; .type 32; .endef
    .seh_proc add3
    add3:
    pushq %rsi
    .seh_pushreg %rsi
    pushq %rbx
    .seh_pushreg %rbx
    .seh_endprologue
    # %rcx - dst
    # %rdx - a
    # %r8 - b
    # %r9 - c
    sub %rdx, %rcx
    mov %rcx, %r10 # r10 = dst - a
    sub %rdx, %r8 # r8 = b - a
    sub %rdx, %r9 # r9 = c - c
    mov %rdx, %r11 # r11 - a
    mov $60, %edx
    xor %ecx, %ecx
    .p2align 4
    .loop:
    xor %ebx, %ebx # CF <= 0, OF <= 0, EBX <= 0
    mov (%r11), %rsi
    adcx (%r11,%r8), %rsi
    adox (%r11,%r9), %rsi

    mov 8(%r11), %rax
    adcx 8(%r11,%r8), %rax
    adox 8(%r11,%r9), %rax
    mov %rax, 8(%r10,%r11)

    [snipped the rest]


    Very impressive Michael!

    I particularly like how you are interleaving ADOX and ADCX to gain two
    carry bits without having to save them off to an additional register.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Wed Aug 20 14:16:55 2025
    On Wed, 20 Aug 2025 10:50:39 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:



    Very impressive Michael!

    I particularly like how you are interleaving ADOX and ADCX to gain
    two carry bits without having to save them off to an additional
    register.

    Terje


    It is interesting as an exercise in ADX extension programming, but in
    practice it is only 0-10% faster than much simpler and smaller code
    presented in the other post that uses no ISA extensions so runs on every
    iAMD64 CPU since K8.
    I suspect that this result is quite representative of the gains that
    can be achieved with ADX. May be, if there is a crypto requirement
    of independence of execution time from inputs then the gain would be
    somewhat bigger, but even there I would be very surprised to find 1.5x
    gain.
    Overall, I think that time spent by Intel engineers on invention of ADX
    could have been spent much better.


    Going back to the task of 3-way addition, another approach that can
    utilize the same idea of breaking data dependency is using SIMD.
    In case of 4 cores that I tested SIMD means AVX2.
    The are results of AVX2 implementation that unrolls by two i.e. 512
    output bits per iteration of the inner loop.

    Platform RC GM SK Z3
    add3_avxq_u2 226.7 823.3 321.1 309.5

    The speed is about equal to more unrolled ADX variant on RC, faster on
    Z3, much faster on SK and much slower on GM. Unlike ADX, it runs on
    Intel Haswell and on few pre-Zen AMD CPUs.

    .file "add3_avxq_u2.s"
    .text
    .p2align 4
    .globl add3
    .def add3; .scl 2; .type 32; .endef
    .seh_proc add3
    add3:
    subq $56, %rsp
    .seh_stackalloc 56
    vmovups %xmm6, 32(%rsp)
    .seh_savexmm %xmm6, 32
    .seh_endprologue
    # %rcx - dst
    # %rdx - a
    # %r8 - b
    # %r9 - c
    sub %rcx, %rdx # %rdx - a-dst
    sub %rcx, %r8 # %r8 - b-dst
    sub %rcx, %r9 # %r9 - c-dst
    vpcmpeqq %ymm6, %ymm6, %ymm6
    vpsllq $63, %ymm6, %ymm6 # ymm6[0:3] = msbit = 2**63
    vpxor %xmm5, %xmm5, %xmm5 # ymm5[0] = carry = 0
    mov $127, %eax
    .loop:
    vpxor (%rdx,%rcx), %ymm6, %ymm0
    # ymm0[0:3] = iA[0:3] = a[0:3] - msbit
    vpxor 32(%rdx,%rcx), %ymm6, %ymm1
    # ymm1[0:3] = iA[4:7] = a[4:7] - msbit
    vpaddq (%r8, %rcx), %ymm0, %ymm2
    # ymm2[0:3] = iSum1[0:3] = iA[0:3]+b[0:3]
    vpaddq 32(%r8, %rcx), %ymm1, %ymm3
    # ymm3[0:3] = iSum1[4:7] = iA[4:7] + b[4:7]
    vpcmpgtq %ymm2, %ymm0, %ymm4
    # ymm4[0:3] = c1[0:3] = iA[0:3] > iSum1[0:3]
    vpaddq (%r9, %rcx), %ymm2, %ymm0
    # ymm0[0:3] = iSum2[0:3] = iSum1[0:3]+c[0:3]
    vpcmpgtq %ymm0, %ymm2, %ymm2
    # ymm2[0:3] = c2[0:3] = iSum1[0:3] > iSum2[0:3]
    vpaddq %ymm4, %ymm2, %ymm2
    # ymm2[0:3] = cSum0[0:3] = c1[0:3]+c2[0:3]
    vpcmpgtq %ymm3, %ymm1, %ymm4
    # ymm4[0:3] = c1[4:7] = iA[4:7] > iSum1[4:7]
    vpaddq 32(%r9, %rcx), %ymm3, %ymm1
    # ymm1[0:3] = iSum2[4:7] = iSum1[4:7] + c[4:7]
    vpcmpgtq %ymm1, %ymm3, %ymm3
    # ymm3[0:3] = c2[4:7] = iSum1[4:7] > iSum2[4:7]
    vpaddq %ymm4, %ymm3, %ymm3
    # ymm3[0:3] = cSum0[4:7] = c1[4:7] + c2[4:7]
    vpermq $0x93, %ymm2, %ymm4
    # ymm4[0:3] = cSum0[3,0:2]
    vpblendd $3, %ymm5, %ymm4, %ymm2
    # ymm1[0:3] = cSum[0:3] = { carry[0], cSum0[0,1,2] }
    vpermq $0x93, %ymm3, %ymm5
    # ymm5[0:3] = cSum0[7,4:6] == carry
    vpblendd $3, %ymm4, %ymm5, %ymm3
    # ymm3[0:3] = cSum[4:7] = { cSum0[3], cSum0[4:6] }
    .add_carry:
    vpsubq %ymm2, %ymm0, %ymm2
    # ymm2[0:3] = iSum3[0:3] = iSum2[0:3] - cSum[0:3]
    vpsubq %ymm3, %ymm1, %ymm3
    # ymm3[0:3] = iSum3[4:7] = iSum2[4:7] - cSum[4:7]
    vpcmpgtq %ymm2, %ymm0, %ymm0
    # ymm0[0:3] = c3[0:3] = iSum2[0:3] > iSum3[0:3]
    vpcmpgtq %ymm3, %ymm1, %ymm1
    # ymm3[0:3] = c3[4:7] = iSum2[4:7] > iSum3[4:7]
    vpor %ymm0, %ymm1, %ymm4
    vptest %ymm4, %ymm4
    jne .prop_carry
    vpxor %ymm2, %ymm6, %ymm0
    # ymm0[0:3] = uSum3[0:3] = iSum3[0:3] + msbit
    vpxor %ymm3, %ymm6, %ymm1
    # ymm1[4:7] = uSum3[4:7] = iSum3[4:7] + msbit
    vmovdqu %ymm0, (%rcx)
    vmovdqu %ymm1, 32(%rcx)
    addq $64, %rcx
    dec %eax
    jnz .loop

    # last 7
    vpxor (%rdx,%rcx), %ymm6, %ymm0
    # ymm0[0:3] = iA[0:3] = a[0:3] - msbit
    vpxor 24(%rdx,%rcx), %ymm6, %ymm1
    # ymm1[0:3] = iA[3:6] = a[3:6] - msbit
    vpaddq (%r8, %rcx), %ymm0, %ymm2
    # ymm2[0:3] = iSum1[0:3] = iA[0:3]+b[0:3]
    vpaddq 24(%r8, %rcx), %ymm1, %ymm3
    # ymm3[0:3] = iSum1[3:6] = iA[3:6] + b[3:6]
    vpcmpgtq %ymm2, %ymm0, %ymm4
    # ymm4[0:3] = c1[0:3] = iA[0:3] > iSum1[0:3]
    vpaddq (%r9, %rcx), %ymm2, %ymm0
    # ymm0[0:3] = iSum2[0:3] = iSum1[0:3]+c[0:3]
    vpcmpgtq %ymm0, %ymm2, %ymm2
    # ymm2[0:3] = c2[0:3] = iSum1[0:3] > iSum2[0:3]
    vpaddq %ymm4, %ymm2, %ymm2
    # ymm2[0:3] = cSum0[0:3] = c1[0:3]+c2[0:3]
    vpcmpgtq %ymm3, %ymm1, %ymm4
    # ymm4[0:3] = c1[3:6] = iA[3:6] > iSum1[3:6]
    vpaddq 24(%r9, %rcx), %ymm3, %ymm1
    # ymm1[0:3] = iSum2[3:6] = iSum1[3:6] + c[3:6]
    vpcmpgtq %ymm1, %ymm3, %ymm3
    # ymm3[0:3] = c2[3:6] = iSum1[3:6] > iSum2[3:6]
    vpaddq %ymm4, %ymm3, %ymm3
    # ymm3[0:3] = cSum[4:7] = cSum0[3:6] = c1[3:6] + c2[367]
    vpermq $0x93, %ymm2, %ymm4
    # ymm2[0:3] = cSum0[3,0,1,2]
    vpblendd $3, %ymm5, %ymm4, %ymm2
    # ymm1[0:3] = cSum[0:3] = { carry[0], cSum0[0,1,2] }
    vpermq $0xF9, %ymm1, %ymm1
    # ymm3[0:3] = iSum2[4:6,6]
    .add_carry2:
    vpsubq %ymm2, %ymm0, %ymm2
    # ymm2[0:3] = iSum3[0:3] = iSum2[0:3] - cSum[0:3]
    vpsubq %ymm3, %ymm1, %ymm3
    # ymm3[0:3] = iSum3[4:7] = iSum2[4:7] - cSum[4:7]
    vpcmpgtq %ymm2, %ymm0, %ymm0
    # ymm0[0:3] = c3[0:3] = iSum2[0:3] > iSum3[0:3]
    vpcmpgtq %ymm3, %ymm1, %ymm1
    # ymm1[0:3] = c3[4:7] = iSum2[4:7] > iSum3[4:7]
    vptest %ymm0, %ymm0
    jne .prop_carry2
    vptest %ymm1, %ymm1
    jne .prop_carry2
    vpxor %ymm2, %ymm6, %ymm0
    # ymm0[0:3] = uSum3[0:3] = iSum3[0:3] + msbit
    vpxor %ymm3, %ymm6, %ymm1
    # ymm1[4:7] = uSum3[4:7] = iSum3[4:7] + msbit
    vmovdqu %ymm0, (%rcx)
    vmovdqu %xmm1, 32(%rcx)
    vextractf128 $1, %ymm1, %xmm1
    vmovq %xmm1, 48(%rcx)

    lea -(127*64)(%rcx), %rax
    vzeroupper
    vmovups 32(%rsp), %xmm6
    addq $56, %rsp
    ret

    .prop_carry:
    # input:
    # ymm0[0:3] = c3[0:3]
    # ymm1[0:3] = c3[4:7]
    # ymm2[0:3] = iSum3[0:3]
    # ymm3[0:3] = iSum3[4:7]
    # ymm5[0] = carry
    # output:
    # ymm0[0:3] = iSum2[0:3]
    # ymm1[0:3] = iSum2[4:7]
    # ymm2[0:3] = cSum [0:3]
    # ymm3[0:3] = cSum [4:7]
    # ymm5[0] = carry
    # scratch: ymm4
    vpermq $0x93, %ymm0, %ymm4
    # ymm4[0:3] = c3[3,0,1,2]
    vmovdqa %ymm2, %ymm0
    # ymm0[0:3] = iSum2[0:3] = iSum3[0:3]
    vpermq $0x93, %ymm1, %ymm2
    # ymm2[0:3] = c3[7,4,5,6]
    vpaddq %xmm2, %xmm5, %xmm5
    # ymm5[0] = carry += c3[7]
    vmovdqa %ymm3, %ymm1
    # ymm1[0:3] = iSum2[4:7] = iSum3[4:7]
    vpblendd $3, %ymm4, %ymm2, %ymm3
    # ymm3[0:3] = cSum[4:7] = { c3[3], c3[4,5,6] }
    vpxor %xmm2, %xmm2, %xmm2
    # ymm2[0:3] = 0
    vpblendd $3, %ymm2, %ymm4, %ymm2
    # ymm2[0:3] = cSum[0:3] = { 0, c3[0,1,2] }
    jmp .add_carry

    .prop_carry2:
    # input:
    # ymm0[0:3] = c3[0:3]
    # ymm1[0:3] = c3[4:7]
    # ymm2[0:3] = iSum3[0:3]
    # ymm3[0:3] = iSum3[4:7]
    # ymm5[0] = carry
    # output:
    # ymm0[0:3] = iSum2[0:3]
    # ymm1[0:3] = iSum2[4:7]
    # ymm2[0:3] = cSum [0:3]
    # ymm3[0:3] = cSum [4:7]
    # ymm5[0] = carry
    # scratch: ymm4
    vpermq $0x93, %ymm0, %ymm4
    # ymm4[0:3] = c3[3,0,1,2]
    vmovdqa %ymm2, %ymm0
    # ymm0[0:3] = iSum2[0:3] = iSum3[0:3]
    vpermq $0x93, %ymm1, %ymm2
    # ymm2[0:3] = c3[7,4,5,6]
    vmovdqa %ymm3, %ymm1
    # ymm1[0:3] = iSum2[4:7] = iSum3[4:7]
    vpblendd $3, %ymm4, %ymm2, %ymm3
    # ymm3[0:3] = cSum[4:7] = { c3[3], c3[4,5,6] }
    vpxor %xmm2, %xmm2, %xmm2
    # ymm2[0:3] = 0
    vpblendd $3, %ymm2, %ymm4, %ymm2
    # ymm2[0:3] = cSum[0:3] = { 0, c3[0,1,2] }
    jmp .add_carry2

    .seh_endproc

    AVX2 is rather poorly suited for this task - it lacks unsigned
    comparison instructions, so the first input should be shifted by
    half-range at the beginning and the result should be shifted back.

    AVX-512 can be more suitable. But the only AVX-512 capable CPU that I
    have access to is miniPC with cheap and slow core-i3 used by family
    members almost exclusively for viewing movies. It does not even have
    minimal programming environments installed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)