• Intel ADX (was: 3-way long addition)

    From Anton Ertl@21:1/5 to Michael S on Wed Aug 20 14:08:34 2025
    Michael S <already5chosen@yahoo.com> writes:
    Overall, I think that time spent by Intel engineers on invention of ADX
    could have been spent much better.

    The whitepapers about ADX are about long multiplication and squaring
    (for cryptographic uses), and they are from 2012, and ADX was first
    implemented in Broadwell (2014), when microarchitectures were quite a
    bit narrower than the recent ones.

    If you implement the classical long multiplication algorithm, but add
    each line to the intermediate sum as you create it, you need an
    operation like

    intermediateresult += multiplicand*multiplicator[i]

    where all parts are multi-precision numbers, but only one word of the multiplicator is used. The whole long multiplication would look
    somewhat like:

    intermediateresult=0;
    for (i=0; i<n; i++) {
    intermediateresult += multiplicand*multiplicator[i];
    shift intermediate result by one word; /* actually, you will access it at */
    /* an offset, but how to express this in this pseudocode? */ }

    The operation for a single line can be implemented as:

    carry=0;
    for (j=0; j<m; j++)
    uint128_t d = intermediateresult[j] +
    multiplicand[j]*(uint128_t)multiplicator[i] +
    (uint128_t)carry;
    intermediateresult[j] = d; /* low word */
    carry = d >> 64;
    }

    The computation of d (both words) can be written on AMD64 as:

    #multuplicator[i] in rax
    mov multiplicator[i], rax
    mulq multiplicand[j]
    addq intermediateresult[j], rax
    adcq $0, rdx
    addq carry, rax
    adcq $0, rdx
    mov rdx, carry

    With ADX and BMI2, this can be coded as:

    #carry is represented as carry1+C+O
    mulx ma, m, carry2
    adcx mb, m
    adox carry1, m
    #carry is represented as carry2+C+O
    #unroll by an even factor, and switch roles for carry1 and carry2

    Does it matter? We can apply the usual blocking techniques to
    perform, say, a 4x4 submultiplication in the registers (probably even
    a little larger, but let's stick with these numbers). That's 16
    mulx/adox/adcx combinations, loads of 4+4 words of inputs and stores
    of 8 words of output. mulx is probably limited to one per cycle, but
    if we want to utilize this on a 4-wide machine line the Broadwell, we
    must have at most 3 additional instructions per mulx; with ADX, one
    additional instruction is adcx, another adox, and the third is either
    a load or a store. Any additional overhead, and the code will be
    limited by resources other than the multiplier.

    On today's CPUs, we can reach the 1 mul/cycle limit with the x86-64-v1
    code shown before the ADX code. But then, they might put a second
    multiplier in, and we would profit from ADX again.

    But Intel seems to have second thoughts on ADX itself. ADX has not
    been included in x86-64-4, despite the fact that every CPU that
    supports the other extensions of x86-64-4 also supports ADX. And the whitepapers vanish from Intel's web pages. Some time ago I still
    found it on https://www.intel.cn/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf
    (i.e., Intel China), but now it's gone there, too. I can still find
    it on <https://raw.githubusercontent.com/wiki/intel/intel-ipsec-mb/doc/ia-large-integer-arithmetic-paper.pdf>

    There is another whitepaper on using ADX for squaring numbers, but I
    cannot find that. Looking around what Erdinç Öztürk (aka Erdinc
    Ozturk) has also written, there's a patent "SIMD integer
    multiply-accumulate instruction for multi-precision arithmetic" from
    2016, so maybe Intel's thoughts are now into doing it with SIMD
    instead of with scalar instructions.

    Still, why deemphasize ADX? Do they want to drop it eventually? Why?
    They have to support separate renaming of C, O, and the other three
    because of instructions that go much farther back. The only way would
    be to provide alternatives to these instructions, and then deemphasize
    them over time, and eventually rename all flags together (and the old instrutions may then perform slowly).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)