Forum: >>> Magnum BBS <<<

Intel ADX (was: 3-way long addition)

From Anton Ertl@21:1/5 to Michael S on Wed Aug 20 14:08:34 2025

Michael S <already5chosen@yahoo.com> writes:

Overall, I think that time spent by Intel engineers on invention of ADX
could have been spent much better.

The whitepapers about ADX are about long multiplication and squaring
(for cryptographic uses), and they are from 2012, and ADX was first
implemented in Broadwell (2014), when microarchitectures were quite a
bit narrower than the recent ones.

If you implement the classical long multiplication algorithm, but add
each line to the intermediate sum as you create it, you need an
operation like

intermediateresult += multiplicand*multiplicator[i]

where all parts are multi-precision numbers, but only one word of the multiplicator is used. The whole long multiplication would look
somewhat like:

intermediateresult=0;
for (i=0; i<n; i++) {
intermediateresult += multiplicand*multiplicator[i];
shift intermediate result by one word; /* actually, you will access it at */
/* an offset, but how to express this in this pseudocode? */ }

The operation for a single line can be implemented as:

carry=0;
for (j=0; j<m; j++)
uint128_t d = intermediateresult[j] +
multiplicand[j]*(uint128_t)multiplicator[i] +
(uint128_t)carry;
intermediateresult[j] = d; /* low word */
carry = d >> 64;
}

The computation of d (both words) can be written on AMD64 as:

#multuplicator[i] in rax
mov multiplicator[i], rax
mulq multiplicand[j]
addq intermediateresult[j], rax
adcq $0, rdx
addq carry, rax
adcq $0, rdx
mov rdx, carry

With ADX and BMI2, this can be coded as:

#carry is represented as carry1+C+O
mulx ma, m, carry2
adcx mb, m
adox carry1, m
#carry is represented as carry2+C+O
#unroll by an even factor, and switch roles for carry1 and carry2

Does it matter? We can apply the usual blocking techniques to
perform, say, a 4x4 submultiplication in the registers (probably even
a little larger, but let's stick with these numbers). That's 16
mulx/adox/adcx combinations, loads of 4+4 words of inputs and stores
of 8 words of output. mulx is probably limited to one per cycle, but
if we want to utilize this on a 4-wide machine line the Broadwell, we
must have at most 3 additional instructions per mulx; with ADX, one
additional instruction is adcx, another adox, and the third is either
a load or a store. Any additional overhead, and the code will be
limited by resources other than the multiplier.

On today's CPUs, we can reach the 1 mul/cycle limit with the x86-64-v1
code shown before the ADX code. But then, they might put a second
multiplier in, and we would profit from ADX again.

But Intel seems to have second thoughts on ADX itself. ADX has not
been included in x86-64-4, despite the fact that every CPU that
supports the other extensions of x86-64-4 also supports ADX. And the whitepapers vanish from Intel's web pages. Some time ago I still
found it on https://www.intel.cn/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf
(i.e., Intel China), but now it's gone there, too. I can still find
it on <https://raw.githubusercontent.com/wiki/intel/intel-ipsec-mb/doc/ia-large-integer-arithmetic-paper.pdf>

There is another whitepaper on using ADX for squaring numbers, but I
cannot find that. Looking around what Erdinç Öztürk (aka Erdinc
Ozturk) has also written, there's a patent "SIMD integer
multiply-accumulate instruction for multi-precision arithmetic" from
2016, so maybe Intel's thoughts are now into doing it with SIMD
instead of with scalar instructions.

Still, why deemphasize ADX? Do they want to drop it eventually? Why?
They have to support separate renaming of C, O, and the other three
because of instructions that go much farther back. The only way would
be to provide alternatives to these instructions, and then deemphasize
them over time, and eventually rename all flags together (and the old instrutions may then perform slowly).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Wed Sep 17 08:54:03 2025
  from Derry, Nh via Telnet
- Bob Worm
  Wed Sep 17 08:43:18 2025
  from Wales, Uk via Telnet
- Bob Worm
  Wed Sep 17 08:14:37 2025
  from Wales, Uk via Telnet
- Volatile_Memory
  Wed Sep 17 07:20:57 2025
  from Des Moines, Iowa via SSH
- Volatile_Memory
  Wed Sep 17 07:17:26 2025
  from Des Moines, Iowa via SSH
- Bob Worm
  Tue Sep 16 21:01:27 2025
  from Wales, Uk via Telnet
- Bob Worm
  Tue Sep 16 15:15:42 2025
  from Wales, Uk via Telnet
- Gretchiie
  Tue Sep 16 05:20:21 2025
  from Derry, Nh via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	54:57:16
Calls:	10,397
Calls today:	5
Files:	14,067
Messages:	6,417,417
Posted today:	1

Intel ADX (was: 3-way long addition)

Who's Online

Recent Visitors

System Info