Michael S <
already5chosen@yahoo.com> writes:
Overall, I think that time spent by Intel engineers on invention of ADX
could have been spent much better.
The whitepapers about ADX are about long multiplication and squaring
(for cryptographic uses), and they are from 2012, and ADX was first
implemented in Broadwell (2014), when microarchitectures were quite a
bit narrower than the recent ones.
If you implement the classical long multiplication algorithm, but add
each line to the intermediate sum as you create it, you need an
operation like
intermediateresult += multiplicand*multiplicator[i]
where all parts are multi-precision numbers, but only one word of the multiplicator is used. The whole long multiplication would look
somewhat like:
intermediateresult=0;
for (i=0; i<n; i++) {
intermediateresult += multiplicand*multiplicator[i];
shift intermediate result by one word; /* actually, you will access it at */
/* an offset, but how to express this in this pseudocode? */ }
The operation for a single line can be implemented as:
carry=0;
for (j=0; j<m; j++)
uint128_t d = intermediateresult[j] +
multiplicand[j]*(uint128_t)multiplicator[i] +
(uint128_t)carry;
intermediateresult[j] = d; /* low word */
carry = d >> 64;
}
The computation of d (both words) can be written on AMD64 as:
#multuplicator[i] in rax
mov multiplicator[i], rax
mulq multiplicand[j]
addq intermediateresult[j], rax
adcq $0, rdx
addq carry, rax
adcq $0, rdx
mov rdx, carry
With ADX and BMI2, this can be coded as:
#carry is represented as carry1+C+O
mulx ma, m, carry2
adcx mb, m
adox carry1, m
#carry is represented as carry2+C+O
#unroll by an even factor, and switch roles for carry1 and carry2
Does it matter? We can apply the usual blocking techniques to
perform, say, a 4x4 submultiplication in the registers (probably even
a little larger, but let's stick with these numbers). That's 16
mulx/adox/adcx combinations, loads of 4+4 words of inputs and stores
of 8 words of output. mulx is probably limited to one per cycle, but
if we want to utilize this on a 4-wide machine line the Broadwell, we
must have at most 3 additional instructions per mulx; with ADX, one
additional instruction is adcx, another adox, and the third is either
a load or a store. Any additional overhead, and the code will be
limited by resources other than the multiplier.
On today's CPUs, we can reach the 1 mul/cycle limit with the x86-64-v1
code shown before the ADX code. But then, they might put a second
multiplier in, and we would profit from ADX again.
But Intel seems to have second thoughts on ADX itself. ADX has not
been included in x86-64-4, despite the fact that every CPU that
supports the other extensions of x86-64-4 also supports ADX. And the whitepapers vanish from Intel's web pages. Some time ago I still
found it on
https://www.intel.cn/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf
(i.e., Intel China), but now it's gone there, too. I can still find
it on <
https://raw.githubusercontent.com/wiki/intel/intel-ipsec-mb/doc/ia-large-integer-arithmetic-paper.pdf>
There is another whitepaper on using ADX for squaring numbers, but I
cannot find that. Looking around what Erdinç Öztürk (aka Erdinc
Ozturk) has also written, there's a patent "SIMD integer
multiply-accumulate instruction for multi-precision arithmetic" from
2016, so maybe Intel's thoughts are now into doing it with SIMD
instead of with scalar instructions.
Still, why deemphasize ADX? Do they want to drop it eventually? Why?
They have to support separate renaming of C, O, and the other three
because of instructions that go much farther back. The only way would
be to provide alternatives to these instructions, and then deemphasize
them over time, and eventually rename all flags together (and the old instrutions may then perform slowly).
- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <
c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)