is a set of "legacy floating-point" instructions.
These use the _old_ internal format for IEEE 754 compatible floats. They allow arithmetic with 80-bit temporary reals, which use opcodes
analogous to the ones used for the Medium floating-point format, as well
as single and double precision (32-bit and 64-bit) IEEE 754 floats.
is a set of "legacy floating-point" instructions.
On 7/18/2025 8:48 AM, John Savard wrote:-------------
I know this probably seems crazy, but to me it seems to be the only
reasonable way to deal with these new formats within the architecture
I've set out for my computer designs... with the constraints that
for a given floating-point format, all sizes of variables will share
the same internal exponent size, and the floating-point registers
can't be made bigger than 128 bits, as that would be unreasonable...
and any floating-point format that uses a pair of registers would use
a pair of exponents in the old-fashioned way to keep from switching
formats within a single instruction.
My thoughts on the formats:
Binary32: Good for light duty general-use work;
Sometimes insufficient.
Binary64: Good for general use work.
Almost always sufficient.
Binary128: Overkill.
Also too expensive to really do on FPGA in any "fast" form.
Binary256: Serious overkill.
Practically, likely better to leave Binary128 and Binary256 to software emulation; and instead focus on cheaper ways to make these faster (like, 128-bit and extended precision ADD/SUB and Shift).
Things like my newer BITMOV instructions can also help.
While it is possible to do Binary128 FMUL (and FDIV) via a Shift-and-Add unit, this is going to be slower than doing it in software. Though, considered before (but would be more expensive to implement) could be a Radix-4 or Radix-16 Shift-ADD unit.
In theory, a Radix-16 unit could do:
32-bit IDIV in 10 cycles;
64-bit IMUL/IDIV in 20 cycles.
64-bit FDIV in 30 cycles;
128-bit FMUL/FDIV in 60 cycles (with a 256-bit unit).
32-bit FDIV in 10 cycles.
Where, as noted, the Shift-and-ADD unit can be made to do FPU operations
by setting it up with the mantissas as fixed point numbers and running
the unit for more cycles than used for integer ops. I am not sure if
this is a well known strategy, but seemed "nifty" so I did it this way.
Where, 1-bit (Radix-2) Shift/ADD would take ~ 240 cycles to deal with Binary128 (if the unit were internally widened to 256 bit), which, at
least for FMUL, is slower than a pure software option.
A Radix-16 unit is likely the next possible step:
One of the cheaper options available (as a step-up from Radix-2).
But, would be more expensive to implement.
The simple linear-comparison would get turns into a 4*4->8 bit multiply lookup, and a need for a division lookup (and/or try to sort it out with combinatorial logic).
The logic would also need to support 4-bit multiply in the adder stage.
Basically, the 4b multiply during ADD, and "how many times does X go
into Y" logic being the main costs. One could turn the question into 16 parallel multiply lookups, but there is a possible cheaper option. Seems
like it should be possible to decompose it internally to Radix-4
operations (could be cheaper; the Radix-4 operations fit nicer into
LUTs).
Effectively, one would need a way to find the dividend of an 8 bit
number with a 4 bit number within a single clock-cycle. And, seemingly
the most viable way to try to do this would be Radix-4 combinatorial
logic. Don't know whether it would pass timing, haven't written or
tested the idea yet.
Nor am I sure that a Radix-16 unit would be worth the cost (hand-wavy estimate is that such a thing could likely cost around 5 or 6 kLUT for a
unit that is 128-bit internally); though with a bulk of the added cost
being due to the Radix-16 multiply-and-add, where the normal A+B
becomes, essentially:
A+((M[0]?B:0)+(M[1]?B<<1:0))+((M[2]?B<<2:0)+(M[3]?B<<3:0))
Though, unclear if it would be cheaper to try to implement it directly
as adders, or to implement it as a Radix-4 or 16 multiplier.
While, theoretically, Radix-8 could be an option, it has one big glaring fault: 64 is not a multiple of 3 (and the last few bits are likely to
wreck the viability of Radix-8). Well, maybe it still works if one
simply sign-or-zero extends 64 to 66. Big advantage of Radix-8 being
that 3*3->6 fits in a LUT6. Though (unlike Radix-4) one can't fit a carry-chain into LUTs (so, the biggest "cost issue" of Radix-16 would
not be addressed with Radix-8).
But, the latter could still an attractive option for other use cases
though.
It is possible this could be done in an FPGA.
The main issue is how to best do the Horizontal-ADD.
However, this approach would allow doing the main part of the horizontal
add as integer addition in mantissa space (vs an FP-ADD for the final result).
On 7/18/2025 8:48 AM, John Savard wrote:
Initially, when I first heard of the new 128-bit and 256-bit
floating-point
formats defined as part of a revised version of the IEEE 754
floating-point
standard, I revised the floating-point formats supported by the
Concertina
II ISA in the following manner:
I increased the size of the floating-point registers in the
architecture to
512 bits, so that a 512-bit register could contain a 256-bit IEEE 754
float
converted to a temporary-real style format without a hidden first bit;
and in addition to defining a 512-bit temporary real type, I also
defined a
1,024-bit real type which involved the use of a pair of registers,
with an
unused exponent field in the second half. (Think 72-bit double
precision on
the IBM 704, or extended precision on the System/360 Model 85.)
I have decided now to instead do the following:
- keep the size of the floating-point registers at 128 bits;
- continue to convert all floating-point numbers to a temporary real
style
format without a hidden first bit when storing them in those registers.
In order to do this, however, I have decided that while I will not fully
support the new 128-bit standard floating-point format, I still do want
to have interoperability with it.
Therefore, I have changed the temporary real format I use to have an
exponent field that is *one bit smaller* than that of the 8087 temporary
real format.
In this way, my 128-bit floats have _the same precision_ as the new
IEEE 754 standard 128-bit floats, and the Concertina II will provide
additional instructions to load such numbers to, and save such numbers
from, the floating-point registers.
This won't support the new 128-bit floating-point standard, as it will
only cover half of its exponent range. But it will allow computations
with numbers within that part of the range without sacrificing
precision, thus giving a degree of interoperability with computers
that fully support the new IEEE 754 standard.
This may seem like a strange choice to make, but it is the result of
my having thought through the implications of the new floating-point
formats, with the desire to maintain the size of the floating-point
registers at 128 bits.
The 256-bit short vector registers, however, because they're divided
into aliquot parts for shorter integer and floating-point variable types,
do *not* convert 32-bit and 64-bit IEEE 754 floats into an internal form,
but keep the hidden first bit.
Therefore, I still *can*, and will, provide full support for the new
128-bit and 256-bit IEEE 754 floating-point variable types... but only
in the short vector registers!
New instructioons will be added that treat the halves of the sixteen
short vector registers as 32 special floating-point registers only
used for numbers in the standard IEEE 754 128-bit floating-point
format.
I know this probably seems crazy, but to me it seems to be the only
reasonable way to deal with these new formats within the architecture
I've set out for my computer designs... with the constraints that
for a given floating-point format, all sizes of variables will share
the same internal exponent size, and the floating-point registers
can't be made bigger than 128 bits, as that would be unreasonable...
and any floating-point format that uses a pair of registers would use
a pair of exponents in the old-fashioned way to keep from switching
formats within a single instruction.
My thoughts on the formats:
 Binary32: Good for light duty general-use work;
   Sometimes insufficient.
 Binary64: Good for general use work.
   Almost always sufficient.
 Binary128: Overkill.
   Also too expensive to really do on FPGA in any "fast" form.
 Binary256: Serious overkill.
Practically, likely better to leave Binary128 and Binary256 to software emulation; and instead focus on cheaper ways to make these faster (like, 128-bit and extended precision ADD/SUB and Shift).
Things like my newer BITMOV instructions can also help.
While it is possible to do Binary128 FMUL (and FDIV) via a Shift-and-Add unit, this is going to be slower than doing it in software. Though, considered before (but would be more expensive to implement) could be a Radix-4 or Radix-16 Shift-ADD unit.
BGB wrote:-----------------
While it is possible to do Binary128 FMUL (and FDIV) via a Shift-and-Add
unit, this is going to be slower than doing it in software. Though,
considered before (but would be more expensive to implement) could be a
Radix-4 or Radix-16 Shift-ADD unit.
fp128/fp256 FMUL is easy to emulate when you have multiple 64x64->128
integer multipliers.
Doing the same for FDIV pretty much require a reciprocal approach, since
this doubles the precision for each added stage, but it is still so
small that more fancy multiplication approaches (like FFT-based) don't
make sense.
Terje
What I found on the Mill is that a few helper ops can make SW emulation run in 2-4x hardware latency instead of 5-10x.
I have decided now to instead do the following:
- keep the size of the floating-point registers at 128 bits;
- continue to convert all floating-point numbers to a temporary real style format without a hidden first bit when storing them in those registers.
but it is still so
small that more fancy multiplication approaches (like FFT-based) don't
make sense.
Practically, likely better to leave Binary128 and Binary256 to software emulation; and instead focus on cheaper ways to make these faster ...
What I found on the Mill is that a few helper ops can make SW emulation run >> in 2-4x hardware latency instead of 5-10x.
What are those helper ops?
High Precision:
Binary128 (borderline general use)
Stefan Monnier wrote:
What I found on the Mill is that a few helper ops can make SW emulation
run
in 2-4x hardware latency instead of 5-10x.
What are those helper ops?
The most complicated one would take two fp values, classify both (Zero/Subnormal/Normal/Inf/NaN) and return them sorted by magnitude.
Next is an unpacker: fp to sign/(unbiased?) exponent/mantissa including hidden bit.
Finally a packer/rounding unit which combines sign/exp/full mantissa
with guard & sticky bits.
At the very end you use the output of the classifier to select either
the regular result or one given by the special inputs. The special input handling runs in parallel with the normal emulation.
The rest is as outlined by Mitch, i.e just a bunch of regular unsigned integer ops.
Terje
High Precision:
Binary128 (borderline general use)
How does the speed of software emulation of Binary128 compare to the
speed of "double-double" (i.e. a pair of doubles, thus also using 128bit
but with a slightly shorter mantissa (usually) and smaller exponent)?
Stefan
High Precision:
Binary128 (borderline general use)
How does the speed of software emulation of Binary128 compare to the
speed of "double-double" (i.e. a pair of doubles, thus also using 128bit
but with a slightly shorter mantissa (usually) and smaller exponent)?
This is exactly where the conversation diverges from FP128 into
Exact FP arithmetics.
I suspect it would surprise NOBODY here that My 66000 has direct access
to Exact FP arithmetics (via CARRY) that even gets the inexact bit set correctly.
{ double hi, double lo } = FADD( double x, double y );
is 1 instruction (2 if you count CARRY as an instruction instead of
an instruction-modifier.)
BGB wrote:
On 7/18/2025 8:48 AM, John Savard wrote:
Initially, when I first heard of the new 128-bit and 256-bit
floating-point
formats defined as part of a revised version of the IEEE 754
floating-point
standard, I revised the floating-point formats supported by the
Concertina
II ISA in the following manner:
I increased the size of the floating-point registers in the
architecture to
512 bits, so that a 512-bit register could contain a 256-bit IEEE
754 float
converted to a temporary-real style format without a hidden first
bit;
and in addition to defining a 512-bit temporary real type, I also
defined a
1,024-bit real type which involved the use of a pair of registers,
with an
unused exponent field in the second half. (Think 72-bit double
precision on
the IBM 704, or extended precision on the System/360 Model 85.)
I have decided now to instead do the following:
- keep the size of the floating-point registers at 128 bits;
- continue to convert all floating-point numbers to a temporary
real style
format without a hidden first bit when storing them in those
registers.
In order to do this, however, I have decided that while I will not
fully support the new 128-bit standard floating-point format, I
still do want to have interoperability with it.
Therefore, I have changed the temporary real format I use to have
an exponent field that is *one bit smaller* than that of the 8087
temporary real format.
In this way, my 128-bit floats have _the same precision_ as the new
IEEE 754 standard 128-bit floats, and the Concertina II will
provide additional instructions to load such numbers to, and save
such numbers from, the floating-point registers.
This won't support the new 128-bit floating-point standard, as it
will only cover half of its exponent range. But it will allow
computations with numbers within that part of the range without
sacrificing precision, thus giving a degree of interoperability
with computers that fully support the new IEEE 754 standard.
This may seem like a strange choice to make, but it is the result
of my having thought through the implications of the new
floating-point formats, with the desire to maintain the size of
the floating-point registers at 128 bits.
The 256-bit short vector registers, however, because they're
divided into aliquot parts for shorter integer and floating-point
variable types, do *not* convert 32-bit and 64-bit IEEE 754 floats
into an internal form, but keep the hidden first bit.
Therefore, I still *can*, and will, provide full support for the
new 128-bit and 256-bit IEEE 754 floating-point variable types...
but only in the short vector registers!
New instructioons will be added that treat the halves of the
sixteen short vector registers as 32 special floating-point
registers only used for numbers in the standard IEEE 754 128-bit
floating-point format.
I know this probably seems crazy, but to me it seems to be the only
reasonable way to deal with these new formats within the
architecture I've set out for my computer designs... with the
constraints that for a given floating-point format, all sizes of
variables will share the same internal exponent size, and the
floating-point registers can't be made bigger than 128 bits, as
that would be unreasonable... and any floating-point format that
uses a pair of registers would use a pair of exponents in the
old-fashioned way to keep from switching formats within a single
instruction.
My thoughts on the formats:
Binary32: Good for light duty general-use work;
Sometimes insufficient.
Binary64: Good for general use work.
Almost always sufficient.
Binary128: Overkill.
Also too expensive to really do on FPGA in any "fast" form.
Binary256: Serious overkill.
Practically, likely better to leave Binary128 and Binary256 to
software emulation; and instead focus on cheaper ways to make these
faster (like, 128-bit and extended precision ADD/SUB and Shift).
Things like my newer BITMOV instructions can also help.
What I found on the Mill is that a few helper ops can make SW
emulation run in 2-4x hardware latency instead of 5-10x.
High Precision:
Binary128 (borderline general use)
How does the speed of software emulation of Binary128 compare to the
speed of "double-double" (i.e. a pair of doubles, thus also using 128bit
but with a slightly shorter mantissa (usually) and smaller exponent)?
On Fri, 18 Jul 2025 13:36:00 -0500, BGB wrote:
Practically, likely better to leave Binary128 and Binary256 to software
emulation; and instead focus on cheaper ways to make these faster ...
Is there much point to continuing with fixed-size formats at these precisions?
Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
High Precision:
Binary128 (borderline general use)
How does the speed of software emulation of Binary128 compare to the
speed of "double-double" (i.e. a pair of doubles, thus also using 128bit
but with a slightly shorter mantissa (usually) and smaller exponent)?
I'm not a big fan of that format. IBM used it, but is trying to
get rid of it for POWER at least (which has 128-bit IEEE hardware
support, but with rather low performance).
It has several drawbacks. One of them is that you cannot safely
calculate sqrt(a^2 + b^2) in a higher precision without adjusting
exponents or dividing.
Also, the smallest positive number > 1 would then be
1 + 2.2250738585072014e-308, discarding subnormals.
How do you compare two numbers?
You can get around that by fixing the exponent of the smaller
number so it extends the bigger one without a gap in the
mantissa, and forcing it to have the same sign.
Ill-formed
numbers should then be flagged as an error.
What do you do if one of your numbers is NaN, the other
one not? Do you prescribe the same sign for both numbers?
(Probably yes).
I can understand wanting to the precise results, like in Mitch's
architecture (also prescribed in IEEE, I believe). But as a general
number format... I'd rather have 128-bit IEEE in software, but I
would even more prefer a highly-performing 128-bit IEEE in hardware.
SIMD registers are big enough to hold them.
On 7/18/2025 3:16 PM, MitchAlsup1 wrote:
On Fri, 18 Jul 2025 18:36:00 +0000, BGB wrote:
In my ISA, I have 128-bit shift, though not currently a generic funnel
shift, but can be sorta faked with register MOV's. A funnel shift
instruction could be possible, but would need to be a 4R encoding.
<snip>
But, the latter could still an attractive option for other use cases
though.
It is possible this could be done in an FPGA.
  The main issue is how to best do the Horizontal-ADD.
Baugh-Wooley or Koggé-Stone.
OK.
Looking.
One tradeoff is that hopefully whatever is done, can be efficiently implemented in terms of FPGA primitives.
On Sat, 19 Jul 2025 19:15:39 +0000, Thomas Koenig wrote:
Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
High Precision:
Binary128 (borderline general use)
How does the speed of software emulation of Binary128 compare to the
speed of "double-double" (i.e. a pair of doubles, thus also using 128bit >>> but with a slightly shorter mantissa (usually) and smaller exponent)?
Well double_double is 1 pipelined FP calculation in My 66000 ISA
whereas FP128 QADD would be about 20 mostly data-dependent inst-
ructions, maybe a few more if all rounding modes are supported.
So a bit more than 20× slowdown. On a machine without the proper
support instructions double the previous number.
I'm not a big fan of that format. IBM used it, but is trying to
get rid of it for POWER at least (which has 128-bit IEEE hardware
support, but with rather low performance).
It has several drawbacks. One of them is that you cannot safely
calculate sqrt(a^2 + b^2) in a higher precision without adjusting
exponents or dividing.
Something like, but a little more complicated than::
{
if( overflows ( a^2+b^2 ) ) expmod = +64;
else if( underflows( a^2+b^2 ) ) expmod = -64;
else expmod = 0;
a = ADDexponent( a, expmod );
b = ADDexponent( b, expmod );
c = ADDexponent( sqrt( a^2+b^2 ), -expmod );
}
The only REAL hard part is performing the overflow and/or underflow subroutines/macros and dealing with the case where one overflows
and the other underflows--doubles the complexity shown above.
Also, the smallest positive number > 1 would then be
1 + 2.2250738585072014e-308, discarding subnormals.
How do you compare two numbers?
comparison compare( double_double n1, double_double n2 )
{
if( n1.hi > n2.hi ) return greater;
if( n1.hi < n1.hi ) return lesser;
If low overlaps hi the number is ill-formed.
If both are not {infinity, zero, or NaN} the number is ill-formed.
All ill-formed numbers are treated as NaN.
But that is rather hard::
The real complexity arises when low underflows but high
does not OR when hi overflows but low does not.
How does the speed of software emulation of Binary128 compare to the
speed of "double-double" (i.e. a pair of doubles, thus also using 128bit
but with a slightly shorter mantissa (usually) and smaller exponent)?
MitchAlsup1 <mitchalsup@aol.com> schrieb:
On Sat, 19 Jul 2025 19:15:39 +0000, Thomas Koenig wrote:
Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
How does the speed of software emulation of Binary128 compare to the
speed of "double-double" (i.e. a pair of doubles, thus also using 128bit >>>> but with a slightly shorter mantissa (usually) and smaller exponent)?
Well double_double is 1 pipelined FP calculation in My 66000 ISA
whereas FP128 QADD would be about 20 mostly data-dependent inst-
ructions, maybe a few more if all rounding modes are supported.
So a bit more than 20× slowdown. On a machine without the proper
support instructions double the previous number.
That fails if the high part overlaps with the low part.How do you compare two numbers?comparison compare( double_double n1, double_double n2 )
{
if( n1.hi > n2.hi ) return greater;
if( n1.hi < n1.hi ) return lesser;
There are two main drawbacks of the double-double format: Complexity
and wasted bits. This is the reason, I believe, that the only
company which used it, IBM, is moving away from that format, and
nobody else uses it; everybody else uses 128-bit IEEE (hardware
in POWER 9+, software emulation for everybody else).
My question is specifically about the performance difference for software-only implementations, especially if you take into account the
extra burden of dealing with NaNs and friends (i.e. clean&robust implementations of the library), so as to see if the
difference is small enough to really "bury" the double-double format.
On 7/19/2025 12:32 PM, Stefan Monnier wrote:
High Precision:
Binary128 (borderline general use)
How does the speed of software emulation of Binary128 compare to the
speed of "double-double" (i.e. a pair of doubles, thus also using 128bit >>>> but with a slightly shorter mantissa (usually) and smaller exponent)?
This is exactly where the conversation diverges from FP128 into
Exact FP arithmetics.
I suspect it would surprise NOBODY here that My 66000 has direct access
to Exact FP arithmetics (via CARRY) that even gets the inexact bit set
correctly.
{ double hi, double lo } = FADD( double x, double y );
is 1 instruction (2 if you count CARRY as an instruction instead of
an instruction-modifier.)
Indeed, the hardware can provide specific support for double-double, but
even with regular hardware (mostly FMACC), IIUC you can get decent
performance, hence the question.
Does have the drawback though of (AFAIK) needing single-rounded FMA,
rather than being able to use double-rounded.
In turn, single-rounded has the costs of needing a full-width multiplier
(~ 2x the mantissa width internally) and an mantissa large enough to
deal with around 3x the destination width for the adder stage (to deal
with "cancellation"). In turn, leading to a significantly more expensive
FPU.
Well, vs, say:is not IEEE compliant.
FMUL that only generates the high order bits;
FADD that allows bits to fall off the bottom;is not 754 compliant.
So, cancellation scenarios may reveal the bits that no longerThink it through again.
exist.
...
So, alas, this is something that wont really work with my existing FPU
design (and "fixing" this would likely be too expensive).
....
Stefan
MitchAlsup1 <mitchalsup@aol.com> schrieb:
On Sat, 19 Jul 2025 19:15:39 +0000, Thomas Koenig wrote:
Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
High Precision:
Binary128 (borderline general use)
How does the speed of software emulation of Binary128 compare to the
speed of "double-double" (i.e. a pair of doubles, thus also using 128bit >>>> but with a slightly shorter mantissa (usually) and smaller exponent)?
Well double_double is 1 pipelined FP calculation in My 66000 ISA
whereas FP128 QADD would be about 20 mostly data-dependent inst-
ructions, maybe a few more if all rounding modes are supported.
So a bit more than 20× slowdown. On a machine without the proper
support instructions double the previous number.
I'm not a big fan of that format. IBM used it, but is trying to
get rid of it for POWER at least (which has 128-bit IEEE hardware
support, but with rather low performance).
It has several drawbacks. One of them is that you cannot safely
calculate sqrt(a^2 + b^2) in a higher precision without adjusting
exponents or dividing.
Something like, but a little more complicated than::
{
if( overflows ( a^2+b^2 ) ) expmod = +64;
else if( underflows( a^2+b^2 ) ) expmod = -64;
else expmod = 0;
a = ADDexponent( a, expmod );
b = ADDexponent( b, expmod );
c = ADDexponent( sqrt( a^2+b^2 ), -expmod );
}
That means that a naive programmer will not do it, and since
scientific and engineering programmers tend not to do this
kind of thing, it will very likely not happen in numerical code.
The only REAL hard part is performing the overflow and/or underflow
subroutines/macros and dealing with the case where one overflows
and the other underflows--doubles the complexity shown above.
.... even more so.
Also, the smallest positive number > 1 would then be
1 + 2.2250738585072014e-308, discarding subnormals.
How do you compare two numbers?
comparison compare( double_double n1, double_double n2 )
{
if( n1.hi > n2.hi ) return greater;
if( n1.hi < n1.hi ) return lesser;
That fails if the high part overlaps with the low part. Then
to trap, adjust or handle accordingly; adjusting especially
on a binary read...
[...]
If low overlaps hi the number is ill-formed.
If both are not {infinity, zero, or NaN} the number is ill-formed.
All ill-formed numbers are treated as NaN.
But that is rather hard::
The real complexity arises when low underflows but high
does not OR when hi overflows but low does not.
There are two main drawbacks of the double-double format: Complexity
and wasted bits. This is the reason, I believe, that the only
company which used it, IBM, is moving away from that format, and
nobody else uses it; everybody else uses 128-bit IEEE (hardware
in POWER 9+, software emulation for everybody else).
On Sat, 19 Jul 2025 20:50:33 +0000, BGB wrote:
On 7/18/2025 3:16 PM, MitchAlsup1 wrote:
On Fri, 18 Jul 2025 18:36:00 +0000, BGB wrote:
In my ISA, I have 128-bit shift, though not currently a generic funnel
shift, but can be sorta faked with register MOV's. A funnel shift
instruction could be possible, but would need to be a 4R encoding.
While the shift count is limited to 64, one can do rather
arbitrarily large shifts with My ISA. A 256-bit shift::
    CARRY   R14,{{O}{IO}{IO}{I}}
    SL   R11,R1,Rshift
    SL   R12,R2,Rshift
    SL   R13,R3,Rshift
    SLs   R14,R4,Rshift
Stefan Monnier <monnier@iro.umontreal.ca> schrieb:[...]
My question is specifically about the performance difference for
software-only implementations, especially if you take into account the
I can give you numbers comparing IEEE with hardware support with
IBM long double on the same machine. For a quick & dirty matmul
benchmark (below) I get, with IBM long double
16
MFlops: 173.6
and with IEEE
16
MFlops: 437.3
For comparison, my home box (totally different) gives me
16
MFlops: 95.9
Thomas Koenig [2025-07-20 16:20:51] wrote:
Stefan Monnier <monnier@iro.umontreal.ca> schrieb:[...]
My question is specifically about the performance difference for
software-only implementations, especially if you take into account the
I can give you numbers comparing IEEE with hardware support with
IBM long double on the same machine. For a quick & dirty matmul
benchmark (below) I get, with IBM long double
16
MFlops: 173.6
and with IEEE
16
MFlops: 437.3
[ Not sure what "16" is about. ]
So, IIUC this shows hardware-supported Binary128 to be about 3x faster
than software-only double-double.
For comparison, my home box (totally different) gives me
16
MFlops: 95.9
Is this hard Binary128, soft Binary128, or soft double-double?
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 04:22:42 |
Calls: | 10,387 |
Calls today: | 2 |
Files: | 14,061 |
Messages: | 6,416,782 |