Forum: >>> Magnum BBS <<<

My Response to the New 128-bit IEEE 754 Floating-Point Format

From John Savard@21:1/5 to All on Fri Jul 18 13:48:58 2025

Initially, when I first heard of the new 128-bit and 256-bit floating-point formats defined as part of a revised version of the IEEE 754 floating-point standard, I revised the floating-point formats supported by the Concertina
II ISA in the following manner:

I increased the size of the floating-point registers in the architecture to
512 bits, so that a 512-bit register could contain a 256-bit IEEE 754 float converted to a temporary-real style format without a hidden first bit;

and in addition to defining a 512-bit temporary real type, I also defined a 1,024-bit real type which involved the use of a pair of registers, with an unused exponent field in the second half. (Think 72-bit double precision on
the IBM 704, or extended precision on the System/360 Model 85.)

I have decided now to instead do the following:

- keep the size of the floating-point registers at 128 bits;

- continue to convert all floating-point numbers to a temporary real style format without a hidden first bit when storing them in those registers.

In order to do this, however, I have decided that while I will not fully support the new 128-bit standard floating-point format, I still do want
to have interoperability with it.

Therefore, I have changed the temporary real format I use to have an
exponent field that is *one bit smaller* than that of the 8087 temporary
real format.

In this way, my 128-bit floats have _the same precision_ as the new
IEEE 754 standard 128-bit floats, and the Concertina II will provide
additional instructions to load such numbers to, and save such numbers
from, the floating-point registers.

This won't support the new 128-bit floating-point standard, as it will
only cover half of its exponent range. But it will allow computations
with numbers within that part of the range without sacrificing
precision, thus giving a degree of interoperability with computers
that fully support the new IEEE 754 standard.

This may seem like a strange choice to make, but it is the result of
my having thought through the implications of the new floating-point
formats, with the desire to maintain the size of the floating-point
registers at 128 bits.

The 256-bit short vector registers, however, because they're divided
into aliquot parts for shorter integer and floating-point variable types,
do *not* convert 32-bit and 64-bit IEEE 754 floats into an internal form,
but keep the hidden first bit.

Therefore, I still *can*, and will, provide full support for the new
128-bit and 256-bit IEEE 754 floating-point variable types... but only
in the short vector registers!

New instructioons will be added that treat the halves of the sixteen
short vector registers as 32 special floating-point registers only
used for numbers in the standard IEEE 754 128-bit floating-point
format.

I know this probably seems crazy, but to me it seems to be the only
reasonable way to deal with these new formats within the architecture
I've set out for my computer designs... with the constraints that
for a given floating-point format, all sizes of variables will share
the same internal exponent size, and the floating-point registers
can't be made bigger than 128 bits, as that would be unreasonable...
and any floating-point format that uses a pair of registers would use
a pair of exponents in the old-fashioned way to keep from switching
formats within a single instruction.

John Savard
John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Fri Jul 18 14:33:14 2025

I have realized that I missed another wrinkle that responding to
the revised standard might cause.
As one of the exotic features that I intend to consider supporting
is the ability to operate with a different width of memory than the
normal 64-bit and power-of-two multiples thereof typically in current
use...
this meant that I had intended for the long vector registers to
support long vectors of 72-bit floats as well as 64-bit floats.
Because Univac once had a Cray-style vector attachment to their
mainframes, and how dare I fail to provide a capability that computers
once had, but since have lost!
Not that I intend to provide with Concertina II, *unlike the original Concertina*, the ability to do, say, _sterling_ arithmetic. (Even in
the original Concertina, this was only provided within the context of
a more general mixed-radix capability. I am not _completely_ deranged,
despite appearances. Or, at least, so I claim.)
Well, if I'm going to support 72-bit floats, I may as well support
vectors of 80-bit 8087-style temporary reals too while I'm at it.

And so what I clearly need to do is provide, not in the primary 32-bit instructions, but instead only in the operate instructions with
larger opcode fields, and maybe in the 32-bit supplementary load and
store instructions, if there's room, and otherwise out there in 48-bit instruction land...
is a set of "legacy floating-point" instructions.
These use the _old_ internal format for IEEE 754 compatible floats.
They allow arithmetic with 80-bit temporary reals, which use opcodes
analogous to the ones used for the Medium floating-point format, as
well as single and double precision (32-bit and 64-bit) IEEE 754
floats.
That way, the long vector arithmetic unit, even when it's working with
80-bit floats, can fully interoperate with thge scalar arithmetic
unit, even if involving the *short* vector arithmetic unit in the
computation (which goes the opposite way by always hiding the first
bit of the mantissa) is right out when 80-bit temporary reals are being
used.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to John Savard on Fri Jul 18 17:25:24 2025

On Fri, 18 Jul 2025 14:33:14 +0000, John Savard wrote:

is a set of "legacy floating-point" instructions.
These use the _old_ internal format for IEEE 754 compatible floats. They allow arithmetic with 80-bit temporary reals, which use opcodes
analogous to the ones used for the Medium floating-point format, as well
as single and double precision (32-bit and 64-bit) IEEE 754 floats.

What this glimpse into my bizarre thought processes may suggest is
the following:

If there were some kind of electronic component that could be
easily fabricated onto a microprocessor chip made out of silicon,
whether it was based on SCR (silicon-controlled rectifier)
technology, or it was something exotic like a memristor, that was
analogous to a mechanical relay in that it directed current down
one of two paths *with no gate delay* at the cost of being a bit
slow in switching from one path to the other...

I'd be interested in hearing about it, because there are places
in future designs of processors for my ISA where such a thing could
be quite useful.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to John Savard on Fri Jul 18 19:30:34 2025

On Fri, 18 Jul 2025 14:33:14 +0000, John Savard wrote:

is a set of "legacy floating-point" instructions.

I have now realized the error of my ways here.
Instead of shortening the length of the exponent field, to allow interoperability between my own 128-bit format and the standard one, I
need to place more emphasis on compatibility.
So instead, while my internal 128-bit floating-point format will continue
to not have a hidden first bit, its exponent field, instead of shrinking by
one bit, will grow to match that of the standard 256-bit floating-point
format.
That way, the standard 128-bit and 256-bit formats can be supported by
the regular floating-point registers as well, they will just use more
registers than their lengths would indicate. Given that there are
32 floating-point registers, it should still be possible to manage.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Fri Jul 18 20:16:19 2025

On Fri, 18 Jul 2025 18:36:00 +0000, BGB wrote:

On 7/18/2025 8:48 AM, John Savard wrote:

-------------

I know this probably seems crazy, but to me it seems to be the only
reasonable way to deal with these new formats within the architecture
I've set out for my computer designs... with the constraints that
for a given floating-point format, all sizes of variables will share
the same internal exponent size, and the floating-point registers
can't be made bigger than 128 bits, as that would be unreasonable...
and any floating-point format that uses a pair of registers would use
a pair of exponents in the old-fashioned way to keep from switching
formats within a single instruction.

My thoughts on the formats:
Binary32: Good for light duty general-use work;
Sometimes insufficient.
Binary64: Good for general use work.
Almost always sufficient.
Binary128: Overkill.
Also too expensive to really do on FPGA in any "fast" form.
Binary256: Serious overkill.

I wanted My 66000 to be very efficient at 64-bits with reasonable
efficiency for 128-bit stuff--missing keyword occasional.

Practically, likely better to leave Binary128 and Binary256 to software emulation; and instead focus on cheaper ways to make these faster (like, 128-bit and extended precision ADD/SUB and Shift).
Things like my newer BITMOV instructions can also help.

Efficient IMUL 128, 192, and 256 makes these a lot more reasonable to
emulate. Insert and extract instructions help on the side, along with find-first as prelude to normalization (large 128, 192, 256 shifts)

While it is possible to do Binary128 FMUL (and FDIV) via a Shift-and-Add unit, this is going to be slower than doing it in software. Though, considered before (but would be more expensive to implement) could be a Radix-4 or Radix-16 Shift-ADD unit.

In theory, a Radix-16 unit could do:
32-bit IDIV in 10 cycles;
64-bit IMUL/IDIV in 20 cycles.
64-bit FDIV in 30 cycles;
128-bit FMUL/FDIV in 60 cycles (with a 256-bit unit).
32-bit FDIV in 10 cycles.

Where, as noted, the Shift-and-ADD unit can be made to do FPU operations
by setting it up with the mantissas as fixed point numbers and running
the unit for more cycles than used for integer ops. I am not sure if
this is a well known strategy, but seemed "nifty" so I did it this way.

Where, 1-bit (Radix-2) Shift/ADD would take ~ 240 cycles to deal with Binary128 (if the unit were internally widened to 256 bit), which, at
least for FMUL, is slower than a pure software option.

A pair of radix-2 FUs and a ROM can implement all of CORDIC--just as
slow as you illustrate.

A Radix-16 unit is likely the next possible step:
One of the cheaper options available (as a step-up from Radix-2).

I know of 1 fully pipelined FP32 FDIV unit using radix-4 with 10
cycle latency and 1 cycle throughput. I know of one Radix-8 FDIV
unit (and ECL design). Are there any Radix-16; does that even fit
a 16-gate cycle time ??

In any event, once you get 10-ish bits, Newton-Raphson or Goldschmidt
are faster (latency).

But, would be more expensive to implement.

Every step in radix doubles the lookahead hardware {1->3->7->15};
so, a radix-16 DIV is 15× as big as a radix 1 (not including the adder)

The simple linear-comparison would get turns into a 4*4->8 bit multiply lookup, and a need for a division lookup (and/or try to sort it out with combinatorial logic).
The logic would also need to support 4-bit multiply in the adder stage.

TI ahs a bunch of patents n this corner of arithmetic.

Basically, the 4b multiply during ADD, and "how many times does X go
into Y" logic being the main costs. One could turn the question into 16 parallel multiply lookups, but there is a possible cheaper option. Seems
like it should be possible to decompose it internally to Radix-4
operations (could be cheaper; the Radix-4 operations fit nicer into
LUTs).

Effectively, one would need a way to find the dividend of an 8 bit
number with a 4 bit number within a single clock-cycle. And, seemingly
the most viable way to try to do this would be Radix-4 combinatorial
logic. Don't know whether it would pass timing, haven't written or
tested the idea yet.

Nor am I sure that a Radix-16 unit would be worth the cost (hand-wavy estimate is that such a thing could likely cost around 5 or 6 kLUT for a
unit that is 128-bit internally); though with a bulk of the added cost
being due to the Radix-16 multiply-and-add, where the normal A+B
becomes, essentially:
A+((M[0]?B:0)+(M[1]?B<<1:0))+((M[2]?B<<2:0)+(M[3]?B<<3:0))

Though, unclear if it would be cheaper to try to implement it directly
as adders, or to implement it as a Radix-4 or 16 multiplier.

Everybody I know gave up a Radix-8.

While, theoretically, Radix-8 could be an option, it has one big glaring fault: 64 is not a multiple of 3 (and the last few bits are likely to
wreck the viability of Radix-8). Well, maybe it still works if one
simply sign-or-zero extends 64 to 66. Big advantage of Radix-8 being
that 3*3->6 fits in a LUT6. Though (unlike Radix-4) one can't fit a carry-chain into LUTs (so, the biggest "cost issue" of Radix-16 would
not be addressed with Radix-8).

<snip>

But, the latter could still an attractive option for other use cases
though.

It is possible this could be done in an FPGA.
The main issue is how to best do the Horizontal-ADD.

Baugh-Wooley or Koggé-Stone.

However, this approach would allow doing the main part of the horizontal
add as integer addition in mantissa space (vs an FP-ADD for the final result).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to BGB on Fri Jul 18 22:17:30 2025

BGB wrote:

On 7/18/2025 8:48 AM, John Savard wrote:

Initially, when I first heard of the new 128-bit and 256-bit
floating-point
formats defined as part of a revised version of the IEEE 754
floating-point
standard, I revised the floating-point formats supported by the
Concertina
II ISA in the following manner:

I increased the size of the floating-point registers in the
architecture to
512 bits, so that a 512-bit register could contain a 256-bit IEEE 754
float
converted to a temporary-real style format without a hidden first bit;

and in addition to defining a 512-bit temporary real type, I also
defined a
1,024-bit real type which involved the use of a pair of registers,
with an
unused exponent field in the second half. (Think 72-bit double
precision on
the IBM 704, or extended precision on the System/360 Model 85.)

I have decided now to instead do the following:

- keep the size of the floating-point registers at 128 bits;

- continue to convert all floating-point numbers to a temporary real
style
format without a hidden first bit when storing them in those registers.

In order to do this, however, I have decided that while I will not fully
support the new 128-bit standard floating-point format, I still do want
to have interoperability with it.

Therefore, I have changed the temporary real format I use to have an
exponent field that is *one bit smaller* than that of the 8087 temporary
real format.

In this way, my 128-bit floats have _the same precision_ as the new
IEEE 754 standard 128-bit floats, and the Concertina II will provide
additional instructions to load such numbers to, and save such numbers
from, the floating-point registers.

This won't support the new 128-bit floating-point standard, as it will
only cover half of its exponent range. But it will allow computations
with numbers within that part of the range without sacrificing
precision, thus giving a degree of interoperability with computers
that fully support the new IEEE 754 standard.

This may seem like a strange choice to make, but it is the result of
my having thought through the implications of the new floating-point
formats, with the desire to maintain the size of the floating-point
registers at 128 bits.

The 256-bit short vector registers, however, because they're divided
into aliquot parts for shorter integer and floating-point variable types,
do *not* convert 32-bit and 64-bit IEEE 754 floats into an internal form,
but keep the hidden first bit.

Therefore, I still *can*, and will, provide full support for the new
128-bit and 256-bit IEEE 754 floating-point variable types... but only
in the short vector registers!

New instructioons will be added that treat the halves of the sixteen
short vector registers as 32 special floating-point registers only
used for numbers in the standard IEEE 754 128-bit floating-point
format.

I know this probably seems crazy, but to me it seems to be the only
reasonable way to deal with these new formats within the architecture
I've set out for my computer designs... with the constraints that
for a given floating-point format, all sizes of variables will share
the same internal exponent size, and the floating-point registers
can't be made bigger than 128 bits, as that would be unreasonable...
and any floating-point format that uses a pair of registers would use
a pair of exponents in the old-fashioned way to keep from switching
formats within a single instruction.

My thoughts on the formats:
Binary32: Good for light duty general-use work;
    Sometimes insufficient.
Binary64: Good for general use work.
    Almost always sufficient.
Binary128: Overkill.
    Also too expensive to really do on FPGA in any "fast" form.
Binary256: Serious overkill.

Practically, likely better to leave Binary128 and Binary256 to software emulation; and instead focus on cheaper ways to make these faster (like, 128-bit and extended precision ADD/SUB and Shift).
Things like my newer BITMOV instructions can also help.

What I found on the Mill is that a few helper ops can make SW emulation
run in 2-4x hardware latency instead of 5-10x.

While it is possible to do Binary128 FMUL (and FDIV) via a Shift-and-Add unit, this is going to be slower than doing it in software. Though, considered before (but would be more expensive to implement) could be a Radix-4 or Radix-16 Shift-ADD unit.

fp128/fp256 FMUL is easy to emulate when you have multiple 64x64->128
integer multipliers.

Doing the same for FDIV pretty much require a reciprocal approach, since
this doubles the precision for each added stage, but it is still so
small that more fancy multiplication approaches (like FFT-based) don't
make sense.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Fri Jul 18 21:47:00 2025

On Fri, 18 Jul 2025 20:17:30 +0000, Terje Mathisen wrote:

BGB wrote:

-----------------

While it is possible to do Binary128 FMUL (and FDIV) via a Shift-and-Add
unit, this is going to be slower than doing it in software. Though,
considered before (but would be more expensive to implement) could be a
Radix-4 or Radix-16 Shift-ADD unit.

fp128/fp256 FMUL is easy to emulate when you have multiple 64x64->128
integer multipliers.

Basically you disassembly FP128 into Sign, Exponent<15:0> and
create operational fractions {1,Fract<46:0>}

Result exponent = s1.exponent + s2.exponent - Bias;
Sign = XOR( s1.sign, s2.sign );

Then do 4 multiplies {64×64 -> 128, 64×48 -> 128, 48×64 -> 128, and
48×48 -> 128}

do 7-64-bit additions being careful with carry propagation
and you have a 224-bit product

without denorms, you have a potential 1-bit shift

Round = Choose to increment or not

And assemble the result.

Doing the same for FDIV pretty much require a reciprocal approach, since
this doubles the precision for each added stage, but it is still so
small that more fancy multiplication approaches (like FFT-based) don't
make sense.

In general, convert FP128 into 1/2 <= FP64.fraction < 1.0
Do FDIV64 and use this as the first 52-biits of FP128 result.
Then 2 (or is it 3) Newton-Raphson steps in FP128.
And assemble result

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Fri Jul 18 17:54:30 2025

What I found on the Mill is that a few helper ops can make SW emulation run in 2-4x hardware latency instead of 5-10x.

What are those helper ops?

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to John Savard on Fri Jul 18 23:19:07 2025

John Savard <quadibloc@invalid.invalid> wrote:

I have decided now to instead do the following:

- keep the size of the floating-point registers at 128 bits;

- continue to convert all floating-point numbers to a temporary real style format without a hidden first bit when storing them in those registers.

Why do not not increase registers to 129 bits to make place for explicit leading bit?

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to Terje Mathisen on Sat Jul 19 00:44:44 2025

On Fri, 18 Jul 2025 22:17:30 +0200, Terje Mathisen wrote:

but it is still so
small that more fancy multiplication approaches (like FFT-based) don't
make sense.

That's something I can agree with. FFT multiplication is what you
would use if you were writing a program to calculate the value
of pi to one million digits - or more. The inherent overhead of
that technique definitely rules it out for ordinary arithmetic -
even, say, on 512-bit numbers.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to BGB on Sat Jul 19 04:40:48 2025

On Fri, 18 Jul 2025 13:36:00 -0500, BGB wrote:

Practically, likely better to leave Binary128 and Binary256 to software emulation; and instead focus on cheaper ways to make these faster ...

Is there much point to continuing with fixed-size formats at these
precisions?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Stefan Monnier on Sat Jul 19 15:24:27 2025

Stefan Monnier wrote:

What I found on the Mill is that a few helper ops can make SW emulation run >> in 2-4x hardware latency instead of 5-10x.

What are those helper ops?

The most complicated one would take two fp values, classify both (Zero/Subnormal/Normal/Inf/NaN) and return them sorted by magnitude.

Next is an unpacker: fp to sign/(unbiased?) exponent/mantissa including
hidden bit.

Finally a packer/rounding unit which combines sign/exp/full mantissa
with guard & sticky bits.

At the very end you use the output of the classifier to select either
the regular result or one given by the special inputs. The special input handling runs in parallel with the normal emulation.

The rest is as outlined by Mitch, i.e just a bunch of regular unsigned
integer ops.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Sat Jul 19 12:52:34 2025

High Precision:
Binary128 (borderline general use)

How does the speed of software emulation of Binary128 compare to the
speed of "double-double" (i.e. a pair of doubles, thus also using 128bit
but with a slightly shorter mantissa (usually) and smaller exponent)?

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Sat Jul 19 17:12:04 2025

On Sat, 19 Jul 2025 13:24:27 +0000, Terje Mathisen wrote:

Stefan Monnier wrote:

What I found on the Mill is that a few helper ops can make SW emulation
run
in 2-4x hardware latency instead of 5-10x.

What are those helper ops?

The most complicated one would take two fp values, classify both (Zero/Subnormal/Normal/Inf/NaN) and return them sorted by magnitude.

Next is an unpacker: fp to sign/(unbiased?) exponent/mantissa including hidden bit.

Finally a packer/rounding unit which combines sign/exp/full mantissa
with guard & sticky bits.

OpenGL defines::

x = ADDtoExponent( Exponent( x ), Fraction( x ) );

The argument functions tear the FP apart::
int Exponent( FloatingPoint x ) is the deBiased exponent
FloatingPoint( FloatingPoint x ) is 1/2 <= fraction < 1

The ADDtoExponent function puts them back together again::

There is also CopySign if you want to take the sign from the fraction.

At the very end you use the output of the classifier to select either
the regular result or one given by the special inputs. The special input handling runs in parallel with the normal emulation.

The rest is as outlined by Mitch, i.e just a bunch of regular unsigned integer ops.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Sat Jul 19 17:17:03 2025

On Sat, 19 Jul 2025 16:52:34 +0000, Stefan Monnier wrote:

High Precision:
Binary128 (borderline general use)

How does the speed of software emulation of Binary128 compare to the
speed of "double-double" (i.e. a pair of doubles, thus also using 128bit
but with a slightly shorter mantissa (usually) and smaller exponent)?

This is exactly where the conversation diverges from FP128 into
Exact FP arithmetics.

I suspect it would surprise NOBODY here that My 66000 has direct access
to Exact FP arithmetics (via CARRY) that even gets the inexact bit set correctly.

{ double hi, double lo } = FADD( double x, double y );

is 1 instruction (2 if you count CARRY as an instruction instead of
an instruction-modifier.)

All the bits that did not get into hi can be found in lo.

There are FMUL, and FDIV variants, too.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Sat Jul 19 13:32:50 2025

High Precision:
Binary128 (borderline general use)

How does the speed of software emulation of Binary128 compare to the
speed of "double-double" (i.e. a pair of doubles, thus also using 128bit
but with a slightly shorter mantissa (usually) and smaller exponent)?

This is exactly where the conversation diverges from FP128 into
Exact FP arithmetics.

I suspect it would surprise NOBODY here that My 66000 has direct access
to Exact FP arithmetics (via CARRY) that even gets the inexact bit set correctly.

{ double hi, double lo } = FADD( double x, double y );

is 1 instruction (2 if you count CARRY as an instruction instead of
an instruction-modifier.)

Indeed, the hardware can provide specific support for double-double, but
even with regular hardware (mostly FMACC), IIUC you can get decent
performance, hence the question.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Sat Jul 19 21:02:46 2025

On Fri, 18 Jul 2025 22:17:30 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

BGB wrote:

On 7/18/2025 8:48 AM, John Savard wrote:

Initially, when I first heard of the new 128-bit and 256-bit
floating-point
formats defined as part of a revised version of the IEEE 754
floating-point
standard, I revised the floating-point formats supported by the
Concertina
II ISA in the following manner:

I increased the size of the floating-point registers in the
architecture to
512 bits, so that a 512-bit register could contain a 256-bit IEEE
754 float
converted to a temporary-real style format without a hidden first
bit;

and in addition to defining a 512-bit temporary real type, I also
defined a
1,024-bit real type which involved the use of a pair of registers,
with an
unused exponent field in the second half. (Think 72-bit double
precision on
the IBM 704, or extended precision on the System/360 Model 85.)

I have decided now to instead do the following:

- keep the size of the floating-point registers at 128 bits;

- continue to convert all floating-point numbers to a temporary
real style
format without a hidden first bit when storing them in those
registers.

In order to do this, however, I have decided that while I will not
fully support the new 128-bit standard floating-point format, I
still do want to have interoperability with it.

Therefore, I have changed the temporary real format I use to have
an exponent field that is *one bit smaller* than that of the 8087
temporary real format.

In this way, my 128-bit floats have _the same precision_ as the new
IEEE 754 standard 128-bit floats, and the Concertina II will
provide additional instructions to load such numbers to, and save
such numbers from, the floating-point registers.

This won't support the new 128-bit floating-point standard, as it
will only cover half of its exponent range. But it will allow
computations with numbers within that part of the range without
sacrificing precision, thus giving a degree of interoperability
with computers that fully support the new IEEE 754 standard.

This may seem like a strange choice to make, but it is the result
of my having thought through the implications of the new
floating-point formats, with the desire to maintain the size of
the floating-point registers at 128 bits.

The 256-bit short vector registers, however, because they're
divided into aliquot parts for shorter integer and floating-point
variable types, do *not* convert 32-bit and 64-bit IEEE 754 floats
into an internal form, but keep the hidden first bit.

Therefore, I still *can*, and will, provide full support for the
new 128-bit and 256-bit IEEE 754 floating-point variable types...
but only in the short vector registers!

New instructioons will be added that treat the halves of the
sixteen short vector registers as 32 special floating-point
registers only used for numbers in the standard IEEE 754 128-bit
floating-point format.

I know this probably seems crazy, but to me it seems to be the only
reasonable way to deal with these new formats within the
architecture I've set out for my computer designs... with the
constraints that for a given floating-point format, all sizes of
variables will share the same internal exponent size, and the
floating-point registers can't be made bigger than 128 bits, as
that would be unreasonable... and any floating-point format that
uses a pair of registers would use a pair of exponents in the
old-fashioned way to keep from switching formats within a single
instruction.

My thoughts on the formats:
� Binary32: Good for light duty general-use work;
�� Sometimes insufficient.
� Binary64: Good for general use work.
�� Almost always sufficient.
� Binary128: Overkill.
�� Also too expensive to really do on FPGA in any "fast" form.
� Binary256: Serious overkill.

Practically, likely better to leave Binary128 and Binary256 to
software emulation; and instead focus on cheaper ways to make these
faster (like, 128-bit and extended precision ADD/SUB and Shift).
Things like my newer BITMOV instructions can also help.

What I found on the Mill is that a few helper ops can make SW
emulation run in 2-4x hardware latency instead of 5-10x.

I found pretty much the opposite on x86-64.

I can't think about any easy to implement helper op or couple that
could make non-negligible difference on modern big cores.

What could make quite big difference is a better ABI.
Current Linux-x64 (and aarch64) __float128 ABIs are very bad, both in
that that parameters to __float128 arithmetic primitive are passed in
SIMD registers instead of GPRs and in that that flags (=exceptions) are returned in FP control word. I don't quite know where I would prefer
them, but certainly somewhere else.
RISC-V Linux ABI looks more reasonable, but I only looked at it, never
even tried to implement anything on RISC-V, much less so to measure
speed.

And for Windows there is no official __float128 ABI at all.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stefan Monnier on Sat Jul 19 19:15:39 2025

Stefan Monnier <monnier@iro.umontreal.ca> schrieb:

High Precision:
Binary128 (borderline general use)

How does the speed of software emulation of Binary128 compare to the
speed of "double-double" (i.e. a pair of doubles, thus also using 128bit
but with a slightly shorter mantissa (usually) and smaller exponent)?

I'm not a big fan of that format. IBM used it, but is trying to
get rid of it for POWER at least (which has 128-bit IEEE hardware
support, but with rather low performance).

It has several drawbacks. One of them is that you cannot safely
calculate sqrt(a^2 + b^2) in a higher precision without adjusting
exponents or dividing.

Also, the smallest positive number > 1 would then be
1 + 2.2250738585072014e-308, discarding subnormals.

How do you compare two numbers?

You can get around that by fixing the exponent of the smaller
number so it extends the bigger one without a gap in the
mantissa, and forcing it to have the same sign. Ill-formed
numbers should then be flagged as an error.

What do you do if one of your numbers is NaN, the other
one not? Do you prescribe the same sign for both numbers?
(Probably yes).

I can understand wanting to the precise results, like in Mitch's
architecture (also prescribed in IEEE, I believe). But as a general
number format... I'd rather have 128-bit IEEE in software, but I
would even more prefer a highly-performing 128-bit IEEE in hardware.
SIMD registers are big enough to hold them.

--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sat Jul 19 19:58:33 2025

On Sat, 19 Jul 2025 4:40:48 +0000, Lawrence D'Oliveiro wrote:

On Fri, 18 Jul 2025 13:36:00 -0500, BGB wrote:

Practically, likely better to leave Binary128 and Binary256 to software
emulation; and instead focus on cheaper ways to make these faster ...

Is there much point to continuing with fixed-size formats at these precisions?

Not if you are emulating them in SW::

# define containers (3)

typedef struct { ubyte s;
int56_t exp;
int64_t fract[containers]; } big_FP;

By the time this exponent overflows, you are accounting for every
particle in the universe !!! (not just visible universe).

Sorry to waste 8 bits on the sign.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Sat Jul 19 19:54:13 2025

On Sat, 19 Jul 2025 19:15:39 +0000, Thomas Koenig wrote:

Stefan Monnier <monnier@iro.umontreal.ca> schrieb:

High Precision:
Binary128 (borderline general use)

How does the speed of software emulation of Binary128 compare to the
speed of "double-double" (i.e. a pair of doubles, thus also using 128bit
but with a slightly shorter mantissa (usually) and smaller exponent)?

Well double_double is 1 pipelined FP calculation in My 66000 ISA
whereas FP128 QADD would be about 20 mostly data-dependent inst-
ructions, maybe a few more if all rounding modes are supported.

So a bit more than 20× slowdown. On a machine without the proper
support instructions double the previous number.

I'm not a big fan of that format. IBM used it, but is trying to
get rid of it for POWER at least (which has 128-bit IEEE hardware
support, but with rather low performance).

It has several drawbacks. One of them is that you cannot safely
calculate sqrt(a^2 + b^2) in a higher precision without adjusting
exponents or dividing.

Something like, but a little more complicated than::
{
if( overflows ( a^2+b^2 ) ) expmod = +64;
else if( underflows( a^2+b^2 ) ) expmod = -64;
else expmod = 0;

a = ADDexponent( a, expmod );
b = ADDexponent( b, expmod );

c = ADDexponent( sqrt( a^2+b^2 ), -expmod );
}

The only REAL hard part is performing the overflow and/or underflow subroutines/macros and dealing with the case where one overflows
and the other underflows--doubles the complexity shown above.

Also, the smallest positive number > 1 would then be
1 + 2.2250738585072014e-308, discarding subnormals.

How do you compare two numbers?

comparison compare( double_double n1, double_double n2 )
{
if( n1.hi > n2.hi ) return greater;
if( n1.hi < n1.hi ) return lesser;
// high parts are equal
if( n1.lo > n2.lo ) return greater;
if( n1.lo < n2.lo ) return lesser;
// low parts are equal
return equal;
}

You can get around that by fixing the exponent of the smaller
number so it extends the bigger one without a gap in the
mantissa, and forcing it to have the same sign.

In (as Anton called it) double-double, the sign bit actually
carries a bit of significance, in effect, it tells if the
larger value was rounded (+1) up or not. In Kahan-Babashuka
summation the high part is not rounded only the low part.

Ill-formed
numbers should then be flagged as an error.

If low overlaps hi the number is ill-formed.
If both are not {infinity, zero, or NaN} the number is ill-formed.
All ill-formed numbers are treated as NaN.

But that is rather hard::
The real complexity arises when low underflows but high
does not OR when hi overflows but low does not.

What do you do if one of your numbers is NaN, the other
one not? Do you prescribe the same sign for both numbers?
(Probably yes).

Kahan-Babashuka yes, exact FP no.

I can understand wanting to the precise results, like in Mitch's
architecture (also prescribed in IEEE, I believe). But as a general
number format... I'd rather have 128-bit IEEE in software, but I
would even more prefer a highly-performing 128-bit IEEE in hardware.
SIMD registers are big enough to hold them.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sun Jul 20 00:25:04 2025

On Sat, 19 Jul 2025 20:50:33 +0000, BGB wrote:

On 7/18/2025 3:16 PM, MitchAlsup1 wrote:

On Fri, 18 Jul 2025 18:36:00 +0000, BGB wrote:

In my ISA, I have 128-bit shift, though not currently a generic funnel
shift, but can be sorta faked with register MOV's. A funnel shift
instruction could be possible, but would need to be a 4R encoding.

While the shift count is limited to 64, one can do rather
arbitrarily large shifts with My ISA. A 256-bit shift::

CARRY R14,{{O}{IO}{IO}{I}}
SL R11,R1,Rshift
SL R12,R2,Rshift
SL R13,R3,Rshift
SLs R14,R4,Rshift

----------------

<snip>

But, the latter could still an attractive option for other use cases
though.

It is possible this could be done in an FPGA.
The main issue is how to best do the Horizontal-ADD.

Baugh-Wooley or Koggé-Stone.

OK.

Looking.
One tradeoff is that hopefully whatever is done, can be efficiently implemented in terms of FPGA primitives.

BW and KS adders are designed such that the center bits from the
multiplier tree can arrive last.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Sun Jul 20 09:24:44 2025

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Sat, 19 Jul 2025 19:15:39 +0000, Thomas Koenig wrote:

Stefan Monnier <monnier@iro.umontreal.ca> schrieb:

High Precision:
Binary128 (borderline general use)

How does the speed of software emulation of Binary128 compare to the
speed of "double-double" (i.e. a pair of doubles, thus also using 128bit >>> but with a slightly shorter mantissa (usually) and smaller exponent)?

Well double_double is 1 pipelined FP calculation in My 66000 ISA
whereas FP128 QADD would be about 20 mostly data-dependent inst-
ructions, maybe a few more if all rounding modes are supported.

So a bit more than 20× slowdown. On a machine without the proper
support instructions double the previous number.

I'm not a big fan of that format. IBM used it, but is trying to
get rid of it for POWER at least (which has 128-bit IEEE hardware
support, but with rather low performance).

It has several drawbacks. One of them is that you cannot safely
calculate sqrt(a^2 + b^2) in a higher precision without adjusting
exponents or dividing.

Something like, but a little more complicated than::
{
if( overflows ( a^2+b^2 ) ) expmod = +64;
else if( underflows( a^2+b^2 ) ) expmod = -64;
else expmod = 0;

a = ADDexponent( a, expmod );
b = ADDexponent( b, expmod );

c = ADDexponent( sqrt( a^2+b^2 ), -expmod );
}

That means that a naive programmer will not do it, and since
scientific and engineering programmers tend not to do this
kind of thing, it will very likely not happen in numerical code.

The only REAL hard part is performing the overflow and/or underflow subroutines/macros and dealing with the case where one overflows
and the other underflows--doubles the complexity shown above.

... even more so.

Also, the smallest positive number > 1 would then be
1 + 2.2250738585072014e-308, discarding subnormals.

How do you compare two numbers?

comparison compare( double_double n1, double_double n2 )
{
if( n1.hi > n2.hi ) return greater;
if( n1.hi < n1.hi ) return lesser;

That fails if the high part overlaps with the low part. Then
to trap, adjust or handle accordingly; adjusting especially
on a binary read...

[...]

If low overlaps hi the number is ill-formed.
If both are not {infinity, zero, or NaN} the number is ill-formed.
All ill-formed numbers are treated as NaN.

But that is rather hard::
The real complexity arises when low underflows but high
does not OR when hi overflows but low does not.

There are two main drawbacks of the double-double format: Complexity
and wasted bits. This is the reason, I believe, that the only
company which used it, IBM, is moving away from that format, and
nobody else uses it; everybody else uses 128-bit IEEE (hardware
in POWER 9+, software emulation for everybody else).

--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to Stefan Monnier on Sun Jul 20 14:28:26 2025

On Sat, 19 Jul 2025 12:52:34 -0400, Stefan Monnier wrote:

How does the speed of software emulation of Binary128 compare to the
speed of "double-double" (i.e. a pair of doubles, thus also using 128bit
but with a slightly shorter mantissa (usually) and smaller exponent)?

That kind of floating-point is only capable of providing an acceleration
if the underlying architecture helps you.

Thus, this sort of thing was done _in hardware_ for the IBM System/360
model 85 for 128-bit extended precision.

Or, in software, this is how 72-bit double precision was done on the IBM
704 computer - because its 36-bit single-precision floating-point
instructions saved the lower precision bits from single precision
arithmetic in a second register, where they could be used in a software
routine for double-precision arithmetic.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Sun Jul 20 10:50:33 2025

Thomas Koenig [2025-07-20 09:24:44] wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Sat, 19 Jul 2025 19:15:39 +0000, Thomas Koenig wrote:

Stefan Monnier <monnier@iro.umontreal.ca> schrieb:

How does the speed of software emulation of Binary128 compare to the
speed of "double-double" (i.e. a pair of doubles, thus also using 128bit >>>> but with a slightly shorter mantissa (usually) and smaller exponent)?

Well double_double is 1 pipelined FP calculation in My 66000 ISA
whereas FP128 QADD would be about 20 mostly data-dependent inst-
ructions, maybe a few more if all rounding modes are supported.
So a bit more than 20� slowdown. On a machine without the proper
support instructions double the previous number.

That's a kind of worst-case ratio, tho. It's only for ADD, and may
depend on the exact convention you use to represent/handle NaNs,
synchronize sign bits, etc...

How do you compare two numbers?

comparison compare( double_double n1, double_double n2 )
{
if( n1.hi > n2.hi ) return greater;
if( n1.hi < n1.hi ) return lesser;

That fails if the high part overlaps with the low part.

AFAIK in double-double such overlap is not allowed, IOW can happen only
if you have a bug elsewhere in your library.

There are two main drawbacks of the double-double format: Complexity
and wasted bits. This is the reason, I believe, that the only
company which used it, IBM, is moving away from that format, and
nobody else uses it; everybody else uses 128-bit IEEE (hardware
in POWER 9+, software emulation for everybody else).

If you can provide hardware support, Binary128 is likely a much
better choice, indeed.

My question is specifically about the performance difference for
software-only implementations, especially if you take into account the
extra burden of dealing with NaNs and friends (i.e. clean&robust implementations of the library), so as to see if the
difference is small enough to really "bury" the double-double format.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stefan Monnier on Sun Jul 20 16:20:51 2025

Stefan Monnier <monnier@iro.umontreal.ca> schrieb:

My question is specifically about the performance difference for software-only implementations, especially if you take into account the
extra burden of dealing with NaNs and friends (i.e. clean&robust implementations of the library), so as to see if the
difference is small enough to really "bury" the double-double format.

It would be possible for POWER, where there are three versions
of long double: IEEE with software emulation, IEEE with hardware
support and IBM long double.

I can give you numbers comparing IEEE with hardware support with
IBM long double on the same machine. For a quick & dirty matmul
benchmark (below) I get, with IBM long double

16
MFlops: 173.6

and with IEEE

16
MFlops: 437.3

For comparison, my home box (totally different) gives me

16
MFlops: 95.9

which is slower by a significant factor, so long double might
actually be faster. But building a IEEE long double toolchain is
a PITA, I'm not going to do this as a benchmark :-)

program main
implicit none
integer, parameter :: wp = selected_real_kind(30)
integer, parameter :: n=801, p=801, m=660
real (kind=wp), allocatable :: c(:,:), a(:,:), b(:, :)
character(len=80) :: line
real (kind=wp) :: fl = 2.d0*n*m*p
integer :: i,j
real :: t1, t2

allocate (c(n,p), a(n,m), b(m, p))

print *,wp

line = '10 10'
call random_number(a)
call random_number(b)
call cpu_time (t1)
c = matmul(a,b)
call cpu_time (t2)
print '(A,F9.1)',"MFlops: ", fl/(t2-t1)*1e-6
read (unit=line,fmt=*) i,j
write (unit=line,fmt=*) c(i,j)
end program main

--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sun Jul 20 17:40:14 2025

On Sun, 20 Jul 2025 1:58:02 +0000, BGB wrote:

On 7/19/2025 12:32 PM, Stefan Monnier wrote:

High Precision:
Binary128 (borderline general use)

How does the speed of software emulation of Binary128 compare to the
speed of "double-double" (i.e. a pair of doubles, thus also using 128bit >>>> but with a slightly shorter mantissa (usually) and smaller exponent)?

This is exactly where the conversation diverges from FP128 into
Exact FP arithmetics.

I suspect it would surprise NOBODY here that My 66000 has direct access
to Exact FP arithmetics (via CARRY) that even gets the inexact bit set
correctly.

{ double hi, double lo } = FADD( double x, double y );

is 1 instruction (2 if you count CARRY as an instruction instead of
an instruction-modifier.)

Indeed, the hardware can provide specific support for double-double, but
even with regular hardware (mostly FMACC), IIUC you can get decent
performance, hence the question.

Does have the drawback though of (AFAIK) needing single-rounded FMA,
rather than being able to use double-rounded.

In turn, single-rounded has the costs of needing a full-width multiplier
(~ 2x the mantissa width internally) and an mantissa large enough to
deal with around 3x the destination width for the adder stage (to deal
with "cancellation"). In turn, leading to a significantly more expensive
FPU.

The double width output of the multiplier is coupled on the least
significant side by an adder of augend bits shifted down so far
they only play a part in rounding. and there is an incrementer
on the HoB side to deal with augend significantly larger than
product. So, in practice it is 4× as wide (minus 3).

Well, vs, say:
FMUL that only generates the high order bits;

is not IEEE compliant.

FADD that allows bits to fall off the bottom;

is not 754 compliant.

So, cancellation scenarios may reveal the bits that no longer
exist.

Think it through again.

...

So, alas, this is something that wont really work with my existing FPU
design (and "fixing" this would likely be too expensive).

....

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Sun Jul 20 17:44:13 2025

On Sun, 20 Jul 2025 9:24:44 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Sat, 19 Jul 2025 19:15:39 +0000, Thomas Koenig wrote:

Stefan Monnier <monnier@iro.umontreal.ca> schrieb:

High Precision:
Binary128 (borderline general use)

How does the speed of software emulation of Binary128 compare to the
speed of "double-double" (i.e. a pair of doubles, thus also using 128bit >>>> but with a slightly shorter mantissa (usually) and smaller exponent)?

Well double_double is 1 pipelined FP calculation in My 66000 ISA
whereas FP128 QADD would be about 20 mostly data-dependent inst-
ructions, maybe a few more if all rounding modes are supported.

So a bit more than 20× slowdown. On a machine without the proper
support instructions double the previous number.

I'm not a big fan of that format. IBM used it, but is trying to
get rid of it for POWER at least (which has 128-bit IEEE hardware
support, but with rather low performance).

It has several drawbacks. One of them is that you cannot safely
calculate sqrt(a^2 + b^2) in a higher precision without adjusting
exponents or dividing.

Something like, but a little more complicated than::
{
if( overflows ( a^2+b^2 ) ) expmod = +64;
else if( underflows( a^2+b^2 ) ) expmod = -64;
else expmod = 0;

a = ADDexponent( a, expmod );
b = ADDexponent( b, expmod );

c = ADDexponent( sqrt( a^2+b^2 ), -expmod );
}

That means that a naive programmer will not do it, and since
scientific and engineering programmers tend not to do this
kind of thing, it will very likely not happen in numerical code.

The only REAL hard part is performing the overflow and/or underflow
subroutines/macros and dealing with the case where one overflows
and the other underflows--doubles the complexity shown above.

.... even more so.

Also, the smallest positive number > 1 would then be
1 + 2.2250738585072014e-308, discarding subnormals.

How do you compare two numbers?

comparison compare( double_double n1, double_double n2 )
{
if( n1.hi > n2.hi ) return greater;
if( n1.hi < n1.hi ) return lesser;

That fails if the high part overlaps with the low part. Then
to trap, adjust or handle accordingly; adjusting especially
on a binary read...

Kahan describes a process whereby overlapping a and b are
repartitioned into non-overlapping c and d. He calls it
distillation.

In any event, My 66000 does not produce overlapping a and b
when using CARRY to perform exact arithmetics.

[...]

If low overlaps hi the number is ill-formed.
If both are not {infinity, zero, or NaN} the number is ill-formed.
All ill-formed numbers are treated as NaN.

But that is rather hard::
The real complexity arises when low underflows but high
does not OR when hi overflows but low does not.

There are two main drawbacks of the double-double format: Complexity
and wasted bits. This is the reason, I believe, that the only
company which used it, IBM, is moving away from that format, and
nobody else uses it; everybody else uses 128-bit IEEE (hardware
in POWER 9+, software emulation for everybody else).

The advantage is that a library of exact FP arithmetics can perform
(at moderate speed) any width fraction.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Sun Jul 20 22:38:40 2025

MitchAlsup1 wrote:

On Sat, 19 Jul 2025 20:50:33 +0000, BGB wrote:

On 7/18/2025 3:16 PM, MitchAlsup1 wrote:

On Fri, 18 Jul 2025 18:36:00 +0000, BGB wrote:

In my ISA, I have 128-bit shift, though not currently a generic funnel
shift, but can be sorta faked with register MOV's. A funnel shift
instruction could be possible, but would need to be a 4R encoding.

While the shift count is limited to 64, one can do rather
arbitrarily large shifts with My ISA. A 256-bit shift::

    CARRY    R14,{{O}{IO}{IO}{I}}
    SL    R11,R1,Rshift
    SL    R12,R2,Rshift
    SL    R13,R3,Rshift
    SLs    R14,R4,Rshift

You still need special code (reg-reg moves) when the Rshift count >= 64, right?

Possibly easiest to branch around the higher shift counts:

while Rshift >= 64 {
r0 = r1;
r1 = r2;
r2 = r3;
r3 = 0;
Rshift -= 64;
}

If higher shift counts are common, then I'd use separate code for each
block size.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Mon Jul 21 10:48:34 2025

Thomas Koenig [2025-07-20 16:20:51] wrote:

Stefan Monnier <monnier@iro.umontreal.ca> schrieb:

My question is specifically about the performance difference for
software-only implementations, especially if you take into account the

[...]

I can give you numbers comparing IEEE with hardware support with
IBM long double on the same machine. For a quick & dirty matmul
benchmark (below) I get, with IBM long double

16
MFlops: 173.6

and with IEEE

16
MFlops: 437.3

[ Not sure what "16" is about. ]
So, IIUC this shows hardware-supported Binary128 to be about 3x faster
than software-only double-double.

For comparison, my home box (totally different) gives me

16
MFlops: 95.9

Is this hard Binary128, soft Binary128, or soft double-double?

Thanks!

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stefan Monnier on Mon Jul 21 16:25:50 2025

Stefan Monnier <monnier@iro.umontreal.ca> schrieb:

Thomas Koenig [2025-07-20 16:20:51] wrote:

Stefan Monnier <monnier@iro.umontreal.ca> schrieb:

My question is specifically about the performance difference for
software-only implementations, especially if you take into account the

[...]

I can give you numbers comparing IEEE with hardware support with
IBM long double on the same machine. For a quick & dirty matmul
benchmark (below) I get, with IBM long double

16
MFlops: 173.6

and with IEEE

16
MFlops: 437.3

[ Not sure what "16" is about. ]

The KIND number, I could have left it out :-)

So, IIUC this shows hardware-supported Binary128 to be about 3x faster
than software-only double-double.

Yep.

For comparison, my home box (totally different) gives me

16
MFlops: 95.9

Is this hard Binary128, soft Binary128, or soft double-double?

This is soft Binary128, on an x86_64 box.

--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	04:22:42
Calls:	10,387
Calls today:	2
Files:	14,061
Messages:	6,416,782

My Response to the New 128-bit IEEE 754 Floating-Point Format

Who's Online

Recent Visitors

System Info