• My Response to the New 128-bit IEEE 754 Floating-Point Format

    From John Savard@21:1/5 to All on Fri Jul 18 13:48:58 2025
    Initially, when I first heard of the new 128-bit and 256-bit floating-point formats defined as part of a revised version of the IEEE 754 floating-point standard, I revised the floating-point formats supported by the Concertina
    II ISA in the following manner:

    I increased the size of the floating-point registers in the architecture to
    512 bits, so that a 512-bit register could contain a 256-bit IEEE 754 float converted to a temporary-real style format without a hidden first bit;

    and in addition to defining a 512-bit temporary real type, I also defined a 1,024-bit real type which involved the use of a pair of registers, with an unused exponent field in the second half. (Think 72-bit double precision on
    the IBM 704, or extended precision on the System/360 Model 85.)

    I have decided now to instead do the following:

    - keep the size of the floating-point registers at 128 bits;

    - continue to convert all floating-point numbers to a temporary real style format without a hidden first bit when storing them in those registers.

    In order to do this, however, I have decided that while I will not fully support the new 128-bit standard floating-point format, I still do want
    to have interoperability with it.

    Therefore, I have changed the temporary real format I use to have an
    exponent field that is *one bit smaller* than that of the 8087 temporary
    real format.

    In this way, my 128-bit floats have _the same precision_ as the new
    IEEE 754 standard 128-bit floats, and the Concertina II will provide
    additional instructions to load such numbers to, and save such numbers
    from, the floating-point registers.

    This won't support the new 128-bit floating-point standard, as it will
    only cover half of its exponent range. But it will allow computations
    with numbers within that part of the range without sacrificing
    precision, thus giving a degree of interoperability with computers
    that fully support the new IEEE 754 standard.

    This may seem like a strange choice to make, but it is the result of
    my having thought through the implications of the new floating-point
    formats, with the desire to maintain the size of the floating-point
    registers at 128 bits.

    The 256-bit short vector registers, however, because they're divided
    into aliquot parts for shorter integer and floating-point variable types,
    do *not* convert 32-bit and 64-bit IEEE 754 floats into an internal form,
    but keep the hidden first bit.

    Therefore, I still *can*, and will, provide full support for the new
    128-bit and 256-bit IEEE 754 floating-point variable types... but only
    in the short vector registers!

    New instructioons will be added that treat the halves of the sixteen
    short vector registers as 32 special floating-point registers only
    used for numbers in the standard IEEE 754 128-bit floating-point
    format.

    I know this probably seems crazy, but to me it seems to be the only
    reasonable way to deal with these new formats within the architecture
    I've set out for my computer designs... with the constraints that
    for a given floating-point format, all sizes of variables will share
    the same internal exponent size, and the floating-point registers
    can't be made bigger than 128 bits, as that would be unreasonable...
    and any floating-point format that uses a pair of registers would use
    a pair of exponents in the old-fashioned way to keep from switching
    formats within a single instruction.

    John Savard
    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Fri Jul 18 14:33:14 2025
    I have realized that I missed another wrinkle that responding to
    the revised standard might cause.
    As one of the exotic features that I intend to consider supporting
    is the ability to operate with a different width of memory than the
    normal 64-bit and power-of-two multiples thereof typically in current
    use...
    this meant that I had intended for the long vector registers to
    support long vectors of 72-bit floats as well as 64-bit floats.
    Because Univac once had a Cray-style vector attachment to their
    mainframes, and how dare I fail to provide a capability that computers
    once had, but since have lost!
    Not that I intend to provide with Concertina II, *unlike the original Concertina*, the ability to do, say, _sterling_ arithmetic. (Even in
    the original Concertina, this was only provided within the context of
    a more general mixed-radix capability. I am not _completely_ deranged,
    despite appearances. Or, at least, so I claim.)
    Well, if I'm going to support 72-bit floats, I may as well support
    vectors of 80-bit 8087-style temporary reals too while I'm at it.

    And so what I clearly need to do is provide, not in the primary 32-bit instructions, but instead only in the operate instructions with
    larger opcode fields, and maybe in the 32-bit supplementary load and
    store instructions, if there's room, and otherwise out there in 48-bit instruction land...
    is a set of "legacy floating-point" instructions.
    These use the _old_ internal format for IEEE 754 compatible floats.
    They allow arithmetic with 80-bit temporary reals, which use opcodes
    analogous to the ones used for the Medium floating-point format, as
    well as single and double precision (32-bit and 64-bit) IEEE 754
    floats.
    That way, the long vector arithmetic unit, even when it's working with
    80-bit floats, can fully interoperate with thge scalar arithmetic
    unit, even if involving the *short* vector arithmetic unit in the
    computation (which goes the opposite way by always hiding the first
    bit of the mantissa) is right out when 80-bit temporary reals are being
    used.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to John Savard on Fri Jul 18 17:25:24 2025
    On Fri, 18 Jul 2025 14:33:14 +0000, John Savard wrote:

    is a set of "legacy floating-point" instructions.
    These use the _old_ internal format for IEEE 754 compatible floats. They allow arithmetic with 80-bit temporary reals, which use opcodes
    analogous to the ones used for the Medium floating-point format, as well
    as single and double precision (32-bit and 64-bit) IEEE 754 floats.

    What this glimpse into my bizarre thought processes may suggest is
    the following:

    If there were some kind of electronic component that could be
    easily fabricated onto a microprocessor chip made out of silicon,
    whether it was based on SCR (silicon-controlled rectifier)
    technology, or it was something exotic like a memristor, that was
    analogous to a mechanical relay in that it directed current down
    one of two paths *with no gate delay* at the cost of being a bit
    slow in switching from one path to the other...

    I'd be interested in hearing about it, because there are places
    in future designs of processors for my ISA where such a thing could
    be quite useful.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to John Savard on Fri Jul 18 19:30:34 2025
    On Fri, 18 Jul 2025 14:33:14 +0000, John Savard wrote:

    is a set of "legacy floating-point" instructions.

    I have now realized the error of my ways here.
    Instead of shortening the length of the exponent field, to allow interoperability between my own 128-bit format and the standard one, I
    need to place more emphasis on compatibility.
    So instead, while my internal 128-bit floating-point format will continue
    to not have a hidden first bit, its exponent field, instead of shrinking by
    one bit, will grow to match that of the standard 256-bit floating-point
    format.
    That way, the standard 128-bit and 256-bit formats can be supported by
    the regular floating-point registers as well, they will just use more
    registers than their lengths would indicate. Given that there are
    32 floating-point registers, it should still be possible to manage.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Fri Jul 18 20:16:19 2025
    On Fri, 18 Jul 2025 18:36:00 +0000, BGB wrote:

    On 7/18/2025 8:48 AM, John Savard wrote:
    -------------

    I know this probably seems crazy, but to me it seems to be the only
    reasonable way to deal with these new formats within the architecture
    I've set out for my computer designs... with the constraints that
    for a given floating-point format, all sizes of variables will share
    the same internal exponent size, and the floating-point registers
    can't be made bigger than 128 bits, as that would be unreasonable...
    and any floating-point format that uses a pair of registers would use
    a pair of exponents in the old-fashioned way to keep from switching
    formats within a single instruction.


    My thoughts on the formats:
    Binary32: Good for light duty general-use work;
    Sometimes insufficient.
    Binary64: Good for general use work.
    Almost always sufficient.
    Binary128: Overkill.
    Also too expensive to really do on FPGA in any "fast" form.
    Binary256: Serious overkill.

    I wanted My 66000 to be very efficient at 64-bits with reasonable
    efficiency for 128-bit stuff--missing keyword occasional.

    Practically, likely better to leave Binary128 and Binary256 to software emulation; and instead focus on cheaper ways to make these faster (like, 128-bit and extended precision ADD/SUB and Shift).
    Things like my newer BITMOV instructions can also help.

    Efficient IMUL 128, 192, and 256 makes these a lot more reasonable to
    emulate. Insert and extract instructions help on the side, along with find-first as prelude to normalization (large 128, 192, 256 shifts)

    While it is possible to do Binary128 FMUL (and FDIV) via a Shift-and-Add unit, this is going to be slower than doing it in software. Though, considered before (but would be more expensive to implement) could be a Radix-4 or Radix-16 Shift-ADD unit.



    In theory, a Radix-16 unit could do:
    32-bit IDIV in 10 cycles;
    64-bit IMUL/IDIV in 20 cycles.
    64-bit FDIV in 30 cycles;
    128-bit FMUL/FDIV in 60 cycles (with a 256-bit unit).
    32-bit FDIV in 10 cycles.


    Where, as noted, the Shift-and-ADD unit can be made to do FPU operations
    by setting it up with the mantissas as fixed point numbers and running
    the unit for more cycles than used for integer ops. I am not sure if
    this is a well known strategy, but seemed "nifty" so I did it this way.


    Where, 1-bit (Radix-2) Shift/ADD would take ~ 240 cycles to deal with Binary128 (if the unit were internally widened to 256 bit), which, at
    least for FMUL, is slower than a pure software option.

    A pair of radix-2 FUs and a ROM can implement all of CORDIC--just as
    slow as you illustrate.

    A Radix-16 unit is likely the next possible step:
    One of the cheaper options available (as a step-up from Radix-2).

    I know of 1 fully pipelined FP32 FDIV unit using radix-4 with 10
    cycle latency and 1 cycle throughput. I know of one Radix-8 FDIV
    unit (and ECL design). Are there any Radix-16; does that even fit
    a 16-gate cycle time ??

    In any event, once you get 10-ish bits, Newton-Raphson or Goldschmidt
    are faster (latency).

    But, would be more expensive to implement.

    Every step in radix doubles the lookahead hardware {1->3->7->15};
    so, a radix-16 DIV is 15× as big as a radix 1 (not including the adder)

    The simple linear-comparison would get turns into a 4*4->8 bit multiply lookup, and a need for a division lookup (and/or try to sort it out with combinatorial logic).
    The logic would also need to support 4-bit multiply in the adder stage.

    TI ahs a bunch of patents n this corner of arithmetic.


    Basically, the 4b multiply during ADD, and "how many times does X go
    into Y" logic being the main costs. One could turn the question into 16 parallel multiply lookups, but there is a possible cheaper option. Seems
    like it should be possible to decompose it internally to Radix-4
    operations (could be cheaper; the Radix-4 operations fit nicer into
    LUTs).

    Effectively, one would need a way to find the dividend of an 8 bit
    number with a 4 bit number within a single clock-cycle. And, seemingly
    the most viable way to try to do this would be Radix-4 combinatorial
    logic. Don't know whether it would pass timing, haven't written or
    tested the idea yet.

    Nor am I sure that a Radix-16 unit would be worth the cost (hand-wavy estimate is that such a thing could likely cost around 5 or 6 kLUT for a
    unit that is 128-bit internally); though with a bulk of the added cost
    being due to the Radix-16 multiply-and-add, where the normal A+B
    becomes, essentially:
    A+((M[0]?B:0)+(M[1]?B<<1:0))+((M[2]?B<<2:0)+(M[3]?B<<3:0))

    Though, unclear if it would be cheaper to try to implement it directly
    as adders, or to implement it as a Radix-4 or 16 multiplier.

    Everybody I know gave up a Radix-8.

    While, theoretically, Radix-8 could be an option, it has one big glaring fault: 64 is not a multiple of 3 (and the last few bits are likely to
    wreck the viability of Radix-8). Well, maybe it still works if one
    simply sign-or-zero extends 64 to 66. Big advantage of Radix-8 being
    that 3*3->6 fits in a LUT6. Though (unlike Radix-4) one can't fit a carry-chain into LUTs (so, the biggest "cost issue" of Radix-16 would
    not be addressed with Radix-8).

    <snip>

    But, the latter could still an attractive option for other use cases
    though.

    It is possible this could be done in an FPGA.
    The main issue is how to best do the Horizontal-ADD.

    Baugh-Wooley or Koggé-Stone.

    However, this approach would allow doing the main part of the horizontal
    add as integer addition in mantissa space (vs an FP-ADD for the final result).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to BGB on Fri Jul 18 22:17:30 2025
    BGB wrote:
    On 7/18/2025 8:48 AM, John Savard wrote:
    Initially, when I first heard of the new 128-bit and 256-bit
    floating-point
    formats defined as part of a revised version of the IEEE 754
    floating-point
    standard, I revised the floating-point formats supported by the
    Concertina
    II ISA in the following manner:

    I increased the size of the floating-point registers in the
    architecture to
    512 bits, so that a 512-bit register could contain a 256-bit IEEE 754
    float
    converted to a temporary-real style format without a hidden first bit;

    and in addition to defining a 512-bit temporary real type, I also
    defined a
    1,024-bit real type which involved the use of a pair of registers,
    with an
    unused exponent field in the second half. (Think 72-bit double
    precision on
    the IBM 704, or extended precision on the System/360 Model 85.)

    I have decided now to instead do the following:

    - keep the size of the floating-point registers at 128 bits;

    - continue to convert all floating-point numbers to a temporary real
    style
    format without a hidden first bit when storing them in those registers.

    In order to do this, however, I have decided that while I will not fully
    support the new 128-bit standard floating-point format, I still do want
    to have interoperability with it.

    Therefore, I have changed the temporary real format I use to have an
    exponent field that is *one bit smaller* than that of the 8087 temporary
    real format.

    In this way, my 128-bit floats have _the same precision_ as the new
    IEEE 754 standard 128-bit floats, and the Concertina II will provide
    additional instructions to load such numbers to, and save such numbers
    from, the floating-point registers.

    This won't support the new 128-bit floating-point standard, as it will
    only cover half of its exponent range. But it will allow computations
    with numbers within that part of the range without sacrificing
    precision, thus giving a degree of interoperability with computers
    that fully support the new IEEE 754 standard.

    This may seem like a strange choice to make, but it is the result of
    my having thought through the implications of the new floating-point
    formats, with the desire to maintain the size of the floating-point
    registers at 128 bits.

    The 256-bit short vector registers, however, because they're divided
    into aliquot parts for shorter integer and floating-point variable types,
    do *not* convert 32-bit and 64-bit IEEE 754 floats into an internal form,
    but keep the hidden first bit.

    Therefore, I still *can*, and will, provide full support for the new
    128-bit and 256-bit IEEE 754 floating-point variable types... but only
    in the short vector registers!

    New instructioons will be added that treat the halves of the sixteen
    short vector registers as 32 special floating-point registers only
    used for numbers in the standard IEEE 754 128-bit floating-point
    format.

    I know this probably seems crazy, but to me it seems to be the only
    reasonable way to deal with these new formats within the architecture
    I've set out for my computer designs... with the constraints that
    for a given floating-point format, all sizes of variables will share
    the same internal exponent size, and the floating-point registers
    can't be made bigger than 128 bits, as that would be unreasonable...
    and any floating-point format that uses a pair of registers would use
    a pair of exponents in the old-fashioned way to keep from switching
    formats within a single instruction.


    My thoughts on the formats:
      Binary32: Good for light duty general-use work;
        Sometimes insufficient.
      Binary64: Good for general use work.
        Almost always sufficient.
      Binary128: Overkill.
        Also too expensive to really do on FPGA in any "fast" form.
      Binary256: Serious overkill.

    Practically, likely better to leave Binary128 and Binary256 to software emulation; and instead focus on cheaper ways to make these faster (like, 128-bit and extended precision ADD/SUB and Shift).
    Things like my newer BITMOV instructions can also help.

    What I found on the Mill is that a few helper ops can make SW emulation
    run in 2-4x hardware latency instead of 5-10x.


    While it is possible to do Binary128 FMUL (and FDIV) via a Shift-and-Add unit, this is going to be slower than doing it in software. Though, considered before (but would be more expensive to implement) could be a Radix-4 or Radix-16 Shift-ADD unit.

    fp128/fp256 FMUL is easy to emulate when you have multiple 64x64->128
    integer multipliers.

    Doing the same for FDIV pretty much require a reciprocal approach, since
    this doubles the precision for each added stage, but it is still so
    small that more fancy multiplication approaches (like FFT-based) don't
    make sense.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Fri Jul 18 21:47:00 2025
    On Fri, 18 Jul 2025 20:17:30 +0000, Terje Mathisen wrote:

    BGB wrote:
    -----------------
    While it is possible to do Binary128 FMUL (and FDIV) via a Shift-and-Add
    unit, this is going to be slower than doing it in software. Though,
    considered before (but would be more expensive to implement) could be a
    Radix-4 or Radix-16 Shift-ADD unit.

    fp128/fp256 FMUL is easy to emulate when you have multiple 64x64->128
    integer multipliers.

    Basically you disassembly FP128 into Sign, Exponent<15:0> and
    create operational fractions {1,Fract<46:0>}

    Result exponent = s1.exponent + s2.exponent - Bias;
    Sign = XOR( s1.sign, s2.sign );

    Then do 4 multiplies {64×64 -> 128, 64×48 -> 128, 48×64 -> 128, and
    48×48 -> 128}

    do 7-64-bit additions being careful with carry propagation
    and you have a 224-bit product

    without denorms, you have a potential 1-bit shift

    Round = Choose to increment or not

    And assemble the result.

    Doing the same for FDIV pretty much require a reciprocal approach, since
    this doubles the precision for each added stage, but it is still so
    small that more fancy multiplication approaches (like FFT-based) don't
    make sense.

    In general, convert FP128 into 1/2 <= FP64.fraction < 1.0
    Do FDIV64 and use this as the first 52-biits of FP128 result.
    Then 2 (or is it 3) Newton-Raphson steps in FP128.
    And assemble result

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Fri Jul 18 17:54:30 2025
    What I found on the Mill is that a few helper ops can make SW emulation run in 2-4x hardware latency instead of 5-10x.

    What are those helper ops?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to John Savard on Fri Jul 18 23:19:07 2025
    John Savard <quadibloc@invalid.invalid> wrote:
    I have decided now to instead do the following:

    - keep the size of the floating-point registers at 128 bits;

    - continue to convert all floating-point numbers to a temporary real style format without a hidden first bit when storing them in those registers.

    Why do not not increase registers to 129 bits to make place for explicit leading bit?

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to Terje Mathisen on Sat Jul 19 00:44:44 2025
    On Fri, 18 Jul 2025 22:17:30 +0200, Terje Mathisen wrote:

    but it is still so
    small that more fancy multiplication approaches (like FFT-based) don't
    make sense.

    That's something I can agree with. FFT multiplication is what you
    would use if you were writing a program to calculate the value
    of pi to one million digits - or more. The inherent overhead of
    that technique definitely rules it out for ordinary arithmetic -
    even, say, on 512-bit numbers.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Sat Jul 19 04:40:48 2025
    On Fri, 18 Jul 2025 13:36:00 -0500, BGB wrote:

    Practically, likely better to leave Binary128 and Binary256 to software emulation; and instead focus on cheaper ways to make these faster ...

    Is there much point to continuing with fixed-size formats at these
    precisions?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Stefan Monnier on Sat Jul 19 15:24:27 2025
    Stefan Monnier wrote:
    What I found on the Mill is that a few helper ops can make SW emulation run >> in 2-4x hardware latency instead of 5-10x.

    What are those helper ops?

    The most complicated one would take two fp values, classify both (Zero/Subnormal/Normal/Inf/NaN) and return them sorted by magnitude.

    Next is an unpacker: fp to sign/(unbiased?) exponent/mantissa including
    hidden bit.

    Finally a packer/rounding unit which combines sign/exp/full mantissa
    with guard & sticky bits.

    At the very end you use the output of the classifier to select either
    the regular result or one given by the special inputs. The special input handling runs in parallel with the normal emulation.

    The rest is as outlined by Mitch, i.e just a bunch of regular unsigned
    integer ops.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Sat Jul 19 12:52:34 2025
    High Precision:
    Binary128 (borderline general use)

    How does the speed of software emulation of Binary128 compare to the
    speed of "double-double" (i.e. a pair of doubles, thus also using 128bit
    but with a slightly shorter mantissa (usually) and smaller exponent)?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Sat Jul 19 17:12:04 2025
    On Sat, 19 Jul 2025 13:24:27 +0000, Terje Mathisen wrote:

    Stefan Monnier wrote:
    What I found on the Mill is that a few helper ops can make SW emulation
    run
    in 2-4x hardware latency instead of 5-10x.

    What are those helper ops?

    The most complicated one would take two fp values, classify both (Zero/Subnormal/Normal/Inf/NaN) and return them sorted by magnitude.

    Next is an unpacker: fp to sign/(unbiased?) exponent/mantissa including hidden bit.

    Finally a packer/rounding unit which combines sign/exp/full mantissa
    with guard & sticky bits.

    OpenGL defines::

    x = ADDtoExponent( Exponent( x ), Fraction( x ) );

    The argument functions tear the FP apart::
    int Exponent( FloatingPoint x ) is the deBiased exponent
    FloatingPoint( FloatingPoint x ) is 1/2 <= fraction < 1

    The ADDtoExponent function puts them back together again::

    There is also CopySign if you want to take the sign from the fraction.

    At the very end you use the output of the classifier to select either
    the regular result or one given by the special inputs. The special input handling runs in parallel with the normal emulation.

    The rest is as outlined by Mitch, i.e just a bunch of regular unsigned integer ops.

    Terje


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Sat Jul 19 17:17:03 2025
    On Sat, 19 Jul 2025 16:52:34 +0000, Stefan Monnier wrote:

    High Precision:
    Binary128 (borderline general use)

    How does the speed of software emulation of Binary128 compare to the
    speed of "double-double" (i.e. a pair of doubles, thus also using 128bit
    but with a slightly shorter mantissa (usually) and smaller exponent)?

    This is exactly where the conversation diverges from FP128 into
    Exact FP arithmetics.

    I suspect it would surprise NOBODY here that My 66000 has direct access
    to Exact FP arithmetics (via CARRY) that even gets the inexact bit set correctly.

    { double hi, double lo } = FADD( double x, double y );

    is 1 instruction (2 if you count CARRY as an instruction instead of
    an instruction-modifier.)

    All the bits that did not get into hi can be found in lo.

    There are FMUL, and FDIV variants, too.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Sat Jul 19 13:32:50 2025
    High Precision:
    Binary128 (borderline general use)

    How does the speed of software emulation of Binary128 compare to the
    speed of "double-double" (i.e. a pair of doubles, thus also using 128bit
    but with a slightly shorter mantissa (usually) and smaller exponent)?

    This is exactly where the conversation diverges from FP128 into
    Exact FP arithmetics.

    I suspect it would surprise NOBODY here that My 66000 has direct access
    to Exact FP arithmetics (via CARRY) that even gets the inexact bit set correctly.

    { double hi, double lo } = FADD( double x, double y );

    is 1 instruction (2 if you count CARRY as an instruction instead of
    an instruction-modifier.)

    Indeed, the hardware can provide specific support for double-double, but
    even with regular hardware (mostly FMACC), IIUC you can get decent
    performance, hence the question.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Sat Jul 19 21:02:46 2025
    On Fri, 18 Jul 2025 22:17:30 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    BGB wrote:
    On 7/18/2025 8:48 AM, John Savard wrote:
    Initially, when I first heard of the new 128-bit and 256-bit
    floating-point
    formats defined as part of a revised version of the IEEE 754
    floating-point
    standard, I revised the floating-point formats supported by the
    Concertina
    II ISA in the following manner:

    I increased the size of the floating-point registers in the
    architecture to
    512 bits, so that a 512-bit register could contain a 256-bit IEEE
    754 float
    converted to a temporary-real style format without a hidden first
    bit;

    and in addition to defining a 512-bit temporary real type, I also
    defined a
    1,024-bit real type which involved the use of a pair of registers,
    with an
    unused exponent field in the second half. (Think 72-bit double
    precision on
    the IBM 704, or extended precision on the System/360 Model 85.)

    I have decided now to instead do the following:

    - keep the size of the floating-point registers at 128 bits;

    - continue to convert all floating-point numbers to a temporary
    real style
    format without a hidden first bit when storing them in those
    registers.

    In order to do this, however, I have decided that while I will not
    fully support the new 128-bit standard floating-point format, I
    still do want to have interoperability with it.

    Therefore, I have changed the temporary real format I use to have
    an exponent field that is *one bit smaller* than that of the 8087
    temporary real format.

    In this way, my 128-bit floats have _the same precision_ as the new
    IEEE 754 standard 128-bit floats, and the Concertina II will
    provide additional instructions to load such numbers to, and save
    such numbers from, the floating-point registers.

    This won't support the new 128-bit floating-point standard, as it
    will only cover half of its exponent range. But it will allow
    computations with numbers within that part of the range without
    sacrificing precision, thus giving a degree of interoperability
    with computers that fully support the new IEEE 754 standard.

    This may seem like a strange choice to make, but it is the result
    of my having thought through the implications of the new
    floating-point formats, with the desire to maintain the size of
    the floating-point registers at 128 bits.

    The 256-bit short vector registers, however, because they're
    divided into aliquot parts for shorter integer and floating-point
    variable types, do *not* convert 32-bit and 64-bit IEEE 754 floats
    into an internal form, but keep the hidden first bit.

    Therefore, I still *can*, and will, provide full support for the
    new 128-bit and 256-bit IEEE 754 floating-point variable types...
    but only in the short vector registers!

    New instructioons will be added that treat the halves of the
    sixteen short vector registers as 32 special floating-point
    registers only used for numbers in the standard IEEE 754 128-bit
    floating-point format.

    I know this probably seems crazy, but to me it seems to be the only
    reasonable way to deal with these new formats within the
    architecture I've set out for my computer designs... with the
    constraints that for a given floating-point format, all sizes of
    variables will share the same internal exponent size, and the
    floating-point registers can't be made bigger than 128 bits, as
    that would be unreasonable... and any floating-point format that
    uses a pair of registers would use a pair of exponents in the
    old-fashioned way to keep from switching formats within a single
    instruction.

    My thoughts on the formats:
      Binary32: Good for light duty general-use work;
        Sometimes insufficient.
      Binary64: Good for general use work.
        Almost always sufficient.
      Binary128: Overkill.
        Also too expensive to really do on FPGA in any "fast" form.
      Binary256: Serious overkill.

    Practically, likely better to leave Binary128 and Binary256 to
    software emulation; and instead focus on cheaper ways to make these
    faster (like, 128-bit and extended precision ADD/SUB and Shift).
    Things like my newer BITMOV instructions can also help.

    What I found on the Mill is that a few helper ops can make SW
    emulation run in 2-4x hardware latency instead of 5-10x.



    I found pretty much the opposite on x86-64.

    I can't think about any easy to implement helper op or couple that
    could make non-negligible difference on modern big cores.

    What could make quite big difference is a better ABI.
    Current Linux-x64 (and aarch64) __float128 ABIs are very bad, both in
    that that parameters to __float128 arithmetic primitive are passed in
    SIMD registers instead of GPRs and in that that flags (=exceptions) are returned in FP control word. I don't quite know where I would prefer
    them, but certainly somewhere else.
    RISC-V Linux ABI looks more reasonable, but I only looked at it, never
    even tried to implement anything on RISC-V, much less so to measure
    speed.

    And for Windows there is no official __float128 ABI at all.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stefan Monnier on Sat Jul 19 19:15:39 2025
    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    High Precision:
    Binary128 (borderline general use)

    How does the speed of software emulation of Binary128 compare to the
    speed of "double-double" (i.e. a pair of doubles, thus also using 128bit
    but with a slightly shorter mantissa (usually) and smaller exponent)?

    I'm not a big fan of that format. IBM used it, but is trying to
    get rid of it for POWER at least (which has 128-bit IEEE hardware
    support, but with rather low performance).

    It has several drawbacks. One of them is that you cannot safely
    calculate sqrt(a^2 + b^2) in a higher precision without adjusting
    exponents or dividing.

    Also, the smallest positive number > 1 would then be
    1 + 2.2250738585072014e-308, discarding subnormals.

    How do you compare two numbers?

    You can get around that by fixing the exponent of the smaller
    number so it extends the bigger one without a gap in the
    mantissa, and forcing it to have the same sign. Ill-formed
    numbers should then be flagged as an error.

    What do you do if one of your numbers is NaN, the other
    one not? Do you prescribe the same sign for both numbers?
    (Probably yes).

    I can understand wanting to the precise results, like in Mitch's
    architecture (also prescribed in IEEE, I believe). But as a general
    number format... I'd rather have 128-bit IEEE in software, but I
    would even more prefer a highly-performing 128-bit IEEE in hardware.
    SIMD registers are big enough to hold them.

    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sat Jul 19 19:58:33 2025
    On Sat, 19 Jul 2025 4:40:48 +0000, Lawrence D'Oliveiro wrote:

    On Fri, 18 Jul 2025 13:36:00 -0500, BGB wrote:

    Practically, likely better to leave Binary128 and Binary256 to software
    emulation; and instead focus on cheaper ways to make these faster ...

    Is there much point to continuing with fixed-size formats at these precisions?

    Not if you are emulating them in SW::

    # define containers (3)

    typedef struct { ubyte s;
    int56_t exp;
    int64_t fract[containers]; } big_FP;

    By the time this exponent overflows, you are accounting for every
    particle in the universe !!! (not just visible universe).

    Sorry to waste 8 bits on the sign.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Sat Jul 19 19:54:13 2025
    On Sat, 19 Jul 2025 19:15:39 +0000, Thomas Koenig wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    High Precision:
    Binary128 (borderline general use)

    How does the speed of software emulation of Binary128 compare to the
    speed of "double-double" (i.e. a pair of doubles, thus also using 128bit
    but with a slightly shorter mantissa (usually) and smaller exponent)?

    Well double_double is 1 pipelined FP calculation in My 66000 ISA
    whereas FP128 QADD would be about 20 mostly data-dependent inst-
    ructions, maybe a few more if all rounding modes are supported.

    So a bit more than 20× slowdown. On a machine without the proper
    support instructions double the previous number.

    I'm not a big fan of that format. IBM used it, but is trying to
    get rid of it for POWER at least (which has 128-bit IEEE hardware
    support, but with rather low performance).

    It has several drawbacks. One of them is that you cannot safely
    calculate sqrt(a^2 + b^2) in a higher precision without adjusting
    exponents or dividing.

    Something like, but a little more complicated than::
    {
    if( overflows ( a^2+b^2 ) ) expmod = +64;
    else if( underflows( a^2+b^2 ) ) expmod = -64;
    else expmod = 0;

    a = ADDexponent( a, expmod );
    b = ADDexponent( b, expmod );

    c = ADDexponent( sqrt( a^2+b^2 ), -expmod );
    }

    The only REAL hard part is performing the overflow and/or underflow subroutines/macros and dealing with the case where one overflows
    and the other underflows--doubles the complexity shown above.


    Also, the smallest positive number > 1 would then be
    1 + 2.2250738585072014e-308, discarding subnormals.

    How do you compare two numbers?

    comparison compare( double_double n1, double_double n2 )
    {
    if( n1.hi > n2.hi ) return greater;
    if( n1.hi < n1.hi ) return lesser;
    // high parts are equal
    if( n1.lo > n2.lo ) return greater;
    if( n1.lo < n2.lo ) return lesser;
    // low parts are equal
    return equal;
    }

    You can get around that by fixing the exponent of the smaller
    number so it extends the bigger one without a gap in the
    mantissa, and forcing it to have the same sign.

    In (as Anton called it) double-double, the sign bit actually
    carries a bit of significance, in effect, it tells if the
    larger value was rounded (+1) up or not. In Kahan-Babashuka
    summation the high part is not rounded only the low part.

    Ill-formed
    numbers should then be flagged as an error.

    If low overlaps hi the number is ill-formed.
    If both are not {infinity, zero, or NaN} the number is ill-formed.
    All ill-formed numbers are treated as NaN.

    But that is rather hard::
    The real complexity arises when low underflows but high
    does not OR when hi overflows but low does not.

    What do you do if one of your numbers is NaN, the other
    one not? Do you prescribe the same sign for both numbers?
    (Probably yes).

    Kahan-Babashuka yes, exact FP no.

    I can understand wanting to the precise results, like in Mitch's
    architecture (also prescribed in IEEE, I believe). But as a general
    number format... I'd rather have 128-bit IEEE in software, but I
    would even more prefer a highly-performing 128-bit IEEE in hardware.
    SIMD registers are big enough to hold them.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sun Jul 20 00:25:04 2025
    On Sat, 19 Jul 2025 20:50:33 +0000, BGB wrote:

    On 7/18/2025 3:16 PM, MitchAlsup1 wrote:
    On Fri, 18 Jul 2025 18:36:00 +0000, BGB wrote:

    In my ISA, I have 128-bit shift, though not currently a generic funnel
    shift, but can be sorta faked with register MOV's. A funnel shift
    instruction could be possible, but would need to be a 4R encoding.

    While the shift count is limited to 64, one can do rather
    arbitrarily large shifts with My ISA. A 256-bit shift::

    CARRY R14,{{O}{IO}{IO}{I}}
    SL R11,R1,Rshift
    SL R12,R2,Rshift
    SL R13,R3,Rshift
    SLs R14,R4,Rshift

    ----------------
    <snip>

    But, the latter could still an attractive option for other use cases
    though.

    It is possible this could be done in an FPGA.
       The main issue is how to best do the Horizontal-ADD.

    Baugh-Wooley or Koggé-Stone.


    OK.

    Looking.
    One tradeoff is that hopefully whatever is done, can be efficiently implemented in terms of FPGA primitives.

    BW and KS adders are designed such that the center bits from the
    multiplier tree can arrive last.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Sun Jul 20 09:24:44 2025
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Sat, 19 Jul 2025 19:15:39 +0000, Thomas Koenig wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    High Precision:
    Binary128 (borderline general use)

    How does the speed of software emulation of Binary128 compare to the
    speed of "double-double" (i.e. a pair of doubles, thus also using 128bit >>> but with a slightly shorter mantissa (usually) and smaller exponent)?

    Well double_double is 1 pipelined FP calculation in My 66000 ISA
    whereas FP128 QADD would be about 20 mostly data-dependent inst-
    ructions, maybe a few more if all rounding modes are supported.

    So a bit more than 20× slowdown. On a machine without the proper
    support instructions double the previous number.

    I'm not a big fan of that format. IBM used it, but is trying to
    get rid of it for POWER at least (which has 128-bit IEEE hardware
    support, but with rather low performance).

    It has several drawbacks. One of them is that you cannot safely
    calculate sqrt(a^2 + b^2) in a higher precision without adjusting
    exponents or dividing.

    Something like, but a little more complicated than::
    {
    if( overflows ( a^2+b^2 ) ) expmod = +64;
    else if( underflows( a^2+b^2 ) ) expmod = -64;
    else expmod = 0;

    a = ADDexponent( a, expmod );
    b = ADDexponent( b, expmod );

    c = ADDexponent( sqrt( a^2+b^2 ), -expmod );
    }

    That means that a naive programmer will not do it, and since
    scientific and engineering programmers tend not to do this
    kind of thing, it will very likely not happen in numerical code.

    The only REAL hard part is performing the overflow and/or underflow subroutines/macros and dealing with the case where one overflows
    and the other underflows--doubles the complexity shown above.

    ... even more so.



    Also, the smallest positive number > 1 would then be
    1 + 2.2250738585072014e-308, discarding subnormals.

    How do you compare two numbers?

    comparison compare( double_double n1, double_double n2 )
    {
    if( n1.hi > n2.hi ) return greater;
    if( n1.hi < n1.hi ) return lesser;

    That fails if the high part overlaps with the low part. Then
    to trap, adjust or handle accordingly; adjusting especially
    on a binary read...

    [...]

    If low overlaps hi the number is ill-formed.
    If both are not {infinity, zero, or NaN} the number is ill-formed.
    All ill-formed numbers are treated as NaN.

    But that is rather hard::
    The real complexity arises when low underflows but high
    does not OR when hi overflows but low does not.

    There are two main drawbacks of the double-double format: Complexity
    and wasted bits. This is the reason, I believe, that the only
    company which used it, IBM, is moving away from that format, and
    nobody else uses it; everybody else uses 128-bit IEEE (hardware
    in POWER 9+, software emulation for everybody else).

    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to Stefan Monnier on Sun Jul 20 14:28:26 2025
    On Sat, 19 Jul 2025 12:52:34 -0400, Stefan Monnier wrote:

    How does the speed of software emulation of Binary128 compare to the
    speed of "double-double" (i.e. a pair of doubles, thus also using 128bit
    but with a slightly shorter mantissa (usually) and smaller exponent)?

    That kind of floating-point is only capable of providing an acceleration
    if the underlying architecture helps you.

    Thus, this sort of thing was done _in hardware_ for the IBM System/360
    model 85 for 128-bit extended precision.

    Or, in software, this is how 72-bit double precision was done on the IBM
    704 computer - because its 36-bit single-precision floating-point
    instructions saved the lower precision bits from single precision
    arithmetic in a second register, where they could be used in a software
    routine for double-precision arithmetic.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Sun Jul 20 10:50:33 2025
    Thomas Koenig [2025-07-20 09:24:44] wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Sat, 19 Jul 2025 19:15:39 +0000, Thomas Koenig wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    How does the speed of software emulation of Binary128 compare to the
    speed of "double-double" (i.e. a pair of doubles, thus also using 128bit >>>> but with a slightly shorter mantissa (usually) and smaller exponent)?

    Well double_double is 1 pipelined FP calculation in My 66000 ISA
    whereas FP128 QADD would be about 20 mostly data-dependent inst-
    ructions, maybe a few more if all rounding modes are supported.
    So a bit more than 20× slowdown. On a machine without the proper
    support instructions double the previous number.

    That's a kind of worst-case ratio, tho. It's only for ADD, and may
    depend on the exact convention you use to represent/handle NaNs,
    synchronize sign bits, etc...

    How do you compare two numbers?
    comparison compare( double_double n1, double_double n2 )
    {
    if( n1.hi > n2.hi ) return greater;
    if( n1.hi < n1.hi ) return lesser;
    That fails if the high part overlaps with the low part.

    AFAIK in double-double such overlap is not allowed, IOW can happen only
    if you have a bug elsewhere in your library.

    There are two main drawbacks of the double-double format: Complexity
    and wasted bits. This is the reason, I believe, that the only
    company which used it, IBM, is moving away from that format, and
    nobody else uses it; everybody else uses 128-bit IEEE (hardware
    in POWER 9+, software emulation for everybody else).

    If you can provide hardware support, Binary128 is likely a much
    better choice, indeed.

    My question is specifically about the performance difference for
    software-only implementations, especially if you take into account the
    extra burden of dealing with NaNs and friends (i.e. clean&robust implementations of the library), so as to see if the
    difference is small enough to really "bury" the double-double format.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stefan Monnier on Sun Jul 20 16:20:51 2025
    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:

    My question is specifically about the performance difference for software-only implementations, especially if you take into account the
    extra burden of dealing with NaNs and friends (i.e. clean&robust implementations of the library), so as to see if the
    difference is small enough to really "bury" the double-double format.

    It would be possible for POWER, where there are three versions
    of long double: IEEE with software emulation, IEEE with hardware
    support and IBM long double.

    I can give you numbers comparing IEEE with hardware support with
    IBM long double on the same machine. For a quick & dirty matmul
    benchmark (below) I get, with IBM long double

    16
    MFlops: 173.6

    and with IEEE

    16
    MFlops: 437.3

    For comparison, my home box (totally different) gives me

    16
    MFlops: 95.9

    which is slower by a significant factor, so long double might
    actually be faster. But building a IEEE long double toolchain is
    a PITA, I'm not going to do this as a benchmark :-)


    program main
    implicit none
    integer, parameter :: wp = selected_real_kind(30)
    integer, parameter :: n=801, p=801, m=660
    real (kind=wp), allocatable :: c(:,:), a(:,:), b(:, :)
    character(len=80) :: line
    real (kind=wp) :: fl = 2.d0*n*m*p
    integer :: i,j
    real :: t1, t2

    allocate (c(n,p), a(n,m), b(m, p))

    print *,wp

    line = '10 10'
    call random_number(a)
    call random_number(b)
    call cpu_time (t1)
    c = matmul(a,b)
    call cpu_time (t2)
    print '(A,F9.1)',"MFlops: ", fl/(t2-t1)*1e-6
    read (unit=line,fmt=*) i,j
    write (unit=line,fmt=*) c(i,j)
    end program main



    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sun Jul 20 17:40:14 2025
    On Sun, 20 Jul 2025 1:58:02 +0000, BGB wrote:

    On 7/19/2025 12:32 PM, Stefan Monnier wrote:
    High Precision:
    Binary128 (borderline general use)

    How does the speed of software emulation of Binary128 compare to the
    speed of "double-double" (i.e. a pair of doubles, thus also using 128bit >>>> but with a slightly shorter mantissa (usually) and smaller exponent)?

    This is exactly where the conversation diverges from FP128 into
    Exact FP arithmetics.

    I suspect it would surprise NOBODY here that My 66000 has direct access
    to Exact FP arithmetics (via CARRY) that even gets the inexact bit set
    correctly.

    { double hi, double lo } = FADD( double x, double y );

    is 1 instruction (2 if you count CARRY as an instruction instead of
    an instruction-modifier.)

    Indeed, the hardware can provide specific support for double-double, but
    even with regular hardware (mostly FMACC), IIUC you can get decent
    performance, hence the question.


    Does have the drawback though of (AFAIK) needing single-rounded FMA,
    rather than being able to use double-rounded.



    In turn, single-rounded has the costs of needing a full-width multiplier
    (~ 2x the mantissa width internally) and an mantissa large enough to
    deal with around 3x the destination width for the adder stage (to deal
    with "cancellation"). In turn, leading to a significantly more expensive
    FPU.

    The double width output of the multiplier is coupled on the least
    significant side by an adder of augend bits shifted down so far
    they only play a part in rounding. and there is an incrementer
    on the HoB side to deal with augend significantly larger than
    product. So, in practice it is 4× as wide (minus 3).

    Well, vs, say:
    FMUL that only generates the high order bits;
    is not IEEE compliant.
    FADD that allows bits to fall off the bottom;
    is not 754 compliant.
    So, cancellation scenarios may reveal the bits that no longer
    exist.
    Think it through again.
    ...

    So, alas, this is something that wont really work with my existing FPU
    design (and "fixing" this would likely be too expensive).

    ....


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Sun Jul 20 17:44:13 2025
    On Sun, 20 Jul 2025 9:24:44 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Sat, 19 Jul 2025 19:15:39 +0000, Thomas Koenig wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    High Precision:
    Binary128 (borderline general use)

    How does the speed of software emulation of Binary128 compare to the
    speed of "double-double" (i.e. a pair of doubles, thus also using 128bit >>>> but with a slightly shorter mantissa (usually) and smaller exponent)?

    Well double_double is 1 pipelined FP calculation in My 66000 ISA
    whereas FP128 QADD would be about 20 mostly data-dependent inst-
    ructions, maybe a few more if all rounding modes are supported.

    So a bit more than 20× slowdown. On a machine without the proper
    support instructions double the previous number.

    I'm not a big fan of that format. IBM used it, but is trying to
    get rid of it for POWER at least (which has 128-bit IEEE hardware
    support, but with rather low performance).

    It has several drawbacks. One of them is that you cannot safely
    calculate sqrt(a^2 + b^2) in a higher precision without adjusting
    exponents or dividing.

    Something like, but a little more complicated than::
    {
    if( overflows ( a^2+b^2 ) ) expmod = +64;
    else if( underflows( a^2+b^2 ) ) expmod = -64;
    else expmod = 0;

    a = ADDexponent( a, expmod );
    b = ADDexponent( b, expmod );

    c = ADDexponent( sqrt( a^2+b^2 ), -expmod );
    }

    That means that a naive programmer will not do it, and since
    scientific and engineering programmers tend not to do this
    kind of thing, it will very likely not happen in numerical code.

    The only REAL hard part is performing the overflow and/or underflow
    subroutines/macros and dealing with the case where one overflows
    and the other underflows--doubles the complexity shown above.

    .... even more so.



    Also, the smallest positive number > 1 would then be
    1 + 2.2250738585072014e-308, discarding subnormals.

    How do you compare two numbers?

    comparison compare( double_double n1, double_double n2 )
    {
    if( n1.hi > n2.hi ) return greater;
    if( n1.hi < n1.hi ) return lesser;

    That fails if the high part overlaps with the low part. Then
    to trap, adjust or handle accordingly; adjusting especially
    on a binary read...

    Kahan describes a process whereby overlapping a and b are
    repartitioned into non-overlapping c and d. He calls it
    distillation.

    In any event, My 66000 does not produce overlapping a and b
    when using CARRY to perform exact arithmetics.

    [...]

    If low overlaps hi the number is ill-formed.
    If both are not {infinity, zero, or NaN} the number is ill-formed.
    All ill-formed numbers are treated as NaN.

    But that is rather hard::
    The real complexity arises when low underflows but high
    does not OR when hi overflows but low does not.

    There are two main drawbacks of the double-double format: Complexity
    and wasted bits. This is the reason, I believe, that the only
    company which used it, IBM, is moving away from that format, and
    nobody else uses it; everybody else uses 128-bit IEEE (hardware
    in POWER 9+, software emulation for everybody else).

    The advantage is that a library of exact FP arithmetics can perform
    (at moderate speed) any width fraction.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Sun Jul 20 22:38:40 2025
    MitchAlsup1 wrote:
    On Sat, 19 Jul 2025 20:50:33 +0000, BGB wrote:

    On 7/18/2025 3:16 PM, MitchAlsup1 wrote:
    On Fri, 18 Jul 2025 18:36:00 +0000, BGB wrote:

    In my ISA, I have 128-bit shift, though not currently a generic funnel
    shift, but can be sorta faked with register MOV's. A funnel shift
    instruction could be possible, but would need to be a 4R encoding.

    While the shift count is limited to 64, one can do rather
    arbitrarily large shifts with My ISA. A 256-bit shift::

        CARRY    R14,{{O}{IO}{IO}{I}}
        SL    R11,R1,Rshift
        SL    R12,R2,Rshift
        SL    R13,R3,Rshift
        SLs    R14,R4,Rshift

    You still need special code (reg-reg moves) when the Rshift count >= 64, right?

    Possibly easiest to branch around the higher shift counts:

    while Rshift >= 64 {
    r0 = r1;
    r1 = r2;
    r2 = r3;
    r3 = 0;
    Rshift -= 64;
    }

    If higher shift counts are common, then I'd use separate code for each
    block size.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Mon Jul 21 10:48:34 2025
    Thomas Koenig [2025-07-20 16:20:51] wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    My question is specifically about the performance difference for
    software-only implementations, especially if you take into account the
    [...]
    I can give you numbers comparing IEEE with hardware support with
    IBM long double on the same machine. For a quick & dirty matmul
    benchmark (below) I get, with IBM long double

    16
    MFlops: 173.6

    and with IEEE

    16
    MFlops: 437.3

    [ Not sure what "16" is about. ]
    So, IIUC this shows hardware-supported Binary128 to be about 3x faster
    than software-only double-double.

    For comparison, my home box (totally different) gives me

    16
    MFlops: 95.9

    Is this hard Binary128, soft Binary128, or soft double-double?

    Thanks!


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stefan Monnier on Mon Jul 21 16:25:50 2025
    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    Thomas Koenig [2025-07-20 16:20:51] wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    My question is specifically about the performance difference for
    software-only implementations, especially if you take into account the
    [...]
    I can give you numbers comparing IEEE with hardware support with
    IBM long double on the same machine. For a quick & dirty matmul
    benchmark (below) I get, with IBM long double

    16
    MFlops: 173.6

    and with IEEE

    16
    MFlops: 437.3

    [ Not sure what "16" is about. ]

    The KIND number, I could have left it out :-)

    So, IIUC this shows hardware-supported Binary128 to be about 3x faster
    than software-only double-double.

    Yep.


    For comparison, my home box (totally different) gives me

    16
    MFlops: 95.9

    Is this hard Binary128, soft Binary128, or soft double-double?

    This is soft Binary128, on an x86_64 box.

    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)