• Re: Random: Very Low Precision FP

    From MitchAlsup@21:1/5 to All on Tue Aug 26 21:17:47 2025
    BGB <cr88192@gmail.com> posted:

    Well, idea here is that sometimes one wants to be able to do
    floating-point math where accuracy is a very low priority.

    Say, the sort of stuff people might use FP8 or BF16 or maybe Binary16
    for (though, what I am thinking of here is low-precision even by
    Binary16 standards).

    For 8-bit stuff, just use 5 memory tables [256×256]

    But, will use Binary16 and BF16 as the example formats.

    So, can note that one can approximate some ops with modified integer
    ADD/SUB (excluding sign-bit handling):
    a*b : A+B-0x3C00 (0x3F80 for BF16)
    a/b : A-B+0x3C00
    sqrt(a): (A>>1)+0x1E00

    You are aware that GPUs perform elementary transcendental functions
    (32-bits) in 5 cycles {sin(), cos(), tan(), exp(), ln(), ...}.
    These functions get within 1.5-2 ULP. See authors: Oberman, Pierno,
    Matula circa 2000-2005 for relevant data. I did a crack at this
    (patented: Samsung) that got within 0.7 and 1.2 ULP using a three
    term polynomial instead of a 2 term polynomial.
    Standard GPU FP math (32-bit and 16-bit) are 4 cycles and are now
    IEEE 754 accurate (except for a couple of outlying cases.)

    So, I don't see this suggestion bringing value to the table.

    The harder ones though, are ADD/SUB.

    A partial ADD seems to be:
    a+b: A+((B-A)>>1)+0x0400

    But, this simple case seems not to hold up when either doing subtract,
    or when A and B are far apart.

    So, it would appear either that there is a 4th term or the bias is
    variable (depending on the B-A term; and for ADD/SUB).

    Seems like the high bits (exponent and operator) could be used to drive
    a lookup table, but this is lame, The magic bias appears to have
    non-linear properties so isn't as easily represented with basic integer operations.

    Then again, probably other people know about all of this and might know
    what I am missing.

    I still recommend getting the right answer over getting a close but wrong answer a couple cycles earlier.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to John Savard on Wed Aug 27 01:39:47 2025
    On Wed, 27 Aug 2025 01:35:08 +0000, John Savard wrote:

    Addition and subtraction required a lookup table - but because the two numbers involved needed to be not too far apart in magnitude for the operations to do anything, the lookup tables required were shorter than
    they would be for numbers represented normally, where it would be multiplication and division that required the lookup tables.

    So to add two numbers, first switch them if necessary, so that the larger
    one is a, and the smaller one is b.

    Calculate b/a by subtraction.

    Then use a short table to find (a+b)/a from b/a. The value found from that table can be added to a to give (the logarithmic representation of) a+b.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to BGB on Wed Aug 27 01:35:08 2025
    On Tue, 26 Aug 2025 13:08:29 -0500, BGB wrote:

    Then again, probably other people know about all of this and might know
    what I am missing.

    A long time ago, a notation called FOCUS was proposed for low-precision
    floats. It represented numbers by their logarithms. Multiplication and
    division were done quickly by addition and subtraction.

    Addition and subtraction required a lookup table - but because the two
    numbers involved needed to be not too far apart in magnitude for the
    operations to do anything, the lookup tables required were shorter than
    they would be for numbers represented normally, where it would be multiplication and division that required the lookup tables.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to MitchAlsup on Wed Aug 27 13:56:53 2025
    MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    Well, idea here is that sometimes one wants to be able to do
    floating-point math where accuracy is a very low priority.

    Say, the sort of stuff people might use FP8 or BF16 or maybe Binary16
    for (though, what I am thinking of here is low-precision even by
    Binary16 standards).

    For 8-bit stuff, just use 5 memory tables [256×256]

    They don't even need to be full 8-bit: With a tiny amount of logic to
    handle the signs you are already down to 128x128, right?

    Then again, probably other people know about all of this and might know
    what I am missing.

    The infamous invsqrt() trick is the canonical example of where all the
    quirks of the ieee 754 format works just right to get you to 10+ bits
    with a single NR iteration.

    Your basic ops examples are a lot more iffy.

    I still recommend getting the right answer over getting a close but wrong answer a couple cycles earlier.

    Exactly.

    I think you showed me the idea of usually getting the correct result in
    N cycles, but in a low number of cases, the trailing bits would be too
    close to a rounding boundary, so they would add one more NR iteration.

    I just realized that the code I wrote to fix Pentium FDIV could have
    been even more efficient on a proper superscalar OoO CPU:

    Start the FDIV immediately, then at the same time do the divisor
    mantissa inspection to determine if the workaround would be needed (5
    out of 1024 cases), and only if that happens, start the slower path that
    takes up to twice as long.

    The idea is that for 99.5% of all divisors, the only cost would be a
    close to zero cycle correctly predicted branch, but then the remainder
    would require two FDIV operations, so 80 instead of 40 cycles.

    OTOH, that same Big OoO core can probably predict that the entire
    mantissa inspection part will end up with a "skip the workaround" branch
    and start the FDIV almost at once. I'm assuming that when the mispredict
    turns up, the core can stop a long operation like FDIV more or less immediately and discard the current status.

    (From memory)

    double fdiv(double a, double b)
    {
    uint64_t mant10;
    memcpy(&mant10, &b, sizeof(ub));
    mant10 = (mant10 >> 42) & 1023;
    if (fdiv_table[mant10 >> 3] & (1 << (mant10 & 7))) {
    // set fpu to extended/long double, save previous mode
    b *= 15.0/16.0; // Exact operation!
    a *= 15.0/16.0; // Exact operation!
    // Restore to previous precision mode
    }
    return a / b;
    }

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Wed Aug 27 14:43:20 2025
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    They don't even need to be full 8-bit: With a tiny amount of logic to=20 >handle the signs you are already down to 128x128, right?

    With exponential representation, say with base 2^(1/4) (range
    0.000018-55109 for exponents -63 to +63, and factor 1.19 between
    consecutive numbers), if the absolutely smaller number is smaller by a
    fixed amount in exponential representation (14 for our base 2^(1/4)
    numbers), adding or subtracting it won't make a difference. Which
    means that we need a 14*15/2=105 entry table (with 8-bit results) for
    addition and a table with the same size for subtraction, and a little
    logic for handling the cases where the numbers are too different or 0,
    or, if supported, +/-Inf or NaN (which reduce the range a little).

    If you want such a representation with finer-grained resolution, you
    get a smaller range and need larger tables. E.g., if you want to have
    a granularity as good as the minimum granularity of FP with 3-bit
    mantissa (with hidden 1), i.e., 1.125, you need 2^(1/6), with a
    granularity of 1.122 and a range 0.00069-1448; adding or subtracting a
    number where the representation is 25 smaller makes no difference, so
    the table sizes are 25*26/2=325 entries. Still looks relatively
    cheap.

    Why are people going for something FP-like instead of exponential
    if the number of bits is so small?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to Anton Ertl on Wed Aug 27 15:29:40 2025
    On Wed, 27 Aug 2025 14:43:20 GMT, anton@mips.complang.tuwien.ac.at
    (Anton Ertl) wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    They don't even need to be full 8-bit: With a tiny amount of logic to=20 >>handle the signs you are already down to 128x128, right?

    With exponential representation, say with base 2^(1/4) (range
    0.000018-55109 for exponents -63 to +63, and factor 1.19 between
    consecutive numbers), if the absolutely smaller number is smaller by a
    fixed amount in exponential representation (14 for our base 2^(1/4)
    numbers), adding or subtracting it won't make a difference. Which
    means that we need a 14*15/2=105 entry table (with 8-bit results) for >addition and a table with the same size for subtraction, and a little
    logic for handling the cases where the numbers are too different or 0,
    or, if supported, +/-Inf or NaN (which reduce the range a little).

    If you want such a representation with finer-grained resolution, you
    get a smaller range and need larger tables. E.g., if you want to have
    a granularity as good as the minimum granularity of FP with 3-bit
    mantissa (with hidden 1), i.e., 1.125, you need 2^(1/6), with a
    granularity of 1.122 and a range 0.00069-1448; adding or subtracting a
    number where the representation is 25 smaller makes no difference, so
    the table sizes are 25*26/2=325 entries. Still looks relatively
    cheap.

    Why are people going for something FP-like instead of exponential
    if the number of bits is so small?

    - anton

    Excellant question. Wish I had an answer.

    Given that the use case almost invariably is NN, the only interesting
    values are [or should be] fractions in the range 0 to 1. Little/no
    need for floating point.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)