Forum: >>> Magnum BBS <<<

Re: Random: Very Low Precision FP

From MitchAlsup@21:1/5 to All on Tue Aug 26 21:17:47 2025

BGB <cr88192@gmail.com> posted:

Well, idea here is that sometimes one wants to be able to do
floating-point math where accuracy is a very low priority.

Say, the sort of stuff people might use FP8 or BF16 or maybe Binary16
for (though, what I am thinking of here is low-precision even by
Binary16 standards).

For 8-bit stuff, just use 5 memory tables [256×256]

But, will use Binary16 and BF16 as the example formats.

So, can note that one can approximate some ops with modified integer
ADD/SUB (excluding sign-bit handling):
a*b : A+B-0x3C00 (0x3F80 for BF16)
a/b : A-B+0x3C00
sqrt(a): (A>>1)+0x1E00

You are aware that GPUs perform elementary transcendental functions
(32-bits) in 5 cycles {sin(), cos(), tan(), exp(), ln(), ...}.
These functions get within 1.5-2 ULP. See authors: Oberman, Pierno,
Matula circa 2000-2005 for relevant data. I did a crack at this
(patented: Samsung) that got within 0.7 and 1.2 ULP using a three
term polynomial instead of a 2 term polynomial.
Standard GPU FP math (32-bit and 16-bit) are 4 cycles and are now
IEEE 754 accurate (except for a couple of outlying cases.)

So, I don't see this suggestion bringing value to the table.

The harder ones though, are ADD/SUB.

A partial ADD seems to be:
a+b: A+((B-A)>>1)+0x0400

But, this simple case seems not to hold up when either doing subtract,
or when A and B are far apart.

So, it would appear either that there is a 4th term or the bias is
variable (depending on the B-A term; and for ADD/SUB).

Seems like the high bits (exponent and operator) could be used to drive
a lookup table, but this is lame, The magic bias appears to have
non-linear properties so isn't as easily represented with basic integer operations.

Then again, probably other people know about all of this and might know
what I am missing.

I still recommend getting the right answer over getting a close but wrong answer a couple cycles earlier.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to John Savard on Wed Aug 27 01:39:47 2025

On Wed, 27 Aug 2025 01:35:08 +0000, John Savard wrote:

Addition and subtraction required a lookup table - but because the two numbers involved needed to be not too far apart in magnitude for the operations to do anything, the lookup tables required were shorter than
they would be for numbers represented normally, where it would be multiplication and division that required the lookup tables.

So to add two numbers, first switch them if necessary, so that the larger
one is a, and the smaller one is b.

Calculate b/a by subtraction.

Then use a short table to find (a+b)/a from b/a. The value found from that table can be added to a to give (the logarithmic representation of) a+b.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to BGB on Wed Aug 27 01:35:08 2025

On Tue, 26 Aug 2025 13:08:29 -0500, BGB wrote:

Then again, probably other people know about all of this and might know
what I am missing.

A long time ago, a notation called FOCUS was proposed for low-precision
floats. It represented numbers by their logarithms. Multiplication and
division were done quickly by addition and subtraction.

Addition and subtraction required a lookup table - but because the two
numbers involved needed to be not too far apart in magnitude for the
operations to do anything, the lookup tables required were shorter than
they would be for numbers represented normally, where it would be multiplication and division that required the lookup tables.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to MitchAlsup on Wed Aug 27 13:56:53 2025

MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

Well, idea here is that sometimes one wants to be able to do
floating-point math where accuracy is a very low priority.

Say, the sort of stuff people might use FP8 or BF16 or maybe Binary16
for (though, what I am thinking of here is low-precision even by
Binary16 standards).

For 8-bit stuff, just use 5 memory tables [256Ã—256]

They don't even need to be full 8-bit: With a tiny amount of logic to
handle the signs you are already down to 128x128, right?

Then again, probably other people know about all of this and might know
what I am missing.

The infamous invsqrt() trick is the canonical example of where all the
quirks of the ieee 754 format works just right to get you to 10+ bits
with a single NR iteration.

Your basic ops examples are a lot more iffy.

I still recommend getting the right answer over getting a close but wrong answer a couple cycles earlier.

Exactly.

I think you showed me the idea of usually getting the correct result in
N cycles, but in a low number of cases, the trailing bits would be too
close to a rounding boundary, so they would add one more NR iteration.

I just realized that the code I wrote to fix Pentium FDIV could have
been even more efficient on a proper superscalar OoO CPU:

Start the FDIV immediately, then at the same time do the divisor
mantissa inspection to determine if the workaround would be needed (5
out of 1024 cases), and only if that happens, start the slower path that
takes up to twice as long.

The idea is that for 99.5% of all divisors, the only cost would be a
close to zero cycle correctly predicted branch, but then the remainder
would require two FDIV operations, so 80 instead of 40 cycles.

OTOH, that same Big OoO core can probably predict that the entire
mantissa inspection part will end up with a "skip the workaround" branch
and start the FDIV almost at once. I'm assuming that when the mispredict
turns up, the core can stop a long operation like FDIV more or less immediately and discard the current status.

(From memory)

double fdiv(double a, double b)
{
uint64_t mant10;
memcpy(&mant10, &b, sizeof(ub));
mant10 = (mant10 >> 42) & 1023;
if (fdiv_table[mant10 >> 3] & (1 << (mant10 & 7))) {
// set fpu to extended/long double, save previous mode
b *= 15.0/16.0; // Exact operation!
a *= 15.0/16.0; // Exact operation!
// Restore to previous precision mode
}
return a / b;
}

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Terje Mathisen on Wed Aug 27 14:43:20 2025

Terje Mathisen <terje.mathisen@tmsw.no> writes:

They don't even need to be full 8-bit: With a tiny amount of logic to=20 >handle the signs you are already down to 128x128, right?

With exponential representation, say with base 2^(1/4) (range
0.000018-55109 for exponents -63 to +63, and factor 1.19 between
consecutive numbers), if the absolutely smaller number is smaller by a
fixed amount in exponential representation (14 for our base 2^(1/4)
numbers), adding or subtracting it won't make a difference. Which
means that we need a 14*15/2=105 entry table (with 8-bit results) for
addition and a table with the same size for subtraction, and a little
logic for handling the cases where the numbers are too different or 0,
or, if supported, +/-Inf or NaN (which reduce the range a little).

If you want such a representation with finer-grained resolution, you
get a smaller range and need larger tables. E.g., if you want to have
a granularity as good as the minimum granularity of FP with 3-bit
mantissa (with hidden 1), i.e., 1.125, you need 2^(1/6), with a
granularity of 1.122 and a range 0.00069-1448; adding or subtracting a
number where the representation is 25 smaller makes no difference, so
the table sizes are 25*26/2=325 entries. Still looks relatively
cheap.

Why are people going for something FP-like instead of exponential
if the number of bits is so small?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to Anton Ertl on Wed Aug 27 15:29:40 2025

On Wed, 27 Aug 2025 14:43:20 GMT, anton@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

They don't even need to be full 8-bit: With a tiny amount of logic to=20 >>handle the signs you are already down to 128x128, right?

With exponential representation, say with base 2^(1/4) (range
0.000018-55109 for exponents -63 to +63, and factor 1.19 between
consecutive numbers), if the absolutely smaller number is smaller by a
fixed amount in exponential representation (14 for our base 2^(1/4)
numbers), adding or subtracting it won't make a difference. Which
means that we need a 14*15/2=105 entry table (with 8-bit results) for >addition and a table with the same size for subtraction, and a little
logic for handling the cases where the numbers are too different or 0,
or, if supported, +/-Inf or NaN (which reduce the range a little).

If you want such a representation with finer-grained resolution, you
get a smaller range and need larger tables. E.g., if you want to have
a granularity as good as the minimum granularity of FP with 3-bit
mantissa (with hidden 1), i.e., 1.125, you need 2^(1/6), with a
granularity of 1.122 and a range 0.00069-1448; adding or subtracting a
number where the representation is 25 smaller makes no difference, so
the table sizes are 25*26/2=325 entries. Still looks relatively
cheap.

Why are people going for something FP-like instead of exponential
if the number of bits is so small?

- anton

Excellant question. Wish I had an answer.

Given that the use case almost invariably is NN, the only interesting
values are [or should be] fractions in the range 0 to 1. Little/no
need for floating point.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	39:24:18
Calls:	10,392
Files:	14,064
Messages:	6,417,189

Re: Random: Very Low Precision FP

Who's Online

System Info