Well, idea here is that sometimes one wants to be able to do
floating-point math where accuracy is a very low priority.
Say, the sort of stuff people might use FP8 or BF16 or maybe Binary16
for (though, what I am thinking of here is low-precision even by
Binary16 standards).
But, will use Binary16 and BF16 as the example formats.
So, can note that one can approximate some ops with modified integer
ADD/SUB (excluding sign-bit handling):
a*b : A+B-0x3C00 (0x3F80 for BF16)
a/b : A-B+0x3C00
sqrt(a): (A>>1)+0x1E00
The harder ones though, are ADD/SUB.
A partial ADD seems to be:
a+b: A+((B-A)>>1)+0x0400
But, this simple case seems not to hold up when either doing subtract,
or when A and B are far apart.
So, it would appear either that there is a 4th term or the bias is
variable (depending on the B-A term; and for ADD/SUB).
Seems like the high bits (exponent and operator) could be used to drive
a lookup table, but this is lame, The magic bias appears to have
non-linear properties so isn't as easily represented with basic integer operations.
Then again, probably other people know about all of this and might know
what I am missing.
Addition and subtraction required a lookup table - but because the two numbers involved needed to be not too far apart in magnitude for the operations to do anything, the lookup tables required were shorter than
they would be for numbers represented normally, where it would be multiplication and division that required the lookup tables.
Then again, probably other people know about all of this and might know
what I am missing.
BGB <cr88192@gmail.com> posted:
Well, idea here is that sometimes one wants to be able to do
floating-point math where accuracy is a very low priority.
Say, the sort of stuff people might use FP8 or BF16 or maybe Binary16
for (though, what I am thinking of here is low-precision even by
Binary16 standards).
For 8-bit stuff, just use 5 memory tables [256×256]
Then again, probably other people know about all of this and might know
what I am missing.
I still recommend getting the right answer over getting a close but wrong answer a couple cycles earlier.
They don't even need to be full 8-bit: With a tiny amount of logic to=20 >handle the signs you are already down to 128x128, right?
Terje Mathisen <terje.mathisen@tmsw.no> writes:
They don't even need to be full 8-bit: With a tiny amount of logic to=20 >>handle the signs you are already down to 128x128, right?
With exponential representation, say with base 2^(1/4) (range
0.000018-55109 for exponents -63 to +63, and factor 1.19 between
consecutive numbers), if the absolutely smaller number is smaller by a
fixed amount in exponential representation (14 for our base 2^(1/4)
numbers), adding or subtracting it won't make a difference. Which
means that we need a 14*15/2=105 entry table (with 8-bit results) for >addition and a table with the same size for subtraction, and a little
logic for handling the cases where the numbers are too different or 0,
or, if supported, +/-Inf or NaN (which reduce the range a little).
If you want such a representation with finer-grained resolution, you
get a smaller range and need larger tables. E.g., if you want to have
a granularity as good as the minimum granularity of FP with 3-bit
mantissa (with hidden 1), i.e., 1.125, you need 2^(1/6), with a
granularity of 1.122 and a range 0.00069-1448; adding or subtracting a
number where the representation is 25 smaller makes no difference, so
the table sizes are 25*26/2=325 entries. Still looks relatively
cheap.
Why are people going for something FP-like instead of exponential
if the number of bits is so small?
- anton
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 39:24:18 |
Calls: | 10,392 |
Files: | 14,064 |
Messages: | 6,417,189 |