In 24-bit memory,
Load Floating would be an invalid operation Load Double would load a
72-bit floating-point number;
Load Medium would load a 48-bit floating-point number;
The Concertina II ISA, as it stands now, has a
very limited number of opcodes for load and
store memory-reference operations, but a huge
number of opcodes for register-to-register
operate instructions.
This makes it suitable for an implementation where:
Physical memory is some power-of-two multiple of
96 bits in width, for example, 192 bits or 384 bits
wide;
A simple hardware divide-by-three circuit allows
accessing memory normally with the RAM module width
of 64 bits as the fundamental unit;
Applying no conversion allows 24 bits to be the
fundamental unit;
An even simpler _multiply_ by five circuit allows
accessing memory with a 60-bit word as the
fundamental unit.
So it's possible to take an implementation of the
architecture with three-channel memory, and add the
feature of allocating data memory blocks that look
like they're made up of 48-bit words or 60-bit words.
When a memory-reference instruction refers to an
address which is inside of such a variant block,
its operation woudl be modified to handle a data
type of an appropriate width for that kind of memory.
Thus:
In 60-bit memory,
Load Floating would be an invalid operation;
Load Double would load a 60-bit floating-point number;
Load Medium would load a 45-bit floating-point number, as
such a float would still be wide enough to be useful.
Load Byte would be an invalid operation.
Load Halfword would load a 15-bit integer;
Load would load a 30-bit integer;
Load Long would load a 60-bit integer.
In 36-bit memory,
Load Floating would load a 36-bit floating-point number
Load Double would load a 72-bit floating-point nuber
Load Medium would load a 54-bit floating-point number
Load Byte would load a 9-bit byte
Load Halfword would load an 18-bit halfword integer
Load would load a 36-bit integer
Load Long would load a 72-bit integer into a _pair_
of registers, the first of which would be an even-numbered
register, as the registers are only 64 bits long
In 24-bit memory,
Load Floating would be an invalid operation
Load Double would load a 72-bit floating-point number;
Load Medium would load a 48-bit floating-point number;
Load Byte would load a 6-bit character;
Load Halfword would load a 12-bit integer;
Load would load a 24-bit integer;
Load Long would load a 48-bit integer.
Note that the relationship between floating-point
widths and integer widths in instructions is offset
by a factor of two when 24-bit memory is referenced.
And there would be separate opcodes for doing arithmetic
on numbers of these different lengths in the registers.
This is all very well. But in program code, which, as
it is so far specified, can _only_ reside in 64-bit memory,
what about pseudo-immediate values?
Since the instruction format that references pseudo-immediate
values is that of an operate instruction, pseudo-immediates
_could_ be allowed, with considerable wasted space in most
cases, in operate instructions for the data types associated
with variant memory widths.
However, I have thought of a way to allow the CPU to treat
the different memory widths with greater equality.
Instead of devising a whole new ISA for each of the other
memory widths, the *existing* ISA, designed around standard
memory built up from the 64-bit unit, could be given
_alternative representations_ wherein programs could be
stored in memory of other widths, using the same ISA with
minimum modifications.
The 36-bit Representation
In the standard 64-bit representation, a block may begin
with 1100, followed by a 2-bit prefix field for each of the
remaining fourteen 16-bit parts of the remaining seven
32-bit instruction slots in the block.
So, make that portion of the ISA the only one supported in
the 36-bit representation, but with each 36-bit word, in
a block, now 288 bits long, with eight 36-bit words, consisting
of a prefix field, followed by 16 instruction bits, repeated
sixteen times.
But it must still be possible to have block headers if one is
to have pseudo-immediates.
So the first 36-bit instruction slot of a block, if it is to
be a header, would start with 1011100011, and the pre- field
portion of the second 18 bits of that instruction slot would
contain 11. 1011100011 consists of 10 (32-bit instruction)
and 11100011 (header prefix in 32-bit mode).
So the ISA is modified because in 64-bit memory, only seven
instruction slots, not all eight, can contain 36-bit instructions,
and there are no headers inside 36 bit instructions. But the
modifications are minimal, for the purpose of adapting programs
to the different size of memory, with that size being native
for them.
The 24-bit representation
Since a 24-bit word contains three 8-bit bytes, here, the
form of instructions isn't modified at all. But blocks shrink
from eight instruction slots that are 32-bits in length to
eight 24-bit words, so a block would be 192 bits long.
The amount of wasted space for data in 24-bit types wouldn't
be changed much in many cases by using this representation
instead of the 32-bit representation for program code,
because the clash between instruction widths and data widths
wouldn't be affected. However, _some_ improvement would be
achieved, because now if there are multiple pseudo-immediate
values used in a single block, the 24-bit data items could
at least be *packed* correctly in the block.
The 60-bit representation
Here, a block is 480 bits long.
It consists of two 240-bit sub-blocks, in the following form:
One 15 bit halfword containing 15 prefix bits,
and fifteen 15-bit halfwords to which those bits apply,
turning them into 16-bit halfwords, which are used to form
instructions.
The blocks are aligned on 480-bit boundaries. When code
consists of 32-bit instructions, the first sub-block begins
with an instruction, and the second sub-block begins with the
second half of an instruction.
The first whole 32-bit instruction in a sub-block can be a
header, and so one has a header applying to seven subsequent
instruction slots in the first sub-block, and a header applying
to six subsequent instruction slots in the second sub-block.
John Savard
I have not found a need to have LD/ST floating point instructions
because I don't have FP register file(s) {or SIMD files}. Why do you
seem to want/need a file dedicated to floating point values ??
This saves a huge chunk of OpCode Space.......
On Fri, 24 Nov 2023 18:48:53 +0000, MitchAlsup wrote:
I have not found a need to have LD/ST floating point instructions
because I don't have FP register file(s) {or SIMD files}. Why do you
seem to want/need a file dedicated to floating point values ??
This saves a huge chunk of OpCode Space.......
On Fri, 24 Nov 2023 04:15:57 +0000, Quadibloc wrote:
In 24-bit memory,
Load Floating would be an invalid operation Load Double would load a
72-bit floating-point number;
Load Medium would load a 48-bit floating-point number;
Alternatively, it would perhaps be simpler to do it this way:
Load Floating would load a 48-bit floating-point number;
Load Medium would load a 72-bit floating-point number.
This would prioritize structure over function, but only the compiler
would ever know...
An even simpler _multiply_ by five circuit allows accessing memory with
a 60-bit word as the fundamental unit.
Instead of devising a whole new ISA for each of the other memory widths,
the *existing* ISA, designed around standard memory built up from the
64-bit unit, could be given _alternative representations_ wherein
programs could be stored in memory of other widths, using the same ISA
with minimum modifications.
On Fri, 24 Nov 2023 18:48:53 +0000, MitchAlsup wrote:
I have not found a need to have LD/ST floating point instructions
because I don't have FP register file(s) {or SIMD files}. Why do you
seem to want/need a file dedicated to floating point values ??
This saves a huge chunk of OpCode Space.......
Well, for one thing, by giving the integer functional unit one register
file, and the floating-point functional unit its own register file, then those two functional units are not constrained to be adjacent.
I can stick the Decimal Floating-Point functional unit on the other side
of the floating register file, and the Simple Floating functional unit,
or the Register Packed Decimal functional unit, on the other side of the integer register file.
But of course my main reason is that this is the way everyone (i.e.
the IBM System/360) does it, so of course I should do it that way!
But having an excess of data types is at least a helpful excuse...
John Savard
On Fri, 24 Nov 2023 22:38:32 +0000, Quadibloc wrote:
On Fri, 24 Nov 2023 18:48:53 +0000, MitchAlsup wrote:
I have not found a need to have LD/ST floating point instructions
because I don't have FP register file(s) {or SIMD files}. Why do you
seem to want/need a file dedicated to floating point values ??
This saves a huge chunk of OpCode Space.......
Also, because IEEE 754 calls for a hidden first bit, and
gradual underflow, these things make floating-point
arithmeitc slightly slower.
Hence, I deal with them at load and store time, by using
the load and store floating point instructions to convert
to and from an *internal representation* of floats. That
in itself precludes using the integer instructions for them.
John Savard
But Integer Multiply and Divide can share the FPU that does these.
And we have the need to use FP values in deciding to take branches.
On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:
But Integer Multiply and Divide can share the FPU that does these.
But giving each one its own multiplier means more superscalar
goodness!
But you _do_ have a good point. This kind of "superscalar goodness"
is usually wasted, because most programs don't do an equal amount
of integer and FP arithmetic. So it would be way better to have
twice as many multipliers and dividers that are designed to be used
either way than having units of two kinds.
I'm assuming, though, that one can make the multipliers and dividers
simpler, hence faster, by making them single-use, and that it's
possible to know, for a given use case, what the ratio between integer
and FP operations will be.
After all, integer multiplication discards
the MSBs of the result, and floating-point multiplication discards the
LSBs of the result.
And we have the need to use FP values in deciding to take branches.
What gets used in taking branches is the _condition codes_. They're
conveyed from the integer and floating point functional units, as well
as all the other functional units, to a central place (the program
status word).
John Savard
On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:
But Integer Multiply and Divide can share the FPU that does these.
But giving each one its own multiplier means more superscalar
goodness!
But you _do_ have a good point. This kind of "superscalar goodness"
is usually wasted, because most programs don't do an equal amount
of integer and FP arithmetic. So it would be way better to have
twice as many multipliers and dividers that are designed to be used
either way than having units of two kinds.
I'm assuming, though, that one can make the multipliers and dividers
simpler, hence faster, by making them single-use, and that it's
possible to know, for a given use case, what the ratio between integer
and FP operations will be. After all, integer multiplication discards
the MSBs of the result, and floating-point multiplication discards the
LSBs of the result.
And we have the need to use FP values in deciding to take branches.
What gets used in taking branches is the _condition codes_. They're
conveyed from the integer and floating point functional units, as well
as all the other functional units, to a central place (the program
status word).
John Savard
On 11/26/2023 7:29 PM, Quadibloc wrote:
On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:
But Integer Multiply and Divide can share the FPU that does these.
But giving each one its own multiplier means more superscalar
goodness!
In most code, FPU ops are comparably sparse
Otherwise, one can debate whether or not having DIV/MOD in hardware
makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
at least "probably" faster than a generic software-only solution).
For cases like divide-by-constant, also, typically it is possible to
turn it in the compiler into a multiply by reciprocal.
That debate has been held, and MIPS has hardware integer divide, Alpha
and IA-64 don't have a hardware integer divide; they both have FP divide instructions.
On Mon, 27 Nov 2023 09:17:59 +0000, Anton Ertl wrote:
That debate has been held, and MIPS has hardware integer divide, Alpha
and IA-64 don't have a hardware integer divide; they both have FP divide
instructions.
I didn't know this about the Itanium. All I remembered hearing was that
the Itanium "didn't have a divide instruction", and so I didn't realize
this applied to fixed-point arithmetic only.
BGB <cr88192@gmail.com> writes:
On 11/26/2023 7:29 PM, Quadibloc wrote:
On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:
But Integer Multiply and Divide can share the FPU that does these.
But giving each one its own multiplier means more superscalar
goodness!
Having two multipliers that serve both purposes means even more
superscalar goodness for similar area cost. However, there is the
issue of latency. The Willamette ships the integers over to the FPU
for multiplication, and the result back, crossing several clock
domains (at one clock loss per domain crossing), resulting in a
10-cycle latency for integer multiplication. I think that these days
every high-performance core with real silicon invests into separate
GPR and FP/SIMD (including integer SIMD) multipliers.
In most code, FPU ops are comparably sparse
In terms of executed ops, that depends very much on the code. GP
cores have acquired SIMD cores primarily for FP ops, as can be seen by
both SSE and supporting only FP at first, and only later adding
integer stuff, because it cost little extra. Plus, we have added GPUs
that are now capable of doing huge amounts of FP ops, with uses in
graphics rendering, HPC and AI.
Otherwise, one can debate whether or not having DIV/MOD in hardware
makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
at least "probably" faster than a generic software-only solution).
That debate has been held, and MIPS has hardware integer divide, Alpha
and IA-64 don't have a hardware integer divide; they both have FP
divide instructions.
However, looking at more recent architectures, the RISC-V M extension
(which is part of RV64G and RV32G, i.e., a standard extension) has not
just multiply instructions (MUL, MULH, MULHU, MULHSU, MULW), but also
integer divide instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW,
and REMUW. ARM A64 also has divide instructions (SDIV, UDIV), but
RISC-V seems significant to me because there the philosophy seems to
be to go for minimalism. So the debate has apparently come to the
conclusion that for general-purpose architectures, you include an
integer divide instruction.
For cases like divide-by-constant, also, typically it is possible to
turn it in the compiler into a multiply by reciprocal.
But for that you want a multiply instruction, which in the RISC-V case
means including the M extension, which also includes divide
instructions. Multiplying by the reciprocal may still be faster.
- anton
Quadibloc <quadibloc@servername.invalid> writes:
On Mon, 27 Nov 2023 09:17:59 +0000, Anton Ertl wrote:
That debate has been held, and MIPS has hardware integer divide, Alpha
and IA-64 don't have a hardware integer divide; they both have FP divide >>> instructions.
I didn't know this about the Itanium. All I remembered hearing was that
the Itanium "didn't have a divide instruction", and so I didn't realize >>this applied to fixed-point arithmetic only.
You are right, I was wrong. <http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf>
says:
|A number of floating-point operations defined by the IEEE Standard are |deferred to software by the IA-64 architecture in all its
|implementations
|
|* floating-point divide (integer divide, which is based on the
| floating-point divide operation, is also deferred to software)
The paper goes on to describe that FP division a/b is based on
determining 1/b through Newton-Raphson approximation, using the FMA instruction, and then multiplying with a. It shows a sequence of 13 instructions for double precision, 1 frcpa and 12 fma or fnma
instructions.
Given that integer division is based on FP division, it probably takes
even more instructions.
Meanwhile, a lowly Cortex-A53 with its divide instruction produces an
integer division result with a small quotient in a few cycles. In
other words, if you want to use that method for division, still
provide a division instruction, and then implement it with that
method. This provides the flexibility to switch to some other method
(e.g., the one used by ARM) later.
- anton
On 11/26/2023 7:29 PM, Quadibloc wrote:
On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:
But Integer Multiply and Divide can share the FPU that does these.
But giving each one its own multiplier means more superscalar
goodness!
But you _do_ have a good point. This kind of "superscalar goodness"
is usually wasted, because most programs don't do an equal amount
of integer and FP arithmetic. So it would be way better to have
twice as many multipliers and dividers that are designed to be used
either way than having units of two kinds.
I'm assuming, though, that one can make the multipliers and dividers
simpler, hence faster, by making them single-use, and that it's
possible to know, for a given use case, what the ratio between integer
and FP operations will be. After all, integer multiplication discards
the MSBs of the result, and floating-point multiplication discards the
LSBs of the result.
If one wants to have the most superscalar goodness, for general case
code, this means mostly lots of simple ALU ops (particularly, ADD/SUB/AND/OR/SHIFT), and ability to do *lots* of memory Load/Store ops.
Where, say, Load/Store ops and simple ALU ops tend to dominate over
pretty much everything else in terms of usage frequency.
In most code, FPU ops are comparably sparse,
as is MUL,MUL 2%
and DIV/MOD isDIV 0.3%
fairly rare in comparison to either. MUL is common enough that you want
it to be fast when it happens, but not so common to where one is dealing
with piles of them all at the same time (except maybe in edge cases,
like the DCT transform in JPEG/MPEG).
If FMUL and MUL can't be co issued, this is unlikely to matter.
DIV is uncommon enough that it can be 80 or 100 cycles or similar, and
most code will not notice (except in rasterizers or similar, which
actually need fast DIV, but can cheat in that they don't usually need
exact DIV and the divisors are small, allowing for "shortcuts", such as
using lookup tables of reciprocals or similar).
Otherwise, one can debate whether or not having DIV/MOD in hardware
makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
at least "probably" faster than a generic software-only solution).
For cases like divide-by-constant, also, typically it is possible to
turn it in the compiler into a multiply by reciprocal.
And we have the need to use FP values in deciding to take branches.
What gets used in taking branches is the _condition codes_. They're
conveyed from the integer and floating point functional units, as well
as all the other functional units, to a central place (the program
status word).
Yeah, better to not have this style of condition codes...
Like, there are other ways of doing conditional branches...
John Savard
BGB <cr88192@gmail.com> writes:
On 11/26/2023 7:29 PM, Quadibloc wrote:
On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:
But Integer Multiply and Divide can share the FPU that does these.
But giving each one its own multiplier means more superscalar
goodness!
Having two multipliers that serve both purposes means even more
superscalar goodness for similar area cost. However, there is the
issue of latency. The Willamette ships the integers over to the FPU
for multiplication, and the result back, crossing several clock
domains (at one clock loss per domain crossing), resulting in a
10-cycle latency for integer multiplication. I think that these days
every high-performance core with real silicon invests into separate
GPR and FP/SIMD (including integer SIMD) multipliers.
In most code, FPU ops are comparably sparse
In terms of executed ops, that depends very much on the code. GP
cores have acquired SIMD cores primarily for FP ops, as can be seen by
both SSE and supporting only FP at first, and only later adding
integer stuff, because it cost little extra. Plus, we have added GPUs
that are now capable of doing huge amounts of FP ops, with uses in
graphics rendering, HPC and AI.
Otherwise, one can debate whether or not having DIV/MOD in hardware
makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
at least "probably" faster than a generic software-only solution).
That debate has been held, and MIPS has hardware integer divide, Alpha
and IA-64 don't have a hardware integer divide; they both have FP
divide instructions.
However, looking at more recent architectures, the RISC-V M extension
(which is part of RV64G and RV32G, i.e., a standard extension) has not
just multiply instructions (MUL, MULH, MULHU, MULHSU, MULW), but also
integer divide instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW,
and REMUW.
ARM A64 also has divide instructions (SDIV, UDIV), but
RISC-V seems significant to me because there the philosophy seems to
be to go for minimalism. So the debate has apparently come to the
conclusion that for general-purpose architectures, you include an
integer divide instruction.
Quadibloc <quadibloc@servername.invalid> writes:
On Mon, 27 Nov 2023 09:17:59 +0000, Anton Ertl wrote:
That debate has been held, and MIPS has hardware integer divide, Alpha
and IA-64 don't have a hardware integer divide; they both have FP
divide instructions.
I didn't know this about the Itanium. All I remembered hearing was that
the Itanium "didn't have a divide instruction", and so I didn't realize >>this applied to fixed-point arithmetic only.
You are right, I was wrong. <http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf> says:
|A number of floating-point operations defined by the IEEE Standard are |deferred to software by the IA-64 architecture in all its
|implementations |
|* floating-point divide (integer divide, which is based on the | floating-point divide operation, is also deferred to software)
The paper goes on to describe that FP division a/b is based on
determining 1/b through Newton-Raphson approximation, using the FMA instruction, and then multiplying with a. It shows a sequence of 13 instructions for double precision, 1 frcpa and 12 fma or fnma
instructions.
Given that integer division is based on FP division, it probably takes
even more instructions.
Anton Ertl wrote:
<http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf>
says:
|A number of floating-point operations defined by the IEEE Standard are
|deferred to software by the IA-64 architecture in all its
|implementations
|
|* floating-point divide (integer divide, which is based on the
| floating-point divide operation, is also deferred to software)
The paper goes on to describe that FP division a/b is based on
determining 1/b through Newton-Raphson approximation, using the FMA
instruction, and then multiplying with a. It shows a sequence of 13
instructions for double precision, 1 frcpa and 12 fma or fnma
instructions.
Given that integer division is based on FP division, it probably takes
even more instructions.
You have to synthesize the bits between 63 and 53 to get 64/64->64.
Several forms of integer DIV at least signed and unsigned.....
BGB wrote:
On 11/26/2023 7:29 PM, Quadibloc wrote:
On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:
But Integer Multiply and Divide can share the FPU that does these.
But giving each one its own multiplier means more superscalar
goodness!
But you _do_ have a good point. This kind of "superscalar goodness"
is usually wasted, because most programs don't do an equal amount
of integer and FP arithmetic. So it would be way better to have
twice as many multipliers and dividers that are designed to be used
either way than having units of two kinds.
I'm assuming, though, that one can make the multipliers and dividers
simpler, hence faster, by making them single-use, and that it's
possible to know, for a given use case, what the ratio between integer
and FP operations will be. After all, integer multiplication discards
the MSBs of the result, and floating-point multiplication discards the
LSBs of the result.
If one wants to have the most superscalar goodness, for general case
code, this means mostly lots of simple ALU ops (particularly,
ADD/SUB/AND/OR/SHIFT), and ability to do *lots* of memory Load/Store ops.
Where, say, Load/Store ops and simple ALU ops tend to dominate over
pretty much everything else in terms of usage frequency.
LD+ST 33%
Int 45%
In most code, FPU ops are comparably sparse,
FP 9%
as is MUL,MUL 2%
and DIV/MOD isDIV 0.3%
fairly rare in comparison to either. MUL is common enough that you
want it to be fast when it happens, but not so common to where one is
dealing with piles of them all at the same time (except maybe in edge
cases, like the DCT transform in JPEG/MPEG).
MUL is much more common in FORTRAN than in C-like languages as it forms
the underlying math for multidimensional arrays, and more ×s exist in typical number crunching code (F) than in pointer chasing (C) languages.
If FMUL and MUL can't be co issued, this is unlikely to matter.
Since both are pipelined, this is a small stall {{Unlike MIPS R2000 where IMUL was 12 cycles blocking while FMUL was 3 (or was it 4) pipelined.
DIV is uncommon enough that it can be 80 or 100 cycles or similar, and
most code will not notice (except in rasterizers or similar, which
actually need fast DIV, but can cheat in that they don't usually need
exact DIV and the divisors are small, allowing for "shortcuts", such
as using lookup tables of reciprocals or similar).
We know that FDIV can be performed in 17 cycles in the FMUL unit, and
this has been known since 1989......why are you making it so long ??
0.3% IDIV × 100 cycles and you have added 3 cycles to the average instruction !!!
Otherwise, one can debate whether or not having DIV/MOD in hardware
makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV
is at least "probably" faster than a generic software-only solution).
With an 8-bit start, and Newton-Raphson iteration, SW should be able
to do this in mid-30-cycles. {{See ITANIC FDIV SW profile}}
For cases like divide-by-constant, also, typically it is possible to
turn it in the compiler into a multiply by reciprocal.
And you should if it helps.
And we have the need to use FP values in deciding to take branches.
What gets used in taking branches is the _condition codes_. They're
conveyed from the integer and floating point functional units, as well
as all the other functional units, to a central place (the program
status word).
Yeah, better to not have this style of condition codes...
RISC-V, MIPS, Mc 88K, My 66000, have no condition codes......
Like, there are other ways of doing conditional branches...
"Compare to zero and branch" × {Signed, Unsigned, FP16, FP32, FP64}}
for 1 instruction FP branches
Compare and then branch on anything (2 inst::1 int, 1 FP) your heart
desires Branch {NaN, -Infinity, +Infinity, -DeNorm, +Denorm, -Normal, +Normal, Zero, -zero, +zero, < <= > >= == !=, 0 < x < y, 0 <=x < y, 0 <
x <= y, 0 <= x <= y, ... }
Can you even record all these states in a single condition code ??
John Savard
0.3% IDIV ? 100 cycles and you have added 3 cycles to the average
instruction !!!
MitchAlsup <mitchalsup@aol.com> wrote:
...
0.3% IDIV ? 100 cycles and you have added 3 cycles to the average
instruction !!!
Not really, more likeliky it would be a 0.33 average CPI bump, assuming
a 1.0 baseline CPI for non-DIV instructions. Still way to much to be ignored...
Still, it seems the incidence of the integer DIV/MOD/etc instructions
being used is somewhat rarer than this estimate in the code I was
measuring.
On Mon, 27 Nov 2023 15:54:03 +0000, Anton Ertl wrote:
Quadibloc <quadibloc@servername.invalid> writes:
On Mon, 27 Nov 2023 09:17:59 +0000, Anton Ertl wrote:
That debate has been held, and MIPS has hardware integer divide, Alpha >>>> and IA-64 don't have a hardware integer divide; they both have FP
divide instructions.
I didn't know this about the Itanium. All I remembered hearing was that >>>the Itanium "didn't have a divide instruction", and so I didn't realize >>>this applied to fixed-point arithmetic only.
You are right, I was wrong.
<http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf> says:
|A number of floating-point operations defined by the IEEE Standard are
|deferred to software by the IA-64 architecture in all its
|implementations |
|* floating-point divide (integer divide, which is based on the |
floating-point divide operation, is also deferred to software)
The paper goes on to describe that FP division a/b is based on
determining 1/b through Newton-Raphson approximation, using the FMA
instruction, and then multiplying with a. It shows a sequence of 13
instructions for double precision, 1 frcpa and 12 fma or fnma
instructions.
Given that integer division is based on FP division, it probably takes
even more instructions.
The fastest _hardware_ implementations of FP division are Goldschmidt division and Newton-Raphson divisiion, and they both make use of multiplication hardware.
So using Newton-Raphson division to increase the width of an FP division result to perform 64-bit integer division won't be too hard.
Since the Wikipedia article didn't go into detail, I had to look further
to find out that you _were_ right about the DEC Alpha, however. It did
have floating divide but not integer divide. The 21264 added floating
square root.
John Savard
On 11/28/2023 11:00 AM, BGB wrote:
snip
Still, it seems the incidence of the integer DIV/MOD/etc instructions
being used is somewhat rarer than this estimate in the code I was
measuring.
Is this a result of your using mostly a few old games as your
"reference" code? i.e. your code base may be different from what gave
rise to the statistics mentioned by others.
Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
On 11/26/2023 7:29 PM, Quadibloc wrote:
On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:
But Integer Multiply and Divide can share the FPU that does these.
But giving each one its own multiplier means more superscalar
goodness!
Having two multipliers that serve both purposes means even more
superscalar goodness for similar area cost. However, there is the
issue of latency. The Willamette ships the integers over to the FPU
for multiplication, and the result back, crossing several clock
domains (at one clock loss per domain crossing), resulting in a
10-cycle latency for integer multiplication. I think that these days
every high-performance core with real silicon invests into separate
GPR and FP/SIMD (including integer SIMD) multipliers.
In most code, FPU ops are comparably sparse
In terms of executed ops, that depends very much on the code. GP
cores have acquired SIMD cores primarily for FP ops, as can be seen by
both SSE and supporting only FP at first, and only later adding
integer stuff, because it cost little extra. Plus, we have added GPUs
that are now capable of doing huge amounts of FP ops, with uses in
graphics rendering, HPC and AI.
Once SIMD gains Integer operations, the Multiplier has to be built to
do both, might as well use it for more things than just SIMD.
Otherwise, one can debate whether or not having DIV/MOD in hardware
makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV
is at least "probably" faster than a generic software-only solution).
That debate has been held, and MIPS has hardware integer divide, Alpha
and IA-64 don't have a hardware integer divide; they both have FP
divide instructions.
And all of these lead fruitful long productive lives before taking over -------Oh wait !!
However, looking at more recent architectures, the RISC-V M extension
(which is part of RV64G and RV32G, i.e., a standard extension) has not
just multiply instructions (MUL, MULH, MULHU, MULHSU, MULW), but also
integer divide instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW,
and REMUW.
All of which are possible in My 66000 using operand sign control, S-bit,
and CARRY when you want 64×64->128 or 128/64->{64 quotient, 64 remainder}
ARM A64 also has divide instructions (SDIV, UDIV), but >> RISC-V seems significant to me because there the philosophy seems to
be to go for minimalism. So the debate has apparently come to the
conclusion that for general-purpose architectures, you include an
integer divide instruction.
Several forms of integer DIV at least signed and unsigned.....
Have noted that these old games did have a nifty trick for calculating approximate distance, say:
dx=x0-x1;
dy=y0-y1;
adx=dx^(dx>>31);
ady=dy^(dy>>31);
if(ady>adx)
{ t=adx; adx=ady; ady=t; } //common
// { adx^=ady; ady^=adx; adx^=ady; } //sometimes
d=adx+(ady>>1);
64-bit multiply, is drastically slower than 32-bit
multiply though.
On 2023-11-30 11:25 a.m., MitchAlsup wrote:
BGB wrote:
64-bit multiply, is drastically slower than 32-bit
multiply though.
Should only be 1 gate of delay longer......
FPGA land using DSPs, 24x16 multiply is single cycle, 32x32 could be
single cycle if a lower fmax is acceptable, but 64x64 best pipelined for several cycles (6). Thor had a fast 24x16 multiply as it is good enough
for array address calcs in a lot of circumstances.
On 11/30/2023 10:23 AM, MitchAlsup wrote:
BGB wrote:
Have noted that these old games did have a nifty trick for calculating
approximate distance, say:
dx=x0-x1;
dy=y0-y1;
adx=dx^(dx>>31);
ady=dy^(dy>>31);
if(ady>adx)
{ t=adx; adx=ady; ady=t; } //common
// { adx^=ady; ady^=adx; adx^=ady; } //sometimes
d=adx+(ady>>1);
Why not::
dx=x0-x1;
dy=y0-y1;
adx=dx^(dx>>31);
ady=dy^(dy>>31);
if(ady>adx)
d=adx+(ady>>1);
else
d=ady+(adx>>1);
On 11/30/2023 2:27 PM, MitchAlsup wrote:
BGB wrote:
On 11/30/2023 10:23 AM, MitchAlsup wrote:
BGB wrote:
Have noted that these old games did have a nifty trick for
calculating approximate distance, say:
dx=x0-x1;
dy=y0-y1;
adx=dx^(dx>>31);
ady=dy^(dy>>31);
if(ady>adx)
{ t=adx; adx=ady; ady=t; } //common
// { adx^=ady; ady^=adx; adx^=ady; } //sometimes
d=adx+(ady>>1);
Why not::
dx=x0-x1;
dy=y0-y1;
adx=dx^(dx>>31);
ady=dy^(dy>>31);
if(ady>adx)
d=adx+(ady>>1);
else
d=ady+(adx>>1);
As long as ISA has max() and min():: Why not::
dx=x0-x1;
dy=y0-y1;
adx=dx^(dx>>31);
ady=dy^(dy>>31);
d = min( adx, ady ) + (max( adx, ady ) >> 1);
Yeah, I guess that could work as well for 2D...
Though, makes more sense to use the "__int32_min" intrinsic or similar, rather than the generic C library "min()"/"max()" macros.
#define min(a, b) (((a)<(b))?(a):(b))
#define max(a, b) (((a)>(b))?(a):(b))
Though, apparently the existence of these macros in stdlib.h is
controversial (they are apparently sort of a POSIX legacy feature, and
not part of the C standard).
Mostly, due to implementation issues in BGBCC, ?: needs an actual
branch, so is ironically actually worse for performance in this case
than using an explicit if/else...
A case could be made for C11 "_Generic", but would need to implement it.
But, yeah, apart from compiler issues, in theory the min/max part could
be 3 instructions.
If done in ASM, might be something like:
// R4..R7 (x0, y0, x1, y1)
SUB R4, R6, R16 | SUB R5, R7, R17
SHAD R16, -31, R18 | SHAD R17, -31, R19
XOR R16, R18, R16 | XOR R17, R19, R17
CMPGT R17, R16
CSELT R16, R17, R18 | CSELT R17, R16, R19
SHAD R19, -1, R19
ADD R18, R19, R2
I don't have actual MIN/MAX ops though.
Though, it seems RISC-V does have them in the Bitmanip extension.
Could probably add them in-theory (maybe along with FMIN/FMAX).
Could probably add them in-theory (maybe along with FMIN/FMAX).
I use the same FU to perform Int max() and FP max()--remember once you
special case out the problems, IEEE is a linearly ordered set of numbers.
Q+ has min3 / max3 for minimum or maximum of three values.
But only fmin / fmax for float.
On 11/30/2023 5:15 PM, MitchAlsup wrote:
BGB wrote:
On 11/30/2023 2:27 PM, MitchAlsup wrote:
BGB wrote:
On 11/30/2023 10:23 AM, MitchAlsup wrote:
BGB wrote:
Have noted that these old games did have a nifty trick for
calculating approximate distance, say:
dx=x0-x1;
dy=y0-y1;
adx=dx^(dx>>31);
ady=dy^(dy>>31);
if(ady>adx)
{ t=adx; adx=ady; ady=t; } //common
// { adx^=ady; ady^=adx; adx^=ady; } //sometimes
d=adx+(ady>>1);
Why not::
dx=x0-x1;
dy=y0-y1;
adx=dx^(dx>>31);
ady=dy^(dy>>31);
if(ady>adx)
d=adx+(ady>>1);
else
d=ady+(adx>>1);
As long as ISA has max() and min():: Why not::
dx=x0-x1;
dy=y0-y1;
adx=dx^(dx>>31);
ady=dy^(dy>>31);
d = min( adx, ady ) + (max( adx, ady ) >> 1);
Yeah, I guess that could work as well for 2D...
Though, makes more sense to use the "__int32_min" intrinsic or
similar, rather than the generic C library "min()"/"max()" macros.
#define min(a, b) (((a)<(b))?(a):(b))
#define max(a, b) (((a)>(b))?(a):(b))
max() and min() are single instructions in my ISA. On the FP side, they
even get NaNs put in the proper places, too.
Though, apparently the existence of these macros in stdlib.h is
controversial (they are apparently sort of a POSIX legacy feature, and
not part of the C standard).
It took Brian quite some time to recognize that crap and turn it all into
max() and min() instructions.
OK.
Mostly, due to implementation issues in BGBCC, ?: needs an actual
branch, so is ironically actually worse for performance in this case
than using an explicit if/else...
My point exactly--but perhaps you should spend the effort to make your
compiler recognize these as single branch-free instructions.
I guess, it could be possible in theory to use pattern recognition on
the ASTs to turn this into a call to an intrinsic. Might need to look
into this.
For the most part, hadn't really done much of any of this sort of
pattern recognition.
A case could be made for C11 "_Generic", but would need to implement it.
But, yeah, apart from compiler issues, in theory the min/max part
could be 3 instructions.
max() and min() are easy to perform in a single cycle in both int and FP
Theoretically, it can be done in two instructions, which is still faster
than what ?: generates currently.
Say:
z=(x<y)?x:y;
Compiling to something like, say:
CMPGT R9, R8
BF .L0
MOV R9, R10
BRA .L1
.L0:
MOV R8, R10
BRA .L1
.L1:
MOV R10, R11
Whereas, the normal "if()" is smart enough to turn it into predication,
but ?: needs to use a temporary and doesn't have a predication special case.
Well, and also predication needs to be handled partly in the frontend
(in the generated RIL3 bytecode), but is generally omitted if the output
is going to a RIL object (so, if compiling for a static library, it
always falls back to actual branches).
Partly this is in turn because my 3AC backend isn't really smart enough
to pull this off.
Putting effort into it wasn't really a high priority though, as ?: in C
seems to be relatively infrequently used.
But, as-is, one will get a 2-op sequence if they use the intrinsic.
If done in ASM, might be something like:
// R4..R7 (x0, y0, x1, y1)
SUB R4, R6, R16 | SUB R5, R7, R17
SHAD R16, -31, R18 | SHAD R17, -31, R19
XOR R16, R18, R16 | XOR R17, R19, R17
CMPGT R17, R16
CSELT R16, R17, R18 | CSELT R17, R16, R19
SHAD R19, -1, R19
ADD R18, R19, R2
And you are claiming this has an advantage over::
MAX R9,R8,R7
MIN R10,R8,R7
SRL R10,R10,#1
LEA R10,[R9,R10,#-1]
At least in so far as it doesn't depend on features that don't currently exist.
I don't have actual MIN/MAX ops though.
perhaps you should reconsider...
I didn't have them previously as I could express them in a 2-op
sequence, which seemed "good enough".
Though, this was before some more recent changes which (in the name of improving FPGA timing) effectively turned CMPxx and similar into 2 cycle
ops.
Though, MIN/MAX ops could possibly allow:
MIN Imm10u, Rn // Rn=(Rn<Imm)?Rn:Imm
Which could potentially allow implementing "__int32_clamp()" in 2 or 3 instructions, which is an improvement over needing 6 instructions (with pretty much all of these instructions ending up with a 2-cycle latency).
Though, it seems RISC-V does have them in the Bitmanip extension.
Could probably add them in-theory (maybe along with FMIN/FMAX).
I use the same FU to perform Int max() and FP max()--remember once you
special case out the problems, IEEE is a linearly ordered set of numbers.
Yeah.
I was doing both CMP and FCMP cases via the ALU.
This is why likely both could be added at the same time, since the logic
for adding one would likely apply to the others.
While this is an "improvement", I do realize that this, along with even supporting multiple memory widths, even if I have found ways that it can technically be done, is sheer insanity.
I do have a rationale for going there. It's simple enough. It comes from
the x86.
Since the irresistible force of availability of applications means that
we are doomed to always have just *one* dominant ISA...
then, to minimize the terrible loss that this entails, that one dominant
ISA should be maximally flexible. To approach, as closely as possible,
the "good old days" where the PDP-10 and the CDC 6600 coexisted
alongside the PDP-11 and the IBM System/360.
This is the driving force behind my insane quest to design an ISA for a
CPU which has the ability to do something which no one seems to want.
Q+ has min3 / max3 for minimum or maximum of three values.
But only fmin / fmax for float.
BGB wrote:
Have noted that these old games did have a nifty trick for calculating
approximate distance, say:
dx=x0-x1;
dy=y0-y1;
adx=dx^(dx>>31);
ady=dy^(dy>>31);
if(ady>adx)
{ t=adx; adx=ady; ady=t; } //common
// { adx^=ady; ady^=adx; adx^=ady; } //sometimes
d=adx+(ady>>1);
Why not::
dx=x0-x1;
dy=y0-y1;
adx=dx^(dx>>31);
ady=dy^(dy>>31);
if(ady>adx)
d=adx+(ady>>1);
else d=ady+(adx>>1);
Robert Finch wrote:
For symmetry I would assume/like a median3 as well, so that min3,
Q+ has min3 / max3 for minimum or maximum of three values.
But only fmin / fmax for float.
median3, max3 would return a sorted list?
Terje
MitchAlsup wrote:
BGB wrote:
Have noted that these old games did have a nifty trick for calculating
approximate distance, say:
dx=x0-x1;
dy=y0-y1;
adx=dx^(dx>>31);
ady=dy^(dy>>31);
if(ady>adx)
{ t=adx; adx=ady; ady=t; } //common
// { adx^=ady; ady^=adx; adx^=ady; } //sometimes
d=adx+(ady>>1);
Why not::
dx=x0-x1;
dy=y0-y1;
adx=dx^(dx>>31);
ady=dy^(dy>>31);
if(ady>adx)
d=adx+(ady>>1);
else d=ady+(adx>>1);
Possibly due to wanting to not use more than one branch predictor entry?
Maybe because ady rarely was larger than adx?
Besides, your version has the two terms swapped. :-)
Terje
Terje Mathisen wrote:
Robert Finch wrote:
For symmetry I would assume/like a median3 as well, so that min3,
Q+ has min3 / max3 for minimum or maximum of three values.
But only fmin / fmax for float.
median3, max3 would return a sorted list?
That was my point 8-10 posts ago.
MitchAlsup wrote:
Terje Mathisen wrote:
Robert Finch wrote:
For symmetry I would assume/like a median3 as well, so that min3,
Q+ has min3 / max3 for minimum or maximum of three values.
But only fmin / fmax for float.
median3, max3 would return a sorted list?
That was my point 8-10 posts ago.
I saw that a bit later: I've been on a sailing trip in the Caribbean
with very sporadic network connections, basically just when in reach of
a French island.
Terje
Terje Mathisen wrote:
MitchAlsup wrote:
Terje Mathisen wrote:
Robert Finch wrote:
For symmetry I would assume/like a median3 as well, so that min3,
Q+ has min3 / max3 for minimum or maximum of three values.
But only fmin / fmax for float.
median3, max3 would return a sorted list?
That was my point 8-10 posts ago.
I saw that a bit later: I've been on a sailing trip in the Caribbean
with very sporadic network connections, basically just when in reach
of a French island.
Must be a tough life............
What about multiply-high-no-carry-in? There might be cases
where such could be used for division with a reused
(possibly compile-time constant) divisor even without the
carry-in. (Producing the carry-in by a low multiply seems
to have significant overhead in energy, throughput, and/or
latency.)
On 11/27/23 8:03 PM, MitchAlsup wrote:
Anton Ertl wrote:[snip]
However, looking at more recent architectures, the RISC-V M
extension (which is part of RV64G and RV32G, i.e., a
standard extension) has not just multiply instructions (MUL,
MULH, MULHU, MULHSU, MULW), but also integer divide >> instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW, and
REMUW.
All of which are possible in My 66000 using operand sign
control, S-bit, and and CARRY when you want 64×64->128 or
128/64->{64 quotient, 64 remainder}
What about multiply-high-no-carry-in?
There might be cases
where such could be used for division with a reused
(possibly compile-time constant) divisor even without the
carry-in. (Producing the carry-in by a low multiply seems
to have significant overhead in energy, throughput, and/or
latency.) A small immediate operand might be shifted into
the most significant position to facilitate division by a
small constant.
A division-specific instruction (divide-using-reciprocal)
could do more sophisticated manipulation of an immediate and
perhaps provide an instruction a compiler could use with
less knowledge of the dividend. Yet there might be uses for
a simple multiply-high-no-carry-in.
On 12/7/2023 7:39 AM, Anton Ertl wrote:
"Paul A. Clayton" <paaronclayton@gmail.com> writes:
What about multiply-high-no-carry-in? There might be cases
where such could be used for division with a reused
(possibly compile-time constant) divisor even without the
carry-in. (Producing the carry-in by a low multiply seems
to have significant overhead in energy, throughput, and/or
latency.)
What's that operation supposed to produce? How does it differ from,
e.g., RISC-V MULH/MULHU/MULHSU? And if there is a difference, what
makes you think that it still works for division?
For division by reciprocal, one may need to be able to adjust the right
shift (say, 32..34 bits), and add a bias to the low-order bits if the
input is negative (say, adding "Rcp-1").
Without the variable shift part, one will not get reliable results for
some divisors (IIRC, 7, 17, 31, etc).
Without a bias, it will round towards negative infinity rather than
towards zero (standard for division is to always round towards zero).
As-is, signed division logic (say, "y=x/7;") would be something like:
MOV 0x49249249, R1
DMULS.L R4, R1, R2
ADD -1, R1
CMPGE 0, R4
ADD?F R2, R1, R2
SHAD.Q R2, -33, R2
Or, if one used jumbo encodings:
DMULS.L R4, 0x49249249, R2
CMPGE 0, R4 // SR.T=(R4>=0)
ADD?F R2, 0x49249248, R2
SHAD.Q R2, -33, R2
So, sadly, simply taking the high result isn't quite sufficient.
On 12/7/2023 7:39 AM, Anton Ertl wrote:
"Paul A. Clayton" <paaronclayton@gmail.com> writes:
What about multiply-high-no-carry-in? There might be cases
where such could be used for division with a reused
(possibly compile-time constant) divisor even without the
carry-in. (Producing the carry-in by a low multiply seems
to have significant overhead in energy, throughput, and/or
latency.)
What's that operation supposed to produce? How does it differ from,
e.g., RISC-V MULH/MULHU/MULHSU? And if there is a difference, what
makes you think that it still works for division?
For division by reciprocal, one may need to be able to adjust the right
shift (say, 32..34 bits), and add a bias to the low-order bits if the
input is negative (say, adding "Rcp-1").
Without the variable shift part, one will not get reliable results for
some divisors (IIRC, 7, 17, 31, etc).
Without a bias, it will round towards negative infinity rather than
towards zero (standard for division is to always round towards zero).
FAZDIV
On 12/8/2023 9:20 AM, Anton Ertl wrote:...
BGB <cr88192@gmail.com> writes:
Without a bias, it will round towards negative infinity rather than
towards zero (standard for division is to always round towards zero).
Nope. E.g., Forth-83 standardizes floored division; in Forth-94 and
Forth-2012, it's implementation-defined. More generally,
<https://en.wikipedia.org/wiki/Modulo_operation> lists lots of
languages that perform floored or Euclidean modulo, and I would hope
that their division operation agrees with the modulo operation.
OK, how about:
Languages like C, Java, C#, etc, expect division to always round towards >zero.
On 12/7/23 1:55 PM, MitchAlsup wrote:
Paul A. Clayton wrote:
On 11/27/23 8:03 PM, MitchAlsup wrote:
Anton Ertl wrote:[snip]
However, looking at more recent architectures, the RISC-V M
extension (which is part of RV64G and RV32G, i.e., a standard
extension) has not just multiply instructions (MUL,
MULH, MULHU, MULHSU, MULW), but also integer divide >>
instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW, and
REMUW.
All of which are possible in My 66000 using operand sign
control, S-bit, and and CARRY when you want 64×64->128 or
128/64->{64 quotient, 64 remainder}
What about multiply-high-no-carry-in?
CARRY R9,{{O}} // carry applies to next inst
// no carry in yes carry out
MUL R10,Rx,Ry // {R9,R10} contain result
I was thinking of a single register result.
There might be cases
where such could be used for division with a reused
(possibly compile-time constant) divisor even without the
carry-in. (Producing the carry-in by a low multiply seems
to have significant overhead in energy, throughput, and/or
latency.) A small immediate operand might be shifted into
the most significant position to facilitate division by a
small constant.
A division-specific instruction (divide-using-reciprocal)
could do more sophisticated manipulation of an immediate and
perhaps provide an instruction a compiler could use with
less knowledge of the dividend. Yet there might be uses for
a simple multiply-high-no-carry-in.
Paul A. Clayton wrote:
On 12/7/23 1:55 PM, MitchAlsup wrote:
What about multiply-high-no-carry-in?
CARRY R9,{{O}} // carry applies to next inst
// no carry in yes carry out
MUL R10,Rx,Ry // {R9,R10} contain result
I was thinking of a single register result.
How can a single 64-bit register hold a 128-bit result ??
MitchAlsup <mitchalsup@aol.com> schrieb:
#include <stdint.h>
uint32_t div25 (uint32_t n)
{
return n / 25;
}
yields
div25:
lis 9,0x51eb // cycle 1
ori 9,9,0x851f // cycle 2
mulhwu 3,3,9 // cycle 3-6
srwi 3,3,3 // cycle 7
blr // cycle 8
MitchAlsup <mitchalsup@aol.com> schrieb:
Paul A. Clayton wrote:
On 12/7/23 1:55 PM, MitchAlsup wrote:
What about multiply-high-no-carry-in?
CARRY R9,{{O}} // carry applies to next inst
// no carry in yes carry out
MUL R10,Rx,Ry // {R9,R10} contain result
I was thinking of a single register result.
How can a single 64-bit register hold a 128-bit result ??
Sometimes, you don't need the whole result; the upper bits suffice.
Consider dividing a 32-bit unsigned number n on a 32-bit system
by 25 (equally applicable to similar examples with 64-bit systems,
but I happen to have the numbers at hand).
In plain C, you could do this by
uint32_t res = ((uint64_t) n * (uint64_t) 1374389535) >> 35;
Thomas Koenig wrote:
MitchAlsup <mitchalsup@aol.com> schrieb:
Paul A. Clayton wrote:
On 12/7/23 1:55 PM, MitchAlsup wrote:
What about multiply-high-no-carry-in?
CARRY R9,{{O}} // carry applies to next inst
// no carry in yes carry out
MUL R10,Rx,Ry // {R9,R10} contain result
I was thinking of a single register result.
How can a single 64-bit register hold a 128-bit result ??
Sometimes, you don't need the whole result; the upper bits suffice.
Consider dividing a 32-bit unsigned number n on a 32-bit system
by 25 (equally applicable to similar examples with 64-bit systems,
but I happen to have the numbers at hand).
In plain C, you could do this by
uint32_t res = ((uint64_t) n * (uint64_t) 1374389535) >> 35;
This only gives you 29 bits of significance.
What if you also need the remainder ??
Thomas Koenig wrote:
MitchAlsup <mitchalsup@aol.com> schrieb:
#include <stdint.h>
uint32_t div25 (uint32_t n)
{
return n / 25;
}
yields
div25:
lis 9,0x51eb // cycle 1
ori 9,9,0x851f // cycle 2
mulhwu 3,3,9 // cycle 3-6
srwi 3,3,3 // cycle 7
blr // cycle 8
div25:
CARRY R8,{{O}} // cycle 1
MUL R1,R1,#0x51eb851f // cycle 1-4
SLL R1,R8,#35 // cycle 5
RET // cycle 6
On 2023-12-12 4:00 a.m., BGB wrote:
On 12/12/2023 12:09 AM, Thomas Koenig wrote:Q+ is almost the same as MY66000
MitchAlsup <mitchalsup@aol.com> schrieb:
Thomas Koenig wrote:
MitchAlsup <mitchalsup@aol.com> schrieb:
#include <stdint.h>
uint32_t div25 (uint32_t n)
{
return n / 25;
}
yields
div25:
lis 9,0x51eb // cycle 1
ori 9,9,0x851f // cycle 2
mulhwu 3,3,9 // cycle 3-6
srwi 3,3,3 // cycle 7
blr // cycle 8
div25:
CARRY R8,{{O}} // cycle 1
MUL R1,R1,#0x51eb851f // cycle 1-4
SLL R1,R8,#35 // cycle 5
RET // cycle 6
Then let's use a 64-bit example:
#include <stdint.h>
uint64_t div (uint64_t n)
{
return n / 37;
}
which becomes
div:
.LFB0:
.cfi_startproc
movabsq $-2492803253203993461, %rax
mulq %rdi
movq %rdx, %rax
shrq $5, %rax
ret
on x86_64 (so, use the high result only and shift it five bits to
the right) and, with Power10,
div:
pli 10,3714566310
pli 9,232160395
rldimi 9,10,32,0
mulhdu 3,3,9
srdi 3,3,5
blr
(The first three instructions are pasting together the constant, which
of course My 66000 can do much better :-)
A hypothetical My66000 code could be
mulhu r1, r1, #-2492803253203993461
srl r1, r1, #5
ret
And, for comparison, on BJX2 (with the MULQ extension):
MOV -2492803253203993461, R3
MULHU.Q R4, R3, R2
SHLD.Q R2, -5, R2
RTS
...
Though, this will take 24 bytes to encode, and wont be fast.
Also, currently the only instructions that take 64 bit immediate values
are:
MOV Imm64, Rn
ADD Imm64, Rn
PLDCXH Imm64, Xn //4x Binary16 to 4x Binary32
...
muluh r1,r1,#-2492803253203993461
lsr r1,r1,#5
ret
but ret is only two bytes.
A hypothetical My66000 code could be
mulhu r1, r1, #-2492803253203993461
srl r1, r1, #5
ret
Thomas Koenig <tkoenig@netcologne.de> schrieb:
A hypothetical My66000 code could be
mulhu r1, r1, #-2492803253203993461
srl r1, r1, #5
ret
By "hypothetical" I mean that it isn't in the ISA at the moment.
Thomas Koenig wrote:
Thomas Koenig <tkoenig@netcologne.de> schrieb:
A hypothetical My66000 code could be
mulhu r1, r1, #-2492803253203993461
srl r1, r1, #5
ret
By "hypothetical" I mean that it isn't in the ISA at the moment.
If I knew what mulhu does I could illuminate::
a) how many bits get multiplied?
b) how many bits are produced?
But assuming h means high, and the widths are 64-bit operands and
128-bit results::
CARRY R1,{{O}{I}}
MUL R2,R1,#-2492803253203993461
SHR R1,R2,<0,5>
RET
MitchAlsup <mitchalsup@aol.com> schrieb:
Thomas Koenig wrote:
Thomas Koenig <tkoenig@netcologne.de> schrieb:
A hypothetical My66000 code could be
mulhu r1, r1, #-2492803253203993461
srl r1, r1, #5
ret
By "hypothetical" I mean that it isn't in the ISA at the moment.
If I knew what mulhu does I could illuminate::
a) how many bits get multiplied?
b) how many bits are produced?
But assuming h means high, and the widths are 64-bit operands and
128-bit results::
CARRY R1,{{O}{I}}
MUL R2,R1,#-2492803253203993461
SHR R1,R2,<0,5>
RET
That works, but has two disadvantages over the non-Carry
version: It is longer (by one instruction modifier),
and R2 is overwritten with a value which is not needed.
This is why I think that having a "multiply high signed"
and "multiply high unsigned" operation is a good idea even
when My 66000's Carry is implemented.
Thomas Koenig wrote:
MitchAlsup <mitchalsup@aol.com> schrieb:
Thomas Koenig wrote:
Thomas Koenig <tkoenig@netcologne.de> schrieb:
A hypothetical My66000 code could be
mulhu r1, r1, #-2492803253203993461
srl r1, r1, #5
ret
By "hypothetical" I mean that it isn't in the ISA at the moment.
If I knew what mulhu does I could illuminate::
a) how many bits get multiplied?
b) how many bits are produced?
But assuming h means high, and the widths are 64-bit operands and
128-bit results::
CARRY R1,{{O}{I}}
MUL R2,R1,#-2492803253203993461
SHR R1,R2,<0,5>
RET
That works, but has two disadvantages over the non-Carry
version: It is longer (by one instruction modifier),
and R2 is overwritten with a value which is not needed.
This is why I think that having a "multiply high signed"
and "multiply high unsigned" operation is a good idea even
when My 66000's Carry is implemented.
Heck:: if you are worried about length::
DIV R1,R1,#27
RET
2 words instead of 6.
MitchAlsup <mitchalsup@aol.com> schrieb:
Thomas Koenig wrote:
MitchAlsup <mitchalsup@aol.com> schrieb:
Thomas Koenig wrote:
Thomas Koenig <tkoenig@netcologne.de> schrieb:
A hypothetical My66000 code could be
mulhu r1, r1, #-2492803253203993461
srl r1, r1, #5
ret
By "hypothetical" I mean that it isn't in the ISA at the moment.
If I knew what mulhu does I could illuminate::
a) how many bits get multiplied?
b) how many bits are produced?
But assuming h means high, and the widths are 64-bit operands and
128-bit results::
CARRY R1,{{O}{I}}
MUL R2,R1,#-2492803253203993461
SHR R1,R2,<0,5>
RET
That works, but has two disadvantages over the non-Carry
version: It is longer (by one instruction modifier),
and R2 is overwritten with a value which is not needed.
This is why I think that having a "multiply high signed"
and "multiply high unsigned" operation is a good idea even
when My 66000's Carry is implemented.
Heck:: if you are worried about length::
DIV R1,R1,#27
RET
2 words instead of 6.
This is an optimization problem with three objective functions:
Code size, register use and speed. Any solution in which at which
it is no longer possible to improve one of the objective functions
without making the others worse is a valid solution to that problem
(part of the Pareto front, if you want to use that terminology).
The version with the DIV with constant is the smallest, so it
trumps everything else on size. It is, however, likely to be
slower than the other two. It is what I would a compiler to emit
if optimizing for size (-Os), and when not optimizing.
The version with the Carry is likely faster than the DIV with
constant version, but is bigger and uses two registers instead
of one. It is what I would expect a compiler to generate in
the absence of multiply high unsigned and any -O option, except
for -Os.
The hypothetical version with "multiply high unsigned" is smaller
than the version with Carry and is better on register use, so
it is better in that sense. If it existed, I would expect a
compiler to always generate that for any optimization level
except -Os. (Small caveat: Maybe hardware has a fiendishly
clever algorithm to divide by 37 in four or less cycles; this
would then be known to the compiler, and it could chose accordingly).
MitchAlsup <mitchalsup@aol.com> schrieb:
Thomas Koenig wrote:
Thomas Koenig <tkoenig@netcologne.de> schrieb:
A hypothetical My66000 code could be
mulhu r1, r1, #-2492803253203993461
srl r1, r1, #5
ret
By "hypothetical" I mean that it isn't in the ISA at the moment.
If I knew what mulhu does I could illuminate::
a) how many bits get multiplied?
b) how many bits are produced?
But assuming h means high, and the widths are 64-bit operands and
128-bit results::
CARRY R1,{{O}{I}}
MUL R2,R1,#-2492803253203993461
SHR R1,R2,<0,5>
RET
That works, but has two disadvantages over the non-Carry
version: It is longer (by one instruction modifier),
and R2 is overwritten with a value which is not needed.
This is why I think that having a "multiply high signed"
and "multiply high unsigned" operation is a good idea even
when My 66000's Carry is implemented.
Thomas Koenig wrote:
MitchAlsup <mitchalsup@aol.com> schrieb:
Thomas Koenig wrote:
Thomas Koenig <tkoenig@netcologne.de> schrieb:
A hypothetical My66000 code could be
mulhu r1, r1, #-2492803253203993461
srl r1, r1, #5
ret
By "hypothetical" I mean that it isn't in the ISA at the moment.
If I knew what mulhu does I could illuminate::
a) how many bits get multiplied?
b) how many bits are produced?
But assuming h means high, and the widths are 64-bit operands and
128-bit results::
CARRY R1,{{O}{I}}
MUL R2,R1,#-2492803253203993461
SHR R1,R2,<0,5>
RET
That works, but has two disadvantages over the non-Carry
version: It is longer (by one instruction modifier),
and R2 is overwritten with a value which is not needed.
This is why I think that having a "multiply high signed"
and "multiply high unsigned" operation is a good idea even
when My 66000's Carry is implemented.
I disagree:
The R2 overwrite really doesn't matter since the OoO circuitry will
quickly notice that the value isn't used at all, and the extra CARRY
modifier does not make the code any slower: It will still take exactly
the latency of MUL + SHR, so presumably 5-6 total cycles.
Since MULHU is very rarely used in regular code (pretty much only in
constant divisions that got replaced with reciprocal muls), this tiny overhead is a very good deal.
Still, when I see an inefficiency, even minor, I tend to remark
on it :-)
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 27:06:24 |
Calls: | 10,390 |
Calls today: | 1 |
Files: | 14,064 |
Messages: | 6,417,064 |