Forum: >>> Magnum BBS <<<

Re: Alternative Representations of the Concertina II ISA

From Quadibloc@21:1/5 to Quadibloc on Fri Nov 24 04:42:31 2023

On Fri, 24 Nov 2023 04:15:57 +0000, Quadibloc wrote:

In 24-bit memory,

Load Floating would be an invalid operation Load Double would load a
72-bit floating-point number;
Load Medium would load a 48-bit floating-point number;

Alternatively, it would perhaps be simpler to do it this way:

Load Floating would load a 48-bit floating-point number;
Load Medium would load a 72-bit floating-point number.

This would prioritize structure over function, but only
the compiler would ever know...

The 24-bit representation of code, however, has a bigger problem.

In the 60-bit representation, a 15-bit halfword that can be
addressed easily in 60-bit memory corresponds to a 16-bit area
in the original ISA;

In the 36-bit representation, an 18-bit halfword, easily addressible
in 36-bit memory, does the same.

In the 24-bit representation, though, a problem exists; the simplest
way to deal with that would be, in code consisting entirely of 32-bit instructions, to allow only every third instruction to be a branch
target.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to All on Fri Nov 24 04:15:57 2023

The Concertina II ISA, as it stands now, has a
very limited number of opcodes for load and
store memory-reference operations, but a huge
number of opcodes for register-to-register
operate instructions.

This makes it suitable for an implementation where:

Physical memory is some power-of-two multiple of
96 bits in width, for example, 192 bits or 384 bits
wide;

A simple hardware divide-by-three circuit allows
accessing memory normally with the RAM module width
of 64 bits as the fundamental unit;

Applying no conversion allows 24 bits to be the
fundamental unit;

An even simpler _multiply_ by five circuit allows
accessing memory with a 60-bit word as the
fundamental unit.

So it's possible to take an implementation of the
architecture with three-channel memory, and add the
feature of allocating data memory blocks that look
like they're made up of 48-bit words or 60-bit words.

When a memory-reference instruction refers to an
address which is inside of such a variant block,
its operation woudl be modified to handle a data
type of an appropriate width for that kind of memory.

Thus:

In 60-bit memory,

Load Floating would be an invalid operation;
Load Double would load a 60-bit floating-point number;
Load Medium would load a 45-bit floating-point number, as
such a float would still be wide enough to be useful.

Load Byte would be an invalid operation.
Load Halfword would load a 15-bit integer;
Load would load a 30-bit integer;
Load Long would load a 60-bit integer.

In 36-bit memory,

Load Floating would load a 36-bit floating-point number
Load Double would load a 72-bit floating-point nuber
Load Medium would load a 54-bit floating-point number

Load Byte would load a 9-bit byte
Load Halfword would load an 18-bit halfword integer
Load would load a 36-bit integer
Load Long would load a 72-bit integer into a _pair_
of registers, the first of which would be an even-numbered
register, as the registers are only 64 bits long

In 24-bit memory,

Load Floating would be an invalid operation
Load Double would load a 72-bit floating-point number;
Load Medium would load a 48-bit floating-point number;

Load Byte would load a 6-bit character;
Load Halfword would load a 12-bit integer;
Load would load a 24-bit integer;
Load Long would load a 48-bit integer.

Note that the relationship between floating-point
widths and integer widths in instructions is offset
by a factor of two when 24-bit memory is referenced.

And there would be separate opcodes for doing arithmetic
on numbers of these different lengths in the registers.

This is all very well. But in program code, which, as
it is so far specified, can _only_ reside in 64-bit memory,
what about pseudo-immediate values?

Since the instruction format that references pseudo-immediate
values is that of an operate instruction, pseudo-immediates
_could_ be allowed, with considerable wasted space in most
cases, in operate instructions for the data types associated
with variant memory widths.

However, I have thought of a way to allow the CPU to treat
the different memory widths with greater equality.

Instead of devising a whole new ISA for each of the other
memory widths, the *existing* ISA, designed around standard
memory built up from the 64-bit unit, could be given
_alternative representations_ wherein programs could be
stored in memory of other widths, using the same ISA with
minimum modifications.

The 36-bit Representation

In the standard 64-bit representation, a block may begin
with 1100, followed by a 2-bit prefix field for each of the
remaining fourteen 16-bit parts of the remaining seven
32-bit instruction slots in the block.

So, make that portion of the ISA the only one supported in
the 36-bit representation, but with each 36-bit word, in
a block, now 288 bits long, with eight 36-bit words, consisting
of a prefix field, followed by 16 instruction bits, repeated
sixteen times.

But it must still be possible to have block headers if one is
to have pseudo-immediates.

So the first 36-bit instruction slot of a block, if it is to
be a header, would start with 1011100011, and the pre- field
portion of the second 18 bits of that instruction slot would
contain 11. 1011100011 consists of 10 (32-bit instruction)
and 11100011 (header prefix in 32-bit mode).

So the ISA is modified because in 64-bit memory, only seven
instruction slots, not all eight, can contain 36-bit instructions,
and there are no headers inside 36 bit instructions. But the
modifications are minimal, for the purpose of adapting programs
to the different size of memory, with that size being native
for them.

The 24-bit representation

Since a 24-bit word contains three 8-bit bytes, here, the
form of instructions isn't modified at all. But blocks shrink
from eight instruction slots that are 32-bits in length to
eight 24-bit words, so a block would be 192 bits long.

The amount of wasted space for data in 24-bit types wouldn't
be changed much in many cases by using this representation
instead of the 32-bit representation for program code,
because the clash between instruction widths and data widths
wouldn't be affected. However, _some_ improvement would be
achieved, because now if there are multiple pseudo-immediate
values used in a single block, the 24-bit data items could
at least be *packed* correctly in the block.

The 60-bit representation

Here, a block is 480 bits long.

It consists of two 240-bit sub-blocks, in the following form:

One 15 bit halfword containing 15 prefix bits,

and fifteen 15-bit halfwords to which those bits apply,
turning them into 16-bit halfwords, which are used to form
instructions.

The blocks are aligned on 480-bit boundaries. When code
consists of 32-bit instructions, the first sub-block begins
with an instruction, and the second sub-block begins with the
second half of an instruction.

The first whole 32-bit instruction in a sub-block can be a
header, and so one has a header applying to seven subsequent
instruction slots in the first sub-block, and a header applying
to six subsequent instruction slots in the second sub-block.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Fri Nov 24 18:48:53 2023

Quadibloc wrote:

The Concertina II ISA, as it stands now, has a
very limited number of opcodes for load and
store memory-reference operations, but a huge
number of opcodes for register-to-register
operate instructions.

This makes it suitable for an implementation where:

Physical memory is some power-of-two multiple of
96 bits in width, for example, 192 bits or 384 bits
wide;

A simple hardware divide-by-three circuit allows
accessing memory normally with the RAM module width
of 64 bits as the fundamental unit;

Applying no conversion allows 24 bits to be the
fundamental unit;

An even simpler _multiply_ by five circuit allows
accessing memory with a 60-bit word as the
fundamental unit.

So it's possible to take an implementation of the
architecture with three-channel memory, and add the
feature of allocating data memory blocks that look
like they're made up of 48-bit words or 60-bit words.

When a memory-reference instruction refers to an
address which is inside of such a variant block,
its operation woudl be modified to handle a data
type of an appropriate width for that kind of memory.

Thus:

In 60-bit memory,

Load Floating would be an invalid operation;
Load Double would load a 60-bit floating-point number;
Load Medium would load a 45-bit floating-point number, as
such a float would still be wide enough to be useful.

I have not found a need to have LD/ST floating point instructions
because I don't have FP register file(s) {or SIMD files}. Why do
you seem to want/need a file dedicated to floating point values ??
This saves a huge chunk of OpCode Space.......

Load Byte would be an invalid operation.
Load Halfword would load a 15-bit integer;
Load would load a 30-bit integer;
Load Long would load a 60-bit integer.

In 36-bit memory,

Load Floating would load a 36-bit floating-point number
Load Double would load a 72-bit floating-point nuber
Load Medium would load a 54-bit floating-point number

Load Byte would load a 9-bit byte
Load Halfword would load an 18-bit halfword integer
Load would load a 36-bit integer
Load Long would load a 72-bit integer into a _pair_
of registers, the first of which would be an even-numbered
register, as the registers are only 64 bits long

In 24-bit memory,

Load Floating would be an invalid operation
Load Double would load a 72-bit floating-point number;
Load Medium would load a 48-bit floating-point number;

Load Byte would load a 6-bit character;
Load Halfword would load a 12-bit integer;
Load would load a 24-bit integer;
Load Long would load a 48-bit integer.

Note that the relationship between floating-point
widths and integer widths in instructions is offset
by a factor of two when 24-bit memory is referenced.

And there would be separate opcodes for doing arithmetic
on numbers of these different lengths in the registers.

This is all very well. But in program code, which, as
it is so far specified, can _only_ reside in 64-bit memory,
what about pseudo-immediate values?

Since the instruction format that references pseudo-immediate
values is that of an operate instruction, pseudo-immediates
_could_ be allowed, with considerable wasted space in most
cases, in operate instructions for the data types associated
with variant memory widths.

However, I have thought of a way to allow the CPU to treat
the different memory widths with greater equality.

Instead of devising a whole new ISA for each of the other
memory widths, the *existing* ISA, designed around standard
memory built up from the 64-bit unit, could be given
_alternative representations_ wherein programs could be
stored in memory of other widths, using the same ISA with
minimum modifications.

The 36-bit Representation

In the standard 64-bit representation, a block may begin
with 1100, followed by a 2-bit prefix field for each of the
remaining fourteen 16-bit parts of the remaining seven
32-bit instruction slots in the block.

So, make that portion of the ISA the only one supported in
the 36-bit representation, but with each 36-bit word, in
a block, now 288 bits long, with eight 36-bit words, consisting
of a prefix field, followed by 16 instruction bits, repeated
sixteen times.

But it must still be possible to have block headers if one is
to have pseudo-immediates.

So the first 36-bit instruction slot of a block, if it is to
be a header, would start with 1011100011, and the pre- field
portion of the second 18 bits of that instruction slot would
contain 11. 1011100011 consists of 10 (32-bit instruction)
and 11100011 (header prefix in 32-bit mode).

So the ISA is modified because in 64-bit memory, only seven
instruction slots, not all eight, can contain 36-bit instructions,
and there are no headers inside 36 bit instructions. But the
modifications are minimal, for the purpose of adapting programs
to the different size of memory, with that size being native
for them.

The 24-bit representation

Since a 24-bit word contains three 8-bit bytes, here, the
form of instructions isn't modified at all. But blocks shrink
from eight instruction slots that are 32-bits in length to
eight 24-bit words, so a block would be 192 bits long.

The amount of wasted space for data in 24-bit types wouldn't
be changed much in many cases by using this representation
instead of the 32-bit representation for program code,
because the clash between instruction widths and data widths
wouldn't be affected. However, _some_ improvement would be
achieved, because now if there are multiple pseudo-immediate
values used in a single block, the 24-bit data items could
at least be *packed* correctly in the block.

The 60-bit representation

Here, a block is 480 bits long.

It consists of two 240-bit sub-blocks, in the following form:

One 15 bit halfword containing 15 prefix bits,

and fifteen 15-bit halfwords to which those bits apply,
turning them into 16-bit halfwords, which are used to form
instructions.

The blocks are aligned on 480-bit boundaries. When code
consists of 32-bit instructions, the first sub-block begins
with an instruction, and the second sub-block begins with the
second half of an instruction.

The first whole 32-bit instruction in a sub-block can be a
header, and so one has a header applying to seven subsequent
instruction slots in the first sub-block, and a header applying
to six subsequent instruction slots in the second sub-block.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Fri Nov 24 22:38:32 2023

On Fri, 24 Nov 2023 18:48:53 +0000, MitchAlsup wrote:

I have not found a need to have LD/ST floating point instructions
because I don't have FP register file(s) {or SIMD files}. Why do you
seem to want/need a file dedicated to floating point values ??
This saves a huge chunk of OpCode Space.......

Well, for one thing, by giving the integer functional unit one register
file, and the floating-point functional unit its own register file, then
those two functional units are not constrained to be adjacent.

I can stick the Decimal Floating-Point functional unit on the other side
of the floating register file, and the Simple Floating functional unit,
or the Register Packed Decimal functional unit, on the other side of the integer register file.

But of course my main reason is that this is the way everyone (i.e.
the IBM System/360) does it, so of course I should do it that way!

But having an excess of data types is at least a helpful excuse...

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Fri Nov 24 22:44:32 2023

On Fri, 24 Nov 2023 22:38:32 +0000, Quadibloc wrote:

On Fri, 24 Nov 2023 18:48:53 +0000, MitchAlsup wrote:

I have not found a need to have LD/ST floating point instructions
because I don't have FP register file(s) {or SIMD files}. Why do you
seem to want/need a file dedicated to floating point values ??
This saves a huge chunk of OpCode Space.......

Also, because IEEE 754 calls for a hidden first bit, and
gradual underflow, these things make floating-point
arithmeitc slightly slower.

Hence, I deal with them at load and store time, by using
the load and store floating point instructions to convert
to and from an *internal representation* of floats. That
in itself precludes using the integer instructions for them.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Fri Nov 24 22:41:03 2023

On Fri, 24 Nov 2023 04:42:31 +0000, Quadibloc wrote:

On Fri, 24 Nov 2023 04:15:57 +0000, Quadibloc wrote:

In 24-bit memory,

Load Floating would be an invalid operation Load Double would load a
72-bit floating-point number;
Load Medium would load a 48-bit floating-point number;

Alternatively, it would perhaps be simpler to do it this way:

Load Floating would load a 48-bit floating-point number;
Load Medium would load a 72-bit floating-point number.

This would prioritize structure over function, but only the compiler
would ever know...

Silly me. How could I have forgotten that 24-bit memory is the
only place where the "ideal" lengths of floating-point numbers
can exist, so I should use them, rather than try to concoct
something that's 24-bit native.

So

Load Floating would load a 36-bit floating-point number,
Load Medium would load a 48-bit floating-point number, and
Load Double would load a 60-bit floating-point number.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Fri Nov 24 22:53:57 2023

On Fri, 24 Nov 2023 04:15:57 +0000, Quadibloc wrote:

An even simpler _multiply_ by five circuit allows accessing memory with
a 60-bit word as the fundamental unit.

As well, a multiply by three circuit is used to address the 36-bit memory.

Instead of devising a whole new ISA for each of the other memory widths,
the *existing* ISA, designed around standard memory built up from the
64-bit unit, could be given _alternative representations_ wherein
programs could be stored in memory of other widths, using the same ISA
with minimum modifications.

While this is an "improvement", I do realize that this, along with even supporting multiple memory widths, even if I have found ways that it can technically be done, is sheer insanity.

I do have a rationale for going there. It's simple enough. It comes from
the x86.

Since the irresistible force of availability of applications means that
we are doomed to always have just *one* dominant ISA...

then, to minimize the terrible loss that this entails, that one dominant
ISA should be maximally flexible. To approach, as closely as possible, the "good old days" where the PDP-10 and the CDC 6600 coexisted alongside the PDP-11 and the IBM System/360.

This is the driving force behind my insane quest to design an ISA for a CPU which has the ability to do something which no one seems to want.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Sat Nov 25 19:55:59 2023

Quadibloc wrote:

On Fri, 24 Nov 2023 18:48:53 +0000, MitchAlsup wrote:

I have not found a need to have LD/ST floating point instructions
because I don't have FP register file(s) {or SIMD files}. Why do you
seem to want/need a file dedicated to floating point values ??
This saves a huge chunk of OpCode Space.......

Well, for one thing, by giving the integer functional unit one register
file, and the floating-point functional unit its own register file, then those two functional units are not constrained to be adjacent.

But Integer Multiply and Divide can share the FPU that does these.
And we have the need to use FP values in deciding to take branches.
So there is a lot of cross pollination of types to FUs. And finally
you have to have 2 function units one to convert from FP to Int and
one to convert INT to FP.

Lets keep them apart and now we need a FU bigger than the FMAC unit
to perform integer multiplies and divides. In addition, you need a Bus
from the FPU side to the Condition unit that cooperates with integer
branch instructions.

Does not look like you can keep them "that far apart" and even when you
do there is lots of cross wiring--some of which are more costly in HW
than keeping FP and Int in the same RF.

I can stick the Decimal Floating-Point functional unit on the other side
of the floating register file, and the Simple Floating functional unit,
or the Register Packed Decimal functional unit, on the other side of the integer register file.

But of course my main reason is that this is the way everyone (i.e.
the IBM System/360) does it, so of course I should do it that way!

But having an excess of data types is at least a helpful excuse...

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to Quadibloc on Sat Nov 25 14:46:33 2023

On 11/24/2023 4:44 PM, Quadibloc wrote:

On Fri, 24 Nov 2023 22:38:32 +0000, Quadibloc wrote:

On Fri, 24 Nov 2023 18:48:53 +0000, MitchAlsup wrote:

I have not found a need to have LD/ST floating point instructions
because I don't have FP register file(s) {or SIMD files}. Why do you
seem to want/need a file dedicated to floating point values ??
This saves a huge chunk of OpCode Space.......

Also, because IEEE 754 calls for a hidden first bit, and
gradual underflow, these things make floating-point
arithmeitc slightly slower.

DAZ/FTZ fixes the gradual underflow issue...
Not the standard solution, but works.

One could maybe also make a case for truncate-only rounding:
Easier to make bit-exact across implementations;
Cheapest option;
...

Though, truncate does occasionally have visible effects on code.
For example, the player's yaw angle in my Quake port was prone to drift,
had to manually add a small bias to stop the drifting.
...

Say, Round-Nearest at least having the advantage that looping relative computations don't have a tendency to drift towards 0. Though, on the
other side, computations which are stable with truncate rounding will
tend to drift away from zero with round-nearest.

In this case, this applied mostly to 'float' and 'short float' (when set
to be routed through the low-precision unit via a compiler flag), with
the main FPU (used for 'double'; and the others with default options)
still doing proper round-nearest.

I don't imagine a non-hidden bit representation would be much better here.

Renormalization would still be required regularly to avoid other issues,
and manual re-normalization would negatively effect performance.

Operations like FADD/FSUB would also still need to be able to
shift-align operands to be able to do their thing, so non-normalized
values would not save much here either.

Well, with the partial exception if one instead re-interprets the floating-point to be more like variable-fixed-point, in which case the
compiler would need to deal with these issues (and probably the
programmer as well, say, to specify how many bits are above or below the decimal point).

But, in this case, may as well save a few bits and just use "normal" fixed-point (or maybe add a compiler type, like "__fixed(16,16)" or
similar, then have the compiler deal with the shifts and similar...).

Hence, I deal with them at load and store time, by using
the load and store floating point instructions to convert
to and from an *internal representation* of floats. That
in itself precludes using the integer instructions for them.

Possible.

I went with integer operations myself...

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to MitchAlsup on Mon Nov 27 01:29:56 2023

On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:

But Integer Multiply and Divide can share the FPU that does these.

But giving each one its own multiplier means more superscalar
goodness!

But you _do_ have a good point. This kind of "superscalar goodness"
is usually wasted, because most programs don't do an equal amount
of integer and FP arithmetic. So it would be way better to have
twice as many multipliers and dividers that are designed to be used
either way than having units of two kinds.

I'm assuming, though, that one can make the multipliers and dividers
simpler, hence faster, by making them single-use, and that it's
possible to know, for a given use case, what the ratio between integer
and FP operations will be. After all, integer multiplication discards
the MSBs of the result, and floating-point multiplication discards the
LSBs of the result.

And we have the need to use FP values in deciding to take branches.

What gets used in taking branches is the _condition codes_. They're
conveyed from the integer and floating point functional units, as well
as all the other functional units, to a central place (the program
status word).

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Mon Nov 27 01:56:01 2023

Quadibloc wrote:

On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:

But Integer Multiply and Divide can share the FPU that does these.

But giving each one its own multiplier means more superscalar
goodness!

But you _do_ have a good point. This kind of "superscalar goodness"
is usually wasted, because most programs don't do an equal amount
of integer and FP arithmetic. So it would be way better to have
twice as many multipliers and dividers that are designed to be used
either way than having units of two kinds.

I'm assuming, though, that one can make the multipliers and dividers
simpler, hence faster, by making them single-use, and that it's
possible to know, for a given use case, what the ratio between integer
and FP operations will be.

In non-scientific code, there are 10 integer mul/divs per 1 FP mul/div
In Scientific code, there are 2 integer mul/divs per 10 FP mul/divs.

After all, integer multiplication discards
the MSBs of the result, and floating-point multiplication discards the
LSBs of the result.

Means almost nothing because a FMAC can end up with those very same bits
as are preferred for Int ×s

And we have the need to use FP values in deciding to take branches.

What gets used in taking branches is the _condition codes_. They're
conveyed from the integer and floating point functional units, as well
as all the other functional units, to a central place (the program
status word).

LD R4,[address of FP value]
BFEQ0 R4,some label

How do you do this on a condition code machine in 2 instructions ??
System 460 had to::

LD F4,[address of FP value]
TST F4
BEQ some label

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to Quadibloc on Mon Nov 27 00:18:34 2023

On 11/26/2023 7:29 PM, Quadibloc wrote:

On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:

But Integer Multiply and Divide can share the FPU that does these.

But giving each one its own multiplier means more superscalar
goodness!

But you _do_ have a good point. This kind of "superscalar goodness"
is usually wasted, because most programs don't do an equal amount
of integer and FP arithmetic. So it would be way better to have
twice as many multipliers and dividers that are designed to be used
either way than having units of two kinds.

I'm assuming, though, that one can make the multipliers and dividers
simpler, hence faster, by making them single-use, and that it's
possible to know, for a given use case, what the ratio between integer
and FP operations will be. After all, integer multiplication discards
the MSBs of the result, and floating-point multiplication discards the
LSBs of the result.

If one wants to have the most superscalar goodness, for general case
code, this means mostly lots of simple ALU ops (particularly, ADD/SUB/AND/OR/SHIFT), and ability to do *lots* of memory Load/Store ops.

Where, say, Load/Store ops and simple ALU ops tend to dominate over
pretty much everything else in terms of usage frequency.

In most code, FPU ops are comparably sparse, as is MUL, and DIV/MOD is
fairly rare in comparison to either. MUL is common enough that you want
it to be fast when it happens, but not so common to where one is dealing
with piles of them all at the same time (except maybe in edge cases,
like the DCT transform in JPEG/MPEG).

If FMUL and MUL can't be co issued, this is unlikely to matter.

DIV is uncommon enough that it can be 80 or 100 cycles or similar, and
most code will not notice (except in rasterizers or similar, which
actually need fast DIV, but can cheat in that they don't usually need
exact DIV and the divisors are small, allowing for "shortcuts", such as
using lookup tables of reciprocals or similar).

Otherwise, one can debate whether or not having DIV/MOD in hardware
makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
at least "probably" faster than a generic software-only solution).

For cases like divide-by-constant, also, typically it is possible to
turn it in the compiler into a multiply by reciprocal.

And we have the need to use FP values in deciding to take branches.

What gets used in taking branches is the _condition codes_. They're
conveyed from the integer and floating point functional units, as well
as all the other functional units, to a central place (the program
status word).

Yeah, better to not have this style of condition codes...

Like, there are other ways of doing conditional branches...

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Mon Nov 27 09:17:59 2023

BGB <cr88192@gmail.com> writes:

On 11/26/2023 7:29 PM, Quadibloc wrote:

On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:

But Integer Multiply and Divide can share the FPU that does these.

But giving each one its own multiplier means more superscalar
goodness!

Having two multipliers that serve both purposes means even more
superscalar goodness for similar area cost. However, there is the
issue of latency. The Willamette ships the integers over to the FPU
for multiplication, and the result back, crossing several clock
domains (at one clock loss per domain crossing), resulting in a
10-cycle latency for integer multiplication. I think that these days
every high-performance core with real silicon invests into separate
GPR and FP/SIMD (including integer SIMD) multipliers.

In most code, FPU ops are comparably sparse

In terms of executed ops, that depends very much on the code. GP
cores have acquired SIMD cores primarily for FP ops, as can be seen by
both SSE and supporting only FP at first, and only later adding
integer stuff, because it cost little extra. Plus, we have added GPUs
that are now capable of doing huge amounts of FP ops, with uses in
graphics rendering, HPC and AI.

Otherwise, one can debate whether or not having DIV/MOD in hardware
makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
at least "probably" faster than a generic software-only solution).

That debate has been held, and MIPS has hardware integer divide, Alpha
and IA-64 don't have a hardware integer divide; they both have FP
divide instructions.

However, looking at more recent architectures, the RISC-V M extension
(which is part of RV64G and RV32G, i.e., a standard extension) has not
just multiply instructions (MUL, MULH, MULHU, MULHSU, MULW), but also
integer divide instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW,
and REMUW. ARM A64 also has divide instructions (SDIV, UDIV), but
RISC-V seems significant to me because there the philosophy seems to
be to go for minimalism. So the debate has apparently come to the
conclusion that for general-purpose architectures, you include an
integer divide instruction.

For cases like divide-by-constant, also, typically it is possible to
turn it in the compiler into a multiply by reciprocal.

But for that you want a multiply instruction, which in the RISC-V case
means including the M extension, which also includes divide
instructions. Multiplying by the reciprocal may still be faster.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Anton Ertl on Mon Nov 27 13:27:01 2023

On Mon, 27 Nov 2023 09:17:59 +0000, Anton Ertl wrote:

That debate has been held, and MIPS has hardware integer divide, Alpha
and IA-64 don't have a hardware integer divide; they both have FP divide instructions.

I didn't know this about the Itanium. All I remembered hearing was that
the Itanium "didn't have a divide instruction", and so I didn't realize
this applied to fixed-point arithmetic only.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Quadibloc on Mon Nov 27 15:54:03 2023

Quadibloc <quadibloc@servername.invalid> writes:

On Mon, 27 Nov 2023 09:17:59 +0000, Anton Ertl wrote:

That debate has been held, and MIPS has hardware integer divide, Alpha
and IA-64 don't have a hardware integer divide; they both have FP divide
instructions.

I didn't know this about the Itanium. All I remembered hearing was that
the Itanium "didn't have a divide instruction", and so I didn't realize
this applied to fixed-point arithmetic only.

You are right, I was wrong. <http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf>
says:

|A number of floating-point operations defined by the IEEE Standard are |deferred to software by the IA-64 architecture in all its
|implementations
|
|* floating-point divide (integer divide, which is based on the
| floating-point divide operation, is also deferred to software)

The paper goes on to describe that FP division a/b is based on
determining 1/b through Newton-Raphson approximation, using the FMA instruction, and then multiplying with a. It shows a sequence of 13 instructions for double precision, 1 frcpa and 12 fma or fnma
instructions.

Given that integer division is based on FP division, it probably takes
even more instructions.

Meanwhile, a lowly Cortex-A53 with its divide instruction produces an
integer division result with a small quotient in a few cycles. In
other words, if you want to use that method for division, still
provide a division instruction, and then implement it with that
method. This provides the flexibility to switch to some other method
(e.g., the one used by ARM) later.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to Anton Ertl on Mon Nov 27 13:41:12 2023

On 11/27/2023 3:17 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

On 11/26/2023 7:29 PM, Quadibloc wrote:

On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:

But Integer Multiply and Divide can share the FPU that does these.

But giving each one its own multiplier means more superscalar
goodness!

Having two multipliers that serve both purposes means even more
superscalar goodness for similar area cost. However, there is the
issue of latency. The Willamette ships the integers over to the FPU
for multiplication, and the result back, crossing several clock
domains (at one clock loss per domain crossing), resulting in a
10-cycle latency for integer multiplication. I think that these days
every high-performance core with real silicon invests into separate
GPR and FP/SIMD (including integer SIMD) multipliers.

I ended up with different multipliers mostly because the requirements
are different...

In most code, FPU ops are comparably sparse

In terms of executed ops, that depends very much on the code. GP
cores have acquired SIMD cores primarily for FP ops, as can be seen by
both SSE and supporting only FP at first, and only later adding
integer stuff, because it cost little extra. Plus, we have added GPUs
that are now capable of doing huge amounts of FP ops, with uses in
graphics rendering, HPC and AI.

Yeah, this was not to say there is not FPU dense code, or that FP-SIMD
is not useful, but rather that in most general code, ALU and LD/ST ops
tend to dominate by a fair margin.

Similarly, SIMD ops may still be useful, even if they are a relative
minority of the executed instructions (even in code sequences which are actively using SIMD ops...).

Like, typically, for every SIMD op used, there are also things like:
The loads and stores to get the value from memory and put the results in memory;
ALU ops to calculate the index into the array or similar that we are
loading from or storing to;
...

Well, along with other ops, like shuffles and similar to get the SIMD
elements into the desired order, etc.

Like, some of this is why it is difficult to get anywhere near the
theoretical 200 MFLOP of the SIMD unit, apart from very contrived
use-cases (such as running neural net code), and had involved wonk like operations which combined a SIMD shuffle into the SIMD ADD/SUB/MUL ops.

For a lot of other use cases, I can just be happy enough that the SIMD
ops are "not slow".

Otherwise, one can debate whether or not having DIV/MOD in hardware
makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
at least "probably" faster than a generic software-only solution).

That debate has been held, and MIPS has hardware integer divide, Alpha
and IA-64 don't have a hardware integer divide; they both have FP
divide instructions.

Technically, also, BJX2 has ended up having both Integer DIV and FDIV instructions. But, they don't gain all that much, so are still left as
optional features.

The integer divide isn't very fast, but it doesn't matter if it isn't
used all that often.

The FDIV is effectively a boat anchor (around 122 clock cycles).
Though, mostly this was based on the observation that with some minor
tweaks, the integer divide unit could be made to perform floating-point
divide as well.

The main merit though (over a software N-R divider) is that it can
apparently give exact results (my N-R dividers generally can't converge
past the low order 4 bits).

However, looking at more recent architectures, the RISC-V M extension
(which is part of RV64G and RV32G, i.e., a standard extension) has not
just multiply instructions (MUL, MULH, MULHU, MULHSU, MULW), but also
integer divide instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW,
and REMUW. ARM A64 also has divide instructions (SDIV, UDIV), but
RISC-V seems significant to me because there the philosophy seems to
be to go for minimalism. So the debate has apparently come to the
conclusion that for general-purpose architectures, you include an
integer divide instruction.

Yeah.

I mostly ended up adding integer divide so that the RISC-V mode could
support the 'M' extension, and if I have it for RISC-V, may as well also
add it in BJX2 as its own optional extension.

Had also added a "FAZ DIV" special case that made the integer divide
faster, where over a limited range of input values, the integer divide
would be turned into a multiply by reciprocal.

So, say, DIV latency:
64-bit: 68 cycle
32-bit: 36 cycle
32-bit FAZ: 3 cycle.

As it so happens, FAZ also covers a similar range to that typically used
for a rasterizer, but is more limited in that it only handles a range of
values it can calculate exactly. For my hardware rasterizer, a similar
strategy was used, but extended to support bigger divisors at the
tradeoff that it becomes inexact with larger divisors.

However, adding a "Divide two integers quickly, but may give an
inaccurate result" instruction would be a bit niche (and normal C code
couldn't use it without an intrinsic or similar).

Partial observation is that mostly, the actual bit patterns in the
reciprocals tends to be fairly repetitive, so it is possible to
synthesize a larger range of reciprocals using lookup tables and shift-adjustments.

In the hardware rasterizer, I had experimented with a fixed-point 1/Z
divider for "more accurate" perspective-correct rasterization, but this
feature isn't cheap (fixed-point reciprocal is more complicated), and
ended up not using it for now (in favor of merely dividing S and T by Z
and then multiplying by the interpolated Z again during rasterization,
as a sort of less accurate "poor man's" version).

The "poor man's perspective correct" strategy doesn't eliminate the need
to subdivide primitives, but does allow the primitives to be somewhat
larger without having as much distortion (mostly relevant as my GL
renderer is still mostly CPU bound in the transform and subdivision
stages, *1).

In theory, proper perspective correct would entirely eliminate the need
to subdivide primitives, but would require clipping geometry to the
viewing frustum (otherwise, the texture coordinates freak out for
primitives crossing outside the frustum or crossing the near plane).

Approximate FDIV is possible, but I have typically used a different
strategy for the reciprocal.

*1: Though, ironically, this does also mean that, via multi-texturing,
it is semi-viable to also use lightmap lighting again in GLQuake (since
this doesn't add too much CPU penalty over the non-multitexture case).

However, dynamic lightmaps still aren't viable, as the CPU-side cost of
drawing the dynamic light-sources to the lightmaps, and then uploading
them, is still a bit too much.

I suspect, probably in any case, GLQuake wasn't likely written to run on
a 50MHz CPU.

Might have been a little easier if the games had poly-counts and and
draw distances more on par with PlayStation games (say, if the whole
scene is kept less than 500 polys; rather than 1k-2k polys).

Say, Quake being generally around 1k-2k polys for the scene, and roughly 300-500 triangles per alias model, ... Despite the scenes looking crude
with large flat surfaces, most of these surfaces had already been hacked
up a fair bit by the BSP algorithm (though, theoretically, some
annoyances could be reduced if the Quake1 maps were rebuilt using a
modified Q3BSP or similar, as the Quake3 tools natively support vertex
lighting and also don't hack up the geometry quite as much as the
original Quake1 BSP tool did, but alas...).

Wouldn't save much, as I still couldn't legally redistribute the data
files (and my GLQuake port isn't particularly usable absent using a
process to convert all the textures into DDS files and rebuilding the
PAK's and similar, so...).

To have something redistributable, would need to replace all of the
textures, sound effects, alias models, etc. An old (probably long dead) "OpenQuartz" project had partly done a lot of this, but "got creative"
with the character models in a way I didn't like (would have preferred something at least "in the same general spirit" as the original models).

Similar sort of annoyances as with FreeDoom, but at least FreeDoom
stayed closer to the original in these areas.

Also, possibly, I may need to rework some things in terms of how TKRA-GL
works to better match up with more recent developments (all this is
still a bit hacky; effectively linking the whole GL implementation to
the binary that uses GL).

Granted, it is also possible I may need to at some point move away from
linking the whole TestKern kernel to any programs that use it as well,
with the tradeoff that programs would no longer be able to launch "bare
metal" in the emulator (but would always need the kernel to also be
present).

For cases like divide-by-constant, also, typically it is possible to
turn it in the compiler into a multiply by reciprocal.

But for that you want a multiply instruction, which in the RISC-V case
means including the M extension, which also includes divide
instructions. Multiplying by the reciprocal may still be faster.

Yeah.

For the relevant cases, widening multiply for 32-bit integers generally
has a 3 cycle latency in my case.

Whereas DIV will often have a 36 cycle latency in the general case.
So, the choice is obvious.

Granted, this still generally needs a shift instruction following the
multiply.

Having a "Do a widening 32-bit multiply and then right shift the results 32/33/34/35 bits" would be possible in theory, but a little wonky.

Could possibly make sense if it could also encode a 32-bit immediate:
DMULSH33S Rm, Imm32u, Rn //Int32: Rn=(Rm*Imm32)>>33
DMULSH33U Rm, Imm32u, Rn //UInt32: Rn=(Rm*Imm32)>>33

Probably don't really have the encoding space to pull this off though
(with a 64-bit encoding; Could be doable in theory with an 96-bit
FE-FF-OP encoding though).

Though, as can be noted, this could likely borrow some of the logic from
the FAZ-DIV mechanism or similar.

Would allow handling "y=x/C;" cases in 3L/1T though, but, integer
division still isn't really all that common (at least, not enough to
justify going through all this just to save using a right-shift op).

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Anton Ertl on Tue Nov 28 01:05:21 2023

Anton Ertl wrote:

Quadibloc <quadibloc@servername.invalid> writes:

On Mon, 27 Nov 2023 09:17:59 +0000, Anton Ertl wrote:

That debate has been held, and MIPS has hardware integer divide, Alpha
and IA-64 don't have a hardware integer divide; they both have FP divide >>> instructions.

I didn't know this about the Itanium. All I remembered hearing was that
the Itanium "didn't have a divide instruction", and so I didn't realize >>this applied to fixed-point arithmetic only.

You are right, I was wrong. <http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf>
says:

|A number of floating-point operations defined by the IEEE Standard are |deferred to software by the IA-64 architecture in all its
|implementations
|
|* floating-point divide (integer divide, which is based on the
| floating-point divide operation, is also deferred to software)

The paper goes on to describe that FP division a/b is based on
determining 1/b through Newton-Raphson approximation, using the FMA instruction, and then multiplying with a. It shows a sequence of 13 instructions for double precision, 1 frcpa and 12 fma or fnma
instructions.

Given that integer division is based on FP division, it probably takes
even more instructions.

You have to synthesize the bits between 63 and 53 to get 64/64->64.

Meanwhile, a lowly Cortex-A53 with its divide instruction produces an
integer division result with a small quotient in a few cycles. In
other words, if you want to use that method for division, still
provide a division instruction, and then implement it with that
method. This provides the flexibility to switch to some other method
(e.g., the one used by ARM) later.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Tue Nov 28 00:58:32 2023

BGB wrote:

On 11/26/2023 7:29 PM, Quadibloc wrote:

On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:

But Integer Multiply and Divide can share the FPU that does these.

But giving each one its own multiplier means more superscalar
goodness!

But you _do_ have a good point. This kind of "superscalar goodness"
is usually wasted, because most programs don't do an equal amount
of integer and FP arithmetic. So it would be way better to have
twice as many multipliers and dividers that are designed to be used
either way than having units of two kinds.

I'm assuming, though, that one can make the multipliers and dividers
simpler, hence faster, by making them single-use, and that it's
possible to know, for a given use case, what the ratio between integer
and FP operations will be. After all, integer multiplication discards
the MSBs of the result, and floating-point multiplication discards the
LSBs of the result.

If one wants to have the most superscalar goodness, for general case
code, this means mostly lots of simple ALU ops (particularly, ADD/SUB/AND/OR/SHIFT), and ability to do *lots* of memory Load/Store ops.

Where, say, Load/Store ops and simple ALU ops tend to dominate over
pretty much everything else in terms of usage frequency.

LD+ST 33%
Int 45%

In most code, FPU ops are comparably sparse,

FP 9%

as is MUL,

MUL 2%

and DIV/MOD is

DIV 0.3%

fairly rare in comparison to either. MUL is common enough that you want
it to be fast when it happens, but not so common to where one is dealing
with piles of them all at the same time (except maybe in edge cases,
like the DCT transform in JPEG/MPEG).

MUL is much more common in FORTRAN than in C-like languages as it forms
the underlying math for multidimensional arrays, and more ×s exist in
typical number crunching code (F) than in pointer chasing (C) languages.

If FMUL and MUL can't be co issued, this is unlikely to matter.

Since both are pipelined, this is a small stall {{Unlike MIPS R2000 where
IMUL was 12 cycles blocking while FMUL was 3 (or was it 4) pipelined.

DIV is uncommon enough that it can be 80 or 100 cycles or similar, and
most code will not notice (except in rasterizers or similar, which
actually need fast DIV, but can cheat in that they don't usually need
exact DIV and the divisors are small, allowing for "shortcuts", such as
using lookup tables of reciprocals or similar).

We know that FDIV can be performed in 17 cycles in the FMUL unit, and
this has been known since 1989......why are you making it so long ??

0.3% IDIV × 100 cycles and you have added 3 cycles to the average
instruction !!!

Otherwise, one can debate whether or not having DIV/MOD in hardware
makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
at least "probably" faster than a generic software-only solution).

With an 8-bit start, and Newton-Raphson iteration, SW should be able
to do this in mid-30-cycles. {{See ITANIC FDIV SW profile}}

For cases like divide-by-constant, also, typically it is possible to
turn it in the compiler into a multiply by reciprocal.

And you should if it helps.

And we have the need to use FP values in deciding to take branches.

What gets used in taking branches is the _condition codes_. They're
conveyed from the integer and floating point functional units, as well
as all the other functional units, to a central place (the program
status word).

Yeah, better to not have this style of condition codes...

RISC-V, MIPS, Mc 88K, My 66000, have no condition codes......

Like, there are other ways of doing conditional branches...

"Compare to zero and branch" × {Signed, Unsigned, FP16, FP32, FP64}}
for 1 instruction FP branches

Compare and then branch on anything (2 inst::1 int, 1 FP) your heart desires Branch {NaN, -Infinity, +Infinity, -DeNorm, +Denorm, -Normal, +Normal, Zero, -zero, +zero, < <= > >= == !=, 0 < x < y, 0 <=x < y, 0 < x <= y, 0 <= x <= y, ... }
Can you even record all these states in a single condition code ??

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Anton Ertl on Tue Nov 28 01:03:39 2023

Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

On 11/26/2023 7:29 PM, Quadibloc wrote:

On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:

But Integer Multiply and Divide can share the FPU that does these.

But giving each one its own multiplier means more superscalar
goodness!

Having two multipliers that serve both purposes means even more
superscalar goodness for similar area cost. However, there is the
issue of latency. The Willamette ships the integers over to the FPU
for multiplication, and the result back, crossing several clock
domains (at one clock loss per domain crossing), resulting in a
10-cycle latency for integer multiplication. I think that these days
every high-performance core with real silicon invests into separate
GPR and FP/SIMD (including integer SIMD) multipliers.

In most code, FPU ops are comparably sparse

In terms of executed ops, that depends very much on the code. GP
cores have acquired SIMD cores primarily for FP ops, as can be seen by
both SSE and supporting only FP at first, and only later adding
integer stuff, because it cost little extra. Plus, we have added GPUs
that are now capable of doing huge amounts of FP ops, with uses in
graphics rendering, HPC and AI.

Once SIMD gains Integer operations, the Multiplier has to be built to
do both, might as well use it for more things than just SIMD.

Otherwise, one can debate whether or not having DIV/MOD in hardware
makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
at least "probably" faster than a generic software-only solution).

That debate has been held, and MIPS has hardware integer divide, Alpha
and IA-64 don't have a hardware integer divide; they both have FP
divide instructions.

And all of these lead fruitful long productive lives before taking over -------Oh wait !!

However, looking at more recent architectures, the RISC-V M extension
(which is part of RV64G and RV32G, i.e., a standard extension) has not
just multiply instructions (MUL, MULH, MULHU, MULHSU, MULW), but also
integer divide instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW,
and REMUW.

All of which are possible in My 66000 using operand sign control, S-bit,
and CARRY when you want 64×64->128 or 128/64->{64 quotient, 64 remainder}

ARM A64 also has divide instructions (SDIV, UDIV), but
RISC-V seems significant to me because there the philosophy seems to
be to go for minimalism. So the debate has apparently come to the
conclusion that for general-purpose architectures, you include an
integer divide instruction.

Several forms of integer DIV at least signed and unsigned.....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Anton Ertl on Tue Nov 28 04:50:20 2023

On Mon, 27 Nov 2023 15:54:03 +0000, Anton Ertl wrote:

Quadibloc <quadibloc@servername.invalid> writes:

On Mon, 27 Nov 2023 09:17:59 +0000, Anton Ertl wrote:

That debate has been held, and MIPS has hardware integer divide, Alpha
and IA-64 don't have a hardware integer divide; they both have FP
divide instructions.

I didn't know this about the Itanium. All I remembered hearing was that
the Itanium "didn't have a divide instruction", and so I didn't realize >>this applied to fixed-point arithmetic only.

You are right, I was wrong. <http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf> says:

|A number of floating-point operations defined by the IEEE Standard are |deferred to software by the IA-64 architecture in all its
|implementations |
|* floating-point divide (integer divide, which is based on the | floating-point divide operation, is also deferred to software)

The paper goes on to describe that FP division a/b is based on
determining 1/b through Newton-Raphson approximation, using the FMA instruction, and then multiplying with a. It shows a sequence of 13 instructions for double precision, 1 frcpa and 12 fma or fnma
instructions.

Given that integer division is based on FP division, it probably takes
even more instructions.

The fastest _hardware_ implementations of FP division are Goldschmidt
division and Newton-Raphson divisiion, and they both make use of
multiplication hardware.

So using Newton-Raphson division to increase the width of an FP division
result to perform 64-bit integer division won't be too hard.

Since the Wikipedia article didn't go into detail, I had to look further
to find out that you _were_ right about the DEC Alpha, however. It did
have floating divide but not integer divide. The 21264 added floating
square root.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to MitchAlsup on Tue Nov 28 09:10:37 2023

mitchalsup@aol.com (MitchAlsup) writes:

Anton Ertl wrote:

<http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf>
says:

|A number of floating-point operations defined by the IEEE Standard are
|deferred to software by the IA-64 architecture in all its
|implementations
|
|* floating-point divide (integer divide, which is based on the
| floating-point divide operation, is also deferred to software)

The paper goes on to describe that FP division a/b is based on
determining 1/b through Newton-Raphson approximation, using the FMA
instruction, and then multiplying with a. It shows a sequence of 13
instructions for double precision, 1 frcpa and 12 fma or fnma
instructions.

Given that integer division is based on FP division, it probably takes
even more instructions.

You have to synthesize the bits between 63 and 53 to get 64/64->64.

The paper says that they use double-extended division to synthesize
64/64-bit division; I expect that double-extended uses even more
Newton-Raphson steps (and FMAs) than double precision; and then there
are the additional steps for getting an integer.

The paper only shows signed 16/16 bit division, a 15-instruction
sequence including 1 fcrpa and 3 fma/fnma instructions. So you can
expect that 64/64 is at least 13+15-4=24 instructions long, probably
longer (because of double-extended precision).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to MitchAlsup on Tue Nov 28 09:20:44 2023

mitchalsup@aol.com (MitchAlsup) writes:

Several forms of integer DIV at least signed and unsigned.....

Gforth has unsigned, floored (signed) and symmetric (signed), in double-cell/single-cell variants (standardized in Forth) as well as in single/single variants.

Floored rounds the quotient down, symmetric (often called truncated)
rounds the quotient towards 0. There is also something called
Euclidean that always produces a remainder >=0, but the practical
difference from floored is when the divisor is negative, which is
exceedingly rare.

Given the rareness of negative divisors, one might wonder about signed/unsigned, but at least implementations of current programming
languages would probably make little use of such instructions even if
they were cheaper than the signed/signed ones.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to MitchAlsup on Tue Nov 28 04:32:58 2023

On 11/27/2023 6:58 PM, MitchAlsup wrote:

BGB wrote:

On 11/26/2023 7:29 PM, Quadibloc wrote:

On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:

But Integer Multiply and Divide can share the FPU that does these.

But giving each one its own multiplier means more superscalar
goodness!

But you _do_ have a good point. This kind of "superscalar goodness"
is usually wasted, because most programs don't do an equal amount
of integer and FP arithmetic. So it would be way better to have
twice as many multipliers and dividers that are designed to be used
either way than having units of two kinds.

I'm assuming, though, that one can make the multipliers and dividers
simpler, hence faster, by making them single-use, and that it's
possible to know, for a given use case, what the ratio between integer
and FP operations will be. After all, integer multiplication discards
the MSBs of the result, and floating-point multiplication discards the
LSBs of the result.

If one wants to have the most superscalar goodness, for general case
code, this means mostly lots of simple ALU ops (particularly,
ADD/SUB/AND/OR/SHIFT), and ability to do *lots* of memory Load/Store ops.

Where, say, Load/Store ops and simple ALU ops tend to dominate over
pretty much everything else in terms of usage frequency.

LD+ST 33%
Int   45%

In most code, FPU ops are comparably sparse,

FP     9%

                                             as is MUL,

MUL    2%

                                                        and DIV/MOD is

DIV 0.3%

Yeah, pretty much...

fairly rare in comparison to either. MUL is common enough that you
want it to be fast when it happens, but not so common to where one is
dealing with piles of them all at the same time (except maybe in edge
cases, like the DCT transform in JPEG/MPEG).

MUL is much more common in FORTRAN than in C-like languages as it forms
the underlying math for multidimensional arrays, and more ×s exist in typical number crunching code (F) than in pointer chasing (C) languages.

Yeah.

Multidimensional arrays are comparably rarely used in C land, and when
they are used, they typically have power-of-2 dimensions.

If FMUL and MUL can't be co issued, this is unlikely to matter.

Since both are pipelined, this is a small stall {{Unlike MIPS R2000 where IMUL was 12 cycles blocking while FMUL was 3 (or was it 4) pipelined.

Slow IMUL but fast FMUL: WTF?...
Unless this was only counting single precision, then it could make more
sense.

DIV is uncommon enough that it can be 80 or 100 cycles or similar, and
most code will not notice (except in rasterizers or similar, which
actually need fast DIV, but can cheat in that they don't usually need
exact DIV and the divisors are small, allowing for "shortcuts", such
as using lookup tables of reciprocals or similar).

We know that FDIV can be performed in 17 cycles in the FMUL unit, and
this has been known since 1989......why are you making it so long ??

0.3% IDIV × 100 cycles and you have added 3 cycles to the average instruction !!!

Can note that using DIV/MOD seems a lot rarer in my case...

But, I can note that my compiler turns constant divide into alternatives:
y=x/C; //~ y=(x*RCP)>>SHR;
y=x%C; //~ y=x-(((x*RCP)>>SHR)*C);

Though, with some extra wonk thrown in so negative values round the
divide towards zero and similar (otherwise, negative values would round
away from zero, which is not the commonly accepted result).

y=((x+((x<0)?(C-1):0))*RCP)>>SHR;

The variable cases are comparably infrequent.

Otherwise, one can debate whether or not having DIV/MOD in hardware
makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV
is at least "probably" faster than a generic software-only solution).

With an 8-bit start, and Newton-Raphson iteration, SW should be able
to do this in mid-30-cycles. {{See ITANIC FDIV SW profile}}

I was thinking of integer divide here (being 68 cycle for DIVS.Q/DIVU.Q
in my case; 36 for DIVS.L/DIVU.L).

But, yeah, the FDIV instruction is even slower (120 cycles); and
software Newton-Raphson will win this race...

Partly this is the drawback of using a Shift-Add design.

Say, every clock-cycle, the value is shifted left by 1 bit, and logic determines whether or not to add a second value to the working value.

For multiply, you can use the bits of one input to mask whether or not
to add the other input. For divide, the carry of the adder can be used
to mask whether to add the inverse of the divisor.

For FDIV, one runs the divider for more clock cycles, and it generates
bits below the decimal point.

Though, at least for the merit of an Integer DIV op, it is faster to use
a 36-cycle integer divide instruction than it is to run a shift-subtract
loop in software.

For cases like divide-by-constant, also, typically it is possible to
turn it in the compiler into a multiply by reciprocal.

And you should if it helps.

Yeah, it does.
Multiply by reciprocal is considerably faster.

And we have the need to use FP values in deciding to take branches.

What gets used in taking branches is the _condition codes_. They're
conveyed from the integer and floating point functional units, as well
as all the other functional units, to a central place (the program
status word).

Yeah, better to not have this style of condition codes...

RISC-V, MIPS, Mc 88K, My 66000, have no condition codes......

Yeah.

In BJX2, there is only SR.T, however, the way it is updated and used is
much narrower in scope.

A case could be made for narrowing its scope further though.

Like, there are other ways of doing conditional branches...

"Compare to zero and branch" × {Signed, Unsigned, FP16, FP32, FP64}}
for 1 instruction FP branches

Yeah, I had ended up adding this strategy as well as the prior SR.T
mechanism.

Compare and then branch on anything (2 inst::1 int, 1 FP) your heart
desires Branch {NaN, -Infinity, +Infinity, -DeNorm, +Denorm, -Normal, +Normal, Zero, -zero, +zero, < <= > >= == !=, 0 < x < y, 0 <=x < y, 0 <
x <= y, 0 <= x <= y, ... }
Can you even record all these states in a single condition code ??

Possible.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marko Zec@21:1/5 to MitchAlsup on Tue Nov 28 13:16:48 2023

MitchAlsup <mitchalsup@aol.com> wrote:
...

0.3% IDIV ? 100 cycles and you have added 3 cycles to the average
instruction !!!

Not really, more likeliky it would be a 0.33 average CPI bump, assuming
a 1.0 baseline CPI for non-DIV instructions. Still way to much to be ignored...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to Marko Zec on Tue Nov 28 13:00:11 2023

On 11/28/2023 7:16 AM, Marko Zec wrote:

MitchAlsup <mitchalsup@aol.com> wrote:
...

0.3% IDIV ? 100 cycles and you have added 3 cycles to the average
instruction !!!

Not really, more likeliky it would be a 0.33 average CPI bump, assuming
a 1.0 baseline CPI for non-DIV instructions. Still way to much to be ignored...

I missed that, but yeah.

Still, it seems the incidence of the integer DIV/MOD/etc instructions
being used is somewhat rarer than this estimate in the code I was measuring.

Checking my compiler output for static use count:
Around 0.05% ...

It might have been more common though, if my compiler didn't
aggressively eliminate the divide by constant cases (and my coding
practices didn't actively avoid using division in general, etc).

Since, between either a slowish DIV instruction, or a software
shift-subtract loop, neither was expected to be all that fast.

On pretty much every system I have used IRL, division has tended to be
fairly slow, so I had tended to avoid it whenever possible.

In cases where fast divide was needed, there were often other
workarounds (such as using a lookup table of reciprocals).

If inexact results are acceptable, it is possible to stretch the table
of reciprocals up to pretty much arbitrary divisors (via normalization
and some shift-related trickery).
In software, this is helped with a CLZ instruction or similar, but had
also used a variant of this effectively in my hardware rasterizer module.

Say:
z=x/y;
y:
0.. 31: Direct multiply by MAX/y (Table A)
32.. 63: Normalized multiply by MAX/y (Table B)
64..127: Use (y>>1) as index to Table B, add 1 to final SHR
128..255: Use (y>>2) as index to Table B, add 2 to final SHR
...

The sizes of Tables A/B can be reduced (via another lookup table) by
making use of repetitions in the bit patterns (though, this is more
practical in Verilog than in software).

...

Though, getting acceptable results from fixed-point division is a little
more of an issue.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to BGB on Tue Nov 28 11:22:48 2023

On 11/28/2023 11:00 AM, BGB wrote:

snip

Still, it seems the incidence of the integer DIV/MOD/etc instructions
being used is somewhat rarer than this estimate in the code I was
measuring.

Is this a result of your using mostly a few old games as your
"reference" code? i.e. your code base may be different from what gave
rise to the statistics mentioned by others.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Tue Nov 28 19:37:04 2023

Quadibloc wrote:

On Mon, 27 Nov 2023 15:54:03 +0000, Anton Ertl wrote:

Quadibloc <quadibloc@servername.invalid> writes:

On Mon, 27 Nov 2023 09:17:59 +0000, Anton Ertl wrote:

That debate has been held, and MIPS has hardware integer divide, Alpha >>>> and IA-64 don't have a hardware integer divide; they both have FP
divide instructions.

I didn't know this about the Itanium. All I remembered hearing was that >>>the Itanium "didn't have a divide instruction", and so I didn't realize >>>this applied to fixed-point arithmetic only.

You are right, I was wrong.
<http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf> says:

|A number of floating-point operations defined by the IEEE Standard are
|deferred to software by the IA-64 architecture in all its
|implementations |
|* floating-point divide (integer divide, which is based on the |
floating-point divide operation, is also deferred to software)

The paper goes on to describe that FP division a/b is based on
determining 1/b through Newton-Raphson approximation, using the FMA
instruction, and then multiplying with a. It shows a sequence of 13
instructions for double precision, 1 frcpa and 12 fma or fnma
instructions.

Given that integer division is based on FP division, it probably takes
even more instructions.

The fastest _hardware_ implementations of FP division are Goldschmidt division and Newton-Raphson divisiion, and they both make use of multiplication hardware.

Goldschmidt is about 2× faster than N-R because:: Goldschmidt inner loop
has 2 independent multiplies, while N-R has 2 dependent multiplies. To
meet IEEE 754 accuracy, the last iteration in Goldschmide is ½ a N-R
iteration {{you perform the N-R 1st step and then use the sign bit to
determine +1 or +2 and the +0 can be determined while the multiply is transpiring.}}

So using Newton-Raphson division to increase the width of an FP division result to perform 64-bit integer division won't be too hard.

Since the Wikipedia article didn't go into detail, I had to look further
to find out that you _were_ right about the DEC Alpha, however. It did
have floating divide but not integer divide. The 21264 added floating
square root.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to Stephen Fuld on Tue Nov 28 14:09:19 2023

On 11/28/2023 1:22 PM, Stephen Fuld wrote:

On 11/28/2023 11:00 AM, BGB wrote:

snip

Still, it seems the incidence of the integer DIV/MOD/etc instructions
being used is somewhat rarer than this estimate in the code I was
measuring.

Is this a result of your using mostly a few old games as your
"reference" code? i.e. your code base may be different from what gave
rise to the statistics mentioned by others.

Possibly.

There is my C library and kernel/runtime code;
And games like Doom and Quake and similar.

I suspect probably ID also avoided integer division when possible as
well, as it is rarely used (and was probably still slow back in
early/mid 90s PCs).

Looking up, 486 and Pentium 1 still had ~ 40+ cycle 32-bit DIV/IDIV.
...

Looks like MUL/IMUL had similar latency to DIV/IDIV on the 386 and 486,
but then the latency dropped down a fair bit on the Pentium.

However, Doom makes extensive use of integer multiply.

Given the timings, I suspect it is possible they were also using a
Shift-Add design for the multiply/divide unit.

In my case, I am using Shift-Add for DIV and also 64-bit multiply, but a
faster DSP48 based multiplier for 32-bit multiply.

If I were to make a guess based on the latency, I would guess the P1 was
using something like a radix-16 long-multiply.

Comparably, it seems DIV/IDIV didn't actually get much faster until the
Ryzen.

Have noted that these old games did have a nifty trick for calculating approximate distance, say:
dx=x0-x1;
dy=y0-y1;
adx=dx^(dx>>31);
ady=dy^(dy>>31);
if(ady>adx)
{ t=adx; adx=ady; ady=t; } //common
// { adx^=ady; ady^=adx; adx^=ady; } //sometimes
d=adx+(ady>>1);

Can be extended to 3D without too much issue, but for 4D, the sorting
step becomes more of an issue.

Kinda makes sense though, as finding the square-root of things is also expensive...

But, not sure how a lot of this compares with other (non game) coding practices.

Have noticed though that a lot of these games really do like using lots
of global variables (a lot more so than I would prefer to use in my
coding practices).

Though, this is one case where it is useful to be able to load/store
global variables in a single instruction (and using something like a GOT
may have a significant penalty on code that uses lots of global variables).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to MitchAlsup on Tue Nov 28 14:13:26 2023

On 11/27/2023 7:03 PM, MitchAlsup wrote:

Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

On 11/26/2023 7:29 PM, Quadibloc wrote:

On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:

But Integer Multiply and Divide can share the FPU that does these.

But giving each one its own multiplier means more superscalar
goodness!

Having two multipliers that serve both purposes means even more
superscalar goodness for similar area cost. However, there is the
issue of latency. The Willamette ships the integers over to the FPU
for multiplication, and the result back, crossing several clock
domains (at one clock loss per domain crossing), resulting in a
10-cycle latency for integer multiplication. I think that these days
every high-performance core with real silicon invests into separate
GPR and FP/SIMD (including integer SIMD) multipliers.

In most code, FPU ops are comparably sparse

In terms of executed ops, that depends very much on the code. GP
cores have acquired SIMD cores primarily for FP ops, as can be seen by
both SSE and supporting only FP at first, and only later adding
integer stuff, because it cost little extra. Plus, we have added GPUs
that are now capable of doing huge amounts of FP ops, with uses in
graphics rendering, HPC and AI.

Once SIMD gains Integer operations, the Multiplier has to be built to
do both, might as well use it for more things than just SIMD.

Well, and/or just use separate multipliers for each, under the rationale
that, at least on an FPGA, it already has the DSP48's whether or not one
uses them (and a lot more DSP48's than I can really make effective use of).

Well, with the caveat that they only natively do signed 18-bit multiply.

Otherwise, one can debate whether or not having DIV/MOD in hardware
makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV
is at least "probably" faster than a generic software-only solution).

That debate has been held, and MIPS has hardware integer divide, Alpha
and IA-64 don't have a hardware integer divide; they both have FP
divide instructions.

And all of these lead fruitful long productive lives before taking over -------Oh wait !!

MIPS at least did moderately well for a time, might have done better if
it were more open.

Both Alpha and IA-64 had a lot of questionable seeming design choices.

However, looking at more recent architectures, the RISC-V M extension
(which is part of RV64G and RV32G, i.e., a standard extension) has not
just multiply instructions (MUL, MULH, MULHU, MULHSU, MULW), but also
integer divide instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW,
and REMUW.

All of which are possible in My 66000 using operand sign control, S-bit,
and CARRY when you want 64×64->128 or 128/64->{64 quotient, 64 remainder}

ARM A64 also has divide instructions (SDIV, UDIV), but >> RISC-V seems significant to me because there the philosophy seems to
be to go for minimalism. So the debate has apparently come to the
conclusion that for general-purpose architectures, you include an
integer divide instruction.

Several forms of integer DIV at least signed and unsigned.....

I had left it as optional, but I guess, optional is not the same as
absent, and my main profiles had ended up including it.

The ops are generally given names like:
DIVS.L, DIVU.L, DIVS.Q, DIVU.Q
MODS.L, MODU.L, MODS.Q, MODU.Q

One may notice a curious level of overlap with the RISC-V ops here...

Though, for 32-bit:
MULS.L, MULU.L: Perform a narrow multiply.
DMULS.L, DMULU.L: Perform a widening multiply.
Along with:
MULS.Q, MULU.Q: 64-bit multiply
MULHS.Q, MULHU.Q: 64-bit multiply, high 64 bits of result

DIV, MOD, and 64-bit multiply, is drastically slower than 32-bit
multiply though.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Thu Nov 30 16:23:05 2023

BGB wrote:

Have noted that these old games did have a nifty trick for calculating approximate distance, say:
dx=x0-x1;
dy=y0-y1;
adx=dx^(dx>>31);
ady=dy^(dy>>31);
if(ady>adx)
{ t=adx; adx=ady; ady=t; } //common
// { adx^=ady; ady^=adx; adx^=ady; } //sometimes
d=adx+(ady>>1);

Why not::

dx=x0-x1;
dy=y0-y1;
adx=dx^(dx>>31);
ady=dy^(dy>>31);
if(ady>adx)
d=adx+(ady>>1);
else
d=ady+(adx>>1);

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Thu Nov 30 16:25:58 2023

BGB wrote:

64-bit multiply, is drastically slower than 32-bit
multiply though.

Should only be 1 gate of delay longer......

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Robert Finch on Thu Nov 30 17:50:10 2023

Robert Finch wrote:

On 2023-11-30 11:25 a.m., MitchAlsup wrote:

BGB wrote:

64-bit multiply, is drastically slower than 32-bit
multiply though.

Should only be 1 gate of delay longer......

FPGA land using DSPs, 24x16 multiply is single cycle, 32x32 could be
single cycle if a lower fmax is acceptable, but 64x64 best pipelined for several cycles (6). Thor had a fast 24x16 multiply as it is good enough
for array address calcs in a lot of circumstances.

Nothing stops you from building a multiplier in LUTs with Boothe recoding.
Each pair of LUTs being a 3-input XOR and a 3-input Majority gate. This also gets rid of BGBs argument on building an incomplete multiplier tree {because IEE needs 53-bits while your DSP only supplies 48.}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Thu Nov 30 20:27:39 2023

BGB wrote:

On 11/30/2023 10:23 AM, MitchAlsup wrote:

BGB wrote:

Have noted that these old games did have a nifty trick for calculating
approximate distance, say:
   dx=x0-x1;
   dy=y0-y1;
   adx=dx^(dx>>31);
   ady=dy^(dy>>31);
   if(ady>adx)
     { t=adx; adx=ady; ady=t; }         //common
// { adx^=ady; ady^=adx; adx^=ady; } //sometimes
   d=adx+(ady>>1);

Why not::

    dx=x0-x1;
    dy=y0-y1;
    adx=dx^(dx>>31);
    ady=dy^(dy>>31);
    if(ady>adx)
        d=adx+(ady>>1);
    else
        d=ady+(adx>>1);

As long as ISA has max() and min():: Why not::

    dx=x0-x1;
    dy=y0-y1;
    adx=dx^(dx>>31);
    ady=dy^(dy>>31);
d = min( adx, ady ) + (max( adx, ady ) >> 1);

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Thu Nov 30 23:15:11 2023

BGB wrote:

On 11/30/2023 2:27 PM, MitchAlsup wrote:

BGB wrote:

On 11/30/2023 10:23 AM, MitchAlsup wrote:

BGB wrote:

Have noted that these old games did have a nifty trick for
calculating approximate distance, say:
   dx=x0-x1;
   dy=y0-y1;
   adx=dx^(dx>>31);
   ady=dy^(dy>>31);
   if(ady>adx)
     { t=adx; adx=ady; ady=t; }         //common
// { adx^=ady; ady^=adx; adx^=ady; } //sometimes
   d=adx+(ady>>1);

Why not::

     dx=x0-x1;
     dy=y0-y1;
     adx=dx^(dx>>31);
     ady=dy^(dy>>31);
     if(ady>adx)
         d=adx+(ady>>1);
     else
         d=ady+(adx>>1);

As long as ISA has max() and min():: Why not::

    dx=x0-x1;
    dy=y0-y1;
    adx=dx^(dx>>31);
    ady=dy^(dy>>31);
    d = min( adx, ady ) + (max( adx, ady ) >> 1);

Yeah, I guess that could work as well for 2D...

Though, makes more sense to use the "__int32_min" intrinsic or similar, rather than the generic C library "min()"/"max()" macros.
#define min(a, b) (((a)<(b))?(a):(b))
#define max(a, b) (((a)>(b))?(a):(b))

max() and min() are single instructions in my ISA. On the FP side, they
even get NaNs put in the proper places, too.

Though, apparently the existence of these macros in stdlib.h is
controversial (they are apparently sort of a POSIX legacy feature, and
not part of the C standard).

It took Brian quite some time to recognize that crap and turn it all into
max() and min() instructions.

Mostly, due to implementation issues in BGBCC, ?: needs an actual
branch, so is ironically actually worse for performance in this case
than using an explicit if/else...

My point exactly--but perhaps you should spend the effort to make your
compiler recognize these as single branch-free instructions.

A case could be made for C11 "_Generic", but would need to implement it.

But, yeah, apart from compiler issues, in theory the min/max part could
be 3 instructions.

max() and min() are easy to perform in a single cycle in both int and FP

If done in ASM, might be something like:
// R4..R7 (x0, y0, x1, y1)
SUB R4, R6, R16 | SUB R5, R7, R17
SHAD R16, -31, R18 | SHAD R17, -31, R19
XOR R16, R18, R16 | XOR R17, R19, R17
CMPGT R17, R16
CSELT R16, R17, R18 | CSELT R17, R16, R19
SHAD R19, -1, R19
ADD R18, R19, R2

And you are claiming this has an advantage over::

MAX R9,R8,R7
MIN R10,R8,R7
SRL R10,R10,#1
LEA R10,[R9,R10,#-1]

I don't have actual MIN/MAX ops though.

perhaps you should reconsider...

Though, it seems RISC-V does have them in the Bitmanip extension.

Could probably add them in-theory (maybe along with FMIN/FMAX).

I use the same FU to perform Int max() and FP max()--remember once you
special case out the problems, IEEE is a linearly ordered set of numbers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Robert Finch on Fri Dec 1 02:46:36 2023

Robert Finch wrote:

Could probably add them in-theory (maybe along with FMIN/FMAX).

I use the same FU to perform Int max() and FP max()--remember once you
special case out the problems, IEEE is a linearly ordered set of numbers.

Q+ has min3 / max3 for minimum or maximum of three values.
But only fmin / fmax for float.

Interesting, but I did not have enough room in the 3-operand subGroup.

How do you determine the mid( x, y, z ) ?? allowing::

mx = max( x, y, z );
mn = min( x, y, z );
mi = mid( x, y, z );

??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Fri Dec 1 18:26:01 2023

BGB wrote:

On 11/30/2023 5:15 PM, MitchAlsup wrote:

BGB wrote:

On 11/30/2023 2:27 PM, MitchAlsup wrote:

BGB wrote:

On 11/30/2023 10:23 AM, MitchAlsup wrote:

BGB wrote:

Have noted that these old games did have a nifty trick for
calculating approximate distance, say:
   dx=x0-x1;
   dy=y0-y1;
   adx=dx^(dx>>31);
   ady=dy^(dy>>31);
   if(ady>adx)
     { t=adx; adx=ady; ady=t; }         //common
// { adx^=ady; ady^=adx; adx^=ady; } //sometimes
   d=adx+(ady>>1);

Why not::

     dx=x0-x1;
     dy=y0-y1;
     adx=dx^(dx>>31);
     ady=dy^(dy>>31);
     if(ady>adx)
         d=adx+(ady>>1);
     else
         d=ady+(adx>>1);

As long as ISA has max() and min():: Why not::

     dx=x0-x1;
     dy=y0-y1;
     adx=dx^(dx>>31);
     ady=dy^(dy>>31);
     d = min( adx, ady ) + (max( adx, ady ) >> 1);

Yeah, I guess that could work as well for 2D...

Though, makes more sense to use the "__int32_min" intrinsic or
similar, rather than the generic C library "min()"/"max()" macros.
   #define min(a, b)   (((a)<(b))?(a):(b))
   #define max(a, b)   (((a)>(b))?(a):(b))

max() and min() are single instructions in my ISA. On the FP side, they
even get NaNs put in the proper places, too.

Though, apparently the existence of these macros in stdlib.h is
controversial (they are apparently sort of a POSIX legacy feature, and
not part of the C standard).

It took Brian quite some time to recognize that crap and turn it all into
max() and min() instructions.

OK.

Mostly, due to implementation issues in BGBCC, ?: needs an actual
branch, so is ironically actually worse for performance in this case
than using an explicit if/else...

My point exactly--but perhaps you should spend the effort to make your
compiler recognize these as single branch-free instructions.

I guess, it could be possible in theory to use pattern recognition on
the ASTs to turn this into a call to an intrinsic. Might need to look
into this.

For the most part, hadn't really done much of any of this sort of
pattern recognition.

A case could be made for C11 "_Generic", but would need to implement it.

But, yeah, apart from compiler issues, in theory the min/max part
could be 3 instructions.

max() and min() are easy to perform in a single cycle in both int and FP

Theoretically, it can be done in two instructions, which is still faster
than what ?: generates currently.

Say:
z=(x<y)?x:y;

Compiling to something like, say:
CMPGT R9, R8
BF .L0
MOV R9, R10
BRA .L1
.L0:
MOV R8, R10
BRA .L1
.L1:
MOV R10, R11

Whereas, the normal "if()" is smart enough to turn it into predication,
but ?: needs to use a temporary and doesn't have a predication special case.

Well, and also predication needs to be handled partly in the frontend
(in the generated RIL3 bytecode), but is generally omitted if the output
is going to a RIL object (so, if compiling for a static library, it
always falls back to actual branches).

Partly this is in turn because my 3AC backend isn't really smart enough
to pull this off.

Putting effort into it wasn't really a high priority though, as ?: in C
seems to be relatively infrequently used.

But, as-is, one will get a 2-op sequence if they use the intrinsic.

If done in ASM, might be something like:
   // R4..R7 (x0, y0, x1, y1)
   SUB R4, R6, R16   | SUB R5, R7, R17
   SHAD R16, -31, R18 | SHAD R17, -31, R19
   XOR R16, R18, R16 | XOR R17, R19, R17
   CMPGT R17, R16
   CSELT R16, R17, R18 | CSELT R17, R16, R19
   SHAD R19, -1, R19
   ADD R18, R19, R2

And you are claiming this has an advantage over::

    MAX R9,R8,R7
    MIN R10,R8,R7
    SRL R10,R10,#1
    LEA R10,[R9,R10,#-1]

At least in so far as it doesn't depend on features that don't currently exist.

I don't have actual MIN/MAX ops though.

perhaps you should reconsider...

I didn't have them previously as I could express them in a 2-op
sequence, which seemed "good enough".

Though, this was before some more recent changes which (in the name of improving FPGA timing) effectively turned CMPxx and similar into 2 cycle
ops.

Though, MIN/MAX ops could possibly allow:
MIN Imm10u, Rn // Rn=(Rn<Imm)?Rn:Imm

Which could potentially allow implementing "__int32_clamp()" in 2 or 3 instructions, which is an improvement over needing 6 instructions (with pretty much all of these instructions ending up with a 2-cycle latency).

   Though, it seems RISC-V does have them in the Bitmanip extension.

Could probably add them in-theory (maybe along with FMIN/FMAX).

I use the same FU to perform Int max() and FP max()--remember once you
special case out the problems, IEEE is a linearly ordered set of numbers.

Yeah.

I was doing both CMP and FCMP cases via the ALU.
This is why likely both could be added at the same time, since the logic
for adding one would likely apply to the others.

About 93% of ICMP is used in FCMP.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Sat Dec 2 01:28:13 2023

On Fri, 24 Nov 2023 22:53:57 +0000, Quadibloc wrote:

While this is an "improvement", I do realize that this, along with even supporting multiple memory widths, even if I have found ways that it can technically be done, is sheer insanity.

I do have a rationale for going there. It's simple enough. It comes from
the x86.

Since the irresistible force of availability of applications means that
we are doomed to always have just *one* dominant ISA...

then, to minimize the terrible loss that this entails, that one dominant
ISA should be maximally flexible. To approach, as closely as possible,
the "good old days" where the PDP-10 and the CDC 6600 coexisted
alongside the PDP-11 and the IBM System/360.

This is the driving force behind my insane quest to design an ISA for a
CPU which has the ability to do something which no one seems to want.

And for some people, the KDF 9, with a 48-bit word, was something to
sing about - here, to the tune of "The British Grenadiers":

Some talk of I.B.M.s and some of C.D.C.s,
Of Honeywells and Burroughses, and such great names as these,
But of all the world's computers, there's none that is so fine,
As the English Electric Leo Marconi Kay - Dee - Eff Nine!

Some talk of thirty-two bit words, and some of twenty-four,
Of disks and drums and datacells, and megabytes of core,
But for those who've written usercode there's nothing can outshine,
The subroutine jump nesting store of the Kay - Dee - Eff Nine!

It's a good thing I had saved this in a text file on my computer.

A web search, including a search on Google Groups, was no longer
able to turn this up for me. Also, the KDF 9 was lauded in another
posting for reasons more directly related to this scheme of mine:

(Quoting a brochure for the computer)
The KDF 9 word has 48 bits...

It may be used as...

Eight 6-bit Alphanumeric Characters
One 48-bit Fixed-Point Number
Two 24-bit (Half length) Fixed-Point Numbers
Half of a 96-bit (Double length) Fixed-Point Number
One 48-Bit Floating-Point Number
Two 24-bit (Half length) Floating-Point Numbers
Half of a 96-Bit (Double length) Floating-Point Number
Three 16-Bit (Fixed-Point) Integers
Six 8-Bit Instruction Syllables
(end quote)

An instruction was 1, 2, or 3 syllables; an address was 15 bits.
O, memory! We shall not see its like again.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Robert Finch on Sat Dec 2 14:33:55 2023

Robert Finch wrote:

Q+ has min3 / max3 for minimum or maximum of three values.
But only fmin / fmax for float.

For symmetry I would assume/like a median3 as well, so that min3,
median3, max3 would return a sorted list?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to MitchAlsup on Sat Dec 2 14:24:14 2023

MitchAlsup wrote:

BGB wrote:

Have noted that these old games did have a nifty trick for calculating
approximate distance, say:
   dx=x0-x1;
   dy=y0-y1;
   adx=dx^(dx>>31);
   ady=dy^(dy>>31);
   if(ady>adx)
     { t=adx; adx=ady; ady=t; }         //common
// { adx^=ady; ady^=adx; adx^=ady; } //sometimes
   d=adx+(ady>>1);

Why not::

    dx=x0-x1;
    dy=y0-y1;
    adx=dx^(dx>>31);
    ady=dy^(dy>>31);
    if(ady>adx)
        d=adx+(ady>>1);
    else         d=ady+(adx>>1);

Possibly due to wanting to not use more than one branch predictor entry?

Maybe because ady rarely was larger than adx?

Besides, your version has the two terms swapped. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Terje Mathisen on Sat Dec 2 20:41:48 2023

Terje Mathisen wrote:

Robert Finch wrote:

Q+ has min3 / max3 for minimum or maximum of three values.
But only fmin / fmax for float.

For symmetry I would assume/like a median3 as well, so that min3,
median3, max3 would return a sorted list?

That was my point 8-10 posts ago.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Terje Mathisen on Sat Dec 2 20:42:37 2023

Terje Mathisen wrote:

MitchAlsup wrote:

BGB wrote:

Have noted that these old games did have a nifty trick for calculating
approximate distance, say:
   dx=x0-x1;
   dy=y0-y1;
   adx=dx^(dx>>31);
   ady=dy^(dy>>31);
   if(ady>adx)
     { t=adx; adx=ady; ady=t; }         //common
// { adx^=ady; ady^=adx; adx^=ady; } //sometimes
   d=adx+(ady>>1);

Why not::

    dx=x0-x1;
    dy=y0-y1;
    adx=dx^(dx>>31);
    ady=dy^(dy>>31);
    if(ady>adx)
        d=adx+(ady>>1);
    else         d=ady+(adx>>1);

Possibly due to wanting to not use more than one branch predictor entry?

Maybe because ady rarely was larger than adx?

Besides, your version has the two terms swapped. :-)

Just mimicking the original.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to MitchAlsup on Sun Dec 3 10:36:39 2023

MitchAlsup wrote:

Terje Mathisen wrote:

Robert Finch wrote:

Q+ has min3 / max3 for minimum or maximum of three values.
But only fmin / fmax for float.

For symmetry I would assume/like a median3 as well, so that min3,
median3, max3 would return a sorted list?

That was my point 8-10 posts ago.

I saw that a bit later: I've been on a sailing trip in the Caribbean
with very sporadic network connections, basically just when in reach of
a French island.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Terje Mathisen on Sun Dec 3 16:59:31 2023

Terje Mathisen wrote:

MitchAlsup wrote:

Terje Mathisen wrote:

Robert Finch wrote:

Q+ has min3 / max3 for minimum or maximum of three values.
But only fmin / fmax for float.

For symmetry I would assume/like a median3 as well, so that min3,
median3, max3 would return a sorted list?

That was my point 8-10 posts ago.

I saw that a bit later: I've been on a sailing trip in the Caribbean
with very sporadic network connections, basically just when in reach of
a French island.

Must be a tough life............

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to MitchAlsup on Tue Dec 5 13:40:08 2023

MitchAlsup wrote:

Terje Mathisen wrote:

MitchAlsup wrote:

Terje Mathisen wrote:

Robert Finch wrote:

Q+ has min3 / max3 for minimum or maximum of three values.
But only fmin / fmax for float.

For symmetry I would assume/like a median3 as well, so that min3,
median3, max3 would return a sorted list?

That was my point 8-10 posts ago.

I saw that a bit later: I've been on a sailing trip in the Caribbean
with very sporadic network connections, basically just when in reach
of a French island.

Must be a tough life............

Indeed.

Our last such trip was in 2018 (Panama City to Jamaica on the Royal
Clipper), this trip was supposed to take place in 2020 but then Covid
happened.

Back in Oslo now, and I just got my very first positive Covid test, so I
must have been exposed to the virus during the trip, thankfully I didn't
really get sick until we got to Oslo Airport in -17C.

Terje
PS. These trips are for orienteering, we ran 10-11 races over 16 days.
I've uploaded headcam videos to youtube from all of them. On those you
can see both my forward view and a map snippet with moving marker which
shows where I'm running.

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Paul A. Clayton on Thu Dec 7 13:39:33 2023

"Paul A. Clayton" <paaronclayton@gmail.com> writes:

What about multiply-high-no-carry-in? There might be cases
where such could be used for division with a reused
(possibly compile-time constant) divisor even without the
carry-in. (Producing the carry-in by a low multiply seems
to have significant overhead in energy, throughput, and/or
latency.)

What's that operation supposed to produce? How does it differ from,
e.g., RISC-V MULH/MULHU/MULHSU? And if there is a difference, what
makes you think that it still works for division?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Thu Dec 7 18:55:10 2023

Paul A. Clayton wrote:

On 11/27/23 8:03 PM, MitchAlsup wrote:

Anton Ertl wrote:

[snip]

However, looking at more recent architectures, the RISC-V M
extension (which is part of RV64G and RV32G, i.e., a
standard extension) has not just multiply instructions (MUL,
MULH, MULHU, MULHSU, MULW), but also integer divide >> instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW, and
REMUW.

All of which are possible in My 66000 using operand sign
control, S-bit, and and CARRY when you want 64×64->128 or
128/64->{64 quotient, 64 remainder}

What about multiply-high-no-carry-in?

CARRY R9,{{O}} // carry applies to next inst
// no carry in yes carry out
MUL R10,Rx,Ry // {R9,R10} contain result

There might be cases
where such could be used for division with a reused
(possibly compile-time constant) divisor even without the
carry-in. (Producing the carry-in by a low multiply seems
to have significant overhead in energy, throughput, and/or
latency.) A small immediate operand might be shifted into
the most significant position to facilitate division by a
small constant.

A division-specific instruction (divide-using-reciprocal)
could do more sophisticated manipulation of an immediate and
perhaps provide an instruction a compiler could use with
less knowledge of the dividend. Yet there might be uses for
a simple multiply-high-no-carry-in.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB on Fri Dec 8 02:44:03 2023

BGB wrote:

On 12/7/2023 7:39 AM, Anton Ertl wrote:

"Paul A. Clayton" <paaronclayton@gmail.com> writes:

What about multiply-high-no-carry-in? There might be cases
where such could be used for division with a reused
(possibly compile-time constant) divisor even without the
carry-in. (Producing the carry-in by a low multiply seems
to have significant overhead in energy, throughput, and/or
latency.)

What's that operation supposed to produce? How does it differ from,
e.g., RISC-V MULH/MULHU/MULHSU? And if there is a difference, what
makes you think that it still works for division?

For division by reciprocal, one may need to be able to adjust the right
shift (say, 32..34 bits), and add a bias to the low-order bits if the
input is negative (say, adding "Rcp-1").

Without the variable shift part, one will not get reliable results for
some divisors (IIRC, 7, 17, 31, etc).

Without a bias, it will round towards negative infinity rather than
towards zero (standard for division is to always round towards zero).

No:: C99 division is so specified, but there are 3 other (semi) popular definitions.

As-is, signed division logic (say, "y=x/7;") would be something like:
MOV 0x49249249, R1
DMULS.L R4, R1, R2
ADD -1, R1
CMPGE 0, R4
ADD?F R2, R1, R2
SHAD.Q R2, -33, R2

Or, if one used jumbo encodings:
DMULS.L R4, 0x49249249, R2
CMPGE 0, R4 // SR.T=(R4>=0)
ADD?F R2, 0x49249248, R2
SHAD.Q R2, -33, R2

In 1975 I wrote PDP-11/40 microcode that performed serial DIV algorithm,
and had the opportunity in the loop to short circuit if all the bits of
the denominator had been consumed. {{CMU had an add on board with student programmable microcode.}}

Thus::
DIV R9,R5,#7

should not take more than 5 cycles. Nor should
Thusly::
DIV R9,R5,#-7

!!

So, sadly, simply taking the high result isn't quite sufficient.

It is more practical to make HW DIV fast, than to have everyone {and his brother} invent new ways to avoid dividing.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Fri Dec 8 15:20:48 2023

BGB <cr88192@gmail.com> writes:

On 12/7/2023 7:39 AM, Anton Ertl wrote:

"Paul A. Clayton" <paaronclayton@gmail.com> writes:

What about multiply-high-no-carry-in? There might be cases
where such could be used for division with a reused
(possibly compile-time constant) divisor even without the
carry-in. (Producing the carry-in by a low multiply seems
to have significant overhead in energy, throughput, and/or
latency.)

What's that operation supposed to produce? How does it differ from,
e.g., RISC-V MULH/MULHU/MULHSU? And if there is a difference, what
makes you think that it still works for division?

For division by reciprocal, one may need to be able to adjust the right
shift (say, 32..34 bits), and add a bias to the low-order bits if the
input is negative (say, adding "Rcp-1").

Without the variable shift part, one will not get reliable results for
some divisors (IIRC, 7, 17, 31, etc).

Without a bias, it will round towards negative infinity rather than
towards zero (standard for division is to always round towards zero).

Nope. E.g., Forth-83 standardizes floored division; in Forth-94 and Forth-2012, it's implementation-defined. More generally, <https://en.wikipedia.org/wiki/Modulo_operation> lists lots of
languages that perform floored or Euclidean modulo, and I would hope
that their division operation agrees with the modulo operation.

As for division by multiplying with the reciprocal, I have written a
paper on it [ertl19kps], but the most important sentence of that paper
is:

|If you read only one paper about the topic, my recommendation is
|Robison’s [Rob05].

My take on this improves the latency of the unsigned case on some CPUs
by avoiding the shift and bias, but it costs an additional
multiplication.

FAZDIV

What is FAZDIV?

And you failed to answer what multiply-high-no-carry-in does.

@InProceedings{ertl19kps,
author = {M. Anton Ertl},
title = {Integer Division by Multiplying with the
Double-Width Reciprocal},
crossref = {kps19},
pages = {75--84},
url = {http://www.complang.tuwien.ac.at/papers/ertl19kps.pdf},
url-slides = {http://www.complang.tuwien.ac.at/papers/ertl19kps-slides.pdf},
abstract = {Earlier work on integer division by multiplying with
the reciprocal has focused on multiplying with a
single-width reciprocal, combined with a correction
and followed by a shift. The present work explores
using a double-width reciprocal to allow getting rid
of the correction and shift.}
}

@Proceedings{kps19,
title = {20. Kolloquium Programmiersprachen und Grundlagen
der Programmierung (KPS)},
booktitle = {20. Kolloquium Programmiersprachen und Grundlagen
der Programmierung (KPS)},
year = {2019},
key = {kps19},
editor = {Martin Pl\"umicke and Fayez Abu Alia},
url = {https://www.hb.dhbw-stuttgart.de/kps2019/kps2019_Tagungsband.pdf}
}

@InProceedings{robison05,
author = "Arch D. Robison",
title = "{$N$}-Bit Unsigned Division Via {$N$}-Bit
Multiply-Add",
OPTeditor = "Paolo Montuschi and Eric (Eric Mark) Schwarz",
booktitle = "{Proceedings of the 17th IEEE Symposium on Computer
Arithmetic (ARITH-17)}",
publisher = "IEEE Computer Society Press",
ISBN = "0-7695-2366-8",
ISBN-13 = "978-0-7695-2366-8",
year = "2005",
bibdate = "Wed Jun 22 07:02:55 2005",
bibsource = "http://www.math.utah.edu/pub/tex/bib/fparith.bib",
URL = "http://www.acsel-lab.com/arithmetic/arith17/papers/ARITH17_Robison.pdf",
abstract = "Integer division on modern processors is expensive
compared to multiplication. Previous algorithms for
performing unsigned division by an invariant divisor,
via reciprocal approximation, suffer in the worst case
from a common requirement for $ n + 1 $ bit
multiplication, which typically must be synthesized
from $n$-bit multiplication and extra arithmetic
operations. This paper presents, and proves, a hybrid
of previous algorithms that replaces $ n + 1 $ bit
multiplication with a single fused multiply-add
operation on $n$-bit operands, thus reducing any
$n$-bit unsigned division to the upper $n$ bits of a
multiply-add, followed by a single right shift. An
additional benefit is that the prerequisite
calculations are simple and fast. On the Itanium 2
processor, the technique is advantageous for as few as
two quotients that share a common run-time divisor.",
acknowledgement = "Nelson H. F. Beebe, University of Utah, Department
of Mathematics, 110 LCB, 155 S 1400 E RM 233, Salt Lake
City, UT 84112-0090, USA, Tel: +1 801 581 5254, FAX: +1
801 581 4148, e-mail: \path|beebe@math.utah.edu|,
\path|beebe@acm.org|, \path|beebe@computer.org|
(Internet), URL:
\path|http://www.math.utah.edu/~beebe/|",
keywords = "ARITH-17",
pagecount = "9",
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Sat Dec 9 08:26:34 2023

BGB <cr88192@gmail.com> writes:

On 12/8/2023 9:20 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

Without a bias, it will round towards negative infinity rather than
towards zero (standard for division is to always round towards zero).

Nope. E.g., Forth-83 standardizes floored division; in Forth-94 and
Forth-2012, it's implementation-defined. More generally,
<https://en.wikipedia.org/wiki/Modulo_operation> lists lots of
languages that perform floored or Euclidean modulo, and I would hope
that their division operation agrees with the modulo operation.

...

OK, how about:
Languages like C, Java, C#, etc, expect division to always round towards >zero.

Yes, there are programming languages that standardize truncated (aka
symmetric) division. BTW, C89 is not one of them, C99 ff. are. There
are also programming languages that standardize other kinds of
division (or at least modulo), or fail to standardize the behaviour
fully when one or both of the operands is negative.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Sun Dec 10 18:29:50 2023

Paul A. Clayton wrote:

On 12/7/23 1:55 PM, MitchAlsup wrote:

Paul A. Clayton wrote:

On 11/27/23 8:03 PM, MitchAlsup wrote:

Anton Ertl wrote:

[snip]

However, looking at more recent architectures, the RISC-V M
extension (which is part of RV64G and RV32G, i.e., a standard
extension) has not just multiply instructions (MUL,
MULH, MULHU, MULHSU, MULW), but also integer divide >>
instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW, and
REMUW.

All of which are possible in My 66000 using operand sign
control, S-bit, and and CARRY when you want 64×64->128 or
128/64->{64 quotient, 64 remainder}

What about multiply-high-no-carry-in?

      CARRY   R9,{{O}}    // carry applies to next inst
                          // no carry in yes carry out
      MUL     R10,Rx,Ry   // {R9,R10} contain result

I was thinking of a single register result.

How can a single 64-bit register hold a 128-bit result ??

                                      There might be cases
where such could be used for division with a reused
(possibly compile-time constant) divisor even without the
carry-in. (Producing the carry-in by a low multiply seems
to have significant overhead in energy, throughput, and/or
latency.) A small immediate operand might be shifted into
the most significant position to facilitate division by a
small constant.

A division-specific instruction (divide-using-reciprocal)
could do more sophisticated manipulation of an immediate and
perhaps provide an instruction a compiler could use with
less knowledge of the dividend. Yet there might be uses for
a simple multiply-high-no-carry-in.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to MitchAlsup on Mon Dec 11 10:43:41 2023

MitchAlsup <mitchalsup@aol.com> schrieb:

Paul A. Clayton wrote:

On 12/7/23 1:55 PM, MitchAlsup wrote:

What about multiply-high-no-carry-in?

      CARRY   R9,{{O}}    // carry applies to next inst
                          // no carry in yes carry out
      MUL     R10,Rx,Ry   // {R9,R10} contain result

I was thinking of a single register result.

How can a single 64-bit register hold a 128-bit result ??

Sometimes, you don't need the whole result; the upper bits suffice.

Consider dividing a 32-bit unsigned number n on a 32-bit system
by 25 (equally applicable to similar examples with 64-bit systems,
but I happen to have the numbers at hand).

In plain C, you could do this by

uint32_t res = ((uint64_t) n * (uint64_t) 1374389535) >> 35;

where any compiler worth its salt will optimize this to a multiply
high followed by a shift by three bits. Compilers have done
this for a long time, for division by a constant. Example, for
32-bit POWER:

#include <stdint.h>

uint32_t div25 (uint32_t n)
{
return n / 25;
}

yields

div25:
lis 9,0x51eb
ori 9,9,0x851f
mulhwu 3,3,9
srwi 3,3,3
blr

where the (lis/ori is POWER's way of synthesizing a 32-bit contant,
and 0x51eb851f = 1374389535).

If you can only access the high part via CARRY, this clobbers a
register that you do not need to do otherwise.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Thomas Koenig on Tue Dec 12 00:13:03 2023

Thomas Koenig wrote:

MitchAlsup <mitchalsup@aol.com> schrieb:

#include <stdint.h>

uint32_t div25 (uint32_t n)
{
return n / 25;
}

yields

div25:
lis 9,0x51eb // cycle 1
ori 9,9,0x851f // cycle 2
mulhwu 3,3,9 // cycle 3-6
srwi 3,3,3 // cycle 7
blr // cycle 8

div25:
CARRY R8,{{O}} // cycle 1
MUL R1,R1,#0x51eb851f // cycle 1-4
SLL R1,R8,#35 // cycle 5
RET // cycle 6

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Thomas Koenig on Tue Dec 12 00:14:44 2023

Thomas Koenig wrote:

MitchAlsup <mitchalsup@aol.com> schrieb:

Paul A. Clayton wrote:

On 12/7/23 1:55 PM, MitchAlsup wrote:

What about multiply-high-no-carry-in?

      CARRY   R9,{{O}}    // carry applies to next inst
                          // no carry in yes carry out
      MUL     R10,Rx,Ry   // {R9,R10} contain result

I was thinking of a single register result.

How can a single 64-bit register hold a 128-bit result ??

Sometimes, you don't need the whole result; the upper bits suffice.

Consider dividing a 32-bit unsigned number n on a 32-bit system
by 25 (equally applicable to similar examples with 64-bit systems,
but I happen to have the numbers at hand).

In plain C, you could do this by

uint32_t res = ((uint64_t) n * (uint64_t) 1374389535) >> 35;

This only gives you 29 bits of significance.

What if you also need the remainder ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to MitchAlsup on Tue Dec 12 05:52:22 2023

MitchAlsup <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

MitchAlsup <mitchalsup@aol.com> schrieb:

Paul A. Clayton wrote:

On 12/7/23 1:55 PM, MitchAlsup wrote:

What about multiply-high-no-carry-in?

      CARRY   R9,{{O}}    // carry applies to next inst
                          // no carry in yes carry out
      MUL     R10,Rx,Ry   // {R9,R10} contain result

I was thinking of a single register result.

How can a single 64-bit register hold a 128-bit result ??

Sometimes, you don't need the whole result; the upper bits suffice.

Consider dividing a 32-bit unsigned number n on a 32-bit system
by 25 (equally applicable to similar examples with 64-bit systems,
but I happen to have the numbers at hand).

In plain C, you could do this by

uint32_t res = ((uint64_t) n * (uint64_t) 1374389535) >> 35;

This only gives you 29 bits of significance.

Because that's all it takes, dividing by 25 reduces the number
of significant bits.

What if you also need the remainder ??

Multiply and subtract.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to MitchAlsup on Tue Dec 12 06:09:40 2023

MitchAlsup <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

MitchAlsup <mitchalsup@aol.com> schrieb:

#include <stdint.h>

uint32_t div25 (uint32_t n)
{
return n / 25;
}

yields

div25:
lis 9,0x51eb // cycle 1
ori 9,9,0x851f // cycle 2
mulhwu 3,3,9 // cycle 3-6
srwi 3,3,3 // cycle 7
blr // cycle 8

div25:
CARRY R8,{{O}} // cycle 1
MUL R1,R1,#0x51eb851f // cycle 1-4
SLL R1,R8,#35 // cycle 5
RET // cycle 6

Then let's use a 64-bit example:

#include <stdint.h>

uint64_t div (uint64_t n)
{
return n / 37;
}

which becomes

div:
.LFB0:
.cfi_startproc
movabsq $-2492803253203993461, %rax
mulq %rdi
movq %rdx, %rax
shrq $5, %rax
ret

on x86_64 (so, use the high result only and shift it five bits to
the right) and, with Power10,

div:
pli 10,3714566310
pli 9,232160395
rldimi 9,10,32,0
mulhdu 3,3,9
srdi 3,3,5
blr

(The first three instructions are pasting together the constant, which
of course My 66000 can do much better :-)

A hypothetical My66000 code could be

mulhu r1, r1, #-2492803253203993461
srl r1, r1, #5
ret

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Robert Finch on Tue Dec 12 20:43:52 2023

Robert Finch wrote:

On 2023-12-12 4:00 a.m., BGB wrote:

On 12/12/2023 12:09 AM, Thomas Koenig wrote:

MitchAlsup <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

MitchAlsup <mitchalsup@aol.com> schrieb:

#include <stdint.h>

uint32_t div25 (uint32_t n)
{
   return n / 25;
}

yields

div25:
         lis 9,0x51eb        // cycle 1
         ori 9,9,0x851f    // cycle 2
         mulhwu 3,3,9        // cycle 3-6
         srwi 3,3,3        // cycle 7
         blr            // cycle 8

div25:
    CARRY R8,{{O}}       // cycle 1
    MUL R1,R1,#0x51eb851f    // cycle 1-4
    SLL R1,R8,#35        // cycle 5
    RET            // cycle 6

Then let's use a 64-bit example:

#include <stdint.h>

uint64_t div (uint64_t n)
{
   return n / 37;
}

which becomes

div:
.LFB0:
         .cfi_startproc
         movabsq $-2492803253203993461, %rax
         mulq    %rdi
         movq    %rdx, %rax
         shrq    $5, %rax
         ret

on x86_64 (so, use the high result only and shift it five bits to
the right) and, with Power10,

div:
         pli 10,3714566310
         pli 9,232160395
         rldimi 9,10,32,0
         mulhdu 3,3,9
         srdi 3,3,5
         blr

(The first three instructions are pasting together the constant, which
of course My 66000 can do much better :-)

A hypothetical My66000 code could be

       mulhu    r1, r1, #-2492803253203993461
       srl    r1, r1, #5
       ret

And, for comparison, on BJX2 (with the MULQ extension):
MOV -2492803253203993461, R3
MULHU.Q R4, R3, R2
SHLD.Q R2, -5, R2
RTS
...

Though, this will take 24 bytes to encode, and wont be fast.

Also, currently the only instructions that take 64 bit immediate values
are:
MOV Imm64, Rn
ADD Imm64, Rn
PLDCXH Imm64, Xn //4x Binary16 to 4x Binary32

...

Q+ is almost the same as MY66000

muluh r1,r1,#-2492803253203993461
lsr r1,r1,#5
ret

but ret is only two bytes.

When subroutines are aligned to cache line boundaries, that adds nothing.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Thomas Koenig on Sat Dec 16 13:51:15 2023

Thomas Koenig <tkoenig@netcologne.de> schrieb:

A hypothetical My66000 code could be

mulhu r1, r1, #-2492803253203993461
srl r1, r1, #5
ret

By "hypothetical" I mean that it isn't in the ISA at the moment.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Thomas Koenig on Sat Dec 16 19:21:43 2023

Thomas Koenig wrote:

Thomas Koenig <tkoenig@netcologne.de> schrieb:

A hypothetical My66000 code could be

mulhu r1, r1, #-2492803253203993461
srl r1, r1, #5
ret

By "hypothetical" I mean that it isn't in the ISA at the moment.

If I knew what mulhu does I could illuminate::
a) how many bits get multiplied?
b) how many bits are produced?

But assuming h means high, and the widths are 64-bit operands and
128-bit results::

CARRY R1,{{O}{I}}
MUL R2,R1,#-2492803253203993461
SHR R1,R2,<0,5>
RET

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to MitchAlsup on Sat Dec 16 22:26:07 2023

MitchAlsup <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

Thomas Koenig <tkoenig@netcologne.de> schrieb:

A hypothetical My66000 code could be

mulhu r1, r1, #-2492803253203993461
srl r1, r1, #5
ret

By "hypothetical" I mean that it isn't in the ISA at the moment.

If I knew what mulhu does I could illuminate::
a) how many bits get multiplied?
b) how many bits are produced?

But assuming h means high, and the widths are 64-bit operands and
128-bit results::

CARRY R1,{{O}{I}}
MUL R2,R1,#-2492803253203993461
SHR R1,R2,<0,5>
RET

That works, but has two disadvantages over the non-Carry
version: It is longer (by one instruction modifier),
and R2 is overwritten with a value which is not needed.
This is why I think that having a "multiply high signed"
and "multiply high unsigned" operation is a good idea even
when My 66000's Carry is implemented.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Thomas Koenig on Sat Dec 16 23:10:27 2023

Thomas Koenig wrote:

MitchAlsup <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

Thomas Koenig <tkoenig@netcologne.de> schrieb:

A hypothetical My66000 code could be

mulhu r1, r1, #-2492803253203993461
srl r1, r1, #5
ret

By "hypothetical" I mean that it isn't in the ISA at the moment.

If I knew what mulhu does I could illuminate::
a) how many bits get multiplied?
b) how many bits are produced?

But assuming h means high, and the widths are 64-bit operands and
128-bit results::

CARRY R1,{{O}{I}}
MUL R2,R1,#-2492803253203993461
SHR R1,R2,<0,5>
RET

That works, but has two disadvantages over the non-Carry
version: It is longer (by one instruction modifier),
and R2 is overwritten with a value which is not needed.
This is why I think that having a "multiply high signed"
and "multiply high unsigned" operation is a good idea even
when My 66000's Carry is implemented.

Heck:: if you are worried about length::

DIV R1,R1,#27
RET

2 words instead of 6.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to MitchAlsup on Sun Dec 17 07:52:12 2023

MitchAlsup <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

MitchAlsup <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

Thomas Koenig <tkoenig@netcologne.de> schrieb:

A hypothetical My66000 code could be

mulhu r1, r1, #-2492803253203993461
srl r1, r1, #5
ret

By "hypothetical" I mean that it isn't in the ISA at the moment.

If I knew what mulhu does I could illuminate::
a) how many bits get multiplied?
b) how many bits are produced?

But assuming h means high, and the widths are 64-bit operands and
128-bit results::

CARRY R1,{{O}{I}}
MUL R2,R1,#-2492803253203993461
SHR R1,R2,<0,5>
RET

That works, but has two disadvantages over the non-Carry
version: It is longer (by one instruction modifier),
and R2 is overwritten with a value which is not needed.
This is why I think that having a "multiply high signed"
and "multiply high unsigned" operation is a good idea even
when My 66000's Carry is implemented.

Heck:: if you are worried about length::

DIV R1,R1,#27
RET

2 words instead of 6.

This is an optimization problem with three objective functions:
Code size, register use and speed. Any solution in which at which
it is no longer possible to improve one of the objective functions
without making the others worse is a valid solution to that problem
(part of the Pareto front, if you want to use that terminology).

The version with the DIV with constant is the smallest, so it
trumps everything else on size. It is, however, likely to be
slower than the other two. It is what I would a compiler to emit
if optimizing for size (-Os), and when not optimizing.

The version with the Carry is likely faster than the DIV with
constant version, but is bigger and uses two registers instead
of one. It is what I would expect a compiler to generate in
the absence of multiply high unsigned and any -O option, except
for -Os.

The hypothetical version with "multiply high unsigned" is smaller
than the version with Carry and is better on register use, so
it is better in that sense. If it existed, I would expect a
compiler to always generate that for any optimization level
except -Os. (Small caveat: Maybe hardware has a fiendishly
clever algorithm to divide by 37 in four or less cycles; this
would then be known to the compiler, and it could chose accordingly).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Thomas Koenig on Sun Dec 17 21:55:47 2023

Thomas Koenig wrote:

MitchAlsup <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

MitchAlsup <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

Thomas Koenig <tkoenig@netcologne.de> schrieb:

A hypothetical My66000 code could be

mulhu r1, r1, #-2492803253203993461
srl r1, r1, #5
ret

By "hypothetical" I mean that it isn't in the ISA at the moment.

If I knew what mulhu does I could illuminate::
a) how many bits get multiplied?
b) how many bits are produced?

But assuming h means high, and the widths are 64-bit operands and
128-bit results::

CARRY R1,{{O}{I}}
MUL R2,R1,#-2492803253203993461
SHR R1,R2,<0,5>
RET

That works, but has two disadvantages over the non-Carry
version: It is longer (by one instruction modifier),
and R2 is overwritten with a value which is not needed.
This is why I think that having a "multiply high signed"
and "multiply high unsigned" operation is a good idea even
when My 66000's Carry is implemented.

Heck:: if you are worried about length::

DIV R1,R1,#27
RET

2 words instead of 6.

This is an optimization problem with three objective functions:
Code size, register use and speed. Any solution in which at which
it is no longer possible to improve one of the objective functions
without making the others worse is a valid solution to that problem
(part of the Pareto front, if you want to use that terminology).

The version with the DIV with constant is the smallest, so it
trumps everything else on size. It is, however, likely to be
slower than the other two. It is what I would a compiler to emit
if optimizing for size (-Os), and when not optimizing.

Since I do IDIV in the FMAC unit, and integers are not normalized::
FMAC passes both operands through the normalizer's Find First logic
and knows the significance of the numerator and denominator.

Thus, for this case, 27 is detected to have only 5-bits of significance
and the 1-bit per loop* only takes 5 cycles. Giving a fall through time
of 8-cycles total.

(*) assuming a 1-bit at a time loop--which I would not do--but I
still do the normalization steps to make the integers smell like
FP numbers, so, 1 Goldschmidt step and 1 Newton-Raphson step and
you have 8 cycle IDIV for denominators with less than 8-bits of
significance, 10-cycles for denominators of less then 16-bits,...

The version with the Carry is likely faster than the DIV with
constant version, but is bigger and uses two registers instead
of one. It is what I would expect a compiler to generate in
the absence of multiply high unsigned and any -O option, except
for -Os.

What if we solve the problem by making IDIV fast instead !!

Given a fast IDIV, is there any use for IMULH ??

The hypothetical version with "multiply high unsigned" is smaller
than the version with Carry and is better on register use, so

I should point out that having the constant attached to the
MUL (as is were) you have already saves a register compared to
ISAs without large constants.

it is better in that sense. If it existed, I would expect a
compiler to always generate that for any optimization level
except -Os. (Small caveat: Maybe hardware has a fiendishly
clever algorithm to divide by 37 in four or less cycles; this
would then be known to the compiler, and it could chose accordingly).

This is one gigantic reason to use FDIV to perform IDIV--and another
smaller one not to separate register files. FIDVs are in the 17-20
cycle range, why should IDIV not also be in that range ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Thomas Koenig on Mon Dec 18 12:00:54 2023

Thomas Koenig wrote:

MitchAlsup <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

Thomas Koenig <tkoenig@netcologne.de> schrieb:

A hypothetical My66000 code could be

mulhu r1, r1, #-2492803253203993461
srl r1, r1, #5
ret

By "hypothetical" I mean that it isn't in the ISA at the moment.

If I knew what mulhu does I could illuminate::
a) how many bits get multiplied?
b) how many bits are produced?

But assuming h means high, and the widths are 64-bit operands and
128-bit results::

CARRY R1,{{O}{I}}
MUL R2,R1,#-2492803253203993461
SHR R1,R2,<0,5>
RET

That works, but has two disadvantages over the non-Carry
version: It is longer (by one instruction modifier),
and R2 is overwritten with a value which is not needed.
This is why I think that having a "multiply high signed"
and "multiply high unsigned" operation is a good idea even
when My 66000's Carry is implemented.

I disagree:

The R2 overwrite really doesn't matter since the OoO circuitry will
quickly notice that the value isn't used at all, and the extra CARRY
modifier does not make the code any slower: It will still take exactly
the latency of MUL + SHR, so presumably 5-6 total cycles.

Since MULHU is very rarely used in regular code (pretty much only in
constant divisions that got replaced with reciprocal muls), this tiny
overhead is a very good deal.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Terje Mathisen on Tue Dec 19 13:57:07 2023

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

MitchAlsup <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

Thomas Koenig <tkoenig@netcologne.de> schrieb:

A hypothetical My66000 code could be

mulhu r1, r1, #-2492803253203993461
srl r1, r1, #5
ret

By "hypothetical" I mean that it isn't in the ISA at the moment.

If I knew what mulhu does I could illuminate::
a) how many bits get multiplied?
b) how many bits are produced?

But assuming h means high, and the widths are 64-bit operands and
128-bit results::

CARRY R1,{{O}{I}}
MUL R2,R1,#-2492803253203993461
SHR R1,R2,<0,5>
RET

That works, but has two disadvantages over the non-Carry
version: It is longer (by one instruction modifier),
and R2 is overwritten with a value which is not needed.
This is why I think that having a "multiply high signed"
and "multiply high unsigned" operation is a good idea even
when My 66000's Carry is implemented.

I disagree:

The R2 overwrite really doesn't matter since the OoO circuitry will
quickly notice that the value isn't used at all, and the extra CARRY
modifier does not make the code any slower: It will still take exactly
the latency of MUL + SHR, so presumably 5-6 total cycles.

You're correct that the register use will probably be elided in
an OoO machine, but the register is still overwritten. This does
not matter for a function that does just that division, but in
larger functions, this can lead to an additional spill/restore.

Since MULHU is very rarely used in regular code (pretty much only in
constant divisions that got replaced with reciprocal muls), this tiny overhead is a very good deal.

My 66000 has very low register usage, especially due to the encoding
of constants. Another reason are indexed and scaled loads and
stores, which allows fewer registers because, for code like

do i=1,n
a(i) = b(i) + c(i)
end do

there is no need for having a pointer run through a, b and c,
requiring fewer registers.

This is why putting floating point values and integer values into
the same register set actually works well for My 66000, and would
not work so well for others.

Still, when I see an inefficiency, even minor, I tend to remark
on it :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Thomas Koenig on Tue Dec 19 18:35:44 2023

Thomas Koenig wrote:

Still, when I see an inefficiency, even minor, I tend to remark
on it :-)

And that is why we love you.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Tue Sep 16 05:20:21 2025
  from Derry, Nh via Telnet
- Ginger1
  Mon Sep 15 19:33:54 2025
  from London via SSH
- Bob Worm
  Mon Sep 15 15:42:34 2025
  from Wales, Uk via Telnet
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	27:06:24
Calls:	10,390
Calls today:	1
Files:	14,064
Messages:	6,417,064

Re: Alternative Representations of the Concertina II ISA

Who's Online

Recent Visitors

System Info