• Re: Alternative Representations of the Concertina II ISA

    From Quadibloc@21:1/5 to Quadibloc on Fri Nov 24 04:42:31 2023
    On Fri, 24 Nov 2023 04:15:57 +0000, Quadibloc wrote:

    In 24-bit memory,

    Load Floating would be an invalid operation Load Double would load a
    72-bit floating-point number;
    Load Medium would load a 48-bit floating-point number;

    Alternatively, it would perhaps be simpler to do it this way:

    Load Floating would load a 48-bit floating-point number;
    Load Medium would load a 72-bit floating-point number.

    This would prioritize structure over function, but only
    the compiler would ever know...

    The 24-bit representation of code, however, has a bigger problem.

    In the 60-bit representation, a 15-bit halfword that can be
    addressed easily in 60-bit memory corresponds to a 16-bit area
    in the original ISA;

    In the 36-bit representation, an 18-bit halfword, easily addressible
    in 36-bit memory, does the same.

    In the 24-bit representation, though, a problem exists; the simplest
    way to deal with that would be, in code consisting entirely of 32-bit instructions, to allow only every third instruction to be a branch
    target.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to All on Fri Nov 24 04:15:57 2023
    The Concertina II ISA, as it stands now, has a
    very limited number of opcodes for load and
    store memory-reference operations, but a huge
    number of opcodes for register-to-register
    operate instructions.

    This makes it suitable for an implementation where:

    Physical memory is some power-of-two multiple of
    96 bits in width, for example, 192 bits or 384 bits
    wide;

    A simple hardware divide-by-three circuit allows
    accessing memory normally with the RAM module width
    of 64 bits as the fundamental unit;

    Applying no conversion allows 24 bits to be the
    fundamental unit;

    An even simpler _multiply_ by five circuit allows
    accessing memory with a 60-bit word as the
    fundamental unit.

    So it's possible to take an implementation of the
    architecture with three-channel memory, and add the
    feature of allocating data memory blocks that look
    like they're made up of 48-bit words or 60-bit words.

    When a memory-reference instruction refers to an
    address which is inside of such a variant block,
    its operation woudl be modified to handle a data
    type of an appropriate width for that kind of memory.

    Thus:

    In 60-bit memory,

    Load Floating would be an invalid operation;
    Load Double would load a 60-bit floating-point number;
    Load Medium would load a 45-bit floating-point number, as
    such a float would still be wide enough to be useful.

    Load Byte would be an invalid operation.
    Load Halfword would load a 15-bit integer;
    Load would load a 30-bit integer;
    Load Long would load a 60-bit integer.

    In 36-bit memory,

    Load Floating would load a 36-bit floating-point number
    Load Double would load a 72-bit floating-point nuber
    Load Medium would load a 54-bit floating-point number

    Load Byte would load a 9-bit byte
    Load Halfword would load an 18-bit halfword integer
    Load would load a 36-bit integer
    Load Long would load a 72-bit integer into a _pair_
    of registers, the first of which would be an even-numbered
    register, as the registers are only 64 bits long

    In 24-bit memory,

    Load Floating would be an invalid operation
    Load Double would load a 72-bit floating-point number;
    Load Medium would load a 48-bit floating-point number;

    Load Byte would load a 6-bit character;
    Load Halfword would load a 12-bit integer;
    Load would load a 24-bit integer;
    Load Long would load a 48-bit integer.

    Note that the relationship between floating-point
    widths and integer widths in instructions is offset
    by a factor of two when 24-bit memory is referenced.

    And there would be separate opcodes for doing arithmetic
    on numbers of these different lengths in the registers.

    This is all very well. But in program code, which, as
    it is so far specified, can _only_ reside in 64-bit memory,
    what about pseudo-immediate values?

    Since the instruction format that references pseudo-immediate
    values is that of an operate instruction, pseudo-immediates
    _could_ be allowed, with considerable wasted space in most
    cases, in operate instructions for the data types associated
    with variant memory widths.

    However, I have thought of a way to allow the CPU to treat
    the different memory widths with greater equality.

    Instead of devising a whole new ISA for each of the other
    memory widths, the *existing* ISA, designed around standard
    memory built up from the 64-bit unit, could be given
    _alternative representations_ wherein programs could be
    stored in memory of other widths, using the same ISA with
    minimum modifications.

    The 36-bit Representation

    In the standard 64-bit representation, a block may begin
    with 1100, followed by a 2-bit prefix field for each of the
    remaining fourteen 16-bit parts of the remaining seven
    32-bit instruction slots in the block.

    So, make that portion of the ISA the only one supported in
    the 36-bit representation, but with each 36-bit word, in
    a block, now 288 bits long, with eight 36-bit words, consisting
    of a prefix field, followed by 16 instruction bits, repeated
    sixteen times.

    But it must still be possible to have block headers if one is
    to have pseudo-immediates.

    So the first 36-bit instruction slot of a block, if it is to
    be a header, would start with 1011100011, and the pre- field
    portion of the second 18 bits of that instruction slot would
    contain 11. 1011100011 consists of 10 (32-bit instruction)
    and 11100011 (header prefix in 32-bit mode).

    So the ISA is modified because in 64-bit memory, only seven
    instruction slots, not all eight, can contain 36-bit instructions,
    and there are no headers inside 36 bit instructions. But the
    modifications are minimal, for the purpose of adapting programs
    to the different size of memory, with that size being native
    for them.

    The 24-bit representation

    Since a 24-bit word contains three 8-bit bytes, here, the
    form of instructions isn't modified at all. But blocks shrink
    from eight instruction slots that are 32-bits in length to
    eight 24-bit words, so a block would be 192 bits long.

    The amount of wasted space for data in 24-bit types wouldn't
    be changed much in many cases by using this representation
    instead of the 32-bit representation for program code,
    because the clash between instruction widths and data widths
    wouldn't be affected. However, _some_ improvement would be
    achieved, because now if there are multiple pseudo-immediate
    values used in a single block, the 24-bit data items could
    at least be *packed* correctly in the block.

    The 60-bit representation

    Here, a block is 480 bits long.

    It consists of two 240-bit sub-blocks, in the following form:

    One 15 bit halfword containing 15 prefix bits,

    and fifteen 15-bit halfwords to which those bits apply,
    turning them into 16-bit halfwords, which are used to form
    instructions.

    The blocks are aligned on 480-bit boundaries. When code
    consists of 32-bit instructions, the first sub-block begins
    with an instruction, and the second sub-block begins with the
    second half of an instruction.

    The first whole 32-bit instruction in a sub-block can be a
    header, and so one has a header applying to seven subsequent
    instruction slots in the first sub-block, and a header applying
    to six subsequent instruction slots in the second sub-block.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Fri Nov 24 18:48:53 2023
    Quadibloc wrote:

    The Concertina II ISA, as it stands now, has a
    very limited number of opcodes for load and
    store memory-reference operations, but a huge
    number of opcodes for register-to-register
    operate instructions.

    This makes it suitable for an implementation where:

    Physical memory is some power-of-two multiple of
    96 bits in width, for example, 192 bits or 384 bits
    wide;

    A simple hardware divide-by-three circuit allows
    accessing memory normally with the RAM module width
    of 64 bits as the fundamental unit;

    Applying no conversion allows 24 bits to be the
    fundamental unit;

    An even simpler _multiply_ by five circuit allows
    accessing memory with a 60-bit word as the
    fundamental unit.

    So it's possible to take an implementation of the
    architecture with three-channel memory, and add the
    feature of allocating data memory blocks that look
    like they're made up of 48-bit words or 60-bit words.

    When a memory-reference instruction refers to an
    address which is inside of such a variant block,
    its operation woudl be modified to handle a data
    type of an appropriate width for that kind of memory.

    Thus:

    In 60-bit memory,

    Load Floating would be an invalid operation;
    Load Double would load a 60-bit floating-point number;
    Load Medium would load a 45-bit floating-point number, as
    such a float would still be wide enough to be useful.

    I have not found a need to have LD/ST floating point instructions
    because I don't have FP register file(s) {or SIMD files}. Why do
    you seem to want/need a file dedicated to floating point values ??
    This saves a huge chunk of OpCode Space.......

    Load Byte would be an invalid operation.
    Load Halfword would load a 15-bit integer;
    Load would load a 30-bit integer;
    Load Long would load a 60-bit integer.

    In 36-bit memory,

    Load Floating would load a 36-bit floating-point number
    Load Double would load a 72-bit floating-point nuber
    Load Medium would load a 54-bit floating-point number

    Load Byte would load a 9-bit byte
    Load Halfword would load an 18-bit halfword integer
    Load would load a 36-bit integer
    Load Long would load a 72-bit integer into a _pair_
    of registers, the first of which would be an even-numbered
    register, as the registers are only 64 bits long

    In 24-bit memory,

    Load Floating would be an invalid operation
    Load Double would load a 72-bit floating-point number;
    Load Medium would load a 48-bit floating-point number;

    Load Byte would load a 6-bit character;
    Load Halfword would load a 12-bit integer;
    Load would load a 24-bit integer;
    Load Long would load a 48-bit integer.

    Note that the relationship between floating-point
    widths and integer widths in instructions is offset
    by a factor of two when 24-bit memory is referenced.

    And there would be separate opcodes for doing arithmetic
    on numbers of these different lengths in the registers.

    This is all very well. But in program code, which, as
    it is so far specified, can _only_ reside in 64-bit memory,
    what about pseudo-immediate values?

    Since the instruction format that references pseudo-immediate
    values is that of an operate instruction, pseudo-immediates
    _could_ be allowed, with considerable wasted space in most
    cases, in operate instructions for the data types associated
    with variant memory widths.

    However, I have thought of a way to allow the CPU to treat
    the different memory widths with greater equality.

    Instead of devising a whole new ISA for each of the other
    memory widths, the *existing* ISA, designed around standard
    memory built up from the 64-bit unit, could be given
    _alternative representations_ wherein programs could be
    stored in memory of other widths, using the same ISA with
    minimum modifications.

    The 36-bit Representation

    In the standard 64-bit representation, a block may begin
    with 1100, followed by a 2-bit prefix field for each of the
    remaining fourteen 16-bit parts of the remaining seven
    32-bit instruction slots in the block.

    So, make that portion of the ISA the only one supported in
    the 36-bit representation, but with each 36-bit word, in
    a block, now 288 bits long, with eight 36-bit words, consisting
    of a prefix field, followed by 16 instruction bits, repeated
    sixteen times.

    But it must still be possible to have block headers if one is
    to have pseudo-immediates.

    So the first 36-bit instruction slot of a block, if it is to
    be a header, would start with 1011100011, and the pre- field
    portion of the second 18 bits of that instruction slot would
    contain 11. 1011100011 consists of 10 (32-bit instruction)
    and 11100011 (header prefix in 32-bit mode).

    So the ISA is modified because in 64-bit memory, only seven
    instruction slots, not all eight, can contain 36-bit instructions,
    and there are no headers inside 36 bit instructions. But the
    modifications are minimal, for the purpose of adapting programs
    to the different size of memory, with that size being native
    for them.

    The 24-bit representation

    Since a 24-bit word contains three 8-bit bytes, here, the
    form of instructions isn't modified at all. But blocks shrink
    from eight instruction slots that are 32-bits in length to
    eight 24-bit words, so a block would be 192 bits long.

    The amount of wasted space for data in 24-bit types wouldn't
    be changed much in many cases by using this representation
    instead of the 32-bit representation for program code,
    because the clash between instruction widths and data widths
    wouldn't be affected. However, _some_ improvement would be
    achieved, because now if there are multiple pseudo-immediate
    values used in a single block, the 24-bit data items could
    at least be *packed* correctly in the block.

    The 60-bit representation

    Here, a block is 480 bits long.

    It consists of two 240-bit sub-blocks, in the following form:

    One 15 bit halfword containing 15 prefix bits,

    and fifteen 15-bit halfwords to which those bits apply,
    turning them into 16-bit halfwords, which are used to form
    instructions.

    The blocks are aligned on 480-bit boundaries. When code
    consists of 32-bit instructions, the first sub-block begins
    with an instruction, and the second sub-block begins with the
    second half of an instruction.

    The first whole 32-bit instruction in a sub-block can be a
    header, and so one has a header applying to seven subsequent
    instruction slots in the first sub-block, and a header applying
    to six subsequent instruction slots in the second sub-block.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Fri Nov 24 22:38:32 2023
    On Fri, 24 Nov 2023 18:48:53 +0000, MitchAlsup wrote:

    I have not found a need to have LD/ST floating point instructions
    because I don't have FP register file(s) {or SIMD files}. Why do you
    seem to want/need a file dedicated to floating point values ??
    This saves a huge chunk of OpCode Space.......

    Well, for one thing, by giving the integer functional unit one register
    file, and the floating-point functional unit its own register file, then
    those two functional units are not constrained to be adjacent.

    I can stick the Decimal Floating-Point functional unit on the other side
    of the floating register file, and the Simple Floating functional unit,
    or the Register Packed Decimal functional unit, on the other side of the integer register file.

    But of course my main reason is that this is the way everyone (i.e.
    the IBM System/360) does it, so of course I should do it that way!

    But having an excess of data types is at least a helpful excuse...

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Fri Nov 24 22:44:32 2023
    On Fri, 24 Nov 2023 22:38:32 +0000, Quadibloc wrote:

    On Fri, 24 Nov 2023 18:48:53 +0000, MitchAlsup wrote:

    I have not found a need to have LD/ST floating point instructions
    because I don't have FP register file(s) {or SIMD files}. Why do you
    seem to want/need a file dedicated to floating point values ??
    This saves a huge chunk of OpCode Space.......

    Also, because IEEE 754 calls for a hidden first bit, and
    gradual underflow, these things make floating-point
    arithmeitc slightly slower.

    Hence, I deal with them at load and store time, by using
    the load and store floating point instructions to convert
    to and from an *internal representation* of floats. That
    in itself precludes using the integer instructions for them.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Fri Nov 24 22:41:03 2023
    On Fri, 24 Nov 2023 04:42:31 +0000, Quadibloc wrote:

    On Fri, 24 Nov 2023 04:15:57 +0000, Quadibloc wrote:

    In 24-bit memory,

    Load Floating would be an invalid operation Load Double would load a
    72-bit floating-point number;
    Load Medium would load a 48-bit floating-point number;

    Alternatively, it would perhaps be simpler to do it this way:

    Load Floating would load a 48-bit floating-point number;
    Load Medium would load a 72-bit floating-point number.

    This would prioritize structure over function, but only the compiler
    would ever know...

    Silly me. How could I have forgotten that 24-bit memory is the
    only place where the "ideal" lengths of floating-point numbers
    can exist, so I should use them, rather than try to concoct
    something that's 24-bit native.

    So

    Load Floating would load a 36-bit floating-point number,
    Load Medium would load a 48-bit floating-point number, and
    Load Double would load a 60-bit floating-point number.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Fri Nov 24 22:53:57 2023
    On Fri, 24 Nov 2023 04:15:57 +0000, Quadibloc wrote:

    An even simpler _multiply_ by five circuit allows accessing memory with
    a 60-bit word as the fundamental unit.

    As well, a multiply by three circuit is used to address the 36-bit memory.

    Instead of devising a whole new ISA for each of the other memory widths,
    the *existing* ISA, designed around standard memory built up from the
    64-bit unit, could be given _alternative representations_ wherein
    programs could be stored in memory of other widths, using the same ISA
    with minimum modifications.

    While this is an "improvement", I do realize that this, along with even supporting multiple memory widths, even if I have found ways that it can technically be done, is sheer insanity.

    I do have a rationale for going there. It's simple enough. It comes from
    the x86.

    Since the irresistible force of availability of applications means that
    we are doomed to always have just *one* dominant ISA...

    then, to minimize the terrible loss that this entails, that one dominant
    ISA should be maximally flexible. To approach, as closely as possible, the "good old days" where the PDP-10 and the CDC 6600 coexisted alongside the PDP-11 and the IBM System/360.

    This is the driving force behind my insane quest to design an ISA for a CPU which has the ability to do something which no one seems to want.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Sat Nov 25 19:55:59 2023
    Quadibloc wrote:

    On Fri, 24 Nov 2023 18:48:53 +0000, MitchAlsup wrote:

    I have not found a need to have LD/ST floating point instructions
    because I don't have FP register file(s) {or SIMD files}. Why do you
    seem to want/need a file dedicated to floating point values ??
    This saves a huge chunk of OpCode Space.......

    Well, for one thing, by giving the integer functional unit one register
    file, and the floating-point functional unit its own register file, then those two functional units are not constrained to be adjacent.

    But Integer Multiply and Divide can share the FPU that does these.
    And we have the need to use FP values in deciding to take branches.
    So there is a lot of cross pollination of types to FUs. And finally
    you have to have 2 function units one to convert from FP to Int and
    one to convert INT to FP.

    Lets keep them apart and now we need a FU bigger than the FMAC unit
    to perform integer multiplies and divides. In addition, you need a Bus
    from the FPU side to the Condition unit that cooperates with integer
    branch instructions.

    Does not look like you can keep them "that far apart" and even when you
    do there is lots of cross wiring--some of which are more costly in HW
    than keeping FP and Int in the same RF.

    I can stick the Decimal Floating-Point functional unit on the other side
    of the floating register file, and the Simple Floating functional unit,
    or the Register Packed Decimal functional unit, on the other side of the integer register file.

    But of course my main reason is that this is the way everyone (i.e.
    the IBM System/360) does it, so of course I should do it that way!

    But having an excess of data types is at least a helpful excuse...

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Quadibloc on Sat Nov 25 14:46:33 2023
    On 11/24/2023 4:44 PM, Quadibloc wrote:
    On Fri, 24 Nov 2023 22:38:32 +0000, Quadibloc wrote:

    On Fri, 24 Nov 2023 18:48:53 +0000, MitchAlsup wrote:

    I have not found a need to have LD/ST floating point instructions
    because I don't have FP register file(s) {or SIMD files}. Why do you
    seem to want/need a file dedicated to floating point values ??
    This saves a huge chunk of OpCode Space.......

    Also, because IEEE 754 calls for a hidden first bit, and
    gradual underflow, these things make floating-point
    arithmeitc slightly slower.


    DAZ/FTZ fixes the gradual underflow issue...
    Not the standard solution, but works.


    One could maybe also make a case for truncate-only rounding:
    Easier to make bit-exact across implementations;
    Cheapest option;
    ...

    Though, truncate does occasionally have visible effects on code.
    For example, the player's yaw angle in my Quake port was prone to drift,
    had to manually add a small bias to stop the drifting.
    ...

    Say, Round-Nearest at least having the advantage that looping relative computations don't have a tendency to drift towards 0. Though, on the
    other side, computations which are stable with truncate rounding will
    tend to drift away from zero with round-nearest.

    In this case, this applied mostly to 'float' and 'short float' (when set
    to be routed through the low-precision unit via a compiler flag), with
    the main FPU (used for 'double'; and the others with default options)
    still doing proper round-nearest.



    I don't imagine a non-hidden bit representation would be much better here.

    Renormalization would still be required regularly to avoid other issues,
    and manual re-normalization would negatively effect performance.

    Operations like FADD/FSUB would also still need to be able to
    shift-align operands to be able to do their thing, so non-normalized
    values would not save much here either.


    Well, with the partial exception if one instead re-interprets the floating-point to be more like variable-fixed-point, in which case the
    compiler would need to deal with these issues (and probably the
    programmer as well, say, to specify how many bits are above or below the decimal point).

    But, in this case, may as well save a few bits and just use "normal" fixed-point (or maybe add a compiler type, like "__fixed(16,16)" or
    similar, then have the compiler deal with the shifts and similar...).


    Hence, I deal with them at load and store time, by using
    the load and store floating point instructions to convert
    to and from an *internal representation* of floats. That
    in itself precludes using the integer instructions for them.


    Possible.

    I went with integer operations myself...



    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to MitchAlsup on Mon Nov 27 01:29:56 2023
    On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:

    But Integer Multiply and Divide can share the FPU that does these.

    But giving each one its own multiplier means more superscalar
    goodness!

    But you _do_ have a good point. This kind of "superscalar goodness"
    is usually wasted, because most programs don't do an equal amount
    of integer and FP arithmetic. So it would be way better to have
    twice as many multipliers and dividers that are designed to be used
    either way than having units of two kinds.

    I'm assuming, though, that one can make the multipliers and dividers
    simpler, hence faster, by making them single-use, and that it's
    possible to know, for a given use case, what the ratio between integer
    and FP operations will be. After all, integer multiplication discards
    the MSBs of the result, and floating-point multiplication discards the
    LSBs of the result.

    And we have the need to use FP values in deciding to take branches.

    What gets used in taking branches is the _condition codes_. They're
    conveyed from the integer and floating point functional units, as well
    as all the other functional units, to a central place (the program
    status word).

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Mon Nov 27 01:56:01 2023
    Quadibloc wrote:

    On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:

    But Integer Multiply and Divide can share the FPU that does these.

    But giving each one its own multiplier means more superscalar
    goodness!

    But you _do_ have a good point. This kind of "superscalar goodness"
    is usually wasted, because most programs don't do an equal amount
    of integer and FP arithmetic. So it would be way better to have
    twice as many multipliers and dividers that are designed to be used
    either way than having units of two kinds.

    I'm assuming, though, that one can make the multipliers and dividers
    simpler, hence faster, by making them single-use, and that it's
    possible to know, for a given use case, what the ratio between integer
    and FP operations will be.

    In non-scientific code, there are 10 integer mul/divs per 1 FP mul/div
    In Scientific code, there are 2 integer mul/divs per 10 FP mul/divs.

    After all, integer multiplication discards
    the MSBs of the result, and floating-point multiplication discards the
    LSBs of the result.

    Means almost nothing because a FMAC can end up with those very same bits
    as are preferred for Int ×s

    And we have the need to use FP values in deciding to take branches.

    What gets used in taking branches is the _condition codes_. They're
    conveyed from the integer and floating point functional units, as well
    as all the other functional units, to a central place (the program
    status word).

    LD R4,[address of FP value]
    BFEQ0 R4,some label

    How do you do this on a condition code machine in 2 instructions ??
    System 460 had to::

    LD F4,[address of FP value]
    TST F4
    BEQ some label

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Quadibloc on Mon Nov 27 00:18:34 2023
    On 11/26/2023 7:29 PM, Quadibloc wrote:
    On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:

    But Integer Multiply and Divide can share the FPU that does these.

    But giving each one its own multiplier means more superscalar
    goodness!

    But you _do_ have a good point. This kind of "superscalar goodness"
    is usually wasted, because most programs don't do an equal amount
    of integer and FP arithmetic. So it would be way better to have
    twice as many multipliers and dividers that are designed to be used
    either way than having units of two kinds.

    I'm assuming, though, that one can make the multipliers and dividers
    simpler, hence faster, by making them single-use, and that it's
    possible to know, for a given use case, what the ratio between integer
    and FP operations will be. After all, integer multiplication discards
    the MSBs of the result, and floating-point multiplication discards the
    LSBs of the result.


    If one wants to have the most superscalar goodness, for general case
    code, this means mostly lots of simple ALU ops (particularly, ADD/SUB/AND/OR/SHIFT), and ability to do *lots* of memory Load/Store ops.

    Where, say, Load/Store ops and simple ALU ops tend to dominate over
    pretty much everything else in terms of usage frequency.


    In most code, FPU ops are comparably sparse, as is MUL, and DIV/MOD is
    fairly rare in comparison to either. MUL is common enough that you want
    it to be fast when it happens, but not so common to where one is dealing
    with piles of them all at the same time (except maybe in edge cases,
    like the DCT transform in JPEG/MPEG).

    If FMUL and MUL can't be co issued, this is unlikely to matter.


    DIV is uncommon enough that it can be 80 or 100 cycles or similar, and
    most code will not notice (except in rasterizers or similar, which
    actually need fast DIV, but can cheat in that they don't usually need
    exact DIV and the divisors are small, allowing for "shortcuts", such as
    using lookup tables of reciprocals or similar).


    Otherwise, one can debate whether or not having DIV/MOD in hardware
    makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
    at least "probably" faster than a generic software-only solution).

    For cases like divide-by-constant, also, typically it is possible to
    turn it in the compiler into a multiply by reciprocal.



    And we have the need to use FP values in deciding to take branches.

    What gets used in taking branches is the _condition codes_. They're
    conveyed from the integer and floating point functional units, as well
    as all the other functional units, to a central place (the program
    status word).


    Yeah, better to not have this style of condition codes...

    Like, there are other ways of doing conditional branches...



    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Mon Nov 27 09:17:59 2023
    BGB <cr88192@gmail.com> writes:
    On 11/26/2023 7:29 PM, Quadibloc wrote:
    On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:

    But Integer Multiply and Divide can share the FPU that does these.

    But giving each one its own multiplier means more superscalar
    goodness!

    Having two multipliers that serve both purposes means even more
    superscalar goodness for similar area cost. However, there is the
    issue of latency. The Willamette ships the integers over to the FPU
    for multiplication, and the result back, crossing several clock
    domains (at one clock loss per domain crossing), resulting in a
    10-cycle latency for integer multiplication. I think that these days
    every high-performance core with real silicon invests into separate
    GPR and FP/SIMD (including integer SIMD) multipliers.

    In most code, FPU ops are comparably sparse

    In terms of executed ops, that depends very much on the code. GP
    cores have acquired SIMD cores primarily for FP ops, as can be seen by
    both SSE and supporting only FP at first, and only later adding
    integer stuff, because it cost little extra. Plus, we have added GPUs
    that are now capable of doing huge amounts of FP ops, with uses in
    graphics rendering, HPC and AI.

    Otherwise, one can debate whether or not having DIV/MOD in hardware
    makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
    at least "probably" faster than a generic software-only solution).

    That debate has been held, and MIPS has hardware integer divide, Alpha
    and IA-64 don't have a hardware integer divide; they both have FP
    divide instructions.

    However, looking at more recent architectures, the RISC-V M extension
    (which is part of RV64G and RV32G, i.e., a standard extension) has not
    just multiply instructions (MUL, MULH, MULHU, MULHSU, MULW), but also
    integer divide instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW,
    and REMUW. ARM A64 also has divide instructions (SDIV, UDIV), but
    RISC-V seems significant to me because there the philosophy seems to
    be to go for minimalism. So the debate has apparently come to the
    conclusion that for general-purpose architectures, you include an
    integer divide instruction.

    For cases like divide-by-constant, also, typically it is possible to
    turn it in the compiler into a multiply by reciprocal.

    But for that you want a multiply instruction, which in the RISC-V case
    means including the M extension, which also includes divide
    instructions. Multiplying by the reciprocal may still be faster.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Anton Ertl on Mon Nov 27 13:27:01 2023
    On Mon, 27 Nov 2023 09:17:59 +0000, Anton Ertl wrote:

    That debate has been held, and MIPS has hardware integer divide, Alpha
    and IA-64 don't have a hardware integer divide; they both have FP divide instructions.

    I didn't know this about the Itanium. All I remembered hearing was that
    the Itanium "didn't have a divide instruction", and so I didn't realize
    this applied to fixed-point arithmetic only.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Quadibloc on Mon Nov 27 15:54:03 2023
    Quadibloc <quadibloc@servername.invalid> writes:
    On Mon, 27 Nov 2023 09:17:59 +0000, Anton Ertl wrote:

    That debate has been held, and MIPS has hardware integer divide, Alpha
    and IA-64 don't have a hardware integer divide; they both have FP divide
    instructions.

    I didn't know this about the Itanium. All I remembered hearing was that
    the Itanium "didn't have a divide instruction", and so I didn't realize
    this applied to fixed-point arithmetic only.

    You are right, I was wrong. <http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf>
    says:

    |A number of floating-point operations defined by the IEEE Standard are |deferred to software by the IA-64 architecture in all its
    |implementations
    |
    |* floating-point divide (integer divide, which is based on the
    | floating-point divide operation, is also deferred to software)

    The paper goes on to describe that FP division a/b is based on
    determining 1/b through Newton-Raphson approximation, using the FMA instruction, and then multiplying with a. It shows a sequence of 13 instructions for double precision, 1 frcpa and 12 fma or fnma
    instructions.

    Given that integer division is based on FP division, it probably takes
    even more instructions.

    Meanwhile, a lowly Cortex-A53 with its divide instruction produces an
    integer division result with a small quotient in a few cycles. In
    other words, if you want to use that method for division, still
    provide a division instruction, and then implement it with that
    method. This provides the flexibility to switch to some other method
    (e.g., the one used by ARM) later.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Anton Ertl on Mon Nov 27 13:41:12 2023
    On 11/27/2023 3:17 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/26/2023 7:29 PM, Quadibloc wrote:
    On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:

    But Integer Multiply and Divide can share the FPU that does these.

    But giving each one its own multiplier means more superscalar
    goodness!

    Having two multipliers that serve both purposes means even more
    superscalar goodness for similar area cost. However, there is the
    issue of latency. The Willamette ships the integers over to the FPU
    for multiplication, and the result back, crossing several clock
    domains (at one clock loss per domain crossing), resulting in a
    10-cycle latency for integer multiplication. I think that these days
    every high-performance core with real silicon invests into separate
    GPR and FP/SIMD (including integer SIMD) multipliers.


    I ended up with different multipliers mostly because the requirements
    are different...


    In most code, FPU ops are comparably sparse

    In terms of executed ops, that depends very much on the code. GP
    cores have acquired SIMD cores primarily for FP ops, as can be seen by
    both SSE and supporting only FP at first, and only later adding
    integer stuff, because it cost little extra. Plus, we have added GPUs
    that are now capable of doing huge amounts of FP ops, with uses in
    graphics rendering, HPC and AI.


    Yeah, this was not to say there is not FPU dense code, or that FP-SIMD
    is not useful, but rather that in most general code, ALU and LD/ST ops
    tend to dominate by a fair margin.

    Similarly, SIMD ops may still be useful, even if they are a relative
    minority of the executed instructions (even in code sequences which are actively using SIMD ops...).


    Like, typically, for every SIMD op used, there are also things like:
    The loads and stores to get the value from memory and put the results in memory;
    ALU ops to calculate the index into the array or similar that we are
    loading from or storing to;
    ...

    Well, along with other ops, like shuffles and similar to get the SIMD
    elements into the desired order, etc.


    Like, some of this is why it is difficult to get anywhere near the
    theoretical 200 MFLOP of the SIMD unit, apart from very contrived
    use-cases (such as running neural net code), and had involved wonk like operations which combined a SIMD shuffle into the SIMD ADD/SUB/MUL ops.


    For a lot of other use cases, I can just be happy enough that the SIMD
    ops are "not slow".


    Otherwise, one can debate whether or not having DIV/MOD in hardware
    makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
    at least "probably" faster than a generic software-only solution).

    That debate has been held, and MIPS has hardware integer divide, Alpha
    and IA-64 don't have a hardware integer divide; they both have FP
    divide instructions.


    Technically, also, BJX2 has ended up having both Integer DIV and FDIV instructions. But, they don't gain all that much, so are still left as
    optional features.

    The integer divide isn't very fast, but it doesn't matter if it isn't
    used all that often.


    The FDIV is effectively a boat anchor (around 122 clock cycles).
    Though, mostly this was based on the observation that with some minor
    tweaks, the integer divide unit could be made to perform floating-point
    divide as well.

    The main merit though (over a software N-R divider) is that it can
    apparently give exact results (my N-R dividers generally can't converge
    past the low order 4 bits).



    However, looking at more recent architectures, the RISC-V M extension
    (which is part of RV64G and RV32G, i.e., a standard extension) has not
    just multiply instructions (MUL, MULH, MULHU, MULHSU, MULW), but also
    integer divide instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW,
    and REMUW. ARM A64 also has divide instructions (SDIV, UDIV), but
    RISC-V seems significant to me because there the philosophy seems to
    be to go for minimalism. So the debate has apparently come to the
    conclusion that for general-purpose architectures, you include an
    integer divide instruction.


    Yeah.

    I mostly ended up adding integer divide so that the RISC-V mode could
    support the 'M' extension, and if I have it for RISC-V, may as well also
    add it in BJX2 as its own optional extension.


    Had also added a "FAZ DIV" special case that made the integer divide
    faster, where over a limited range of input values, the integer divide
    would be turned into a multiply by reciprocal.

    So, say, DIV latency:
    64-bit: 68 cycle
    32-bit: 36 cycle
    32-bit FAZ: 3 cycle.

    As it so happens, FAZ also covers a similar range to that typically used
    for a rasterizer, but is more limited in that it only handles a range of
    values it can calculate exactly. For my hardware rasterizer, a similar
    strategy was used, but extended to support bigger divisors at the
    tradeoff that it becomes inexact with larger divisors.

    However, adding a "Divide two integers quickly, but may give an
    inaccurate result" instruction would be a bit niche (and normal C code
    couldn't use it without an intrinsic or similar).

    Partial observation is that mostly, the actual bit patterns in the
    reciprocals tends to be fairly repetitive, so it is possible to
    synthesize a larger range of reciprocals using lookup tables and shift-adjustments.

    In the hardware rasterizer, I had experimented with a fixed-point 1/Z
    divider for "more accurate" perspective-correct rasterization, but this
    feature isn't cheap (fixed-point reciprocal is more complicated), and
    ended up not using it for now (in favor of merely dividing S and T by Z
    and then multiplying by the interpolated Z again during rasterization,
    as a sort of less accurate "poor man's" version).

    The "poor man's perspective correct" strategy doesn't eliminate the need
    to subdivide primitives, but does allow the primitives to be somewhat
    larger without having as much distortion (mostly relevant as my GL
    renderer is still mostly CPU bound in the transform and subdivision
    stages, *1).


    In theory, proper perspective correct would entirely eliminate the need
    to subdivide primitives, but would require clipping geometry to the
    viewing frustum (otherwise, the texture coordinates freak out for
    primitives crossing outside the frustum or crossing the near plane).

    Approximate FDIV is possible, but I have typically used a different
    strategy for the reciprocal.



    *1: Though, ironically, this does also mean that, via multi-texturing,
    it is semi-viable to also use lightmap lighting again in GLQuake (since
    this doesn't add too much CPU penalty over the non-multitexture case).


    However, dynamic lightmaps still aren't viable, as the CPU-side cost of
    drawing the dynamic light-sources to the lightmaps, and then uploading
    them, is still a bit too much.

    I suspect, probably in any case, GLQuake wasn't likely written to run on
    a 50MHz CPU.


    Might have been a little easier if the games had poly-counts and and
    draw distances more on par with PlayStation games (say, if the whole
    scene is kept less than 500 polys; rather than 1k-2k polys).

    Say, Quake being generally around 1k-2k polys for the scene, and roughly 300-500 triangles per alias model, ... Despite the scenes looking crude
    with large flat surfaces, most of these surfaces had already been hacked
    up a fair bit by the BSP algorithm (though, theoretically, some
    annoyances could be reduced if the Quake1 maps were rebuilt using a
    modified Q3BSP or similar, as the Quake3 tools natively support vertex
    lighting and also don't hack up the geometry quite as much as the
    original Quake1 BSP tool did, but alas...).

    Wouldn't save much, as I still couldn't legally redistribute the data
    files (and my GLQuake port isn't particularly usable absent using a
    process to convert all the textures into DDS files and rebuilding the
    PAK's and similar, so...).

    To have something redistributable, would need to replace all of the
    textures, sound effects, alias models, etc. An old (probably long dead) "OpenQuartz" project had partly done a lot of this, but "got creative"
    with the character models in a way I didn't like (would have preferred something at least "in the same general spirit" as the original models).

    Similar sort of annoyances as with FreeDoom, but at least FreeDoom
    stayed closer to the original in these areas.



    Also, possibly, I may need to rework some things in terms of how TKRA-GL
    works to better match up with more recent developments (all this is
    still a bit hacky; effectively linking the whole GL implementation to
    the binary that uses GL).

    Granted, it is also possible I may need to at some point move away from
    linking the whole TestKern kernel to any programs that use it as well,
    with the tradeoff that programs would no longer be able to launch "bare
    metal" in the emulator (but would always need the kernel to also be
    present).


    For cases like divide-by-constant, also, typically it is possible to
    turn it in the compiler into a multiply by reciprocal.

    But for that you want a multiply instruction, which in the RISC-V case
    means including the M extension, which also includes divide
    instructions. Multiplying by the reciprocal may still be faster.


    Yeah.

    For the relevant cases, widening multiply for 32-bit integers generally
    has a 3 cycle latency in my case.

    Whereas DIV will often have a 36 cycle latency in the general case.
    So, the choice is obvious.

    Granted, this still generally needs a shift instruction following the
    multiply.


    Having a "Do a widening 32-bit multiply and then right shift the results 32/33/34/35 bits" would be possible in theory, but a little wonky.

    Could possibly make sense if it could also encode a 32-bit immediate:
    DMULSH33S Rm, Imm32u, Rn //Int32: Rn=(Rm*Imm32)>>33
    DMULSH33U Rm, Imm32u, Rn //UInt32: Rn=(Rm*Imm32)>>33

    Probably don't really have the encoding space to pull this off though
    (with a 64-bit encoding; Could be doable in theory with an 96-bit
    FE-FF-OP encoding though).

    Though, as can be noted, this could likely borrow some of the logic from
    the FAZ-DIV mechanism or similar.

    Would allow handling "y=x/C;" cases in 3L/1T though, but, integer
    division still isn't really all that common (at least, not enough to
    justify going through all this just to save using a right-shift op).


    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Tue Nov 28 01:05:21 2023
    Anton Ertl wrote:

    Quadibloc <quadibloc@servername.invalid> writes:
    On Mon, 27 Nov 2023 09:17:59 +0000, Anton Ertl wrote:

    That debate has been held, and MIPS has hardware integer divide, Alpha
    and IA-64 don't have a hardware integer divide; they both have FP divide >>> instructions.

    I didn't know this about the Itanium. All I remembered hearing was that
    the Itanium "didn't have a divide instruction", and so I didn't realize >>this applied to fixed-point arithmetic only.

    You are right, I was wrong. <http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf>
    says:

    |A number of floating-point operations defined by the IEEE Standard are |deferred to software by the IA-64 architecture in all its
    |implementations
    |
    |* floating-point divide (integer divide, which is based on the
    | floating-point divide operation, is also deferred to software)

    The paper goes on to describe that FP division a/b is based on
    determining 1/b through Newton-Raphson approximation, using the FMA instruction, and then multiplying with a. It shows a sequence of 13 instructions for double precision, 1 frcpa and 12 fma or fnma
    instructions.

    Given that integer division is based on FP division, it probably takes
    even more instructions.

    You have to synthesize the bits between 63 and 53 to get 64/64->64.

    Meanwhile, a lowly Cortex-A53 with its divide instruction produces an
    integer division result with a small quotient in a few cycles. In
    other words, if you want to use that method for division, still
    provide a division instruction, and then implement it with that
    method. This provides the flexibility to switch to some other method
    (e.g., the one used by ARM) later.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Tue Nov 28 00:58:32 2023
    BGB wrote:

    On 11/26/2023 7:29 PM, Quadibloc wrote:
    On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:

    But Integer Multiply and Divide can share the FPU that does these.

    But giving each one its own multiplier means more superscalar
    goodness!

    But you _do_ have a good point. This kind of "superscalar goodness"
    is usually wasted, because most programs don't do an equal amount
    of integer and FP arithmetic. So it would be way better to have
    twice as many multipliers and dividers that are designed to be used
    either way than having units of two kinds.

    I'm assuming, though, that one can make the multipliers and dividers
    simpler, hence faster, by making them single-use, and that it's
    possible to know, for a given use case, what the ratio between integer
    and FP operations will be. After all, integer multiplication discards
    the MSBs of the result, and floating-point multiplication discards the
    LSBs of the result.


    If one wants to have the most superscalar goodness, for general case
    code, this means mostly lots of simple ALU ops (particularly, ADD/SUB/AND/OR/SHIFT), and ability to do *lots* of memory Load/Store ops.

    Where, say, Load/Store ops and simple ALU ops tend to dominate over
    pretty much everything else in terms of usage frequency.

    LD+ST 33%
    Int 45%

    In most code, FPU ops are comparably sparse,

    FP 9%
    as is MUL,
    MUL 2%
    and DIV/MOD is
    DIV 0.3%
    fairly rare in comparison to either. MUL is common enough that you want
    it to be fast when it happens, but not so common to where one is dealing
    with piles of them all at the same time (except maybe in edge cases,
    like the DCT transform in JPEG/MPEG).

    MUL is much more common in FORTRAN than in C-like languages as it forms
    the underlying math for multidimensional arrays, and more ×s exist in
    typical number crunching code (F) than in pointer chasing (C) languages.

    If FMUL and MUL can't be co issued, this is unlikely to matter.

    Since both are pipelined, this is a small stall {{Unlike MIPS R2000 where
    IMUL was 12 cycles blocking while FMUL was 3 (or was it 4) pipelined.

    DIV is uncommon enough that it can be 80 or 100 cycles or similar, and
    most code will not notice (except in rasterizers or similar, which
    actually need fast DIV, but can cheat in that they don't usually need
    exact DIV and the divisors are small, allowing for "shortcuts", such as
    using lookup tables of reciprocals or similar).

    We know that FDIV can be performed in 17 cycles in the FMUL unit, and
    this has been known since 1989......why are you making it so long ??

    0.3% IDIV × 100 cycles and you have added 3 cycles to the average
    instruction !!!

    Otherwise, one can debate whether or not having DIV/MOD in hardware
    makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
    at least "probably" faster than a generic software-only solution).

    With an 8-bit start, and Newton-Raphson iteration, SW should be able
    to do this in mid-30-cycles. {{See ITANIC FDIV SW profile}}

    For cases like divide-by-constant, also, typically it is possible to
    turn it in the compiler into a multiply by reciprocal.

    And you should if it helps.

    And we have the need to use FP values in deciding to take branches.

    What gets used in taking branches is the _condition codes_. They're
    conveyed from the integer and floating point functional units, as well
    as all the other functional units, to a central place (the program
    status word).


    Yeah, better to not have this style of condition codes...

    RISC-V, MIPS, Mc 88K, My 66000, have no condition codes......

    Like, there are other ways of doing conditional branches...

    "Compare to zero and branch" × {Signed, Unsigned, FP16, FP32, FP64}}
    for 1 instruction FP branches

    Compare and then branch on anything (2 inst::1 int, 1 FP) your heart desires Branch {NaN, -Infinity, +Infinity, -DeNorm, +Denorm, -Normal, +Normal, Zero, -zero, +zero, < <= > >= == !=, 0 < x < y, 0 <=x < y, 0 < x <= y, 0 <= x <= y, ... }
    Can you even record all these states in a single condition code ??

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Tue Nov 28 01:03:39 2023
    Anton Ertl wrote:

    BGB <cr88192@gmail.com> writes:
    On 11/26/2023 7:29 PM, Quadibloc wrote:
    On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:

    But Integer Multiply and Divide can share the FPU that does these.

    But giving each one its own multiplier means more superscalar
    goodness!

    Having two multipliers that serve both purposes means even more
    superscalar goodness for similar area cost. However, there is the
    issue of latency. The Willamette ships the integers over to the FPU
    for multiplication, and the result back, crossing several clock
    domains (at one clock loss per domain crossing), resulting in a
    10-cycle latency for integer multiplication. I think that these days
    every high-performance core with real silicon invests into separate
    GPR and FP/SIMD (including integer SIMD) multipliers.

    In most code, FPU ops are comparably sparse

    In terms of executed ops, that depends very much on the code. GP
    cores have acquired SIMD cores primarily for FP ops, as can be seen by
    both SSE and supporting only FP at first, and only later adding
    integer stuff, because it cost little extra. Plus, we have added GPUs
    that are now capable of doing huge amounts of FP ops, with uses in
    graphics rendering, HPC and AI.

    Once SIMD gains Integer operations, the Multiplier has to be built to
    do both, might as well use it for more things than just SIMD.

    Otherwise, one can debate whether or not having DIV/MOD in hardware
    makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
    at least "probably" faster than a generic software-only solution).

    That debate has been held, and MIPS has hardware integer divide, Alpha
    and IA-64 don't have a hardware integer divide; they both have FP
    divide instructions.

    And all of these lead fruitful long productive lives before taking over -------Oh wait !!

    However, looking at more recent architectures, the RISC-V M extension
    (which is part of RV64G and RV32G, i.e., a standard extension) has not
    just multiply instructions (MUL, MULH, MULHU, MULHSU, MULW), but also
    integer divide instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW,
    and REMUW.

    All of which are possible in My 66000 using operand sign control, S-bit,
    and CARRY when you want 64×64->128 or 128/64->{64 quotient, 64 remainder}

    ARM A64 also has divide instructions (SDIV, UDIV), but
    RISC-V seems significant to me because there the philosophy seems to
    be to go for minimalism. So the debate has apparently come to the
    conclusion that for general-purpose architectures, you include an
    integer divide instruction.

    Several forms of integer DIV at least signed and unsigned.....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Anton Ertl on Tue Nov 28 04:50:20 2023
    On Mon, 27 Nov 2023 15:54:03 +0000, Anton Ertl wrote:

    Quadibloc <quadibloc@servername.invalid> writes:
    On Mon, 27 Nov 2023 09:17:59 +0000, Anton Ertl wrote:

    That debate has been held, and MIPS has hardware integer divide, Alpha
    and IA-64 don't have a hardware integer divide; they both have FP
    divide instructions.

    I didn't know this about the Itanium. All I remembered hearing was that
    the Itanium "didn't have a divide instruction", and so I didn't realize >>this applied to fixed-point arithmetic only.

    You are right, I was wrong. <http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf> says:

    |A number of floating-point operations defined by the IEEE Standard are |deferred to software by the IA-64 architecture in all its
    |implementations |
    |* floating-point divide (integer divide, which is based on the | floating-point divide operation, is also deferred to software)

    The paper goes on to describe that FP division a/b is based on
    determining 1/b through Newton-Raphson approximation, using the FMA instruction, and then multiplying with a. It shows a sequence of 13 instructions for double precision, 1 frcpa and 12 fma or fnma
    instructions.

    Given that integer division is based on FP division, it probably takes
    even more instructions.

    The fastest _hardware_ implementations of FP division are Goldschmidt
    division and Newton-Raphson divisiion, and they both make use of
    multiplication hardware.

    So using Newton-Raphson division to increase the width of an FP division
    result to perform 64-bit integer division won't be too hard.

    Since the Wikipedia article didn't go into detail, I had to look further
    to find out that you _were_ right about the DEC Alpha, however. It did
    have floating divide but not integer divide. The 21264 added floating
    square root.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to MitchAlsup on Tue Nov 28 09:10:37 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Anton Ertl wrote:
    <http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf>
    says:

    |A number of floating-point operations defined by the IEEE Standard are
    |deferred to software by the IA-64 architecture in all its
    |implementations
    |
    |* floating-point divide (integer divide, which is based on the
    | floating-point divide operation, is also deferred to software)

    The paper goes on to describe that FP division a/b is based on
    determining 1/b through Newton-Raphson approximation, using the FMA
    instruction, and then multiplying with a. It shows a sequence of 13
    instructions for double precision, 1 frcpa and 12 fma or fnma
    instructions.

    Given that integer division is based on FP division, it probably takes
    even more instructions.

    You have to synthesize the bits between 63 and 53 to get 64/64->64.

    The paper says that they use double-extended division to synthesize
    64/64-bit division; I expect that double-extended uses even more
    Newton-Raphson steps (and FMAs) than double precision; and then there
    are the additional steps for getting an integer.

    The paper only shows signed 16/16 bit division, a 15-instruction
    sequence including 1 fcrpa and 3 fma/fnma instructions. So you can
    expect that 64/64 is at least 13+15-4=24 instructions long, probably
    longer (because of double-extended precision).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to MitchAlsup on Tue Nov 28 09:20:44 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Several forms of integer DIV at least signed and unsigned.....

    Gforth has unsigned, floored (signed) and symmetric (signed), in double-cell/single-cell variants (standardized in Forth) as well as in single/single variants.

    Floored rounds the quotient down, symmetric (often called truncated)
    rounds the quotient towards 0. There is also something called
    Euclidean that always produces a remainder >=0, but the practical
    difference from floored is when the divisor is negative, which is
    exceedingly rare.

    Given the rareness of negative divisors, one might wonder about signed/unsigned, but at least implementations of current programming
    languages would probably make little use of such instructions even if
    they were cheaper than the signed/signed ones.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to MitchAlsup on Tue Nov 28 04:32:58 2023
    On 11/27/2023 6:58 PM, MitchAlsup wrote:
    BGB wrote:

    On 11/26/2023 7:29 PM, Quadibloc wrote:
    On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:

    But Integer Multiply and Divide can share the FPU that does these.

    But giving each one its own multiplier means more superscalar
    goodness!

    But you _do_ have a good point. This kind of "superscalar goodness"
    is usually wasted, because most programs don't do an equal amount
    of integer and FP arithmetic. So it would be way better to have
    twice as many multipliers and dividers that are designed to be used
    either way than having units of two kinds.

    I'm assuming, though, that one can make the multipliers and dividers
    simpler, hence faster, by making them single-use, and that it's
    possible to know, for a given use case, what the ratio between integer
    and FP operations will be. After all, integer multiplication discards
    the MSBs of the result, and floating-point multiplication discards the
    LSBs of the result.


    If one wants to have the most superscalar goodness, for general case
    code, this means mostly lots of simple ALU ops (particularly,
    ADD/SUB/AND/OR/SHIFT), and ability to do *lots* of memory Load/Store ops.

    Where, say, Load/Store ops and simple ALU ops tend to dominate over
    pretty much everything else in terms of usage frequency.

    LD+ST 33%
    Int   45%

    In most code, FPU ops are comparably sparse,

    FP     9%
                                                 as is MUL,
    MUL    2%
                                                            and DIV/MOD is
    DIV  0.3%

    Yeah, pretty much...


    fairly rare in comparison to either. MUL is common enough that you
    want it to be fast when it happens, but not so common to where one is
    dealing with piles of them all at the same time (except maybe in edge
    cases, like the DCT transform in JPEG/MPEG).

    MUL is much more common in FORTRAN than in C-like languages as it forms
    the underlying math for multidimensional arrays, and more ×s exist in typical number crunching code (F) than in pointer chasing (C) languages.


    Yeah.

    Multidimensional arrays are comparably rarely used in C land, and when
    they are used, they typically have power-of-2 dimensions.


    If FMUL and MUL can't be co issued, this is unlikely to matter.

    Since both are pipelined, this is a small stall {{Unlike MIPS R2000 where IMUL was 12 cycles blocking while FMUL was 3 (or was it 4) pipelined.


    Slow IMUL but fast FMUL: WTF?...
    Unless this was only counting single precision, then it could make more
    sense.


    DIV is uncommon enough that it can be 80 or 100 cycles or similar, and
    most code will not notice (except in rasterizers or similar, which
    actually need fast DIV, but can cheat in that they don't usually need
    exact DIV and the divisors are small, allowing for "shortcuts", such
    as using lookup tables of reciprocals or similar).

    We know that FDIV can be performed in 17 cycles in the FMUL unit, and
    this has been known since 1989......why are you making it so long ??

    0.3% IDIV × 100 cycles and you have added 3 cycles to the average instruction !!!


    Can note that using DIV/MOD seems a lot rarer in my case...

    But, I can note that my compiler turns constant divide into alternatives:
    y=x/C; //~ y=(x*RCP)>>SHR;
    y=x%C; //~ y=x-(((x*RCP)>>SHR)*C);

    Though, with some extra wonk thrown in so negative values round the
    divide towards zero and similar (otherwise, negative values would round
    away from zero, which is not the commonly accepted result).

    y=((x+((x<0)?(C-1):0))*RCP)>>SHR;

    The variable cases are comparably infrequent.



    Otherwise, one can debate whether or not having DIV/MOD in hardware
    makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV
    is at least "probably" faster than a generic software-only solution).

    With an 8-bit start, and Newton-Raphson iteration, SW should be able
    to do this in mid-30-cycles. {{See ITANIC FDIV SW profile}}


    I was thinking of integer divide here (being 68 cycle for DIVS.Q/DIVU.Q
    in my case; 36 for DIVS.L/DIVU.L).

    But, yeah, the FDIV instruction is even slower (120 cycles); and
    software Newton-Raphson will win this race...


    Partly this is the drawback of using a Shift-Add design.


    Say, every clock-cycle, the value is shifted left by 1 bit, and logic determines whether or not to add a second value to the working value.

    For multiply, you can use the bits of one input to mask whether or not
    to add the other input. For divide, the carry of the adder can be used
    to mask whether to add the inverse of the divisor.

    For FDIV, one runs the divider for more clock cycles, and it generates
    bits below the decimal point.



    Though, at least for the merit of an Integer DIV op, it is faster to use
    a 36-cycle integer divide instruction than it is to run a shift-subtract
    loop in software.



    For cases like divide-by-constant, also, typically it is possible to
    turn it in the compiler into a multiply by reciprocal.

    And you should if it helps.


    Yeah, it does.
    Multiply by reciprocal is considerably faster.


    And we have the need to use FP values in deciding to take branches.

    What gets used in taking branches is the _condition codes_. They're
    conveyed from the integer and floating point functional units, as well
    as all the other functional units, to a central place (the program
    status word).


    Yeah, better to not have this style of condition codes...

    RISC-V, MIPS, Mc 88K, My 66000, have no condition codes......


    Yeah.

    In BJX2, there is only SR.T, however, the way it is updated and used is
    much narrower in scope.

    A case could be made for narrowing its scope further though.


    Like, there are other ways of doing conditional branches...

    "Compare to zero and branch" × {Signed, Unsigned, FP16, FP32, FP64}}
    for 1 instruction FP branches


    Yeah, I had ended up adding this strategy as well as the prior SR.T
    mechanism.


    Compare and then branch on anything (2 inst::1 int, 1 FP) your heart
    desires Branch {NaN, -Infinity, +Infinity, -DeNorm, +Denorm, -Normal, +Normal, Zero, -zero, +zero, < <= > >= == !=, 0 < x < y, 0 <=x < y, 0 <
    x <= y, 0 <= x <= y, ... }
    Can you even record all these states in a single condition code ??


    Possible.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marko Zec@21:1/5 to MitchAlsup on Tue Nov 28 13:16:48 2023
    MitchAlsup <mitchalsup@aol.com> wrote:
    ...
    0.3% IDIV ? 100 cycles and you have added 3 cycles to the average
    instruction !!!

    Not really, more likeliky it would be a 0.33 average CPI bump, assuming
    a 1.0 baseline CPI for non-DIV instructions. Still way to much to be ignored...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Marko Zec on Tue Nov 28 13:00:11 2023
    On 11/28/2023 7:16 AM, Marko Zec wrote:
    MitchAlsup <mitchalsup@aol.com> wrote:
    ...
    0.3% IDIV ? 100 cycles and you have added 3 cycles to the average
    instruction !!!

    Not really, more likeliky it would be a 0.33 average CPI bump, assuming
    a 1.0 baseline CPI for non-DIV instructions. Still way to much to be ignored...

    I missed that, but yeah.

    Still, it seems the incidence of the integer DIV/MOD/etc instructions
    being used is somewhat rarer than this estimate in the code I was measuring.

    Checking my compiler output for static use count:
    Around 0.05% ...


    It might have been more common though, if my compiler didn't
    aggressively eliminate the divide by constant cases (and my coding
    practices didn't actively avoid using division in general, etc).

    Since, between either a slowish DIV instruction, or a software
    shift-subtract loop, neither was expected to be all that fast.


    On pretty much every system I have used IRL, division has tended to be
    fairly slow, so I had tended to avoid it whenever possible.


    In cases where fast divide was needed, there were often other
    workarounds (such as using a lookup table of reciprocals).


    If inexact results are acceptable, it is possible to stretch the table
    of reciprocals up to pretty much arbitrary divisors (via normalization
    and some shift-related trickery).
    In software, this is helped with a CLZ instruction or similar, but had
    also used a variant of this effectively in my hardware rasterizer module.

    Say:
    z=x/y;
    y:
    0.. 31: Direct multiply by MAX/y (Table A)
    32.. 63: Normalized multiply by MAX/y (Table B)
    64..127: Use (y>>1) as index to Table B, add 1 to final SHR
    128..255: Use (y>>2) as index to Table B, add 2 to final SHR
    ...

    The sizes of Tables A/B can be reduced (via another lookup table) by
    making use of repetitions in the bit patterns (though, this is more
    practical in Verilog than in software).

    ...


    Though, getting acceptable results from fixed-point division is a little
    more of an issue.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to BGB on Tue Nov 28 11:22:48 2023
    On 11/28/2023 11:00 AM, BGB wrote:

    snip

    Still, it seems the incidence of the integer DIV/MOD/etc instructions
    being used is somewhat rarer than this estimate in the code I was
    measuring.

    Is this a result of your using mostly a few old games as your
    "reference" code? i.e. your code base may be different from what gave
    rise to the statistics mentioned by others.

    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Tue Nov 28 19:37:04 2023
    Quadibloc wrote:

    On Mon, 27 Nov 2023 15:54:03 +0000, Anton Ertl wrote:

    Quadibloc <quadibloc@servername.invalid> writes:
    On Mon, 27 Nov 2023 09:17:59 +0000, Anton Ertl wrote:

    That debate has been held, and MIPS has hardware integer divide, Alpha >>>> and IA-64 don't have a hardware integer divide; they both have FP
    divide instructions.

    I didn't know this about the Itanium. All I remembered hearing was that >>>the Itanium "didn't have a divide instruction", and so I didn't realize >>>this applied to fixed-point arithmetic only.

    You are right, I was wrong.
    <http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf> says:

    |A number of floating-point operations defined by the IEEE Standard are
    |deferred to software by the IA-64 architecture in all its
    |implementations |
    |* floating-point divide (integer divide, which is based on the |
    floating-point divide operation, is also deferred to software)

    The paper goes on to describe that FP division a/b is based on
    determining 1/b through Newton-Raphson approximation, using the FMA
    instruction, and then multiplying with a. It shows a sequence of 13
    instructions for double precision, 1 frcpa and 12 fma or fnma
    instructions.

    Given that integer division is based on FP division, it probably takes
    even more instructions.

    The fastest _hardware_ implementations of FP division are Goldschmidt division and Newton-Raphson divisiion, and they both make use of multiplication hardware.

    Goldschmidt is about 2× faster than N-R because:: Goldschmidt inner loop
    has 2 independent multiplies, while N-R has 2 dependent multiplies. To
    meet IEEE 754 accuracy, the last iteration in Goldschmide is ½ a N-R
    iteration {{you perform the N-R 1st step and then use the sign bit to
    determine +1 or +2 and the +0 can be determined while the multiply is transpiring.}}

    So using Newton-Raphson division to increase the width of an FP division result to perform 64-bit integer division won't be too hard.

    Since the Wikipedia article didn't go into detail, I had to look further
    to find out that you _were_ right about the DEC Alpha, however. It did
    have floating divide but not integer divide. The 21264 added floating
    square root.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Stephen Fuld on Tue Nov 28 14:09:19 2023
    On 11/28/2023 1:22 PM, Stephen Fuld wrote:
    On 11/28/2023 11:00 AM, BGB wrote:

    snip

    Still, it seems the incidence of the integer DIV/MOD/etc instructions
    being used is somewhat rarer than this estimate in the code I was
    measuring.

    Is this a result of your using mostly a few old games as your
    "reference" code? i.e. your code base may be different from what gave
    rise to the statistics mentioned by others.


    Possibly.

    There is my C library and kernel/runtime code;
    And games like Doom and Quake and similar.

    I suspect probably ID also avoided integer division when possible as
    well, as it is rarely used (and was probably still slow back in
    early/mid 90s PCs).

    Looking up, 486 and Pentium 1 still had ~ 40+ cycle 32-bit DIV/IDIV.
    ...


    Looks like MUL/IMUL had similar latency to DIV/IDIV on the 386 and 486,
    but then the latency dropped down a fair bit on the Pentium.

    However, Doom makes extensive use of integer multiply.



    Given the timings, I suspect it is possible they were also using a
    Shift-Add design for the multiply/divide unit.

    In my case, I am using Shift-Add for DIV and also 64-bit multiply, but a
    faster DSP48 based multiplier for 32-bit multiply.

    If I were to make a guess based on the latency, I would guess the P1 was
    using something like a radix-16 long-multiply.


    Comparably, it seems DIV/IDIV didn't actually get much faster until the
    Ryzen.



    Have noted that these old games did have a nifty trick for calculating approximate distance, say:
    dx=x0-x1;
    dy=y0-y1;
    adx=dx^(dx>>31);
    ady=dy^(dy>>31);
    if(ady>adx)
    { t=adx; adx=ady; ady=t; } //common
    // { adx^=ady; ady^=adx; adx^=ady; } //sometimes
    d=adx+(ady>>1);


    Can be extended to 3D without too much issue, but for 4D, the sorting
    step becomes more of an issue.

    Kinda makes sense though, as finding the square-root of things is also expensive...


    But, not sure how a lot of this compares with other (non game) coding practices.


    Have noticed though that a lot of these games really do like using lots
    of global variables (a lot more so than I would prefer to use in my
    coding practices).

    Though, this is one case where it is useful to be able to load/store
    global variables in a single instruction (and using something like a GOT
    may have a significant penalty on code that uses lots of global variables).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to MitchAlsup on Tue Nov 28 14:13:26 2023
    On 11/27/2023 7:03 PM, MitchAlsup wrote:
    Anton Ertl wrote:

    BGB <cr88192@gmail.com> writes:
    On 11/26/2023 7:29 PM, Quadibloc wrote:
    On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:

    But Integer Multiply and Divide can share the FPU that does these.

    But giving each one its own multiplier means more superscalar
    goodness!

    Having two multipliers that serve both purposes means even more
    superscalar goodness for similar area cost.  However, there is the
    issue of latency.  The Willamette ships the integers over to the FPU
    for multiplication, and the result back, crossing several clock
    domains (at one clock loss per domain crossing), resulting in a
    10-cycle latency for integer multiplication.  I think that these days
    every high-performance core with real silicon invests into separate
    GPR and FP/SIMD (including integer SIMD) multipliers.

    In most code, FPU ops are comparably sparse

    In terms of executed ops, that depends very much on the code.  GP
    cores have acquired SIMD cores primarily for FP ops, as can be seen by
    both SSE and supporting only FP at first, and only later adding
    integer stuff, because it cost little extra.  Plus, we have added GPUs
    that are now capable of doing huge amounts of FP ops, with uses in
    graphics rendering, HPC and AI.

    Once SIMD gains Integer operations, the Multiplier has to be built to
    do both, might as well use it for more things than just SIMD.


    Well, and/or just use separate multipliers for each, under the rationale
    that, at least on an FPGA, it already has the DSP48's whether or not one
    uses them (and a lot more DSP48's than I can really make effective use of).

    Well, with the caveat that they only natively do signed 18-bit multiply.



    Otherwise, one can debate whether or not having DIV/MOD in hardware
    makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV
    is at least "probably" faster than a generic software-only solution).

    That debate has been held, and MIPS has hardware integer divide, Alpha
    and IA-64 don't have a hardware integer divide; they both have FP
    divide instructions.

    And all of these lead fruitful long productive lives before taking over -------Oh wait !!


    MIPS at least did moderately well for a time, might have done better if
    it were more open.

    Both Alpha and IA-64 had a lot of questionable seeming design choices.


    However, looking at more recent architectures, the RISC-V M extension
    (which is part of RV64G and RV32G, i.e., a standard extension) has not
    just multiply instructions (MUL, MULH, MULHU, MULHSU, MULW), but also
    integer divide instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW,
    and REMUW.

    All of which are possible in My 66000 using operand sign control, S-bit,
    and CARRY when you want 64×64->128 or 128/64->{64 quotient, 64 remainder}

               ARM A64 also has divide instructions (SDIV, UDIV), but >> RISC-V seems significant to me because there the philosophy seems to
    be to go for minimalism.  So the debate has apparently come to the
    conclusion that for general-purpose architectures, you include an
    integer divide instruction.

    Several forms of integer DIV at least signed and unsigned.....

    I had left it as optional, but I guess, optional is not the same as
    absent, and my main profiles had ended up including it.


    The ops are generally given names like:
    DIVS.L, DIVU.L, DIVS.Q, DIVU.Q
    MODS.L, MODU.L, MODS.Q, MODU.Q

    One may notice a curious level of overlap with the RISC-V ops here...

    Though, for 32-bit:
    MULS.L, MULU.L: Perform a narrow multiply.
    DMULS.L, DMULU.L: Perform a widening multiply.
    Along with:
    MULS.Q, MULU.Q: 64-bit multiply
    MULHS.Q, MULHU.Q: 64-bit multiply, high 64 bits of result


    DIV, MOD, and 64-bit multiply, is drastically slower than 32-bit
    multiply though.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Thu Nov 30 16:23:05 2023
    BGB wrote:

    Have noted that these old games did have a nifty trick for calculating approximate distance, say:
    dx=x0-x1;
    dy=y0-y1;
    adx=dx^(dx>>31);
    ady=dy^(dy>>31);
    if(ady>adx)
    { t=adx; adx=ady; ady=t; } //common
    // { adx^=ady; ady^=adx; adx^=ady; } //sometimes
    d=adx+(ady>>1);

    Why not::

    dx=x0-x1;
    dy=y0-y1;
    adx=dx^(dx>>31);
    ady=dy^(dy>>31);
    if(ady>adx)
    d=adx+(ady>>1);
    else
    d=ady+(adx>>1);

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Thu Nov 30 16:25:58 2023
    BGB wrote:

    64-bit multiply, is drastically slower than 32-bit
    multiply though.

    Should only be 1 gate of delay longer......

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Thu Nov 30 17:50:10 2023
    Robert Finch wrote:

    On 2023-11-30 11:25 a.m., MitchAlsup wrote:
    BGB wrote:

                  64-bit multiply, is drastically slower than 32-bit
    multiply though.

    Should only be 1 gate of delay longer......

    FPGA land using DSPs, 24x16 multiply is single cycle, 32x32 could be
    single cycle if a lower fmax is acceptable, but 64x64 best pipelined for several cycles (6). Thor had a fast 24x16 multiply as it is good enough
    for array address calcs in a lot of circumstances.

    Nothing stops you from building a multiplier in LUTs with Boothe recoding.
    Each pair of LUTs being a 3-input XOR and a 3-input Majority gate. This also gets rid of BGBs argument on building an incomplete multiplier tree {because IEE needs 53-bits while your DSP only supplies 48.}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Thu Nov 30 20:27:39 2023
    BGB wrote:

    On 11/30/2023 10:23 AM, MitchAlsup wrote:
    BGB wrote:

    Have noted that these old games did have a nifty trick for calculating
    approximate distance, say:
       dx=x0-x1;
       dy=y0-y1;
       adx=dx^(dx>>31);
       ady=dy^(dy>>31);
       if(ady>adx)
         { t=adx; adx=ady; ady=t; }         //common
    //  { adx^=ady; ady^=adx; adx^=ady; }  //sometimes
       d=adx+(ady>>1);

    Why not::

        dx=x0-x1;
        dy=y0-y1;
        adx=dx^(dx>>31);
        ady=dy^(dy>>31);
        if(ady>adx)
            d=adx+(ady>>1);
        else
            d=ady+(adx>>1);

    As long as ISA has max() and min():: Why not::

        dx=x0-x1;
        dy=y0-y1;
        adx=dx^(dx>>31);
        ady=dy^(dy>>31);
    d = min( adx, ady ) + (max( adx, ady ) >> 1);



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Thu Nov 30 23:15:11 2023
    BGB wrote:

    On 11/30/2023 2:27 PM, MitchAlsup wrote:
    BGB wrote:

    On 11/30/2023 10:23 AM, MitchAlsup wrote:
    BGB wrote:

    Have noted that these old games did have a nifty trick for
    calculating approximate distance, say:
       dx=x0-x1;
       dy=y0-y1;
       adx=dx^(dx>>31);
       ady=dy^(dy>>31);
       if(ady>adx)
         { t=adx; adx=ady; ady=t; }         //common
    //  { adx^=ady; ady^=adx; adx^=ady; }  //sometimes
       d=adx+(ady>>1);

    Why not::

         dx=x0-x1;
         dy=y0-y1;
         adx=dx^(dx>>31);
         ady=dy^(dy>>31);
         if(ady>adx)
             d=adx+(ady>>1);
         else
             d=ady+(adx>>1);

    As long as ISA has max() and min():: Why not::

        dx=x0-x1;
        dy=y0-y1;
        adx=dx^(dx>>31);
        ady=dy^(dy>>31);
        d = min( adx, ady ) + (max( adx, ady ) >> 1);


    Yeah, I guess that could work as well for 2D...


    Though, makes more sense to use the "__int32_min" intrinsic or similar, rather than the generic C library "min()"/"max()" macros.
    #define min(a, b) (((a)<(b))?(a):(b))
    #define max(a, b) (((a)>(b))?(a):(b))

    max() and min() are single instructions in my ISA. On the FP side, they
    even get NaNs put in the proper places, too.

    Though, apparently the existence of these macros in stdlib.h is
    controversial (they are apparently sort of a POSIX legacy feature, and
    not part of the C standard).

    It took Brian quite some time to recognize that crap and turn it all into
    max() and min() instructions.

    Mostly, due to implementation issues in BGBCC, ?: needs an actual
    branch, so is ironically actually worse for performance in this case
    than using an explicit if/else...

    My point exactly--but perhaps you should spend the effort to make your
    compiler recognize these as single branch-free instructions.

    A case could be made for C11 "_Generic", but would need to implement it.


    But, yeah, apart from compiler issues, in theory the min/max part could
    be 3 instructions.

    max() and min() are easy to perform in a single cycle in both int and FP

    If done in ASM, might be something like:
    // R4..R7 (x0, y0, x1, y1)
    SUB R4, R6, R16 | SUB R5, R7, R17
    SHAD R16, -31, R18 | SHAD R17, -31, R19
    XOR R16, R18, R16 | XOR R17, R19, R17
    CMPGT R17, R16
    CSELT R16, R17, R18 | CSELT R17, R16, R19
    SHAD R19, -1, R19
    ADD R18, R19, R2

    And you are claiming this has an advantage over::

    MAX R9,R8,R7
    MIN R10,R8,R7
    SRL R10,R10,#1
    LEA R10,[R9,R10,#-1]


    I don't have actual MIN/MAX ops though.

    perhaps you should reconsider...

    Though, it seems RISC-V does have them in the Bitmanip extension.

    Could probably add them in-theory (maybe along with FMIN/FMAX).

    I use the same FU to perform Int max() and FP max()--remember once you
    special case out the problems, IEEE is a linearly ordered set of numbers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Fri Dec 1 02:46:36 2023
    Robert Finch wrote:

    Could probably add them in-theory (maybe along with FMIN/FMAX).

    I use the same FU to perform Int max() and FP max()--remember once you
    special case out the problems, IEEE is a linearly ordered set of numbers.

    Q+ has min3 / max3 for minimum or maximum of three values.
    But only fmin / fmax for float.

    Interesting, but I did not have enough room in the 3-operand subGroup.

    How do you determine the mid( x, y, z ) ?? allowing::

    mx = max( x, y, z );
    mn = min( x, y, z );
    mi = mid( x, y, z );

    ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Fri Dec 1 18:26:01 2023
    BGB wrote:

    On 11/30/2023 5:15 PM, MitchAlsup wrote:
    BGB wrote:

    On 11/30/2023 2:27 PM, MitchAlsup wrote:
    BGB wrote:

    On 11/30/2023 10:23 AM, MitchAlsup wrote:
    BGB wrote:

    Have noted that these old games did have a nifty trick for
    calculating approximate distance, say:
       dx=x0-x1;
       dy=y0-y1;
       adx=dx^(dx>>31);
       ady=dy^(dy>>31);
       if(ady>adx)
         { t=adx; adx=ady; ady=t; }         //common
    //  { adx^=ady; ady^=adx; adx^=ady; }  //sometimes
       d=adx+(ady>>1);

    Why not::

         dx=x0-x1;
         dy=y0-y1;
         adx=dx^(dx>>31);
         ady=dy^(dy>>31);
         if(ady>adx)
             d=adx+(ady>>1);
         else
             d=ady+(adx>>1);

    As long as ISA has max() and min():: Why not::

         dx=x0-x1;
         dy=y0-y1;
         adx=dx^(dx>>31);
         ady=dy^(dy>>31);
         d = min( adx, ady ) + (max( adx, ady ) >> 1);


    Yeah, I guess that could work as well for 2D...


    Though, makes more sense to use the "__int32_min" intrinsic or
    similar, rather than the generic C library "min()"/"max()" macros.
       #define min(a, b)   (((a)<(b))?(a):(b))
       #define max(a, b)   (((a)>(b))?(a):(b))

    max() and min() are single instructions in my ISA. On the FP side, they
    even get NaNs put in the proper places, too.

    Though, apparently the existence of these macros in stdlib.h is
    controversial (they are apparently sort of a POSIX legacy feature, and
    not part of the C standard).

    It took Brian quite some time to recognize that crap and turn it all into
    max() and min() instructions.


    OK.

    Mostly, due to implementation issues in BGBCC, ?: needs an actual
    branch, so is ironically actually worse for performance in this case
    than using an explicit if/else...

    My point exactly--but perhaps you should spend the effort to make your
    compiler recognize these as single branch-free instructions.


    I guess, it could be possible in theory to use pattern recognition on
    the ASTs to turn this into a call to an intrinsic. Might need to look
    into this.

    For the most part, hadn't really done much of any of this sort of
    pattern recognition.


    A case could be made for C11 "_Generic", but would need to implement it.


    But, yeah, apart from compiler issues, in theory the min/max part
    could be 3 instructions.

    max() and min() are easy to perform in a single cycle in both int and FP


    Theoretically, it can be done in two instructions, which is still faster
    than what ?: generates currently.

    Say:
    z=(x<y)?x:y;

    Compiling to something like, say:
    CMPGT R9, R8
    BF .L0
    MOV R9, R10
    BRA .L1
    .L0:
    MOV R8, R10
    BRA .L1
    .L1:
    MOV R10, R11

    Whereas, the normal "if()" is smart enough to turn it into predication,
    but ?: needs to use a temporary and doesn't have a predication special case.

    Well, and also predication needs to be handled partly in the frontend
    (in the generated RIL3 bytecode), but is generally omitted if the output
    is going to a RIL object (so, if compiling for a static library, it
    always falls back to actual branches).

    Partly this is in turn because my 3AC backend isn't really smart enough
    to pull this off.


    Putting effort into it wasn't really a high priority though, as ?: in C
    seems to be relatively infrequently used.



    But, as-is, one will get a 2-op sequence if they use the intrinsic.


    If done in ASM, might be something like:
       // R4..R7 (x0, y0, x1, y1)
       SUB  R4, R6, R16   |  SUB  R5, R7, R17
       SHAD R16, -31, R18 | SHAD R17, -31, R19
       XOR  R16, R18, R16 | XOR  R17, R19, R17
       CMPGT R17, R16
       CSELT R16, R17, R18 | CSELT R17, R16, R19
       SHAD R19, -1, R19
       ADD  R18, R19, R2

    And you are claiming this has an advantage over::

        MAX  R9,R8,R7
        MIN  R10,R8,R7
        SRL  R10,R10,#1
        LEA  R10,[R9,R10,#-1]


    At least in so far as it doesn't depend on features that don't currently exist.



    I don't have actual MIN/MAX ops though.

    perhaps you should reconsider...


    I didn't have them previously as I could express them in a 2-op
    sequence, which seemed "good enough".

    Though, this was before some more recent changes which (in the name of improving FPGA timing) effectively turned CMPxx and similar into 2 cycle
    ops.


    Though, MIN/MAX ops could possibly allow:
    MIN Imm10u, Rn // Rn=(Rn<Imm)?Rn:Imm

    Which could potentially allow implementing "__int32_clamp()" in 2 or 3 instructions, which is an improvement over needing 6 instructions (with pretty much all of these instructions ending up with a 2-cycle latency).


       Though, it seems RISC-V does have them in the Bitmanip extension.

    Could probably add them in-theory (maybe along with FMIN/FMAX).

    I use the same FU to perform Int max() and FP max()--remember once you
    special case out the problems, IEEE is a linearly ordered set of numbers.

    Yeah.

    I was doing both CMP and FCMP cases via the ALU.
    This is why likely both could be added at the same time, since the logic
    for adding one would likely apply to the others.

    About 93% of ICMP is used in FCMP.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Sat Dec 2 01:28:13 2023
    On Fri, 24 Nov 2023 22:53:57 +0000, Quadibloc wrote:

    While this is an "improvement", I do realize that this, along with even supporting multiple memory widths, even if I have found ways that it can technically be done, is sheer insanity.

    I do have a rationale for going there. It's simple enough. It comes from
    the x86.

    Since the irresistible force of availability of applications means that
    we are doomed to always have just *one* dominant ISA...

    then, to minimize the terrible loss that this entails, that one dominant
    ISA should be maximally flexible. To approach, as closely as possible,
    the "good old days" where the PDP-10 and the CDC 6600 coexisted
    alongside the PDP-11 and the IBM System/360.

    This is the driving force behind my insane quest to design an ISA for a
    CPU which has the ability to do something which no one seems to want.

    And for some people, the KDF 9, with a 48-bit word, was something to
    sing about - here, to the tune of "The British Grenadiers":

    Some talk of I.B.M.s and some of C.D.C.s,
    Of Honeywells and Burroughses, and such great names as these,
    But of all the world's computers, there's none that is so fine,
    As the English Electric Leo Marconi Kay - Dee - Eff Nine!

    Some talk of thirty-two bit words, and some of twenty-four,
    Of disks and drums and datacells, and megabytes of core,
    But for those who've written usercode there's nothing can outshine,
    The subroutine jump nesting store of the Kay - Dee - Eff Nine!

    It's a good thing I had saved this in a text file on my computer.

    A web search, including a search on Google Groups, was no longer
    able to turn this up for me. Also, the KDF 9 was lauded in another
    posting for reasons more directly related to this scheme of mine:

    (Quoting a brochure for the computer)
    The KDF 9 word has 48 bits...

    It may be used as...

    Eight 6-bit Alphanumeric Characters
    One 48-bit Fixed-Point Number
    Two 24-bit (Half length) Fixed-Point Numbers
    Half of a 96-bit (Double length) Fixed-Point Number
    One 48-Bit Floating-Point Number
    Two 24-bit (Half length) Floating-Point Numbers
    Half of a 96-Bit (Double length) Floating-Point Number
    Three 16-Bit (Fixed-Point) Integers
    Six 8-Bit Instruction Syllables
    (end quote)

    An instruction was 1, 2, or 3 syllables; an address was 15 bits.
    O, memory! We shall not see its like again.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Robert Finch on Sat Dec 2 14:33:55 2023
    Robert Finch wrote:

    Q+ has min3 / max3 for minimum or maximum of three values.
    But only fmin / fmax for float.

    For symmetry I would assume/like a median3 as well, so that min3,
    median3, max3 would return a sorted list?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to MitchAlsup on Sat Dec 2 14:24:14 2023
    MitchAlsup wrote:
    BGB wrote:

    Have noted that these old games did have a nifty trick for calculating
    approximate distance, say:
       dx=x0-x1;
       dy=y0-y1;
       adx=dx^(dx>>31);
       ady=dy^(dy>>31);
       if(ady>adx)
         { t=adx; adx=ady; ady=t; }         //common
    //  { adx^=ady; ady^=adx; adx^=ady; }  //sometimes
       d=adx+(ady>>1);

    Why not::

        dx=x0-x1;
        dy=y0-y1;
        adx=dx^(dx>>31);
        ady=dy^(dy>>31);
        if(ady>adx)
            d=adx+(ady>>1);
        else         d=ady+(adx>>1);

    Possibly due to wanting to not use more than one branch predictor entry?

    Maybe because ady rarely was larger than adx?

    Besides, your version has the two terms swapped. :-)

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Terje Mathisen on Sat Dec 2 20:41:48 2023
    Terje Mathisen wrote:

    Robert Finch wrote:

    Q+ has min3 / max3 for minimum or maximum of three values.
    But only fmin / fmax for float.

    For symmetry I would assume/like a median3 as well, so that min3,
    median3, max3 would return a sorted list?

    That was my point 8-10 posts ago.

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Terje Mathisen on Sat Dec 2 20:42:37 2023
    Terje Mathisen wrote:

    MitchAlsup wrote:
    BGB wrote:

    Have noted that these old games did have a nifty trick for calculating
    approximate distance, say:
       dx=x0-x1;
       dy=y0-y1;
       adx=dx^(dx>>31);
       ady=dy^(dy>>31);
       if(ady>adx)
         { t=adx; adx=ady; ady=t; }         //common
    //  { adx^=ady; ady^=adx; adx^=ady; }  //sometimes
       d=adx+(ady>>1);

    Why not::

        dx=x0-x1;
        dy=y0-y1;
        adx=dx^(dx>>31);
        ady=dy^(dy>>31);
        if(ady>adx)
            d=adx+(ady>>1);
        else         d=ady+(adx>>1);

    Possibly due to wanting to not use more than one branch predictor entry?

    Maybe because ady rarely was larger than adx?

    Besides, your version has the two terms swapped. :-)

    Just mimicking the original.

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to MitchAlsup on Sun Dec 3 10:36:39 2023
    MitchAlsup wrote:
    Terje Mathisen wrote:

    Robert Finch wrote:

    Q+ has min3 / max3 for minimum or maximum of three values.
    But only fmin / fmax for float.

    For symmetry I would assume/like a median3 as well, so that min3,
    median3, max3 would return a sorted list?

    That was my point 8-10 posts ago.

    I saw that a bit later: I've been on a sailing trip in the Caribbean
    with very sporadic network connections, basically just when in reach of
    a French island.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Terje Mathisen on Sun Dec 3 16:59:31 2023
    Terje Mathisen wrote:

    MitchAlsup wrote:
    Terje Mathisen wrote:

    Robert Finch wrote:

    Q+ has min3 / max3 for minimum or maximum of three values.
    But only fmin / fmax for float.

    For symmetry I would assume/like a median3 as well, so that min3,
    median3, max3 would return a sorted list?

    That was my point 8-10 posts ago.

    I saw that a bit later: I've been on a sailing trip in the Caribbean
    with very sporadic network connections, basically just when in reach of
    a French island.

    Must be a tough life............

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to MitchAlsup on Tue Dec 5 13:40:08 2023
    MitchAlsup wrote:
    Terje Mathisen wrote:

    MitchAlsup wrote:
    Terje Mathisen wrote:

    Robert Finch wrote:

    Q+ has min3 / max3 for minimum or maximum of three values.
    But only fmin / fmax for float.

    For symmetry I would assume/like a median3 as well, so that min3,
    median3, max3 would return a sorted list?

    That was my point 8-10 posts ago.

    I saw that a bit later: I've been on a sailing trip in the Caribbean
    with very sporadic network connections, basically just when in reach
    of a French island.

    Must be a tough life............

    Indeed.

    Our last such trip was in 2018 (Panama City to Jamaica on the Royal
    Clipper), this trip was supposed to take place in 2020 but then Covid
    happened.

    Back in Oslo now, and I just got my very first positive Covid test, so I
    must have been exposed to the virus during the trip, thankfully I didn't
    really get sick until we got to Oslo Airport in -17C.

    Terje
    PS. These trips are for orienteering, we ran 10-11 races over 16 days.
    I've uploaded headcam videos to youtube from all of them. On those you
    can see both my forward view and a map snippet with moving marker which
    shows where I'm running.

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Paul A. Clayton on Thu Dec 7 13:39:33 2023
    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    What about multiply-high-no-carry-in? There might be cases
    where such could be used for division with a reused
    (possibly compile-time constant) divisor even without the
    carry-in. (Producing the carry-in by a low multiply seems
    to have significant overhead in energy, throughput, and/or
    latency.)

    What's that operation supposed to produce? How does it differ from,
    e.g., RISC-V MULH/MULHU/MULHSU? And if there is a difference, what
    makes you think that it still works for division?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Thu Dec 7 18:55:10 2023
    Paul A. Clayton wrote:

    On 11/27/23 8:03 PM, MitchAlsup wrote:
    Anton Ertl wrote:
    [snip]
    However, looking at more recent architectures, the RISC-V M
    extension (which is part of RV64G and RV32G, i.e., a
    standard extension) has not just multiply instructions (MUL,
    MULH, MULHU, MULHSU, MULW), but also integer divide >> instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW, and
    REMUW.

    All of which are possible in My 66000 using operand sign
    control, S-bit, and and CARRY when you want 64×64->128 or
    128/64->{64 quotient, 64 remainder}

    What about multiply-high-no-carry-in?

    CARRY R9,{{O}} // carry applies to next inst
    // no carry in yes carry out
    MUL R10,Rx,Ry // {R9,R10} contain result

    There might be cases
    where such could be used for division with a reused
    (possibly compile-time constant) divisor even without the
    carry-in. (Producing the carry-in by a low multiply seems
    to have significant overhead in energy, throughput, and/or
    latency.) A small immediate operand might be shifted into
    the most significant position to facilitate division by a
    small constant.

    A division-specific instruction (divide-using-reciprocal)
    could do more sophisticated manipulation of an immediate and
    perhaps provide an instruction a compiler could use with
    less knowledge of the dividend. Yet there might be uses for
    a simple multiply-high-no-carry-in.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Fri Dec 8 02:44:03 2023
    BGB wrote:

    On 12/7/2023 7:39 AM, Anton Ertl wrote:
    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    What about multiply-high-no-carry-in? There might be cases
    where such could be used for division with a reused
    (possibly compile-time constant) divisor even without the
    carry-in. (Producing the carry-in by a low multiply seems
    to have significant overhead in energy, throughput, and/or
    latency.)

    What's that operation supposed to produce? How does it differ from,
    e.g., RISC-V MULH/MULHU/MULHSU? And if there is a difference, what
    makes you think that it still works for division?


    For division by reciprocal, one may need to be able to adjust the right
    shift (say, 32..34 bits), and add a bias to the low-order bits if the
    input is negative (say, adding "Rcp-1").

    Without the variable shift part, one will not get reliable results for
    some divisors (IIRC, 7, 17, 31, etc).

    Without a bias, it will round towards negative infinity rather than
    towards zero (standard for division is to always round towards zero).

    No:: C99 division is so specified, but there are 3 other (semi) popular definitions.


    As-is, signed division logic (say, "y=x/7;") would be something like:
    MOV 0x49249249, R1
    DMULS.L R4, R1, R2
    ADD -1, R1
    CMPGE 0, R4
    ADD?F R2, R1, R2
    SHAD.Q R2, -33, R2

    Or, if one used jumbo encodings:
    DMULS.L R4, 0x49249249, R2
    CMPGE 0, R4 // SR.T=(R4>=0)
    ADD?F R2, 0x49249248, R2
    SHAD.Q R2, -33, R2

    In 1975 I wrote PDP-11/40 microcode that performed serial DIV algorithm,
    and had the opportunity in the loop to short circuit if all the bits of
    the denominator had been consumed. {{CMU had an add on board with student programmable microcode.}}

    Thus::
    DIV R9,R5,#7

    should not take more than 5 cycles. Nor should
    Thusly::
    DIV R9,R5,#-7

    !!

    So, sadly, simply taking the high result isn't quite sufficient.

    It is more practical to make HW DIV fast, than to have everyone {and his brother} invent new ways to avoid dividing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Fri Dec 8 15:20:48 2023
    BGB <cr88192@gmail.com> writes:
    On 12/7/2023 7:39 AM, Anton Ertl wrote:
    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    What about multiply-high-no-carry-in? There might be cases
    where such could be used for division with a reused
    (possibly compile-time constant) divisor even without the
    carry-in. (Producing the carry-in by a low multiply seems
    to have significant overhead in energy, throughput, and/or
    latency.)

    What's that operation supposed to produce? How does it differ from,
    e.g., RISC-V MULH/MULHU/MULHSU? And if there is a difference, what
    makes you think that it still works for division?


    For division by reciprocal, one may need to be able to adjust the right
    shift (say, 32..34 bits), and add a bias to the low-order bits if the
    input is negative (say, adding "Rcp-1").

    Without the variable shift part, one will not get reliable results for
    some divisors (IIRC, 7, 17, 31, etc).

    Without a bias, it will round towards negative infinity rather than
    towards zero (standard for division is to always round towards zero).

    Nope. E.g., Forth-83 standardizes floored division; in Forth-94 and Forth-2012, it's implementation-defined. More generally, <https://en.wikipedia.org/wiki/Modulo_operation> lists lots of
    languages that perform floored or Euclidean modulo, and I would hope
    that their division operation agrees with the modulo operation.

    As for division by multiplying with the reciprocal, I have written a
    paper on it [ertl19kps], but the most important sentence of that paper
    is:

    |If you read only one paper about the topic, my recommendation is
    |Robison’s [Rob05].

    My take on this improves the latency of the unsigned case on some CPUs
    by avoiding the shift and bias, but it costs an additional
    multiplication.

    FAZDIV

    What is FAZDIV?

    And you failed to answer what multiply-high-no-carry-in does.

    @InProceedings{ertl19kps,
    author = {M. Anton Ertl},
    title = {Integer Division by Multiplying with the
    Double-Width Reciprocal},
    crossref = {kps19},
    pages = {75--84},
    url = {http://www.complang.tuwien.ac.at/papers/ertl19kps.pdf},
    url-slides = {http://www.complang.tuwien.ac.at/papers/ertl19kps-slides.pdf},
    abstract = {Earlier work on integer division by multiplying with
    the reciprocal has focused on multiplying with a
    single-width reciprocal, combined with a correction
    and followed by a shift. The present work explores
    using a double-width reciprocal to allow getting rid
    of the correction and shift.}
    }

    @Proceedings{kps19,
    title = {20. Kolloquium Programmiersprachen und Grundlagen
    der Programmierung (KPS)},
    booktitle = {20. Kolloquium Programmiersprachen und Grundlagen
    der Programmierung (KPS)},
    year = {2019},
    key = {kps19},
    editor = {Martin Pl\"umicke and Fayez Abu Alia},
    url = {https://www.hb.dhbw-stuttgart.de/kps2019/kps2019_Tagungsband.pdf}
    }

    @InProceedings{robison05,
    author = "Arch D. Robison",
    title = "{$N$}-Bit Unsigned Division Via {$N$}-Bit
    Multiply-Add",
    OPTeditor = "Paolo Montuschi and Eric (Eric Mark) Schwarz",
    booktitle = "{Proceedings of the 17th IEEE Symposium on Computer
    Arithmetic (ARITH-17)}",
    publisher = "IEEE Computer Society Press",
    ISBN = "0-7695-2366-8",
    ISBN-13 = "978-0-7695-2366-8",
    year = "2005",
    bibdate = "Wed Jun 22 07:02:55 2005",
    bibsource = "http://www.math.utah.edu/pub/tex/bib/fparith.bib",
    URL = "http://www.acsel-lab.com/arithmetic/arith17/papers/ARITH17_Robison.pdf",
    abstract = "Integer division on modern processors is expensive
    compared to multiplication. Previous algorithms for
    performing unsigned division by an invariant divisor,
    via reciprocal approximation, suffer in the worst case
    from a common requirement for $ n + 1 $ bit
    multiplication, which typically must be synthesized
    from $n$-bit multiplication and extra arithmetic
    operations. This paper presents, and proves, a hybrid
    of previous algorithms that replaces $ n + 1 $ bit
    multiplication with a single fused multiply-add
    operation on $n$-bit operands, thus reducing any
    $n$-bit unsigned division to the upper $n$ bits of a
    multiply-add, followed by a single right shift. An
    additional benefit is that the prerequisite
    calculations are simple and fast. On the Itanium 2
    processor, the technique is advantageous for as few as
    two quotients that share a common run-time divisor.",
    acknowledgement = "Nelson H. F. Beebe, University of Utah, Department
    of Mathematics, 110 LCB, 155 S 1400 E RM 233, Salt Lake
    City, UT 84112-0090, USA, Tel: +1 801 581 5254, FAX: +1
    801 581 4148, e-mail: \path|beebe@math.utah.edu|,
    \path|beebe@acm.org|, \path|beebe@computer.org|
    (Internet), URL:
    \path|http://www.math.utah.edu/~beebe/|",
    keywords = "ARITH-17",
    pagecount = "9",
    }

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Sat Dec 9 08:26:34 2023
    BGB <cr88192@gmail.com> writes:
    On 12/8/2023 9:20 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    Without a bias, it will round towards negative infinity rather than
    towards zero (standard for division is to always round towards zero).

    Nope. E.g., Forth-83 standardizes floored division; in Forth-94 and
    Forth-2012, it's implementation-defined. More generally,
    <https://en.wikipedia.org/wiki/Modulo_operation> lists lots of
    languages that perform floored or Euclidean modulo, and I would hope
    that their division operation agrees with the modulo operation.
    ...
    OK, how about:
    Languages like C, Java, C#, etc, expect division to always round towards >zero.

    Yes, there are programming languages that standardize truncated (aka
    symmetric) division. BTW, C89 is not one of them, C99 ff. are. There
    are also programming languages that standardize other kinds of
    division (or at least modulo), or fail to standardize the behaviour
    fully when one or both of the operands is negative.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Sun Dec 10 18:29:50 2023
    Paul A. Clayton wrote:

    On 12/7/23 1:55 PM, MitchAlsup wrote:
    Paul A. Clayton wrote:

    On 11/27/23 8:03 PM, MitchAlsup wrote:
    Anton Ertl wrote:
    [snip]
    However, looking at more recent architectures, the RISC-V M
    extension (which is part of RV64G and RV32G, i.e., a standard
    extension) has not just multiply instructions (MUL,
    MULH, MULHU, MULHSU, MULW), but also integer divide  >>
    instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW, and
    REMUW.

    All of which are possible in My 66000 using operand sign
    control, S-bit, and and CARRY when you want 64×64->128 or
    128/64->{64 quotient, 64 remainder}

    What about multiply-high-no-carry-in?

          CARRY   R9,{{O}}    // carry applies to next inst
                              // no carry in yes carry out
          MUL     R10,Rx,Ry   // {R9,R10} contain result

    I was thinking of a single register result.

    How can a single 64-bit register hold a 128-bit result ??


                                          There might be cases
    where such could be used for division with a reused
    (possibly compile-time constant) divisor even without the
    carry-in. (Producing the carry-in by a low multiply seems
    to have significant overhead in energy, throughput, and/or
    latency.) A small immediate operand might be shifted into
    the most significant position to facilitate division by a
    small constant.

    A division-specific instruction (divide-using-reciprocal)
    could do more sophisticated manipulation of an immediate and
    perhaps provide an instruction a compiler could use with
    less knowledge of the dividend. Yet there might be uses for
    a simple multiply-high-no-carry-in.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to MitchAlsup on Mon Dec 11 10:43:41 2023
    MitchAlsup <mitchalsup@aol.com> schrieb:
    Paul A. Clayton wrote:

    On 12/7/23 1:55 PM, MitchAlsup wrote:

    What about multiply-high-no-carry-in?

          CARRY   R9,{{O}}    // carry applies to next inst
                              // no carry in yes carry out
          MUL     R10,Rx,Ry   // {R9,R10} contain result

    I was thinking of a single register result.

    How can a single 64-bit register hold a 128-bit result ??

    Sometimes, you don't need the whole result; the upper bits suffice.

    Consider dividing a 32-bit unsigned number n on a 32-bit system
    by 25 (equally applicable to similar examples with 64-bit systems,
    but I happen to have the numbers at hand).

    In plain C, you could do this by

    uint32_t res = ((uint64_t) n * (uint64_t) 1374389535) >> 35;

    where any compiler worth its salt will optimize this to a multiply
    high followed by a shift by three bits. Compilers have done
    this for a long time, for division by a constant. Example, for
    32-bit POWER:

    #include <stdint.h>

    uint32_t div25 (uint32_t n)
    {
    return n / 25;
    }

    yields

    div25:
    lis 9,0x51eb
    ori 9,9,0x851f
    mulhwu 3,3,9
    srwi 3,3,3
    blr

    where the (lis/ori is POWER's way of synthesizing a 32-bit contant,
    and 0x51eb851f = 1374389535).

    If you can only access the high part via CARRY, this clobbers a
    register that you do not need to do otherwise.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Thomas Koenig on Tue Dec 12 00:13:03 2023
    Thomas Koenig wrote:

    MitchAlsup <mitchalsup@aol.com> schrieb:

    #include <stdint.h>

    uint32_t div25 (uint32_t n)
    {
    return n / 25;
    }

    yields

    div25:
    lis 9,0x51eb // cycle 1
    ori 9,9,0x851f // cycle 2
    mulhwu 3,3,9 // cycle 3-6
    srwi 3,3,3 // cycle 7
    blr // cycle 8

    div25:
    CARRY R8,{{O}} // cycle 1
    MUL R1,R1,#0x51eb851f // cycle 1-4
    SLL R1,R8,#35 // cycle 5
    RET // cycle 6

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Thomas Koenig on Tue Dec 12 00:14:44 2023
    Thomas Koenig wrote:

    MitchAlsup <mitchalsup@aol.com> schrieb:
    Paul A. Clayton wrote:

    On 12/7/23 1:55 PM, MitchAlsup wrote:

    What about multiply-high-no-carry-in?

          CARRY   R9,{{O}}    // carry applies to next inst
                              // no carry in yes carry out
          MUL     R10,Rx,Ry   // {R9,R10} contain result

    I was thinking of a single register result.

    How can a single 64-bit register hold a 128-bit result ??

    Sometimes, you don't need the whole result; the upper bits suffice.

    Consider dividing a 32-bit unsigned number n on a 32-bit system
    by 25 (equally applicable to similar examples with 64-bit systems,
    but I happen to have the numbers at hand).

    In plain C, you could do this by

    uint32_t res = ((uint64_t) n * (uint64_t) 1374389535) >> 35;

    This only gives you 29 bits of significance.

    What if you also need the remainder ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to MitchAlsup on Tue Dec 12 05:52:22 2023
    MitchAlsup <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    MitchAlsup <mitchalsup@aol.com> schrieb:
    Paul A. Clayton wrote:

    On 12/7/23 1:55 PM, MitchAlsup wrote:

    What about multiply-high-no-carry-in?

          CARRY   R9,{{O}}    // carry applies to next inst
                              // no carry in yes carry out
          MUL     R10,Rx,Ry   // {R9,R10} contain result

    I was thinking of a single register result.

    How can a single 64-bit register hold a 128-bit result ??

    Sometimes, you don't need the whole result; the upper bits suffice.

    Consider dividing a 32-bit unsigned number n on a 32-bit system
    by 25 (equally applicable to similar examples with 64-bit systems,
    but I happen to have the numbers at hand).

    In plain C, you could do this by

    uint32_t res = ((uint64_t) n * (uint64_t) 1374389535) >> 35;

    This only gives you 29 bits of significance.

    Because that's all it takes, dividing by 25 reduces the number
    of significant bits.

    What if you also need the remainder ??

    Multiply and subtract.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to MitchAlsup on Tue Dec 12 06:09:40 2023
    MitchAlsup <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    MitchAlsup <mitchalsup@aol.com> schrieb:

    #include <stdint.h>

    uint32_t div25 (uint32_t n)
    {
    return n / 25;
    }

    yields

    div25:
    lis 9,0x51eb // cycle 1
    ori 9,9,0x851f // cycle 2
    mulhwu 3,3,9 // cycle 3-6
    srwi 3,3,3 // cycle 7
    blr // cycle 8

    div25:
    CARRY R8,{{O}} // cycle 1
    MUL R1,R1,#0x51eb851f // cycle 1-4
    SLL R1,R8,#35 // cycle 5
    RET // cycle 6

    Then let's use a 64-bit example:

    #include <stdint.h>

    uint64_t div (uint64_t n)
    {
    return n / 37;
    }

    which becomes

    div:
    .LFB0:
    .cfi_startproc
    movabsq $-2492803253203993461, %rax
    mulq %rdi
    movq %rdx, %rax
    shrq $5, %rax
    ret

    on x86_64 (so, use the high result only and shift it five bits to
    the right) and, with Power10,

    div:
    pli 10,3714566310
    pli 9,232160395
    rldimi 9,10,32,0
    mulhdu 3,3,9
    srdi 3,3,5
    blr

    (The first three instructions are pasting together the constant, which
    of course My 66000 can do much better :-)

    A hypothetical My66000 code could be

    mulhu r1, r1, #-2492803253203993461
    srl r1, r1, #5
    ret

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Tue Dec 12 20:43:52 2023
    Robert Finch wrote:

    On 2023-12-12 4:00 a.m., BGB wrote:
    On 12/12/2023 12:09 AM, Thomas Koenig wrote:
    MitchAlsup <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    MitchAlsup <mitchalsup@aol.com> schrieb:

    #include <stdint.h>

    uint32_t div25 (uint32_t n)
    {
       return n / 25;
    }

    yields

    div25:
             lis 9,0x51eb        // cycle 1
             ori 9,9,0x851f    // cycle 2
             mulhwu 3,3,9        // cycle 3-6
             srwi 3,3,3        // cycle 7
             blr            // cycle 8

    div25:
        CARRY R8,{{O}}       // cycle 1
        MUL R1,R1,#0x51eb851f    // cycle 1-4
        SLL R1,R8,#35        // cycle 5
        RET            // cycle 6

    Then let's use a 64-bit example:

    #include <stdint.h>

    uint64_t div (uint64_t n)
    {
       return n / 37;
    }

    which becomes

    div:
    .LFB0:
             .cfi_startproc
             movabsq $-2492803253203993461, %rax
             mulq    %rdi
             movq    %rdx, %rax
             shrq    $5, %rax
             ret

    on x86_64 (so, use the high result only and shift it five bits to
    the right) and, with Power10,

    div:
             pli 10,3714566310
             pli 9,232160395
             rldimi 9,10,32,0
             mulhdu 3,3,9
             srdi 3,3,5
             blr

    (The first three instructions are pasting together the constant, which
    of course My 66000 can do much better :-)

    A hypothetical My66000 code could be

           mulhu    r1, r1, #-2492803253203993461
           srl    r1, r1, #5
           ret


    And, for comparison, on BJX2 (with the MULQ extension):
      MOV -2492803253203993461, R3
      MULHU.Q R4, R3, R2
      SHLD.Q R2, -5, R2
      RTS
    ...

    Though, this will take 24 bytes to encode, and wont be fast.

    Also, currently the only instructions that take 64 bit immediate values
    are:
      MOV Imm64, Rn
      ADD Imm64, Rn
      PLDCXH Imm64, Xn  //4x Binary16 to 4x Binary32

    ...

    Q+ is almost the same as MY66000

    muluh r1,r1,#-2492803253203993461
    lsr r1,r1,#5
    ret

    but ret is only two bytes.

    When subroutines are aligned to cache line boundaries, that adds nothing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Thomas Koenig on Sat Dec 16 13:51:15 2023
    Thomas Koenig <tkoenig@netcologne.de> schrieb:

    A hypothetical My66000 code could be

    mulhu r1, r1, #-2492803253203993461
    srl r1, r1, #5
    ret

    By "hypothetical" I mean that it isn't in the ISA at the moment.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Thomas Koenig on Sat Dec 16 19:21:43 2023
    Thomas Koenig wrote:

    Thomas Koenig <tkoenig@netcologne.de> schrieb:

    A hypothetical My66000 code could be

    mulhu r1, r1, #-2492803253203993461
    srl r1, r1, #5
    ret

    By "hypothetical" I mean that it isn't in the ISA at the moment.

    If I knew what mulhu does I could illuminate::
    a) how many bits get multiplied?
    b) how many bits are produced?

    But assuming h means high, and the widths are 64-bit operands and
    128-bit results::

    CARRY R1,{{O}{I}}
    MUL R2,R1,#-2492803253203993461
    SHR R1,R2,<0,5>
    RET

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to MitchAlsup on Sat Dec 16 22:26:07 2023
    MitchAlsup <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Thomas Koenig <tkoenig@netcologne.de> schrieb:

    A hypothetical My66000 code could be

    mulhu r1, r1, #-2492803253203993461
    srl r1, r1, #5
    ret

    By "hypothetical" I mean that it isn't in the ISA at the moment.

    If I knew what mulhu does I could illuminate::
    a) how many bits get multiplied?
    b) how many bits are produced?

    But assuming h means high, and the widths are 64-bit operands and
    128-bit results::

    CARRY R1,{{O}{I}}
    MUL R2,R1,#-2492803253203993461
    SHR R1,R2,<0,5>
    RET

    That works, but has two disadvantages over the non-Carry
    version: It is longer (by one instruction modifier),
    and R2 is overwritten with a value which is not needed.
    This is why I think that having a "multiply high signed"
    and "multiply high unsigned" operation is a good idea even
    when My 66000's Carry is implemented.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Thomas Koenig on Sat Dec 16 23:10:27 2023
    Thomas Koenig wrote:

    MitchAlsup <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Thomas Koenig <tkoenig@netcologne.de> schrieb:

    A hypothetical My66000 code could be

    mulhu r1, r1, #-2492803253203993461
    srl r1, r1, #5
    ret

    By "hypothetical" I mean that it isn't in the ISA at the moment.

    If I knew what mulhu does I could illuminate::
    a) how many bits get multiplied?
    b) how many bits are produced?

    But assuming h means high, and the widths are 64-bit operands and
    128-bit results::

    CARRY R1,{{O}{I}}
    MUL R2,R1,#-2492803253203993461
    SHR R1,R2,<0,5>
    RET

    That works, but has two disadvantages over the non-Carry
    version: It is longer (by one instruction modifier),
    and R2 is overwritten with a value which is not needed.
    This is why I think that having a "multiply high signed"
    and "multiply high unsigned" operation is a good idea even
    when My 66000's Carry is implemented.

    Heck:: if you are worried about length::

    DIV R1,R1,#27
    RET

    2 words instead of 6.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to MitchAlsup on Sun Dec 17 07:52:12 2023
    MitchAlsup <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    MitchAlsup <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Thomas Koenig <tkoenig@netcologne.de> schrieb:

    A hypothetical My66000 code could be

    mulhu r1, r1, #-2492803253203993461
    srl r1, r1, #5
    ret

    By "hypothetical" I mean that it isn't in the ISA at the moment.

    If I knew what mulhu does I could illuminate::
    a) how many bits get multiplied?
    b) how many bits are produced?

    But assuming h means high, and the widths are 64-bit operands and
    128-bit results::

    CARRY R1,{{O}{I}}
    MUL R2,R1,#-2492803253203993461
    SHR R1,R2,<0,5>
    RET

    That works, but has two disadvantages over the non-Carry
    version: It is longer (by one instruction modifier),
    and R2 is overwritten with a value which is not needed.
    This is why I think that having a "multiply high signed"
    and "multiply high unsigned" operation is a good idea even
    when My 66000's Carry is implemented.

    Heck:: if you are worried about length::

    DIV R1,R1,#27
    RET

    2 words instead of 6.

    This is an optimization problem with three objective functions:
    Code size, register use and speed. Any solution in which at which
    it is no longer possible to improve one of the objective functions
    without making the others worse is a valid solution to that problem
    (part of the Pareto front, if you want to use that terminology).

    The version with the DIV with constant is the smallest, so it
    trumps everything else on size. It is, however, likely to be
    slower than the other two. It is what I would a compiler to emit
    if optimizing for size (-Os), and when not optimizing.

    The version with the Carry is likely faster than the DIV with
    constant version, but is bigger and uses two registers instead
    of one. It is what I would expect a compiler to generate in
    the absence of multiply high unsigned and any -O option, except
    for -Os.

    The hypothetical version with "multiply high unsigned" is smaller
    than the version with Carry and is better on register use, so
    it is better in that sense. If it existed, I would expect a
    compiler to always generate that for any optimization level
    except -Os. (Small caveat: Maybe hardware has a fiendishly
    clever algorithm to divide by 37 in four or less cycles; this
    would then be known to the compiler, and it could chose accordingly).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Thomas Koenig on Sun Dec 17 21:55:47 2023
    Thomas Koenig wrote:

    MitchAlsup <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    MitchAlsup <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Thomas Koenig <tkoenig@netcologne.de> schrieb:

    A hypothetical My66000 code could be

    mulhu r1, r1, #-2492803253203993461
    srl r1, r1, #5
    ret

    By "hypothetical" I mean that it isn't in the ISA at the moment.

    If I knew what mulhu does I could illuminate::
    a) how many bits get multiplied?
    b) how many bits are produced?

    But assuming h means high, and the widths are 64-bit operands and
    128-bit results::

    CARRY R1,{{O}{I}}
    MUL R2,R1,#-2492803253203993461
    SHR R1,R2,<0,5>
    RET

    That works, but has two disadvantages over the non-Carry
    version: It is longer (by one instruction modifier),
    and R2 is overwritten with a value which is not needed.
    This is why I think that having a "multiply high signed"
    and "multiply high unsigned" operation is a good idea even
    when My 66000's Carry is implemented.

    Heck:: if you are worried about length::

    DIV R1,R1,#27
    RET

    2 words instead of 6.

    This is an optimization problem with three objective functions:
    Code size, register use and speed. Any solution in which at which
    it is no longer possible to improve one of the objective functions
    without making the others worse is a valid solution to that problem
    (part of the Pareto front, if you want to use that terminology).

    The version with the DIV with constant is the smallest, so it
    trumps everything else on size. It is, however, likely to be
    slower than the other two. It is what I would a compiler to emit
    if optimizing for size (-Os), and when not optimizing.

    Since I do IDIV in the FMAC unit, and integers are not normalized::
    FMAC passes both operands through the normalizer's Find First logic
    and knows the significance of the numerator and denominator.

    Thus, for this case, 27 is detected to have only 5-bits of significance
    and the 1-bit per loop* only takes 5 cycles. Giving a fall through time
    of 8-cycles total.

    (*) assuming a 1-bit at a time loop--which I would not do--but I
    still do the normalization steps to make the integers smell like
    FP numbers, so, 1 Goldschmidt step and 1 Newton-Raphson step and
    you have 8 cycle IDIV for denominators with less than 8-bits of
    significance, 10-cycles for denominators of less then 16-bits,...

    The version with the Carry is likely faster than the DIV with
    constant version, but is bigger and uses two registers instead
    of one. It is what I would expect a compiler to generate in
    the absence of multiply high unsigned and any -O option, except
    for -Os.

    What if we solve the problem by making IDIV fast instead !!

    Given a fast IDIV, is there any use for IMULH ??

    The hypothetical version with "multiply high unsigned" is smaller
    than the version with Carry and is better on register use, so

    I should point out that having the constant attached to the
    MUL (as is were) you have already saves a register compared to
    ISAs without large constants.

    it is better in that sense. If it existed, I would expect a
    compiler to always generate that for any optimization level
    except -Os. (Small caveat: Maybe hardware has a fiendishly
    clever algorithm to divide by 37 in four or less cycles; this
    would then be known to the compiler, and it could chose accordingly).

    This is one gigantic reason to use FDIV to perform IDIV--and another
    smaller one not to separate register files. FIDVs are in the 17-20
    cycle range, why should IDIV not also be in that range ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Thomas Koenig on Mon Dec 18 12:00:54 2023
    Thomas Koenig wrote:
    MitchAlsup <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Thomas Koenig <tkoenig@netcologne.de> schrieb:

    A hypothetical My66000 code could be

    mulhu r1, r1, #-2492803253203993461
    srl r1, r1, #5
    ret

    By "hypothetical" I mean that it isn't in the ISA at the moment.

    If I knew what mulhu does I could illuminate::
    a) how many bits get multiplied?
    b) how many bits are produced?

    But assuming h means high, and the widths are 64-bit operands and
    128-bit results::

    CARRY R1,{{O}{I}}
    MUL R2,R1,#-2492803253203993461
    SHR R1,R2,<0,5>
    RET

    That works, but has two disadvantages over the non-Carry
    version: It is longer (by one instruction modifier),
    and R2 is overwritten with a value which is not needed.
    This is why I think that having a "multiply high signed"
    and "multiply high unsigned" operation is a good idea even
    when My 66000's Carry is implemented.

    I disagree:

    The R2 overwrite really doesn't matter since the OoO circuitry will
    quickly notice that the value isn't used at all, and the extra CARRY
    modifier does not make the code any slower: It will still take exactly
    the latency of MUL + SHR, so presumably 5-6 total cycles.

    Since MULHU is very rarely used in regular code (pretty much only in
    constant divisions that got replaced with reciprocal muls), this tiny
    overhead is a very good deal.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Terje Mathisen on Tue Dec 19 13:57:07 2023
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    MitchAlsup <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    Thomas Koenig <tkoenig@netcologne.de> schrieb:

    A hypothetical My66000 code could be

    mulhu r1, r1, #-2492803253203993461
    srl r1, r1, #5
    ret

    By "hypothetical" I mean that it isn't in the ISA at the moment.

    If I knew what mulhu does I could illuminate::
    a) how many bits get multiplied?
    b) how many bits are produced?

    But assuming h means high, and the widths are 64-bit operands and
    128-bit results::

    CARRY R1,{{O}{I}}
    MUL R2,R1,#-2492803253203993461
    SHR R1,R2,<0,5>
    RET

    That works, but has two disadvantages over the non-Carry
    version: It is longer (by one instruction modifier),
    and R2 is overwritten with a value which is not needed.
    This is why I think that having a "multiply high signed"
    and "multiply high unsigned" operation is a good idea even
    when My 66000's Carry is implemented.

    I disagree:

    The R2 overwrite really doesn't matter since the OoO circuitry will
    quickly notice that the value isn't used at all, and the extra CARRY
    modifier does not make the code any slower: It will still take exactly
    the latency of MUL + SHR, so presumably 5-6 total cycles.

    You're correct that the register use will probably be elided in
    an OoO machine, but the register is still overwritten. This does
    not matter for a function that does just that division, but in
    larger functions, this can lead to an additional spill/restore.

    Since MULHU is very rarely used in regular code (pretty much only in
    constant divisions that got replaced with reciprocal muls), this tiny overhead is a very good deal.

    My 66000 has very low register usage, especially due to the encoding
    of constants. Another reason are indexed and scaled loads and
    stores, which allows fewer registers because, for code like

    do i=1,n
    a(i) = b(i) + c(i)
    end do

    there is no need for having a pointer run through a, b and c,
    requiring fewer registers.

    This is why putting floating point values and integer values into
    the same register set actually works well for My 66000, and would
    not work so well for others.

    Still, when I see an inefficiency, even minor, I tend to remark
    on it :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Thomas Koenig on Tue Dec 19 18:35:44 2023
    Thomas Koenig wrote:

    Still, when I see an inefficiency, even minor, I tend to remark
    on it :-)

    And that is why we love you.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)