• Re: Misc: BGBCC targeting RV64G, initial results...

    From MitchAlsup1@21:1/5 to BGB on Sun Sep 29 19:11:47 2024
    On Sat, 28 Sep 2024 4:30:12 +0000, BGB wrote:

    On 9/27/2024 7:43 PM, MitchAlsup1 wrote:
    On Fri, 27 Sep 2024 23:53:22 +0000, BGB wrote:

    One of the reasons reservation stations became in vouge.


    Possibly, but is a CPU feature rather than a compiler feature...

    A good compiler should be able to make use of 98% of the instruction
    set.

    ------------

    Saw a video not too long ago where he was making code faster by undoing
    a lot of loop unrolling, as the code was apparently spending more in I$ misses than it was gaining by being unrolled.

    I noticed this in 1991 when we got Mc88120 simulator up and running.
    GBOoO chips are <nearly> best served when there is the smallest number
    of instructions.
    ------------

    In contrast, a jumbo prefix by itself does not make sense; its meaning depends on the thing that being is prefixed. Also the decoder will
    decode a jumbo prefix and suffix instruction at the same time.

    How many bits does one of these jumbo prefixes consume ?
    -----


    For the jumbo prefix:
    Recognize that is a jumbo prefix;
    Inform the decoder for the following instruction of this fact
    (via internal flag bits);
    Provide the prefix's data bits to the corresponding decoder.

    Unlike a "real" instruction, a jumbo prefix does not need to provide
    behavior of its own, merely be able to be identified as such and to
    provide payload data bits.


    For now, there are not any encodings larger than 96 bits.
    Partly this is because 128 bit fetch would likely add more cost and complexity than it is worth at the moment.

    For your implementation, yes. For all others:: <at best> maybe.




    Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64 >>>>> (there does seem to be some interest for ELF FDPIC but limited to
    32-bit
    RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too far off >>>>> from PBO (namely, using GP for a global section and then chaining the >>>>> sections for each binary).

    How are you going to do dense PIC switch() {...} in RISC-V ??

    Already implemented...
    With pseudo-instructions:
        SUB Rs, $(MIN), R10
        MOV $(MAX-MIN), R11
        BGTU R11, R10, Lbl_Dfl

        MOV   .L0, R6      //AUIPC+ADD
        SHAD  R10, 2, R10  //SLLI
        ADD   R6, R10, R6
        JMP   R6           //JALR X0, X6, 0

        .L0:
        BRA  Lbl_Case0     //JAL X0, Lbl_Case0
        BRA  Lbl_Case1
        ...

    Compared to::
    //      ADD        Rt,Rswitch,#-min
           JTT        Rt,#max
           .jttable   min, ... , max, default
    adder:

    The ADD is not necessary if min == 0

    The JTT instruction compared Rt with 0 on the low side and max
    on the high side. If Ri is out of bounds, default is selected.

    The table displacements come in {B,H,W,D} selected in the JTT
    (jump through table) instruction. Rt indexes the table, its
    signed value is <<2 and added to address which happens to be
    address of JTT instruction + #(max+1)<<entry. {{The table is
    fetched through the ICache with execute permission}}

    Thus, the table is PIC; and generally 1/4 the size of typical
    switch tables.
    -----

    Potentially it could be more compact.

    Both more compact and just as fast; many times faster.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Mon Sep 30 05:52:56 2024
    On Sat, 28 Sep 2024 1:44:10 +0000, Paul A. Clayton wrote:

    On 9/27/24 11:52 AM, MitchAlsup1 wrote:
    On Fri, 27 Sep 2024 9:46:01 +0000, BGB wrote:
    [snip]
    RV's selection of 3R compare ops is more limited:
       RV: SLT, SLTU
       BJX2: CMPEQ, CMPNE, CMPGT, CMPGE, CMPHI, CMPHS, TST, NTST
    A lot of these cases require a multi-op sequence to implement
    with just
    SLT and SLTU.

    My 55000 can do::  1 < i && i <= MAX in 1 instruction

    Did you mean "0 < i && i <= MAX" (Fortran-IN comparison result) or
    "1 <= i && i <= MAX" (which is the same, for unsigned)? Or am I
    missing a capability of My 66000?

    Technically is it:: 0< i && i <=max
    LLVM will convert 1 <= i into 0 < i in an unsigned sense.

    There is a corresponding:: 0 <= i && i < max, too, called
    CIN (C's version of IN)

    Itanium comparison instructions were interesting in that the one
    bit result (which was stored in two condition registers, one
    storing the compliment) could be ANDed or ORed with another
    condition register as part of the instruction. This merging
    apparently allowed some complex comparison merging to be done
    in one cycle with multiple compare instructions.

    I have some similar stuff.

    Itanium's method may not be the best way of merging conditions,
    but there may be some benefit from not using sequential branches
    to perform this function (or SHIFT and OR/AND to merge a bit in a
    comparison result) . (I suppose a microarchitecture could use one
    BTB entry for both branches, but detecting such seems a fair
    amount of work for an uncommon case.)

    Another weird concept that came to mind would be providing an
    8-bit (e.g.) field that enumerated a set of interesting
    conditions.

    I use a 64-bit container of conditions

    Combined with a Table Transfer instruction (which
    jumps either to an indexed table position for execution or uses
    the indexed table entry as a jump target address) this _might_
    be useful, though I doubt there are many cases where a general
    condition test could set a dense enumeration of cases that are
    of interest for the specific use. Yet perhaps mentioning this
    weirdness might stir someone else's *useful* creativity.

    (For small local switch-like jumps, using 16-bit table entries
    might be practical starting immediately after the instruction.

    We have not found (yet) a subroutine that needed more than 16-bit
    displacement for "within subroutine" branch.

    A 64-bit immediate would provide four targets with the next
    word being a "free" target specification. 8-bit offsets might
    be practical for small switches, especially if accumulated (i.e.,
    as if performing a sequence of short branches); adding four
    8-bit values would be fairly fast. Yes, more weirdness.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Mon Sep 30 06:03:43 2024
    On Mon, 30 Sep 2024 2:19:41 +0000, BGB wrote:

    On 9/29/2024 2:11 PM, MitchAlsup1 wrote:

    I noticed this in 1991 when we got Mc88120 simulator up and running.
    GBOoO chips are <nearly> best served when there is the smallest number
    of instructions.


    Looking it up, seems the CPU in question (MIPS R4300) was:
    16K L1 I$ cache;
    8K L1 D$ cache;
    No L2 cache (but could be supported off-die);
    1-wide scalar, 32 or 64 bit
    Non pipelined FPU and multiplie

    we had::
    16 KB 4-way banked unified cache
    and a 1024 entry trace cache
    Best case DRAM was 5 cycles with typical case at 9 cycles


    How many bits does one of these jumbo prefixes consume ?

    The prefix itself is 32 bits.
    In the context of XG3, it supplies 23 or 27 bits.

    So it has a cost of 9 or 5 bits
    ---------

    Well, and people obsessing on what happens if an interrupt somehow
    occurs "between" the prefix and prefixed instruction.

    Which, as I have tended to implement them, is simply not possible, since everything is fetched and decoded at the same time.


    Granted, yes, it does add the drawback of needing to have tag-bits to remember the mode, and maybe the CPU hiding mode bits in the high order
    bits of the link register and similar is not such an elegant idea.

    So, what do you do when the jumbo is in the last word of a line and
    the next line takes "a long time" to arrive ?
    -----

    For your implementation, yes. For all others:: <at best> maybe.

    Maybe.

    I could maybe consider widening fetch/decode to 128-bits if there were a compelling use case.

    Remember I am doing 128-bit fetches on a 1-wide machine
    and 3 128-bit fetches on the 6-wide machine.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Tue Oct 1 21:29:14 2024
    On Tue, 1 Oct 2024 10:00:49 +0000, Robert Finch wrote:

    On 2024-09-29 10:19 p.m., BGB wrote:
    On 9/29/2024 2:11 PM, MitchAlsup1 wrote:

    Compared to::
    //      ADD        Rt,Rswitch,#-min
            JTT        Rt,#max
            .jttable   min, ... , max, default
    adder:

    The ADD is not necessary if min == 0

    The JTT instruction compared Rt with 0 on the low side and max
    on the high side. If Ri is out of bounds, default is selected.

    The table displacements come in {B,H,W,D} selected in the JTT
    (jump through table) instruction. Rt indexes the table, its
    signed value is <<2 and added to address which happens to be
    address of JTT instruction + #(max+1)<<entry. {{The table is
    fetched through the ICache with execute permission}}

    Thus, the table is PIC; and generally 1/4 the size of typical
    switch tables.

    How well does JTT work with large tables? What if there are several
    hundred table entries?

    Max table size is +2^16 entries. The instruction specifies the entry
    size {B,H,W,D} and scales the index accordingly. If the index is
    negative or > MAX then max+1 entry is fetched (default) {Known to
    be in the table}. The fetched entry is scaled by <<2 for word
    displacement of IP, and then added for the new IP location.
    Fetch proceeds.

    {{The assembler checks (and adjusts) the size after resolving
    the labels of the switch().}}

    Most of the time the fetch entry is already in the instruction
    buffer and can be simply muxed out (1 cycle) instead of ICache
    read latency.

    For Q+ indirect jump the values loaded from the table replace the low
    order bits of the PC instead of being a displacement. Only {W,T,O} are supported. (W=wyde,T=tetra,O=octa). Should add an option for
    displacements. Borrowed the memory indirect jump from the 68k.

    I have memory indirect CALL instructions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Fri Sep 27 15:52:34 2024
    On Fri, 27 Sep 2024 9:46:01 +0000, BGB wrote:

    Had recently been working on getting BGBCC to target RV64G.

    Array Load/Store:
    M66: 1 instruction
    XG2: 1 instruction
    RV64: 3 instructions

    Global Variable:
    M66: 1 instruction (anywhere in 64-bit memory)
    XG2: 1 instruction (if within 2K of GBR)
    RV64: 1 or 4 instructions

    Constant Load into register (not R5):
    M66: 0 instructions
    XG2: 1 instruction
    RV64: ~ 1-6

    Operator with 32-bit immediate:
    M66: 1 instruction
    BJX2: 1 instruction;
    RV64: 3 instructions.

    Operator with 64-bit immediate:
    M66: 1 instruction
    BJX2: 1 instruction;
    RV64: 4-9 instructions.



    Floating point is still a bit of a hack, as it is currently implemented
    by shuffling values between GPRs and FPRs, but sorta works.

    My 66000 has a common register file.


    RV's selection of 3R compare ops is more limited:
    RV: SLT, SLTU
    BJX2: CMPEQ, CMPNE, CMPGT, CMPGE, CMPHI, CMPHS, TST, NTST
    A lot of these cases require a multi-op sequence to implement with just
    SLT and SLTU.

    My 55000 can do:: 1 < i && i <= MAX in 1 instruction


    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Fri Sep 27 19:40:32 2024
    On Fri, 27 Sep 2024 18:26:28 +0000, BGB wrote:

    On 9/27/2024 7:50 AM, Robert Finch wrote:
    On 2024-09-27 5:46 a.m., BGB wrote:
    ---------

    But, BJX2 does not spam the ADD instruction quite so hard, so is more forgiving of latency. In this case, an optimization that reduces
    common-case ADD to 1 cycle was being used (it only works though in the
    CPU core if the operands are both in signed 32-bit range and no overflow occurs; IIRC optionally using a sign-extended AGU output as a stopgap
    ALU output before the output arrives from the main ALU the next cycle).

    RISC-V group opinion is that "we have done nothing to damage pipeline
    operating frequency". {{Except the moving of register specifier fields
    between 32-bit and 16-bit instructions; except for: AGEN-RAM-CMP-ALIGN
    in 2 cycles, and several others...}}


    Comparably, it appears BGBCC leans more heavily into ADD and SLLI than
    GCC does, with a fair chunk of the total instructions executed being
    these two (more cycles are spent adding and shifting than doing memory
    load or store...).

    That seems to be a bit off. Mem ops are usually around 1/4 of

    Most agree it is closer to 30% than 25% {{Unless you clutter up the ISA
    such that your typical memref needs a support instruction.

    instructions. Spending more than 25% on adds and shifts seems like a
    lot. Is it address calcs? Register loads of immediates?


    It is both...


    In BJX2, the dominant instruction tends to be memory Load.
    Typical output from BGBCC for Doom is (at runtime):
    ~ 70% fixed-displacement;
    ~ 30% register-indexed.
    Static output differs slightly:
    ~ 84% fixed-displacement;
    ~ 16% register-indexed.

    RV64G lacks register-indexed addressing, only having fixed displacement.

    If you need to do a register-indexed load in RV64:
    SLLI X5, Xo, 2 //shift by size of index
    ADD X5, Xm, X5 //add base and index
    LW Xn, X5, 0 //do the load

    This case is bad...

    Which makes that 16% (above) into 48% and renormalizing to::
    ~ 63% fixed-displacement;
    ~ 36% register-indexed and support instructions.


    Also global variables outside the 2kB window:
    LUI X5, DispHi
    ADDI X5, X5, DispLo
    ADD X5, GP, X5
    LW Xn, X5, 0

    Where, sorting global variables by usage priority gives:
    ~ 35%: in range
    ~ 65%: not in range

    Illustrating the falicy of 12-bits of displacement.

    Comparably, XG2 has a 16K or 32K reach here (depending on immediate
    size), which hits most of the global variables. The fallback Jumbo
    encoding hits the rest.

    I get ±32K with 16-bit displacements


    Theoretically, could save 1 instruction here, but would need to add two
    more reloc types to allow for:
    LUI, ADD, Lx
    LUI, ADD, Sx
    Because annoyingly Load and Store have different displacement encodings;
    and I still need the base form for other cases.


    More compact way to load/store global variables would be to use absolute 32-bit or PC relative:
    LUI + Lx/Sx : Abs32
    AUIPC + Lx/Sx : PC-Rel32

    MEM Rd,[IP,,DISP32/64] // IP-rel

    -----

    Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64
    (there does seem to be some interest for ELF FDPIC but limited to 32-bit RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too far off
    from PBO (namely, using GP for a global section and then chaining the sections for each binary).

    How are you going to do dense PIC switch() {...} in RISC-V ??

    Main difference being that FDPIC uses fat
    function pointers and does the GP reload on the caller, vs PBO where I
    use narrow function pointers and do the reload on the callee (with
    load-time fixups for the PBO Offset).


    The result of all this is a whole lot of
    unnecessary
    Shifts and ADDs.

    Seemingly, even more for BGBCC than for GCC, which already had a lot of shifts and adds.

    BGBCC basically entirely dethrowns the Load and Store ops ...


    Possibly more so than GCC, which tended to turn most constant loads into memory loads. It would load a table of constants into a register and
    then pull constants from the table, rather than compose them inline.

    Say, something like:
    AUIPC X18, X18, DispHi
    ADD X18, X18, DispLo
    (X18 now holds a table of constants, pointing into .rodata)

    And, when it needs a constant:
    LW Xn, X18, Disp //offset of the constant it wants.
    Or:
    LD Xn, X18, Disp //64-bit constant


    Currently, BGBCC does not use this strategy.
    Though, for 64-bit constants it could be more compact and faster.

    But, better still would be having Jumbo prefixes or similar, or even a
    SHORI instruction.

    Better Still Still is having 32-bit and 64-bit constants available
    from the instruction stream and positioned in either operand position.

    Say, 64-bit constant-load in SH-5 or similar:
    xxxxyyyyzzzzwwww
    MOV ImmX, Rn
    SHORI ImmY, Rn
    SHORI ImmZ, Rn
    SHORI ImmW, Rn
    Where, one loads the constant in 16-bit chunks.

    Yech



    Don't you ever snip anything ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Fri Sep 27 23:21:01 2024
    On Fri, 27 Sep 2024 21:28:41 +0000, BGB wrote:

    On 9/27/2024 10:52 AM, MitchAlsup1 wrote:

    My 66000 can do::  1 < i && i <= MAX in 1 instruction


    BJX2:
    CMPQGT R4, 1, R16
    CMPQLT R4, (MAX+1), R17 //*1
    AND R16, R17, R5

    So, more than 1 instruction, but less than faking it with SLT / SLTI ...

    CMP Rt,Ri,MAX
    BFIN Rt,label // fin = Fortran IN
    -----

    It is better for performance though to be able to flip the output bit in
    the pipeline than to need to use an XOR instruction or similar.

    I do integer negation by flipping all the bits and running a carry
    into the subsequent ALU.

    Thus, if the target calculation unit is logical, one gets inversion
    integer gets negation, and FPUs only invert the sign bit.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sat Sep 28 00:43:49 2024
    On Fri, 27 Sep 2024 23:53:22 +0000, BGB wrote:

    On 9/27/2024 2:40 PM, MitchAlsup1 wrote:
    On Fri, 27 Sep 2024 18:26:28 +0000, BGB wrote:

    But, generally this does still impose limits:
    Can't reorder instructions across a label;
    Can't move instructions with an associated reloc;

    I always did code motion prior to assembler. Code motion only has to
    consider:: 1-operand, 2-operand, 3-operand, branch, label, LD, ST.

    Can't reorder memory instructions unless they can be proven to not alias (loads may be freely reordered, but the relative order of loads and
    stores may not unless provably non-aliasing);

    Same base register different displacement.

    The effectiveness of this does depend on how the C code is written
    though (works favorably with larger blocks of mostly-independent expressions).

    One of the reasons reservation stations became in vouge.

    Most agree it is closer to 30% than 25% {{Unless you clutter up the ISA
    such that your typical memref needs a support instruction.


    Cough, RV64...
    -----
    Which makes that 16% (above) into 48% and renormalizing to::
          ~ 63% fixed-displacement;
          ~ 36% register-indexed and support instructions.

    Yeah.

    I think there are reasons here why I am generally getting lackluster performance out of RV64...
    -----
    Comparably, XG2 has a 16K or 32K reach here (depending on immediate
    size), which hits most of the global variables. The fallback Jumbo
    encoding hits the rest.

    I get ±32K with 16-bit displacements


    Baseline has special case 32-bit ops:
    MOV.L (GBR, Disp10u), Rn //4K
    MOV.Q (GBR, Disp10u), Rn //8K

    But, in XG2, it gains 2 bits:
    MOV.L (GBR, Disp12u), Rn //16K
    MOV.Q (GBR, Disp12u), Rn //32K

    Jumbo can encode +/- 4GB here (64-bit encoding).
    MOV.L (GBR, Disp33s), Rn //+/- 4GB
    MOV.Q (GBR, Disp33s), Rn //+/- 4GB

    Mostly because GBR displacements are unscaled.
    Plan for XG3 is that all Disp33s encodings would be unscaled.

    The assembler gets to choose based on the memory model::

    MEM Rd,[Rb,Ri<<s,DISP]

    Assembler (or even linker) can choose 32-bit or 64 bit based on a
    variety
    of things {flags, memory model, size of linked module,...}

    BJX2 can also do (PC, Disp33s) in a single logical instruction...

    But, RISC-V can't...

    What is your definition of "single logical instruction". In my parlance,
    a single logical instruction can be::

    ST #64-bit-const,[Rb,Ri<<s,DISP64]

    is 1 instruction occupying 5 words.



    Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64
    (there does seem to be some interest for ELF FDPIC but limited to 32-bit >>> RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too far off
    from PBO (namely, using GP for a global section and then chaining the
    sections for each binary).

    How are you going to do dense PIC switch() {...} in RISC-V ??

    Already implemented...

    With pseudo-instructions:
    SUB Rs, $(MIN), R10
    MOV $(MAX-MIN), R11
    BGTU R11, R10, Lbl_Dfl

    MOV .L0, R6 //AUIPC+ADD
    SHAD R10, 2, R10 //SLLI
    ADD R6, R10, R6
    JMP R6 //JALR X0, X6, 0

    .L0:
    BRA Lbl_Case0 //JAL X0, Lbl_Case0
    BRA Lbl_Case1
    ...

    Compared to::
    // ADD Rt,Rswitch,#-min
    JTT Rt,#max
    .jttable min, ... , max, default
    adder:

    The ADD is not necessary if min == 0

    The JTT instruction compared Rt with 0 on the low side and max
    on the high side. If Ri is out of bounds, default is selected.

    The table displacements come in {B,H,W,D} selected in the JTT
    (jump through table) instruction. Rt indexes the table, its
    signed value is <<2 and added to address which happens to be
    address of JTT instruction + #(max+1)<<entry. {{The table is
    fetched through the ICache with execute permission}}

    Thus, the table is PIC; and generally 1/4 the size of typical
    switch tables.
    -----
    Currently, BGBCC does not use this strategy.
    Though, for 64-bit constants it could be more compact and faster.

    But, better still would be having Jumbo prefixes or similar, or even a
    SHORI instruction.

    Better Still Still is having 32-bit and 64-bit constants available
    from the instruction stream and positioned in either operand position.


    Granted...


    Say, 64-bit constant-load in SH-5 or similar:
       xxxxyyyyzzzzwwww
       MOV   ImmX, Rn
       SHORI ImmY, Rn
       SHORI ImmZ, Rn
       SHORI ImmW, Rn
    Where, one loads the constant in 16-bit chunks.

    Yech


    But, 4 is still less than 6.

    1 is less than 4, too.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sat Oct 5 23:10:32 2024
    On Thu, 3 Oct 2024 8:20:46 +0000, BGB wrote:

    On 10/1/2024 5:00 AM, Robert Finch wrote:
    On 2024-09-29 10:19 p.m., BGB wrote:
    The ADD is not necessary if min == 0

    The JTT instruction compared Rt with 0 on the low side and max
    on the high side. If Ri is out of bounds, default is selected.

    The table displacements come in {B,H,W,D} selected in the JTT
    (jump through table) instruction. Rt indexes the table, its
    signed value is <<2 and added to address which happens to be
    address of JTT instruction + #(max+1)<<entry. {{The table is
    fetched through the ICache with execute permission}}

    Thus, the table is PIC; and generally 1/4 the size of typical
    switch tables.

    How well does JTT work with large tables? What if there are several
    hundred table entries?

    Tables can have 2^16-1 (65534) case entries.

    For Q+ indirect jump the values loaded from the table replace the low
    order bits of the PC instead of being a displacement. Only {W,T,O} are
    supported. (W=wyde,T=tetra,O=octa). Should add an option for
    displacements. Borrowed the memory indirect jump from the 68k.

    My 66000 Loads the table entry directly into IP in 1 cycle less
    than LD latency.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue Oct 8 20:48:53 2024
    On Tue, 8 Oct 2024 19:06:34 +0000, BGB wrote:

    On 10/5/2024 6:10 PM, MitchAlsup1 wrote:
    On Thu, 3 Oct 2024 8:20:46 +0000, BGB wrote:

    How well does JTT work with large tables? What if there are several
    hundred table entries?

    Tables can have 2^16-1 (65534) case entries.


    There is no hard-limit in my case, but BGBCC had generally broken up
    tables larger than 256.

    IIRC, this was because larger tables were more likely to have "voids"
    which could lead to a large number of branches to default, while still
    being over a 75% density threshold, splitting the table apart was more
    likely to expose these voids (and make the binary smaller).

    Yes, one would expect voids in tables would cause the compiler to
    break the switch table into more dense chunks.

    I guess a possible tweak could be, say, if the density is over 93% or
    so, it will allow a table-jump regardless of size.

    If the void is greater than the overhead to break the table, then
    the table should be broken.

    Though, for the most part the programs I am testing with tend not to
    really have too many large switch blocks, and in the case of "switch"
    with a byte (most common case in these cases), a limit of 256 works.


    Meanwhile, for some other cases, like a switch full of TWOCC and FOURCC values, one really does not want to use a jump table (but, this is
    avoided as these cases tend to have a very low density).



    For Q+ indirect jump the values loaded from the table replace the low
    order bits of the PC instead of being a displacement. Only {W,T,O} are >>>> supported. (W=wyde,T=tetra,O=octa). Should add an option for
    displacements. Borrowed the memory indirect jump from the 68k.

    My 66000 Loads the table entry directly into IP in 1 cycle less
    than LD latency.


    I guess, specialized Load+Branch could potentially have less latency
    than separate load+branch, or the current strategy of double-branching.

    Think of it as a LD IP,[address]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Oct 9 10:31:09 2024
    I guess also possible is to make it such that a the required density (to avoid split) is correlated to table size.

    Say (table size, minimum density):
    128, 75%
    256, 80%
    512, 85%
    1024, 90%
    2048, 95%

    Density is just a vague indicator. I think it makes more sense to
    define your heuristic based on specific (estimated) costs
    (e.g. code(+table) size, or number of cycles).


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Wed Oct 9 16:24:07 2024
    On Wed, 9 Oct 2024 5:40:02 +0000, BGB wrote:

    On 10/8/2024 3:48 PM, MitchAlsup1 wrote:
    On Tue, 8 Oct 2024 19:06:34 +0000, BGB wrote:

    At present, there isn't anything to detect voids directly, but, density percentage is easy to detect:
    (100*count)/(max-min)

    defaults_in_a_row > 3 words.

    ---------------
    To some amount, everything is heuristics.

    I guess also possible is to make it such that a the required density (to avoid split) is correlated to table size.

    Say (table size, minimum density):
    128, 75%
    256, 80%
    512, 85%
    1024, 90%
    2048, 95%

    defaults_in_a_row > 3 words.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Paul A. Clayton on Wed Oct 16 11:07:08 2024
    On 10/16/2024 8:59 AM, Paul A. Clayton wrote:

    snip

    I do not know of any enumeration of conditions that would be
    commonly useful. Less than, equal to, greater than might be
    somewhat useful for a three-way branch.

    That was the function of the arithmetic if statement in original
    Fortran. If it were more useful, it wouldn't have been taken out of the language long ago.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stephen Fuld on Wed Oct 16 19:17:17 2024
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 10/16/2024 8:59 AM, Paul A. Clayton wrote:

    snip

    I do not know of any enumeration of conditions that would be
    commonly useful. Less than, equal to, greater than might be
    somewhat useful for a three-way branch.

    That was the function of the arithmetic if statement in original
    Fortran. If it were more useful, it wouldn't have been taken out of the language long ago.

    Not so long ago, actually, it was only dropped in Fortran 2018.
    I actually think that this is a bad idea, compilers will continue
    to support such features, but possible interactions with other
    features will no longer be properly defined.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Wed Oct 16 19:11:02 2024
    On Wed, 16 Oct 2024 15:59:57 +0000, Paul A. Clayton wrote:

    On 9/30/24 1:52 AM, MitchAlsup1 wrote:
    On Sat, 28 Sep 2024 1:44:10 +0000, Paul A. Clayton wrote:
    [snip]
    Another weird concept that came to mind would be providing an
    8-bit (e.g.) field that enumerated a set of interesting
    conditions.

    I use a 64-bit container of conditions

    A enumeration of conditions is different from a bitmask of
    conditions. An enumeration could support N-way branching in a
    single instruction rather than a tree of single bit-condition
    branches.

    My 66000's compare result has unused space for multiple such
    enumerations.

    One can "do" 3-way branching as is:: CMP-BC1-BC2-other

    I do not know of any enumeration of conditions that would be
    commonly useful. Less than, equal to, greater than might be
    somewhat useful for a three-way branch. Relation to zero as well
    as an explicit comparison value might be useful for some multi-way
    choices.

    3-way branches are out of style:: Fortran disinherited them
    while IEEE 754 made them need to be 4-way (NaN).

    Lack of density is also a problem for multi-way branches; the
    encoding will waste space if multiple enumerated states share a
    target.

    The concept seemed worth mentioning even if I thought it unlikely
    to be practically useful.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Wed Oct 16 22:16:29 2024
    On Wed, 16 Oct 2024 20:23:08 +0000, BGB wrote:


    Ironically, one of the main arguable use-cases for old Fortran style IF statements is implementing the binary dispatch logic in a binary
    subdivided "switch()", but not enough to justify having a dedicated instruction for it.

    Say:
    MOV Imm, Rt //pivot case
    BLT Rt, Rx, .lbl_lo
    BGT Rt, Rx, .lbl_hi
    BRA .lbl_case

    With a 64-bitinstruction one could do::

    B3W .lbl_lo,.lbl_zero,.lbl_hi

    rather straightforwardly.....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Fri Oct 18 02:28:45 2024
    On Thu, 17 Oct 2024 0:03:26 +0000, BGB wrote:

    On 10/16/2024 5:16 PM, MitchAlsup1 wrote:
    On Wed, 16 Oct 2024 20:23:08 +0000, BGB wrote:


    Ironically, one of the main arguable use-cases for old Fortran style IF
    statements is implementing the binary dispatch logic in a binary
    subdivided "switch()", but not enough to justify having a dedicated
    instruction for it.

    Say:
       MOV  Imm, Rt  //pivot case
       BLT  Rt, Rx, .lbl_lo
       BGT  Rt, Rx, .lbl_hi
       BRA  .lbl_case

    With a 64-bitinstruction one could do::

        B3W   .lbl_lo,.lbl_zero,.lbl_hi

    rather straightforwardly.....

    Possibly, but the harder part would be to deal with decoding and feeding
    the instruction through the pipeline.

    Feed the 3×15-bit displacements to the branch unit. When the condition resolves, use one of the 2 selected displacements as the target address.

    Granted, I guess it could be decoded as if it were a normal 3RI op or similar, but then split up the immediate into multiple parts in EX1.

    Why would you want do make it 3×11-bit displacements when you can
    make it 3×16-bit displacements.

    +------+-----+-----+----------------+
    | Bc | 3W | Rt | .lb_lo |
    +------+-----+-----+----------------+
    | .lb_zero | .lb_hi |
    +------------------+----------------+

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Mon Oct 21 21:10:20 2024
    On Mon, 21 Oct 2024 20:13:20 +0000, BGB wrote:

    On 10/17/2024 9:28 PM, MitchAlsup1 wrote:

    Granted, I guess it could be decoded as if it were a normal 3RI op or
    similar, but then split up the immediate into multiple parts in EX1.

    Why would you want do make it 3×11-bit displacements when you can
    make it 3×16-bit displacements.

        +------+-----+-----+----------------+
        | Bc   |  3W |  Rt |   .lb_lo       |
        +------+-----+-----+----------------+
        |   .lb_zero       |  .lb_hi        |
        +------------------+----------------+

    Neither BJX2 nor RISC-V have the encoding space to pull this off...
    Even in a clean-slate ISA, it would be a big ask.

    If you remove compressed instructions from RISC-V, you have enough
    room left over to put the entire My 66000 ISA. ... ... ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)