Forum: >>> Magnum BBS <<<

Re: Misc: BGBCC targeting RV64G, initial results...

From MitchAlsup1@21:1/5 to BGB on Sun Sep 29 19:11:47 2024

On Sat, 28 Sep 2024 4:30:12 +0000, BGB wrote:

On 9/27/2024 7:43 PM, MitchAlsup1 wrote:

On Fri, 27 Sep 2024 23:53:22 +0000, BGB wrote:

One of the reasons reservation stations became in vouge.

Possibly, but is a CPU feature rather than a compiler feature...

A good compiler should be able to make use of 98% of the instruction
set.

------------

Saw a video not too long ago where he was making code faster by undoing
a lot of loop unrolling, as the code was apparently spending more in I$ misses than it was gaining by being unrolled.

I noticed this in 1991 when we got Mc88120 simulator up and running.
GBOoO chips are <nearly> best served when there is the smallest number
of instructions.

------------

In contrast, a jumbo prefix by itself does not make sense; its meaning depends on the thing that being is prefixed. Also the decoder will
decode a jumbo prefix and suffix instruction at the same time.

How many bits does one of these jumbo prefixes consume ?
-----

For the jumbo prefix:
Recognize that is a jumbo prefix;
Inform the decoder for the following instruction of this fact
(via internal flag bits);
Provide the prefix's data bits to the corresponding decoder.

Unlike a "real" instruction, a jumbo prefix does not need to provide
behavior of its own, merely be able to be identified as such and to
provide payload data bits.

For now, there are not any encodings larger than 96 bits.
Partly this is because 128 bit fetch would likely add more cost and complexity than it is worth at the moment.

For your implementation, yes. For all others:: <at best> maybe.

Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64 >>>>> (there does seem to be some interest for ELF FDPIC but limited to
32-bit
RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too far off >>>>> from PBO (namely, using GP for a global section and then chaining the >>>>> sections for each binary).

How are you going to do dense PIC switch() {...} in RISC-V ??

Already implemented...

With pseudo-instructions:

    SUB Rs, $(MIN), R10
    MOV $(MAX-MIN), R11
    BGTU R11, R10, Lbl_Dfl

    MOV   .L0, R6      //AUIPC+ADD
    SHAD R10, 2, R10 //SLLI
    ADD   R6, R10, R6
    JMP   R6           //JALR X0, X6, 0

    .L0:
    BRA Lbl_Case0     //JAL X0, Lbl_Case0
    BRA Lbl_Case1
    ...

Compared to::
//      ADD        Rt,Rswitch,#-min
       JTT        Rt,#max
       .jttable   min, ... , max, default
adder:

The ADD is not necessary if min == 0

The JTT instruction compared Rt with 0 on the low side and max
on the high side. If Ri is out of bounds, default is selected.

The table displacements come in {B,H,W,D} selected in the JTT
(jump through table) instruction. Rt indexes the table, its
signed value is <<2 and added to address which happens to be
address of JTT instruction + #(max+1)<<entry. {{The table is
fetched through the ICache with execute permission}}

Thus, the table is PIC; and generally 1/4 the size of typical
switch tables.
-----

Potentially it could be more compact.

Both more compact and just as fast; many times faster.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Paul A. Clayton on Mon Sep 30 05:52:56 2024

On Sat, 28 Sep 2024 1:44:10 +0000, Paul A. Clayton wrote:

On 9/27/24 11:52 AM, MitchAlsup1 wrote:

On Fri, 27 Sep 2024 9:46:01 +0000, BGB wrote:

[snip]

RV's selection of 3R compare ops is more limited:
RV: SLT, SLTU
BJX2: CMPEQ, CMPNE, CMPGT, CMPGE, CMPHI, CMPHS, TST, NTST
A lot of these cases require a multi-op sequence to implement
with just
SLT and SLTU.

My 55000 can do:: 1 < i && i <= MAX in 1 instruction

Did you mean "0 < i && i <= MAX" (Fortran-IN comparison result) or
"1 <= i && i <= MAX" (which is the same, for unsigned)? Or am I
missing a capability of My 66000?

Technically is it:: 0< i && i <=max
LLVM will convert 1 <= i into 0 < i in an unsigned sense.

There is a corresponding:: 0 <= i && i < max, too, called
CIN (C's version of IN)

Itanium comparison instructions were interesting in that the one
bit result (which was stored in two condition registers, one
storing the compliment) could be ANDed or ORed with another
condition register as part of the instruction. This merging
apparently allowed some complex comparison merging to be done
in one cycle with multiple compare instructions.

I have some similar stuff.

Itanium's method may not be the best way of merging conditions,
but there may be some benefit from not using sequential branches
to perform this function (or SHIFT and OR/AND to merge a bit in a
comparison result) . (I suppose a microarchitecture could use one
BTB entry for both branches, but detecting such seems a fair
amount of work for an uncommon case.)

Another weird concept that came to mind would be providing an
8-bit (e.g.) field that enumerated a set of interesting
conditions.

I use a 64-bit container of conditions

Combined with a Table Transfer instruction (which
jumps either to an indexed table position for execution or uses
the indexed table entry as a jump target address) this _might_
be useful, though I doubt there are many cases where a general
condition test could set a dense enumeration of cases that are
of interest for the specific use. Yet perhaps mentioning this
weirdness might stir someone else's *useful* creativity.

(For small local switch-like jumps, using 16-bit table entries
might be practical starting immediately after the instruction.

We have not found (yet) a subroutine that needed more than 16-bit
displacement for "within subroutine" branch.

A 64-bit immediate would provide four targets with the next
word being a "free" target specification. 8-bit offsets might
be practical for small switches, especially if accumulated (i.e.,
as if performing a sequence of short branches); adding four
8-bit values would be fairly fast. Yes, more weirdness.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Mon Sep 30 06:03:43 2024

On Mon, 30 Sep 2024 2:19:41 +0000, BGB wrote:

On 9/29/2024 2:11 PM, MitchAlsup1 wrote:

I noticed this in 1991 when we got Mc88120 simulator up and running.
GBOoO chips are <nearly> best served when there is the smallest number
of instructions.

Looking it up, seems the CPU in question (MIPS R4300) was:
16K L1 I$ cache;
8K L1 D$ cache;
No L2 cache (but could be supported off-die);
1-wide scalar, 32 or 64 bit
Non pipelined FPU and multiplie

we had::
16 KB 4-way banked unified cache
and a 1024 entry trace cache
Best case DRAM was 5 cycles with typical case at 9 cycles

How many bits does one of these jumbo prefixes consume ?

The prefix itself is 32 bits.
In the context of XG3, it supplies 23 or 27 bits.

So it has a cost of 9 or 5 bits
---------

Well, and people obsessing on what happens if an interrupt somehow
occurs "between" the prefix and prefixed instruction.

Which, as I have tended to implement them, is simply not possible, since everything is fetched and decoded at the same time.

Granted, yes, it does add the drawback of needing to have tag-bits to remember the mode, and maybe the CPU hiding mode bits in the high order
bits of the link register and similar is not such an elegant idea.

So, what do you do when the jumbo is in the last word of a line and
the next line takes "a long time" to arrive ?
-----

For your implementation, yes. For all others:: <at best> maybe.

Maybe.

I could maybe consider widening fetch/decode to 128-bits if there were a compelling use case.

Remember I am doing 128-bit fetches on a 1-wide machine
and 3 128-bit fetches on the 6-wide machine.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Tue Oct 1 21:29:14 2024

On Tue, 1 Oct 2024 10:00:49 +0000, Robert Finch wrote:

On 2024-09-29 10:19 p.m., BGB wrote:

On 9/29/2024 2:11 PM, MitchAlsup1 wrote:

Compared to::
//      ADD        Rt,Rswitch,#-min
        JTT        Rt,#max
        .jttable   min, ... , max, default
adder:

The ADD is not necessary if min == 0

The JTT instruction compared Rt with 0 on the low side and max
on the high side. If Ri is out of bounds, default is selected.

The table displacements come in {B,H,W,D} selected in the JTT
(jump through table) instruction. Rt indexes the table, its
signed value is <<2 and added to address which happens to be
address of JTT instruction + #(max+1)<<entry. {{The table is
fetched through the ICache with execute permission}}

Thus, the table is PIC; and generally 1/4 the size of typical
switch tables.

How well does JTT work with large tables? What if there are several
hundred table entries?

Max table size is +2^16 entries. The instruction specifies the entry
size {B,H,W,D} and scales the index accordingly. If the index is
negative or > MAX then max+1 entry is fetched (default) {Known to
be in the table}. The fetched entry is scaled by <<2 for word
displacement of IP, and then added for the new IP location.
Fetch proceeds.

{{The assembler checks (and adjusts) the size after resolving
the labels of the switch().}}

Most of the time the fetch entry is already in the instruction
buffer and can be simply muxed out (1 cycle) instead of ICache
read latency.

For Q+ indirect jump the values loaded from the table replace the low
order bits of the PC instead of being a displacement. Only {W,T,O} are supported. (W=wyde,T=tetra,O=octa). Should add an option for
displacements. Borrowed the memory indirect jump from the 68k.

I have memory indirect CALL instructions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Fri Sep 27 15:52:34 2024

On Fri, 27 Sep 2024 9:46:01 +0000, BGB wrote:

Had recently been working on getting BGBCC to target RV64G.

Array Load/Store:

M66: 1 instruction

XG2: 1 instruction
RV64: 3 instructions

Global Variable:

M66: 1 instruction (anywhere in 64-bit memory)

XG2: 1 instruction (if within 2K of GBR)
RV64: 1 or 4 instructions

Constant Load into register (not R5):

M66: 0 instructions

XG2: 1 instruction
RV64: ~ 1-6

Operator with 32-bit immediate:

M66: 1 instruction

BJX2: 1 instruction;
RV64: 3 instructions.

Operator with 64-bit immediate:

M66: 1 instruction

BJX2: 1 instruction;
RV64: 4-9 instructions.

Floating point is still a bit of a hack, as it is currently implemented
by shuffling values between GPRs and FPRs, but sorta works.

My 66000 has a common register file.

RV's selection of 3R compare ops is more limited:
RV: SLT, SLTU
BJX2: CMPEQ, CMPNE, CMPGT, CMPGE, CMPHI, CMPHS, TST, NTST
A lot of these cases require a multi-op sequence to implement with just
SLT and SLTU.

My 55000 can do:: 1 < i && i <= MAX in 1 instruction

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Fri Sep 27 19:40:32 2024

On Fri, 27 Sep 2024 18:26:28 +0000, BGB wrote:

On 9/27/2024 7:50 AM, Robert Finch wrote:

On 2024-09-27 5:46 a.m., BGB wrote:

---------

But, BJX2 does not spam the ADD instruction quite so hard, so is more forgiving of latency. In this case, an optimization that reduces
common-case ADD to 1 cycle was being used (it only works though in the
CPU core if the operands are both in signed 32-bit range and no overflow occurs; IIRC optionally using a sign-extended AGU output as a stopgap
ALU output before the output arrives from the main ALU the next cycle).

RISC-V group opinion is that "we have done nothing to damage pipeline
operating frequency". {{Except the moving of register specifier fields
between 32-bit and 16-bit instructions; except for: AGEN-RAM-CMP-ALIGN
in 2 cycles, and several others...}}

Comparably, it appears BGBCC leans more heavily into ADD and SLLI than
GCC does, with a fair chunk of the total instructions executed being
these two (more cycles are spent adding and shifting than doing memory
load or store...).

That seems to be a bit off. Mem ops are usually around 1/4 of

Most agree it is closer to 30% than 25% {{Unless you clutter up the ISA
such that your typical memref needs a support instruction.

instructions. Spending more than 25% on adds and shifts seems like a
lot. Is it address calcs? Register loads of immediates?

It is both...

In BJX2, the dominant instruction tends to be memory Load.
Typical output from BGBCC for Doom is (at runtime):
~ 70% fixed-displacement;
~ 30% register-indexed.
Static output differs slightly:
~ 84% fixed-displacement;
~ 16% register-indexed.

RV64G lacks register-indexed addressing, only having fixed displacement.

If you need to do a register-indexed load in RV64:
SLLI X5, Xo, 2 //shift by size of index
ADD X5, Xm, X5 //add base and index
LW Xn, X5, 0 //do the load

This case is bad...

Which makes that 16% (above) into 48% and renormalizing to::
~ 63% fixed-displacement;
~ 36% register-indexed and support instructions.

Also global variables outside the 2kB window:
LUI X5, DispHi
ADDI X5, X5, DispLo
ADD X5, GP, X5
LW Xn, X5, 0

Where, sorting global variables by usage priority gives:
~ 35%: in range
~ 65%: not in range

Illustrating the falicy of 12-bits of displacement.

Comparably, XG2 has a 16K or 32K reach here (depending on immediate
size), which hits most of the global variables. The fallback Jumbo
encoding hits the rest.

I get ±32K with 16-bit displacements

Theoretically, could save 1 instruction here, but would need to add two
more reloc types to allow for:
LUI, ADD, Lx
LUI, ADD, Sx
Because annoyingly Load and Store have different displacement encodings;
and I still need the base form for other cases.

More compact way to load/store global variables would be to use absolute 32-bit or PC relative:
LUI + Lx/Sx : Abs32
AUIPC + Lx/Sx : PC-Rel32

MEM Rd,[IP,,DISP32/64] // IP-rel

-----

Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64
(there does seem to be some interest for ELF FDPIC but limited to 32-bit RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too far off
from PBO (namely, using GP for a global section and then chaining the sections for each binary).

How are you going to do dense PIC switch() {...} in RISC-V ??

Main difference being that FDPIC uses fat
function pointers and does the GP reload on the caller, vs PBO where I
use narrow function pointers and do the reload on the callee (with
load-time fixups for the PBO Offset).

The result of all this is a whole lot of

unnecessary

Shifts and ADDs.

Seemingly, even more for BGBCC than for GCC, which already had a lot of shifts and adds.

BGBCC basically entirely dethrowns the Load and Store ops ...

Possibly more so than GCC, which tended to turn most constant loads into memory loads. It would load a table of constants into a register and
then pull constants from the table, rather than compose them inline.

Say, something like:
AUIPC X18, X18, DispHi
ADD X18, X18, DispLo
(X18 now holds a table of constants, pointing into .rodata)

And, when it needs a constant:
LW Xn, X18, Disp //offset of the constant it wants.
Or:
LD Xn, X18, Disp //64-bit constant

Currently, BGBCC does not use this strategy.
Though, for 64-bit constants it could be more compact and faster.

But, better still would be having Jumbo prefixes or similar, or even a
SHORI instruction.

Better Still Still is having 32-bit and 64-bit constants available
from the instruction stream and positioned in either operand position.

Say, 64-bit constant-load in SH-5 or similar:
xxxxyyyyzzzzwwww
MOV ImmX, Rn
SHORI ImmY, Rn
SHORI ImmZ, Rn
SHORI ImmW, Rn
Where, one loads the constant in 16-bit chunks.

Yech

Don't you ever snip anything ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Fri Sep 27 23:21:01 2024

On Fri, 27 Sep 2024 21:28:41 +0000, BGB wrote:

On 9/27/2024 10:52 AM, MitchAlsup1 wrote:

My 66000 can do:: 1 < i && i <= MAX in 1 instruction

BJX2:
CMPQGT R4, 1, R16
CMPQLT R4, (MAX+1), R17 //*1
AND R16, R17, R5

So, more than 1 instruction, but less than faking it with SLT / SLTI ...

CMP Rt,Ri,MAX
BFIN Rt,label // fin = Fortran IN
-----

It is better for performance though to be able to flip the output bit in
the pipeline than to need to use an XOR instruction or similar.

I do integer negation by flipping all the bits and running a carry
into the subsequent ALU.

Thus, if the target calculation unit is logical, one gets inversion
integer gets negation, and FPUs only invert the sign bit.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sat Sep 28 00:43:49 2024

On Fri, 27 Sep 2024 23:53:22 +0000, BGB wrote:

On 9/27/2024 2:40 PM, MitchAlsup1 wrote:

On Fri, 27 Sep 2024 18:26:28 +0000, BGB wrote:

But, generally this does still impose limits:
Can't reorder instructions across a label;
Can't move instructions with an associated reloc;

I always did code motion prior to assembler. Code motion only has to
consider:: 1-operand, 2-operand, 3-operand, branch, label, LD, ST.

Can't reorder memory instructions unless they can be proven to not alias (loads may be freely reordered, but the relative order of loads and
stores may not unless provably non-aliasing);

Same base register different displacement.

The effectiveness of this does depend on how the C code is written
though (works favorably with larger blocks of mostly-independent expressions).

One of the reasons reservation stations became in vouge.

Most agree it is closer to 30% than 25% {{Unless you clutter up the ISA
such that your typical memref needs a support instruction.

Cough, RV64...

-----

Which makes that 16% (above) into 48% and renormalizing to::
      ~ 63% fixed-displacement;
      ~ 36% register-indexed and support instructions.

Yeah.

I think there are reasons here why I am generally getting lackluster performance out of RV64...

-----

Comparably, XG2 has a 16K or 32K reach here (depending on immediate
size), which hits most of the global variables. The fallback Jumbo
encoding hits the rest.

I get ±32K with 16-bit displacements

Baseline has special case 32-bit ops:
MOV.L (GBR, Disp10u), Rn //4K
MOV.Q (GBR, Disp10u), Rn //8K

But, in XG2, it gains 2 bits:
MOV.L (GBR, Disp12u), Rn //16K
MOV.Q (GBR, Disp12u), Rn //32K

Jumbo can encode +/- 4GB here (64-bit encoding).
MOV.L (GBR, Disp33s), Rn //+/- 4GB
MOV.Q (GBR, Disp33s), Rn //+/- 4GB

Mostly because GBR displacements are unscaled.
Plan for XG3 is that all Disp33s encodings would be unscaled.

The assembler gets to choose based on the memory model::

MEM Rd,[Rb,Ri<<s,DISP]

Assembler (or even linker) can choose 32-bit or 64 bit based on a
variety
of things {flags, memory model, size of linked module,...}

BJX2 can also do (PC, Disp33s) in a single logical instruction...

But, RISC-V can't...

What is your definition of "single logical instruction". In my parlance,
a single logical instruction can be::

ST #64-bit-const,[Rb,Ri<<s,DISP64]

is 1 instruction occupying 5 words.

Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64
(there does seem to be some interest for ELF FDPIC but limited to 32-bit >>> RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too far off
from PBO (namely, using GP for a global section and then chaining the
sections for each binary).

How are you going to do dense PIC switch() {...} in RISC-V ??

Already implemented...

With pseudo-instructions:
SUB Rs, $(MIN), R10
MOV $(MAX-MIN), R11
BGTU R11, R10, Lbl_Dfl

MOV .L0, R6 //AUIPC+ADD
SHAD R10, 2, R10 //SLLI
ADD R6, R10, R6
JMP R6 //JALR X0, X6, 0

.L0:
BRA Lbl_Case0 //JAL X0, Lbl_Case0
BRA Lbl_Case1
...

Compared to::
// ADD Rt,Rswitch,#-min
JTT Rt,#max
.jttable min, ... , max, default
adder:

The ADD is not necessary if min == 0

The JTT instruction compared Rt with 0 on the low side and max
on the high side. If Ri is out of bounds, default is selected.

The table displacements come in {B,H,W,D} selected in the JTT
(jump through table) instruction. Rt indexes the table, its
signed value is <<2 and added to address which happens to be
address of JTT instruction + #(max+1)<<entry. {{The table is
fetched through the ICache with execute permission}}

Thus, the table is PIC; and generally 1/4 the size of typical
switch tables.
-----

Currently, BGBCC does not use this strategy.
Though, for 64-bit constants it could be more compact and faster.

But, better still would be having Jumbo prefixes or similar, or even a
SHORI instruction.

Better Still Still is having 32-bit and 64-bit constants available
from the instruction stream and positioned in either operand position.

Granted...

Say, 64-bit constant-load in SH-5 or similar:
   xxxxyyyyzzzzwwww
   MOV   ImmX, Rn
   SHORI ImmY, Rn
   SHORI ImmZ, Rn
   SHORI ImmW, Rn
Where, one loads the constant in 16-bit chunks.

Yech

But, 4 is still less than 6.

1 is less than 4, too.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sat Oct 5 23:10:32 2024

On Thu, 3 Oct 2024 8:20:46 +0000, BGB wrote:

On 10/1/2024 5:00 AM, Robert Finch wrote:

On 2024-09-29 10:19 p.m., BGB wrote:

The ADD is not necessary if min == 0

The JTT instruction compared Rt with 0 on the low side and max
on the high side. If Ri is out of bounds, default is selected.

The table displacements come in {B,H,W,D} selected in the JTT
(jump through table) instruction. Rt indexes the table, its
signed value is <<2 and added to address which happens to be
address of JTT instruction + #(max+1)<<entry. {{The table is
fetched through the ICache with execute permission}}

Thus, the table is PIC; and generally 1/4 the size of typical
switch tables.

How well does JTT work with large tables? What if there are several
hundred table entries?

Tables can have 2^16-1 (65534) case entries.

For Q+ indirect jump the values loaded from the table replace the low
order bits of the PC instead of being a displacement. Only {W,T,O} are
supported. (W=wyde,T=tetra,O=octa). Should add an option for
displacements. Borrowed the memory indirect jump from the 68k.

My 66000 Loads the table entry directly into IP in 1 cycle less
than LD latency.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue Oct 8 20:48:53 2024

On Tue, 8 Oct 2024 19:06:34 +0000, BGB wrote:

On 10/5/2024 6:10 PM, MitchAlsup1 wrote:

On Thu, 3 Oct 2024 8:20:46 +0000, BGB wrote:

How well does JTT work with large tables? What if there are several
hundred table entries?

Tables can have 2^16-1 (65534) case entries.

There is no hard-limit in my case, but BGBCC had generally broken up
tables larger than 256.

IIRC, this was because larger tables were more likely to have "voids"
which could lead to a large number of branches to default, while still
being over a 75% density threshold, splitting the table apart was more
likely to expose these voids (and make the binary smaller).

Yes, one would expect voids in tables would cause the compiler to
break the switch table into more dense chunks.

I guess a possible tweak could be, say, if the density is over 93% or
so, it will allow a table-jump regardless of size.

If the void is greater than the overhead to break the table, then
the table should be broken.

Though, for the most part the programs I am testing with tend not to
really have too many large switch blocks, and in the case of "switch"
with a byte (most common case in these cases), a limit of 256 works.

Meanwhile, for some other cases, like a switch full of TWOCC and FOURCC values, one really does not want to use a jump table (but, this is
avoided as these cases tend to have a very low density).

For Q+ indirect jump the values loaded from the table replace the low
order bits of the PC instead of being a displacement. Only {W,T,O} are >>>> supported. (W=wyde,T=tetra,O=octa). Should add an option for
displacements. Borrowed the memory indirect jump from the 68k.

My 66000 Loads the table entry directly into IP in 1 cycle less
than LD latency.

I guess, specialized Load+Branch could potentially have less latency
than separate load+branch, or the current strategy of double-branching.

Think of it as a LD IP,[address]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Oct 9 10:31:09 2024

I guess also possible is to make it such that a the required density (to avoid split) is correlated to table size.

Say (table size, minimum density):
128, 75%
256, 80%
512, 85%
1024, 90%
2048, 95%

Density is just a vague indicator. I think it makes more sense to
define your heuristic based on specific (estimated) costs
(e.g. code(+table) size, or number of cycles).

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed Oct 9 16:24:07 2024

On Wed, 9 Oct 2024 5:40:02 +0000, BGB wrote:

On 10/8/2024 3:48 PM, MitchAlsup1 wrote:

On Tue, 8 Oct 2024 19:06:34 +0000, BGB wrote:

At present, there isn't anything to detect voids directly, but, density percentage is easy to detect:
(100*count)/(max-min)

defaults_in_a_row > 3 words.

---------------

To some amount, everything is heuristics.

I guess also possible is to make it such that a the required density (to avoid split) is correlated to table size.

Say (table size, minimum density):
128, 75%
256, 80%
512, 85%
1024, 90%
2048, 95%

defaults_in_a_row > 3 words.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Paul A. Clayton on Wed Oct 16 11:07:08 2024

On 10/16/2024 8:59 AM, Paul A. Clayton wrote:

snip

I do not know of any enumeration of conditions that would be
commonly useful. Less than, equal to, greater than might be
somewhat useful for a three-way branch.

That was the function of the arithmetic if statement in original
Fortran. If it were more useful, it wouldn't have been taken out of the language long ago.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stephen Fuld on Wed Oct 16 19:17:17 2024

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

On 10/16/2024 8:59 AM, Paul A. Clayton wrote:

snip

I do not know of any enumeration of conditions that would be
commonly useful. Less than, equal to, greater than might be
somewhat useful for a three-way branch.

That was the function of the arithmetic if statement in original
Fortran. If it were more useful, it wouldn't have been taken out of the language long ago.

Not so long ago, actually, it was only dropped in Fortran 2018.
I actually think that this is a bad idea, compilers will continue
to support such features, but possible interactions with other
features will no longer be properly defined.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Paul A. Clayton on Wed Oct 16 19:11:02 2024

On Wed, 16 Oct 2024 15:59:57 +0000, Paul A. Clayton wrote:

On 9/30/24 1:52 AM, MitchAlsup1 wrote:

On Sat, 28 Sep 2024 1:44:10 +0000, Paul A. Clayton wrote:

[snip]

Another weird concept that came to mind would be providing an
8-bit (e.g.) field that enumerated a set of interesting
conditions.

I use a 64-bit container of conditions

A enumeration of conditions is different from a bitmask of
conditions. An enumeration could support N-way branching in a
single instruction rather than a tree of single bit-condition
branches.

My 66000's compare result has unused space for multiple such
enumerations.

One can "do" 3-way branching as is:: CMP-BC1-BC2-other

I do not know of any enumeration of conditions that would be
commonly useful. Less than, equal to, greater than might be
somewhat useful for a three-way branch. Relation to zero as well
as an explicit comparison value might be useful for some multi-way
choices.

3-way branches are out of style:: Fortran disinherited them
while IEEE 754 made them need to be 4-way (NaN).

Lack of density is also a problem for multi-way branches; the
encoding will waste space if multiple enumerated states share a
target.

The concept seemed worth mentioning even if I thought it unlikely
to be practically useful.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed Oct 16 22:16:29 2024

On Wed, 16 Oct 2024 20:23:08 +0000, BGB wrote:

Ironically, one of the main arguable use-cases for old Fortran style IF statements is implementing the binary dispatch logic in a binary
subdivided "switch()", but not enough to justify having a dedicated instruction for it.

Say:
MOV Imm, Rt //pivot case
BLT Rt, Rx, .lbl_lo
BGT Rt, Rx, .lbl_hi
BRA .lbl_case

With a 64-bitinstruction one could do::

B3W .lbl_lo,.lbl_zero,.lbl_hi

rather straightforwardly.....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Fri Oct 18 02:28:45 2024

On Thu, 17 Oct 2024 0:03:26 +0000, BGB wrote:

On 10/16/2024 5:16 PM, MitchAlsup1 wrote:

On Wed, 16 Oct 2024 20:23:08 +0000, BGB wrote:

Ironically, one of the main arguable use-cases for old Fortran style IF
statements is implementing the binary dispatch logic in a binary
subdivided "switch()", but not enough to justify having a dedicated
instruction for it.

Say:
   MOV Imm, Rt //pivot case
   BLT Rt, Rx, .lbl_lo
   BGT Rt, Rx, .lbl_hi
   BRA .lbl_case

With a 64-bitinstruction one could do::

    B3W   .lbl_lo,.lbl_zero,.lbl_hi

rather straightforwardly.....

Possibly, but the harder part would be to deal with decoding and feeding
the instruction through the pipeline.

Feed the 3×15-bit displacements to the branch unit. When the condition resolves, use one of the 2 selected displacements as the target address.

Granted, I guess it could be decoded as if it were a normal 3RI op or similar, but then split up the immediate into multiple parts in EX1.

Why would you want do make it 3×11-bit displacements when you can
make it 3×16-bit displacements.

+------+-----+-----+----------------+
| Bc | 3W | Rt | .lb_lo |
+------+-----+-----+----------------+
| .lb_zero | .lb_hi |
+------------------+----------------+

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Mon Oct 21 21:10:20 2024

On Mon, 21 Oct 2024 20:13:20 +0000, BGB wrote:

On 10/17/2024 9:28 PM, MitchAlsup1 wrote:

Granted, I guess it could be decoded as if it were a normal 3RI op or
similar, but then split up the immediate into multiple parts in EX1.

Why would you want do make it 3×11-bit displacements when you can
make it 3×16-bit displacements.

    +------+-----+-----+----------------+
    | Bc   | 3W | Rt |   .lb_lo       |
    +------+-----+-----+----------------+
    |   .lb_zero       | .lb_hi        |
    +------------------+----------------+

Neither BJX2 nor RISC-V have the encoding space to pull this off...
Even in a clean-slate ISA, it would be a big ask.

If you remove compressed instructions from RISC-V, you have enough
room left over to put the entire My 66000 ISA. ... ... ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	04:23:27
Calls:	10,387
Calls today:	2
Files:	14,061
Messages:	6,416,782

Re: Misc: BGBCC targeting RV64G, initial results...

Who's Online

Recent Visitors

System Info