On 9/27/2024 7:43 PM, MitchAlsup1 wrote:
On Fri, 27 Sep 2024 23:53:22 +0000, BGB wrote:
One of the reasons reservation stations became in vouge.
Possibly, but is a CPU feature rather than a compiler feature...
Saw a video not too long ago where he was making code faster by undoing
a lot of loop unrolling, as the code was apparently spending more in I$ misses than it was gaining by being unrolled.
------------
In contrast, a jumbo prefix by itself does not make sense; its meaning depends on the thing that being is prefixed. Also the decoder will
decode a jumbo prefix and suffix instruction at the same time.
For the jumbo prefix:
Recognize that is a jumbo prefix;
Inform the decoder for the following instruction of this fact
(via internal flag bits);
Provide the prefix's data bits to the corresponding decoder.
Unlike a "real" instruction, a jumbo prefix does not need to provide
behavior of its own, merely be able to be identified as such and to
provide payload data bits.
For now, there are not any encodings larger than 96 bits.
Partly this is because 128 bit fetch would likely add more cost and complexity than it is worth at the moment.
Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64 >>>>> (there does seem to be some interest for ELF FDPIC but limited to
32-bit
RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too far off >>>>> from PBO (namely, using GP for a global section and then chaining the >>>>> sections for each binary).
How are you going to do dense PIC switch() {...} in RISC-V ??
Already implemented...
SUB Rs, $(MIN), R10With pseudo-instructions:
MOV $(MAX-MIN), R11
BGTU R11, R10, Lbl_Dfl
MOV .L0, R6 //AUIPC+ADD
SHAD R10, 2, R10 //SLLI
ADD R6, R10, R6
JMP R6 //JALR X0, X6, 0
.L0:
BRA Lbl_Case0 //JAL X0, Lbl_Case0
BRA Lbl_Case1
...
Compared to::
// ADD Rt,Rswitch,#-min
JTT Rt,#max
.jttable min, ... , max, default
adder:
The ADD is not necessary if min == 0
The JTT instruction compared Rt with 0 on the low side and max
on the high side. If Ri is out of bounds, default is selected.
The table displacements come in {B,H,W,D} selected in the JTT
(jump through table) instruction. Rt indexes the table, its
signed value is <<2 and added to address which happens to be
address of JTT instruction + #(max+1)<<entry. {{The table is
fetched through the ICache with execute permission}}
Thus, the table is PIC; and generally 1/4 the size of typical
switch tables.
-----
Potentially it could be more compact.
On 9/27/24 11:52 AM, MitchAlsup1 wrote:
On Fri, 27 Sep 2024 9:46:01 +0000, BGB wrote:[snip]
RV's selection of 3R compare ops is more limited:
RV: SLT, SLTU
BJX2: CMPEQ, CMPNE, CMPGT, CMPGE, CMPHI, CMPHS, TST, NTST
A lot of these cases require a multi-op sequence to implement
with just
SLT and SLTU.
My 55000 can do:: 1 < i && i <= MAX in 1 instruction
Did you mean "0 < i && i <= MAX" (Fortran-IN comparison result) or
"1 <= i && i <= MAX" (which is the same, for unsigned)? Or am I
missing a capability of My 66000?
Itanium comparison instructions were interesting in that the one
bit result (which was stored in two condition registers, one
storing the compliment) could be ANDed or ORed with another
condition register as part of the instruction. This merging
apparently allowed some complex comparison merging to be done
in one cycle with multiple compare instructions.
Itanium's method may not be the best way of merging conditions,
but there may be some benefit from not using sequential branches
to perform this function (or SHIFT and OR/AND to merge a bit in a
comparison result) . (I suppose a microarchitecture could use one
BTB entry for both branches, but detecting such seems a fair
amount of work for an uncommon case.)
Another weird concept that came to mind would be providing an
8-bit (e.g.) field that enumerated a set of interesting
conditions.
Combined with a Table Transfer instruction (which
jumps either to an indexed table position for execution or uses
the indexed table entry as a jump target address) this _might_
be useful, though I doubt there are many cases where a general
condition test could set a dense enumeration of cases that are
of interest for the specific use. Yet perhaps mentioning this
weirdness might stir someone else's *useful* creativity.
(For small local switch-like jumps, using 16-bit table entries
might be practical starting immediately after the instruction.
A 64-bit immediate would provide four targets with the next
word being a "free" target specification. 8-bit offsets might
be practical for small switches, especially if accumulated (i.e.,
as if performing a sequence of short branches); adding four
8-bit values would be fairly fast. Yes, more weirdness.)
On 9/29/2024 2:11 PM, MitchAlsup1 wrote:
I noticed this in 1991 when we got Mc88120 simulator up and running.
GBOoO chips are <nearly> best served when there is the smallest number
of instructions.
Looking it up, seems the CPU in question (MIPS R4300) was:
16K L1 I$ cache;
8K L1 D$ cache;
No L2 cache (but could be supported off-die);
1-wide scalar, 32 or 64 bit
Non pipelined FPU and multiplie
How many bits does one of these jumbo prefixes consume ?
The prefix itself is 32 bits.
In the context of XG3, it supplies 23 or 27 bits.
Well, and people obsessing on what happens if an interrupt somehow
occurs "between" the prefix and prefixed instruction.
Which, as I have tended to implement them, is simply not possible, since everything is fetched and decoded at the same time.
Granted, yes, it does add the drawback of needing to have tag-bits to remember the mode, and maybe the CPU hiding mode bits in the high order
bits of the link register and similar is not such an elegant idea.
For your implementation, yes. For all others:: <at best> maybe.
Maybe.
I could maybe consider widening fetch/decode to 128-bits if there were a compelling use case.
On 2024-09-29 10:19 p.m., BGB wrote:
On 9/29/2024 2:11 PM, MitchAlsup1 wrote:
Compared to::
// ADD Rt,Rswitch,#-min
JTT Rt,#max
.jttable min, ... , max, default
adder:
The ADD is not necessary if min == 0
The JTT instruction compared Rt with 0 on the low side and max
on the high side. If Ri is out of bounds, default is selected.
The table displacements come in {B,H,W,D} selected in the JTT
(jump through table) instruction. Rt indexes the table, its
signed value is <<2 and added to address which happens to be
address of JTT instruction + #(max+1)<<entry. {{The table is
fetched through the ICache with execute permission}}
Thus, the table is PIC; and generally 1/4 the size of typical
switch tables.
How well does JTT work with large tables? What if there are several
hundred table entries?
For Q+ indirect jump the values loaded from the table replace the low
order bits of the PC instead of being a displacement. Only {W,T,O} are supported. (W=wyde,T=tetra,O=octa). Should add an option for
displacements. Borrowed the memory indirect jump from the 68k.
Had recently been working on getting BGBCC to target RV64G.M66: 1 instruction
Array Load/Store:
XG2: 1 instructionM66: 1 instruction (anywhere in 64-bit memory)
RV64: 3 instructions
Global Variable:
XG2: 1 instruction (if within 2K of GBR)M66: 0 instructions
RV64: 1 or 4 instructions
Constant Load into register (not R5):
XG2: 1 instructionM66: 1 instruction
RV64: ~ 1-6
Operator with 32-bit immediate:
BJX2: 1 instruction;M66: 1 instruction
RV64: 3 instructions.
Operator with 64-bit immediate:
BJX2: 1 instruction;
RV64: 4-9 instructions.
Floating point is still a bit of a hack, as it is currently implemented
by shuffling values between GPRs and FPRs, but sorta works.
RV's selection of 3R compare ops is more limited:
RV: SLT, SLTU
BJX2: CMPEQ, CMPNE, CMPGT, CMPGE, CMPHI, CMPHS, TST, NTST
A lot of these cases require a multi-op sequence to implement with just
SLT and SLTU.
....
On 9/27/2024 7:50 AM, Robert Finch wrote:
On 2024-09-27 5:46 a.m., BGB wrote:---------
But, BJX2 does not spam the ADD instruction quite so hard, so is more forgiving of latency. In this case, an optimization that reduces
common-case ADD to 1 cycle was being used (it only works though in the
CPU core if the operands are both in signed 32-bit range and no overflow occurs; IIRC optionally using a sign-extended AGU output as a stopgap
ALU output before the output arrives from the main ALU the next cycle).
Comparably, it appears BGBCC leans more heavily into ADD and SLLI than
GCC does, with a fair chunk of the total instructions executed being
these two (more cycles are spent adding and shifting than doing memory
load or store...).
That seems to be a bit off. Mem ops are usually around 1/4 of
instructions. Spending more than 25% on adds and shifts seems like a
lot. Is it address calcs? Register loads of immediates?
It is both...
In BJX2, the dominant instruction tends to be memory Load.
Typical output from BGBCC for Doom is (at runtime):
~ 70% fixed-displacement;
~ 30% register-indexed.
Static output differs slightly:
~ 84% fixed-displacement;
~ 16% register-indexed.
RV64G lacks register-indexed addressing, only having fixed displacement.
If you need to do a register-indexed load in RV64:
SLLI X5, Xo, 2 //shift by size of index
ADD X5, Xm, X5 //add base and index
LW Xn, X5, 0 //do the load
This case is bad...
Also global variables outside the 2kB window:
LUI X5, DispHi
ADDI X5, X5, DispLo
ADD X5, GP, X5
LW Xn, X5, 0
Where, sorting global variables by usage priority gives:
~ 35%: in range
~ 65%: not in range
Comparably, XG2 has a 16K or 32K reach here (depending on immediate
size), which hits most of the global variables. The fallback Jumbo
encoding hits the rest.
Theoretically, could save 1 instruction here, but would need to add two
more reloc types to allow for:
LUI, ADD, Lx
LUI, ADD, Sx
Because annoyingly Load and Store have different displacement encodings;
and I still need the base form for other cases.
More compact way to load/store global variables would be to use absolute 32-bit or PC relative:
LUI + Lx/Sx : Abs32
AUIPC + Lx/Sx : PC-Rel32
Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64
(there does seem to be some interest for ELF FDPIC but limited to 32-bit RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too far off
from PBO (namely, using GP for a global section and then chaining the sections for each binary).
Main difference being that FDPIC uses fat
function pointers and does the GP reload on the caller, vs PBO where I
use narrow function pointers and do the reload on the callee (with
load-time fixups for the PBO Offset).
The result of all this is a whole lot ofunnecessary
Shifts and ADDs.
Seemingly, even more for BGBCC than for GCC, which already had a lot of shifts and adds.
BGBCC basically entirely dethrowns the Load and Store ops ...
Possibly more so than GCC, which tended to turn most constant loads into memory loads. It would load a table of constants into a register and
then pull constants from the table, rather than compose them inline.
Say, something like:
AUIPC X18, X18, DispHi
ADD X18, X18, DispLo
(X18 now holds a table of constants, pointing into .rodata)
And, when it needs a constant:
LW Xn, X18, Disp //offset of the constant it wants.
Or:
LD Xn, X18, Disp //64-bit constant
Currently, BGBCC does not use this strategy.
Though, for 64-bit constants it could be more compact and faster.
But, better still would be having Jumbo prefixes or similar, or even a
SHORI instruction.
Say, 64-bit constant-load in SH-5 or similar:
xxxxyyyyzzzzwwww
MOV ImmX, Rn
SHORI ImmY, Rn
SHORI ImmZ, Rn
SHORI ImmW, Rn
Where, one loads the constant in 16-bit chunks.
Don't you ever snip anything ??
On 9/27/2024 10:52 AM, MitchAlsup1 wrote:
My 66000 can do:: 1 < i && i <= MAX in 1 instruction
BJX2:
CMPQGT R4, 1, R16
CMPQLT R4, (MAX+1), R17 //*1
AND R16, R17, R5
So, more than 1 instruction, but less than faking it with SLT / SLTI ...
It is better for performance though to be able to flip the output bit in
the pipeline than to need to use an XOR instruction or similar.
On 9/27/2024 2:40 PM, MitchAlsup1 wrote:
On Fri, 27 Sep 2024 18:26:28 +0000, BGB wrote:
But, generally this does still impose limits:
Can't reorder instructions across a label;
Can't move instructions with an associated reloc;
Can't reorder memory instructions unless they can be proven to not alias (loads may be freely reordered, but the relative order of loads and
stores may not unless provably non-aliasing);
The effectiveness of this does depend on how the C code is written
though (works favorably with larger blocks of mostly-independent expressions).
-----Most agree it is closer to 30% than 25% {{Unless you clutter up the ISA
such that your typical memref needs a support instruction.
Cough, RV64...
-----Which makes that 16% (above) into 48% and renormalizing to::
~ 63% fixed-displacement;
~ 36% register-indexed and support instructions.
Yeah.
I think there are reasons here why I am generally getting lackluster performance out of RV64...
Comparably, XG2 has a 16K or 32K reach here (depending on immediate
size), which hits most of the global variables. The fallback Jumbo
encoding hits the rest.
I get ±32K with 16-bit displacements
Baseline has special case 32-bit ops:
MOV.L (GBR, Disp10u), Rn //4K
MOV.Q (GBR, Disp10u), Rn //8K
But, in XG2, it gains 2 bits:
MOV.L (GBR, Disp12u), Rn //16K
MOV.Q (GBR, Disp12u), Rn //32K
Jumbo can encode +/- 4GB here (64-bit encoding).
MOV.L (GBR, Disp33s), Rn //+/- 4GB
MOV.Q (GBR, Disp33s), Rn //+/- 4GB
Mostly because GBR displacements are unscaled.
Plan for XG3 is that all Disp33s encodings would be unscaled.
BJX2 can also do (PC, Disp33s) in a single logical instruction...
But, RISC-V can't...
Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64
(there does seem to be some interest for ELF FDPIC but limited to 32-bit >>> RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too far off
from PBO (namely, using GP for a global section and then chaining the
sections for each binary).
How are you going to do dense PIC switch() {...} in RISC-V ??
Already implemented...
With pseudo-instructions:
SUB Rs, $(MIN), R10
MOV $(MAX-MIN), R11
BGTU R11, R10, Lbl_Dfl
MOV .L0, R6 //AUIPC+ADD
SHAD R10, 2, R10 //SLLI
ADD R6, R10, R6
JMP R6 //JALR X0, X6, 0
.L0:
BRA Lbl_Case0 //JAL X0, Lbl_Case0
BRA Lbl_Case1
...
Currently, BGBCC does not use this strategy.
Though, for 64-bit constants it could be more compact and faster.
But, better still would be having Jumbo prefixes or similar, or even a
SHORI instruction.
Better Still Still is having 32-bit and 64-bit constants available
from the instruction stream and positioned in either operand position.
Granted...
Say, 64-bit constant-load in SH-5 or similar:
xxxxyyyyzzzzwwww
MOV ImmX, Rn
SHORI ImmY, Rn
SHORI ImmZ, Rn
SHORI ImmW, Rn
Where, one loads the constant in 16-bit chunks.
Yech
But, 4 is still less than 6.
On 10/1/2024 5:00 AM, Robert Finch wrote:
On 2024-09-29 10:19 p.m., BGB wrote:
The ADD is not necessary if min == 0
The JTT instruction compared Rt with 0 on the low side and max
on the high side. If Ri is out of bounds, default is selected.
The table displacements come in {B,H,W,D} selected in the JTT
(jump through table) instruction. Rt indexes the table, its
signed value is <<2 and added to address which happens to be
address of JTT instruction + #(max+1)<<entry. {{The table is
fetched through the ICache with execute permission}}
Thus, the table is PIC; and generally 1/4 the size of typical
switch tables.
How well does JTT work with large tables? What if there are several
hundred table entries?
For Q+ indirect jump the values loaded from the table replace the low
order bits of the PC instead of being a displacement. Only {W,T,O} are
supported. (W=wyde,T=tetra,O=octa). Should add an option for
displacements. Borrowed the memory indirect jump from the 68k.
On 10/5/2024 6:10 PM, MitchAlsup1 wrote:
On Thu, 3 Oct 2024 8:20:46 +0000, BGB wrote:
How well does JTT work with large tables? What if there are several
hundred table entries?
Tables can have 2^16-1 (65534) case entries.
There is no hard-limit in my case, but BGBCC had generally broken up
tables larger than 256.
IIRC, this was because larger tables were more likely to have "voids"
which could lead to a large number of branches to default, while still
being over a 75% density threshold, splitting the table apart was more
likely to expose these voids (and make the binary smaller).
I guess a possible tweak could be, say, if the density is over 93% or
so, it will allow a table-jump regardless of size.
Though, for the most part the programs I am testing with tend not to
really have too many large switch blocks, and in the case of "switch"
with a byte (most common case in these cases), a limit of 256 works.
Meanwhile, for some other cases, like a switch full of TWOCC and FOURCC values, one really does not want to use a jump table (but, this is
avoided as these cases tend to have a very low density).
For Q+ indirect jump the values loaded from the table replace the low
order bits of the PC instead of being a displacement. Only {W,T,O} are >>>> supported. (W=wyde,T=tetra,O=octa). Should add an option for
displacements. Borrowed the memory indirect jump from the 68k.
My 66000 Loads the table entry directly into IP in 1 cycle less
than LD latency.
I guess, specialized Load+Branch could potentially have less latency
than separate load+branch, or the current strategy of double-branching.
I guess also possible is to make it such that a the required density (to avoid split) is correlated to table size.
Say (table size, minimum density):
128, 75%
256, 80%
512, 85%
1024, 90%
2048, 95%
On 10/8/2024 3:48 PM, MitchAlsup1 wrote:
On Tue, 8 Oct 2024 19:06:34 +0000, BGB wrote:
At present, there isn't anything to detect voids directly, but, density percentage is easy to detect:
(100*count)/(max-min)
To some amount, everything is heuristics.
I guess also possible is to make it such that a the required density (to avoid split) is correlated to table size.
Say (table size, minimum density):
128, 75%
256, 80%
512, 85%
1024, 90%
2048, 95%
I do not know of any enumeration of conditions that would be
commonly useful. Less than, equal to, greater than might be
somewhat useful for a three-way branch.
On 10/16/2024 8:59 AM, Paul A. Clayton wrote:
snip
I do not know of any enumeration of conditions that would be
commonly useful. Less than, equal to, greater than might be
somewhat useful for a three-way branch.
That was the function of the arithmetic if statement in original
Fortran. If it were more useful, it wouldn't have been taken out of the language long ago.
On 9/30/24 1:52 AM, MitchAlsup1 wrote:
On Sat, 28 Sep 2024 1:44:10 +0000, Paul A. Clayton wrote:[snip]
Another weird concept that came to mind would be providing an
8-bit (e.g.) field that enumerated a set of interesting
conditions.
I use a 64-bit container of conditions
A enumeration of conditions is different from a bitmask of
conditions. An enumeration could support N-way branching in a
single instruction rather than a tree of single bit-condition
branches.
My 66000's compare result has unused space for multiple such
enumerations.
I do not know of any enumeration of conditions that would be
commonly useful. Less than, equal to, greater than might be
somewhat useful for a three-way branch. Relation to zero as well
as an explicit comparison value might be useful for some multi-way
choices.
Lack of density is also a problem for multi-way branches; the
encoding will waste space if multiple enumerated states share a
target.
The concept seemed worth mentioning even if I thought it unlikely
to be practically useful.
Ironically, one of the main arguable use-cases for old Fortran style IF statements is implementing the binary dispatch logic in a binary
subdivided "switch()", but not enough to justify having a dedicated instruction for it.
Say:
MOV Imm, Rt //pivot case
BLT Rt, Rx, .lbl_lo
BGT Rt, Rx, .lbl_hi
BRA .lbl_case
On 10/16/2024 5:16 PM, MitchAlsup1 wrote:
On Wed, 16 Oct 2024 20:23:08 +0000, BGB wrote:
Ironically, one of the main arguable use-cases for old Fortran style IF
statements is implementing the binary dispatch logic in a binary
subdivided "switch()", but not enough to justify having a dedicated
instruction for it.
Say:
MOV Imm, Rt //pivot case
BLT Rt, Rx, .lbl_lo
BGT Rt, Rx, .lbl_hi
BRA .lbl_case
With a 64-bitinstruction one could do::
B3W .lbl_lo,.lbl_zero,.lbl_hi
rather straightforwardly.....
Possibly, but the harder part would be to deal with decoding and feeding
the instruction through the pipeline.
Granted, I guess it could be decoded as if it were a normal 3RI op or similar, but then split up the immediate into multiple parts in EX1.
On 10/17/2024 9:28 PM, MitchAlsup1 wrote:
Granted, I guess it could be decoded as if it were a normal 3RI op or
similar, but then split up the immediate into multiple parts in EX1.
Why would you want do make it 3×11-bit displacements when you can
make it 3×16-bit displacements.
+------+-----+-----+----------------+
| Bc | 3W | Rt | .lb_lo |
+------+-----+-----+----------------+
| .lb_zero | .lb_hi |
+------------------+----------------+
Neither BJX2 nor RISC-V have the encoding space to pull this off...
Even in a clean-slate ISA, it would be a big ask.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 04:23:27 |
Calls: | 10,387 |
Calls today: | 2 |
Files: | 14,061 |
Messages: | 6,416,782 |