At the moment, only a link to Concertina III is present on my
main page, no content is yet present.
I've gone to 15-bit displacements, in order to avoid compromising
addressing modes, while allowing 16-bit instructions without
switching to an alternate instruction set.
Possibly using only three base registers is also sufficiently
non-violent to the addressing modes that I should have done that
instead, so I will likely give consideration to that option in
the days ahead.
Packing and unpacking DFP numbers does not take a lot of logic, assuming
one of the common DPD packing methods.
The number of registers handling
DFP values could be doubled if they were unpacked and packed for each operation.
Since DFP arithmetic has a high latency anyway, for example
Q+ the DFP unit unpacks, performs the operation, then repacks the DFP
number. So, registers only need be 128-bit.
256 bits seems a little narrow for a vector register.
I have seen
several other architectures with vector registers supporting 16+ 32-bit values, or a length of 512-bits. This is also the width of a typical
cache line.
Having the base register implicitly encoded in the instruction is a way
to reduce the number of bits used to represent the base register.
There
seems to be a lot of different base register usages. Will not that make
the compiler more difficult to write?
Does array addressing mode have memory indirect addressing? It seems
like a complex mode to support.
Block headers are tricky to use. They need to follow the output of the instructions in the assembler so that the assembler has time to generate
the appropriate bits for the header. The entire instruction block needs
to be flushed at the end of a function.
So in Concertina II, I had added a new addressing mode which
simply uses the same feature that allows immediate values to
tack a 64-bit absolute address on to an instruction. (Since it
looks like a 64-bit number, the linking loader can relocate it.)
That fancy feature, though, was too much complication for this
stripped-down ISA.
Agreed. Would not be in favor of block-headers or block structuring.
Linear instruction formats are preferable, preferably in 32-bit chunks.
On 1/23/2024 6:06 AM, Robert Finch wrote:
IME, the main address modes are:
(Rm, Disp) // ~ 66% +/- 10%
(Rm, Ro*FixSc) // ~ 33% +/- 10%
Where: FixSc matches the element size.
Pretty much everything else falls into the noise.
RISC-V only has the former, but kinda shoots itself in the foot:
GCC is good at eliminating most SP relative loads/stores;
That means, the nominal percentage of indexed is even higher...
As a result, the code is basically left doing excessive amounts of
shifts and adds, which (vs BJX2) effectively dethrone the memory
load/store ops for top-place.
Likewise, the moment one exceeds 12 bits on much of anything, RISC-V
also shoots itself in the foot. Because, not only has one hit the limits
of the ALU and LD/ST ops, there are no cheap fallbacks for intermediate
range constants.
If my compiler, with its arguably poor optimizer and barely functional register allocation, is beating GCC for performance (when targeting
RISC-V), I don't really consider this a win for some of RISC-V's design choices.
And, if GCC in its great wisdom, is mostly loading constants from memory (having apparently offloaded most of them into the ".data" section),
this is also not a good sign.
Also, needing to use shift-pairs to sign and zero extend things is a bit
weak as well, ...
Also, as a random annoyance, RISC-V's instruction layout is very
difficult to decipher from a hexadecimal view. One basically needs to
dump it in binary to make it viable to mentally parse and lookup instructions, which sucks.
the absolute array addresses are
available without block structure.
On 1/23/2024 4:11 PM, Brian G. Lucas wrote:
On 1/23/24 16:10, MitchAlsup1 wrote:
When you benchmark against a strawman, cows get to eat.
Not a farm boy I'll bet. Cows eat hay, but not straw.
https://en.wikipedia.org/wiki/Nord_and_Bert_Couldn%27t_Make_Head_or_Tail_of_It
On 1/23/2024 4:10 PM, MitchAlsup1 wrote:------------------------
Likewise, the moment one exceeds 12 bits on much of anything, RISC-V
also shoots itself in the foot. Because, not only has one hit the
limits of the ALU and LD/ST ops, there are no cheap fallbacks for
intermediate range constants.
My 66000 has constants of all sizes for all instructions.
And, if GCC in its great wisdom, is mostly loading constants from
memory (having apparently offloaded most of them into the ".data"
section), this is also not a good sign.
Loading constants:
a) pollutes the data cache
b) wastes energy
c) wastes instructions
Yes.
But, I guess it does improve code density in this case... Because the constants are "somewhere else" and thus don't contribute to the size of '.text'; the program just puts a few kB worth of constants into '.data' instead...
Does make the code density slightly less impressive.
Granted, one can argue the same of prolog/epilog compression in my case:
Save some space on prolog/epilog by calling or branching to prior
versions (since the code to save and restore GPRs is fairly repetitive).
On 1/24/2024 2:23 PM, MitchAlsup1 wrote:
BGB wrote:
Granted, one can argue the same of prolog/epilog compression in my case: >>> Save some space on prolog/epilog by calling or branching to prior
versions (since the code to save and restore GPRs is fairly repetitive).
ENTER and EXIT eliminate the additional control transfers and can allow
FETCH of the return address to start before the restores are finished.
Possible, but branches are cheaper to implement in hardware, and would
have been implemented already...
Granted, it is a similar thing to the recent addition of a memcpy()
slide for intermediate-sized memcpy.
Where, if one expresses the slide in reverse order, copying any multiple
of N bytes can be expressed as a branch into the slide (with less
overhead than a loop).
But, I guess in theory, the memcpy slide could be implemented in plain C
with a switch.
uint64_t *dst, *src;
uint64_t li0, li1, li2, li3;
... copy final bytes ...
switch(sz>>5)
{
...
case 2:
li0=src[4]; li1=src[5];
li2=src[6]; li3=src[7];
dst[4]=li0; dst[5]=li1;
dst[6]=li2; dst[7]=li3;
case 1:
li0=src[0]; li1=src[1];
li2=src[2]; li3=src[3];
dst[0]=li0; dst[1]=li1;
dst[2]=li2; dst[3]=li3;
case 0:
break;
}
Like, in theory one could have a special hardware feature, but a plain software solution is reasonably effective.
On 1/24/2024 2:23 PM, MitchAlsup1 wrote:
BGB wrote:
Granted, one can argue the same of prolog/epilog compression in my case: >>> Save some space on prolog/epilog by calling or branching to prior
versions (since the code to save and restore GPRs is fairly repetitive).
ENTER and EXIT eliminate the additional control transfers and can allow
FETCH of the return address to start before the restores are finished.
Possible, but branches are cheaper to implement in hardware, and would
have been implemented already...
Granted, it is a similar thing to the recent addition of a memcpy()
slide for intermediate-sized memcpy.
Where, if one expresses the slide in reverse order, copying any multiple
of N bytes can be expressed as a branch into the slide (with less
overhead than a loop).
But, I guess in theory, the memcpy slide could be implemented in plain C
with a switch.
uint64_t *dst, *src;
uint64_t li0, li1, li2, li3;
... copy final bytes ...
switch(sz>>5)
{
...
case 2:
li0=src[4]; li1=src[5];
li2=src[6]; li3=src[7];
dst[4]=li0; dst[5]=li1;
dst[6]=li2; dst[7]=li3;
case 1:
li0=src[0]; li1=src[1];
li2=src[2]; li3=src[3];
dst[0]=li0; dst[1]=li1;
dst[2]=li2; dst[3]=li3;
case 0:
break;
}
Like, in theory one could have a special hardware feature, but a plain software solution is reasonably effective.
On 1/25/2024 11:26 AM, MitchAlsup1 wrote:
BGB wrote:
On 1/24/2024 2:23 PM, MitchAlsup1 wrote:
BGB wrote:
Granted, one can argue the same of prolog/epilog compression in my
case:
Save some space on prolog/epilog by calling or branching to prior
versions (since the code to save and restore GPRs is fairly
repetitive).
ENTER and EXIT eliminate the additional control transfers and can allow >>>> FETCH of the return address to start before the restores are finished.
Possible, but branches are cheaper to implement in hardware, and would
have been implemented already...
Are you intentionally misreading what I wrote ??
?? I don't understand.
Epilogue is a sequence of loads leading to a jump to the return address.
Your ISA cannot jump to the return address while performing the loads
so FETCH does not get the return address and can't start fetching
instructions until the jump is performed.
You can put the load for the return address before the other loads.
Then, if the epilog is long enough (so that this load is no-longer in
flight once it hits the final jump), the branch-predictor will lead to
it start loading the post-return instructions before the jump is reached.
This is likely a non-issue as I see it.
It is only really an issue if one demands that reloading the return
address be done as one of the final instructions in the epilog, and not
one of the first instructions.
Granted, one would have to do it as one of the final ops, if it were implemented as a slide, but it is not. There are "practical reasons" why
a slide would not be a workable strategy in this case.
So, generally, these parts of the prolog/epilog sequences are emitted
for every combination of saved/restored registers that had been encountered.
Though, granted, when used, does mean that any such function needs to effectively two two sets of stack-pointer adjustments:
One set for the save/restore area (in the reused part);
One part for the function (for its data and local/temporary variables
and similar).
Because the entire Epilogue is encapsulated in EXIT, My 66000 can LD
the return address from the stack and fetch the instructions at the
return address while still loading the preserved registers (that were
saved) so that the instructions are ready for execution by the time
the last LD is performed.
In addition, If one is performing an EXIT and fetch runs into a CALL;
it can fetch the Called address and if there is an ENTER instruction
there, it can cancel the remainder of EXIT and cancel some of ENTER
because the preserved registers are already on the stack where they are
supposed to be.
Doing these with STs and LDs cannot save those cycles.
I don't see why not, the branch-predictor can still do its thing
regardless of whether or not LD/ST ops were used.
I have indeed decided that using three base registers for the
basic load-store instructions is much preferable to shortening the
length of the displacement even by one bit.
On 2024-01-25 4:25 p.m., MitchAlsup1 wrote:
Whereas:
funct2:
ENTER R25,R0,stackArea2
...
funct1:
...
EXIT R21,R0,stackArea1
will have registers R0,R25..R30 in the same positions on the stack
guaranteed by ISA definition!!
I like the ENTER / EXIT instructions and safe stack idea, and have incorporated them into Q+ called ENTER and LEAVE. EXIT makes me think of program exit(). They can improve code density. I gather that the stack
used for ENTER and EXIT is not the same stack as is available for the
rest of the app. This means managing two stack pointers, the regular
stack and the safe stack. Q+ could have the safe stack pointer as a
register that is not even accessible by the app and not part of the GPR
file.
For ENTER/LEAVE Q+ has the number of registers to save specified as a four-bit number and saves only the saved registers, link register and
frame pointer according to the ABI. So, “ENTER 3,64” will save s0 to s2, the frame-pointer, link register and allocate 64 bytes plus the return
block on the stack. The return block contains the frame-pointer, link register and two slots that are zeroed out intended for exception
handlers. The saved registers are limited to s0 so s9.
Q+ also has a PUSHA / POPA instructions to push or pop all the
registers, meant for interrupt handlers. PUSH and POP instructions by themselves can push or pop up to five registers.
Some thought has been given towards modifying ENTER and LEAVE to support interrupt handlers, rather than have separate PUSHA / POPA instructions. ENTER 15,0 would save all the registers, and LEAVE 15,0 would restore
them all and return using an interrupt return.
On 2024-01-26 11:10 p.m., BGB wrote:
On 1/26/2024 10:58 AM, Robert Finch wrote:
<snip>
Admittedly, it can make sense for an ISA intended for higher-end
hardware, but not necessarily something intended to aim for similar
hardware costs to something like an in-order RISC-V core.
Once there is micro-code or a state machine to handle an instruction
with multiple micro-ops, it is not that costly to add other operations.
The Q+ micro-code cost something like < 1k LUTs. Many early micro's use micro-code.
<snip>Q+ uses a 128-bit system bus the bus tag is not the same tag as used for
the cache. Q+ burst loads the cache with 4 128-bit accesses for 512 bits
and the 64B cache line is tagged with a single tag. The instruction /
data cache controller takes care of adjusting the bus size between the
cache and system.
I think I suggested this before, and the idea got shot down, but I
cannot find the post. It is mystery operations where the opcode comes
from a register value. I was thinking of adding an instruction modifier
to do this. The instruction modifier would supply the opcode bits for
the next instruction from a register value. This would only be applied
to specific classes of instructions. In particular register-register
operate instructions. Many of the register-register functions are not
decoded until execute time. The function code is simply copied to the execution unit. It does not have to run through the decode and rename
stage. I think this field could easily come from a register. Seems like
it would be easy to update the opcode while the instruction is sitting
in the reorder buffer.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 04:23:41 |
Calls: | 10,387 |
Calls today: | 2 |
Files: | 14,061 |
Messages: | 6,416,782 |