Not being satisfied with current Q+ and the number of rename registers required I decide to start yet another project, this time a CPU with
only 16 GPRs. I know that fewer registers will spill to memory more
often, so, I thought using explicit spill and fill instructions backed
up by appropriate buffers would help.
I found this article, which is related, suggesting ILP may be increased.
http://cva.stanford.edu/classes/ee482a/projects/project_spill.pdf
With only 16 regs, some instructions can be reduced to 24-bits.
Not being satisfied with current Q+ and the number of rename registers required I decide to start yet another project, this time a CPU with
only 16 GPRs. I know that fewer registers will spill to memory more
often, so, I thought using explicit spill and fill instructions backed
up by appropriate buffers would help.
I found this article, which is related, suggesting ILP may be increased.
http://cva.stanford.edu/classes/ee482a/projects/project_spill.pdf
With only 16 regs, some instructions can be reduced to 24-bits.
On 2/11/2024 3:36 PM, MitchAlsup1 wrote:
Robert Finch wrote:
With only 16 regs, some instructions can be reduced to 24-bits.
I have compiled benchmarks where My 66000 with only 32 registers takes
no spill/fill instructions where RISC-V takes spill/fill instructions
even though it has 32 integer and 32 FP registers in its file. In my
case this is down to efficient use of <FP> constants, not wasting inst-
ructions to LD then, and not wasting a register to temporarily hold them.
I have still not entirely eliminated spill/fill, even with 64 GPRs.
Though, this is typically more due to compiler limitations than actually running out of free registers...
Then noted in my fiddling that, with superscalar enabled, Dhrystone was faster in RV64G ("GCC -O3") than in BJX2.
Though, more fiddling, I have noted that re-enabling the Compare+Branch
ops (with 2 input registers), and disabling stack-canary checking
(enabled by default in BGBCC), was enough to put BJX2 back in the lead (though, not by a particularly large margin, namely 91k vs 88k).
In the past I have noted that a 16 register machine with IBM-360-like
ISA performs as if it had about 22 registers; LD-OPs performing most
of the heavy lifting; saving registers from holding temporary and use
once values.
It is possible I may need to revisit this, since:
I already have the underlying mechanism as it is needed for the RV 'A' extension;
The competition against RV is tighter than I would like;
Ultimately, my project may be kinda moot if it is only slightly faster
than RISC-V.
Though, I suspect that performance and code-density are interrelated in
this case (in particular, my compiler is still emitting some amount of unnecessary instructions).
Though, I guess I still have my GLQuake port on my side.
And on the RISC-V side, the 'P' extension ironically manages to be both
less useful and also needlessly over-complicated.
Me:
PADD.W, PSUB.W
'P':
ADD, SUB, ADDSUB, SUBADD x Wrap/SSat/USat/SHalve/UHalve x Byte/Word
So, where I have 2 instructions, P has 40...
And, it just keeps going on and on like this...
And, it never gets to FPU-SIMD...
Robert Finch wrote:
Not being satisfied with current Q+ and the number of rename registers
required I decide to start yet another project, this time a CPU with
only 16 GPRs. I know that fewer registers will spill to memory more
often, so, I thought using explicit spill and fill instructions backed
up by appropriate buffers would help.
I found this article, which is related, suggesting ILP may be increased.
http://cva.stanford.edu/classes/ee482a/projects/project_spill.pdf
With only 16 regs, some instructions can be reduced to 24-bits.
That's going to have the same problems as Sparc register windowing.
The problems happen when there is a memory reference to a register that software thinks was spilled but is being held in the register window
that is acting as a hidden non-coherent cache.
The problem with SPARC register windows is that it slows down the register >file access because there are at least 4× as many registers in the file
as typical RISCs. Thus, while MIPS, M88K, HP, .. all got register access
time under ½ cycle, SPARCs got 1 full cycle, slowing the pipeline or the >frequency.
EricP wrote:
Robert Finch wrote:
Not being satisfied with current Q+ and the number of rename registers
required I decide to start yet another project, this time a CPU with
only 16 GPRs. I know that fewer registers will spill to memory more
often, so, I thought using explicit spill and fill instructions backed
up by appropriate buffers would help.
I found this article, which is related, suggesting ILP may be increased. >>>
http://cva.stanford.edu/classes/ee482a/projects/project_spill.pdf
With only 16 regs, some instructions can be reduced to 24-bits.
That's going to have the same problems as Sparc register windowing.
The problems happen when there is a memory reference to a register that
software thinks was spilled but is being held in the register window
that is acting as a hidden non-coherent cache.
It is similar to SPARC register windows in that it provides a place to perform spill/fill, and if that place does not "overflow" then the
STs to memory are not performed and fewer cycles are required. It is different in how the compiler expresses spill/fill: SPARC is implicit,
that paper is explicit.
MitchAlsup1 <mitchalsup@aol.com> schrieb:
EricP wrote:
Robert Finch wrote:
Not being satisfied with current Q+ and the number of rename registers >>>> required I decide to start yet another project, this time a CPU with
only 16 GPRs. I know that fewer registers will spill to memory more
often, so, I thought using explicit spill and fill instructions backed >>>> up by appropriate buffers would help.
I found this article, which is related, suggesting ILP may be increased. >>>>
http://cva.stanford.edu/classes/ee482a/projects/project_spill.pdf
With only 16 regs, some instructions can be reduced to 24-bits.
That's going to have the same problems as Sparc register windowing.
The problems happen when there is a memory reference to a register that
software thinks was spilled but is being held in the register window
that is acting as a hidden non-coherent cache.
It is similar to SPARC register windows in that it provides a place to
perform spill/fill, and if that place does not "overflow" then the
STs to memory are not performed and fewer cycles are required. It is
different in how the compiler expresses spill/fill: SPARC is implicit,
that paper is explicit.
The spill/restore step would still happen behind the program's
back, so there is at least some potential issue of inconsistent
memory state.
However, a clear ABI which makes sure that only local variables
which have nothing pointing to them can be spilled/restored in
this way could work. Any registers could be reclaimed when
the stack pointer is adjusted, without having to go through
the cache system.
Hmm... anything that could seriously go wrong with this?
Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
EricP wrote:
Robert Finch wrote:
Not being satisfied with current Q+ and the number of rename
registers required I decide to start yet another project, this time
a CPU with only 16 GPRs. I know that fewer registers will spill to
memory more often, so, I thought using explicit spill and fill
instructions backed up by appropriate buffers would help.
I found this article, which is related, suggesting ILP may be
increased.
http://cva.stanford.edu/classes/ee482a/projects/project_spill.pdf
With only 16 regs, some instructions can be reduced to 24-bits.
That's going to have the same problems as Sparc register windowing.
The problems happen when there is a memory reference to a register that >>>> software thinks was spilled but is being held in the register window
that is acting as a hidden non-coherent cache.
It is similar to SPARC register windows in that it provides a place to
perform spill/fill, and if that place does not "overflow" then the
STs to memory are not performed and fewer cycles are required. It is
different in how the compiler expresses spill/fill: SPARC is implicit,
that paper is explicit.
The spill/restore step would still happen behind the program's
back, so there is at least some potential issue of inconsistent
memory state.
How so ?? If the spilled register has not reached memory, the fill
gets the non-SW-visible flip-flop data, and if it has reached memory
it gets the value in that memory. Some 3rd party reading memory
expecting a spill to be there would be problematic, but this would
be frowned upon programming practice and would have to be interlocked
with ATOMIC guards.
However, a clear ABI which makes sure that only local variables
which have nothing pointing to them can be spilled/restored in
this way could work. Any registers could be reclaimed when
the stack pointer is adjusted, without having to go through
the cache system.
Hmm... anything that could seriously go wrong with this?
MitchAlsup1 wrote:
Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
EricP wrote:
Robert Finch wrote:
Not being satisfied with current Q+ and the number of rename
registers required I decide to start yet another project, this time >>>>>> a CPU with only 16 GPRs. I know that fewer registers will spill to >>>>>> memory more often, so, I thought using explicit spill and fill
instructions backed up by appropriate buffers would help.
I found this article, which is related, suggesting ILP may be
increased.
http://cva.stanford.edu/classes/ee482a/projects/project_spill.pdf
With only 16 regs, some instructions can be reduced to 24-bits.
That's going to have the same problems as Sparc register windowing.
The problems happen when there is a memory reference to a register that >>>>> software thinks was spilled but is being held in the register window >>>>> that is acting as a hidden non-coherent cache.
It is similar to SPARC register windows in that it provides a place to >>>> perform spill/fill, and if that place does not "overflow" then the
STs to memory are not performed and fewer cycles are required. It is
different in how the compiler expresses spill/fill: SPARC is implicit, >>>> that paper is explicit.
The spill/restore step would still happen behind the program's
back, so there is at least some potential issue of inconsistent
memory state.
How so ?? If the spilled register has not reached memory, the fill
gets the non-SW-visible flip-flop data, and if it has reached memory
it gets the value in that memory. Some 3rd party reading memory
expecting a spill to be there would be problematic, but this would
be frowned upon programming practice and would have to be interlocked
with ATOMIC guards.
Exactly, it would be problematic for a third party like an IO,
interrupts, DMA, other threads.
Or a setjmp/longjmp.
Or a nested routine that is looking backwards in the stack
(remember, the callee doesn't know if the caller has done this).
Its doesn't need an atomic guard, but at a minimum it needs a non-privileged sync stack (syncstk) instruction that flushes all pending spills
*in the privilege mode active at the time the deferred spill was performed*.
And hardware the can handle flushing deferred user mode stack spills
and associated virtual address translates and page table walks
while in kernel mode.
Then the discussion becomes where and how often does syncstk need to be used, and are the rules for using it clear enough that it won't leave land mines
in code all over the place.
However, a clear ABI which makes sure that only local variables
which have nothing pointing to them can be spilled/restored in
this way could work. Any registers could be reclaimed when
the stack pointer is adjusted, without having to go through
the cache system.
Hmm... anything that could seriously go wrong with this?
It is an hidden non-coherent cache of unknown and variable size with
manual synchronization controls that must be invoked any time
there *might* be an access by the current execution context
into some unknown prior deferred spill.
For example, every interrupt, exception, or syscall will start with
a syncstk. So the deferred cost of spilling multiple sets of multiple registers to user mode stack will be paid at the start of every interrupt.
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
MitchAlsup1 wrote:
Thomas Koenig wrote:Exactly, it would be problematic for a third party like an IO,
MitchAlsup1 <mitchalsup@aol.com> schrieb:How so ?? If the spilled register has not reached memory, the fill
EricP wrote:The spill/restore step would still happen behind the program's
Robert Finch wrote:It is similar to SPARC register windows in that it provides a place to >>>>> perform spill/fill, and if that place does not "overflow" then the
Not being satisfied with current Q+ and the number of renameThat's going to have the same problems as Sparc register windowing. >>>>>> The problems happen when there is a memory reference to a register that >>>>>> software thinks was spilled but is being held in the register window >>>>>> that is acting as a hidden non-coherent cache.
registers required I decide to start yet another project, this time >>>>>>> a CPU with only 16 GPRs. I know that fewer registers will spill to >>>>>>> memory more often, so, I thought using explicit spill and fill
instructions backed up by appropriate buffers would help.
I found this article, which is related, suggesting ILP may be
increased.
http://cva.stanford.edu/classes/ee482a/projects/project_spill.pdf >>>>>>>
With only 16 regs, some instructions can be reduced to 24-bits.
STs to memory are not performed and fewer cycles are required. It is >>>>> different in how the compiler expresses spill/fill: SPARC is implicit, >>>>> that paper is explicit.
back, so there is at least some potential issue of inconsistent
memory state.
gets the non-SW-visible flip-flop data, and if it has reached memory
it gets the value in that memory. Some 3rd party reading memory
expecting a spill to be there would be problematic, but this would
be frowned upon programming practice and would have to be interlocked
with ATOMIC guards.
interrupts, DMA, other threads.
Make the spills backed up by stack storage only.
Or a setjmp/longjmp.
Not sure what is needed there.
Or a nested routine that is looking backwards in the stack
(remember, the callee doesn't know if the caller has done this).
Never pass a pointer to something that has been spilled. If you
do, it's an ABI violation (same as overwriting the stack
via some other pointer).
Its doesn't need an atomic guard, but at a minimum it needs a non-privileged >> sync stack (syncstk) instruction that flushes all pending spills
*in the privilege mode active at the time the deferred spill was performed*.
Or spill to memory on privilege change.
It could also be possible to have a background task in the processor
which does the syncing (while keeping the backed-up registers).
And hardware the can handle flushing deferred user mode stack spills
and associated virtual address translates and page table walks
while in kernel mode.
Then the discussion becomes where and how often does syncstk need to be used,
and are the rules for using it clear enough that it won't leave land mines >> in code all over the place.
It is an hidden non-coherent cache of unknown and variable size withHowever, a clear ABI which makes sure that only local variables
which have nothing pointing to them can be spilled/restored in
this way could work. Any registers could be reclaimed when
the stack pointer is adjusted, without having to go through
the cache system.
Hmm... anything that could seriously go wrong with this?
manual synchronization controls that must be invoked any time
there *might* be an access by the current execution context
into some unknown prior deferred spill.
For example, every interrupt, exception, or syscall will start with
a syncstk. So the deferred cost of spilling multiple sets of multiple
registers to user mode stack will be paid at the start of every interrupt.
That cost will be non-zero, agreed. But depending on the frequency
of interrupts (and if something has already done some of the work
in the background), there might still be a net gain overall.
Thomas Koenig wrote:
- it requires many more hardware registers but doesn't allow them to be accessed directly. Sparc required 120 physical registers but only 29
were architecturally available to a programmer. This was far more
significant issue back in the 1980's when RW was first introduced.
But still today it could double the number of physical registers.
- Sparc's fixed window size of 8 registers was considered very inefficient. The number of window save-sets was intended to be model specific but it turned out that too many algorithms wound up depending on the initial
size of 4 so that's where it stayed.
-Sparc's RW was coupled to CALL and RET so you could not call a routine without creating a new 8 register window.
A better method would support a variable size save-set that is independent
of CALL & RET so you can call leaf routines which require no registers saved.
- Sparc use traps to for overflow/underflow management which made it expensive. Also one of the windows had to be reserved for the trap handler
so in practice there were only 3 save-sets.
- Kernel transitions have to save and restore some or all of the user
mode windows so it can use the windows in kernel mode, increasing the overhead for interrupts and exceptions.
After rummaging about for a while I have not been able to find the
papers that outlined all the issues with register windows (RW)
so I'll try to remember some people have mentioned...
- it requires many more hardware registers but doesn't allow them to be accessed directly. Sparc required 120 physical registers but only 29
were architecturally available to a programmer. This was far more
significant issue back in the 1980's when RW was first introduced.
But still today it could double the number of physical registers.
- Sparc's fixed window size of 8 registers was considered very inefficient. The number of window save-sets was intended to be model specific but it turned out that too many algorithms wound up depending on the initial
size of 4 so that's where it stayed.
-Sparc's RW was coupled to CALL and RET so you could not call a routine without creating a new 8 register window.
A better method would support a variable size save-set that is independent
of CALL & RET so you can call leaf routines which require no registers
saved.
- Sparc use traps to for overflow/underflow management which made it expensive.
Also one of the windows had to be reserved for the trap handler
so in practice there were only 3 save-sets.
- Kernel transitions have to save and restore some or all of the user
mode windows so it can use the windows in kernel mode, increasing the overhead for interrupts and exceptions.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 45:52:36 |
Calls: | 10,394 |
Calls today: | 2 |
Files: | 14,066 |
Messages: | 6,417,271 |