• Re: Spill and Fill Buffers

    From MitchAlsup1@21:1/5 to Robert Finch on Sun Feb 11 21:36:41 2024
    Robert Finch wrote:

    Not being satisfied with current Q+ and the number of rename registers required I decide to start yet another project, this time a CPU with
    only 16 GPRs. I know that fewer registers will spill to memory more
    often, so, I thought using explicit spill and fill instructions backed
    up by appropriate buffers would help.
    I found this article, which is related, suggesting ILP may be increased.

    http://cva.stanford.edu/classes/ee482a/projects/project_spill.pdf

    Interesting

    With only 16 regs, some instructions can be reduced to 24-bits.

    I have compiled benchmarks where My 66000 with only 32 registers takes
    no spill/fill instructions where RISC-V takes spill/fill instructions
    even though it has 32 integer and 32 FP registers in its file. In my
    case this is down to efficient use of <FP> constants, not wasting inst- ructions to LD then, and not wasting a register to temporarily hold
    them.

    In the past I have noted that a 16 register machine with IBM-360-like
    ISA performs as if it had about 22 registers; LD-OPs performing most
    of the heavy lifting; saving registers from holding temporary and use
    once values.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Robert Finch on Mon Feb 12 13:03:37 2024
    Robert Finch wrote:
    Not being satisfied with current Q+ and the number of rename registers required I decide to start yet another project, this time a CPU with
    only 16 GPRs. I know that fewer registers will spill to memory more
    often, so, I thought using explicit spill and fill instructions backed
    up by appropriate buffers would help.
    I found this article, which is related, suggesting ILP may be increased.

    http://cva.stanford.edu/classes/ee482a/projects/project_spill.pdf

    With only 16 regs, some instructions can be reduced to 24-bits.

    That's going to have the same problems as Sparc register windowing.
    The problems happen when there is a memory reference to a register that software thinks was spilled but is being held in the register window
    that is acting as a hidden non-coherent cache.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue Feb 20 17:56:28 2024
    BGB wrote:

    On 2/11/2024 3:36 PM, MitchAlsup1 wrote:
    Robert Finch wrote:

    With only 16 regs, some instructions can be reduced to 24-bits.

    I have compiled benchmarks where My 66000 with only 32 registers takes
    no spill/fill instructions where RISC-V takes spill/fill instructions
    even though it has 32 integer and 32 FP registers in its file. In my
    case this is down to efficient use of <FP> constants, not wasting inst-
    ructions to LD then, and not wasting a register to temporarily hold them.


    I have still not entirely eliminated spill/fill, even with 64 GPRs.
    Though, this is typically more due to compiler limitations than actually running out of free registers...

    Nobody can completely eliminate spill/fill with a finite number of registers.

    Then noted in my fiddling that, with superscalar enabled, Dhrystone was faster in RV64G ("GCC -O3") than in BJX2.

    Though, more fiddling, I have noted that re-enabling the Compare+Branch
    ops (with 2 input registers), and disabling stack-canary checking
    (enabled by default in BGBCC), was enough to put BJX2 back in the lead (though, not by a particularly large margin, namely 91k vs 88k).


    In the past I have noted that a 16 register machine with IBM-360-like
    ISA performs as if it had about 22 registers; LD-OPs performing most
    of the heavy lifting; saving registers from holding temporary and use
    once values.

    It is possible I may need to revisit this, since:
    I already have the underlying mechanism as it is needed for the RV 'A' extension;
    The competition against RV is tighter than I would like;
    Ultimately, my project may be kinda moot if it is only slightly faster
    than RISC-V.

    This is one of the reasons that one needs a "better" ISA than RISC-V.
    My 66000 only requires 70%-72% of the instruction count of RISC-V.

    Though, I suspect that performance and code-density are interrelated in
    this case (in particular, my compiler is still emitting some amount of unnecessary instructions).

    A balance is required:: you need the ISA to be higher enough to have a
    better instruction count, but not so high that the number of cycles goes
    way up (VAX).

    Though, I guess I still have my GLQuake port on my side.


    And on the RISC-V side, the 'P' extension ironically manages to be both
    less useful and also needlessly over-complicated.

    Me:
    PADD.W, PSUB.W
    'P':
    ADD, SUB, ADDSUB, SUBADD x Wrap/SSat/USat/SHalve/UHalve x Byte/Word
    So, where I have 2 instructions, P has 40...
    And, it just keeps going on and on like this...

    I get all my SIMD-ness and Vectorization with exactly 2 instructions.

    And, it never gets to FPU-SIMD...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Tue Feb 20 18:02:10 2024
    EricP wrote:

    Robert Finch wrote:
    Not being satisfied with current Q+ and the number of rename registers
    required I decide to start yet another project, this time a CPU with
    only 16 GPRs. I know that fewer registers will spill to memory more
    often, so, I thought using explicit spill and fill instructions backed
    up by appropriate buffers would help.
    I found this article, which is related, suggesting ILP may be increased.

    http://cva.stanford.edu/classes/ee482a/projects/project_spill.pdf

    With only 16 regs, some instructions can be reduced to 24-bits.

    That's going to have the same problems as Sparc register windowing.
    The problems happen when there is a memory reference to a register that software thinks was spilled but is being held in the register window
    that is acting as a hidden non-coherent cache.

    It is similar to SPARC register windows in that it provides a place to
    perform spill/fill, and if that place does not "overflow" then the
    STs to memory are not performed and fewer cycles are required. It is
    different in how the compiler expresses spill/fill: SPARC is implicit,
    that paper is explicit.

    The problem with SPARC register windows is that it slows down the register
    file access because there are at least 4× as many registers in the file
    as typical RISCs. Thus, while MIPS, M88K, HP, .. all got register access
    time under ½ cycle, SPARCs got 1 full cycle, slowing the pipeline or the frequency.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Tue Feb 20 22:03:42 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    The problem with SPARC register windows is that it slows down the register >file access because there are at least 4× as many registers in the file
    as typical RISCs. Thus, while MIPS, M88K, HP, .. all got register access
    time under ½ cycle, SPARCs got 1 full cycle, slowing the pipeline or the >frequency.

    SPARC64 X+ was available at frequencies up to 3.7GHz.

    Sparc M8 was (maybe still is) available at frequencies up to 5GHz.

    MIPS, M88K, HP did not produce anything that has even remotely these frequencies.

    These days, Intel is making Raptor Cove cores with 280 physical
    registers, and they run at up to 6GHz.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Tue Feb 20 22:15:35 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    EricP wrote:

    Robert Finch wrote:
    Not being satisfied with current Q+ and the number of rename registers
    required I decide to start yet another project, this time a CPU with
    only 16 GPRs. I know that fewer registers will spill to memory more
    often, so, I thought using explicit spill and fill instructions backed
    up by appropriate buffers would help.
    I found this article, which is related, suggesting ILP may be increased. >>>
    http://cva.stanford.edu/classes/ee482a/projects/project_spill.pdf

    With only 16 regs, some instructions can be reduced to 24-bits.

    That's going to have the same problems as Sparc register windowing.
    The problems happen when there is a memory reference to a register that
    software thinks was spilled but is being held in the register window
    that is acting as a hidden non-coherent cache.

    It is similar to SPARC register windows in that it provides a place to perform spill/fill, and if that place does not "overflow" then the
    STs to memory are not performed and fewer cycles are required. It is different in how the compiler expresses spill/fill: SPARC is implicit,
    that paper is explicit.

    The spill/restore step would still happen behind the program's
    back, so there is at least some potential issue of inconsistent
    memory state.

    However, a clear ABI which makes sure that only local variables
    which have nothing pointing to them can be spilled/restored in
    this way could work. Any registers could be reclaimed when
    the stack pointer is adjusted, without having to go through
    the cache system.

    Hmm... anything that could seriously go wrong with this?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Tue Feb 20 23:32:23 2024
    Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    EricP wrote:

    Robert Finch wrote:
    Not being satisfied with current Q+ and the number of rename registers >>>> required I decide to start yet another project, this time a CPU with
    only 16 GPRs. I know that fewer registers will spill to memory more
    often, so, I thought using explicit spill and fill instructions backed >>>> up by appropriate buffers would help.
    I found this article, which is related, suggesting ILP may be increased. >>>>
    http://cva.stanford.edu/classes/ee482a/projects/project_spill.pdf

    With only 16 regs, some instructions can be reduced to 24-bits.

    That's going to have the same problems as Sparc register windowing.
    The problems happen when there is a memory reference to a register that
    software thinks was spilled but is being held in the register window
    that is acting as a hidden non-coherent cache.

    It is similar to SPARC register windows in that it provides a place to
    perform spill/fill, and if that place does not "overflow" then the
    STs to memory are not performed and fewer cycles are required. It is
    different in how the compiler expresses spill/fill: SPARC is implicit,
    that paper is explicit.

    The spill/restore step would still happen behind the program's
    back, so there is at least some potential issue of inconsistent
    memory state.

    How so ?? If the spilled register has not reached memory, the fill
    gets the non-SW-visible flip-flop data, and if it has reached memory
    it gets the value in that memory. Some 3rd party reading memory
    expecting a spill to be there would be problematic, but this would
    be frowned upon programming practice and would have to be interlocked
    with ATOMIC guards.

    However, a clear ABI which makes sure that only local variables
    which have nothing pointing to them can be spilled/restored in
    this way could work. Any registers could be reclaimed when
    the stack pointer is adjusted, without having to go through
    the cache system.

    Hmm... anything that could seriously go wrong with this?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Wed Feb 21 10:58:20 2024
    MitchAlsup1 wrote:
    Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    EricP wrote:

    Robert Finch wrote:
    Not being satisfied with current Q+ and the number of rename
    registers required I decide to start yet another project, this time
    a CPU with only 16 GPRs. I know that fewer registers will spill to
    memory more often, so, I thought using explicit spill and fill
    instructions backed up by appropriate buffers would help.
    I found this article, which is related, suggesting ILP may be
    increased.

    http://cva.stanford.edu/classes/ee482a/projects/project_spill.pdf

    With only 16 regs, some instructions can be reduced to 24-bits.

    That's going to have the same problems as Sparc register windowing.
    The problems happen when there is a memory reference to a register that >>>> software thinks was spilled but is being held in the register window
    that is acting as a hidden non-coherent cache.

    It is similar to SPARC register windows in that it provides a place to
    perform spill/fill, and if that place does not "overflow" then the
    STs to memory are not performed and fewer cycles are required. It is
    different in how the compiler expresses spill/fill: SPARC is implicit,
    that paper is explicit.

    The spill/restore step would still happen behind the program's
    back, so there is at least some potential issue of inconsistent
    memory state.

    How so ?? If the spilled register has not reached memory, the fill
    gets the non-SW-visible flip-flop data, and if it has reached memory
    it gets the value in that memory. Some 3rd party reading memory
    expecting a spill to be there would be problematic, but this would
    be frowned upon programming practice and would have to be interlocked
    with ATOMIC guards.

    Exactly, it would be problematic for a third party like an IO,
    interrupts, DMA, other threads.
    Or a setjmp/longjmp.
    Or a nested routine that is looking backwards in the stack
    (remember, the callee doesn't know if the caller has done this).

    Its doesn't need an atomic guard, but at a minimum it needs a non-privileged sync stack (syncstk) instruction that flushes all pending spills
    *in the privilege mode active at the time the deferred spill was performed*.

    And hardware the can handle flushing deferred user mode stack spills
    and associated virtual address translates and page table walks
    while in kernel mode.

    Then the discussion becomes where and how often does syncstk need to be used, and are the rules for using it clear enough that it won't leave land mines
    in code all over the place.

    However, a clear ABI which makes sure that only local variables
    which have nothing pointing to them can be spilled/restored in
    this way could work. Any registers could be reclaimed when
    the stack pointer is adjusted, without having to go through
    the cache system.

    Hmm... anything that could seriously go wrong with this?

    It is an hidden non-coherent cache of unknown and variable size with
    manual synchronization controls that must be invoked any time
    there *might* be an access by the current execution context
    into some unknown prior deferred spill.

    For example, every interrupt, exception, or syscall will start with
    a syncstk. So the deferred cost of spilling multiple sets of multiple
    registers to user mode stack will be paid at the start of every interrupt.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to EricP on Wed Feb 21 22:12:41 2024
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    MitchAlsup1 wrote:
    Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    EricP wrote:

    Robert Finch wrote:
    Not being satisfied with current Q+ and the number of rename
    registers required I decide to start yet another project, this time >>>>>> a CPU with only 16 GPRs. I know that fewer registers will spill to >>>>>> memory more often, so, I thought using explicit spill and fill
    instructions backed up by appropriate buffers would help.
    I found this article, which is related, suggesting ILP may be
    increased.

    http://cva.stanford.edu/classes/ee482a/projects/project_spill.pdf

    With only 16 regs, some instructions can be reduced to 24-bits.

    That's going to have the same problems as Sparc register windowing.
    The problems happen when there is a memory reference to a register that >>>>> software thinks was spilled but is being held in the register window >>>>> that is acting as a hidden non-coherent cache.

    It is similar to SPARC register windows in that it provides a place to >>>> perform spill/fill, and if that place does not "overflow" then the
    STs to memory are not performed and fewer cycles are required. It is
    different in how the compiler expresses spill/fill: SPARC is implicit, >>>> that paper is explicit.

    The spill/restore step would still happen behind the program's
    back, so there is at least some potential issue of inconsistent
    memory state.

    How so ?? If the spilled register has not reached memory, the fill
    gets the non-SW-visible flip-flop data, and if it has reached memory
    it gets the value in that memory. Some 3rd party reading memory
    expecting a spill to be there would be problematic, but this would
    be frowned upon programming practice and would have to be interlocked
    with ATOMIC guards.

    Exactly, it would be problematic for a third party like an IO,
    interrupts, DMA, other threads.

    Make the spills backed up by stack storage only.

    Or a setjmp/longjmp.

    Not sure what is needed there.

    Or a nested routine that is looking backwards in the stack
    (remember, the callee doesn't know if the caller has done this).

    Never pass a pointer to something that has been spilled. If you
    do, it's an ABI violation (same as overwriting the stack
    via some other pointer).

    Its doesn't need an atomic guard, but at a minimum it needs a non-privileged sync stack (syncstk) instruction that flushes all pending spills
    *in the privilege mode active at the time the deferred spill was performed*.

    Or spill to memory on privilege change.

    It could also be possible to have a background task in the processor
    which does the syncing (while keeping the backed-up registers).

    And hardware the can handle flushing deferred user mode stack spills
    and associated virtual address translates and page table walks
    while in kernel mode.

    Then the discussion becomes where and how often does syncstk need to be used, and are the rules for using it clear enough that it won't leave land mines
    in code all over the place.

    However, a clear ABI which makes sure that only local variables
    which have nothing pointing to them can be spilled/restored in
    this way could work. Any registers could be reclaimed when
    the stack pointer is adjusted, without having to go through
    the cache system.

    Hmm... anything that could seriously go wrong with this?

    It is an hidden non-coherent cache of unknown and variable size with
    manual synchronization controls that must be invoked any time
    there *might* be an access by the current execution context
    into some unknown prior deferred spill.

    For example, every interrupt, exception, or syscall will start with
    a syncstk. So the deferred cost of spilling multiple sets of multiple registers to user mode stack will be paid at the start of every interrupt.

    That cost will be non-zero, agreed. But depending on the frequency
    of interrupts (and if something has already done some of the work
    in the background), there might still be a net gain overall.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Thomas Koenig on Fri Feb 23 13:49:50 2024
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    MitchAlsup1 wrote:
    Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    EricP wrote:

    Robert Finch wrote:
    Not being satisfied with current Q+ and the number of rename
    registers required I decide to start yet another project, this time >>>>>>> a CPU with only 16 GPRs. I know that fewer registers will spill to >>>>>>> memory more often, so, I thought using explicit spill and fill
    instructions backed up by appropriate buffers would help.
    I found this article, which is related, suggesting ILP may be
    increased.

    http://cva.stanford.edu/classes/ee482a/projects/project_spill.pdf >>>>>>>
    With only 16 regs, some instructions can be reduced to 24-bits.
    That's going to have the same problems as Sparc register windowing. >>>>>> The problems happen when there is a memory reference to a register that >>>>>> software thinks was spilled but is being held in the register window >>>>>> that is acting as a hidden non-coherent cache.
    It is similar to SPARC register windows in that it provides a place to >>>>> perform spill/fill, and if that place does not "overflow" then the
    STs to memory are not performed and fewer cycles are required. It is >>>>> different in how the compiler expresses spill/fill: SPARC is implicit, >>>>> that paper is explicit.
    The spill/restore step would still happen behind the program's
    back, so there is at least some potential issue of inconsistent
    memory state.
    How so ?? If the spilled register has not reached memory, the fill
    gets the non-SW-visible flip-flop data, and if it has reached memory
    it gets the value in that memory. Some 3rd party reading memory
    expecting a spill to be there would be problematic, but this would
    be frowned upon programming practice and would have to be interlocked
    with ATOMIC guards.
    Exactly, it would be problematic for a third party like an IO,
    interrupts, DMA, other threads.

    Make the spills backed up by stack storage only.

    The lazy spills may eventually write to the stack.
    Its just you don't know if or when it will happen.

    Or a setjmp/longjmp.

    Not sure what is needed there.

    I might be being overly paranoid on this one.
    The function of the lazy spill instructions is incompatible with
    a setjmp or any equivalent register set snapshot function.
    So just don't mix the two.

    On Sparc this was problematic because the register window creation was automatic so there was no way to avoid it. This meant that setjmp had
    to sync the stack which requires flushing all the pending changes.
    This was made even more expensive because Sparc used kernel traps
    for managing register windows.

    Sparc had similar flushing requirements for user mode task switching
    as part of the current task context may be stuck in the window cache.
    But flushing the window cache required a kernel trap, which kinda defeats
    the whole purpose of cheap user mode task switching.

    Or a nested routine that is looking backwards in the stack
    (remember, the callee doesn't know if the caller has done this).

    Never pass a pointer to something that has been spilled. If you
    do, it's an ABI violation (same as overwriting the stack
    via some other pointer).

    My concern is at the hardware level not a language level.
    There is no technical reason that you could not have a subroutine that,
    say, reads the stack and writes it to a file as part of a an error logger,
    or a debugger that examines or writes to the stack.

    Its doesn't need an atomic guard, but at a minimum it needs a non-privileged >> sync stack (syncstk) instruction that flushes all pending spills
    *in the privilege mode active at the time the deferred spill was performed*.

    Or spill to memory on privilege change.

    Therein lies another problem because the window is part of the thread
    context but not be spillable to user mode after switching to kernel mode because the OS is not allowed to page fault in many places like interrupts.

    So it would need a second mechanism so that it can save the pending register windows in non-paged kernel memory, so that kernel code can make calls that create new register windows.

    It could also be possible to have a background task in the processor
    which does the syncing (while keeping the backed-up registers).

    Uhg.

    And hardware the can handle flushing deferred user mode stack spills
    and associated virtual address translates and page table walks
    while in kernel mode.

    Then the discussion becomes where and how often does syncstk need to be used,
    and are the rules for using it clear enough that it won't leave land mines >> in code all over the place.

    However, a clear ABI which makes sure that only local variables
    which have nothing pointing to them can be spilled/restored in
    this way could work. Any registers could be reclaimed when
    the stack pointer is adjusted, without having to go through
    the cache system.
    Hmm... anything that could seriously go wrong with this?
    It is an hidden non-coherent cache of unknown and variable size with
    manual synchronization controls that must be invoked any time
    there *might* be an access by the current execution context
    into some unknown prior deferred spill.

    For example, every interrupt, exception, or syscall will start with
    a syncstk. So the deferred cost of spilling multiple sets of multiple
    registers to user mode stack will be paid at the start of every interrupt.

    That cost will be non-zero, agreed. But depending on the frequency
    of interrupts (and if something has already done some of the work
    in the background), there might still be a net gain overall.

    After rummaging about for a while I have not been able to find the
    papers that outlined all the issues with register windows (RW)
    so I'll try to remember some people have mentioned...

    - it requires many more hardware registers but doesn't allow them to be accessed directly. Sparc required 120 physical registers but only 29
    were architecturally available to a programmer. This was far more
    significant issue back in the 1980's when RW was first introduced.
    But still today it could double the number of physical registers.

    - Sparc's fixed window size of 8 registers was considered very inefficient.
    The number of window save-sets was intended to be model specific but it
    turned out that too many algorithms wound up depending on the initial
    size of 4 so that's where it stayed.

    -Sparc's RW was coupled to CALL and RET so you could not call a routine
    without creating a new 8 register window.
    A better method would support a variable size save-set that is independent
    of CALL & RET so you can call leaf routines which require no registers saved.

    - Sparc use traps to for overflow/underflow management which made it
    expensive. Also one of the windows had to be reserved for the trap handler
    so in practice there were only 3 save-sets.

    - Kernel transitions have to save and restore some or all of the user
    mode windows so it can use the windows in kernel mode, increasing the
    overhead for interrupts and exceptions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Fri Feb 23 19:48:14 2024
    EricP wrote:

    Thomas Koenig wrote:


    - it requires many more hardware registers but doesn't allow them to be accessed directly. Sparc required 120 physical registers but only 29
    were architecturally available to a programmer. This was far more
    significant issue back in the 1980's when RW was first introduced.
    But still today it could double the number of physical registers.

    - Sparc's fixed window size of 8 registers was considered very inefficient. The number of window save-sets was intended to be model specific but it turned out that too many algorithms wound up depending on the initial
    size of 4 so that's where it stayed.

    Indeed, the median number of registers required to save/restore across a subroutine boundary is between 2 and 3 (depending if you consider return address one of them.)

    -Sparc's RW was coupled to CALL and RET so you could not call a routine without creating a new 8 register window.

    Losing out on the leaf level procedure's typical lack of need for any
    but temporary regsiters.

    A better method would support a variable size save-set that is independent
    of CALL & RET so you can call leaf routines which require no registers saved.

    This requires tooo much logic in the register file decoder whereas SPARC RW only required a 2-bit numeric adder.

    - Sparc use traps to for overflow/underflow management which made it expensive. Also one of the windows had to be reserved for the trap handler
    so in practice there were only 3 save-sets.

    SPARC Register windows were "well though out" only in an academic sense.

    - Kernel transitions have to save and restore some or all of the user
    mode windows so it can use the windows in kernel mode, increasing the overhead for interrupts and exceptions.

    Which is why nobody copied them. I suggests others not copy them too.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to EricP on Fri Feb 23 22:28:43 2024
    On 2024-02-23 20:49, EricP wrote:

    [snip]

    After rummaging about for a while I have not been able to find the
    papers that outlined all the issues with register windows (RW)
    so I'll try to remember some people have mentioned...

    - it requires many more hardware registers but doesn't allow them to be accessed directly. Sparc required 120 physical registers but only 29
    were architecturally available to a programmer. This was far more
    significant issue back in the 1980's when RW was first introduced.
    But still today it could double the number of physical registers.

    - Sparc's fixed window size of 8 registers was considered very inefficient. The number of window save-sets was intended to be model specific but it turned out that too many algorithms wound up depending on the initial
    size of 4 so that's where it stayed.


    All the SPARC processors I have used (ERC32, LEON2) have 8 save-sets
    (windows), of which one is typically reserved for trap handlers
    (including the RW overflow/underflow handler), leaving 7 for the
    application. This may be too few for some current SW that uses lots of
    very small routines in very deep, rapidly see-sawing call-chains.

    What algorithms would depend on the number of save-sets? No application algorithms should. The kernel's RW-handling operations may depend on it,
    but that should not be a problem.


    -Sparc's RW was coupled to CALL and RET so you could not call a routine without creating a new 8 register window.


    No, the RW file is rotated by the SAVE and RESTORE instructions, not by
    CALL and RET.

    One of the gcc ports to SPARC (from Gaisler Research) has an option not
    to use register windows at all, and instead use the SPARC as a "flat", unwindowed 32-register processor. (I haven't used that mode.)


    A better method would support a variable size save-set that is independent
    of CALL & RET so you can call leaf routines which require no registers
    saved.


    Indeed, and a common optimization in SPARC code is not to use
    SAVE/RESTORE for leaf routines.


    - Sparc use traps to for overflow/underflow management which made it expensive.


    The expense of course depends on how the trap is implemented. In the
    SPARC applications I worked on (real-time, bare machine or real-time
    kernel) the overhead to enter and leave the trap handler was minor
    compared to the work to handle the underflow or overflow. With a "real"
    OS the case may be different.


    Also one of the windows had to be reserved for the trap handler
    so in practice there were only 3 save-sets.


    With 8 save-sets, reserving one is not a big problem.


    - Kernel transitions have to save and restore some or all of the user
    mode windows so it can use the windows in kernel mode, increasing the overhead for interrupts and exceptions.


    Maybe so, I'm not sure what a "real" OS kernel would do here. I don't
    think this was a problem in my SPARC applications with real-time kernels because kernel services were usually not called via traps, but as normal routines. The kernel had to save/restore the register windows of a given process/thread only when switching process/thread.

    In my SPARC applications, the major drawback of SPARC register windows
    was that every stack frame (for non-leaf routines) had to have space to
    store a whole register window, some 100 octets. This made the required
    stack size for each thread rather larger than one would expect from the
    source code. This should not be a big problem with today's memory sizes,
    but it might increase data-cache misses.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)