• Re: Constant Stack Canaries

    From MitchAlsup1@21:1/5 to BGB on Sun Mar 30 20:14:53 2025
    On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:

    On 3/30/2025 7:16 AM, Robert Finch wrote:
    Just got to thinking about stack canaries. I was going to have a special
    purpose register holding the canary value for testing while the program
    was running. But I just realized today that it may not be needed. Canary
    values could be handled by the program loader as constants, eliminating
    the need for a register. Since the value is not changing while the
    program is running, it could easily be a constant. This may require a
    fixup record handled by the assembler / linker to indicate to the loader
    to place a canary value.

    Prolog code would just store an immediate to the stack. On return a TRAP
    instruction could check for the immediate value and trap if not present.
    But the process seems to require assembler / linker support.


    They are mostly just a normal compiler feature IME:
    Prolog stores the value;
    Epilog loads it and verifies that the value is intact.

    Agreed.

    Using a magic number

    Remove excess words.

    Nothing fancy needed in the assemble or link stages.

    They remain blissfully ignorant--at most they generate the magic
    number, possibly at random, possibly per link-module.

    In my case, canary behavior is one of:
    Use them in functions with arrays or similar (default);
    Use them everywhere (optional);
    Disable them entirely (also optional).

    In my case, it is only checking 16-bit magic numbers, but mostly because
    a 16-bit constant is cheaper to load into a register in this case
    (single 32-bit instruction, vs a larger encoding needed for larger
    values).

    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Mon Mar 31 09:04:40 2025
    On 3/30/2025 1:14 PM, MitchAlsup1 wrote:
    On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:

    On 3/30/2025 7:16 AM, Robert Finch wrote:
    Just got to thinking about stack canaries. I was going to have a special >>> purpose register holding the canary value for testing while the program
    was running. But I just realized today that it may not be needed. Canary >>> values could be handled by the program loader as constants, eliminating
    the need for a register. Since the value is not changing while the
    program is running, it could easily be a constant. This may require a
    fixup record handled by the assembler / linker to indicate to the loader >>> to place a canary value.

    Prolog code would just store an immediate to the stack. On return a TRAP >>> instruction could check for the immediate value and trap if not present. >>> But the process seems to require assembler / linker support.


    They are mostly just a normal compiler feature IME:
       Prolog stores the value;
       Epilog loads it and verifies that the value is intact.

    Agreed.

    I'm glad you, Mitch, chimed in here. When I saw this, it occurred to me
    that this could be done automatically by the hardware (optionally, based
    on a bit in a control register). The CALL instruction would store
    magic value, and the RET instruction would test it. If there was not a
    match, an exception would be generated. The value itself could be
    something like the clock value when the program was initiated, thus guaranteeing uniqueness.

    The advantage over the software approach, of course, is the elimination
    of several instructions in each prolog/epilog, reducing footprint, and
    perhaps even time as it might be possible to overlap some of the
    processing with the other things these instructions do. The downside is
    more hardware and perhaps extra overhead.

    Does this make sense? What have I missed.







    Using a magic number

    Remove excess words.

    Nothing fancy needed in the assemble or link stages.

    They remain blissfully ignorant--at most they generate the magic
    number, possibly at random, possibly per link-module.

    In my case, canary behavior is one of:
       Use them in functions with arrays or similar (default);
       Use them everywhere (optional);
       Disable them entirely (also optional).

    In my case, it is only checking 16-bit magic numbers, but mostly because
    a 16-bit constant is cheaper to load into a register in this case
    (single 32-bit instruction, vs a larger encoding needed for larger
    values).

    ....


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to BGB on Mon Mar 31 10:57:35 2025
    On 3/31/2025 10:17 AM, BGB wrote:
    On 3/31/2025 11:04 AM, Stephen Fuld wrote:
    On 3/30/2025 1:14 PM, MitchAlsup1 wrote:
    On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:

    On 3/30/2025 7:16 AM, Robert Finch wrote:
    Just got to thinking about stack canaries. I was going to have a
    special
    purpose register holding the canary value for testing while the
    program
    was running. But I just realized today that it may not be needed.
    Canary
    values could be handled by the program loader as constants,
    eliminating
    the need for a register. Since the value is not changing while the
    program is running, it could easily be a constant. This may require a >>>>> fixup record handled by the assembler / linker to indicate to the
    loader
    to place a canary value.

    Prolog code would just store an immediate to the stack. On return a
    TRAP
    instruction could check for the immediate value and trap if not
    present.
    But the process seems to require assembler / linker support.


    They are mostly just a normal compiler feature IME:
       Prolog stores the value;
       Epilog loads it and verifies that the value is intact.

    Agreed.

    I'm glad you, Mitch, chimed in here.  When I saw this, it occurred to
    me that this could be done automatically by the hardware (optionally,
    based on a bit in a control register).   The CALL instruction would
    store magic value, and the RET instruction would test it.  If there
    was not a match, an exception would be generated.  The value itself
    could be something like the clock value when the program was
    initiated, thus guaranteeing uniqueness.

    The advantage over the software approach, of course, is the
    elimination of several instructions in each prolog/epilog, reducing
    footprint, and perhaps even time as it might be possible to overlap
    some of the processing with the other things these instructions do.
    The downside is more hardware and perhaps extra overhead.

    Does this make sense?  What have I missed.


    This would seem to imply an ISA where CALL/RET push onto the stack or similar, rather than the (more common for RISC's) strategy of copying PC
    into a link register...

    Sorry, you're right. I should have said, in the context of Mitch's My
    66000, the ENTER and EXIT instructions.


    Another option being if it could be a feature of a Load/Store Multiple.

    The nice thing about the ENTER/EXIT is that they combine the store
    multiple (ENTER) and the load multiple and return control (EXIT).


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Mon Mar 31 18:07:30 2025
    On Mon, 31 Mar 2025 17:17:38 +0000, BGB wrote:

    On 3/31/2025 11:04 AM, Stephen Fuld wrote:
    On 3/30/2025 1:14 PM, MitchAlsup1 wrote:
    On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:
    -------------

    They are mostly just a normal compiler feature IME:
       Prolog stores the value;
       Epilog loads it and verifies that the value is intact.

    Agreed.

    I'm glad you, Mitch, chimed in here.  When I saw this, it occurred to me
    that this could be done automatically by the hardware (optionally, based
    on a bit in a control register).   The CALL instruction would store
    magic value, and the RET instruction would test it.  If there was not a
    match, an exception would be generated.  The value itself could be
    something like the clock value when the program was initiated, thus
    guaranteeing uniqueness.

    The advantage over the software approach, of course, is the elimination
    of several instructions in each prolog/epilog, reducing footprint, and
    perhaps even time as it might be possible to overlap some of the
    processing with the other things these instructions do.  The downside is
    more hardware and perhaps extra overhead.

    Does this make sense?  What have I missed.


    This would seem to imply an ISA where CALL/RET push onto the stack or similar, rather than the (more common for RISC's) strategy of copying PC
    into a link register...


    Another option being if it could be a feature of a Load/Store Multiple.

    Say, LDM/STM:
    6b Hi (Upper bound of register to save)
    6b Lo (Lower bound of registers to save)
    1b LR (Flag to save Link Register)
    1b GP (Flag to save Global Pointer)
    1b SK (Flag to generate a canary)

    ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
    are implicit.

    Likely (STM):
    Pushes LR first (if bit set);
    Pushes GP second (if bit set);
    Pushes registers in range (if Hi>=Lo);
    Pushes stack canary (if bit set).

    EXIT uses its 3rd flag used when doing longjump() and THROW()
    so as to pop the call-stack but not actually RET from the stack
    walker.

    LDM would check the canary first and fault if it doesn't see the
    expected value.

    Downside, granted, is needing the relative complexity of an LDM/STM
    style instruction.

    Not conceptually any harder than DIV or FDIV and nobody complains
    about doing multi-cycle math.

    Other ISAs use a flag bit for each register, but this is less viable
    with an ISA with a larger number of registers, well, unless one uses a
    64 or 96 bit LDM/STM encoding (possible). Merit though would be not
    needing multiple LDM's / STM's to deal with a discontinuous register
    range.

    To quote Trevor Smith:: "Why would anyone want to do that" ??

    Well, also excluding the possibility where the LDM/STM is essentially
    just a function call (say, if beyond certain number of registers are to
    be saved/restored, the compiler generates a call to a save/restore
    sequence, which is also generates as-needed). Granted, this is basically
    the strategy used by BGBCC. If multiple functions happen to save/restore
    the same combination of registers, they get to reuse the prior
    function's save/restore sequence (generally folded off to before the
    function in question).

    Calling a subroutine to perform epilogues is adding to the number of
    branches a program executes. Having an instruction like EXIT means
    when you know you need to exit, you EXIT you don't branch to the exit
    point. Saving instructions.

    Granted, the folding strategy can still do canary values, but doing so
    in the reused portions would limit the range of unique canary values
    (well, unless the canary magic is XOR'ed with SP or something...).


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Mon Mar 31 20:52:14 2025
    On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:

    On 3/31/2025 1:07 PM, MitchAlsup1 wrote:
    -------------
    Another option being if it could be a feature of a Load/Store Multiple.

    Say, LDM/STM:
       6b Hi (Upper bound of register to save)
       6b Lo (Lower bound of registers to save)
       1b LR (Flag to save Link Register)
       1b GP (Flag to save Global Pointer)
       1b SK (Flag to generate a canary)

    ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
    are implicit.

    Likely (STM):
       Pushes LR first (if bit set);
       Pushes GP second (if bit set);
       Pushes registers in range (if Hi>=Lo);
       Pushes stack canary (if bit set).

    EXIT uses its 3rd flag used when doing longjump() and THROW()
    so as to pop the call-stack but not actually RET from the stack
    walker.


    OK.

    I guess one could debate whether an LDM could treat the Load-LR as "Load
    LR" or "Load address and Branch", and/or have separate flags (Load LR vs
    Load PC, with Load PC meaning to branch).


    Other ABIs may not have as much reason to save/restore the Global
    Pointer all the time. But, in my case, it is being used as the primary
    way of accessing globals, and each binary image has its own address
    range here.

    I use constants to access globals.
    These comes in 32-bit and 64-bit flavors.

    PC-Rel not being used as PC-Rel doesn't allow for multiple process
    instances of a given loaded binary within a shared address space.

    As long as the relative distance is the same, it does.

    Vs, say, for PIE ELF binaries where it is needed to load a new copy for
    each process instance because of this (well, excluding an FDPIC style
    ABI, but seemingly still no one seems to have bothered adding FDPIC
    support in GCC or friends for RV64 based targets, ...).

    Well, granted, because Linux and similar tend to load every new process
    into its own address space and/or use CoW.

    CoW and execl()

    --------------
    Other ISAs use a flag bit for each register, but this is less viable
    with an ISA with a larger number of registers, well, unless one uses a
    64 or 96 bit LDM/STM encoding (possible). Merit though would be not
    needing multiple LDM's / STM's to deal with a discontinuous register
    range.

    To quote Trevor Smith:: "Why would anyone want to do that" ??


    Discontinuous register ranges:
    Because pretty much no ABI's put all of the callee save registers in a contiguous range.

    Granted, I guess if someone were designing an ISA and ABI clean, they
    could make all of the argument registers and callee save registers contiguous.

    Say:
    R0..R3: Special
    R4..R15: Scratch
    R16..R31: Argument
    R32..R63: Callee Save
    ....

    But, invariably, someone will want "compressed" instructions with a
    subset of the registers, and one can't just have these only having
    access to argument registers.

    Brian had little trouble using My 66000 ABI which does have contiguous
    register groupings.

    Well, also excluding the possibility where the LDM/STM is essentially
    just a function call (say, if beyond certain number of registers are to
    be saved/restored, the compiler generates a call to a save/restore
    sequence, which is also generates as-needed). Granted, this is basically >>> the strategy used by BGBCC. If multiple functions happen to save/restore >>> the same combination of registers, they get to reuse the prior
    function's save/restore sequence (generally folded off to before the
    function in question).

    Calling a subroutine to perform epilogues is adding to the number of
    branches a program executes. Having an instruction like EXIT means
    when you know you need to exit, you EXIT you don't branch to the exit
    point. Saving instructions.


    Prolog needs a call, but epilog can just be a branch, since no need to
    return back into the function that is returning.

    Yes, but this means My 66000 executes 3 fewer transfers of control
    per subroutine than you do. And taken branches add latency.

    Needs to have a lower limit though, as it is not worth it to use a call/branch to save/restore 3 or 4 registers...

    But, say, 20 registers, it is more worthwhile.

    ENTER saves as few as 1 or as many as 32 and remains that 1 single
    instruction. Same for EXIT and exit also performs the RET when LDing
    R0.


    Granted, the folding strategy can still do canary values, but doing so
    in the reused portions would limit the range of unique canary values
    (well, unless the canary magic is XOR'ed with SP or something...).

    Canary values are in addition to ENTER and EXIT not part of them
    IMHO.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Tue Apr 1 18:51:30 2025
    On Tue, 1 Apr 2025 4:58:58 +0000, Robert Finch wrote:

    On 2025-03-31 4:52 p.m., MitchAlsup1 wrote:
    On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:
    --------------------
    ENTER saves as few as 1 or as many as 32 and remains that 1 single
    instruction. Same for EXIT and exit also performs the RET when LDing
    R0.


    Granted, the folding strategy can still do canary values, but doing so >>>>> in the reused portions would limit the range of unique canary values >>>>> (well, unless the canary magic is XOR'ed with SP or something...).

    Canary values are in addition to ENTER and EXIT not part of them
    IMHO.

    In Q+3 there are push and pop multiple instructions. I did not want to
    add load and store multiple on top of that. They work great for ISRs,
    but not so great for task switching code. I have the instructions
    pushing or popping up to 17 registers in a group. Groups of registers
    overlap by eight. The instructions can handle all 96 registers in the machine. ENTER and EXIT are also present.

    It is looking like the context switch code for the OS will take about
    3000 clock cycles to run.

    How much of that is figuring out who to switch to and, now that that has
    been decided, make the context switch manifest ??

    Not wanting to disable interrupts for that
    long, I put a spinlock on the system’s task control block array. But I think I have run into an issue. It is the timer ISR that switches tasks. Since it is an ISR it pushes a subset of registers that it uses and
    restores them at exit. But when exiting and switching tasks it spinlocks
    on the task control block array. I am not sure this is a good thing. As
    the timer IRQ is fairly high priority. If something else locked the TCB
    array it would deadlock. I guess the context switching could be deferred until the app requests some other operating system function. But then
    the issue is what if the app gets stuck in an infinite loop, not calling
    the OS? I suppose I could make an OS heartbeat function call a
    requirement of apps. If the app does not do a heartbeat within a
    reasonable time, it could be terminated.

    Q+3 progresses rapidly. A lot of the stuff in earlier versions was
    removed. The pared down version is a 32-bit machine. Expecting some
    headaches because of the use of condition registers and branch
    registers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Tue Apr 1 23:24:29 2025
    On Tue, 1 Apr 2025 22:06:10 +0000, Robert Finch wrote:

    On 2025-04-01 2:51 p.m., MitchAlsup1 wrote:
    On Tue, 1 Apr 2025 4:58:58 +0000, Robert Finch wrote:
    ------------------

    It is looking like the context switch code for the OS will take about
    3000 clock cycles to run.

    How much of that is figuring out who to switch to and, now that that has
    been decided, make the context switch manifest ??

    That was just for the making the switch. I calculated based on the
    number of register loads and stores x2 and then times 13 clocks for
    memory access, plus a little bit of overhead for other instructions.

    Why is it not 13 cycles to get started and then each register is 1 one
    cycle.

    Deciding who to switch to may be another good chunk of time. But the
    system is using a hardware ready list, so the choice is just to pop
    (load) the top task id off the ready list. The guts of the switcher is
    only about 30 LOC, but it calls a couple of helper routines.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue Apr 1 23:21:24 2025
    On Tue, 1 Apr 2025 19:34:10 +0000, BGB wrote:

    On 3/31/2025 3:52 PM, MitchAlsup1 wrote:
    On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:
    ---------------------
    PC-Rel not being used as PC-Rel doesn't allow for multiple process
    instances of a given loaded binary within a shared address space.

    As long as the relative distance is the same, it does.


    Can't happen within a shared address space.

    Say, if you load a single copy of a binary at 0x24680000.
    Process A and B can't use the same mapping in the same address space,
    with PC-rel globals, as then they would each see the other's globals.

    Say I load a copy of the binary text at 0x24680000 and its data at
    0x35900000 for a distance of 0x11280000 into the address space of
    a process.

    Then I load another copy at 0x44680000 and its data at 55900000
    into the address space of a different process.

    PC-rel addressing works in both cases--because the distance (-rel)
    remains the same,

    and the MMU can translate the code to the same physical, and map
    each area of data individually.

    Different virtual addresses, same code physical address, different
    data virtual and physical addresses.

    You can't do a duplicate mapping at another address, as this both wastes
    VAS, and also any Abs64 base-relocs or similar would differ.

    A 64-bit VAS is a wasteable address space, whereas a 48-bit VAS is not.

    You also can't CoW the data/bss sections, as this is no longer a shared address space.

    You are trying to "get at" something here, but I can't see it (yet).


    So, alternative is to use GBR to access globals, with the data/bss
    sections allocated independently of the binary.

    This way, multiple processes can share the same mapping at the same
    address for any executable code and constant data, with only the data sections needing to be allocated.


    Does mean though that one needs to save/restore the global pointer, and
    there is a ritual for reloading it.

    EXE's generally assume they are index 0, so:
    MOV.Q (GBR, 0), Rt
    MOV.Q (Rt, 0), GBR
    Or, in RV terms:
    LD X6, 0(X3)
    LD X3, Disp33(X6)
    Or, RV64G:
    LD X6, 0(X3)
    LUI X5, DispHi
    ADD X5 X5, X6
    LD X3, DispLo(X5)


    For DLL's, the index is fixed up with a base-reloc (for each loaded
    DLL), so basically the same idea. Typically a Disp33 is used here to
    allow for a potentially large/unknown number of loaded DLL's. Thus far,
    a global numbering scheme is used.

    Where, (GBR+0) gives the address of a table of global pointers for every loaded binary (can be assumed read-only from userland).


    Generally, this is needed if:
    Function may be called from outside of the current binary and:
    Accesses global variables;
    And/or, calls local functions.

    I just use 32-bit of 64-bit displacement constants. Does not matter
    how control arrived at this subroutine, it accesses its data as the
    linker resolved addresses--without wasting a register.


    Though, still generally lower average-case overhead than the strategy typically used by FDPIC, which would handle this reload process on the
    caller side...
    SD X3, Disp(SP)
    LD X3, 8(X18)
    LD X6, 0(X18)
    JALR X1, 0(X6)
    LD X3, Disp(SP)

    This is just::

    CALX [IP,,#GOT[funct_num]-.]

    In the 32-bit linking mode this is a 2 word instruction, in the 64-bit
    linking mode it is a 3 word instruction.
    ----------------

    Though, execl() effectively replaces the current process.

    IMHO, a "CreateProcess()" style abstraction makes more sense than
    fork+exec.

    You are 40 years late on that.

    ---------------

    But, invariably, someone will want "compressed" instructions with a
    subset of the registers, and one can't just have these only having
    access to argument registers.

    Brian had little trouble using My 66000 ABI which does have contiguous
    register groupings.


    But, My66000 also isn't like, "Hey, how about 16-bit ops with 3 or 4 bit register numbers".

    Not sure the thinking behind the RV ABI.

    If RISC-V removed its 16-bit instructions, there is room in its ISA
    to put my entire ISA along with all the non-compressed RISC-V inst-
    ructions.

    ---------------

    Prolog needs a call, but epilog can just be a branch, since no need to
    return back into the function that is returning.

    Yes, but this means My 66000 executes 3 fewer transfers of control
    per subroutine than you do. And taken branches add latency.


    Granted.

    Each predicted branch adds 2 cycles.

    So, you loose 6 cycles on just under ½ of all subroutine calls,
    while also executing 2-5 instructions manipulating your global
    pointer.


    Needs to have a lower limit though, as it is not worth it to use a
    call/branch to save/restore 3 or 4 registers...

    But, say, 20 registers, it is more worthwhile.

    ENTER saves as few as 1 or as many as 32 and remains that 1 single
    instruction. Same for EXIT and exit also performs the RET when LDing
    R0.


    Granted.

    My strategy isn't perfect:
    Non-zero branching overheads, when the feature is used;
    Per-function load/store slides in prolog/epilog, when not used.

    Then, the heuristic mostly becomes one of when it is better to use the
    inline strategy (load/store slide), or to fold them off and use calls/branches.

    My solution gets rid of the delimma:
    a) the call code is always smaller
    b) the call code never takes more cycles

    In addition, there is a straightforward way to elide the STs of ENTER
    when the memory unit is still executing the previous EXIT.

    Does technically also work for RISC-V though (though seemingly GCC
    always uses inline save/restore, but also the RV ABI has fewer
    registers).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Wed Apr 2 01:47:26 2025
    On Wed, 2 Apr 2025 0:07:41 +0000, Robert Finch wrote:

    On 2025-04-01 7:24 p.m., MitchAlsup1 wrote:
    On Tue, 1 Apr 2025 22:06:10 +0000, Robert Finch wrote:

    On 2025-04-01 2:51 p.m., MitchAlsup1 wrote:
    On Tue, 1 Apr 2025 4:58:58 +0000, Robert Finch wrote:
    ------------------

    It is looking like the context switch code for the OS will take about >>>>> 3000 clock cycles to run.

    How much of that is figuring out who to switch to and, now that that has >>>> been decided, make the context switch manifest ??

    That was just for the making the switch. I calculated based on the
    number of register loads and stores x2 and then times 13 clocks for
    memory access, plus a little bit of overhead for other instructions.

    Why is it not 13 cycles to get started and then each register is 1 one
    cycle.

    The CPU does not do pipe-lined burst loads. To load the cache line it is
    two independent loads. 256-bits at a time. Stores post to the bus, but
    I seem to remember having to space out the stores so the queue in the
    memory controller did not overflow. Needs more work.

    Stores should be faster, I think they are single cycle. But loads may be quite slow if things are not in the cache. I should really measure it.
    It may not be as bad I think. It is still 300 LOC, about 100 loads and
    stores each way. Lots of move instructions for regs that cannot be
    directly loaded or stored. And with CRs serializing the processor. But
    the processor should eat up all the moves fairly quickly.

    One of the reasons I went with treating the register file and thread-
    state as a write-back cache is that HW can read-up the inbound register
    values before starting to write out the outbound values (rather than
    the other way of having to do the STs first so the LDs have a place
    to land.)

    Deciding who to switch to may be another good chunk of time. But the
    system is using a hardware ready list, so the choice is just to pop
    (load) the top task id off the ready list. The guts of the switcher is
    only about 30 LOC, but it calls a couple of helper routines.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Apr 1 22:55:56 2025
    Say, if you load a single copy of a binary at 0x24680000.
    Process A and B can't use the same mapping in the same address space,
    with PC-rel globals, as then they would each see the other's globals.

    Say I load a copy of the binary text at 0x24680000 and its data at
    0x35900000 for a distance of 0x11280000 into the address space of
    a process.

    Then I load another copy at 0x44680000 and its data at 55900000
    into the address space of a different process.

    But then if thread A (whose state is stored at 0x35900000) sends to
    thread B (whose state is at 55900000) a closure whose code points
    somewhere inside 0x24680000, it will end up using the state of thread
    A instead of the state of the current thread.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Thu Apr 3 10:09:20 2025
    BGB [2025-04-01 23:19:11] wrote:
    But, yeah, inter-process function pointers aren't really a thing, and should not be a thing.

    AFAIK, this point was brought in the context of a shared address space
    (I assumed it was some kind of SASOS situation, but the same thing
    happens with per-thread data inside a POSIX-style process).
    Function pointers are perfectly normal and common in data (even tho they
    may often be implicit, e.g. within the method table of objects), and the
    whole point of sharing an address space is to be able to exchange data.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Fri Apr 4 21:07:09 2025
    On Wed, 2 Apr 2025 0:07:41 +0000, Robert Finch wrote:

    On 2025-04-01 7:24 p.m., MitchAlsup1 wrote:
    On Tue, 1 Apr 2025 22:06:10 +0000, Robert Finch wrote: -------------------------
    Why is it not 13 cycles to get started and then each register is 1 one
    cycle.

    The CPU does not do pipe-lined burst loads. To load the cache line it is
    two independent loads. 256-bits at a time. Stores post to the bus, but
    I seem to remember having to space out the stores so the queue in the
    memory controller did not overflow. Needs more work.

    Stores should be faster, I think they are single cycle. But loads may be quite slow if things are not in the cache. I should really measure it.
    It may not be as bad I think. It is still 300 LOC, about 100 loads and
    stores each way. Lots of move instructions for regs that cannot be
    directly loaded or stored. And with CRs serializing the processor. But
    the processor should eat up all the moves fairly quickly.

    By placing all the CRs together, and treating thread-state as a write-
    back cache, all the storing and loading happens without any
    serialization,
    in cache line quanta, where the LD can begin before the STs
    begin--giving
    the overlap that reduces the cycle count.

    For example, once a core has decided to run "this-thread" all it has to
    do is to execute a single HR instruction which writes a pointer to
    thread-
    state. Then upon SVR, that thread begins running. Between HE and SVR, HW
    can preload the inbound data, and push out the outbound data after the
    inbound data has arrived.

    But, also note: Due to the way CR's are mapped into MMI/O memory, one
    core can write that same HR available CR on another core and cause a
    remote context switch of that another core.

    The main use is more likely to be remote diagnostics of a core that
    has quit responding to the system (crashed hard) so its CRs can be
    read out and examined to see why it quit responding.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Fri Apr 4 21:13:27 2025
    On Fri, 4 Apr 2025 3:49:31 +0000, Robert Finch wrote:

    On 2025-04-03 1:22 p.m., BGB wrote:
    -------------------

    Or, to allow for NOMMU operation, or reduce costs by not having context
    switches result in as large of numbers of TLB misses.

    Also makes the kernel simpler as it doesn't need to deal with each
    process having its own address space.

    Have you seen the MPRV bit in RISCV? Allows memory ops to execute using
    the previous mode / address space. The bit just has to be set, then do
    the memory op, then reset the bit. Makes it easy to access data using
    the process address space.

    Let us postulate you are running in RISC-V HyperVisor on core[j]
    and you want to write into GuestOS VAS and into application VAS
    more or less simultaneously.

    Seems to me like you need a MPRV to be more than a single bit
    so it could index which layer of the SW stack's VAS it needs
    to touch.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Sat Apr 5 16:37:19 2025
    On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:

    On 2025-04-04 5:13 p.m., MitchAlsup1 wrote:
    On Fri, 4 Apr 2025 3:49:31 +0000, Robert Finch wrote:

    On 2025-04-03 1:22 p.m., BGB wrote:
    -------------------

    Or, to allow for NOMMU operation, or reduce costs by not having context >>>> switches result in as large of numbers of TLB misses.

    Also makes the kernel simpler as it doesn't need to deal with each
    process having its own address space.

    Have you seen the MPRV bit in RISCV? Allows memory ops to execute using
    the previous mode / address space. The bit just has to be set, then do
    the memory op, then reset the bit. Makes it easy to access data using
    the process address space.

    Let us postulate you are running in RISC-V HyperVisor on core[j]
    and you want to write into GuestOS VAS and into application VAS
    more or less simultaneously.

    Would not writing to the GuestOs VAS and the application VAS be the
    result of separate system calls? Or does the hypervisor take over for
    the GuestOS?

    Application has a 64-bit VAS
    GusetOS has a 64-bit VAS
    HyprVisor has a 64-bit VAS
    and so does
    Securte has a 64-bit VAS

    So, we are in HV and we need to write to guestOS and to Application
    but we have only 1-bit of distinction.

    And this has nothing to do with system calls it has to do with
    accessing (rather simultaneously) any of the 4 VASs.

    Seems to me like you need a MPRV to be more than a single bit
    so it could index which layer of the SW stack's VAS it needs
    to touch.

    So, there is a need to be able to go back two or three levels? I suppose
    it could also be done by manipulating the stack, although adding an
    extra bit may be easier. How often does it happen?

    I have no idea, and I suspect GuestOS people don't either.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sat Apr 5 18:31:44 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:


    Would not writing to the GuestOs VAS and the application VAS be the
    result of separate system calls? Or does the hypervisor take over for
    the GuestOS?

    Application has a 64-bit VAS
    GusetOS has a 64-bit VAS
    HyprVisor has a 64-bit VAS
    and so does
    Securte has a 64-bit VAS

    So, we are in HV and we need to write to guestOS and to Application
    but we have only 1-bit of distinction.

    On ARM64, when the HV needs to write to guest user VA or guest PA,
    the SMMU provides an interface the processor can use to translate
    the guest VA or Guest PA to the corresponding system physical address.
    Of course, there is a race if the guest OS changes the underlying
    translation tables during the upcall to the hypervisor or secure
    monitor, although that would be a bug in the guest were it so to do,
    since the guest explicitly requested the action from the higher
    privilege level (e.g. HV).

    Arm does have a set of load/store "user" instructions that translate
    addresses using the unprivileged (application) translation tables. There's also a processor state bit (UAO - User Access Override) that can
    be set to force those instructions to use the permissions associated
    with the current processor privilege level.

    Note that there is a push by all vendors to include support
    for guest 'privacy', such that the hypervisor has no direct
    access to memory owned by the guest, or where the the guest
    memory is encrypted using a key the hypervisor or secure monitor
    don't have access to.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Apr 5 23:06:38 2025
    On Sat, 5 Apr 2025 18:31:44 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:


    Would not writing to the GuestOs VAS and the application VAS be the
    result of separate system calls? Or does the hypervisor take over for
    the GuestOS?

    Application has a 64-bit VAS
    GusetOS has a 64-bit VAS
    HyprVisor has a 64-bit VAS
    and so does
    Securte has a 64-bit VAS

    So, we are in HV and we need to write to guestOS and to Application
    but we have only 1-bit of distinction.

    On ARM64, when the HV needs to write to guest user VA or guest PA,
    the SMMU provides an interface the processor can use to translate
    the guest VA or Guest PA to the corresponding system physical address.
    Of course, there is a race if the guest OS changes the underlying
    translation tables during the upcall to the hypervisor or secure
    monitor, although that would be a bug in the guest were it so to do,
    since the guest explicitly requested the action from the higher
    privilege level (e.g. HV).

    Arm does have a set of load/store "user" instructions that translate addresses using the unprivileged (application) translation tables.

    When Secure Monitor executes a "user" instructions which layer
    of the SW stack is accessed:: {HV, SV, User} ??

    Is this 1-layer down the stack, or all layers down the stack ??

    There's
    also a processor state bit (UAO - User Access Override) that can
    be set to force those instructions to use the permissions associated
    with the current processor privilege level.

    That is how My 66000 MMU is defined--higher privilege layers
    have R/W access to the next lower privilege layer--without
    doing anything other than a typical LD or ST instruction.

    I/O MMU has similar issues to solve in that a device can Read
    write-execute only memory and write read-execute only memory.

    Note that there is a push by all vendors to include support
    for guest 'privacy', such that the hypervisor has no direct
    access to memory owned by the guest, or where the the guest
    memory is encrypted using a key the hypervisor or secure monitor
    don't have access to.

    I call these "paranoid" applications--generally requiring no
    privilege, but they don't want GuestOS of HyperVisor to look
    at their data and at the same time, they want GuestOS or HV
    to perform I/O to said data--so some devices have a effective
    privilege above that of the driver commanding them.

    I understand the reasons and rational.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Sat Apr 5 23:11:00 2025
    On Sat, 5 Apr 2025 21:57:50 +0000, Robert Finch wrote:

    On 2025-04-05 2:31 p.m., Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:


    Would not writing to the GuestOs VAS and the application VAS be the
    result of separate system calls? Or does the hypervisor take over for
    the GuestOS?

    Application has a 64-bit VAS
    GusetOS has a 64-bit VAS
    HyprVisor has a 64-bit VAS
    and so does
    Securte has a 64-bit VAS

    So, we are in HV and we need to write to guestOS and to Application
    but we have only 1-bit of distinction.

    On ARM64, when the HV needs to write to guest user VA or guest PA,
    the SMMU provides an interface the processor can use to translate
    the guest VA or Guest PA to the corresponding system physical address.
    Of course, there is a race if the guest OS changes the underlying
    translation tables during the upcall to the hypervisor or secure
    monitor, although that would be a bug in the guest were it so to do,
    since the guest explicitly requested the action from the higher
    privilege level (e.g. HV).

    Arm does have a set of load/store "user" instructions that translate
    addresses using the unprivileged (application) translation tables.
    There's
    also a processor state bit (UAO - User Access Override) that can
    be set to force those instructions to use the permissions associated
    with the current processor privilege level.

    Note that there is a push by all vendors to include support
    for guest 'privacy', such that the hypervisor has no direct
    access to memory owned by the guest, or where the the guest
    memory is encrypted using a key the hypervisor or secure monitor
    don't have access to.

    Okay,

    I was interpreting RISCV specs wrong. They have three bits dedicated to
    this. 1 is an on/off and the other two are the mode to use. I am left wondering how it is determined which mode to use. If the hypervisor is
    passed a pointer to a VAS variable in a register, how does it know that
    the pointer is for the supervisor or the user/app?

    More interesting the the concept that there are multiple HVs that
    have been virtualized--in this case the sender of the address may
    think it has HV privilege but is currently operating as if it only
    has GuestOS privilege. ...

    It's why I assumed it found the mode from the stack. Those two select bits have to set
    somehow. It seems like extra code to access the right address space.
    I got the thought to use the three bits a bit differently.
    111 = use current mode
    110 = use mode from stack
    100 = debug? mode
    011 = secure (machine) mode
    010 = hypervisor mode
    001 = supervisor mode
    000 = user/app mode
    I was just using inline code to select the proper address space. But if
    it is necessary to dig around to figure the mode, it may turn into a subroutine call.

    All the machines I have used/designed/programmed in the past use 000
    as highest privilege and 111 as lowest.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sun Apr 6 14:32:43 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sat, 5 Apr 2025 18:31:44 +0000, Scott Lurndal wrote:


    Arm does have a set of load/store "user" instructions that translate
    addresses using the unprivileged (application) translation tables.

    When Secure Monitor executes a "user" instructions which layer
    of the SW stack is accessed:: {HV, SV, User} ?

    The Secure Monitor will never execute a user instruction. If
    it does, it will act as any other load/store executed by the
    secure monitor.

    The "user" instructions are only used by a bare-metal OS
    or a guest OS to access user application address spaces.


    Is this 1-layer down the stack, or all layers down the stack ??

    One layer down, and only the least privileged non-user level.


    There's
    also a processor state bit (UAO - User Access Override) that can
    be set to force those instructions to use the permissions associated
    with the current processor privilege level.

    That is how My 66000 MMU is defined--higher privilege layers
    have R/W access to the next lower privilege layer--without
    doing anything other than a typical LD or ST instruction.

    That presumes a shared address space between the privilege
    levels - which is common for the OS and user-modes. It's
    not common (or particularly useful[*]) at any other privilege level.

    [*] A primary goal must be to avoid privilege level
    upcalls as much as possible.



    I/O MMU has similar issues to solve in that a device can Read
    write-execute only memory and write read-execute only memory.

    By the time the IOMMU translates the inbound address, it is
    a physical machine address, so I don't see any issue here.
    And in the ARM case, the IOMMU translation tables are identical
    to the processor translation tables in format and can actually
    share some or all of the tables between the core(s) and the IOMMU.

    Note that for various reasons, the IOMMU translation tables
    may cover only a portion of the target address space at any particular privilege level.


    Note that there is a push by all vendors to include support
    for guest 'privacy', such that the hypervisor has no direct
    access to memory owned by the guest, or where the the guest
    memory is encrypted using a key the hypervisor or secure monitor
    don't have access to.

    I call these "paranoid" applications--generally requiring no
    privilege, but they don't want GuestOS of HyperVisor to look
    at their data and at the same time, they want GuestOS or HV
    to perform I/O to said data--so some devices have a effective
    privilege above that of the driver commanding them.

    I understand the reasons and rational.

    The primary reason is for encrypted video decoding where
    the decoded video is fed directly to the graphics processor
    and the end-user cannot intercept the decrypted video stream. Closing
    the barn door after the horse has left, but c'est la vie.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Robert Finch on Sun Apr 6 14:21:26 2025
    Robert Finch <robfi680@gmail.com> writes:
    On 2025-04-05 2:31 p.m., Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:


    Would not writing to the GuestOs VAS and the application VAS be the
    result of separate system calls? Or does the hypervisor take over for
    the GuestOS?

    Application has a 64-bit VAS
    GusetOS has a 64-bit VAS
    HyprVisor has a 64-bit VAS
    and so does
    Securte has a 64-bit VAS

    So, we are in HV and we need to write to guestOS and to Application
    but we have only 1-bit of distinction.

    On ARM64, when the HV needs to write to guest user VA or guest PA,
    the SMMU provides an interface the processor can use to translate
    the guest VA or Guest PA to the corresponding system physical address.
    Of course, there is a race if the guest OS changes the underlying
    translation tables during the upcall to the hypervisor or secure
    monitor, although that would be a bug in the guest were it so to do,
    since the guest explicitly requested the action from the higher
    privilege level (e.g. HV).

    Arm does have a set of load/store "user" instructions that translate
    addresses using the unprivileged (application) translation tables. There's >> also a processor state bit (UAO - User Access Override) that can
    be set to force those instructions to use the permissions associated
    with the current processor privilege level.

    Note that there is a push by all vendors to include support
    for guest 'privacy', such that the hypervisor has no direct
    access to memory owned by the guest, or where the the guest
    memory is encrypted using a key the hypervisor or secure monitor
    don't have access to.

    Okay,

    I was interpreting RISCV specs wrong. They have three bits dedicated to
    this. 1 is an on/off and the other two are the mode to use. I am left >wondering how it is determined which mode to use. If the hypervisor is
    passed a pointer to a VAS variable in a register, how does it know that
    the pointer is for the supervisor or the user/app?

    When the exception (in this case an upcall to a more privileged
    regime) occurs, the saved state register/stack word should contain the prior privilege level. The hypervisor will know from that whether
    the upcall was from the guest OS or a guest Application.

    Note that on ARM, there are restrictions on upcalls to
    more privileged regimes - generally a particular regime
    can only upcall the next higher privileged regime, so
    the user app can only upcall the GuestOS, the guest OS can only
    upcall the HV and the HV is the only regime that can
    upcall the secure monitor.

    It's why I assumed it
    found the mode from the stack. Those two select bits have to set
    somehow. It seems like extra code to access the right address space.'

    I haven't spent much time with RISC-V, but surely the processor
    has a state register that stores the current mode, and which
    must be preserved over exceptions/upcalls, which would require
    that they be recorded in an exception syndrome register for
    restoration when the upcall returns.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Mon Apr 7 00:51:08 2025
    On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
    ----------------
    When the exception (in this case an upcall to a more privileged
    regime) occurs, the saved state register/stack word should contain the
    prior privilege level. The hypervisor will know from that whether
    the upcall was from the guest OS or a guest Application.

    Note that on ARM, there are restrictions on upcalls to
    more privileged regimes - generally a particular regime
    can only upcall the next higher privileged regime, so
    the user app can only upcall the GuestOS, the guest OS can only
    upcall the HV and the HV is the only regime that can
    upcall the secure monitor.

    On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

    That presumes a shared address space between the privilege
    levels - which is common for the OS and user-modes. It's
    not common (or particularly useful[*]) at any other privilege
    level.

    So, is this dichotomy because::

    a) HVs are good enough at virtualizing raw HW that GuestOS
    does not need a lot of paravirtualization to be efficient ??

    b) GuestOS does not need "that much paravirtualization" to be
    efficient anyway.

    c) the kinds of things GuestOS ask HVs to perform is just not
    enough like the kind of things user asks of GuestOS.

    d) User and GuestOS evolved in a time before virtualization
    and simply prefer to exist as it used to be ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Robert Finch on Mon Apr 7 14:04:37 2025
    Robert Finch <robfi680@gmail.com> writes:
    On 2025-04-06 10:21 a.m., Scott Lurndal wrote:

    Note that on ARM, there are restrictions on upcalls to
    more privileged regimes - generally a particular regime
    can only upcall the next higher privileged regime, so
    the user app can only upcall the GuestOS, the guest OS can only
    upcall the HV and the HV is the only regime that can
    upcall the secure monitor.

    Allows two directional virtualization I think. Q+ has all exceptions and >interrupts going to the secure monitor, which can then delegate it back
    to a lower level.

    If that adds latency to the interrupt handler, that will not
    be a positive benefit.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Apr 7 14:09:50 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
    ----------------
    When the exception (in this case an upcall to a more privileged
    regime) occurs, the saved state register/stack word should contain the
    prior privilege level. The hypervisor will know from that whether
    the upcall was from the guest OS or a guest Application.

    Note that on ARM, there are restrictions on upcalls to
    more privileged regimes - generally a particular regime
    can only upcall the next higher privileged regime, so
    the user app can only upcall the GuestOS, the guest OS can only
    upcall the HV and the HV is the only regime that can
    upcall the secure monitor.

    On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

    That presumes a shared address space between the privilege
    levels - which is common for the OS and user-modes. It's
    not common (or particularly useful[*]) at any other privilege
    level.

    So, is this dichotomy because::

    a) HVs are good enough at virtualizing raw HW that GuestOS
    does not need a lot of paravirtualization to be efficient ??

    Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
    proposed the SR-IOV capability, paravirtualization became anathema.


    b) GuestOS does not need "that much paravirtualization" to be
    efficient anyway.

    With modern hardware support, yes.


    c) the kinds of things GuestOS ask HVs to perform is just not
    enough like the kind of things user asks of GuestOS.

    Yes, that's also a truism.


    d) User and GuestOS evolved in a time before virtualization
    and simply prefer to exist as it used to be ??

    Typically an OS doesn't know if it is a guest or bare metal.
    That characteristic means that a given distribution can
    operate as either.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Apr 9 00:23:09 2025
    On Mon, 7 Apr 2025 14:09:50 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
    ----------------
    When the exception (in this case an upcall to a more privileged
    regime) occurs, the saved state register/stack word should contain the
    prior privilege level. The hypervisor will know from that whether
    the upcall was from the guest OS or a guest Application.

    Note that on ARM, there are restrictions on upcalls to
    more privileged regimes - generally a particular regime
    can only upcall the next higher privileged regime, so
    the user app can only upcall the GuestOS, the guest OS can only
    upcall the HV and the HV is the only regime that can
    upcall the secure monitor.

    On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

    That presumes a shared address space between the privilege
    levels - which is common for the OS and user-modes. It's
    not common (or particularly useful[*]) at any other privilege
    level.

    So, is this dichotomy because::

    a) HVs are good enough at virtualizing raw HW that GuestOS
    does not need a lot of paravirtualization to be efficient ??

    Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
    proposed the SR-IOV capability, paravirtualization became anathema.


    b) GuestOS does not need "that much paravirtualization" to be
    efficient anyway.

    With modern hardware support, yes.


    c) the kinds of things GuestOS ask HVs to perform is just not
    enough like the kind of things user asks of GuestOS.

    Yes, that's also a truism.


    d) User and GuestOS evolved in a time before virtualization
    and simply prefer to exist as it used to be ??

    Typically an OS doesn't know if it is a guest or bare metal.
    That characteristic means that a given distribution can
    operate as either.

    Thank you for updating a piece of history apparently I did not
    live through !!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Apr 15 00:43:43 2025
    On Mon, 7 Apr 2025 14:09:50 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
    ----------------
    When the exception (in this case an upcall to a more privileged
    regime) occurs, the saved state register/stack word should contain the
    prior privilege level. The hypervisor will know from that whether
    the upcall was from the guest OS or a guest Application.

    Note that on ARM, there are restrictions on upcalls to
    more privileged regimes - generally a particular regime
    can only upcall the next higher privileged regime, so
    the user app can only upcall the GuestOS, the guest OS can only
    upcall the HV and the HV is the only regime that can
    upcall the secure monitor.

    On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

    That presumes a shared address space between the privilege
    levels - which is common for the OS and user-modes. It's
    not common (or particularly useful[*]) at any other privilege
    level.

    So, is this dichotomy because::

    a) HVs are good enough at virtualizing raw HW that GuestOS
    does not need a lot of paravirtualization to be efficient ??

    Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
    proposed the SR-IOV capability, paravirtualization became anathema.

    Ok, back to Dan Cross:: (with help from Scott)

    If GuestOS wants to grab and hold onto a lock/mutex for a while
    to do some critical section stuff--does GuestOS "care" that HV
    can still take an interrupt while GuestOS is doing its CS thing ??
    since HV is not going to touch any memory associated with GuestOS.

    In effect, I am asking is Disable Interrupt is SW-stack-wide or only
    applicable to the current layer of the SW stack ?? One can equally
    use SW-stack-wide to mean core-wide.

    For example:: GuestOS DIs, and HV takes a page fault from GuestOS;
    makes the page resident and accessible, and allows GuestOS to run
    from the point of fault. GuestOS "sees" no interrupt and nothing
    in GuestOS VAS is touched by HV in servicing the page fault.

    Now, sure that lock is held while the page fault is being serviced,
    and the ugly head of priority inversion takes hold. But ... I am ni
    need of some edumacation here.


    b) GuestOS does not need "that much paravirtualization" to be
    efficient anyway.

    With modern hardware support, yes.


    c) the kinds of things GuestOS ask HVs to perform is just not
    enough like the kind of things user asks of GuestOS.

    Yes, that's also a truism.


    d) User and GuestOS evolved in a time before virtualization
    and simply prefer to exist as it used to be ??

    Typically an OS doesn't know if it is a guest or bare metal.
    That characteristic means that a given distribution can
    operate as either.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Apr 15 14:02:37 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 7 Apr 2025 14:09:50 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
    ----------------
    When the exception (in this case an upcall to a more privileged
    regime) occurs, the saved state register/stack word should contain the >>>> prior privilege level. The hypervisor will know from that whether
    the upcall was from the guest OS or a guest Application.

    Note that on ARM, there are restrictions on upcalls to
    more privileged regimes - generally a particular regime
    can only upcall the next higher privileged regime, so
    the user app can only upcall the GuestOS, the guest OS can only
    upcall the HV and the HV is the only regime that can
    upcall the secure monitor.

    On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

    That presumes a shared address space between the privilege
    levels - which is common for the OS and user-modes. It's
    not common (or particularly useful[*]) at any other privilege
    level.

    So, is this dichotomy because::

    a) HVs are good enough at virtualizing raw HW that GuestOS
    does not need a lot of paravirtualization to be efficient ??

    Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
    proposed the SR-IOV capability, paravirtualization became anathema.

    Ok, back to Dan Cross:: (with help from Scott)

    If GuestOS wants to grab and hold onto a lock/mutex for a while
    to do some critical section stuff--does GuestOS "care" that HV
    can still take an interrupt while GuestOS is doing its CS thing ??
    since HV is not going to touch any memory associated with GuestOS.

    Generally, the Guest should execute "as if" it were running on
    Bare Metal. Consider an intel/amd processor running a bare-metal
    operating system that takes an interrupt into SMM mode; from the
    POV of a guest, an HV interrupt is similar to an SMM interrupt.

    If the SMM, Secure Monitor or HV modify guest memory in any way,
    all bets are off.


    In effect, I am asking is Disable Interrupt is SW-stack-wide or only >applicable to the current layer of the SW stack ?? One can equally
    use SW-stack-wide to mean core-wide.

    Current layer of the privilege stack. If there is a secure monitor
    at a more privileged level than the HV, it can take interrupts in a
    manner similar to the legacy SMM interupts. Typically there will
    be independent periodic timer interrupts in the Guest OS, the HV, and the secure monitor.


    For example:: GuestOS DIs, and HV takes a page fault from GuestOS;

    Note that these will be rare and only if the HV overcommits physical
    memory.

    makes the page resident and accessible, and allows GuestOS to run
    from the point of fault. GuestOS "sees" no interrupt and nothing
    in GuestOS VAS is touched by HV in servicing the page fault.

    The only way that the guest OS or guest OS application can detect
    such an event is if it measures an affected load/store - a covert
    channel. So there may be security considerations.


    Now, sure that lock is held while the page fault is being serviced,
    and the ugly head of priority inversion takes hold. But ... I am ni
    need of some edumacation here.

    Priority inversion is only applicable within a privilege level/ring.
    Interrupts to a higher privilege level cannot be masked by an active interrupt at a lower priority level.

    The higher privilege level must not unilaterally modify guest OS or
    application state.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Apr 15 20:46:28 2025
    On Tue, 15 Apr 2025 14:02:37 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 7 Apr 2025 14:09:50 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote: >>>>----------------
    When the exception (in this case an upcall to a more privileged
    regime) occurs, the saved state register/stack word should contain the >>>>> prior privilege level. The hypervisor will know from that whether
    the upcall was from the guest OS or a guest Application.

    Note that on ARM, there are restrictions on upcalls to
    more privileged regimes - generally a particular regime
    can only upcall the next higher privileged regime, so
    the user app can only upcall the GuestOS, the guest OS can only
    upcall the HV and the HV is the only regime that can
    upcall the secure monitor.

    On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

    That presumes a shared address space between the privilege
    levels - which is common for the OS and user-modes. It's
    not common (or particularly useful[*]) at any other privilege
    level.

    So, is this dichotomy because::

    a) HVs are good enough at virtualizing raw HW that GuestOS
    does not need a lot of paravirtualization to be efficient ??

    Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
    proposed the SR-IOV capability, paravirtualization became anathema.

    Ok, back to Dan Cross:: (with help from Scott)

    If GuestOS wants to grab and hold onto a lock/mutex for a while
    to do some critical section stuff--does GuestOS "care" that HV
    can still take an interrupt while GuestOS is doing its CS thing ??
    since HV is not going to touch any memory associated with GuestOS.

    Generally, the Guest should execute "as if" it were running on
    Bare Metal. Consider an intel/amd processor running a bare-metal
    operating system that takes an interrupt into SMM mode; from the
    POV of a guest, an HV interrupt is similar to an SMM interrupt.

    If the SMM, Secure Monitor or HV modify guest memory in any way,
    all bets are off.

    Yes, but we have previously established HV does its virtualization
    without touching GuestOS memory. {Which is why I used page fault as
    the example.}


    In effect, I am asking is Disable Interrupt is SW-stack-wide or only >>applicable to the current layer of the SW stack ?? One can equally
    use SW-stack-wide to mean core-wide.

    Current layer of the privilege stack. If there is a secure monitor
    at a more privileged level than the HV, it can take interrupts in a
    manner similar to the legacy SMM interupts. Typically there will
    be independent periodic timer interrupts in the Guest OS, the HV, and
    the secure monitor.

    This agrees with the RISC-V approach where each layer in the stack
    has its own Interrupt Enable configuration. {Which is what lead to
    my questions}.

    However, many architectures have only a single control bit for the
    whole core--which is shy I am trying to get a complete understanding
    of what is required and what is choice. That there is some control
    is (IS) required--how many seems to be choice at this stage.

    Would it be unwise of me to speculate that a control at each layer
    is more optimal, or that the critical section that is delayed due
    to "other stuff needing to be handled" should have taken precedent.

    Anyone know of any literature where this was simulate or measured ??


    For example:: GuestOS DIs, and HV takes a page fault from GuestOS;

    Note that these will be rare and only if the HV overcommits physical
    memory.

    makes the page resident and accessible, and allows GuestOS to run
    from the point of fault. GuestOS "sees" no interrupt and nothing
    in GuestOS VAS is touched by HV in servicing the page fault.

    The only way that the guest OS or guest OS application can detect
    such an event is if it measures an affected load/store - a covert
    channel. So there may be security considerations.

    Damn that high precision clock .....

    Which also leads to the question of should a Virtual Machine have
    its own virtual time ?? {Or VM and VMM share the concept of virtual
    time} ??


    Now, sure that lock is held while the page fault is being serviced,
    and the ugly head of priority inversion takes hold. But ... I am ni
    need of some edumacation here.

    Priority inversion is only applicable within a privilege level/ring. Interrupts to a higher privilege level cannot be masked by an active interrupt at a lower priority level.

    So, if core is running HyperVisor at priority 15 and a user interrupt
    arrives at a higher priority but directed at GuestOS (instead of HV)
    does::
    a) HV continue leaving higher priority interrupt waiting.
    b) switch back to GuestOS for higher priority interrupt--in such
    . a way that when GuestOS returns from interrupt HV takes over
    . from whence it left.

    This is really a question of what priority means across the entire
    SW stack--and real-time versus Linux may have different answers on
    this matter.

    The higher privilege level must not unilaterally modify guest OS or application state.

    Given the almost complete lack of shared address spaces in a manner
    where pointers can be passed between, there is almost nothing an HV
    can do to a GuestOS VAS unless GuestOS has ask for a HV service via paravirtualization entry point.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Apr 16 14:07:36 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 15 Apr 2025 14:02:37 +0000, Scott Lurndal wrote:




    Current layer of the privilege stack. If there is a secure monitor
    at a more privileged level than the HV, it can take interrupts in a
    manner similar to the legacy SMM interupts. Typically there will
    be independent periodic timer interrupts in the Guest OS, the HV, and
    the secure monitor.

    This agrees with the RISC-V approach where each layer in the stack
    has its own Interrupt Enable configuration. {Which is what lead to
    my questions}.

    AArch64 also had interrupt enables at each privilege level.


    However, many architectures have only a single control bit for the
    whole core--which is shy I am trying to get a complete understanding
    of what is required and what is choice. That there is some control
    is (IS) required--how many seems to be choice at this stage.

    I'm not aware of any architecture that supports virtualization that
    doesn't have enables for each privilege level; either there are
    distinct levels in hardware, or the hypervisor needs to handle
    all interrupts and inject them into the guest in some fashion. Best
    to have hardware support for all of this rather than the overhead
    of the HV handing all interrupts and the consequent context switches.

    Would it be unwise of me to speculate that a control at each layer
    is more optimal, or that the critical section that is delayed due
    to "other stuff needing to be handled" should have taken precedent.

    The former is optimal. Assumning the guest is independent of the
    HV, any delay in the critical section (e.g. due to an HV interrupt
    being handled) are inconsequential. The critical section is only
    critical to the privilege layer it occurs on.

    <snip>



    The only way that the guest OS or guest OS application can detect
    such an event is if it measures an affected load/store - a covert
    channel. So there may be security considerations.

    Damn that high precision clock .....

    Which also leads to the question of should a Virtual Machine have
    its own virtual time ?? {Or VM and VMM share the concept of virtual
    time} ??

    Generally, yes. Usually modeled with an offset register in
    the HV that gets applied to the guest view of current time.



    Now, sure that lock is held while the page fault is being serviced,
    and the ugly head of priority inversion takes hold. But ... I am ni
    need of some edumacation here.

    Priority inversion is only applicable within a privilege level/ring.
    Interrupts to a higher privilege level cannot be masked by an active
    interrupt at a lower priority level.

    So, if core is running HyperVisor at priority 15 and a user interrupt
    arrives at a higher priority but directed at GuestOS (instead of HV)
    does::
    a) HV continue leaving higher priority interrupt waiting.
    b) switch back to GuestOS for higher priority interrupt--in such
    . a way that when GuestOS returns from interrupt HV takes over
    . from whence it left.

    ARM, for example, splits the per-core interrupt priority range into halves
    - one half is assigned to the secure monitor and the other is assigned to the non-secure software running on the core. Early hypervisors would field all non-secure interrupts and either handle them itself or inject them into
    the guest. The first ARM64 cores would field all interrupts in the HV
    and the int controller had special registers the HV could use to inject interrupts
    into the guest. The overhead was not insignifcant, so they added
    a mechanism to allow some interrupts to be directly fielded by the
    guest itself - avoiding the round trip through the HV on every interrupt (called virtual LPIs).


    This is really a question of what priority means across the entire
    SW stack--and real-time versus Linux may have different answers on
    this matter.

    The higher privilege level must not unilaterally modify guest OS or
    application state.

    Given the almost complete lack of shared address spaces in a manner
    where pointers can be passed between, there is almost nothing an HV
    can do to a GuestOS VAS unless GuestOS has ask for a HV service via >paravirtualization entry point.

    The HV owns the translation tables for guest to physical address,
    it can pretty much do anything it wants with that access[*], including modifying guest processor and memory state at any time - absent
    potential future features such as hardware guest memory encryption
    or memory access controls at a level higher than the HV (e.g. the
    secure monitor - see AArch64 Realms, for example).

    https://developer.arm.com/documentation/den0126/0101/Overview

    [*] the hypervisor can easily double map a page in both the guest PAS
    and the HV VAS - a technique common in paravirtualized environments.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Apr 16 21:13:43 2025
    On Wed, 16 Apr 2025 14:07:36 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    ---------snip-----------

    So, if core is running HyperVisor at priority 15 and a user interrupt >>arrives at a higher priority but directed at GuestOS (instead of HV)
    does::
    a) HV continue leaving higher priority interrupt waiting.
    b) switch back to GuestOS for higher priority interrupt--in such
    . a way that when GuestOS returns from interrupt HV takes over
    . from whence it left.

    ARM, for example, splits the per-core interrupt priority range into
    halves
    - one half is assigned to the secure monitor and the other is assigned
    to the
    non-secure software running on the core.

    Thus, my predilection for 64-priority levels (rather than ~8 as
    suggested
    by another participant) allows for this distribution of priorities
    across
    layers in the SW stack at the discretion of trustable-SW.

    Early hypervisors would field
    all
    non-secure interrupts and either handle them itself or inject them into
    the guest. The first ARM64 cores would field all interrupts in the HV
    and the int controller had special registers the HV could use to inject interrupts
    into the guest. The overhead was not insignifcant, so they added
    a mechanism to allow some interrupts to be directly fielded by the
    guest itself - avoiding the round trip through the HV on every interrupt (called virtual LPIs).

    Given 4 layers in the stack {Secure, Hyper, Super, User} and we have
    interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
    or do we gain flexibility by being able to target interrupts directly to
    {user} ?? (the 4th element).

    Roughly: HW maintains 4 copies of state and generally indexes state
    with a 2-bit value, and the "structure" of thread-header is identical
    between layers; thus, indexing down to {user} falls out for free.

    {{But I could be off my rocker...again}}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Wed Apr 16 15:26:12 2025
    On 4/16/2025 2:13 PM, MitchAlsup1 wrote:
    On Wed, 16 Apr 2025 14:07:36 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    ---------snip-----------

    So, if core is running HyperVisor at priority 15 and a user interrupt
    arrives at a higher priority but directed at GuestOS (instead of HV)
    does::
    a) HV continue leaving higher priority interrupt waiting.
    b) switch back to GuestOS for higher priority interrupt--in such
    . a way that when GuestOS returns from interrupt HV takes over
    . from whence it left.

    ARM, for example, splits the per-core interrupt  priority range into
    halves
    - one half is assigned to the secure monitor and the other is assigned
    to the
    non-secure software running on the core.

    Thus, my predilection for 64-priority levels (rather than ~8 as
    suggested
    by another participant) allows for this distribution of priorities
    across
    layers in the SW stack at the discretion of trustable-SW.

                                              Early hypervisors would field
    all
    non-secure interrupts and either handle them itself or inject them into
    the guest.    The first ARM64 cores would field all interrupts in the HV >> and the int controller had special registers the HV could use to inject
    interrupts
    into the guest.    The overhead was not insignifcant, so they added
    a mechanism to allow some interrupts to be directly fielded by the
    guest itself - avoiding the round trip through the HV on every interrupt
    (called virtual LPIs).

    Given 4 layers in the stack {Secure, Hyper, Super, User} and we have interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
    or do we gain flexibility by being able to target interrupts directly to {user} ?? (the 4th element).

    I think you could gain a tiny amount of efficiency if the OS (super)
    allowed the user to set up handle certain classes of exceptions, e.g.
    divide faults) itself rather than having to go through the super.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stephen Fuld on Thu Apr 17 00:57:12 2025
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 4/16/2025 2:13 PM, MitchAlsup1 wrote:
    On Wed, 16 Apr 2025 14:07:36 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    ---------snip-----------

    So, if core is running HyperVisor at priority 15 and a user interrupt
    arrives at a higher priority but directed at GuestOS (instead of HV)
    does::
    a) HV continue leaving higher priority interrupt waiting.
    b) switch back to GuestOS for higher priority interrupt--in such
    . a way that when GuestOS returns from interrupt HV takes over
    . from whence it left.

    ARM, for example, splits the per-core interrupt  priority range into
    halves
    - one half is assigned to the secure monitor and the other is assigned
    to the
    non-secure software running on the core.

    Thus, my predilection for 64-priority levels (rather than ~8 as
    suggested
    by another participant) allows for this distribution of priorities
    across
    layers in the SW stack at the discretion of trustable-SW.

                                              Early hypervisors would field
    all
    non-secure interrupts and either handle them itself or inject them into
    the guest.    The first ARM64 cores would field all interrupts in the HV >>> and the int controller had special registers the HV could use to inject
    interrupts
    into the guest.    The overhead was not insignifcant, so they added
    a mechanism to allow some interrupts to be directly fielded by the
    guest itself - avoiding the round trip through the HV on every interrupt >>> (called virtual LPIs).

    Given 4 layers in the stack {Secure, Hyper, Super, User} and we have
    interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
    or do we gain flexibility by being able to target interrupts directly to
    {user} ?? (the 4th element).

    I think you could gain a tiny amount of efficiency if the OS (super)
    allowed the user to set up handle certain classes of exceptions, e.g.
    divide faults) itself rather than having to go through the super.

    Think carefully about the security implications of user-mode interrupt delivery. Particuarly with respect to potential impacts on other
    processes running on the system, and to overall system functionality.

    Handling interrupts requires direct access to the hardware from
    user-mode.

    Hardware access is normally done in the context of a 'sandboxed'
    PCI Express SRIOV function which the application can access directly;
    the hardware guarantees that the user process cannot adversley
    affect the hardware or other guests using other virtual functions.

    However, the interrupt controller itself (e.g. the mechanism used
    to acknowledge the interrupt to the interrupt controller after it
    has been serviced - e.g. the LAPIC) isn't virtualized, and direct
    access to that shouldn't be available to user-mode for fairly obvious
    reasons.

    That's why DPDK/ODP require the OS to handle interrupts and notify
    the application via standard OS notification mechanisms even
    when using SR-IOV capable hardware for the actual packet handling.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Thu Apr 17 00:47:38 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 16 Apr 2025 14:07:36 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    ---------snip-----------

    So, if core is running HyperVisor at priority 15 and a user interrupt >>>arrives at a higher priority but directed at GuestOS (instead of HV) >>>does::
    a) HV continue leaving higher priority interrupt waiting.
    b) switch back to GuestOS for higher priority interrupt--in such
    . a way that when GuestOS returns from interrupt HV takes over
    . from whence it left.

    ARM, for example, splits the per-core interrupt priority range into
    halves
    - one half is assigned to the secure monitor and the other is assigned
    to the
    non-secure software running on the core.

    Thus, my predilection for 64-priority levels (rather than ~8 as
    suggested
    by another participant) allows for this distribution of priorities
    across
    layers in the SW stack at the discretion of trustable-SW.

    Architecturally, the ARM64 interrupt priority can vary from 3 to 8
    bits. Most implementations implement 5 bits, allowing 16 secure
    and 16 non-secure priority levels. They can be grouped using
    a binary point register, if required.


    Early hypervisors would field
    all
    non-secure interrupts and either handle them itself or inject them into
    the guest. The first ARM64 cores would field all interrupts in the HV
    and the int controller had special registers the HV could use to inject
    interrupts
    into the guest. The overhead was not insignifcant, so they added
    a mechanism to allow some interrupts to be directly fielded by the
    guest itself - avoiding the round trip through the HV on every interrupt
    (called virtual LPIs).

    Given 4 layers in the stack {Secure, Hyper, Super, User} and we have >interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
    or do we gain flexibility by being able to target interrupts directly to >{user} ?? (the 4th element).

    On ARM there are only two interrupt signals from the interrupt controller
    to each core: FIQ and IRQ.

    Each of the signals can be 'claimed' by one, and only one privilege
    level on that core; if the secure monitor claims FIQ, then it can only be delivered
    to EL3.

    If running bare-metal, the OS (EL1) will claim the IRQ signal (by default if none of the more privileged levels claim it).

    If a hypervisor (EL2) is running, it will claim the IRQ signal and field
    all physical interrupts, except for virtual LPI and IPI interrupts which the hardware can inject directly into the guest (which may result in an
    interrupt to the hypervisor if the guest isn't resident on the target
    CPU).

    In a virtualized environment, one needs to be vary careful when
    exposing hardware interrupt signals directly to the guest operating system,
    as that often requires exposing some of the interrupt controller.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)