On 3/30/2025 7:16 AM, Robert Finch wrote:
Just got to thinking about stack canaries. I was going to have a special
purpose register holding the canary value for testing while the program
was running. But I just realized today that it may not be needed. Canary
values could be handled by the program loader as constants, eliminating
the need for a register. Since the value is not changing while the
program is running, it could easily be a constant. This may require a
fixup record handled by the assembler / linker to indicate to the loader
to place a canary value.
Prolog code would just store an immediate to the stack. On return a TRAP
instruction could check for the immediate value and trap if not present.
But the process seems to require assembler / linker support.
They are mostly just a normal compiler feature IME:
Prolog stores the value;
Epilog loads it and verifies that the value is intact.
Using a magic number
Nothing fancy needed in the assemble or link stages.
In my case, canary behavior is one of:
Use them in functions with arrays or similar (default);
Use them everywhere (optional);
Disable them entirely (also optional).
In my case, it is only checking 16-bit magic numbers, but mostly because
a 16-bit constant is cheaper to load into a register in this case
(single 32-bit instruction, vs a larger encoding needed for larger
values).
....
On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:
On 3/30/2025 7:16 AM, Robert Finch wrote:
Just got to thinking about stack canaries. I was going to have a special >>> purpose register holding the canary value for testing while the program
was running. But I just realized today that it may not be needed. Canary >>> values could be handled by the program loader as constants, eliminating
the need for a register. Since the value is not changing while the
program is running, it could easily be a constant. This may require a
fixup record handled by the assembler / linker to indicate to the loader >>> to place a canary value.
Prolog code would just store an immediate to the stack. On return a TRAP >>> instruction could check for the immediate value and trap if not present. >>> But the process seems to require assembler / linker support.
They are mostly just a normal compiler feature IME:
Prolog stores the value;
Epilog loads it and verifies that the value is intact.
Agreed.
Using a magic number
Remove excess words.
Nothing fancy needed in the assemble or link stages.
They remain blissfully ignorant--at most they generate the magic
number, possibly at random, possibly per link-module.
In my case, canary behavior is one of:
Use them in functions with arrays or similar (default);
Use them everywhere (optional);
Disable them entirely (also optional).
In my case, it is only checking 16-bit magic numbers, but mostly because
a 16-bit constant is cheaper to load into a register in this case
(single 32-bit instruction, vs a larger encoding needed for larger
values).
....
On 3/31/2025 11:04 AM, Stephen Fuld wrote:
On 3/30/2025 1:14 PM, MitchAlsup1 wrote:
On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:
On 3/30/2025 7:16 AM, Robert Finch wrote:
Just got to thinking about stack canaries. I was going to have a
special
purpose register holding the canary value for testing while the
program
was running. But I just realized today that it may not be needed.
Canary
values could be handled by the program loader as constants,
eliminating
the need for a register. Since the value is not changing while the
program is running, it could easily be a constant. This may require a >>>>> fixup record handled by the assembler / linker to indicate to the
loader
to place a canary value.
Prolog code would just store an immediate to the stack. On return a
TRAP
instruction could check for the immediate value and trap if not
present.
But the process seems to require assembler / linker support.
They are mostly just a normal compiler feature IME:
Prolog stores the value;
Epilog loads it and verifies that the value is intact.
Agreed.
I'm glad you, Mitch, chimed in here. When I saw this, it occurred to
me that this could be done automatically by the hardware (optionally,
based on a bit in a control register). The CALL instruction would
store magic value, and the RET instruction would test it. If there
was not a match, an exception would be generated. The value itself
could be something like the clock value when the program was
initiated, thus guaranteeing uniqueness.
The advantage over the software approach, of course, is the
elimination of several instructions in each prolog/epilog, reducing
footprint, and perhaps even time as it might be possible to overlap
some of the processing with the other things these instructions do.
The downside is more hardware and perhaps extra overhead.
Does this make sense? What have I missed.
This would seem to imply an ISA where CALL/RET push onto the stack or similar, rather than the (more common for RISC's) strategy of copying PC
into a link register...
Another option being if it could be a feature of a Load/Store Multiple.
On 3/31/2025 11:04 AM, Stephen Fuld wrote:
On 3/30/2025 1:14 PM, MitchAlsup1 wrote:
On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:
-------------
They are mostly just a normal compiler feature IME:
Prolog stores the value;
Epilog loads it and verifies that the value is intact.
Agreed.
I'm glad you, Mitch, chimed in here. When I saw this, it occurred to me
that this could be done automatically by the hardware (optionally, based
on a bit in a control register). The CALL instruction would store
magic value, and the RET instruction would test it. If there was not a
match, an exception would be generated. The value itself could be
something like the clock value when the program was initiated, thus
guaranteeing uniqueness.
The advantage over the software approach, of course, is the elimination
of several instructions in each prolog/epilog, reducing footprint, and
perhaps even time as it might be possible to overlap some of the
processing with the other things these instructions do. The downside is
more hardware and perhaps extra overhead.
Does this make sense? What have I missed.
This would seem to imply an ISA where CALL/RET push onto the stack or similar, rather than the (more common for RISC's) strategy of copying PC
into a link register...
Another option being if it could be a feature of a Load/Store Multiple.
Say, LDM/STM:
6b Hi (Upper bound of register to save)
6b Lo (Lower bound of registers to save)
1b LR (Flag to save Link Register)
1b GP (Flag to save Global Pointer)
1b SK (Flag to generate a canary)
Likely (STM):
Pushes LR first (if bit set);
Pushes GP second (if bit set);
Pushes registers in range (if Hi>=Lo);
Pushes stack canary (if bit set).
LDM would check the canary first and fault if it doesn't see the
expected value.
Downside, granted, is needing the relative complexity of an LDM/STM
style instruction.
Other ISAs use a flag bit for each register, but this is less viable
with an ISA with a larger number of registers, well, unless one uses a
64 or 96 bit LDM/STM encoding (possible). Merit though would be not
needing multiple LDM's / STM's to deal with a discontinuous register
range.
Well, also excluding the possibility where the LDM/STM is essentially
just a function call (say, if beyond certain number of registers are to
be saved/restored, the compiler generates a call to a save/restore
sequence, which is also generates as-needed). Granted, this is basically
the strategy used by BGBCC. If multiple functions happen to save/restore
the same combination of registers, they get to reuse the prior
function's save/restore sequence (generally folded off to before the
function in question).
Granted, the folding strategy can still do canary values, but doing so
in the reused portions would limit the range of unique canary values
(well, unless the canary magic is XOR'ed with SP or something...).
On 3/31/2025 1:07 PM, MitchAlsup1 wrote:-------------
Another option being if it could be a feature of a Load/Store Multiple.
Say, LDM/STM:
6b Hi (Upper bound of register to save)
6b Lo (Lower bound of registers to save)
1b LR (Flag to save Link Register)
1b GP (Flag to save Global Pointer)
1b SK (Flag to generate a canary)
ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
are implicit.
Likely (STM):
Pushes LR first (if bit set);
Pushes GP second (if bit set);
Pushes registers in range (if Hi>=Lo);
Pushes stack canary (if bit set).
EXIT uses its 3rd flag used when doing longjump() and THROW()
so as to pop the call-stack but not actually RET from the stack
walker.
OK.
I guess one could debate whether an LDM could treat the Load-LR as "Load
LR" or "Load address and Branch", and/or have separate flags (Load LR vs
Load PC, with Load PC meaning to branch).
Other ABIs may not have as much reason to save/restore the Global
Pointer all the time. But, in my case, it is being used as the primary
way of accessing globals, and each binary image has its own address
range here.
PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.
Vs, say, for PIE ELF binaries where it is needed to load a new copy for
each process instance because of this (well, excluding an FDPIC style
ABI, but seemingly still no one seems to have bothered adding FDPIC
support in GCC or friends for RV64 based targets, ...).
Well, granted, because Linux and similar tend to load every new process
into its own address space and/or use CoW.
Other ISAs use a flag bit for each register, but this is less viable
with an ISA with a larger number of registers, well, unless one uses a
64 or 96 bit LDM/STM encoding (possible). Merit though would be not
needing multiple LDM's / STM's to deal with a discontinuous register
range.
To quote Trevor Smith:: "Why would anyone want to do that" ??
Discontinuous register ranges:
Because pretty much no ABI's put all of the callee save registers in a contiguous range.
Granted, I guess if someone were designing an ISA and ABI clean, they
could make all of the argument registers and callee save registers contiguous.
Say:
R0..R3: Special
R4..R15: Scratch
R16..R31: Argument
R32..R63: Callee Save
....
But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.
Well, also excluding the possibility where the LDM/STM is essentially
just a function call (say, if beyond certain number of registers are to
be saved/restored, the compiler generates a call to a save/restore
sequence, which is also generates as-needed). Granted, this is basically >>> the strategy used by BGBCC. If multiple functions happen to save/restore >>> the same combination of registers, they get to reuse the prior
function's save/restore sequence (generally folded off to before the
function in question).
Calling a subroutine to perform epilogues is adding to the number of
branches a program executes. Having an instruction like EXIT means
when you know you need to exit, you EXIT you don't branch to the exit
point. Saving instructions.
Prolog needs a call, but epilog can just be a branch, since no need to
return back into the function that is returning.
Needs to have a lower limit though, as it is not worth it to use a call/branch to save/restore 3 or 4 registers...
But, say, 20 registers, it is more worthwhile.
Canary values are in addition to ENTER and EXIT not part of themGranted, the folding strategy can still do canary values, but doing so
in the reused portions would limit the range of unique canary values
(well, unless the canary magic is XOR'ed with SP or something...).
On 2025-03-31 4:52 p.m., MitchAlsup1 wrote:--------------------
On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:
ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.
Canary values are in addition to ENTER and EXIT not part of them
Granted, the folding strategy can still do canary values, but doing so >>>>> in the reused portions would limit the range of unique canary values >>>>> (well, unless the canary magic is XOR'ed with SP or something...).
IMHO.
In Q+3 there are push and pop multiple instructions. I did not want to
add load and store multiple on top of that. They work great for ISRs,
but not so great for task switching code. I have the instructions
pushing or popping up to 17 registers in a group. Groups of registers
overlap by eight. The instructions can handle all 96 registers in the machine. ENTER and EXIT are also present.
It is looking like the context switch code for the OS will take about
3000 clock cycles to run.
Not wanting to disable interrupts for that
long, I put a spinlock on the system’s task control block array. But I think I have run into an issue. It is the timer ISR that switches tasks. Since it is an ISR it pushes a subset of registers that it uses and
restores them at exit. But when exiting and switching tasks it spinlocks
on the task control block array. I am not sure this is a good thing. As
the timer IRQ is fairly high priority. If something else locked the TCB
array it would deadlock. I guess the context switching could be deferred until the app requests some other operating system function. But then
the issue is what if the app gets stuck in an infinite loop, not calling
the OS? I suppose I could make an OS heartbeat function call a
requirement of apps. If the app does not do a heartbeat within a
reasonable time, it could be terminated.
Q+3 progresses rapidly. A lot of the stuff in earlier versions was
removed. The pared down version is a 32-bit machine. Expecting some
headaches because of the use of condition registers and branch
registers.
On 2025-04-01 2:51 p.m., MitchAlsup1 wrote:------------------
On Tue, 1 Apr 2025 4:58:58 +0000, Robert Finch wrote:
It is looking like the context switch code for the OS will take about
3000 clock cycles to run.
How much of that is figuring out who to switch to and, now that that has
been decided, make the context switch manifest ??
That was just for the making the switch. I calculated based on the
number of register loads and stores x2 and then times 13 clocks for
memory access, plus a little bit of overhead for other instructions.
Deciding who to switch to may be another good chunk of time. But the
system is using a hardware ready list, so the choice is just to pop
(load) the top task id off the ready list. The guts of the switcher is
only about 30 LOC, but it calls a couple of helper routines.
On 3/31/2025 3:52 PM, MitchAlsup1 wrote:---------------------
On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:
PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.
As long as the relative distance is the same, it does.
Can't happen within a shared address space.
Say, if you load a single copy of a binary at 0x24680000.
Process A and B can't use the same mapping in the same address space,
with PC-rel globals, as then they would each see the other's globals.
You can't do a duplicate mapping at another address, as this both wastes
VAS, and also any Abs64 base-relocs or similar would differ.
You also can't CoW the data/bss sections, as this is no longer a shared address space.
So, alternative is to use GBR to access globals, with the data/bss
sections allocated independently of the binary.
This way, multiple processes can share the same mapping at the same
address for any executable code and constant data, with only the data sections needing to be allocated.
Does mean though that one needs to save/restore the global pointer, and
there is a ritual for reloading it.
EXE's generally assume they are index 0, so:
MOV.Q (GBR, 0), Rt
MOV.Q (Rt, 0), GBR
Or, in RV terms:
LD X6, 0(X3)
LD X3, Disp33(X6)
Or, RV64G:
LD X6, 0(X3)
LUI X5, DispHi
ADD X5 X5, X6
LD X3, DispLo(X5)
For DLL's, the index is fixed up with a base-reloc (for each loaded
DLL), so basically the same idea. Typically a Disp33 is used here to
allow for a potentially large/unknown number of loaded DLL's. Thus far,
a global numbering scheme is used.
Where, (GBR+0) gives the address of a table of global pointers for every loaded binary (can be assumed read-only from userland).
Generally, this is needed if:
Function may be called from outside of the current binary and:
Accesses global variables;
And/or, calls local functions.
Though, still generally lower average-case overhead than the strategy typically used by FDPIC, which would handle this reload process on the
caller side...
SD X3, Disp(SP)
LD X3, 8(X18)
LD X6, 0(X18)
JALR X1, 0(X6)
LD X3, Disp(SP)
Though, execl() effectively replaces the current process.
IMHO, a "CreateProcess()" style abstraction makes more sense than
fork+exec.
But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.
Brian had little trouble using My 66000 ABI which does have contiguous
register groupings.
But, My66000 also isn't like, "Hey, how about 16-bit ops with 3 or 4 bit register numbers".
Not sure the thinking behind the RV ABI.
Prolog needs a call, but epilog can just be a branch, since no need to
return back into the function that is returning.
Yes, but this means My 66000 executes 3 fewer transfers of control
per subroutine than you do. And taken branches add latency.
Granted.
Each predicted branch adds 2 cycles.
Needs to have a lower limit though, as it is not worth it to use a
call/branch to save/restore 3 or 4 registers...
But, say, 20 registers, it is more worthwhile.
ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.
Granted.
My strategy isn't perfect:
Non-zero branching overheads, when the feature is used;
Per-function load/store slides in prolog/epilog, when not used.
Then, the heuristic mostly becomes one of when it is better to use the
inline strategy (load/store slide), or to fold them off and use calls/branches.
Does technically also work for RISC-V though (though seemingly GCC
always uses inline save/restore, but also the RV ABI has fewer
registers).
On 2025-04-01 7:24 p.m., MitchAlsup1 wrote:
On Tue, 1 Apr 2025 22:06:10 +0000, Robert Finch wrote:The CPU does not do pipe-lined burst loads. To load the cache line it is
On 2025-04-01 2:51 p.m., MitchAlsup1 wrote:------------------
On Tue, 1 Apr 2025 4:58:58 +0000, Robert Finch wrote:
It is looking like the context switch code for the OS will take about >>>>> 3000 clock cycles to run.
How much of that is figuring out who to switch to and, now that that has >>>> been decided, make the context switch manifest ??
That was just for the making the switch. I calculated based on the
number of register loads and stores x2 and then times 13 clocks for
memory access, plus a little bit of overhead for other instructions.
Why is it not 13 cycles to get started and then each register is 1 one
cycle.
two independent loads. 256-bits at a time. Stores post to the bus, but
I seem to remember having to space out the stores so the queue in the
memory controller did not overflow. Needs more work.
Stores should be faster, I think they are single cycle. But loads may be quite slow if things are not in the cache. I should really measure it.
It may not be as bad I think. It is still 300 LOC, about 100 loads and
stores each way. Lots of move instructions for regs that cannot be
directly loaded or stored. And with CRs serializing the processor. But
the processor should eat up all the moves fairly quickly.
Deciding who to switch to may be another good chunk of time. But the
system is using a hardware ready list, so the choice is just to pop
(load) the top task id off the ready list. The guts of the switcher is
only about 30 LOC, but it calls a couple of helper routines.
Say, if you load a single copy of a binary at 0x24680000.
Process A and B can't use the same mapping in the same address space,
with PC-rel globals, as then they would each see the other's globals.
Say I load a copy of the binary text at 0x24680000 and its data at
0x35900000 for a distance of 0x11280000 into the address space of
a process.
Then I load another copy at 0x44680000 and its data at 55900000
into the address space of a different process.
But, yeah, inter-process function pointers aren't really a thing, and should not be a thing.
On 2025-04-01 7:24 p.m., MitchAlsup1 wrote:
On Tue, 1 Apr 2025 22:06:10 +0000, Robert Finch wrote: -------------------------The CPU does not do pipe-lined burst loads. To load the cache line it is
Why is it not 13 cycles to get started and then each register is 1 one
cycle.
two independent loads. 256-bits at a time. Stores post to the bus, but
I seem to remember having to space out the stores so the queue in the
memory controller did not overflow. Needs more work.
Stores should be faster, I think they are single cycle. But loads may be quite slow if things are not in the cache. I should really measure it.
It may not be as bad I think. It is still 300 LOC, about 100 loads and
stores each way. Lots of move instructions for regs that cannot be
directly loaded or stored. And with CRs serializing the processor. But
the processor should eat up all the moves fairly quickly.
On 2025-04-03 1:22 p.m., BGB wrote:-------------------
Or, to allow for NOMMU operation, or reduce costs by not having context
switches result in as large of numbers of TLB misses.
Also makes the kernel simpler as it doesn't need to deal with each
process having its own address space.
Have you seen the MPRV bit in RISCV? Allows memory ops to execute using
the previous mode / address space. The bit just has to be set, then do
the memory op, then reset the bit. Makes it easy to access data using
the process address space.
On 2025-04-04 5:13 p.m., MitchAlsup1 wrote:
On Fri, 4 Apr 2025 3:49:31 +0000, Robert Finch wrote:Would not writing to the GuestOs VAS and the application VAS be the
On 2025-04-03 1:22 p.m., BGB wrote:-------------------
Or, to allow for NOMMU operation, or reduce costs by not having context >>>> switches result in as large of numbers of TLB misses.
Also makes the kernel simpler as it doesn't need to deal with each
process having its own address space.
Have you seen the MPRV bit in RISCV? Allows memory ops to execute using
the previous mode / address space. The bit just has to be set, then do
the memory op, then reset the bit. Makes it easy to access data using
the process address space.
Let us postulate you are running in RISC-V HyperVisor on core[j]
and you want to write into GuestOS VAS and into application VAS
more or less simultaneously.
result of separate system calls? Or does the hypervisor take over for
the GuestOS?
Seems to me like you need a MPRV to be more than a single bit
so it could index which layer of the SW stack's VAS it needs
to touch.
So, there is a need to be able to go back two or three levels? I suppose
it could also be done by manipulating the stack, although adding an
extra bit may be easier. How often does it happen?
On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:
Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for
the GuestOS?
Application has a 64-bit VAS
GusetOS has a 64-bit VAS
HyprVisor has a 64-bit VAS
and so does
Securte has a 64-bit VAS
So, we are in HV and we need to write to guestOS and to Application
but we have only 1-bit of distinction.
mitchalsup@aol.com (MitchAlsup1) writes:
On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:
Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for
the GuestOS?
Application has a 64-bit VAS
GusetOS has a 64-bit VAS
HyprVisor has a 64-bit VAS
and so does
Securte has a 64-bit VAS
So, we are in HV and we need to write to guestOS and to Application
but we have only 1-bit of distinction.
On ARM64, when the HV needs to write to guest user VA or guest PA,
the SMMU provides an interface the processor can use to translate
the guest VA or Guest PA to the corresponding system physical address.
Of course, there is a race if the guest OS changes the underlying
translation tables during the upcall to the hypervisor or secure
monitor, although that would be a bug in the guest were it so to do,
since the guest explicitly requested the action from the higher
privilege level (e.g. HV).
Arm does have a set of load/store "user" instructions that translate addresses using the unprivileged (application) translation tables.
There's
also a processor state bit (UAO - User Access Override) that can
be set to force those instructions to use the permissions associated
with the current processor privilege level.
Note that there is a push by all vendors to include support
for guest 'privacy', such that the hypervisor has no direct
access to memory owned by the guest, or where the the guest
memory is encrypted using a key the hypervisor or secure monitor
don't have access to.
On 2025-04-05 2:31 p.m., Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:Okay,
On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:
Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for
the GuestOS?
Application has a 64-bit VAS
GusetOS has a 64-bit VAS
HyprVisor has a 64-bit VAS
and so does
Securte has a 64-bit VAS
So, we are in HV and we need to write to guestOS and to Application
but we have only 1-bit of distinction.
On ARM64, when the HV needs to write to guest user VA or guest PA,
the SMMU provides an interface the processor can use to translate
the guest VA or Guest PA to the corresponding system physical address.
Of course, there is a race if the guest OS changes the underlying
translation tables during the upcall to the hypervisor or secure
monitor, although that would be a bug in the guest were it so to do,
since the guest explicitly requested the action from the higher
privilege level (e.g. HV).
Arm does have a set of load/store "user" instructions that translate
addresses using the unprivileged (application) translation tables.
There's
also a processor state bit (UAO - User Access Override) that can
be set to force those instructions to use the permissions associated
with the current processor privilege level.
Note that there is a push by all vendors to include support
for guest 'privacy', such that the hypervisor has no direct
access to memory owned by the guest, or where the the guest
memory is encrypted using a key the hypervisor or secure monitor
don't have access to.
I was interpreting RISCV specs wrong. They have three bits dedicated to
this. 1 is an on/off and the other two are the mode to use. I am left wondering how it is determined which mode to use. If the hypervisor is
passed a pointer to a VAS variable in a register, how does it know that
the pointer is for the supervisor or the user/app?
It's why I assumed it found the mode from the stack. Those two select bits have to set
somehow. It seems like extra code to access the right address space.
I got the thought to use the three bits a bit differently.
111 = use current mode
110 = use mode from stack
100 = debug? mode
011 = secure (machine) mode
010 = hypervisor mode
001 = supervisor mode
000 = user/app mode
I was just using inline code to select the proper address space. But if
it is necessary to dig around to figure the mode, it may turn into a subroutine call.
On Sat, 5 Apr 2025 18:31:44 +0000, Scott Lurndal wrote:
Arm does have a set of load/store "user" instructions that translate
addresses using the unprivileged (application) translation tables.
When Secure Monitor executes a "user" instructions which layer
of the SW stack is accessed:: {HV, SV, User} ?
Is this 1-layer down the stack, or all layers down the stack ??
There's
also a processor state bit (UAO - User Access Override) that can
be set to force those instructions to use the permissions associated
with the current processor privilege level.
That is how My 66000 MMU is defined--higher privilege layers
have R/W access to the next lower privilege layer--without
doing anything other than a typical LD or ST instruction.
I/O MMU has similar issues to solve in that a device can Read
write-execute only memory and write read-execute only memory.
Note that there is a push by all vendors to include support
for guest 'privacy', such that the hypervisor has no direct
access to memory owned by the guest, or where the the guest
memory is encrypted using a key the hypervisor or secure monitor
don't have access to.
I call these "paranoid" applications--generally requiring no
privilege, but they don't want GuestOS of HyperVisor to look
at their data and at the same time, they want GuestOS or HV
to perform I/O to said data--so some devices have a effective
privilege above that of the driver commanding them.
I understand the reasons and rational.
On 2025-04-05 2:31 p.m., Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:Okay,
On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:
Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for
the GuestOS?
Application has a 64-bit VAS
GusetOS has a 64-bit VAS
HyprVisor has a 64-bit VAS
and so does
Securte has a 64-bit VAS
So, we are in HV and we need to write to guestOS and to Application
but we have only 1-bit of distinction.
On ARM64, when the HV needs to write to guest user VA or guest PA,
the SMMU provides an interface the processor can use to translate
the guest VA or Guest PA to the corresponding system physical address.
Of course, there is a race if the guest OS changes the underlying
translation tables during the upcall to the hypervisor or secure
monitor, although that would be a bug in the guest were it so to do,
since the guest explicitly requested the action from the higher
privilege level (e.g. HV).
Arm does have a set of load/store "user" instructions that translate
addresses using the unprivileged (application) translation tables. There's >> also a processor state bit (UAO - User Access Override) that can
be set to force those instructions to use the permissions associated
with the current processor privilege level.
Note that there is a push by all vendors to include support
for guest 'privacy', such that the hypervisor has no direct
access to memory owned by the guest, or where the the guest
memory is encrypted using a key the hypervisor or secure monitor
don't have access to.
I was interpreting RISCV specs wrong. They have three bits dedicated to
this. 1 is an on/off and the other two are the mode to use. I am left >wondering how it is determined which mode to use. If the hypervisor is
passed a pointer to a VAS variable in a register, how does it know that
the pointer is for the supervisor or the user/app?
It's why I assumed it
found the mode from the stack. Those two select bits have to set
somehow. It seems like extra code to access the right address space.'
When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the
prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.
Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.
That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.
On 2025-04-06 10:21 a.m., Scott Lurndal wrote:
Note that on ARM, there are restrictions on upcalls toAllows two directional virtualization I think. Q+ has all exceptions and >interrupts going to the secure monitor, which can then delegate it back
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.
to a lower level.
On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
----------------
When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the
prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.
Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.
On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:
That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.
So, is this dichotomy because::
a) HVs are good enough at virtualizing raw HW that GuestOS
does not need a lot of paravirtualization to be efficient ??
b) GuestOS does not need "that much paravirtualization" to be
efficient anyway.
c) the kinds of things GuestOS ask HVs to perform is just not
enough like the kind of things user asks of GuestOS.
d) User and GuestOS evolved in a time before virtualization
and simply prefer to exist as it used to be ??
mitchalsup@aol.com (MitchAlsup1) writes:
On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
----------------
When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the
prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.
Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.
On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:
That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.
So, is this dichotomy because::
a) HVs are good enough at virtualizing raw HW that GuestOS
does not need a lot of paravirtualization to be efficient ??
Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
proposed the SR-IOV capability, paravirtualization became anathema.
b) GuestOS does not need "that much paravirtualization" to be
efficient anyway.
With modern hardware support, yes.
c) the kinds of things GuestOS ask HVs to perform is just not
enough like the kind of things user asks of GuestOS.
Yes, that's also a truism.
d) User and GuestOS evolved in a time before virtualization
and simply prefer to exist as it used to be ??
Typically an OS doesn't know if it is a guest or bare metal.
That characteristic means that a given distribution can
operate as either.
mitchalsup@aol.com (MitchAlsup1) writes:
On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
----------------
When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the
prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.
Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.
On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:
That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.
So, is this dichotomy because::
a) HVs are good enough at virtualizing raw HW that GuestOS
does not need a lot of paravirtualization to be efficient ??
Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
proposed the SR-IOV capability, paravirtualization became anathema.
b) GuestOS does not need "that much paravirtualization" to be
efficient anyway.
With modern hardware support, yes.
c) the kinds of things GuestOS ask HVs to perform is just not
enough like the kind of things user asks of GuestOS.
Yes, that's also a truism.
d) User and GuestOS evolved in a time before virtualization
and simply prefer to exist as it used to be ??
Typically an OS doesn't know if it is a guest or bare metal.
That characteristic means that a given distribution can
operate as either.
On Mon, 7 Apr 2025 14:09:50 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
----------------
When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the >>>> prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.
Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.
On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:
That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.
So, is this dichotomy because::
a) HVs are good enough at virtualizing raw HW that GuestOS
does not need a lot of paravirtualization to be efficient ??
Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
proposed the SR-IOV capability, paravirtualization became anathema.
Ok, back to Dan Cross:: (with help from Scott)
If GuestOS wants to grab and hold onto a lock/mutex for a while
to do some critical section stuff--does GuestOS "care" that HV
can still take an interrupt while GuestOS is doing its CS thing ??
since HV is not going to touch any memory associated with GuestOS.
In effect, I am asking is Disable Interrupt is SW-stack-wide or only >applicable to the current layer of the SW stack ?? One can equally
use SW-stack-wide to mean core-wide.
For example:: GuestOS DIs, and HV takes a page fault from GuestOS;
makes the page resident and accessible, and allows GuestOS to run
from the point of fault. GuestOS "sees" no interrupt and nothing
in GuestOS VAS is touched by HV in servicing the page fault.
Now, sure that lock is held while the page fault is being serviced,
and the ugly head of priority inversion takes hold. But ... I am ni
need of some edumacation here.
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 7 Apr 2025 14:09:50 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote: >>>>----------------
When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the >>>>> prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.
Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.
On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:
That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.
So, is this dichotomy because::
a) HVs are good enough at virtualizing raw HW that GuestOS
does not need a lot of paravirtualization to be efficient ??
Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
proposed the SR-IOV capability, paravirtualization became anathema.
Ok, back to Dan Cross:: (with help from Scott)
If GuestOS wants to grab and hold onto a lock/mutex for a while
to do some critical section stuff--does GuestOS "care" that HV
can still take an interrupt while GuestOS is doing its CS thing ??
since HV is not going to touch any memory associated with GuestOS.
Generally, the Guest should execute "as if" it were running on
Bare Metal. Consider an intel/amd processor running a bare-metal
operating system that takes an interrupt into SMM mode; from the
POV of a guest, an HV interrupt is similar to an SMM interrupt.
If the SMM, Secure Monitor or HV modify guest memory in any way,
all bets are off.
In effect, I am asking is Disable Interrupt is SW-stack-wide or only >>applicable to the current layer of the SW stack ?? One can equally
use SW-stack-wide to mean core-wide.
Current layer of the privilege stack. If there is a secure monitor
at a more privileged level than the HV, it can take interrupts in a
manner similar to the legacy SMM interupts. Typically there will
be independent periodic timer interrupts in the Guest OS, the HV, and
the secure monitor.
For example:: GuestOS DIs, and HV takes a page fault from GuestOS;
Note that these will be rare and only if the HV overcommits physical
memory.
makes the page resident and accessible, and allows GuestOS to run
from the point of fault. GuestOS "sees" no interrupt and nothing
in GuestOS VAS is touched by HV in servicing the page fault.
The only way that the guest OS or guest OS application can detect
such an event is if it measures an affected load/store - a covert
channel. So there may be security considerations.
Now, sure that lock is held while the page fault is being serviced,
and the ugly head of priority inversion takes hold. But ... I am ni
need of some edumacation here.
Priority inversion is only applicable within a privilege level/ring. Interrupts to a higher privilege level cannot be masked by an active interrupt at a lower priority level.
The higher privilege level must not unilaterally modify guest OS or application state.
On Tue, 15 Apr 2025 14:02:37 +0000, Scott Lurndal wrote:
Current layer of the privilege stack. If there is a secure monitor
at a more privileged level than the HV, it can take interrupts in a
manner similar to the legacy SMM interupts. Typically there will
be independent periodic timer interrupts in the Guest OS, the HV, and
the secure monitor.
This agrees with the RISC-V approach where each layer in the stack
has its own Interrupt Enable configuration. {Which is what lead to
my questions}.
However, many architectures have only a single control bit for the
whole core--which is shy I am trying to get a complete understanding
of what is required and what is choice. That there is some control
is (IS) required--how many seems to be choice at this stage.
Would it be unwise of me to speculate that a control at each layer
is more optimal, or that the critical section that is delayed due
to "other stuff needing to be handled" should have taken precedent.
The only way that the guest OS or guest OS application can detect
such an event is if it measures an affected load/store - a covert
channel. So there may be security considerations.
Damn that high precision clock .....
Which also leads to the question of should a Virtual Machine have
its own virtual time ?? {Or VM and VMM share the concept of virtual
time} ??
Now, sure that lock is held while the page fault is being serviced,
and the ugly head of priority inversion takes hold. But ... I am ni
need of some edumacation here.
Priority inversion is only applicable within a privilege level/ring.
Interrupts to a higher privilege level cannot be masked by an active
interrupt at a lower priority level.
So, if core is running HyperVisor at priority 15 and a user interrupt
arrives at a higher priority but directed at GuestOS (instead of HV)
does::
a) HV continue leaving higher priority interrupt waiting.
b) switch back to GuestOS for higher priority interrupt--in such
. a way that when GuestOS returns from interrupt HV takes over
. from whence it left.
This is really a question of what priority means across the entire
SW stack--and real-time versus Linux may have different answers on
this matter.
The higher privilege level must not unilaterally modify guest OS or
application state.
Given the almost complete lack of shared address spaces in a manner
where pointers can be passed between, there is almost nothing an HV
can do to a GuestOS VAS unless GuestOS has ask for a HV service via >paravirtualization entry point.
mitchalsup@aol.com (MitchAlsup1) writes:---------snip-----------
So, if core is running HyperVisor at priority 15 and a user interrupt >>arrives at a higher priority but directed at GuestOS (instead of HV)
does::
a) HV continue leaving higher priority interrupt waiting.
b) switch back to GuestOS for higher priority interrupt--in such
. a way that when GuestOS returns from interrupt HV takes over
. from whence it left.
ARM, for example, splits the per-core interrupt priority range into
halves
- one half is assigned to the secure monitor and the other is assigned
to the
non-secure software running on the core.
Early hypervisors would field
all
non-secure interrupts and either handle them itself or inject them into
the guest. The first ARM64 cores would field all interrupts in the HV
and the int controller had special registers the HV could use to inject interrupts
into the guest. The overhead was not insignifcant, so they added
a mechanism to allow some interrupts to be directly fielded by the
guest itself - avoiding the round trip through the HV on every interrupt (called virtual LPIs).
On Wed, 16 Apr 2025 14:07:36 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:---------snip-----------
So, if core is running HyperVisor at priority 15 and a user interrupt
arrives at a higher priority but directed at GuestOS (instead of HV)
does::
a) HV continue leaving higher priority interrupt waiting.
b) switch back to GuestOS for higher priority interrupt--in such
. a way that when GuestOS returns from interrupt HV takes over
. from whence it left.
ARM, for example, splits the per-core interrupt priority range into
halves
- one half is assigned to the secure monitor and the other is assigned
to the
non-secure software running on the core.
Thus, my predilection for 64-priority levels (rather than ~8 as
suggested
by another participant) allows for this distribution of priorities
across
layers in the SW stack at the discretion of trustable-SW.
Early hypervisors would field
all
non-secure interrupts and either handle them itself or inject them into
the guest. The first ARM64 cores would field all interrupts in the HV >> and the int controller had special registers the HV could use to inject
interrupts
into the guest. The overhead was not insignifcant, so they added
a mechanism to allow some interrupts to be directly fielded by the
guest itself - avoiding the round trip through the HV on every interrupt
(called virtual LPIs).
Given 4 layers in the stack {Secure, Hyper, Super, User} and we have interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
or do we gain flexibility by being able to target interrupts directly to {user} ?? (the 4th element).
On 4/16/2025 2:13 PM, MitchAlsup1 wrote:
On Wed, 16 Apr 2025 14:07:36 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:---------snip-----------
So, if core is running HyperVisor at priority 15 and a user interrupt
arrives at a higher priority but directed at GuestOS (instead of HV)
does::
a) HV continue leaving higher priority interrupt waiting.
b) switch back to GuestOS for higher priority interrupt--in such
. a way that when GuestOS returns from interrupt HV takes over
. from whence it left.
ARM, for example, splits the per-core interrupt priority range into
halves
- one half is assigned to the secure monitor and the other is assigned
to the
non-secure software running on the core.
Thus, my predilection for 64-priority levels (rather than ~8 as
suggested
by another participant) allows for this distribution of priorities
across
layers in the SW stack at the discretion of trustable-SW.
Early hypervisors would field
all
non-secure interrupts and either handle them itself or inject them into
the guest. The first ARM64 cores would field all interrupts in the HV >>> and the int controller had special registers the HV could use to inject
interrupts
into the guest. The overhead was not insignifcant, so they added
a mechanism to allow some interrupts to be directly fielded by the
guest itself - avoiding the round trip through the HV on every interrupt >>> (called virtual LPIs).
Given 4 layers in the stack {Secure, Hyper, Super, User} and we have
interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
or do we gain flexibility by being able to target interrupts directly to
{user} ?? (the 4th element).
I think you could gain a tiny amount of efficiency if the OS (super)
allowed the user to set up handle certain classes of exceptions, e.g.
divide faults) itself rather than having to go through the super.
On Wed, 16 Apr 2025 14:07:36 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:---------snip-----------
So, if core is running HyperVisor at priority 15 and a user interrupt >>>arrives at a higher priority but directed at GuestOS (instead of HV) >>>does::
a) HV continue leaving higher priority interrupt waiting.
b) switch back to GuestOS for higher priority interrupt--in such
. a way that when GuestOS returns from interrupt HV takes over
. from whence it left.
ARM, for example, splits the per-core interrupt priority range into
halves
- one half is assigned to the secure monitor and the other is assigned
to the
non-secure software running on the core.
Thus, my predilection for 64-priority levels (rather than ~8 as
suggested
by another participant) allows for this distribution of priorities
across
layers in the SW stack at the discretion of trustable-SW.
Early hypervisors would field
all
non-secure interrupts and either handle them itself or inject them into
the guest. The first ARM64 cores would field all interrupts in the HV
and the int controller had special registers the HV could use to inject
interrupts
into the guest. The overhead was not insignifcant, so they added
a mechanism to allow some interrupts to be directly fielded by the
guest itself - avoiding the round trip through the HV on every interrupt
(called virtual LPIs).
Given 4 layers in the stack {Secure, Hyper, Super, User} and we have >interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
or do we gain flexibility by being able to target interrupts directly to >{user} ?? (the 4th element).
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 04:23:42 |
Calls: | 10,387 |
Calls today: | 2 |
Files: | 14,061 |
Messages: | 6,416,782 |