Forum: >>> Magnum BBS <<<

Re: Constant Stack Canaries

From MitchAlsup1@21:1/5 to BGB on Sun Mar 30 20:14:53 2025

On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:

On 3/30/2025 7:16 AM, Robert Finch wrote:

Just got to thinking about stack canaries. I was going to have a special
purpose register holding the canary value for testing while the program
was running. But I just realized today that it may not be needed. Canary
values could be handled by the program loader as constants, eliminating
the need for a register. Since the value is not changing while the
program is running, it could easily be a constant. This may require a
fixup record handled by the assembler / linker to indicate to the loader
to place a canary value.

Prolog code would just store an immediate to the stack. On return a TRAP
instruction could check for the immediate value and trap if not present.
But the process seems to require assembler / linker support.

They are mostly just a normal compiler feature IME:
Prolog stores the value;
Epilog loads it and verifies that the value is intact.

Agreed.

Using a magic number

Remove excess words.

Nothing fancy needed in the assemble or link stages.

They remain blissfully ignorant--at most they generate the magic
number, possibly at random, possibly per link-module.

In my case, canary behavior is one of:
Use them in functions with arrays or similar (default);
Use them everywhere (optional);
Disable them entirely (also optional).

In my case, it is only checking 16-bit magic numbers, but mostly because
a 16-bit constant is cheaper to load into a register in this case
(single 32-bit instruction, vs a larger encoding needed for larger
values).

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Mon Mar 31 09:04:40 2025

On 3/30/2025 1:14 PM, MitchAlsup1 wrote:

On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:

On 3/30/2025 7:16 AM, Robert Finch wrote:

Just got to thinking about stack canaries. I was going to have a special >>> purpose register holding the canary value for testing while the program
was running. But I just realized today that it may not be needed. Canary >>> values could be handled by the program loader as constants, eliminating
the need for a register. Since the value is not changing while the
program is running, it could easily be a constant. This may require a
fixup record handled by the assembler / linker to indicate to the loader >>> to place a canary value.

Prolog code would just store an immediate to the stack. On return a TRAP >>> instruction could check for the immediate value and trap if not present. >>> But the process seems to require assembler / linker support.

They are mostly just a normal compiler feature IME:
   Prolog stores the value;
   Epilog loads it and verifies that the value is intact.

Agreed.

I'm glad you, Mitch, chimed in here. When I saw this, it occurred to me
that this could be done automatically by the hardware (optionally, based
on a bit in a control register). The CALL instruction would store
magic value, and the RET instruction would test it. If there was not a
match, an exception would be generated. The value itself could be
something like the clock value when the program was initiated, thus guaranteeing uniqueness.

The advantage over the software approach, of course, is the elimination
of several instructions in each prolog/epilog, reducing footprint, and
perhaps even time as it might be possible to overlap some of the
processing with the other things these instructions do. The downside is
more hardware and perhaps extra overhead.

Does this make sense? What have I missed.

Using a magic number

Remove excess words.

Nothing fancy needed in the assemble or link stages.

They remain blissfully ignorant--at most they generate the magic
number, possibly at random, possibly per link-module.

In my case, canary behavior is one of:
   Use them in functions with arrays or similar (default);
   Use them everywhere (optional);
   Disable them entirely (also optional).

In my case, it is only checking 16-bit magic numbers, but mostly because
a 16-bit constant is cheaper to load into a register in this case
(single 32-bit instruction, vs a larger encoding needed for larger
values).

....

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to BGB on Mon Mar 31 10:57:35 2025

On 3/31/2025 10:17 AM, BGB wrote:

On 3/31/2025 11:04 AM, Stephen Fuld wrote:

On 3/30/2025 1:14 PM, MitchAlsup1 wrote:

On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:

On 3/30/2025 7:16 AM, Robert Finch wrote:

Just got to thinking about stack canaries. I was going to have a
special
purpose register holding the canary value for testing while the
program
was running. But I just realized today that it may not be needed.
Canary
values could be handled by the program loader as constants,
eliminating
the need for a register. Since the value is not changing while the
program is running, it could easily be a constant. This may require a >>>>> fixup record handled by the assembler / linker to indicate to the
loader
to place a canary value.

Prolog code would just store an immediate to the stack. On return a
TRAP
instruction could check for the immediate value and trap if not
present.
But the process seems to require assembler / linker support.

They are mostly just a normal compiler feature IME:
   Prolog stores the value;
   Epilog loads it and verifies that the value is intact.

Agreed.

I'm glad you, Mitch, chimed in here. When I saw this, it occurred to
me that this could be done automatically by the hardware (optionally,
based on a bit in a control register).   The CALL instruction would
store magic value, and the RET instruction would test it. If there
was not a match, an exception would be generated. The value itself
could be something like the clock value when the program was
initiated, thus guaranteeing uniqueness.

The advantage over the software approach, of course, is the
elimination of several instructions in each prolog/epilog, reducing
footprint, and perhaps even time as it might be possible to overlap
some of the processing with the other things these instructions do.
The downside is more hardware and perhaps extra overhead.

Does this make sense? What have I missed.

This would seem to imply an ISA where CALL/RET push onto the stack or similar, rather than the (more common for RISC's) strategy of copying PC
into a link register...

Sorry, you're right. I should have said, in the context of Mitch's My
66000, the ENTER and EXIT instructions.

Another option being if it could be a feature of a Load/Store Multiple.

The nice thing about the ENTER/EXIT is that they combine the store
multiple (ENTER) and the load multiple and return control (EXIT).

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Mon Mar 31 18:07:30 2025

On Mon, 31 Mar 2025 17:17:38 +0000, BGB wrote:

On 3/31/2025 11:04 AM, Stephen Fuld wrote:

On 3/30/2025 1:14 PM, MitchAlsup1 wrote:

On Sun, 30 Mar 2025 17:47:59 +0000, BGB wrote:
-------------

They are mostly just a normal compiler feature IME:
   Prolog stores the value;
   Epilog loads it and verifies that the value is intact.

Agreed.

I'm glad you, Mitch, chimed in here. When I saw this, it occurred to me
that this could be done automatically by the hardware (optionally, based
on a bit in a control register).   The CALL instruction would store
magic value, and the RET instruction would test it. If there was not a
match, an exception would be generated. The value itself could be
something like the clock value when the program was initiated, thus
guaranteeing uniqueness.

The advantage over the software approach, of course, is the elimination
of several instructions in each prolog/epilog, reducing footprint, and
perhaps even time as it might be possible to overlap some of the
processing with the other things these instructions do. The downside is
more hardware and perhaps extra overhead.

Does this make sense? What have I missed.

This would seem to imply an ISA where CALL/RET push onto the stack or similar, rather than the (more common for RISC's) strategy of copying PC
into a link register...

Another option being if it could be a feature of a Load/Store Multiple.

Say, LDM/STM:
6b Hi (Upper bound of register to save)
6b Lo (Lower bound of registers to save)
1b LR (Flag to save Link Register)
1b GP (Flag to save Global Pointer)
1b SK (Flag to generate a canary)

ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
are implicit.

Likely (STM):
Pushes LR first (if bit set);
Pushes GP second (if bit set);
Pushes registers in range (if Hi>=Lo);
Pushes stack canary (if bit set).

EXIT uses its 3rd flag used when doing longjump() and THROW()
so as to pop the call-stack but not actually RET from the stack
walker.

LDM would check the canary first and fault if it doesn't see the
expected value.

Downside, granted, is needing the relative complexity of an LDM/STM
style instruction.

Not conceptually any harder than DIV or FDIV and nobody complains
about doing multi-cycle math.

Other ISAs use a flag bit for each register, but this is less viable
with an ISA with a larger number of registers, well, unless one uses a
64 or 96 bit LDM/STM encoding (possible). Merit though would be not
needing multiple LDM's / STM's to deal with a discontinuous register
range.

To quote Trevor Smith:: "Why would anyone want to do that" ??

Well, also excluding the possibility where the LDM/STM is essentially
just a function call (say, if beyond certain number of registers are to
be saved/restored, the compiler generates a call to a save/restore
sequence, which is also generates as-needed). Granted, this is basically
the strategy used by BGBCC. If multiple functions happen to save/restore
the same combination of registers, they get to reuse the prior
function's save/restore sequence (generally folded off to before the
function in question).

Calling a subroutine to perform epilogues is adding to the number of
branches a program executes. Having an instruction like EXIT means
when you know you need to exit, you EXIT you don't branch to the exit
point. Saving instructions.

Granted, the folding strategy can still do canary values, but doing so
in the reused portions would limit the range of unique canary values
(well, unless the canary magic is XOR'ed with SP or something...).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Mon Mar 31 20:52:14 2025

On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:

On 3/31/2025 1:07 PM, MitchAlsup1 wrote:

-------------

Another option being if it could be a feature of a Load/Store Multiple.

Say, LDM/STM:
   6b Hi (Upper bound of register to save)
   6b Lo (Lower bound of registers to save)
   1b LR (Flag to save Link Register)
   1b GP (Flag to save Global Pointer)
   1b SK (Flag to generate a canary)

ENTER and EXIT have 2 of those flags--but also note use of SP and CSP
are implicit.

Likely (STM):
   Pushes LR first (if bit set);
   Pushes GP second (if bit set);
   Pushes registers in range (if Hi>=Lo);
   Pushes stack canary (if bit set).

EXIT uses its 3rd flag used when doing longjump() and THROW()
so as to pop the call-stack but not actually RET from the stack
walker.

OK.

I guess one could debate whether an LDM could treat the Load-LR as "Load
LR" or "Load address and Branch", and/or have separate flags (Load LR vs
Load PC, with Load PC meaning to branch).

Other ABIs may not have as much reason to save/restore the Global
Pointer all the time. But, in my case, it is being used as the primary
way of accessing globals, and each binary image has its own address
range here.

I use constants to access globals.
These comes in 32-bit and 64-bit flavors.

PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.

As long as the relative distance is the same, it does.

Vs, say, for PIE ELF binaries where it is needed to load a new copy for
each process instance because of this (well, excluding an FDPIC style
ABI, but seemingly still no one seems to have bothered adding FDPIC
support in GCC or friends for RV64 based targets, ...).

Well, granted, because Linux and similar tend to load every new process
into its own address space and/or use CoW.

CoW and execl()

--------------

Other ISAs use a flag bit for each register, but this is less viable
with an ISA with a larger number of registers, well, unless one uses a
64 or 96 bit LDM/STM encoding (possible). Merit though would be not
needing multiple LDM's / STM's to deal with a discontinuous register
range.

To quote Trevor Smith:: "Why would anyone want to do that" ??

Discontinuous register ranges:
Because pretty much no ABI's put all of the callee save registers in a contiguous range.

Granted, I guess if someone were designing an ISA and ABI clean, they
could make all of the argument registers and callee save registers contiguous.

Say:
R0..R3: Special
R4..R15: Scratch
R16..R31: Argument
R32..R63: Callee Save
....

But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.

Brian had little trouble using My 66000 ABI which does have contiguous
register groupings.

Well, also excluding the possibility where the LDM/STM is essentially
just a function call (say, if beyond certain number of registers are to
be saved/restored, the compiler generates a call to a save/restore
sequence, which is also generates as-needed). Granted, this is basically >>> the strategy used by BGBCC. If multiple functions happen to save/restore >>> the same combination of registers, they get to reuse the prior
function's save/restore sequence (generally folded off to before the
function in question).

Calling a subroutine to perform epilogues is adding to the number of
branches a program executes. Having an instruction like EXIT means
when you know you need to exit, you EXIT you don't branch to the exit
point. Saving instructions.

Prolog needs a call, but epilog can just be a branch, since no need to
return back into the function that is returning.

Yes, but this means My 66000 executes 3 fewer transfers of control
per subroutine than you do. And taken branches add latency.

Needs to have a lower limit though, as it is not worth it to use a call/branch to save/restore 3 or 4 registers...

But, say, 20 registers, it is more worthwhile.

ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.

Granted, the folding strategy can still do canary values, but doing so
in the reused portions would limit the range of unique canary values
(well, unless the canary magic is XOR'ed with SP or something...).

Canary values are in addition to ENTER and EXIT not part of them
IMHO.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Tue Apr 1 18:51:30 2025

On Tue, 1 Apr 2025 4:58:58 +0000, Robert Finch wrote:

On 2025-03-31 4:52 p.m., MitchAlsup1 wrote:

On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:

--------------------

ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.

Granted, the folding strategy can still do canary values, but doing so >>>>> in the reused portions would limit the range of unique canary values >>>>> (well, unless the canary magic is XOR'ed with SP or something...).

Canary values are in addition to ENTER and EXIT not part of them
IMHO.

In Q+3 there are push and pop multiple instructions. I did not want to
add load and store multiple on top of that. They work great for ISRs,
but not so great for task switching code. I have the instructions
pushing or popping up to 17 registers in a group. Groups of registers
overlap by eight. The instructions can handle all 96 registers in the machine. ENTER and EXIT are also present.

It is looking like the context switch code for the OS will take about
3000 clock cycles to run.

How much of that is figuring out who to switch to and, now that that has
been decided, make the context switch manifest ??

Not wanting to disable interrupts for that
long, I put a spinlock on the system’s task control block array. But I think I have run into an issue. It is the timer ISR that switches tasks. Since it is an ISR it pushes a subset of registers that it uses and
restores them at exit. But when exiting and switching tasks it spinlocks
on the task control block array. I am not sure this is a good thing. As
the timer IRQ is fairly high priority. If something else locked the TCB
array it would deadlock. I guess the context switching could be deferred until the app requests some other operating system function. But then
the issue is what if the app gets stuck in an infinite loop, not calling
the OS? I suppose I could make an OS heartbeat function call a
requirement of apps. If the app does not do a heartbeat within a
reasonable time, it could be terminated.

Q+3 progresses rapidly. A lot of the stuff in earlier versions was
removed. The pared down version is a 32-bit machine. Expecting some
headaches because of the use of condition registers and branch
registers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Tue Apr 1 23:24:29 2025

On Tue, 1 Apr 2025 22:06:10 +0000, Robert Finch wrote:

On 2025-04-01 2:51 p.m., MitchAlsup1 wrote:

On Tue, 1 Apr 2025 4:58:58 +0000, Robert Finch wrote:

------------------

It is looking like the context switch code for the OS will take about
3000 clock cycles to run.

How much of that is figuring out who to switch to and, now that that has
been decided, make the context switch manifest ??

That was just for the making the switch. I calculated based on the
number of register loads and stores x2 and then times 13 clocks for
memory access, plus a little bit of overhead for other instructions.

Why is it not 13 cycles to get started and then each register is 1 one
cycle.

Deciding who to switch to may be another good chunk of time. But the
system is using a hardware ready list, so the choice is just to pop
(load) the top task id off the ready list. The guts of the switcher is
only about 30 LOC, but it calls a couple of helper routines.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue Apr 1 23:21:24 2025

On Tue, 1 Apr 2025 19:34:10 +0000, BGB wrote:

On 3/31/2025 3:52 PM, MitchAlsup1 wrote:

On Mon, 31 Mar 2025 18:56:32 +0000, BGB wrote:

---------------------

PC-Rel not being used as PC-Rel doesn't allow for multiple process
instances of a given loaded binary within a shared address space.

As long as the relative distance is the same, it does.

Can't happen within a shared address space.

Say, if you load a single copy of a binary at 0x24680000.
Process A and B can't use the same mapping in the same address space,
with PC-rel globals, as then they would each see the other's globals.

Say I load a copy of the binary text at 0x24680000 and its data at
0x35900000 for a distance of 0x11280000 into the address space of
a process.

Then I load another copy at 0x44680000 and its data at 55900000
into the address space of a different process.

PC-rel addressing works in both cases--because the distance (-rel)
remains the same,

and the MMU can translate the code to the same physical, and map
each area of data individually.

Different virtual addresses, same code physical address, different
data virtual and physical addresses.

You can't do a duplicate mapping at another address, as this both wastes
VAS, and also any Abs64 base-relocs or similar would differ.

A 64-bit VAS is a wasteable address space, whereas a 48-bit VAS is not.

You also can't CoW the data/bss sections, as this is no longer a shared address space.

You are trying to "get at" something here, but I can't see it (yet).

So, alternative is to use GBR to access globals, with the data/bss
sections allocated independently of the binary.

This way, multiple processes can share the same mapping at the same
address for any executable code and constant data, with only the data sections needing to be allocated.

Does mean though that one needs to save/restore the global pointer, and
there is a ritual for reloading it.

EXE's generally assume they are index 0, so:
MOV.Q (GBR, 0), Rt
MOV.Q (Rt, 0), GBR
Or, in RV terms:
LD X6, 0(X3)
LD X3, Disp33(X6)
Or, RV64G:
LD X6, 0(X3)
LUI X5, DispHi
ADD X5 X5, X6
LD X3, DispLo(X5)

For DLL's, the index is fixed up with a base-reloc (for each loaded
DLL), so basically the same idea. Typically a Disp33 is used here to
allow for a potentially large/unknown number of loaded DLL's. Thus far,
a global numbering scheme is used.

Where, (GBR+0) gives the address of a table of global pointers for every loaded binary (can be assumed read-only from userland).

Generally, this is needed if:
Function may be called from outside of the current binary and:
Accesses global variables;
And/or, calls local functions.

I just use 32-bit of 64-bit displacement constants. Does not matter
how control arrived at this subroutine, it accesses its data as the
linker resolved addresses--without wasting a register.

Though, still generally lower average-case overhead than the strategy typically used by FDPIC, which would handle this reload process on the
caller side...
SD X3, Disp(SP)
LD X3, 8(X18)
LD X6, 0(X18)
JALR X1, 0(X6)
LD X3, Disp(SP)

This is just::

CALX [IP,,#GOT[funct_num]-.]

In the 32-bit linking mode this is a 2 word instruction, in the 64-bit
linking mode it is a 3 word instruction.
----------------

Though, execl() effectively replaces the current process.

IMHO, a "CreateProcess()" style abstraction makes more sense than
fork+exec.

You are 40 years late on that.

---------------

But, invariably, someone will want "compressed" instructions with a
subset of the registers, and one can't just have these only having
access to argument registers.

Brian had little trouble using My 66000 ABI which does have contiguous
register groupings.

But, My66000 also isn't like, "Hey, how about 16-bit ops with 3 or 4 bit register numbers".

Not sure the thinking behind the RV ABI.

If RISC-V removed its 16-bit instructions, there is room in its ISA
to put my entire ISA along with all the non-compressed RISC-V inst-
ructions.

---------------

Prolog needs a call, but epilog can just be a branch, since no need to
return back into the function that is returning.

Yes, but this means My 66000 executes 3 fewer transfers of control
per subroutine than you do. And taken branches add latency.

Granted.

Each predicted branch adds 2 cycles.

So, you loose 6 cycles on just under ½ of all subroutine calls,
while also executing 2-5 instructions manipulating your global
pointer.

Needs to have a lower limit though, as it is not worth it to use a
call/branch to save/restore 3 or 4 registers...

But, say, 20 registers, it is more worthwhile.

ENTER saves as few as 1 or as many as 32 and remains that 1 single
instruction. Same for EXIT and exit also performs the RET when LDing
R0.

Granted.

My strategy isn't perfect:
Non-zero branching overheads, when the feature is used;
Per-function load/store slides in prolog/epilog, when not used.

Then, the heuristic mostly becomes one of when it is better to use the
inline strategy (load/store slide), or to fold them off and use calls/branches.

My solution gets rid of the delimma:
a) the call code is always smaller
b) the call code never takes more cycles

In addition, there is a straightforward way to elide the STs of ENTER
when the memory unit is still executing the previous EXIT.

Does technically also work for RISC-V though (though seemingly GCC
always uses inline save/restore, but also the RV ABI has fewer
registers).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Wed Apr 2 01:47:26 2025

On Wed, 2 Apr 2025 0:07:41 +0000, Robert Finch wrote:

On 2025-04-01 7:24 p.m., MitchAlsup1 wrote:

On Tue, 1 Apr 2025 22:06:10 +0000, Robert Finch wrote:

On 2025-04-01 2:51 p.m., MitchAlsup1 wrote:

On Tue, 1 Apr 2025 4:58:58 +0000, Robert Finch wrote:

------------------

It is looking like the context switch code for the OS will take about >>>>> 3000 clock cycles to run.

How much of that is figuring out who to switch to and, now that that has >>>> been decided, make the context switch manifest ??

That was just for the making the switch. I calculated based on the
number of register loads and stores x2 and then times 13 clocks for
memory access, plus a little bit of overhead for other instructions.

Why is it not 13 cycles to get started and then each register is 1 one
cycle.

The CPU does not do pipe-lined burst loads. To load the cache line it is
two independent loads. 256-bits at a time. Stores post to the bus, but
I seem to remember having to space out the stores so the queue in the
memory controller did not overflow. Needs more work.

Stores should be faster, I think they are single cycle. But loads may be quite slow if things are not in the cache. I should really measure it.
It may not be as bad I think. It is still 300 LOC, about 100 loads and
stores each way. Lots of move instructions for regs that cannot be
directly loaded or stored. And with CRs serializing the processor. But
the processor should eat up all the moves fairly quickly.

One of the reasons I went with treating the register file and thread-
state as a write-back cache is that HW can read-up the inbound register
values before starting to write out the outbound values (rather than
the other way of having to do the STs first so the LDs have a place
to land.)

Deciding who to switch to may be another good chunk of time. But the
system is using a hardware ready list, so the choice is just to pop
(load) the top task id off the ready list. The guts of the switcher is
only about 30 LOC, but it calls a couple of helper routines.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue Apr 1 22:55:56 2025

Say, if you load a single copy of a binary at 0x24680000.
Process A and B can't use the same mapping in the same address space,
with PC-rel globals, as then they would each see the other's globals.

Say I load a copy of the binary text at 0x24680000 and its data at
0x35900000 for a distance of 0x11280000 into the address space of
a process.

Then I load another copy at 0x44680000 and its data at 55900000
into the address space of a different process.

But then if thread A (whose state is stored at 0x35900000) sends to
thread B (whose state is at 55900000) a closure whose code points
somewhere inside 0x24680000, it will end up using the state of thread
A instead of the state of the current thread.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Thu Apr 3 10:09:20 2025

BGB [2025-04-01 23:19:11] wrote:

But, yeah, inter-process function pointers aren't really a thing, and should not be a thing.

AFAIK, this point was brought in the context of a shared address space
(I assumed it was some kind of SASOS situation, but the same thing
happens with per-thread data inside a POSIX-style process).
Function pointers are perfectly normal and common in data (even tho they
may often be implicit, e.g. within the method table of objects), and the
whole point of sharing an address space is to be able to exchange data.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Fri Apr 4 21:07:09 2025

On Wed, 2 Apr 2025 0:07:41 +0000, Robert Finch wrote:

On 2025-04-01 7:24 p.m., MitchAlsup1 wrote:

On Tue, 1 Apr 2025 22:06:10 +0000, Robert Finch wrote: -------------------------
Why is it not 13 cycles to get started and then each register is 1 one
cycle.

The CPU does not do pipe-lined burst loads. To load the cache line it is
two independent loads. 256-bits at a time. Stores post to the bus, but
I seem to remember having to space out the stores so the queue in the
memory controller did not overflow. Needs more work.

Stores should be faster, I think they are single cycle. But loads may be quite slow if things are not in the cache. I should really measure it.
It may not be as bad I think. It is still 300 LOC, about 100 loads and
stores each way. Lots of move instructions for regs that cannot be
directly loaded or stored. And with CRs serializing the processor. But
the processor should eat up all the moves fairly quickly.

By placing all the CRs together, and treating thread-state as a write-
back cache, all the storing and loading happens without any
serialization,
in cache line quanta, where the LD can begin before the STs
begin--giving
the overlap that reduces the cycle count.

For example, once a core has decided to run "this-thread" all it has to
do is to execute a single HR instruction which writes a pointer to
thread-
state. Then upon SVR, that thread begins running. Between HE and SVR, HW
can preload the inbound data, and push out the outbound data after the
inbound data has arrived.

But, also note: Due to the way CR's are mapped into MMI/O memory, one
core can write that same HR available CR on another core and cause a
remote context switch of that another core.

The main use is more likely to be remote diagnostics of a core that
has quit responding to the system (crashed hard) so its CRs can be
read out and examined to see why it quit responding.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Fri Apr 4 21:13:27 2025

On Fri, 4 Apr 2025 3:49:31 +0000, Robert Finch wrote:

On 2025-04-03 1:22 p.m., BGB wrote:

-------------------

Or, to allow for NOMMU operation, or reduce costs by not having context
switches result in as large of numbers of TLB misses.

Also makes the kernel simpler as it doesn't need to deal with each
process having its own address space.

Have you seen the MPRV bit in RISCV? Allows memory ops to execute using
the previous mode / address space. The bit just has to be set, then do
the memory op, then reset the bit. Makes it easy to access data using
the process address space.

Let us postulate you are running in RISC-V HyperVisor on core[j]
and you want to write into GuestOS VAS and into application VAS
more or less simultaneously.

Seems to me like you need a MPRV to be more than a single bit
so it could index which layer of the SW stack's VAS it needs
to touch.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Sat Apr 5 16:37:19 2025

On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:

On 2025-04-04 5:13 p.m., MitchAlsup1 wrote:

On Fri, 4 Apr 2025 3:49:31 +0000, Robert Finch wrote:

On 2025-04-03 1:22 p.m., BGB wrote:

-------------------

Or, to allow for NOMMU operation, or reduce costs by not having context >>>> switches result in as large of numbers of TLB misses.

Also makes the kernel simpler as it doesn't need to deal with each
process having its own address space.

Have you seen the MPRV bit in RISCV? Allows memory ops to execute using
the previous mode / address space. The bit just has to be set, then do
the memory op, then reset the bit. Makes it easy to access data using
the process address space.

Let us postulate you are running in RISC-V HyperVisor on core[j]
and you want to write into GuestOS VAS and into application VAS
more or less simultaneously.

Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for
the GuestOS?

Application has a 64-bit VAS
GusetOS has a 64-bit VAS
HyprVisor has a 64-bit VAS
and so does
Securte has a 64-bit VAS

So, we are in HV and we need to write to guestOS and to Application
but we have only 1-bit of distinction.

And this has nothing to do with system calls it has to do with
accessing (rather simultaneously) any of the 4 VASs.

Seems to me like you need a MPRV to be more than a single bit
so it could index which layer of the SW stack's VAS it needs
to touch.

So, there is a need to be able to go back two or three levels? I suppose
it could also be done by manipulating the stack, although adding an
extra bit may be easier. How often does it happen?

I have no idea, and I suspect GuestOS people don't either.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sat Apr 5 18:31:44 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:

Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for
the GuestOS?

Application has a 64-bit VAS
GusetOS has a 64-bit VAS
HyprVisor has a 64-bit VAS
and so does
Securte has a 64-bit VAS

So, we are in HV and we need to write to guestOS and to Application
but we have only 1-bit of distinction.

On ARM64, when the HV needs to write to guest user VA or guest PA,
the SMMU provides an interface the processor can use to translate
the guest VA or Guest PA to the corresponding system physical address.
Of course, there is a race if the guest OS changes the underlying
translation tables during the upcall to the hypervisor or secure
monitor, although that would be a bug in the guest were it so to do,
since the guest explicitly requested the action from the higher
privilege level (e.g. HV).

Arm does have a set of load/store "user" instructions that translate
addresses using the unprivileged (application) translation tables. There's also a processor state bit (UAO - User Access Override) that can
be set to force those instructions to use the permissions associated
with the current processor privilege level.

Note that there is a push by all vendors to include support
for guest 'privacy', such that the hypervisor has no direct
access to memory owned by the guest, or where the the guest
memory is encrypted using a key the hypervisor or secure monitor
don't have access to.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Apr 5 23:06:38 2025

On Sat, 5 Apr 2025 18:31:44 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:

Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for
the GuestOS?

Application has a 64-bit VAS
GusetOS has a 64-bit VAS
HyprVisor has a 64-bit VAS
and so does
Securte has a 64-bit VAS

So, we are in HV and we need to write to guestOS and to Application
but we have only 1-bit of distinction.

On ARM64, when the HV needs to write to guest user VA or guest PA,
the SMMU provides an interface the processor can use to translate
the guest VA or Guest PA to the corresponding system physical address.
Of course, there is a race if the guest OS changes the underlying
translation tables during the upcall to the hypervisor or secure
monitor, although that would be a bug in the guest were it so to do,
since the guest explicitly requested the action from the higher
privilege level (e.g. HV).

Arm does have a set of load/store "user" instructions that translate addresses using the unprivileged (application) translation tables.

When Secure Monitor executes a "user" instructions which layer
of the SW stack is accessed:: {HV, SV, User} ??

Is this 1-layer down the stack, or all layers down the stack ??

There's
also a processor state bit (UAO - User Access Override) that can
be set to force those instructions to use the permissions associated
with the current processor privilege level.

That is how My 66000 MMU is defined--higher privilege layers
have R/W access to the next lower privilege layer--without
doing anything other than a typical LD or ST instruction.

I/O MMU has similar issues to solve in that a device can Read
write-execute only memory and write read-execute only memory.

Note that there is a push by all vendors to include support
for guest 'privacy', such that the hypervisor has no direct
access to memory owned by the guest, or where the the guest
memory is encrypted using a key the hypervisor or secure monitor
don't have access to.

I call these "paranoid" applications--generally requiring no
privilege, but they don't want GuestOS of HyperVisor to look
at their data and at the same time, they want GuestOS or HV
to perform I/O to said data--so some devices have a effective
privilege above that of the driver commanding them.

I understand the reasons and rational.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Sat Apr 5 23:11:00 2025

On Sat, 5 Apr 2025 21:57:50 +0000, Robert Finch wrote:

On 2025-04-05 2:31 p.m., Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:

Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for
the GuestOS?

Application has a 64-bit VAS
GusetOS has a 64-bit VAS
HyprVisor has a 64-bit VAS
and so does
Securte has a 64-bit VAS

So, we are in HV and we need to write to guestOS and to Application
but we have only 1-bit of distinction.

On ARM64, when the HV needs to write to guest user VA or guest PA,
the SMMU provides an interface the processor can use to translate
the guest VA or Guest PA to the corresponding system physical address.
Of course, there is a race if the guest OS changes the underlying
translation tables during the upcall to the hypervisor or secure
monitor, although that would be a bug in the guest were it so to do,
since the guest explicitly requested the action from the higher
privilege level (e.g. HV).

Arm does have a set of load/store "user" instructions that translate
addresses using the unprivileged (application) translation tables.
There's
also a processor state bit (UAO - User Access Override) that can
be set to force those instructions to use the permissions associated
with the current processor privilege level.

Note that there is a push by all vendors to include support
for guest 'privacy', such that the hypervisor has no direct
access to memory owned by the guest, or where the the guest
memory is encrypted using a key the hypervisor or secure monitor
don't have access to.

Okay,

I was interpreting RISCV specs wrong. They have three bits dedicated to
this. 1 is an on/off and the other two are the mode to use. I am left wondering how it is determined which mode to use. If the hypervisor is
passed a pointer to a VAS variable in a register, how does it know that
the pointer is for the supervisor or the user/app?

More interesting the the concept that there are multiple HVs that
have been virtualized--in this case the sender of the address may
think it has HV privilege but is currently operating as if it only
has GuestOS privilege. ...

It's why I assumed it found the mode from the stack. Those two select bits have to set
somehow. It seems like extra code to access the right address space.
I got the thought to use the three bits a bit differently.
111 = use current mode
110 = use mode from stack
100 = debug? mode
011 = secure (machine) mode
010 = hypervisor mode
001 = supervisor mode
000 = user/app mode
I was just using inline code to select the proper address space. But if
it is necessary to dig around to figure the mode, it may turn into a subroutine call.

All the machines I have used/designed/programmed in the past use 000
as highest privilege and 111 as lowest.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sun Apr 6 14:32:43 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Sat, 5 Apr 2025 18:31:44 +0000, Scott Lurndal wrote:

Arm does have a set of load/store "user" instructions that translate
addresses using the unprivileged (application) translation tables.

When Secure Monitor executes a "user" instructions which layer
of the SW stack is accessed:: {HV, SV, User} ?

The Secure Monitor will never execute a user instruction. If
it does, it will act as any other load/store executed by the
secure monitor.

The "user" instructions are only used by a bare-metal OS
or a guest OS to access user application address spaces.

Is this 1-layer down the stack, or all layers down the stack ??

One layer down, and only the least privileged non-user level.

There's
also a processor state bit (UAO - User Access Override) that can
be set to force those instructions to use the permissions associated
with the current processor privilege level.

That is how My 66000 MMU is defined--higher privilege layers
have R/W access to the next lower privilege layer--without
doing anything other than a typical LD or ST instruction.

That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege level.

[*] A primary goal must be to avoid privilege level
upcalls as much as possible.

I/O MMU has similar issues to solve in that a device can Read
write-execute only memory and write read-execute only memory.

By the time the IOMMU translates the inbound address, it is
a physical machine address, so I don't see any issue here.
And in the ARM case, the IOMMU translation tables are identical
to the processor translation tables in format and can actually
share some or all of the tables between the core(s) and the IOMMU.

Note that for various reasons, the IOMMU translation tables
may cover only a portion of the target address space at any particular privilege level.

Note that there is a push by all vendors to include support
for guest 'privacy', such that the hypervisor has no direct
access to memory owned by the guest, or where the the guest
memory is encrypted using a key the hypervisor or secure monitor
don't have access to.

I call these "paranoid" applications--generally requiring no
privilege, but they don't want GuestOS of HyperVisor to look
at their data and at the same time, they want GuestOS or HV
to perform I/O to said data--so some devices have a effective
privilege above that of the driver commanding them.

I understand the reasons and rational.

The primary reason is for encrypted video decoding where
the decoded video is fed directly to the graphics processor
and the end-user cannot intercept the decrypted video stream. Closing
the barn door after the horse has left, but c'est la vie.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Robert Finch on Sun Apr 6 14:21:26 2025

Robert Finch <robfi680@gmail.com> writes:

On 2025-04-05 2:31 p.m., Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sat, 5 Apr 2025 3:45:51 +0000, Robert Finch wrote:

Would not writing to the GuestOs VAS and the application VAS be the
result of separate system calls? Or does the hypervisor take over for
the GuestOS?

Application has a 64-bit VAS
GusetOS has a 64-bit VAS
HyprVisor has a 64-bit VAS
and so does
Securte has a 64-bit VAS

So, we are in HV and we need to write to guestOS and to Application
but we have only 1-bit of distinction.

On ARM64, when the HV needs to write to guest user VA or guest PA,
the SMMU provides an interface the processor can use to translate
the guest VA or Guest PA to the corresponding system physical address.
Of course, there is a race if the guest OS changes the underlying
translation tables during the upcall to the hypervisor or secure
monitor, although that would be a bug in the guest were it so to do,
since the guest explicitly requested the action from the higher
privilege level (e.g. HV).

Arm does have a set of load/store "user" instructions that translate
addresses using the unprivileged (application) translation tables. There's >> also a processor state bit (UAO - User Access Override) that can
be set to force those instructions to use the permissions associated
with the current processor privilege level.

Note that there is a push by all vendors to include support
for guest 'privacy', such that the hypervisor has no direct
access to memory owned by the guest, or where the the guest
memory is encrypted using a key the hypervisor or secure monitor
don't have access to.

Okay,

I was interpreting RISCV specs wrong. They have three bits dedicated to
this. 1 is an on/off and the other two are the mode to use. I am left >wondering how it is determined which mode to use. If the hypervisor is
passed a pointer to a VAS variable in a register, how does it know that
the pointer is for the supervisor or the user/app?

When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.

Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.

It's why I assumed it
found the mode from the stack. Those two select bits have to set
somehow. It seems like extra code to access the right address space.'

I haven't spent much time with RISC-V, but surely the processor
has a state register that stores the current mode, and which
must be preserved over exceptions/upcalls, which would require
that they be recorded in an exception syndrome register for
restoration when the upcall returns.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Mon Apr 7 00:51:08 2025

On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
----------------

When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the
prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.

Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.

On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.

So, is this dichotomy because::

a) HVs are good enough at virtualizing raw HW that GuestOS
does not need a lot of paravirtualization to be efficient ??

b) GuestOS does not need "that much paravirtualization" to be
efficient anyway.

c) the kinds of things GuestOS ask HVs to perform is just not
enough like the kind of things user asks of GuestOS.

d) User and GuestOS evolved in a time before virtualization
and simply prefer to exist as it used to be ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Robert Finch on Mon Apr 7 14:04:37 2025

Robert Finch <robfi680@gmail.com> writes:

On 2025-04-06 10:21 a.m., Scott Lurndal wrote:

Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.

Allows two directional virtualization I think. Q+ has all exceptions and >interrupts going to the secure monitor, which can then delegate it back
to a lower level.

If that adds latency to the interrupt handler, that will not
be a positive benefit.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Apr 7 14:09:50 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
----------------

When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the
prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.

Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.

On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.

So, is this dichotomy because::

a) HVs are good enough at virtualizing raw HW that GuestOS
does not need a lot of paravirtualization to be efficient ??

Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
proposed the SR-IOV capability, paravirtualization became anathema.

b) GuestOS does not need "that much paravirtualization" to be
efficient anyway.

With modern hardware support, yes.

c) the kinds of things GuestOS ask HVs to perform is just not
enough like the kind of things user asks of GuestOS.

Yes, that's also a truism.

d) User and GuestOS evolved in a time before virtualization
and simply prefer to exist as it used to be ??

Typically an OS doesn't know if it is a guest or bare metal.
That characteristic means that a given distribution can
operate as either.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Apr 9 00:23:09 2025

On Mon, 7 Apr 2025 14:09:50 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
----------------

When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the
prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.

Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.

On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.

So, is this dichotomy because::

a) HVs are good enough at virtualizing raw HW that GuestOS
does not need a lot of paravirtualization to be efficient ??

Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
proposed the SR-IOV capability, paravirtualization became anathema.

b) GuestOS does not need "that much paravirtualization" to be
efficient anyway.

With modern hardware support, yes.

c) the kinds of things GuestOS ask HVs to perform is just not
enough like the kind of things user asks of GuestOS.

Yes, that's also a truism.

d) User and GuestOS evolved in a time before virtualization
and simply prefer to exist as it used to be ??

Typically an OS doesn't know if it is a guest or bare metal.
That characteristic means that a given distribution can
operate as either.

Thank you for updating a piece of history apparently I did not
live through !!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Apr 15 00:43:43 2025

On Mon, 7 Apr 2025 14:09:50 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
----------------

When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the
prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.

Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.

On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.

So, is this dichotomy because::

a) HVs are good enough at virtualizing raw HW that GuestOS
does not need a lot of paravirtualization to be efficient ??

Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
proposed the SR-IOV capability, paravirtualization became anathema.

Ok, back to Dan Cross:: (with help from Scott)

If GuestOS wants to grab and hold onto a lock/mutex for a while
to do some critical section stuff--does GuestOS "care" that HV
can still take an interrupt while GuestOS is doing its CS thing ??
since HV is not going to touch any memory associated with GuestOS.

In effect, I am asking is Disable Interrupt is SW-stack-wide or only
applicable to the current layer of the SW stack ?? One can equally
use SW-stack-wide to mean core-wide.

For example:: GuestOS DIs, and HV takes a page fault from GuestOS;
makes the page resident and accessible, and allows GuestOS to run
from the point of fault. GuestOS "sees" no interrupt and nothing
in GuestOS VAS is touched by HV in servicing the page fault.

Now, sure that lock is held while the page fault is being serviced,
and the ugly head of priority inversion takes hold. But ... I am ni
need of some edumacation here.

b) GuestOS does not need "that much paravirtualization" to be
efficient anyway.

With modern hardware support, yes.

c) the kinds of things GuestOS ask HVs to perform is just not
enough like the kind of things user asks of GuestOS.

Yes, that's also a truism.

d) User and GuestOS evolved in a time before virtualization
and simply prefer to exist as it used to be ??

Typically an OS doesn't know if it is a guest or bare metal.
That characteristic means that a given distribution can
operate as either.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Apr 15 14:02:37 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 7 Apr 2025 14:09:50 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote:
----------------

When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the >>>> prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.

Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.

On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.

So, is this dichotomy because::

a) HVs are good enough at virtualizing raw HW that GuestOS
does not need a lot of paravirtualization to be efficient ??

Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
proposed the SR-IOV capability, paravirtualization became anathema.

Ok, back to Dan Cross:: (with help from Scott)

If GuestOS wants to grab and hold onto a lock/mutex for a while
to do some critical section stuff--does GuestOS "care" that HV
can still take an interrupt while GuestOS is doing its CS thing ??
since HV is not going to touch any memory associated with GuestOS.

Generally, the Guest should execute "as if" it were running on
Bare Metal. Consider an intel/amd processor running a bare-metal
operating system that takes an interrupt into SMM mode; from the
POV of a guest, an HV interrupt is similar to an SMM interrupt.

If the SMM, Secure Monitor or HV modify guest memory in any way,
all bets are off.

In effect, I am asking is Disable Interrupt is SW-stack-wide or only >applicable to the current layer of the SW stack ?? One can equally
use SW-stack-wide to mean core-wide.

Current layer of the privilege stack. If there is a secure monitor
at a more privileged level than the HV, it can take interrupts in a
manner similar to the legacy SMM interupts. Typically there will
be independent periodic timer interrupts in the Guest OS, the HV, and the secure monitor.

For example:: GuestOS DIs, and HV takes a page fault from GuestOS;

Note that these will be rare and only if the HV overcommits physical
memory.

makes the page resident and accessible, and allows GuestOS to run
from the point of fault. GuestOS "sees" no interrupt and nothing
in GuestOS VAS is touched by HV in servicing the page fault.

The only way that the guest OS or guest OS application can detect
such an event is if it measures an affected load/store - a covert
channel. So there may be security considerations.

Now, sure that lock is held while the page fault is being serviced,
and the ugly head of priority inversion takes hold. But ... I am ni
need of some edumacation here.

Priority inversion is only applicable within a privilege level/ring.
Interrupts to a higher privilege level cannot be masked by an active interrupt at a lower priority level.

The higher privilege level must not unilaterally modify guest OS or
application state.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Apr 15 20:46:28 2025

On Tue, 15 Apr 2025 14:02:37 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 7 Apr 2025 14:09:50 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 6 Apr 2025 14:21:26 +0000, Scott Lurndal wrote: >>>>----------------

When the exception (in this case an upcall to a more privileged
regime) occurs, the saved state register/stack word should contain the >>>>> prior privilege level. The hypervisor will know from that whether
the upcall was from the guest OS or a guest Application.

Note that on ARM, there are restrictions on upcalls to
more privileged regimes - generally a particular regime
can only upcall the next higher privileged regime, so
the user app can only upcall the GuestOS, the guest OS can only
upcall the HV and the HV is the only regime that can
upcall the secure monitor.

On Sun, 6 Apr 2025 14:32:43 +0000, Scott Lurndal wrote:

That presumes a shared address space between the privilege
levels - which is common for the OS and user-modes. It's
not common (or particularly useful[*]) at any other privilege
level.

So, is this dichotomy because::

a) HVs are good enough at virtualizing raw HW that GuestOS
does not need a lot of paravirtualization to be efficient ??

Yes. Once AMD added Nested Page Tables to SVM and the PCI-SIG
proposed the SR-IOV capability, paravirtualization became anathema.

Ok, back to Dan Cross:: (with help from Scott)

If GuestOS wants to grab and hold onto a lock/mutex for a while
to do some critical section stuff--does GuestOS "care" that HV
can still take an interrupt while GuestOS is doing its CS thing ??
since HV is not going to touch any memory associated with GuestOS.

Generally, the Guest should execute "as if" it were running on
Bare Metal. Consider an intel/amd processor running a bare-metal
operating system that takes an interrupt into SMM mode; from the
POV of a guest, an HV interrupt is similar to an SMM interrupt.

If the SMM, Secure Monitor or HV modify guest memory in any way,
all bets are off.

Yes, but we have previously established HV does its virtualization
without touching GuestOS memory. {Which is why I used page fault as
the example.}

In effect, I am asking is Disable Interrupt is SW-stack-wide or only >>applicable to the current layer of the SW stack ?? One can equally
use SW-stack-wide to mean core-wide.

Current layer of the privilege stack. If there is a secure monitor
at a more privileged level than the HV, it can take interrupts in a
manner similar to the legacy SMM interupts. Typically there will
be independent periodic timer interrupts in the Guest OS, the HV, and
the secure monitor.

This agrees with the RISC-V approach where each layer in the stack
has its own Interrupt Enable configuration. {Which is what lead to
my questions}.

However, many architectures have only a single control bit for the
whole core--which is shy I am trying to get a complete understanding
of what is required and what is choice. That there is some control
is (IS) required--how many seems to be choice at this stage.

Would it be unwise of me to speculate that a control at each layer
is more optimal, or that the critical section that is delayed due
to "other stuff needing to be handled" should have taken precedent.

Anyone know of any literature where this was simulate or measured ??

For example:: GuestOS DIs, and HV takes a page fault from GuestOS;

Note that these will be rare and only if the HV overcommits physical
memory.

makes the page resident and accessible, and allows GuestOS to run
from the point of fault. GuestOS "sees" no interrupt and nothing
in GuestOS VAS is touched by HV in servicing the page fault.

The only way that the guest OS or guest OS application can detect
such an event is if it measures an affected load/store - a covert
channel. So there may be security considerations.

Damn that high precision clock .....

Which also leads to the question of should a Virtual Machine have
its own virtual time ?? {Or VM and VMM share the concept of virtual
time} ??

Now, sure that lock is held while the page fault is being serviced,
and the ugly head of priority inversion takes hold. But ... I am ni
need of some edumacation here.

Priority inversion is only applicable within a privilege level/ring. Interrupts to a higher privilege level cannot be masked by an active interrupt at a lower priority level.

So, if core is running HyperVisor at priority 15 and a user interrupt
arrives at a higher priority but directed at GuestOS (instead of HV)
does::
a) HV continue leaving higher priority interrupt waiting.
b) switch back to GuestOS for higher priority interrupt--in such
. a way that when GuestOS returns from interrupt HV takes over
. from whence it left.

This is really a question of what priority means across the entire
SW stack--and real-time versus Linux may have different answers on
this matter.

The higher privilege level must not unilaterally modify guest OS or application state.

Given the almost complete lack of shared address spaces in a manner
where pointers can be passed between, there is almost nothing an HV
can do to a GuestOS VAS unless GuestOS has ask for a HV service via paravirtualization entry point.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Apr 16 14:07:36 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Apr 2025 14:02:37 +0000, Scott Lurndal wrote:

Current layer of the privilege stack. If there is a secure monitor
at a more privileged level than the HV, it can take interrupts in a
manner similar to the legacy SMM interupts. Typically there will
be independent periodic timer interrupts in the Guest OS, the HV, and
the secure monitor.

This agrees with the RISC-V approach where each layer in the stack
has its own Interrupt Enable configuration. {Which is what lead to
my questions}.

AArch64 also had interrupt enables at each privilege level.

However, many architectures have only a single control bit for the
whole core--which is shy I am trying to get a complete understanding
of what is required and what is choice. That there is some control
is (IS) required--how many seems to be choice at this stage.

I'm not aware of any architecture that supports virtualization that
doesn't have enables for each privilege level; either there are
distinct levels in hardware, or the hypervisor needs to handle
all interrupts and inject them into the guest in some fashion. Best
to have hardware support for all of this rather than the overhead
of the HV handing all interrupts and the consequent context switches.

Would it be unwise of me to speculate that a control at each layer
is more optimal, or that the critical section that is delayed due
to "other stuff needing to be handled" should have taken precedent.

The former is optimal. Assumning the guest is independent of the
HV, any delay in the critical section (e.g. due to an HV interrupt
being handled) are inconsequential. The critical section is only
critical to the privilege layer it occurs on.

<snip>

The only way that the guest OS or guest OS application can detect
such an event is if it measures an affected load/store - a covert
channel. So there may be security considerations.

Damn that high precision clock .....

Which also leads to the question of should a Virtual Machine have
its own virtual time ?? {Or VM and VMM share the concept of virtual
time} ??

Generally, yes. Usually modeled with an offset register in
the HV that gets applied to the guest view of current time.

Now, sure that lock is held while the page fault is being serviced,
and the ugly head of priority inversion takes hold. But ... I am ni
need of some edumacation here.

Priority inversion is only applicable within a privilege level/ring.
Interrupts to a higher privilege level cannot be masked by an active
interrupt at a lower priority level.

So, if core is running HyperVisor at priority 15 and a user interrupt
arrives at a higher priority but directed at GuestOS (instead of HV)
does::
a) HV continue leaving higher priority interrupt waiting.
b) switch back to GuestOS for higher priority interrupt--in such
. a way that when GuestOS returns from interrupt HV takes over
. from whence it left.

ARM, for example, splits the per-core interrupt priority range into halves
- one half is assigned to the secure monitor and the other is assigned to the non-secure software running on the core. Early hypervisors would field all non-secure interrupts and either handle them itself or inject them into
the guest. The first ARM64 cores would field all interrupts in the HV
and the int controller had special registers the HV could use to inject interrupts
into the guest. The overhead was not insignifcant, so they added
a mechanism to allow some interrupts to be directly fielded by the
guest itself - avoiding the round trip through the HV on every interrupt (called virtual LPIs).

This is really a question of what priority means across the entire
SW stack--and real-time versus Linux may have different answers on
this matter.

The higher privilege level must not unilaterally modify guest OS or
application state.

Given the almost complete lack of shared address spaces in a manner
where pointers can be passed between, there is almost nothing an HV
can do to a GuestOS VAS unless GuestOS has ask for a HV service via >paravirtualization entry point.

The HV owns the translation tables for guest to physical address,
it can pretty much do anything it wants with that access[*], including modifying guest processor and memory state at any time - absent
potential future features such as hardware guest memory encryption
or memory access controls at a level higher than the HV (e.g. the
secure monitor - see AArch64 Realms, for example).

https://developer.arm.com/documentation/den0126/0101/Overview

[*] the hypervisor can easily double map a page in both the guest PAS
and the HV VAS - a technique common in paravirtualized environments.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Apr 16 21:13:43 2025

On Wed, 16 Apr 2025 14:07:36 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

---------snip-----------

So, if core is running HyperVisor at priority 15 and a user interrupt >>arrives at a higher priority but directed at GuestOS (instead of HV)
does::
a) HV continue leaving higher priority interrupt waiting.
b) switch back to GuestOS for higher priority interrupt--in such
. a way that when GuestOS returns from interrupt HV takes over
. from whence it left.

ARM, for example, splits the per-core interrupt priority range into
halves
- one half is assigned to the secure monitor and the other is assigned
to the
non-secure software running on the core.

Thus, my predilection for 64-priority levels (rather than ~8 as
suggested
by another participant) allows for this distribution of priorities
across
layers in the SW stack at the discretion of trustable-SW.

Early hypervisors would field
all
non-secure interrupts and either handle them itself or inject them into
the guest. The first ARM64 cores would field all interrupts in the HV
and the int controller had special registers the HV could use to inject interrupts
into the guest. The overhead was not insignifcant, so they added
a mechanism to allow some interrupts to be directly fielded by the
guest itself - avoiding the round trip through the HV on every interrupt (called virtual LPIs).

Given 4 layers in the stack {Secure, Hyper, Super, User} and we have
interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
or do we gain flexibility by being able to target interrupts directly to
{user} ?? (the 4th element).

Roughly: HW maintains 4 copies of state and generally indexes state
with a 2-bit value, and the "structure" of thread-header is identical
between layers; thus, indexing down to {user} falls out for free.

{{But I could be off my rocker...again}}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Wed Apr 16 15:26:12 2025

On 4/16/2025 2:13 PM, MitchAlsup1 wrote:

On Wed, 16 Apr 2025 14:07:36 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

---------snip-----------

So, if core is running HyperVisor at priority 15 and a user interrupt
arrives at a higher priority but directed at GuestOS (instead of HV)
does::
a) HV continue leaving higher priority interrupt waiting.
b) switch back to GuestOS for higher priority interrupt--in such
. a way that when GuestOS returns from interrupt HV takes over
. from whence it left.

ARM, for example, splits the per-core interrupt priority range into
halves
- one half is assigned to the secure monitor and the other is assigned
to the
non-secure software running on the core.

Thus, my predilection for 64-priority levels (rather than ~8 as
suggested
by another participant) allows for this distribution of priorities
across
layers in the SW stack at the discretion of trustable-SW.

                                          Early hypervisors would field
all
non-secure interrupts and either handle them itself or inject them into
the guest.    The first ARM64 cores would field all interrupts in the HV >> and the int controller had special registers the HV could use to inject
interrupts
into the guest.    The overhead was not insignifcant, so they added
a mechanism to allow some interrupts to be directly fielded by the
guest itself - avoiding the round trip through the HV on every interrupt
(called virtual LPIs).

Given 4 layers in the stack {Secure, Hyper, Super, User} and we have interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
or do we gain flexibility by being able to target interrupts directly to {user} ?? (the 4th element).

I think you could gain a tiny amount of efficiency if the OS (super)
allowed the user to set up handle certain classes of exceptions, e.g.
divide faults) itself rather than having to go through the super.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Stephen Fuld on Thu Apr 17 00:57:12 2025

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 4/16/2025 2:13 PM, MitchAlsup1 wrote:

On Wed, 16 Apr 2025 14:07:36 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

---------snip-----------

So, if core is running HyperVisor at priority 15 and a user interrupt
arrives at a higher priority but directed at GuestOS (instead of HV)
does::
a) HV continue leaving higher priority interrupt waiting.
b) switch back to GuestOS for higher priority interrupt--in such
. a way that when GuestOS returns from interrupt HV takes over
. from whence it left.

ARM, for example, splits the per-core interrupt priority range into
halves
- one half is assigned to the secure monitor and the other is assigned
to the
non-secure software running on the core.

Thus, my predilection for 64-priority levels (rather than ~8 as
suggested
by another participant) allows for this distribution of priorities
across
layers in the SW stack at the discretion of trustable-SW.

                                          Early hypervisors would field
all
non-secure interrupts and either handle them itself or inject them into
the guest.    The first ARM64 cores would field all interrupts in the HV >>> and the int controller had special registers the HV could use to inject
interrupts
into the guest.    The overhead was not insignifcant, so they added
a mechanism to allow some interrupts to be directly fielded by the
guest itself - avoiding the round trip through the HV on every interrupt >>> (called virtual LPIs).

Given 4 layers in the stack {Secure, Hyper, Super, User} and we have
interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
or do we gain flexibility by being able to target interrupts directly to
{user} ?? (the 4th element).

I think you could gain a tiny amount of efficiency if the OS (super)
allowed the user to set up handle certain classes of exceptions, e.g.
divide faults) itself rather than having to go through the super.

Think carefully about the security implications of user-mode interrupt delivery. Particuarly with respect to potential impacts on other
processes running on the system, and to overall system functionality.

Handling interrupts requires direct access to the hardware from
user-mode.

Hardware access is normally done in the context of a 'sandboxed'
PCI Express SRIOV function which the application can access directly;
the hardware guarantees that the user process cannot adversley
affect the hardware or other guests using other virtual functions.

However, the interrupt controller itself (e.g. the mechanism used
to acknowledge the interrupt to the interrupt controller after it
has been serviced - e.g. the LAPIC) isn't virtualized, and direct
access to that shouldn't be available to user-mode for fairly obvious
reasons.

That's why DPDK/ODP require the OS to handle interrupts and notify
the application via standard OS notification mechanisms even
when using SR-IOV capable hardware for the actual packet handling.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Thu Apr 17 00:47:38 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 16 Apr 2025 14:07:36 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

---------snip-----------

So, if core is running HyperVisor at priority 15 and a user interrupt >>>arrives at a higher priority but directed at GuestOS (instead of HV) >>>does::
a) HV continue leaving higher priority interrupt waiting.
b) switch back to GuestOS for higher priority interrupt--in such
. a way that when GuestOS returns from interrupt HV takes over
. from whence it left.

ARM, for example, splits the per-core interrupt priority range into
halves
- one half is assigned to the secure monitor and the other is assigned
to the
non-secure software running on the core.

Thus, my predilection for 64-priority levels (rather than ~8 as
suggested
by another participant) allows for this distribution of priorities
across
layers in the SW stack at the discretion of trustable-SW.

Architecturally, the ARM64 interrupt priority can vary from 3 to 8
bits. Most implementations implement 5 bits, allowing 16 secure
and 16 non-secure priority levels. They can be grouped using
a binary point register, if required.

Early hypervisors would field
all
non-secure interrupts and either handle them itself or inject them into
the guest. The first ARM64 cores would field all interrupts in the HV
and the int controller had special registers the HV could use to inject
interrupts
into the guest. The overhead was not insignifcant, so they added
a mechanism to allow some interrupts to be directly fielded by the
guest itself - avoiding the round trip through the HV on every interrupt
(called virtual LPIs).

Given 4 layers in the stack {Secure, Hyper, Super, User} and we have >interrupts targeting {Secure, Hyper, Super}, do we pick up any liability
or do we gain flexibility by being able to target interrupts directly to >{user} ?? (the 4th element).

On ARM there are only two interrupt signals from the interrupt controller
to each core: FIQ and IRQ.

Each of the signals can be 'claimed' by one, and only one privilege
level on that core; if the secure monitor claims FIQ, then it can only be delivered
to EL3.

If running bare-metal, the OS (EL1) will claim the IRQ signal (by default if none of the more privileged levels claim it).

If a hypervisor (EL2) is running, it will claim the IRQ signal and field
all physical interrupts, except for virtual LPI and IPI interrupts which the hardware can inject directly into the guest (which may result in an
interrupt to the hypervisor if the guest isn't resident on the target
CPU).

In a virtualized environment, one needs to be vary careful when
exposing hardware interrupt signals directly to the guest operating system,
as that often requires exposing some of the interrupt controller.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	04:23:42
Calls:	10,387
Calls today:	2
Files:	14,061
Messages:	6,416,782

Re: Constant Stack Canaries

Who's Online

Recent Visitors

System Info