Forum: >>> Magnum BBS <<<

Tonight's tradeoff

From Robert Finch@21:1/5 to All on Sun Nov 12 22:47:12 2023

Branch miss logic versus clock frequency.

The branch miss logic for the current OoO version of Thor is quite
involved. It needs to back out the register source indexes to the last
valid source before the branch instruction. To do this in a single
cycle, the logic is about 25+ logic levels deep. I find this somewhat unacceptable.

I can remove a lot of logic improving the clock frequency substantially
by removing the branch miss logic that resets the registers source id to
the last valid source. Instead of stomping on the instruction on a miss
and flushing the instructions in a single cycle, I think the predicate
for the instructions can be cleared which will effectively turn them
into a NOP. The value of the target register will be propagated in the
reorder buffer meaning the registers source id need not be reset. The
reorder buffer is only eight entries. So, on average four entries would
be turned into NOPs. The NOPs would still propagate through the reorder
buffer so it may take several clock cycles for them to be flushed from
the buffer. Meaning the branch latency for miss-predicted branches would
be quite high. However, if the clock frequency can be improved by 20%
for all instructions, much of the lost performance on the branches may
be made up.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Robert Finch on Mon Nov 13 11:10:19 2023

Robert Finch wrote:

Branch miss logic versus clock frequency.

The branch miss logic for the current OoO version of Thor is quite
involved. It needs to back out the register source indexes to the last
valid source before the branch instruction. To do this in a single
cycle, the logic is about 25+ logic levels deep. I find this somewhat unacceptable.

I can remove a lot of logic improving the clock frequency substantially
by removing the branch miss logic that resets the registers source id to
the last valid source. Instead of stomping on the instruction on a miss
and flushing the instructions in a single cycle, I think the predicate
for the instructions can be cleared which will effectively turn them
into a NOP. The value of the target register will be propagated in the reorder buffer meaning the registers source id need not be reset. The
reorder buffer is only eight entries. So, on average four entries would
be turned into NOPs. The NOPs would still propagate through the reorder buffer so it may take several clock cycles for them to be flushed from
the buffer. Meaning the branch latency for miss-predicted branches would
be quite high. However, if the clock frequency can be improved by 20%
for all instructions, much of the lost performance on the branches may
be made up.

Basically it sounds like you want to eliminate the checkpoint and rollback,
and instead let resources be recovered at Retire. That could work.

However you are not restoring the Renamer's future Register Alias Table (RAT) to its state at the point of the mispredicted branch instruction, which is
what the rollback would have done, so its state will be whatever it was at
the end of the mispredicted sequence. That needs to be re-sync'ed with the program state as of the branch.

That can be accomplished by stalling the front end, waiting until the mispredicted branch reaches Retire and then copying the committed RAT, maintained by Retire, to the future RAT at Rename, and restart front end.
The list of free physical registers is then all those that are not
marked as architectural registers.
This is partly how I handle exceptions.

Also you still need a mechanism to cancel start of execution of the
subset of pending uOps for the purged set. You don't want to launch
a LD or DIV from the mispredicted set if it has not already started.
If you are using a reservation station design then you need some way
to distribute the cancel request to the various FU's and RS's,
and wait for them to clean themselves up.

Note that some things might not be able to cancel immediately,
like an in-flight MUL in a pipeline or an outstanding LD to the cache.
So some of this will be asynchronous (send cancel request, wait for ACK).

There are some other things that might need cleanup.
A Return Stack Predictor might be manipulated by the mispredicted path.
Not sure how to handle that without a checkpoint.
Maybe have two copies like RAT, a future one maintained by Decode and
a committed one maintained by Retire, and copy the committed to future.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Robert Finch on Mon Nov 13 19:47:27 2023

Robert Finch wrote:

Branch miss logic versus clock frequency.

The branch miss logic for the current OoO version of Thor is quite
involved. It needs to back out the register source indexes to the last
valid source before the branch instruction. To do this in a single
cycle, the logic is about 25+ logic levels deep. I find this somewhat unacceptable.

<
When you launch a predicted branch into execution (prelude to signaling recovery is required), while the branch is determining whether to backup
(or not) have the branch recovery logic setup the register indexes such
that::
a) if the branch succeeds keep the current map
b) if the branch fails, you are 1 multiplexer delay from having the state
you want.
<
That is move the setup to repair the previous clock.
<

I can remove a lot of logic improving the clock frequency substantially
by removing the branch miss logic that resets the registers source id to
the last valid source. Instead of stomping on the instruction on a miss
and flushing the instructions in a single cycle, I think the predicate
for the instructions can be cleared which will effectively turn them
into a NOP. The value of the target register will be propagated in the reorder buffer meaning the registers source id need not be reset. The
reorder buffer is only eight entries. So, on average four entries would
be turned into NOPs. The NOPs would still propagate through the reorder buffer so it may take several clock cycles for them to be flushed from
the buffer. Meaning the branch latency for miss-predicted branches would
be quite high. However, if the clock frequency can be improved by 20%
for all instructions, much of the lost performance on the branches may
be made up.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Mon Nov 13 20:01:53 2023

EricP wrote:

Robert Finch wrote:

Branch miss logic versus clock frequency.

The branch miss logic for the current OoO version of Thor is quite
involved. It needs to back out the register source indexes to the last
valid source before the branch instruction. To do this in a single
cycle, the logic is about 25+ logic levels deep. I find this somewhat
unacceptable.

I can remove a lot of logic improving the clock frequency substantially
by removing the branch miss logic that resets the registers source id to
the last valid source. Instead of stomping on the instruction on a miss
and flushing the instructions in a single cycle, I think the predicate
for the instructions can be cleared which will effectively turn them
into a NOP. The value of the target register will be propagated in the
reorder buffer meaning the registers source id need not be reset. The
reorder buffer is only eight entries. So, on average four entries would
be turned into NOPs. The NOPs would still propagate through the reorder
buffer so it may take several clock cycles for them to be flushed from
the buffer. Meaning the branch latency for miss-predicted branches would
be quite high. However, if the clock frequency can be improved by 20%
for all instructions, much of the lost performance on the branches may
be made up.

Basically it sounds like you want to eliminate the checkpoint and rollback, and instead let resources be recovered at Retire. That could work.

However you are not restoring the Renamer's future Register Alias Table (RAT) to its state at the point of the mispredicted branch instruction, which is what the rollback would have done, so its state will be whatever it was at the end of the mispredicted sequence. That needs to be re-sync'ed with the program state as of the branch.

<
I, personally, don't use a RAT--I use a CAM based architectural decoder
for operand read and a standard physical equality decoder for writes.
<
Every cycle the CAM.valid bits are block loaded into a history table
and if you need to return the CAMs to the checkpointed mappings, you
take the valid bits from the history table and write the CAM.valid
bits back into the physical register file. Presto, the map is how it
used to be.
<
Can even be made to be performed in 0-cycles. {yes: 0 not 1 cycles}
<

That can be accomplished by stalling the front end, waiting until the mispredicted branch reaches Retire and then copying the committed RAT, maintained by Retire, to the future RAT at Rename, and restart front end.
The list of free physical registers is then all those that are not
marked as architectural registers.

<
Sounds slow.
<

This is partly how I handle exceptions.

Also you still need a mechanism to cancel start of execution of the
subset of pending uOps for the purged set. You don't want to launch
a LD or DIV from the mispredicted set if it has not already started.
If you are using a reservation station design then you need some way
to distribute the cancel request to the various FU's and RS's,
and wait for them to clean themselves up.

<
I use the concept of an execution window to do this both at the reservation station and function units. There is an insert pointer and a consistent
pointer RS is only allowed to launch when the instruction is between.
FU are only allowed to calculate so long as the instruction remains
between these 2 pointers. The 2 pointers (4-bits each) are broadcast
around the machine every cycle. Each station and unit decide for themselves.

Note that some things might not be able to cancel immediately,
like an in-flight MUL in a pipeline or an outstanding LD to the cache.
So some of this will be asynchronous (send cancel request, wait for ACK).

<
If an instruction that should not have its result delivered is delivered,
it is delivered to the physical register it was assigned at its issue time.
But since the value had not been delivered, that register is not in the
pool of assignable registers, so no dependency has been created.
<

There are some other things that might need cleanup.
A Return Stack Predictor might be manipulated by the mispredicted path.

<
Do these with a linked list and you can backup a misprediced return
to a mispredicted call.
<

Not sure how to handle that without a checkpoint.

<
Every (non exceptional) flow altering instruction needs a checkpoint. Predicated strings of instructions use a light weight checkpoint;
predicted branches use a heavy weight version.
<

Maybe have two copies like RAT, a future one maintained by Decode and
a committed one maintained by Retire, and copy the committed to future.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Finch@21:1/5 to All on Wed Nov 15 01:21:16 2023

Decided to shelve Thor2024 and begin work on Thor2025. While Thor2024 is
very good there are a few issues with it. The ROB is used to store
register values and that is effectively a CAM. It is not very resource efficient in an FPGA. I have been researching an x86 OoO implementation (https://www.stuffedcow.net/files/henry-thesis-phd.pdf ) done in an FPGA
and it turns out to be considerably smaller than Thor. There are more
efficient implementations for components than what is currently in use.

Thor2025 will use a PRF approach although using a PRF seems large to me.
To reduce the size and complexity of the register file, separate
register files will be used for float and integer operations, along with separate register files for vector mask registers and subroutine link registers. This set of register files limits the GPR file to only 3
write ports and 18 read ports to support all the functional units.
Currently the register file is 10r2w.

The trade-off is block RAM usage instead of LUTs.

While having separate registers files seems like a step backwards, it
should ultimately make the hardware more resource efficient. It does
impact the ISA spec.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Robert Finch on Wed Nov 15 19:11:31 2023

Robert Finch wrote:

Decided to shelve Thor2024 and begin work on Thor2025. While Thor2024 is
very good there are a few issues with it. The ROB is used to store
register values and that is effectively a CAM. It is not very resource efficient in an FPGA. I have been researching an x86 OoO implementation (https://www.stuffedcow.net/files/henry-thesis-phd.pdf ) done in an FPGA
and it turns out to be considerably smaller than Thor. There are more efficient implementations for components than what is currently in use.

Thor2025 will use a PRF approach although using a PRF seems large to me.

<
I have a PRF design I could show you--way to big for comp.arch and
with the requisite figures.
<

To reduce the size and complexity of the register file, separate
register files will be used for float and integer operations, along with separate register files for vector mask registers and subroutine link registers. This set of register files limits the GPR file to only 3
write ports and 18 read ports to support all the functional units.
Currently the register file is 10r2w.

The trade-off is block RAM usage instead of LUTs.

While having separate registers files seems like a step backwards, it
should ultimately make the hardware more resource efficient. It does
impact the ISA spec.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Finch@21:1/5 to MitchAlsup on Fri Nov 17 22:39:45 2023

On 2023-11-15 2:11 p.m., MitchAlsup wrote:

Robert Finch wrote:

Decided to shelve Thor2024 and begin work on Thor2025. While Thor2024
is very good there are a few issues with it. The ROB is used to store
register values and that is effectively a CAM. It is not very resource
efficient in an FPGA. I have been researching an x86 OoO
implementation (https://www.stuffedcow.net/files/henry-thesis-phd.pdf
) done in an FPGA and it turns out to be considerably smaller than
Thor. There are more efficient implementations for components than
what is currently in use.

Thor2025 will use a PRF approach although using a PRF seems large to me.

<
I have a PRF design I could show you--way to big for comp.arch and
with the requisite figures.
<

To reduce the size and complexity of the register file, separate
register files will be used for float and integer operations, along
with separate register files for vector mask registers and subroutine
link registers. This set of register files limits the GPR file to only
3 write ports and 18 read ports to support all the functional units.
Currently the register file is 10r2w.

The trade-off is block RAM usage instead of LUTs.

While having separate registers files seems like a step backwards, it
should ultimately make the hardware more resource efficient. It does
impact the ISA spec.

Still digesting the PRF diagram.

Decided to go with a unified register file, 27r3w so far. Having
separate register files would not reduce the number of read ports
required and would add complexity to the processor.

Loads, FPU operations and flow control (FCU) operations all share the
third write port of the register file. The other two write ports are
dedicated to the ALU results. I think this will be okay given <1% of instructions would be FCU updates. Loads are about 25%, and FPU depends
on the application.

The ALUs/FPU/Loads have five input operands including the 3 source
operands, a target operand, and a mask register. Stores do not need a
target operand. FCU ops are non-masked so do not need a mask register or
target operand input.

Not planning to implement the vector register file as it would be immense.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Finch@21:1/5 to Robert Finch on Sat Nov 18 05:58:42 2023

On 2023-11-17 10:39 p.m., Robert Finch wrote:

On 2023-11-15 2:11 p.m., MitchAlsup wrote:

Robert Finch wrote:

Decided to shelve Thor2024 and begin work on Thor2025. While Thor2024
is very good there are a few issues with it. The ROB is used to store
register values and that is effectively a CAM. It is not very
resource efficient in an FPGA. I have been researching an x86 OoO
implementation (https://www.stuffedcow.net/files/henry-thesis-phd.pdf
) done in an FPGA and it turns out to be considerably smaller than
Thor. There are more efficient implementations for components than
what is currently in use.

Thor2025 will use a PRF approach although using a PRF seems large to me.

<
I have a PRF design I could show you--way to big for comp.arch and
with the requisite figures.
<

To reduce the size and complexity of the register file, separate
register files will be used for float and integer operations, along
with separate register files for vector mask registers and subroutine
link registers. This set of register files limits the GPR file to
only 3 write ports and 18 read ports to support all the functional
units. Currently the register file is 10r2w.

The trade-off is block RAM usage instead of LUTs.

While having separate registers files seems like a step backwards, it
should ultimately make the hardware more resource efficient. It does
impact the ISA spec.

Still digesting the PRF diagram.

Decided to go with a unified register file, 27r3w so far. Having
separate register files would not reduce the number of read ports
required and would add complexity to the processor.

Loads, FPU operations and flow control (FCU) operations all share the
third write port of the register file. The other two write ports are dedicated to the ALU results. I think this will be okay given <1% of instructions would be FCU updates. Loads are about 25%, and FPU depends
on the application.

The ALUs/FPU/Loads have five input operands including the 3 source
operands, a target operand, and a mask register. Stores do not need a
target operand. FCU ops are non-masked so do not need a mask register or target operand input.

Not planning to implement the vector register file as it would be immense.

Changed the moniker of my current processor project from Thor to Qupls (Q-Plus). I wanted a five- letter name beginning with ‘Q’. For a moment
I thought of calling it Quake but thought that would be too confusing.
One must understand the magic behind name choices.

The current design uses instruction postfixes of 32, 48, 80, and 144
bits which provide constants of 23, 39, 64 and 128 bits. Two bits in the instruction indicate the postfix size. 64 and 128-bit constants have
seven extra unused bits available. The fields available being 71 and 135
bits.

Somewhat ugly, but it is desired to keep instructions a multiple of
16-bits in size. The shortest instruction is a NOP which is 16-bits so
that it may be used for alignment.

I almost switched to 96-bit floats which seem appealing, but once again remembered that the progression of 32, 64, 128-bit floats work very well
for the float approximations.

Branches are 48-bit, being a combination of a compare and a branch with
a 24-bit target address field. Other flow control ops like JSR and JMP
are also 48-bit to keep all flow controls at 48-bit for simplified decoding.

Most instructions are 32-bits in size.

Sticking with a 64-register unified register file.

Removed the vector operations. There is enough play in the ISA to add
them at a later date if desired.

Loads and stores support two address mode, d(Rn) and d(Rn+Rm*Sc). The
scaled index address mode will likely be a 48-bit op.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Robert Finch on Sat Nov 18 17:27:50 2023

Robert Finch wrote:

On 2023-11-15 2:11 p.m., MitchAlsup wrote:

Robert Finch wrote:

Decided to shelve Thor2024 and begin work on Thor2025. While Thor2024
is very good there are a few issues with it. The ROB is used to store
register values and that is effectively a CAM. It is not very resource
efficient in an FPGA. I have been researching an x86 OoO
implementation (https://www.stuffedcow.net/files/henry-thesis-phd.pdf
) done in an FPGA and it turns out to be considerably smaller than
Thor. There are more efficient implementations for components than
what is currently in use.

Thor2025 will use a PRF approach although using a PRF seems large to me.

<
I have a PRF design I could show you--way to big for comp.arch and
with the requisite figures.
<

To reduce the size and complexity of the register file, separate
register files will be used for float and integer operations, along
with separate register files for vector mask registers and subroutine
link registers. This set of register files limits the GPR file to only
3 write ports and 18 read ports to support all the functional units.
Currently the register file is 10r2w.

The trade-off is block RAM usage instead of LUTs.

While having separate registers files seems like a step backwards, it
should ultimately make the hardware more resource efficient. It does
impact the ISA spec.

Still digesting the PRF diagram.

The diagram is for a 6R6W PRF with a history table, ARN->PRN translation,
Free pool pickers, and register ports. The X with a ½ box is a latch
or flip-flop depending on the clocking that is put around the figure.
It also includes the renamer {history table and free pool pickers}.

Decided to go with a unified register file, 27r3w so far. Having
separate register files would not reduce the number of read ports
required and would add complexity to the processor.

9 Reads per 1 write ?!?!?

Loads, FPU operations and flow control (FCU) operations all share the
third write port of the register file. The other two write ports are dedicated to the ALU results. I think this will be okay given <1% of instructions would be FCU updates. Loads are about 25%, and FPU depends
on the application.

The ALUs/FPU/Loads have five input operands including the 3 source
operands, a target operand, and a mask register. Stores do not need a
target operand. FCU ops are non-masked so do not need a mask register or target operand input.

Not planning to implement the vector register file as it would be immense.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Finch@21:1/5 to MitchAlsup on Sat Nov 18 14:41:14 2023

On 2023-11-18 12:27 p.m., MitchAlsup wrote:

Robert Finch wrote:

On 2023-11-15 2:11 p.m., MitchAlsup wrote:

Robert Finch wrote:

Decided to shelve Thor2024 and begin work on Thor2025. While
Thor2024 is very good there are a few issues with it. The ROB is
used to store register values and that is effectively a CAM. It is
not very resource efficient in an FPGA. I have been researching an
x86 OoO implementation
(https://www.stuffedcow.net/files/henry-thesis-phd.pdf ) done in an
FPGA and it turns out to be considerably smaller than Thor. There
are more efficient implementations for components than what is
currently in use.

Thor2025 will use a PRF approach although using a PRF seems large to
me.

<
I have a PRF design I could show you--way to big for comp.arch and
with the requisite figures.
<

To reduce the size and complexity of the register file, separate
register files will be used for float and integer operations, along
with separate register files for vector mask registers and
subroutine link registers. This set of register files limits the GPR
file to only 3 write ports and 18 read ports to support all the
functional units. Currently the register file is 10r2w.

The trade-off is block RAM usage instead of LUTs.

While having separate registers files seems like a step backwards,
it should ultimately make the hardware more resource efficient. It
does impact the ISA spec.

Still digesting the PRF diagram.

The diagram is for a 6R6W PRF with a history table, ARN->PRN translation, Free pool pickers, and register ports. The X with a ½ box is a latch
or flip-flop depending on the clocking that is put around the figure.
It also includes the renamer {history table and free pool pickers}.

Decided to go with a unified register file, 27r3w so far. Having
separate register files would not reduce the number of read ports
required and would add complexity to the processor.

9 Reads per 1 write ?!?!?

Loads, FPU operations and flow control (FCU) operations all share the
third write port of the register file. The other two write ports are
dedicated to the ALU results. I think this will be okay given <1% of
instructions would be FCU updates. Loads are about 25%, and FPU
depends on the application.

The ALUs/FPU/Loads have five input operands including the 3 source
operands, a target operand, and a mask register. Stores do not need a
target operand. FCU ops are non-masked so do not need a mask register
or target operand input.

Not planning to implement the vector register file as it would be
immense.

Freelist:

I just used the find-first/last-one’s trick on a bit-list to pick a PR
for an AR. It can provide PRs for two ARs per cycle. I have all the PRs
from the ROB feeding into the list manager so that on a branch miss the
PRs can be freed up. (Just the portion of the PRs associated with the
miss are freed). Three discarded PRs from commit also feed into the list manager so they can be freed. It seems like a lot of logic translating
the PR to a bit. It seems a bit impractical to me to feed all the PRs
from the ROB to the list manager. It can be done with the smallish 16
entry ROB, but for a larger ROB the free may have to be split up or
another means found.

RAT:

A register alias table is being used to track the mappings of ARs to
PRs. It uses two maps; speculative and committed. On instruction enqueue speculative mappings are updated. On commit committed mappings are
updated, and on pipeline flush commit is copied to speculative.

Register file:

I’ve reduced the number of read ports, by not supporting the vector
stuff. There are only 18 read ports. Six groups of three.

ROB:
The ROB acts like a CAM to store both the aRN and pRN for the target
register. The aRN is needed to know which previous pRN to free on
commit. For source operands only the pRN is stored.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Finch@21:1/5 to All on Fri Nov 24 19:32:09 2023

On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
64kB page can handle 512MB of mappings. Tonight’s trade-off is how many
root pointers to support. With a 12-bit ASID, 4096 root pointers are
required to link to the mapping tables with one root pointer for each
address space. A 512 MB space is probably sufficient for a large number
of apps. Meaning access for a TLB update is via a single root pointer
lookup and then looking up the translation from a single memory page.
Not much for the table walker to do. The 4096 root pointers use two
block RAMs and require an 8192-byte address space for update assuming a
32-bit physical address space (a 16-bit root page number).

An IO mapped area of 64kB is available for root pointer memory. 16 block
RAMs could be setup in this area, that would allow 8 root pointers for
each address space. Three bits of the virtual address space could then
be mapped using root pointers. If the root pointer just points to a
single level of page tables, then a 4GB (32-bit) space could be mapped.
I am mulling over whether it is worth it to support the additional root pointers. It is a chunk of block RAM memory that might be better spent elsewhere.

If I use an 11-bit ASID, all the root pointers could be present in a
single block RAM. So, design choices are 11 or 12-bits ASID, 1 or 8 root pointers per address space.

My thought is to have only a single root pointer per space, and organize
the root pointer table as if there were 32-bits for the pointer. This
would allow a 48-bit physical address space to place the mapping tables
in. The RAM could be mapped so that the high order bits of the pointer
are assumed to be zero. The system could get by using a single block RAM
if the mapping tables location were restricted to a 16MB address range. Eight-bit pointers could be used then.

Given that it is a small system, with only 512MB of DRAM, I think it
best to keep the page-table-walker simple, and use the minimum amount of
BRAM (1).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Robert Finch on Sat Nov 25 01:00:29 2023

Robert Finch wrote:

On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
64kB page can handle 512MB of mappings. Tonight’s trade-off is how many root pointers to support. With a 12-bit ASID, 4096 root pointers are
required to link to the mapping tables with one root pointer for each
address space.

So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.

How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is accessing right now>. And using ASID as an index into any array might lead to some conundrum down the road a apiece.

Secondarily, SUN started out with 12-bit ASID and went to 16-bits just about
as fast as they could--even before main memories went bigger than 4GB.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Finch@21:1/5 to MitchAlsup on Fri Nov 24 21:28:25 2023

On 2023-11-24 8:00 p.m., MitchAlsup wrote:

Robert Finch wrote:

On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
64kB page can handle 512MB of mappings. Tonight’s trade-off is how
many root pointers to support. With a 12-bit ASID, 4096 root pointers
are required to link to the mapping tables with one root pointer for
each address space.

So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.

How is this usefully different that only using the ASID to qualify TLB results ?? <Was this TLB entry installed from the same ASID as is accessing right now>. And using ASID as an index into any array might lead to some conundrum down the road a apiece.

Secondarily, SUN started out with 12-bit ASID and went to 16-bits just
about
as fast as they could--even before main memories went bigger than 4GB.

I view the address space as an entity in it own right to be managed by
the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that identifies the address space has a life outside of just the TLB. I may
be increasing the typical scope of an ASID.

It is the same idea as using the ASID to qualify TLB entries, except
that it qualifies the root pointer as well. So, the root pointer does
not need to be switched by software. Once the root pointer is set for
the AS it simply sits there statically until the AS is reused.

I am using the ASID like a process ID. So, the root pointer register
does not need to be reset on a task switch. Address spaces may not be
mapped 1:1 with processes. An address space may outlive a task if it is
shared with another task. So, I do not want to use the PID to
distinguish tables. This assumes the address space will not be freed up
and reused by another task, if there are tasks using the ASID.

4096 address spaces is a lot. But if using a 16-bit ASID it would no
longer be practical to store a root pointer per ASID in a table.
Instead, the root pointer would have to be managed by software as is
normally done.

I am wondering why the 16-bit ASID? 256 address spaces in 256 process? I suspect it is just because 16-bit is easier to pass around/calculate in
a HLL than some other value like 14-bits. Are 65536 address spaces
really needed?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to Robert Finch on Fri Nov 24 21:16:43 2023

On 11/24/2023 8:28 PM, Robert Finch wrote:

On 2023-11-24 8:00 p.m., MitchAlsup wrote:

Robert Finch wrote:

On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A
single 64kB page can handle 512MB of mappings. Tonight’s trade-off is
how many root pointers to support. With a 12-bit ASID, 4096 root
pointers are required to link to the mapping tables with one root
pointer for each address space.

So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.

How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is
accessing
right now>. And using ASID as an index into any array might lead to some
conundrum down the road a apiece.

Secondarily, SUN started out with 12-bit ASID and went to 16-bits just
about
as fast as they could--even before main memories went bigger than 4GB.

I view the address space as an entity in it own right to be managed by
the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that identifies the address space has a life outside of just the TLB. I may
be increasing the typical scope of an ASID.

It is the same idea as using the ASID to qualify TLB entries, except
that it qualifies the root pointer as well. So, the root pointer does
not need to be switched by software. Once the root pointer is set for
the AS it simply sits there statically until the AS is reused.

I am using the ASID like a process ID. So, the root pointer register
does not need to be reset on a task switch. Address spaces may not be
mapped 1:1 with processes. An address space may outlive a task if it is shared with another task. So, I do not want to use the PID to
distinguish tables. This assumes the address space will not be freed up
and reused by another task, if there are tasks using the ASID.

4096 address spaces is a lot. But if using a 16-bit ASID it would no
longer be practical to store a root pointer per ASID in a table.
Instead, the root pointer would have to be managed by software as is
normally done.

I am wondering why the 16-bit ASID? 256 address spaces in 256 process? I suspect it is just because 16-bit is easier to pass around/calculate in
a HLL than some other value like 14-bits. Are 65536 address spaces
really needed?

If one assumes one address space per PID, then one is going to hit a
limit of 4K a lot faster than 64K, and when one hits the limit, there is
no good way to "reclaim" previously used address spaces short of
flushing the TLB to be sure that no entries from that space remain in
the TLB (ASID thrashing is likely to be relatively expensive to deal
with as a result).

Well, along with other things, like if/how to allow "Global" pages:
True global pages are likely a foot gun, as there is no way to exclude
them from a given process (where there may be a need to do so);
Disallowing global pages entirely means higher TLB miss rates because no processes can share TLB entries.

One option seems to be, say, that a few of the high-order bits of the
ASID could be used as a "page group", with global pages only applying
within a single page-group (possibly with one of the page groups being designated as "No global pages allowed").

...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Finch@21:1/5 to BGB on Fri Nov 24 22:48:35 2023

On 2023-11-24 10:16 p.m., BGB wrote:

On 11/24/2023 8:28 PM, Robert Finch wrote:

On 2023-11-24 8:00 p.m., MitchAlsup wrote:

Robert Finch wrote:

On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A
single 64kB page can handle 512MB of mappings. Tonight’s trade-off
is how many root pointers to support. With a 12-bit ASID, 4096 root
pointers are required to link to the mapping tables with one root
pointer for each address space.

So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.

How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is
accessing
right now>. And using ASID as an index into any array might lead to some >>> conundrum down the road a apiece.

Secondarily, SUN started out with 12-bit ASID and went to 16-bits
just about
as fast as they could--even before main memories went bigger than 4GB.

I view the address space as an entity in it own right to be managed by
the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that
identifies the address space has a life outside of just the TLB. I may
be increasing the typical scope of an ASID.

It is the same idea as using the ASID to qualify TLB entries, except
that it qualifies the root pointer as well. So, the root pointer does
not need to be switched by software. Once the root pointer is set for
the AS it simply sits there statically until the AS is reused.

I am using the ASID like a process ID. So, the root pointer register
does not need to be reset on a task switch. Address spaces may not be
mapped 1:1 with processes. An address space may outlive a task if it
is shared with another task. So, I do not want to use the PID to
distinguish tables. This assumes the address space will not be freed
up and reused by another task, if there are tasks using the ASID.

4096 address spaces is a lot. But if using a 16-bit ASID it would no
longer be practical to store a root pointer per ASID in a table.
Instead, the root pointer would have to be managed by software as is
normally done.

I am wondering why the 16-bit ASID? 256 address spaces in 256 process?
I suspect it is just because 16-bit is easier to pass around/calculate
in a HLL than some other value like 14-bits. Are 65536 address spaces
really needed?

If one assumes one address space per PID, then one is going to hit a
limit of 4K a lot faster than 64K, and when one hits the limit, there is
no good way to "reclaim" previously used address spaces short of
flushing the TLB to be sure that no entries from that space remain in
the TLB (ASID thrashing is likely to be relatively expensive to deal
with as a result).

I see after reading several webpages that the root pointer is used to
point to only a single table for a process. This is not how I was doing
things. I have a MMU tables for each address space as opposed to having
a table for the process. The process may have only a single address
space, or it may use several address spaces.

I am wondering why there is only a single table per process.

Well, along with other things, like if/how to allow "Global" pages:
True global pages are likely a foot gun, as there is no way to exclude
them from a given process (where there may be a need to do so);
Disallowing global pages entirely means higher TLB miss rates because no processes can share TLB entries.

Global space can be assigned by designating an address space as a global
space and giving it an ASID. All process wanting access to the global
space need only then use the MMU table for that ASID. Eg. use ASID 0 for
the global address space.

One option seems to be, say, that a few of the high-order bits of the
ASID could be used as a "page group", with global pages only applying
within a single page-group (possibly with one of the page groups being designated as "No global pages allowed").

...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Sat Nov 25 17:11:09 2023

mitchalsup@aol.com (MitchAlsup) writes:

Robert Finch wrote:

On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
64kB page can handle 512MB of mappings. Tonight’s trade-off is how many
root pointers to support. With a 12-bit ASID, 4096 root pointers are
required to link to the mapping tables with one root pointer for each
address space.

So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.

How is this usefully different that only using the ASID to qualify TLB >results ?? <Was this TLB entry installed from the same ASID as is accessing >right now>. And using ASID as an index into any array might lead to some >conundrum down the road a apiece.

Secondarily, SUN started out with 12-bit ASID and went to 16-bits just about >as fast as they could--even before main memories went bigger than 4GB.

Yeah, armv8 was originally 8-bit, and added 16 even before the spec was dry.

I don't see a benefit to tying the ASID (or VMID for that matter) to
the root of the page table. Especially with the common split
address spaces (ARMv8 has a root pointer for each half of the VA space,
for example, where the upper half is shared by all schedulable entities).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Robert Finch on Sat Nov 25 17:20:34 2023

Robert Finch <robfi680@gmail.com> writes:

On 2023-11-24 10:16 p.m., BGB wrote:

On 11/24/2023 8:28 PM, Robert Finch wrote:

On 2023-11-24 8:00 p.m., MitchAlsup wrote:

If one assumes one address space per PID, then one is going to hit a
limit of 4K a lot faster than 64K, and when one hits the limit, there is
no good way to "reclaim" previously used address spaces short of
flushing the TLB to be sure that no entries from that space remain in
the TLB (ASID thrashing is likely to be relatively expensive to deal
with as a result).

I see after reading several webpages that the root pointer is used to
point to only a single table for a process. This is not how I was doing >things. I have a MMU tables for each address space as opposed to having
a table for the process. The process may have only a single address
space, or it may use several address spaces.

I am wondering why there is only a single table per process.

There is actually two in most operating systems - the lower half
of the VA space is owned by the user-mode code in the process and
the upper-half is shared by all processors and used by the
operating system on behalf of the process. For Intel/AMD, the
kernel manages both halves, for ARMv8, each half has a completely
distinct and separate root pointer (at each exeception level).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Robert Finch on Sat Nov 25 17:16:55 2023

Robert Finch <robfi680@gmail.com> writes:

On 2023-11-24 8:00 p.m., MitchAlsup wrote:

Robert Finch wrote:

On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
64kB page can handle 512MB of mappings. Tonight’s trade-off is how
many root pointers to support. With a 12-bit ASID, 4096 root pointers
are required to link to the mapping tables with one root pointer for
each address space.

So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.

How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is accessing >> right now>. And using ASID as an index into any array might lead to some
conundrum down the road a apiece.

Secondarily, SUN started out with 12-bit ASID and went to 16-bits just
about
as fast as they could--even before main memories went bigger than 4GB.

I view the address space as an entity in it own right to be managed by
the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that >identifies the address space has a life outside of just the TLB. I may
be increasing the typical scope of an ASID.

It is the same idea as using the ASID to qualify TLB entries, except
that it qualifies the root pointer as well. So, the root pointer does
not need to be switched by software. Once the root pointer is set for
the AS it simply sits there statically until the AS is reused.

I am using the ASID like a process ID. So, the root pointer register
does not need to be reset on a task switch. Address spaces may not be
mapped 1:1 with processes. An address space may outlive a task if it is >shared with another task. So, I do not want to use the PID to
distinguish tables. This assumes the address space will not be freed up
and reused by another task, if there are tasks using the ASID.

4096 address spaces is a lot. But if using a 16-bit ASID it would no
longer be practical to store a root pointer per ASID in a table.
Instead, the root pointer would have to be managed by software as is
normally done.

I am wondering why the 16-bit ASID? 256 address spaces in 256 process? I >suspect it is just because 16-bit is easier to pass around/calculate in
a HLL than some other value like 14-bits. Are 65536 address spaces
really needed?

256 is far too small.

$ ps -ef | wc -l
709

Every time the ASID overflows, the system must basically flush
all the caches system-wide. On an 80 processor system, that's a lot of overhead.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to Robert Finch on Sat Nov 25 11:59:42 2023

On 11/24/2023 9:48 PM, Robert Finch wrote:

On 2023-11-24 10:16 p.m., BGB wrote:

On 11/24/2023 8:28 PM, Robert Finch wrote:

On 2023-11-24 8:00 p.m., MitchAlsup wrote:

Robert Finch wrote:

On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A
single 64kB page can handle 512MB of mappings. Tonight’s trade-off >>>>> is how many root pointers to support. With a 12-bit ASID, 4096 root
pointers are required to link to the mapping tables with one root
pointer for each address space.

So, you associate a single ROOT pointer VALUE with an ASID, and manage >>>> in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.

How is this usefully different that only using the ASID to qualify TLB >>>> results ?? <Was this TLB entry installed from the same ASID as is
accessing
right now>. And using ASID as an index into any array might lead to
some
conundrum down the road a apiece.

Secondarily, SUN started out with 12-bit ASID and went to 16-bits
just about
as fast as they could--even before main memories went bigger than 4GB.

I view the address space as an entity in it own right to be managed
by the MMU. ASIDs and address spaces should be mapped 1:1. The ASID
that identifies the address space has a life outside of just the TLB.
I may be increasing the typical scope of an ASID.

It is the same idea as using the ASID to qualify TLB entries, except
that it qualifies the root pointer as well. So, the root pointer does
not need to be switched by software. Once the root pointer is set for
the AS it simply sits there statically until the AS is reused.

I am using the ASID like a process ID. So, the root pointer register
does not need to be reset on a task switch. Address spaces may not be
mapped 1:1 with processes. An address space may outlive a task if it
is shared with another task. So, I do not want to use the PID to
distinguish tables. This assumes the address space will not be freed
up and reused by another task, if there are tasks using the ASID.

4096 address spaces is a lot. But if using a 16-bit ASID it would no
longer be practical to store a root pointer per ASID in a table.
Instead, the root pointer would have to be managed by software as is
normally done.

I am wondering why the 16-bit ASID? 256 address spaces in 256
process? I suspect it is just because 16-bit is easier to pass
around/calculate in a HLL than some other value like 14-bits. Are
65536 address spaces really needed?

If one assumes one address space per PID, then one is going to hit a
limit of 4K a lot faster than 64K, and when one hits the limit, there
is no good way to "reclaim" previously used address spaces short of
flushing the TLB to be sure that no entries from that space remain in
the TLB (ASID thrashing is likely to be relatively expensive to deal
with as a result).

I see after reading several webpages that the root pointer is used to
point to only a single table for a process. This is not how I was doing things. I have a MMU tables for each address space as opposed to having
a table for the process. The process may have only a single address
space, or it may use several address spaces.

I am wondering why there is only a single table per process.

I went the opposite route of one big address space, with the idea of
allowing memory protection within this address space via the VUGID/ACL mechanism. There is a KRR, or Keyring Register, which holds up to 4 keys
that may be used for ACL checking, granting an access if it is allowed
by at least one of the keys; triggering an ISR on miss similar to the
TLB. In this case, the conceptual model is more similar to that
typically used in filesystems.

But, I also have a 16-bit ASID...

As-is, there is at most one set of page tables per address space, or per-process if processes are given different address spaces.

Well, along with other things, like if/how to allow "Global" pages:
True global pages are likely a foot gun, as there is no way to exclude
them from a given process (where there may be a need to do so);
Disallowing global pages entirely means higher TLB miss rates because
no processes can share TLB entries.

Global space can be assigned by designating an address space as a global space and giving it an ASID. All process wanting access to the global
space need only then use the MMU table for that ASID. Eg. use ASID 0 for
the global address space.

Had considered this, but there is a problem:
What if you have a process that you *don't* want to be able to see into
this global space?...

Though, this is where the idea of page-grouping can come in, say, the
ASID becomes:
gggg-pppp-pppp-pppp

Where:
0000 is visible to all of 0zzz
1000 is visible to all of 1zzz
...
Except:
Fzzz, this group does not have any global pages (all one-off ASIDs).

Or, possible also, is a 2.14 bit split.

One option seems to be, say, that a few of the high-order bits of the
ASID could be used as a "page group", with global pages only applying
within a single page-group (possibly with one of the page groups being
designated as "No global pages allowed").

...

Meanwhile:
I went and bought 128GB of RAM, only to realize my PC doesn't work if
one tries to install the full 128GB (the BIOS boot-loops a bunch of
times, and then apparently concludes that there is only 3.5GB ...).

Does work at least if I install 3x 32GB sticks and 1x 16GB stick, giving
112GB. This breaks the pairing rules, but seems to be working.

...

Had I known this, could have spent half as much, and only upgraded to 96GB.

Seemingly MOBO/BIOS/... designers didn't anticipate someone sticking a
full 128GB in this thing?... (BIOS is dated from 2018).

Well, either this, or a hardware compatibility issue with one of the
cards?...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Robert Finch on Sat Nov 25 19:31:13 2023

Robert Finch wrote:

On 2023-11-24 8:00 p.m., MitchAlsup wrote:

Robert Finch wrote:

On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
64kB page can handle 512MB of mappings. Tonight’s trade-off is how
many root pointers to support. With a 12-bit ASID, 4096 root pointers
are required to link to the mapping tables with one root pointer for
each address space.

So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.

How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is accessing >> right now>. And using ASID as an index into any array might lead to some
conundrum down the road a apiece.

Secondarily, SUN started out with 12-bit ASID and went to 16-bits just
about
as fast as they could--even before main memories went bigger than 4GB.

I view the address space as an entity in it own right to be managed by
the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that identifies the address space has a life outside of just the TLB. I may
be increasing the typical scope of an ASID.

Consider the case where two different processes MMAP the same area
of memory.
Should they both end up using the same ASID ??
Should they both take extra TLB walks because they use different ASIDs ?? Should they uses their own ASIDs for their own memory but a different ASID
for the shared memory ?? And How do you expect this to happen ??

It is the same idea as using the ASID to qualify TLB entries, except
that it qualifies the root pointer as well. So, the root pointer does
not need to be switched by software. Once the root pointer is set for
the AS it simply sits there statically until the AS is reused.

I am using the ASID like a process ID. So, the root pointer register
does not need to be reset on a task switch. Address spaces may not be
mapped 1:1 with processes. An address space may outlive a task if it is shared with another task. So, I do not want to use the PID to
distinguish tables. This assumes the address space will not be freed up
and reused by another task, if there are tasks using the ASID.

4096 address spaces is a lot. But if using a 16-bit ASID it would no
longer be practical to store a root pointer per ASID in a table.
Instead, the root pointer would have to be managed by software as is
normally done.

I am wondering why the 16-bit ASID? 256 address spaces in 256 process? I suspect it is just because 16-bit is easier to pass around/calculate in
a HLL than some other value like 14-bits. Are 65536 address spaces
really needed?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Sat Nov 25 19:44:13 2023

Scott Lurndal wrote:

Robert Finch <robfi680@gmail.com> writes:

On 2023-11-24 10:16 p.m., BGB wrote:

On 11/24/2023 8:28 PM, Robert Finch wrote:

On 2023-11-24 8:00 p.m., MitchAlsup wrote:

If one assumes one address space per PID, then one is going to hit a
limit of 4K a lot faster than 64K, and when one hits the limit, there is >>> no good way to "reclaim" previously used address spaces short of
flushing the TLB to be sure that no entries from that space remain in
the TLB (ASID thrashing is likely to be relatively expensive to deal
with as a result).

I see after reading several webpages that the root pointer is used to
point to only a single table for a process. This is not how I was doing >>things. I have a MMU tables for each address space as opposed to having
a table for the process. The process may have only a single address
space, or it may use several address spaces.

I am wondering why there is only a single table per process.

There is actually two in most operating systems - the lower half
of the VA space is owned by the user-mode code in the process and
the upper-half is shared by all processors and used by the
operating system on behalf of the process. For Intel/AMD, the
kernel manages both halves, for ARMv8, each half has a completely
distinct and separate root pointer (at each exeception level).

My 66000 Architecture has 4 Root Pointers available at all instants
of time. The above was designed before the rise of HyperVisors and is
now showing its age problems. All 4 Root Pointers are used based on
privilege level::

HOB=0 HOB=1
Application:: Application 2-level No Access
Guest OS :: Application 2-level Guest OS 2-level
Guest HV :: Guest HV 1-level Guest OS 2-level
Real HV :: Guest HV 1-level Real HV 1-level

The overhead of Application to Application is no higher than that
of Guest OS to a different Guest OS--whereas on the machines with
VMENTER and VMEXIT it takes 10,000 cycles whereas Application to
Application is closer to 1,000 cycles. I want this down in the
10-100 cycle range.

The exception <stack> system is designed to allow Guest HV to
recover a Guest OS that takes page faults while servicing ISRs
(and the like).

The interrupt <stack> system is designed to allow the ISR to
RPC or softIRQ without having to look at the pending stack on
the way out. RTI looks at the pending stack and services the
highest pending PRC/softIRQ affinitized to the CPU with control.

The Interrupt dispatch system allows the CPU to continue running
instructions until the contending CPUs decide which interrupt
is claimed by which CPU (1::1) and then context switch do the
interrupt dispatcher.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Sat Nov 25 20:02:15 2023

mitchalsup@aol.com (MitchAlsup) writes:

Robert Finch wrote:

On 2023-11-24 8:00 p.m., MitchAlsup wrote:

Robert Finch wrote:

On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single >>>> 64kB page can handle 512MB of mappings. Tonight’s trade-off is how
many root pointers to support. With a 12-bit ASID, 4096 root pointers
are required to link to the mapping tables with one root pointer for
each address space.

So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.

How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is accessing >>> right now>. And using ASID as an index into any array might lead to some >>> conundrum down the road a apiece.

Secondarily, SUN started out with 12-bit ASID and went to 16-bits just
about
as fast as they could--even before main memories went bigger than 4GB.

I view the address space as an entity in it own right to be managed by
the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that
identifies the address space has a life outside of just the TLB. I may
be increasing the typical scope of an ASID.

Consider the case where two different processes MMAP the same area
of memory.

In which case, the area of memory would be mapped to different
virtual address ranges in each process, and thus naturally
consume two TLBs.

FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.

Given various forms of ASLR being used, it's unlikely even in
two instances of the same executable that a call to mmap
with MAP_SHARED without MAP_FIXED would map the region at
the same virtual address in both processes.

Should they both end up using the same ASID ??

They couldn't share an ASID assuming the TLB looks up by VA.

Should they both take extra TLB walks because they use different ASIDs ??

Given the above, yes. It's likely they'll each be scheduled
on different cores anyway in any modern system.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Sat Nov 25 20:40:11 2023

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup) writes:

Robert Finch wrote:

On 2023-11-24 8:00 p.m., MitchAlsup wrote:

Robert Finch wrote:

On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single >>>>> 64kB page can handle 512MB of mappings. Tonight’s trade-off is how >>>>> many root pointers to support. With a 12-bit ASID, 4096 root pointers >>>>> are required to link to the mapping tables with one root pointer for >>>>> each address space.

So, you associate a single ROOT pointer VALUE with an ASID, and manage >>>> in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.

How is this usefully different that only using the ASID to qualify TLB >>>> results ?? <Was this TLB entry installed from the same ASID as is accessing
right now>. And using ASID as an index into any array might lead to some >>>> conundrum down the road a apiece.

Secondarily, SUN started out with 12-bit ASID and went to 16-bits just >>>> about
as fast as they could--even before main memories went bigger than 4GB.

I view the address space as an entity in it own right to be managed by
the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that
identifies the address space has a life outside of just the TLB. I may
be increasing the typical scope of an ASID.

Consider the case where two different processes MMAP the same area
of memory.

In which case, the area of memory would be mapped to different
virtual address ranges in each process, and thus naturally
consume two TLBs.

MMAP() first, fork() second. Now we have 2 processes with the
memory mapped shared memory at the same address.

FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.

Given various forms of ASLR being used, it's unlikely even in
two instances of the same executable that a call to mmap
with MAP_SHARED without MAP_FIXED would map the region at
the same virtual address in both processes.

Should they both end up using the same ASID ??

They couldn't share an ASID assuming the TLB looks up by VA.

Should they both take extra TLB walks because they use different ASIDs ??

Given the above, yes. It's likely they'll each be scheduled
on different cores anyway in any modern system.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Sat Nov 25 21:55:04 2023

mitchalsup@aol.com (MitchAlsup) writes:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup) writes:

Robert Finch wrote:

On 2023-11-24 8:00 p.m., MitchAlsup wrote:

Robert Finch wrote:

On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single >>>>>> 64kB page can handle 512MB of mappings. Tonight’s trade-off is how >>>>>> many root pointers to support. With a 12-bit ASID, 4096 root pointers >>>>>> are required to link to the mapping tables with one root pointer for >>>>>> each address space.

So, you associate a single ROOT pointer VALUE with an ASID, and manage >>>>> in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.

How is this usefully different that only using the ASID to qualify TLB >>>>> results ?? <Was this TLB entry installed from the same ASID as is accessing
right now>. And using ASID as an index into any array might lead to some >>>>> conundrum down the road a apiece.

Secondarily, SUN started out with 12-bit ASID and went to 16-bits just >>>>> about
as fast as they could--even before main memories went bigger than 4GB. >>>

I view the address space as an entity in it own right to be managed by >>>> the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that
identifies the address space has a life outside of just the TLB. I may >>>> be increasing the typical scope of an ASID.

Consider the case where two different processes MMAP the same area
of memory.

In which case, the area of memory would be mapped to different
virtual address ranges in each process, and thus naturally
consume two TLBs.

MMAP() first, fork() second. Now we have 2 processes with the
memory mapped shared memory at the same address.

Yes, in that case, they'll be mapped at the same VA. All
the below points still apply so long as TLB's are per core.

FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.

Given various forms of ASLR being used, it's unlikely even in
two instances of the same executable that a call to mmap
with MAP_SHARED without MAP_FIXED would map the region at
the same virtual address in both processes.

Should they both end up using the same ASID ??

They couldn't share an ASID assuming the TLB looks up by VA.

Should they both take extra TLB walks because they use different ASIDs ??

Given the above, yes. It's likely they'll each be scheduled
on different cores anyway in any modern system.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Finch@21:1/5 to All on Sat Nov 25 19:48:06 2023

Are top-level page directory pages shared between tasks? Suppose a task
needs a 32-bit address space. With one level of page maps, 27 bits is accommodated, that leaves 5 bits of address translation to be done by
the page directory. Using a whole page which can handle 11 address bits
would be wasteful. But if root pointers could point into the same page directory page then the space would not be wasted. For instance, root
pointer for task #1 could point the first 32 entries, root pointer for
task #2 could point into the next 32 entries, and so on.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Robert Finch on Sun Nov 26 01:34:53 2023

Robert Finch wrote:

Are top-level page directory pages shared between tasks?

The HyperVisor tables supporting a single Guest OS certainly are.
The Guest OS tables supporting Guest OS certainly are.

Suppose a task
needs a 32-bit address space. With one level of page maps, 27 bits is accommodated, that leaves 5 bits of address translation to be done by
the page directory. Using a whole page which can handle 11 address bits
would be wasteful. But if root pointers could point into the same page directory page then the space would not be wasted. For instance, root
pointer for task #1 could point the first 32 entries, root pointer for
task #2 could point into the next 32 entries, and so on.

I should Note: that My 66000 Root Pointers determine the address space they map; anything from 8MB through 8EB and PTEs supporting 8KB through 8EB page sizes--with the kicker that large page entries can restrict themselves::
for example you can use a 8MB PTE and enable only 1..1024 pages under that Virtual sub Address Space; furthermore, levels in the hierarchy can be skipped--all of this to minimize table walk time.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Robert Finch on Sun Nov 26 15:55:04 2023

Robert Finch <robfi680@gmail.com> writes:

Are top-level page directory pages shared between tasks?

The top half of the VA space could support this, for
the most part (since the top half is generally shared
by all tasks). The bottom half that's much less likely.

Suppose a task
needs a 32-bit address space. With one level of page maps, 27 bits is >accommodated, that leaves 5 bits of address translation to be done by
the page directory. Using a whole page which can handle 11 address bits
would be wasteful. But if root pointers could point into the same page >directory page then the space would not be wasted. For instance, root
pointer for task #1 could point the first 32 entries, root pointer for
task #2 could point into the next 32 entries, and so on.

If the VA space is small enough, on ARMv8, the tables can be configured
with fewer than the normal four levels by specifying a smaller VA
size in the TCR_ELx register, so the walk may be only two or three levels
deep instead of four (or five when the VA gets larger than 52 bits).

Using intermediate level blocks (soi disant 'huge pages') reduces the
walk overhead as well, but has it's issues with allocation (since
the huge pages need not just be physical contiguous, but aligned
on huge-page-sized boundaries.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Sun Nov 26 15:45:06 2023

scott@slp53.sl.home (Scott Lurndal) writes:

mitchalsup@aol.com (MitchAlsup) writes:

Consider the case where two different processes MMAP the same area
of memory.

In which case, the area of memory would be mapped to different
virtual address ranges in each process,

Says who? Unless the user process asks for MAP_FIXED or the address
range is already occupied in the user process, nothing prevents the OS
from putting the shared area in the same process. If the permissions
are also the same, the OS can then use one ASID for the shared area.

This would be especially useful for the read-only sections (e.g, code)
of common libraries like libc. However, in todays security landscape,
you don't want one process to know where library code is mapped in
other processes (i.e., you want ASLR), so we can no longer make use of
that benefit. And it's doubtful whether other uses are worth the
complications (and even if they are, there might be security issues,
too).

FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.

As usual, what is specified by a common-subset standard is not
relevant for what an OS implementor has to do if they want to supply
more than a practically unusable checkbox feature like the POSIX
subsystem for Windows. There is a reason why WSL2 includes a full
Linux kernel.

Should they both end up using the same ASID ??

They couldn't share an ASID assuming the TLB looks up by VA.

Of course the TLB looks up by VA, what else. But if the VA is the
same and the PA is the same, the same ASID can be used.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Sun Nov 26 12:32:08 2023

Anton Ertl wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

mitchalsup@aol.com (MitchAlsup) writes:

Consider the case where two different processes MMAP the same area
of memory.

In which case, the area of memory would be mapped to different
virtual address ranges in each process,

Says who? Unless the user process asks for MAP_FIXED or the address
range is already occupied in the user process, nothing prevents the OS
from putting the shared area in the same process. If the permissions
are also the same, the OS can then use one ASID for the shared area.

If the mapping range is being selected dynamically, the chance that a
range will already be in use goes up with the number of sharers.
At some point when a new member tries to join the sharing group
the map request will be denied.

Software that does not want to have a mapping request fail should assume
that a shared area will be mapped at a different address in each process.
That implies one should not assume that virtual address can be passed
but instead use, say, section relative offsets to build a linked list.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Sun Nov 26 20:52:23 2023

EricP wrote:

Anton Ertl wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

mitchalsup@aol.com (MitchAlsup) writes:

Consider the case where two different processes MMAP the same area
of memory.

In which case, the area of memory would be mapped to different
virtual address ranges in each process,

Says who? Unless the user process asks for MAP_FIXED or the address
range is already occupied in the user process, nothing prevents the OS
from putting the shared area in the same process. If the permissions
are also the same, the OS can then use one ASID for the shared area.

If the mapping range is being selected dynamically, the chance that a
range will already be in use goes up with the number of sharers.
At some point when a new member tries to join the sharing group
the map request will be denied.

Software that does not want to have a mapping request fail should assume
that a shared area will be mapped at a different address in each process. That implies one should not assume that virtual address can be passed
but instead use, say, section relative offsets to build a linked list.

Here you are using shared memory like PL/1 uses AREA and OFFSET types.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Sun Nov 26 21:26:23 2023

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

In which case, the area of memory would be mapped to different
virtual address ranges in each process,

Says who? Unless the user process asks for MAP_FIXED or the address
range is already occupied in the user process, nothing prevents the OS
from putting the shared area in the same process.

s/process/address range/ for the last word.

If the permissions
are also the same, the OS can then use one ASID for the shared area.

If the mapping range is being selected dynamically, the chance that a
range will already be in use goes up with the number of sharers.
At some point when a new member tries to join the sharing group
the map request will be denied.

It will map, but with a different address range, and therefore a
different ASID. Then, for further mapping requests, the chances that
one of the two address ranges are free are increased. So even with a
large number of processes mapping the same library, you will need only
a few ASIDs for this physical memory, so there will be lots of
sharing. Of course with ASLR this is all no longer relevant.

Software that does not want to have a mapping request fail should assume
that a shared area will be mapped at a different address in each process. >That implies one should not assume that virtual address can be passed
but instead use, say, section relative offsets to build a linked list.

Yes. The other option is to use MAP_FIXED early in the process, and
to have some way of dealing with potential failures. But sharing of
VAs in user code between processes is not what the sharing of ASIDs we
have discussed here would be primarily about.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to Anton Ertl on Sun Nov 26 15:45:08 2023

On 11/26/2023 9:45 AM, Anton Ertl wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

mitchalsup@aol.com (MitchAlsup) writes:

Consider the case where two different processes MMAP the same area
of memory.

In which case, the area of memory would be mapped to different
virtual address ranges in each process,

Says who? Unless the user process asks for MAP_FIXED or the address
range is already occupied in the user process, nothing prevents the OS
from putting the shared area in the same process. If the permissions
are also the same, the OS can then use one ASID for the shared area.

This would be especially useful for the read-only sections (e.g, code)
of common libraries like libc. However, in todays security landscape,
you don't want one process to know where library code is mapped in
other processes (i.e., you want ASLR), so we can no longer make use of
that benefit. And it's doubtful whether other uses are worth the complications (and even if they are, there might be security issues,
too).

It seems to me, as long as it is a different place on each system,
probably good enough. Demanding a different location in each process
would create a lot of additional memory overhead due to from things like base-relocations or similar.

FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.

As usual, what is specified by a common-subset standard is not
relevant for what an OS implementor has to do if they want to supply
more than a practically unusable checkbox feature like the POSIX
subsystem for Windows. There is a reason why WSL2 includes a full
Linux kernel.

Still using WSL1 here as for whatever reason hardware virtualization has
thus far refused to work on my PC, and is apparently required for WSL2.

I can add this to my list of annoyances, like I can install "just short
of 128GB", but putting in the full 128GB causes my PC to be like "Oh
Crap, I guess there is 3.5GB ..." (but, apparently "112GB with unmatched
RAM sticks is fine I guess...").

But, yeah, the original POSIX is an easier goal to achieve, vs, say, the ability to port over the GNU userland.

A lot of it is doable, but things like fork+exec are a problem if one
wants to support NOMMU operation or otherwise run all of the logical
processes in a shared address space.

A practical alternative is something more like a CreateProcess style
call, but this is "not exactly POSIX". In theory though, one could treat "fork()" more like "vfork()" and then turn the exec* call into a
CreateProcess call and then terminate the current thread. Wouldn't
really work "in general" though, for programs that expect to be able to "fork()" and then continue running the current program as a sub-process.

Should they both end up using the same ASID ??

They couldn't share an ASID assuming the TLB looks up by VA.

Of course the TLB looks up by VA, what else. But if the VA is the
same and the PA is the same, the same ASID can be used.

?...

Typically the ASID applies to the whole virtual address space, not to individual memory objects.

Or, at least, my page-table scheme doesn't have a way to express
per-page ASIDs (merely if a page is Private/Shared, with the results of
this partly depending on the current ASID given for the page-table).

Where, say, I am mostly using 64-bit entries in the page-table, as going
to a 128-bit page-table format would be a bit steep.

Say, PTE layout (16K pages):
(63:48): ACLID
(47:14): Physical Address.
(13:12): Address or OS flag.
(11:10): For use by OS
( 9: 0): Base page-access and similar.
(9): S1 / U1 (Page-Size or OS Flag)
(8): S0 / U0 (Page-Size or OS Flag)
(7): Nu User (Supervisor Only)
(6): No Execute
(5): No Write
(4): No Read
(3): No Cache
(2): Dirty (OS, ignored by TLB)
(1): Private/Shared (MBZ if not Valid)
(0): Present/Valid

Where, ACLID serves as an index into the ACL table, or to lookup the
VUGID parameters for the page (well, along with an alternate PTE variant
that encodes VUGID directly, but reduces the physical address to 36
bits). It is possible that the original VUGID scheme may be phased out
in favor of using exclusively ACL checking.

Note that the ACL checks don't add new permissions to a page, they add
further restrictions (with the base-access being the most permissive).

Some combinations of flags are special, and encode a few edge-case
modes; such as pages which are Read/Write in Supervisor mode but
Read-Only in user mode (separate from the possible use of ACL's to mark
pages as read-only for certain tasks).

But, FWIW, I ended up adding an extended MAP_GLOBAL flag for "mmap'ed
space should be visible to all of the processes"; which in turn was used
as part of the backing memory for the "GlobalAlloc" style calls (it is
not a global heap, in that each process still manages the memory
locally, but other intersecting processes can see the address within
their own address spaces).

Well, along with a MAP_PHYSICAL flag, for if one needs memory where
VA==PA (this may fail, with the mmap returning NULL, effectively only
allowed for "superusermode"; mostly intended for hardware interfaces).

The usual behavior of MAP_SHARED didn't really make sense outside of the context of mapping a file, and didn't really serve the needed purpose
(say, one wants to hand off a pointer to a bitmap buffer to the GUI
subsystem to have it drawn into a window).

It is also being used for things like shared scratch buffers, say, for
passing BITMAPINFOHEADER and MIDI commands and similar across the
interprocess calls (the C API style wrapper wraps a lot of this; whereas
the internal COM-style interfaces will require any pointer-style
arguments to point to shared memory).

This is not required for normal syscall handlers, where the usual
assumption is that normal syscalls will have some means of directly
accessing the address space of the caller process. I didn't really want
to require that TKGDI have this same capability.

It is debatable whether calls like BlitImage and similar should require
global memory, or merely recommend it (potentially having the call fall
back to a scratch buffer and internal memcpy if the passed bitmap image
is not already in global memory).

I had originally considered a more complex mechanism for object sharing,
but then ended up going with this for now partly because it was easier
and lower overhead (well, and also because I wanted something that would
still work if/when I started to add proper memory protection). May make
sense to impose a limit on per-process global alloc's though (since it
is intended specifically for shared buffers and not for general heap allocation; where for heap allocation ANONYMOUS+PRIVATE would be used
instead).

Though, looking at stuff, MAP_GLOBAL semantics may have also been
partially covered by "MAP_ANONYMOUS|MAP_SHARED"?... Though, the
semantics aren't the same.

I guess, another alternative would have been to use shm_open+mmap or
similar.

Where, say, memory map will look something like:
00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
00yy_xxxxxxxx: Start of global virtual memory (*1);
3FFF_xxxxxxxx: End of global virtual memory;
4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
7FFF_xxxxxxxx: End of private/local virtual memory (possible);
8000_xxxxxxxx: Start of kernel virtual memory;
BFFF_xxxxxxxx: End of kernel virtual memory;
Cxxx_xxxxxxxx: Physical Address Range (Cached);
Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
Exxx_xxxxxxxx: Reserved;
Fxxx_xxxxxxxx: MMIO and similar.

*1: The 'yy' division point may move, will depend on things like how
much RAM exists (currently, 00/01; no current "sane cost" FPGA boards
having more than 256 or 512 MB of RAM).

*2: If I go to a scheme of giving processes their own address spaces,
then private memory will be used. It is likely that executable code may
remain shared, but the data sections and heap would be put into private
address ranges.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Sun Nov 26 22:27:37 2023

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

mitchalsup@aol.com (MitchAlsup) writes:

Consider the case where two different processes MMAP the same area
of memory.

In which case, the area of memory would be mapped to different
virtual address ranges in each process,

Says who? Unless the user process asks for MAP_FIXED or the address
range is already occupied in the user process, nothing prevents the OS
from putting the shared area in the same process. If the permissions
are also the same, the OS can then use one ASID for the shared area.

This would be especially useful for the read-only sections (e.g, code)
of common libraries like libc. However, in todays security landscape,
you don't want one process to know where library code is mapped in
other processes (i.e., you want ASLR), so we can no longer make use of
that benefit. And it's doubtful whether other uses are worth the >complications (and even if they are, there might be security issues,
too).

FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.

As usual, what is specified by a common-subset standard is not
relevant for what an OS implementor has to do if they want to supply
more than a practically unusable checkbox feature like the POSIX
subsystem for Windows. There is a reason why WSL2 includes a full
Linux kernel.

If an implementation claims support for the XSI option of
POSIX, then it must support MAP_FIXED. There were a couple
of vendors who claimed not to be able to support MAP_FIXED
back in the days when it was being discussed in the standards
committee working groups.

In addition, the standard notes:

"Use of MAP_FIXED may result in unspecified behavior in
further use of malloc() and shmat(). The use of MAP_FIXED is
discouraged, as it may prevent an implementation from making
the most effective use of resources.

Because the semantics of MAP_FIXED are to unmap any
prior mapping in the range, if the implementation had happened to
allocate the heap or shared System V region at that address, the heap
would have become corrupt with dangling references hanging
around which, if stored into, would subsequently corrupt the mapped region.

Should they both end up using the same ASID ??

They couldn't share an ASID assuming the TLB looks up by VA.

Of course the TLB looks up by VA, what else. But if the VA is the
same and the PA is the same, the same ASID can be used.

That sounds like a nightmare scenario. Normally the ASID is
closely associated with a single process and the scope of
necessary TLB maintenance operations (e.g. invalidates
after translation table updates) is usually the process.

It's certainly not possible to do that on ARMv8 systems. The
ASID tag in the TLB entry comes from the translation table base
register and applies to all accesses made to the entire range covered
by the translation table by all the threads of the process.

Likewise the VMID tag in the TLB entry comes from the nested
translation table base address system register at the time
of entry creation.

For a subsequent process (child or detached) sharing memory with
that process, there just isn't any way to tag it's TLB entry with
the ASID of the first process to map the shared region.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to BGB on Sun Nov 26 22:35:19 2023

BGB <cr88192@gmail.com> writes:

On 11/26/2023 9:45 AM, Anton Ertl wrote:

Where, say, memory map will look something like:
00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
00yy_xxxxxxxx: Start of global virtual memory (*1);
3FFF_xxxxxxxx: End of global virtual memory;
4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
7FFF_xxxxxxxx: End of private/local virtual memory (possible);
8000_xxxxxxxx: Start of kernel virtual memory;
BFFF_xxxxxxxx: End of kernel virtual memory;
Cxxx_xxxxxxxx: Physical Address Range (Cached);
Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
Exxx_xxxxxxxx: Reserved;
Fxxx_xxxxxxxx: MMIO and similar.

The modern preference is to make the memory map flexible.

Linux, for example, requires that PCI Base Address Registers
be programmable by the operating system, and the OS can
choose any range (subject to host bridge configuration, of
course) for the device.

It is notable that even on non-intel systems, one may need
to map a 32-bit PCI BAR (AHCI is the classic example) which
requires the address programmed in the bar to be less than
0x10000000. Granted systems can have custom PCI controllers
that remap that into the larger physical address space with
a bit of extra hardware, however the kernel people don't
like that at all since there is no universal standard for
such remapping and they don't want to support
dozens of independent implementations, constantly
changing from generation to generation.

Many modern SoCs (and ARM SBSA requires this) make their
on-board devices and coprocessors look like PCI express
devices to software, and SBSA requires the PCIe ECAM
region for device discovery. Here again, each of
these on board devices will have from one to six
memory region base address registers (or one to
three for 64-bit bars).

Encoding memory attributes into the address is common
in microcontrollers, but in a general purpose processor
constrains the system to an extent sufficient to make it
unattractive for general purpose workloads.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Finch@21:1/5 to BGB on Sun Nov 26 18:20:58 2023

On 2023-11-26 4:45 p.m., BGB wrote:

On 11/26/2023 9:45 AM, Anton Ertl wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

mitchalsup@aol.com (MitchAlsup) writes:

Consider the case where two different processes MMAP the same area
of memory.

In which case, the area of memory would be mapped to different
virtual address ranges in each process,

Says who? Unless the user process asks for MAP_FIXED or the address
range is already occupied in the user process, nothing prevents the OS
from putting the shared area in the same process. If the permissions
are also the same, the OS can then use one ASID for the shared area.

This would be especially useful for the read-only sections (e.g, code)
of common libraries like libc. However, in todays security landscape,
you don't want one process to know where library code is mapped in
other processes (i.e., you want ASLR), so we can no longer make use of
that benefit. And it's doubtful whether other uses are worth the
complications (and even if they are, there might be security issues,
too).

It seems to me, as long as it is a different place on each system,
probably good enough. Demanding a different location in each process
would create a lot of additional memory overhead due to from things like base-relocations or similar.

FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.

As usual, what is specified by a common-subset standard is not
relevant for what an OS implementor has to do if they want to supply
more than a practically unusable checkbox feature like the POSIX
subsystem for Windows. There is a reason why WSL2 includes a full
Linux kernel.

Still using WSL1 here as for whatever reason hardware virtualization has
thus far refused to work on my PC, and is apparently required for WSL2.

I can add this to my list of annoyances, like I can install "just short
of 128GB", but putting in the full 128GB causes my PC to be like "Oh
Crap, I guess there is 3.5GB ..." (but, apparently "112GB with unmatched
RAM sticks is fine I guess...").

But, yeah, the original POSIX is an easier goal to achieve, vs, say, the ability to port over the GNU userland.

A lot of it is doable, but things like fork+exec are a problem if one
wants to support NOMMU operation or otherwise run all of the logical processes in a shared address space.

A practical alternative is something more like a CreateProcess style
call, but this is "not exactly POSIX". In theory though, one could treat "fork()" more like "vfork()" and then turn the exec* call into a CreateProcess call and then terminate the current thread. Wouldn't
really work "in general" though, for programs that expect to be able to "fork()" and then continue running the current program as a sub-process.

Should they both end up using the same ASID ??

They couldn't share an ASID assuming the TLB looks up by VA.

Of course the TLB looks up by VA, what else. But if the VA is the
same and the PA is the same, the same ASID can be used.

?...

Typically the ASID applies to the whole virtual address space, not to individual memory objects.

Or, at least, my page-table scheme doesn't have a way to express
per-page ASIDs (merely if a page is Private/Shared, with the results of
this partly depending on the current ASID given for the page-table).

Where, say, I am mostly using 64-bit entries in the page-table, as going
to a 128-bit page-table format would be a bit steep.

Say, PTE layout (16K pages):
(63:48): ACLID
(47:14): Physical Address.
(13:12): Address or OS flag.
(11:10): For use by OS
( 9: 0): Base page-access and similar.
    (9): S1 / U1 (Page-Size or OS Flag)
    (8): S0 / U0 (Page-Size or OS Flag)
    (7): Nu User (Supervisor Only)
    (6): No Execute
    (5): No Write
    (4): No Read
    (3): No Cache
    (2): Dirty (OS, ignored by TLB)
    (1): Private/Shared (MBZ if not Valid)
    (0): Present/Valid

Where, ACLID serves as an index into the ACL table, or to lookup the
VUGID parameters for the page (well, along with an alternate PTE variant
that encodes VUGID directly, but reduces the physical address to 36
bits). It is possible that the original VUGID scheme may be phased out
in favor of using exclusively ACL checking.

Note that the ACL checks don't add new permissions to a page, they add further restrictions (with the base-access being the most permissive).

Some combinations of flags are special, and encode a few edge-case
modes; such as pages which are Read/Write in Supervisor mode but
Read-Only in user mode (separate from the possible use of ACL's to mark
pages as read-only for certain tasks).

Q+ has a similar setup, but the ACLID is in a separate table.

For Q+ Two similar MMUs have been designed, one to be used in a large
system and a second for a small system. The difference between the two
is in the size of page numbers. The large system uses 64-bit page
numbers, and the small system uses 32-bit page numbers. The PTE for the
large system is 96-bits, 32-bits larger than the PTE for the small
system due to the extra bits for the page number. Pages are 64kB. The
small system supports a 48-bit address range.

The PTE has the following fields:
PPN 64/32 Physical page number
URWX 3 User read-write-execute override
SRWX 3 Supervisor read-write-execute override
HRWX 3 Hypervisor read-write-execute override
MRWX 3 Machine read-write-execute override
CACHE 4 Cache-ability bits
SW 2 OS software usage
A 1 1=accessed/used
M 1 1=modified
V 1 1 if entry is valid, otherwise 0
S 1 1=shared page
G 1 1=global, ignore ASID
T 1 0=page pointer, 1= table pointer
RGN 3 Region table index
LVL/BC 5 the page table level of the entry pointed to

The RWX and CACHE bits are overrides. These values normally come from
the region table, but may be overridden by values in the PTE.
The LVL/BC5 field is five bits to account for a five-bit bounce counter
for inverted page tables. Only a 3-bit level is in use.

There is a separate table with per page information that contains a
reference to an ACL (16-bts), share counts (16-bits), privilege level
(8-bits), and access key (24-bits), and a couple of other fields for compression / encryption.

I have made the PTBR a full 64-bit address now rather than a page number
with control bits. So, it may now point into the middle of a page
directory which is shared between tasks.

The table walker and region table look like PCI devices to the system.

But, FWIW, I ended up adding an extended MAP_GLOBAL flag for "mmap'ed
space should be visible to all of the processes"; which in turn was used
as part of the backing memory for the "GlobalAlloc" style calls (it is
not a global heap, in that each process still manages the memory
locally, but other intersecting processes can see the address within
their own address spaces).

Well, along with a MAP_PHYSICAL flag, for if one needs memory where
VA==PA (this may fail, with the mmap returning NULL, effectively only
allowed for "superusermode"; mostly intended for hardware interfaces).

The usual behavior of MAP_SHARED didn't really make sense outside of the context of mapping a file, and didn't really serve the needed purpose
(say, one wants to hand off a pointer to a bitmap buffer to the GUI
subsystem to have it drawn into a window).

It is also being used for things like shared scratch buffers, say, for passing BITMAPINFOHEADER and MIDI commands and similar across the interprocess calls (the C API style wrapper wraps a lot of this; whereas
the internal COM-style interfaces will require any pointer-style
arguments to point to shared memory).

This is not required for normal syscall handlers, where the usual
assumption is that normal syscalls will have some means of directly
accessing the address space of the caller process. I didn't really want
to require that TKGDI have this same capability.

It is debatable whether calls like BlitImage and similar should require global memory, or merely recommend it (potentially having the call fall
back to a scratch buffer and internal memcpy if the passed bitmap image
is not already in global memory).

I had originally considered a more complex mechanism for object sharing,
but then ended up going with this for now partly because it was easier
and lower overhead (well, and also because I wanted something that would still work if/when I started to add proper memory protection). May make
sense to impose a limit on per-process global alloc's though (since it
is intended specifically for shared buffers and not for general heap allocation; where for heap allocation ANONYMOUS+PRIVATE would be used instead).

Though, looking at stuff, MAP_GLOBAL semantics may have also been
partially covered by "MAP_ANONYMOUS|MAP_SHARED"?... Though, the
semantics aren't the same.

I guess, another alternative would have been to use shm_open+mmap or
similar.

Where, say, memory map will look something like:
00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
00yy_xxxxxxxx: Start of global virtual memory (*1);
3FFF_xxxxxxxx: End of global virtual memory;
4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
7FFF_xxxxxxxx: End of private/local virtual memory (possible);
8000_xxxxxxxx: Start of kernel virtual memory;
BFFF_xxxxxxxx: End of kernel virtual memory;
Cxxx_xxxxxxxx: Physical Address Range (Cached);
Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
Exxx_xxxxxxxx: Reserved;
Fxxx_xxxxxxxx: MMIO and similar.

*1: The 'yy' division point may move, will depend on things like how
much RAM exists (currently, 00/01; no current "sane cost" FPGA boards
having more than 256 or 512 MB of RAM).

*2: If I go to a scheme of giving processes their own address spaces,
then private memory will be used. It is likely that executable code may remain shared, but the data sections and heap would be put into private address ranges.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to Scott Lurndal on Sun Nov 26 18:40:16 2023

On 11/26/2023 4:35 PM, Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 11/26/2023 9:45 AM, Anton Ertl wrote:

Where, say, memory map will look something like:
00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
00yy_xxxxxxxx: Start of global virtual memory (*1);
3FFF_xxxxxxxx: End of global virtual memory;
4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
7FFF_xxxxxxxx: End of private/local virtual memory (possible);
8000_xxxxxxxx: Start of kernel virtual memory;
BFFF_xxxxxxxx: End of kernel virtual memory;
Cxxx_xxxxxxxx: Physical Address Range (Cached);
Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
Exxx_xxxxxxxx: Reserved;
Fxxx_xxxxxxxx: MMIO and similar.

The modern preference is to make the memory map flexible.

Linux, for example, requires that PCI Base Address Registers
be programmable by the operating system, and the OS can
choose any range (subject to host bridge configuration, of
course) for the device.

As for the memory map, actual hardware-relevant part of the map is:
0000_xxxxxxxx..7FFF_xxxxxxxx: User Mode, virtual
8000_xxxxxxxx..BFFF_xxxxxxxx: Supervisor Mode, virtual
Cxxx_xxxxxxxx: Physical Address Range (Cached);
Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
Exxx_xxxxxxxx: Reserved;
Fxxx_xxxxxxxx: MMIO and similar.

No good way to make more entirely flexible, some of this stuff requires
special handling from the L1 cache, and by the time it reaches the TLB,
it is too late (unless there were additional logic to be like "Oh, crap,
this was actually meant for MMIO!").

Though, with the 96-bit VA mode, if GBH(47:0)!=0, then the entire 48-bit
space is User Mode Virtual (and it is not possible to access MMIO or
similar at all, short of reloading 0 into GBH, or using XMOV.x
instructions with a 128-bit pointer, say:
0000_0000_00000000-tttt_Fxxx_xxxxxxxx).

Note here that the high 16-bits are ignored for normal pointers
(typically used for type-tagging or bounds-checking by the runtime).

For branches and captured Link-Register values:
If LSB is 0: High 16 bits are ignored;
The branch will always be within the same CPU Mode.
If LSB is 1: High 16 bits encode CPU Mode control flags.
LSB is always set for created LR values.
CPU will trap if the LSB is Clear in LR during an RTS/RTSU.

Setting the LSB and putting the mode in the high 16 bits is also often
used on function pointers so that theoretically Baseline and XG2 code
can play along together (though, at present, BGBCC does not generate any
mixed binaries, so this part would mostly apply to DLLs).

For the time being, there is no PCI or PCIe in my case.
Nor have I gone up the learning curve for what would be required to
interface with any PCIe devices.

Had tried to get USB working, but didn't have much success as it seemed
I was still missing something (seemed to be sending/receiving bytes, but
the devices would not respond as expected to any requests or commands).

Mostly ended up using a PS2 keyboard, and had realized that (IIRC) if
one pulled the D+ and D- lines high (IIRC) the mouse would instead
implement the PS2 protocol (though, this didn't work on the USB
keyboards I had tried).

Most devices are mapped to fixed address ranges in the MMIO space:
F000Cxxx: Rasterizer / Edge-Walker Control Registers
F000Exxx: Various basic devices
SDcard, PS2 Keyboard/Mouse, RS232 UART (*), etc
F008xxxx: FM Synth / Sample Mixer Control / ...
F009xxxx: PCM Audio Loop/Registers
F00Axxxx: MMIO VRAM
F00Bxxxx: MMIO VRAM and Video Control
At present, VRAM is also RAM-backed.
VRAM framebuffer base address in RAM is now movable.

All this existing within:
FFFF_Fxxxxxxx

*: RS232 generally connected to a UART interface that feeds back to a
connected computer via an on-board FTDI chip or similar.

As for physical memory map, it is sorta like:
00000000..00007FFF: Boot ROM
0000C000..0000DFFF: Boot SRAM
00010000..0001FFFF: ZERO's
00020000..0002FFFF: BJX2 NOP's
00030000..0003FFFF: BJX2 BREAK's
...
01000000..1FFFFFFF: Reserved for RAM
20000000..3FFFFFFF: Reserved for More RAM (And/or repeating)
40000000..5FFFFFFF: RAM repeats (and/or Reserved)
60000000..7FFFFFFF: RAM repeats more (and/or Reserved)
80000000..EFFFFFFF: Reserved
F0000000..FFFFFFFF: MMIO in 32-bit Mode (*1)

*1: There used to be an MMIO range at 0000_F0000000, but this has been eliminated in favor of only recognizing this range as MMIO in 32-bit
mode (where only the low 32-bits of the address are used). Enabling
48-bit addressing will now require using the proper MMIO address.

Currently, nothing past the low 4GB is used in the physical memory map.

It is notable that even on non-intel systems, one may need
to map a 32-bit PCI BAR (AHCI is the classic example) which
requires the address programmed in the bar to be less than
0x10000000. Granted systems can have custom PCI controllers
that remap that into the larger physical address space with
a bit of extra hardware, however the kernel people don't
like that at all since there is no universal standard for
such remapping and they don't want to support
dozens of independent implementations, constantly
changing from generation to generation.

Many modern SoCs (and ARM SBSA requires this) make their
on-board devices and coprocessors look like PCI express
devices to software, and SBSA requires the PCIe ECAM
region for device discovery. Here again, each of
these on board devices will have from one to six
memory region base address registers (or one to
three for 64-bit bars).

Encoding memory attributes into the address is common
in microcontrollers, but in a general purpose processor
constrains the system to an extent sufficient to make it
unattractive for general purpose workloads.

Possibly, but making things more flexible here would be a non-trivial
level of complexity to deal with at the moment (and, it seemed relevant
at first to design something I could "actually implement").

At the time I started out on this, even maintaining similar hardware
interfaces to a minimalist version of the Sega Dreamcast (what the
BJX1's hardware-interface design was partly based on) was asking a bit
too much (even after leaving out things like the CD-ROM drive and similar).

So, I simplified things somewhat, initially taking some design
inspiration in these areas from the Commodore 64 and MSP430 and similar...

Say:
VRAM was reinterpreted as being an 80x25 grid of 8x8 pixel color cells;
Audio was a simple MMIO-backed PCM loop (with a few registers to adjust
the sample rate and similar).

In terms of output signals, the display module drives a VGA output, and
the audio is generally pulled off by turning an IO pin on and off really
fast.

Or, one drives 2 lines for audio, say:
10: +, 01: -, 11: 0

Using an H-Bridge driver as an amplifier (turns out one needs to drive
like 50-100mA to get any decent level of loudness out of headphones;
which is well beyond the power normal IO pins can deliver). Generally
PCM needs to get turned into PWM/PDM.

Driving stereo via a dual H-Bridge driver would get a little wonky
though, since headphones use Left/Right and a Common, effectively one
needs to drive the center as a neutral, with L/R channels (and/or, just
get lazy and drive mono across both the L/R channels using a single
H-Bridge and ignore the center point, which ironically can get more
loudness at less current because now one is dealing with 70 ohm rather
than 35 ohm).

...

Generally, with all of the hardware addresses at fixed locations.
Doing any kind of dynamic configuration or allowing hardware addresses
to be movable would have likely made the MMIO devices significantly more expensive (vs hard-coding the address of each device).

Did generally go with MMIO rather than x86 style IO ports though.
Partly because IO ports sucks, and I wasn't quite *that* limited (say,
could afford to use a 28-bit space, rather than a 16-bit space).

...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Mon Nov 27 02:09:52 2023

Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 11/26/2023 9:45 AM, Anton Ertl wrote:

Where, say, memory map will look something like:
00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
00yy_xxxxxxxx: Start of global virtual memory (*1);
3FFF_xxxxxxxx: End of global virtual memory;
4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
7FFF_xxxxxxxx: End of private/local virtual memory (possible);
8000_xxxxxxxx: Start of kernel virtual memory;
BFFF_xxxxxxxx: End of kernel virtual memory;
Cxxx_xxxxxxxx: Physical Address Range (Cached);
Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
Exxx_xxxxxxxx: Reserved;
Fxxx_xxxxxxxx: MMIO and similar.

The modern preference is to make the memory map flexible.

// cacheable, used, modified bits
CUM kind of access
--- ------------------------------
000 uncacheable DRAM
001 MMI/O
010 config
011 ROM
1xx cacheable DRAM

Linux, for example, requires that PCI Base Address Registers
be programmable by the operating system, and the OS can
choose any range (subject to host bridge configuration, of
course) for the device.

Easily done, just create an uncacheable PTE and set UM to 10
for config space or 01 for MMI/O space.

It is notable that even on non-intel systems, one may need
to map a 32-bit PCI BAR (AHCI is the classic example) which
requires the address programmed in the bar to be less than
0x10000000.

I/O MMU translates these devices from a 32-bit VAS into the
64-bit PAS.

Granted systems can have custom PCI controllers
that remap that into the larger physical address space with
a bit of extra hardware, however the kernel people don't
like that at all since there is no universal standard for
such remapping and they don't want to support
dozens of independent implementations, constantly
changing from generation to generation.

What they figure if they are already supporting 4 incompatible
mapping systems {Intel, AMD, ARM, RISC-V} you would have though
they had gotten good at these implementations :-)

Many modern SoCs (and ARM SBSA requires this) make their
on-board devices and coprocessors look like PCI express
devices to software,

I made the CPU/cores in My 66000 have a configuration port
that is setup during boot that smells just like a PCIe
port.

and SBSA requires the PCIe ECAM
region for device discovery. Here again, each of
these on board devices will have from one to six
memory region base address registers (or one to
three for 64-bit bars).

Encoding memory attributes into the address is common
in microcontrollers, but in a general purpose processor
constrains the system to an extent sufficient to make it
unattractive for general purpose workloads.

Agreed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to MitchAlsup on Mon Nov 27 00:04:06 2023

On 11/26/2023 8:09 PM, MitchAlsup wrote:

Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 11/26/2023 9:45 AM, Anton Ertl wrote:

Where, say, memory map will look something like:
00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
00yy_xxxxxxxx: Start of global virtual memory (*1);
3FFF_xxxxxxxx: End of global virtual memory;
4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
7FFF_xxxxxxxx: End of private/local virtual memory (possible);
8000_xxxxxxxx: Start of kernel virtual memory;
BFFF_xxxxxxxx: End of kernel virtual memory;
Cxxx_xxxxxxxx: Physical Address Range (Cached);
Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
Exxx_xxxxxxxx: Reserved;
Fxxx_xxxxxxxx: MMIO and similar.

The modern preference is to make the memory map flexible.

As noted, some amount of the above would be part of the OS memory map,
rather than a hardware imposed memory map.

Like, say, Windows on x86 typically had:
00000000..000FFFFF: DOS-like map (9x)
00100000..7FFFFFFF: Userland stuff
80000000..BFFFFFFF: Shared stuff
C0000000..FFFFFFFF: Kernel Stuff

Did the hardware enforce this? No.
Did Windows follow such a structure? Yes, generally.

Linux sorta followed a similar structure, except that on some versions,
they had given the full 4GB to userland addresses (which made an
annoyance if trying to use TagRefs and the OS might actually put memory
in the part of the address space one would have otherwise used to hold
fixnums and similar).

Ironically though, this sort of thing (along with the limits of 32-bit
tagrefs) made incentive for my to go over to 64-bit tagrefs even on
32-bit machines, and a generally similar tagref scheme got carried into
my later projects.

Say:
0ttt_xxxx_xxxxxxxx: Pointers
1ttt_xxxx_xxxxxxxx: Small Value Spaces
2ttt_xxxx_xxxxxxxx: ...
3yyy_xxxx_xxxxxxxx: Bounds Checked Pointers
4iii_iiii_iiiiiiii: Fixnum
..
7iii_iiii_iiiiiiii: Fixnum
8iii_iiii_iiiiiiii: Flonum
..
Biii_iiii_iiiiiiii: Flonum
...

But, this scheme is more used by the runtime, not so much by the hardware.

For the most part, C doesn't use pointer tagging.
However BGBScript/JavaScript and my BASIC variant do make use of
type-tagging.

                // cacheable, used, modified bits
    CUM            kind of access
    ---            ------------------------------
    000            uncacheable DRAM
    001            MMI/O
    010            config
    011            ROM
    1xx            cacheable DRAM

Hmm...
Unfortunate acronyms are inescapable it seems...

Linux, for example, requires that PCI Base Address Registers
be programmable by the operating system, and the OS can
choose any range (subject to host bridge configuration, of
course) for the device.

Easily done, just create an uncacheable PTE and set UM to 10
for config space or 01 for MMI/O space.

I guess, if PCIe were supported, some scheme could be developed to map
the PCIe space either into part of the MMIO space, into RAM space, or
maybe some other space.

There is a functional difference between MMIO space and RAM space in
terms of how they are accessed:
RAM space: Cache does its thing and works with cache-lines;
MMIO space: A request is sent over the bus, and then it waits for a
response.

If the MMIO bridge sees an MMIO request, it puts it onto the MMIO Bus,
and sees if any device responds (if so, sending the response back to the origin). Otherwise, if no device responds after a certain number of
clock cycles, an all-zeroes response is sent instead.

Currently, no sort of general purpose bus is routed outside of the FPGA,
and if it did exist, it is not yet clear what form it would take.

Would need to limit pin counts though, so probably some sort of serial
bus in any case.

PCIe might be sort of tempting in the sense that apparently, 1 PCIe lane
can be subdivided to multiple devices, and bridge cards exist that can apparently route PCIe over a repurposed USB cable and then connect
multiple devices, PCI, or ISA cards. Albeit apparently with mixed results.

It is notable that even on non-intel systems, one may need
to map a 32-bit PCI BAR (AHCI is the classic example) which
requires the address programmed in the bar to be less than
0x10000000.

I/O MMU translates these devices from a 32-bit VAS into the 64-bit PAS.

            Granted systems can have custom PCI controllers
that remap that into the larger physical address space with
a bit of extra hardware, however the kernel people don't
like that at all since there is no universal standard for
such remapping and they don't want to support
dozens of independent implementations, constantly
changing from generation to generation.

What they figure if they are already supporting 4 incompatible
mapping systems {Intel, AMD, ARM, RISC-V} you would have though
they had gotten good at these implementations :-)

Many modern SoCs (and ARM SBSA requires this) make their
on-board devices and coprocessors look like PCI express
devices to software,

I made the CPU/cores in My 66000 have a configuration port
that is setup during boot that smells just like a PCIe
port.

                     and SBSA requires the PCIe ECAM
region for device discovery.    Here again, each of
these on board devices will have from one to six
memory region base address registers (or one to
three for 64-bit bars).

Encoding memory attributes into the address is common
in microcontrollers, but in a general purpose processor
constrains the system to an extent sufficient to make it
unattractive for general purpose workloads.

Agreed.

At least for the userland address ranges, there is less of this going on
than in SH4, which had basically spent the top 3 bits of the 32-bit
address as mode.

Say, IIRC:
(29): No TLB
(30): No Cache
(31): Supervisor

So, in effect, there was only 512MB of usable address space.
The SH-4A had then expanded the lower part to 31 bits, so one could have
2GB of usermode address space.

But, say, if one can have 47 bits of freely usable virtual address space
for userland, probably good enough.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Mon Nov 27 07:22:22 2023

BGB <cr88192@gmail.com> writes:

On 11/26/2023 9:45 AM, Anton Ertl wrote:

This would be especially useful for the read-only sections (e.g, code)
of common libraries like libc. However, in todays security landscape,
you don't want one process to know where library code is mapped in
other processes (i.e., you want ASLR), so we can no longer make use of
that benefit. And it's doubtful whether other uses are worth the
complications (and even if they are, there might be security issues,
too).

It seems to me, as long as it is a different place on each system,
probably good enough. Demanding a different location in each process
would create a lot of additional memory overhead due to from things like >base-relocations or similar.

If the binary is position-independent (the default on Linux on AMD64),
there is no such overhead.

I just started the same binary twice and looked at the address of the
same peace of code:

Gforth 0.7.3, Copyright (C) 1995-2008 Free Software Foundation, Inc.
Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license'
Type `bye' to exit
see open-file
Code open-file
0x000055c2b76d5833 <gforth_engine+6595>: mov %r15,0x1c126(%rip)
...

For the other process the same instruction is:

Code open-file
0x000055dd606e4833 <gforth_engine+6595>: mov %r15,0x1c126(%rip)

Following the calls until I get to glibc, I get, for the two processes:

0x00007f705c0c3b90 <__libc_open64+0>: push %r12
0x00007f190aa34b90 <__libc_open64+0>: push %r12

So not just the binary, but also glibc resides at different virtual
addresses in the two processes.

So obviously the Linux and glibc maintainers think that per-system
ASLR is not good enough. They obviously want ASLR to work as well as
possible against local attackers.

Of course the TLB looks up by VA, what else. But if the VA is the
same and the PA is the same, the same ASID can be used.

?...

Typically the ASID applies to the whole virtual address space, not to >individual memory objects.

Yes, one would need more complicated ASID management than setting
"the" ASID on switching to a process if different VMAs in the process
have different ASIDs. Another reason not to go there.

Power (and IIRC HPPA) do something in this direction with their
"segments", where the VA space was split into 16 equally parts, and
IIRC the 16 parts each extended the address by 16 bits (minus the 4
bits of the segment number), so essentially they have 16 16-bit ASIDs.
The address spaces are somewhat unflexible, but with 64-bit VAs
(i.e. 60-bit address spaces) that may be good enough for quite a
while. The cost is that you now have to manage 16 ASID registers.
And if we ever get to actually making use of more the 60 bits of VA in
other ways, combining this ASID scheme with the other use of the VAs.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Mon Nov 27 07:57:08 2023

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.

As usual, what is specified by a common-subset standard is not
relevant for what an OS implementor has to do if they want to supply
more than a practically unusable checkbox feature like the POSIX
subsystem for Windows. There is a reason why WSL2 includes a full
Linux kernel.

...

Because the semantics of MAP_FIXED are to unmap any
prior mapping in the range, if the implementation had happened to
allocate the heap or shared System V region at that address, the heap
would have become corrupt with dangling references hanging
around which, if stored into, would subsequently corrupt the mapped region.

Of course you can provide an address without specifying MAP_FIXED, and
a high-quality OS will satisfy the request if possible (and return a
different address if not), while a work-to-rule OS like the POSIX
subsystem for Windows may then treat that address as if the user had
passed NULL.

Interestingly, Linux (since 4.17) also provides MAP_FIXED_NOREPLACE,
which works like MAP_FIXED except that it returns an error if
MAP_FIXED would replace part of an existing mapping. Makes me wonder
if in the no-conflict case, and given a page-aligned addr there is any difference between MAP_FIXED, MAP_FIXED_NOREPLACE and just providing
an address without any of these flags in Linux. In the conflict case,
the difference between the latter two variants is how you detect that
it did not work as desired.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From BGB@21:1/5 to Anton Ertl on Mon Nov 27 03:34:34 2023

On 11/27/2023 1:22 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

On 11/26/2023 9:45 AM, Anton Ertl wrote:

This would be especially useful for the read-only sections (e.g, code)
of common libraries like libc. However, in todays security landscape,
you don't want one process to know where library code is mapped in
other processes (i.e., you want ASLR), so we can no longer make use of
that benefit. And it's doubtful whether other uses are worth the
complications (and even if they are, there might be security issues,
too).

It seems to me, as long as it is a different place on each system,
probably good enough. Demanding a different location in each process
would create a lot of additional memory overhead due to from things like
base-relocations or similar.

If the binary is position-independent (the default on Linux on AMD64),
there is no such overhead.

OK.

I was thinking mostly of things like PE/COFF, where often a mix of
relative and absolute addressing is used, and loading typically involves applying base relocations (so, once loaded, the assumption is that the
binary will not move further).

Granted, traditional PE/COFF and ELF manage things like global variables differently (direct vs GOT).

Though, on x86-64, PC-relative addressing is a thing, so less need for
absolute addressing. PIC with PE/COFF might not be too much of a stretch.

I just started the same binary twice and looked at the address of the
same peace of code:

Gforth 0.7.3, Copyright (C) 1995-2008 Free Software Foundation, Inc.
Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license'
Type `bye' to exit
see open-file
Code open-file
0x000055c2b76d5833 <gforth_engine+6595>: mov %r15,0x1c126(%rip)
...

For the other process the same instruction is:

Code open-file
0x000055dd606e4833 <gforth_engine+6595>: mov %r15,0x1c126(%rip)

Following the calls until I get to glibc, I get, for the two processes:

0x00007f705c0c3b90 <__libc_open64+0>: push %r12
0x00007f190aa34b90 <__libc_open64+0>: push %r12

So not just the binary, but also glibc resides at different virtual
addresses in the two processes.

So obviously the Linux and glibc maintainers think that per-system
ASLR is not good enough. They obviously want ASLR to work as well as possible against local attackers.

OK.

Of course the TLB looks up by VA, what else. But if the VA is the
same and the PA is the same, the same ASID can be used.

?...

Typically the ASID applies to the whole virtual address space, not to
individual memory objects.

Yes, one would need more complicated ASID management than setting
"the" ASID on switching to a process if different VMAs in the process
have different ASIDs. Another reason not to go there.

Power (and IIRC HPPA) do something in this direction with their
"segments", where the VA space was split into 16 equally parts, and
IIRC the 16 parts each extended the address by 16 bits (minus the 4
bits of the segment number), so essentially they have 16 16-bit ASIDs.
The address spaces are somewhat unflexible, but with 64-bit VAs
(i.e. 60-bit address spaces) that may be good enough for quite a
while. The cost is that you now have to manage 16 ASID registers.
And if we ever get to actually making use of more the 60 bits of VA in
other ways, combining this ASID scheme with the other use of the VAs.

OK.

That seems a bit odd...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Mon Nov 27 14:59:36 2023

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.

As usual, what is specified by a common-subset standard is not
relevant for what an OS implementor has to do if they want to supply
more than a practically unusable checkbox feature like the POSIX >>>subsystem for Windows. There is a reason why WSL2 includes a full
Linux kernel.

...

Because the semantics of MAP_FIXED are to unmap any
prior mapping in the range, if the implementation had happened to
allocate the heap or shared System V region at that address, the heap
would have become corrupt with dangling references hanging
around which, if stored into, would subsequently corrupt the mapped region.

Of course you can provide an address without specifying MAP_FIXED, and
a high-quality OS will satisfy the request if possible (and return a >different address if not), while a work-to-rule OS like the POSIX
subsystem for Windows may then treat that address as if the user had
passed NULL.

Interestingly, Linux (since 4.17) also provides MAP_FIXED_NOREPLACE,
which works like MAP_FIXED except that it returns an error if
MAP_FIXED would replace part of an existing mapping. Makes me wonder
if in the no-conflict case, and given a page-aligned addr there is any >difference between MAP_FIXED, MAP_FIXED_NOREPLACE and just providing
an address without any of these flags in Linux. In the conflict case,
the difference between the latter two variants is how you detect that
it did not work as desired.

I've never seen a case where using MAP_FIXED was useful, and I've
been using mmap since the early 90's. I'm sure there must be one,
probabably where someone uses full VAs instead of offsets in data
structures. Using the full VAs in the region will likely cause
issues in the long term as the application is moved to updated or
different posix systems, particularly if the data file associated
with the region is expected to work in all subsequent
implementats. MAP_FIXED should be avoided, IMO.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Mon Nov 27 16:10:49 2023

scott@slp53.sl.home (Scott Lurndal) writes:

I've never seen a case where using MAP_FIXED was useful, and I've
been using mmap since the early 90's.

Gforth uses it for putting the image into the dictionary (the memory
area for Forth definitions, where more definitions can be put during a session): It first allocates the space for the dictionary with an
anonymous mmap, then puts the image at the start of this area with a
file mmap with MAP_FIXED.

It also currently uses MAP_FIXED for allocating the memory for
non-relocatable images, but thinking through it again, it's probably
better to use MAP_FIXED_NOREPLACE or nothing, and then check the
address, and report any error. However, we have not received any bug
reports about that, which probably shows that nobody uses
non-relocatable images.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Robert Finch on Thu Nov 30 16:59:37 2023

Robert Finch <robfi680@gmail.com> writes:
<snip>

My thought was to support two
banks of registers, one for the highest operating mode, and the other
for remaining operating modes.

How do the operating modes pass data between each other? E.g. for
a system call, the arguments are generally passed to the next higher
privilege level/operating mode via registers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Robert Finch on Thu Nov 30 13:35:04 2023

Robert Finch wrote:

The Q+ register file is implemented with one block-RAM per read port.
With a 64-bit width this gives 512 registers in a block RAM. 192
registers are needed for renaming a 64-entry architectural register
file. That leaves 320 registers unused. My thought was to support two
banks of registers, one for the highest operating mode, and the other
for remaining operating modes. On exceptions the register bank could be switched. But to do this there are now 128-register effectively being
renamed which leads to 384 physical registers to manage. This doubles
the size of the register management code. Unless, a pipeline flush
occurs for exception processing which I think would allow the renamer to reuse the same hardware to manage a new bank of registers. But that
hinges on all references to registers in the current bank being unused.

My other thought was that with approximately three times the number of architectural registers required, using 256 physical registers would
allow 85 architectural registers. Perhaps some of the registers could be banked for different operating modes. Banking four registers per mode
would use up 16.

If the 512-register file were divided by three, 170 physical registers
could be available for renaming. This is less than the ideal 192
registers but maybe close enough to not impact performance adversely.

I don't understand the problem.
You want 64 architecture registers, each which needs a physical register,
plus 128 registers for in-flight instructions, so 196 physical registers.

If you add a second bank of 64 architecture registers for interrupts
then each needs a physical register. But that doesn't change the number
of in-flight registers so thats 256 physical total.
Plus two sets of rename banks, one for each mode.

If you drain the pipeline before switching register banks then all
of the 128 in-flight registers will be free at the time of switch.

If you can switch to interrupt mode without draining the pipeline then
some of those 128 will be in-use for the old mode, some for the new mode
(and the uOps carry a privilege mode flag so you can do things like
check LD or ST ops against the appropriate PTE mode access control).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Thu Nov 30 20:30:52 2023

EricP wrote:

Robert Finch wrote:

The Q+ register file is implemented with one block-RAM per read port.
With a 64-bit width this gives 512 registers in a block RAM. 192
registers are needed for renaming a 64-entry architectural register
file. That leaves 320 registers unused. My thought was to support two
banks of registers, one for the highest operating mode, and the other
for remaining operating modes. On exceptions the register bank could be
switched. But to do this there are now 128-register effectively being
renamed which leads to 384 physical registers to manage. This doubles
the size of the register management code. Unless, a pipeline flush
occurs for exception processing which I think would allow the renamer to
reuse the same hardware to manage a new bank of registers. But that
hinges on all references to registers in the current bank being unused.

My other thought was that with approximately three times the number of
architectural registers required, using 256 physical registers would
allow 85 architectural registers. Perhaps some of the registers could be
banked for different operating modes. Banking four registers per mode
would use up 16.

If the 512-register file were divided by three, 170 physical registers
could be available for renaming. This is less than the ideal 192
registers but maybe close enough to not impact performance adversely.

I don't understand the problem.
You want 64 architecture registers, each which needs a physical register, plus 128 registers for in-flight instructions, so 196 physical registers.

If you add a second bank of 64 architecture registers for interrupts
then each needs a physical register. But that doesn't change the number
of in-flight registers so thats 256 physical total.
Plus two sets of rename banks, one for each mode.

If you drain the pipeline before switching register banks then all
of the 128 in-flight registers will be free at the time of switch.

A couple of bits of state and you don't need to drain the pipeline,
you just have to find the youngest instruction with the property
that all older instructions cannot raise an exception; these can be
allowed to finish execution while you are fetching instruction for
the new context.

If you can switch to interrupt mode without draining the pipeline then
some of those 128 will be in-use for the old mode, some for the new mode
(and the uOps carry a privilege mode flag so you can do things like
check LD or ST ops against the appropriate PTE mode access control).

And 1 bit of state keeps track of which is which.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Robert Finch on Thu Nov 30 23:06:32 2023

Robert Finch wrote:

On 2023-11-30 3:30 p.m., MitchAlsup wrote:

EricP wrote:

Robert Finch wrote:

The Q+ register file is implemented with one block-RAM per read port.
With a 64-bit width this gives 512 registers in a block RAM. 192
registers are needed for renaming a 64-entry architectural register
file. That leaves 320 registers unused. My thought was to support two
banks of registers, one for the highest operating mode, and the other
for remaining operating modes. On exceptions the register bank could
be switched. But to do this there are now 128-register effectively
being renamed which leads to 384 physical registers to manage. This
doubles the size of the register management code. Unless, a pipeline
flush occurs for exception processing which I think would allow the
renamer to reuse the same hardware to manage a new bank of registers.
But that hinges on all references to registers in the current bank
being unused.

My other thought was that with approximately three times the number
of architectural registers required, using 256 physical registers
would allow 85 architectural registers. Perhaps some of the registers
could be banked for different operating modes. Banking four registers
per mode would use up 16.

If the 512-register file were divided by three, 170 physical
registers could be available for renaming. This is less than the
ideal 192 registers but maybe close enough to not impact performance
adversely.

I don't understand the problem.
You want 64 architecture registers, each which needs a physical register, >>> plus 128 registers for in-flight instructions, so 196 physical registers. >>
If you add a second bank of 64 architecture registers for interrupts
then each needs a physical register. But that doesn't change the number
of in-flight registers so thats 256 physical total.
Plus two sets of rename banks, one for each mode.

If you drain the pipeline before switching register banks then all
of the 128 in-flight registers will be free at the time of switch.

A couple of bits of state and you don't need to drain the pipeline,
you just have to find the youngest instruction with the property that
all older instructions cannot raise an exception; these can be
allowed to finish execution while you are fetching instruction for
the new context.

Not quite comprehending. Will not the registers for the new context be improperly mapped if there are registers in use for the old map?

All the in-flight destination registers will get written by the in-flight instructions. All the instruction of the new context will allocate registers from the pool which is not currently in-flight. So, while there is mental confusion on how this gets pulled off in HW, it does get pulled off just
fine. When the new context STs the registers of the old context, it obtains
the correct register from the old context {{Should HW be doing this the
same orchestration applies--and it still works.}}

I think
a state bit could be used to pause a fetch of a register still in use in
the old map, but that is draining the pipeline anyway.

You are assuming a RAT, I am not using a RAT but a CAM where I can restore
to any checkpoint by simply rewriting the valid bit vector.

When the context swaps, a new set of target registers is always
established before the registers are used.

You still have to deal with the transient state and the CAM version works
with either SW or HW save/restore.

So incoming references in the
new context should always map to the new registers?

Which they will--as illustrated above.

If you can switch to interrupt mode without draining the pipeline then
some of those 128 will be in-use for the old mode, some for the new mode >>> (and the uOps carry a privilege mode flag so you can do things like
check LD or ST ops against the appropriate PTE mode access control).

And 1 bit of state keeps track of which is which.

Did some experimenting and the RAT turns out to be too large if more registers are incorporated. Even as few as 256 regs caused the RAT to increase in size substantially. So, I may go the alternate route of
making register wider rather than deeper, having 128-bit wide registers instead.

Register ports (or equivalently RAT ports) are one of the things that most limit issue width. K9 was to have 22 RAT ports, and was similar in size to
the {standard decoded Register File.}

There is an eight bit sequence number bit associated with each
instruction. So it can easily be detected the age of an instruction. I

I assign a 4-bit number (16-checkpints) to all instructions issued in
the same clock cycle. This gives a 6-wide machine up to 96 instructions in-flight; and makes backing up (misprediction) simple and fast.

found a really slick way of detecting instruction age using a matrix
approach on the web. But I did not fully understand it. So I just use
eight bit counters for now.

There is a two bit privilege mode flag for instructions in the ROB. I
suppose the ROB entries could be called uOps.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Robert Finch on Fri Dec 1 02:43:20 2023

Robert Finch wrote:

On 2023-11-30 6:06 p.m., MitchAlsup wrote:

Robert Finch wrote:

On 2023-11-30 3:30 p.m., MitchAlsup wrote:

EricP wrote:

Robert Finch wrote:

The Q+ register file is implemented with one block-RAM per read
port. With a 64-bit width this gives 512 registers in a block RAM. >>>>>> 192 registers are needed for renaming a 64-entry architectural
register file. That leaves 320 registers unused. My thought was to >>>>>> support two banks of registers, one for the highest operating mode, >>>>>> and the other for remaining operating modes. On exceptions the
register bank could be switched. But to do this there are now
128-register effectively being renamed which leads to 384 physical >>>>>> registers to manage. This doubles the size of the register
management code. Unless, a pipeline flush occurs for exception
processing which I think would allow the renamer to reuse the same >>>>>> hardware to manage a new bank of registers. But that hinges on all >>>>>> references to registers in the current bank being unused.

My other thought was that with approximately three times the number >>>>>> of architectural registers required, using 256 physical registers
would allow 85 architectural registers. Perhaps some of the
registers could be banked for different operating modes. Banking
four registers per mode would use up 16.

If the 512-register file were divided by three, 170 physical
registers could be available for renaming. This is less than the
ideal 192 registers but maybe close enough to not impact
performance adversely.

I don't understand the problem.
You want 64 architecture registers, each which needs a physical
register,
plus 128 registers for in-flight instructions, so 196 physical
registers.

If you add a second bank of 64 architecture registers for interrupts >>>>> then each needs a physical register. But that doesn't change the number >>>>> of in-flight registers so thats 256 physical total.
Plus two sets of rename banks, one for each mode.

If you drain the pipeline before switching register banks then all
of the 128 in-flight registers will be free at the time of switch.

A couple of bits of state and you don't need to drain the pipeline,
you just have to find the youngest instruction with the property that
all older instructions cannot raise an exception; these can be
allowed to finish execution while you are fetching instruction for
the new context.

Not quite comprehending. Will not the registers for the new context be
improperly mapped if there are registers in use for the old map?

All the in-flight destination registers will get written by the in-flight
instructions. All the instruction of the new context will allocate
registers
from the pool which is not currently in-flight. So, while there is mental
confusion on how this gets pulled off in HW, it does get pulled off just
fine. When the new context STs the registers of the old context, it obtains >> the correct register from the old context {{Should HW be doing this the
same orchestration applies--and it still works.}}

I
think a state bit could be used to pause a fetch of a register still
in use in the old map, but that is draining the pipeline anyway.

You are assuming a RAT, I am not using a RAT but a CAM where I can restore >> to any checkpoint by simply rewriting the valid bit vector.

I think the RAT can be restored to a specific checkpoint as well using
just an index value. Q+ has a checkpoint RAM of which one of the
checkpoints is the active RAT. The RAT is really 16 tables. I stored a
bit vector of the valid registers in the ROB so that the valid
register set may be reset when a checkpoint is restored.

When the context swaps, a new set of target registers is always
established before the registers are used.

You still have to deal with the transient state and the CAM version works
with either SW or HW save/restore.

So incoming references in
the new context should always map to the new registers?

Which they will--as illustrated above.

If you can switch to interrupt mode without draining the pipeline then >>>>> some of those 128 will be in-use for the old mode, some for the new
mode
(and the uOps carry a privilege mode flag so you can do things like
check LD or ST ops against the appropriate PTE mode access control).

And 1 bit of state keeps track of which is which.

Did some experimenting and the RAT turns out to be too large if more
registers are incorporated. Even as few as 256 regs caused the RAT to
increase in size substantially. So, I may go the alternate route of
making register wider rather than deeper, having 128-bit wide
registers instead.

Register ports (or equivalently RAT ports) are one of the things that most >> limit issue width. K9 was to have 22 RAT ports, and was similar in size
to the {standard decoded Register File.}

The Q+ RAT has 16 read and 8 write ports. I am trying for a 4-wide
machine. It is using about as many LUTs as the register file. The RAT is implemented with LUT ram instead of block RAMs. I do not like the size,
but it adds a lot to the operation of the machine.

There is an eight bit sequence number bit associated with each
instruction. So it can easily be detected the age of an instruction. I

I assign a 4-bit number (16-checkpints) to all instructions issued in
the same clock cycle. This gives a 6-wide machine up to 96 instructions
in-flight; and makes backing up (misprediction) simple and fast.

The same thing is done with Q+. It support 16 checkpoints with a
four-bit number too. Having read that 16 is almost the same as infinity.

Branch repair (from misprediction) has to be fast--especially if you are
going for 0-cycle repair.

Checkpoint recovery (from interrupt) just has to pick 1 of the checkpoints
that has achieved the consistent state (no older instructions can raise an exception).

Exception recovery can backup to the checkpoint containing the instruction which raised the exception, and then single step forward until the exception
is identified. Thus, you do not need "order" at a granularity smaller than
a checkpoint.

One can use pseudo-exceptions to solve difficult timing or sequencing
problems, saving certain kinds of state transitions in the instruction
queuing mechanism. For example, one could use pseudo-exception to regain
memory order in an ATOMIC event when you detect the order was less than sequentially consistent.

found a really slick way of detecting instruction age using a matrix
approach on the web. But I did not fully understand it. So I just use
eight bit counters for now.

There is a two bit privilege mode flag for instructions in the ROB. I
suppose the ROB entries could be called uOps.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Robert Finch on Sun Dec 3 11:07:48 2023

Robert Finch wrote:

Figured it out. Each architectural register in the RAT must refer to N physical registers, where N is the number of banks. Setting N to 4
results in a RAT that is only about 50% larger than one supporting only
a single bank. The operating mode is used to select the physical
register. The first eight registers are shared between all operating
modes so arguments can be passed to syscalls. It is tempting to have
eight banks of registers, one for each hardware interrupt level.

A consequence of multiple architecture register banks is each extra
bank keeps a set of mostly unused physical register attached to them.
For example, if there are 2 modes User and Super and a bank for each,
since User and Super are mutually exclusive,
64 of your 256 physical registers will be sitting unused tied
to the other mode bank, so max of 75% utilization efficiency.

If you have 8 register banks then only 3/10 of the physical registers
are available to use, the other 7/10 are sitting idle attached to
arch registers in other modes consuming power.

Also you don't have to play overlapped-register-bank games to pass
args to/from syscalls. You can have specific instructions that reach
into other banks: Move To User Reg, Move From User Reg.
Since only syscall passes args into the OS you only need to access
the user mode bank from the OS kernel bank.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Robert Finch on Sun Dec 3 16:49:33 2023

Robert Finch wrote:

On 2023-11-30 9:43 p.m., MitchAlsup wrote:

four-bit number too. Having read that 16 is almost the same as infinity.

Branch repair (from misprediction) has to be fast--especially if you are
going for 0-cycle repair.

I think I am far away from zero-cycle repair. Does getting zero-cycle
repair mean fetching from both branch directions and then selecting the correct one?

No, zero cycle means you access the ICache twice per cycle, once on the predicted path and once on the alternate path. The alternate path inst
are put in a buffer indexed by branch number. {{This happens 10-12 cycles before the branch prediction is resolved}}

When the branch instruction is launched out of its inst queue, the buffer
is read, and if the branch prediction failed, you have the instructions
from the mispredicted path ready to decode in the subsequent cycle.

I will be happy if I can get branching to work at all. It
is my first implementation using checkpoints. All the details of
handling branches are not yet worked out in code for Q+. I think enough
of the code is in place to get rough timing estimates. Not sure how well
the BTB will work. A gselect predictor is also being used. Expecting a
lot of branch mispredictions.

Checkpoint recovery (from interrupt) just has to pick 1 of the checkpoints >> that has achieved the consistent state (no older instructions can raise an >> exception).

Sounds straight-forward enough.

Exception recovery can backup to the checkpoint containing the
instruction which raised the exception, and then single step forward
until the exception
is identified. Thus, you do not need "order" at a granularity smaller than >> a checkpoint.

This sounds a little trickier to do. Q+ currently takes an exception
when things commit. It looks in the exception field of the queue entry
for a fault code. If there is one it performs almost the same operation
as a branch except it is occurring at the commit stage.

One can use pseudo-exceptions to solve difficult timing or sequencing
problems, saving certain kinds of state transitions in the instruction
queuing mechanism. For example, one could use pseudo-exception to regain
memory order in an ATOMIC event when you detect the order was less than
sequentially consistent.

Noted.

Gone back to using variable length instructions. Had to pipeline the instruction length decode across three clock cycles to get it to meet
timing.

Curious:: I got VLE to decode in 4-gates of delay, and I can PARSE up to
16 instruction boundaries in a single cycle (using a tree of multiplexers.)

DECODE, then, only has to process the 32-bit instructions and route the constants in at Forwarding.

Now:: I also use 3 cycles after ICache access, but 1 of the cycles includes
tag comparison and set select, so I consider this a 2½ cycle decode; the ½ cycle part performs the VLE and instruction-specifier rout to decoder[k].

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Sun Dec 3 16:58:38 2023

EricP wrote:

Robert Finch wrote:

Figured it out. Each architectural register in the RAT must refer to N
physical registers, where N is the number of banks. Setting N to 4
results in a RAT that is only about 50% larger than one supporting only
a single bank. The operating mode is used to select the physical
register. The first eight registers are shared between all operating
modes so arguments can be passed to syscalls. It is tempting to have
eight banks of registers, one for each hardware interrupt level.

A consequence of multiple architecture register banks is each extra
bank keeps a set of mostly unused physical register attached to them.

A waste.....

For example, if there are 2 modes User and Super and a bank for each,
since User and Super are mutually exclusive,
64 of your 256 physical registers will be sitting unused tied
to the other mode bank, so max of 75% utilization efficiency.

If you have 8 register banks then only 3/10 of the physical registers
are available to use, the other 7/10 are sitting idle attached to
arch registers in other modes consuming power.

Also you don't have to play overlapped-register-bank games to pass
args to/from syscalls. You can have specific instructions that reach
into other banks: Move To User Reg, Move From User Reg.
Since only syscall passes args into the OS you only need to access
the user mode bank from the OS kernel bank.

Whereas: Exceptions, interrupts save and restore 32-registers::
A SysCall in My 66000 only saves and restores 24 of the 32 registers.
So when control arrives, there are 8 argument registers from the
Caller and 24 registers from Guest OS already loaded. So, SysCall
handler already has its stack, and a variety of pointers to data
structures it is interested in.

On the way back, RET only restores 24 registers so Guest OS can pass
back as many as 8 result registers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Robert Finch on Tue Dec 5 23:47:10 2023

Robert Finch wrote:

For the Q+ MPU and SOC the bus system is organized like a tree with the
root being at the CPU. The system bus operates with asynchronous transactions. The bus then fans out through bus bridges to various
system components. Responses coming back from devices are buffered and
merge results together into a more common bus when there are open spaces
in the bus. I think it is fairly fast (well at least for homebrew FPGA).
Bus accesses are single cycle, but they may have a varying amount of
latency.

My "bus" is similar, but is, in effect, a 4-wire protocol done with transactions on the buss. Read goes to Mem CTL, when "ordered" Snoops
go out, Snoop responses go to requesting core, Mem response goes to
core. When core has SNOOP responses and mem data it sends DONE to
mem Ctl. The arriving DONE allows the next access to that same cache
line to begin (that is DONE "orders" successive accesses to the same
line addresses, while allowing independent accesses to proceed inde-
pendently.

The data width of my "bus" is 1 cache line, or ½ cache line at DDR.
Control is ~90-bits including a 66-bit address.
SNOOP responses are packed.

Writes are “posted” so they are essentially single cycle.

Writes to DRAM are "posted"
Writes to config space are strongly ordered
Writes to MMI/O are sequentially Consistent

Reads percolate back up the tree to the CPU. It operates at the CPU clock rate (currently 40MHz) and transfers 128-bits at a time. Maximum peak
transfer rate would then be 640 MB/s. Copying memory is bound to be much slower due to the read latency. Devices on the bus have a configuration
block which looks something like a PCI config block, so devices
addressing may be controlled by the OS.

Multiple devices access the main DRAM memory via a memory controller.

I interpose the LLC (L3) between the "bus" and the Mem Ctl. This interposition is what eliminates RowHammer. The L3 is not really a cache it is a preview
of the state DRAM will eventually achieve or has already achieved. It is,
in essence, an infinite write buffer between the MC and DRC and a near
infinite read buffer between DRC and MC.

Several devices that are bus masters have their own ports to the memory controller and do not use up time on the main system bus tree. The

Yes, PCIe HostBridge has master access to the "bus" all "devices" are
down under HostBridge. With CLX enabled, one can even place DRAM out on
PCIe tree,...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Robert Finch on Fri Dec 8 17:53:12 2023

Robert Finch wrote:

What happens when there is a sequence of numerous branches in row, such
that the machine would run out of checkpoints for the branches?

Stall Insert.

Suppose you go
Bra tgt1
Bra tgt1
… 30 times
Bra tgt1

Unconditional Branches do not need a checkpoint (all by themselves).

Will the machine still work? Or will it crash?
I have Q+ stalling until checkpoints are available, but it seems like a
loss of performance. It is extra hardware to check for the case that
might be preventable with software. I mean how often would a sequence
like the above occur?

Unconditional branches can be dealt with completely in the front end
{they do not need to be executed--except as they alter IP.}

On the other hand:: compilers are pretty good at cleaning up branches
to unconditional branches.

How will you tell for sure:: Read the ASM your compiler produces (a lot
of it).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Robert Finch on Sun Dec 10 01:02:00 2023

Robert Finch wrote:

Getting a bit lazy on the Q+ instruction commit in the interest of
increasing the fmax. The results are already in the register file, so
all the commit has to do is:

1) Update the branch predictor.
2) Free up physical registers

By the time you write the physical register into the file, you are in
a position to free up the now permanently invisible physical register
it replaced.

3) Free load/store queue entries associated with the ROB entry.

Spectré:: write miss buffer data into Cache and TLB.
This is also where I write ST.data into cache.

4) Commit oddball instructions.
5) Process any outstanding exceptions.
6) Free the ROB entry
7) Gather performance statistics.

What needs to be committed is computed in the clock cycle before the
commit. This pipelined signal adds a cycle of latency to the commit, but
it only really affects oddball instructions rarely executed, and
exceptions. Commit also will not commit if the commit pointer is near
the queue pointer. Commit will also only commit up to the first oddball instruction or exception.

Decided to axe the branch-to-register feature of conditional branch instructions because the branch target would not be known at enqueue
time. It would require updating the ROB in two places.

Question:: How would you handle::

IDIV R6,R7,R8
JMP R6

??

Branches can now use a postfix immediate to extend the branch range.
This allows 32 and 64-bit displacements in addition to the existing
17-bit one. However, the assembler cannot know which to use in advance,
so choosing a larger branch displacement size should be an option.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Robert Finch on Sun Dec 10 04:06:39 2023

Robert Finch wrote:

On 2023-12-09 8:02 p.m., MitchAlsup wrote:

Robert Finch wrote:

Getting a bit lazy on the Q+ instruction commit in the interest of
increasing the fmax. The results are already in the register file, so
all the commit has to do is:

1)    Update the branch predictor.
2)    Free up physical registers

By the time you write the physical register into the file, you are in
a position to free up the now permanently invisible physical register
it replaced.

Hey thanks, I should have thought of that. While there are more physical registers available than needed (256 and only about 204 are needed), so
it would probably run okay, I think I see a way to reduce multiplexor
usage by freeing the register when it is written.

You are welcome.

3)    Free load/store queue entries associated with the ROB entry.

Spectré:: write miss buffer data into Cache and TLB.
This is also where I write ST.data into cache.

Is miss data for a TLB page fault?

I leave TLB replacements in the miss buffer simply because they are so
seldom that I don't feel it necessary to build yet another buffer.
TLB plus any tablewalk acceleration is deferred until the casuing
instruction retires.

I have this stored in a register in
the TLB which must be read by the CPU during exception handling.

Technically, the TLB is the storage and comparators, while the rest
of the table walking mechanics {including the TLB} are the MMU.

Otherwise the TLB has a hidden page walker that updates the TLB.

If you don't defer TLB update until after the causing instruction retires Spectré-like attacks have a covert channel at their disposal.

Scratching my head now over writing the store data at commit time.

My 6-wide machine has a conditional-cache (memory reorder buffer)
after execution, calculation instructions can raise no exception.
This is the commit point. Between commit and retire, the conditional
cache updated the Data Cache. So there is a period of time the
pipeline builds up state, and once it has been determined that
nothing can prevent the manifestations of those instructions from
taking place, there is a period of time state gets updated. Once
all state is updated, the instruction has retired.

4)    Commit oddball instructions.
5)    Process any outstanding exceptions.
6)    Free the ROB entry
7)    Gather performance statistics.

What needs to be committed is computed in the clock cycle before the
commit. This pipelined signal adds a cycle of latency to the commit,
but it only really affects oddball instructions rarely executed, and
exceptions. Commit also will not commit if the commit pointer is near
the queue pointer. Commit will also only commit up to the first
oddball instruction or exception.

Decided to axe the branch-to-register feature of conditional branch
instructions because the branch target would not be known at enqueue
time. It would require updating the ROB in two places.

Question:: How would you handle::

    IDIV    R6,R7,R8
    JMP     R6

??

There is a JMP address[Rn] instruction (really JSR Rt,address[Rn]) in
the instruction set which is always treated as a branch miss when it executes. The RTS instruction could also be used, it allows the return address register to be specified and it is a couple of bytes shorter. It
was just that conditional branches had the feature removed. It required
a third register be read for the flow control unit too.

I have a LD IP,[address] instruction which is used to access GOT[k] for
calling dynamically linked subroutines. This bypasses the LD-aligner
to deliver IP to fetch faster.

But you side-stepped answering my question. My question is what do you
do when the Jump address will not arrive for another 20 cycles.

Branches can now use a postfix immediate to extend the branch range.
This allows 32 and 64-bit displacements in addition to the existing
17-bit one. However, the assembler cannot know which to use in
advance, so choosing a larger branch displacement size should be an
option.

I use GOT[k] to branch farther than the 28-bit unconditional branch displacement can reach. We have not yet run into a subroutine that
needs branches of more then 18-bits conditionally or 28-bits uncon-
ditionally.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Robert Finch on Sun Dec 10 15:11:38 2023

Robert Finch wrote:

On 2023-12-09 11:06 p.m., MitchAlsup wrote:

Robert Finch wrote:

I have a LD IP,[address] instruction which is used to access GOT[k] for
calling dynamically linked subroutines. This bypasses the LD-aligner
to deliver IP to fetch faster.

But you side-stepped answering my question. My question is what do you
do when the Jump address will not arrive for another 20 cycles.

While waiting for the register value, other instructions would continue
to queue and execute. Then that processing would be dumped because of
the branch miss. I suppose hardware could be added to suppress
processing until the register value is known. An option for a larger build.

Branches can now use a postfix immediate to extend the branch range. >>>>> This allows 32 and 64-bit displacements in addition to the existing
17-bit one. However, the assembler cannot know which to use in
advance, so choosing a larger branch displacement size should be an
option.

I use GOT[k] to branch farther than the 28-bit unconditional branch
displacement can reach. We have not yet run into a subroutine that
needs branches of more then 18-bits conditionally or 28-bits uncon-
ditionally.

I have yet to use GOT addressing.

There are issues to resolve in the Q+ frontend. The next PC value for
the BTB is not available for about three clocks. To go backwards in
time, the next PC needs to be cached, or rather the displacement to the
next PC to reduce cache size.

What you need is an index and a set to directly access the cache--all
the other stuff can be done in arears {AGEN and cache tag check}

The first time a next PC is needed it will
not be available for three clocks. Once cached it would be available
within a clock. The next PC displacement is the sum of the lengths of
next four instructions. There is not enough room in the FPGA to add
another cache and associated logic, however. Next PC = PC + 20 seems a
whole lot simpler to me.

Thus, I may go back to using a fixed size instruction or rather
instructions with fixed alignment. The position of instructions could be
as if they were fixed length while remaining variable length.

If the first part of an instruction decodes to the length of the instruction easily (EASILY) and cheaply, you can avoid the header and build a tree of
unary pointers each such pointer pointing at twice as many instruction
starting points as the previous. Even without headers, My 66000 can find
the instruction boundaries of up to 16 instructions per cycle without adding "stuff" the the block of instructions.

Instructions would just be aligned at fixed intervals. If I set the
length to five bytes for instance, most the instruction set could be accommodated. Operation by “packed” instructions would be an option for
a larger build. There could be a bit in a control register to allow
execution by packed or unpacked instructions so there is some backwards compatibility to a smaller build.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Robert Finch on Sun Dec 10 22:39:10 2023

Robert Finch wrote:

On 2023-12-10 10:11 a.m., MitchAlsup wrote:

Robert Finch wrote:

On 2023-12-09 11:06 p.m., MitchAlsup wrote:

Robert Finch wrote:

I have a LD IP,[address] instruction which is used to access GOT[k] for >>>> calling dynamically linked subroutines. This bypasses the LD-aligner
to deliver IP to fetch faster.

But you side-stepped answering my question. My question is what do you >>>> do when the Jump address will not arrive for another 20 cycles.

While waiting for the register value, other instructions would
continue to queue and execute. Then that processing would be dumped
because of the branch miss. I suppose hardware could be added to
suppress processing until the register value is known. An option for a
larger build.

Branches can now use a postfix immediate to extend the branch
range. This allows 32 and 64-bit displacements in addition to the >>>>>>> existing 17-bit one. However, the assembler cannot know which to >>>>>>> use in advance, so choosing a larger branch displacement size
should be an option.

I use GOT[k] to branch farther than the 28-bit unconditional branch
displacement can reach. We have not yet run into a subroutine that
needs branches of more then 18-bits conditionally or 28-bits uncon-
ditionally.

I have yet to use GOT addressing.

There are issues to resolve in the Q+ frontend. The next PC value for
the BTB is not available for about three clocks. To go backwards in
time, the next PC needs to be cached, or rather the displacement to
the next PC to reduce cache size.

What you need is an index and a set to directly access the cache--all
the other stuff can be done in arears {AGEN and cache tag check}

The first time a next PC is needed it
will not be available for three clocks. Once cached it would be
available within a clock. The next PC displacement is the sum of the
lengths of next four instructions. There is not enough room in the
FPGA to add another cache and associated logic, however. Next PC = PC
+ 20 seems a whole lot simpler to me.

Thus, I may go back to using a fixed size instruction or rather
instructions with fixed alignment. The position of instructions could
be as if they were fixed length while remaining variable length.

If the first part of an instruction decodes to the length of the
instruction
easily (EASILY) and cheaply, you can avoid the header and build a tree of
unary pointers each such pointer pointing at twice as many instruction
starting points as the previous. Even without headers, My 66000 can find
the instruction boundaries of up to 16 instructions per cycle without
adding
"stuff" the the block of instructions.

Instructions would just be aligned at fixed intervals. If I set the
length to five bytes for instance, most the instruction set could be
accommodated. Operation by “packed” instructions would be an option
for a larger build. There could be a bit in a control register to
allow execution by packed or unpacked instructions so there is some
backwards compatibility to a smaller build.

I cannot get it to work at a decent speed for only six instructions.
With byte-aligned instructions 64-decoders are in use. (They’re really small). Then output from the appropriate ones are selected. It is
partially the fullness of the FPGA and routing congestion because of the design. Routing is taking 90% of the time. Logic is only about 10%.

I did some experimenting with block headers and ended up with a block
trailer instead of a header, for the assembler’s benefit which needs to know all the instruction lengths before the trailer can be output. Only
the index of the instruction group is needed, so usually there are only
a couple of indexes used per instruction block. It can likely get by
with a 24-bit trailer containing four indexes plus the assumed one.
Usually only one or two bytes are wasted at the end of a block.
I assembled the boot rom and there are 4.9 bytes per instruction
average, including the overhead of block trailers and wasted bytes.
Branche and postfixes are five bytes, and there are a lot of them.

Code density is a little misleading because branches occupy five bytes
but do both a compare and branch operation. So they should maybe count
as two instructions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Robert Finch on Sun Dec 10 22:52:04 2023

Robert Finch wrote:

On 2023-12-10 10:11 a.m., MitchAlsup wrote:

Robert Finch wrote:

On 2023-12-09 11:06 p.m., MitchAlsup wrote:

Robert Finch wrote:

I have a LD IP,[address] instruction which is used to access GOT[k] for >>>> calling dynamically linked subroutines. This bypasses the LD-aligner
to deliver IP to fetch faster.

But you side-stepped answering my question. My question is what do you >>>> do when the Jump address will not arrive for another 20 cycles.

While waiting for the register value, other instructions would
continue to queue and execute. Then that processing would be dumped
because of the branch miss. I suppose hardware could be added to
suppress processing until the register value is known. An option for a
larger build.

Branches can now use a postfix immediate to extend the branch
range. This allows 32 and 64-bit displacements in addition to the >>>>>>> existing 17-bit one. However, the assembler cannot know which to >>>>>>> use in advance, so choosing a larger branch displacement size
should be an option.

I use GOT[k] to branch farther than the 28-bit unconditional branch
displacement can reach. We have not yet run into a subroutine that
needs branches of more then 18-bits conditionally or 28-bits uncon-
ditionally.

I have yet to use GOT addressing.

There are issues to resolve in the Q+ frontend. The next PC value for
the BTB is not available for about three clocks. To go backwards in
time, the next PC needs to be cached, or rather the displacement to
the next PC to reduce cache size.

What you need is an index and a set to directly access the cache--all
the other stuff can be done in arears {AGEN and cache tag check}

The first time a next PC is needed it
will not be available for three clocks. Once cached it would be
available within a clock. The next PC displacement is the sum of the
lengths of next four instructions. There is not enough room in the
FPGA to add another cache and associated logic, however. Next PC = PC
+ 20 seems a whole lot simpler to me.

Thus, I may go back to using a fixed size instruction or rather
instructions with fixed alignment. The position of instructions could
be as if they were fixed length while remaining variable length.

If the first part of an instruction decodes to the length of the
instruction
easily (EASILY) and cheaply, you can avoid the header and build a tree of
unary pointers each such pointer pointing at twice as many instruction
starting points as the previous. Even without headers, My 66000 can find
the instruction boundaries of up to 16 instructions per cycle without
adding
"stuff" the the block of instructions.

Instructions would just be aligned at fixed intervals. If I set the
length to five bytes for instance, most the instruction set could be
accommodated. Operation by “packed” instructions would be an option
for a larger build. There could be a bit in a control register to
allow execution by packed or unpacked instructions so there is some
backwards compatibility to a smaller build.

I cannot get it to work at a decent speed for only six instructions.
With byte-aligned instructions 64-decoders are in use. (They’re really small). Then output from the appropriate ones are selected. It is
partially the fullness of the FPGA and routing congestion because of the design. Routing is taking 90% of the time. Logic is only about 10%.

That wire:logic ratio is "not that much out of line" for long distance
bussing of data.

My word oriented design would cut the decoders down to 16-decoders and
they have to look at 7-bits to produce 3×5-bit vectors. A tree of
AND gates takes it from here basically performing FF1.

I did some experimenting with block headers and ended up with a block
trailer instead of a header, for the assembler’s benefit which needs to know all the instruction lengths before the trailer can be output. Only
the index of the instruction group is needed, so usually there are only
a couple of indexes used per instruction block. It can likely get by
with a 24-bit trailer containing four indexes plus the assumed one.
Usually only one or two bytes are wasted at the end of a block.
I assembled the boot rom and there are 4.9 bytes per instruction
average, including the overhead of block trailers and wasted bytes.
Branche and postfixes are five bytes, and there are a lot of them.

Code density is a little misleading because branches occupy five bytes
but do both a compare and branch operation. So they should maybe count
as two instructions.

Sooner or later you have to mash everything down to {bits, bytes, words} Instructions having VLE and having non-identity units of work performed,
bytes are probably the best representation. My eXcel spreadsheet stuff
uses bits.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Robert Finch on Wed Dec 13 19:13:32 2023

Robert Finch wrote:

On 2023-12-11 4:57 a.m., BGB wrote:

I got timing to work at 40+ MHz by using 32-bit instruction parcels
rather than byte-oriented ones.

An issue with 32-bit parcels is that float constants do not fit well
into them because of the opcode present in a postfix. A 32-bit postfix
has only 25 available bits for a constant. The next size up has 57 bits available. One thought I had was to reduce the floating-point precision
to correspond. Single precision floats would be 25 bits, double
precision 57 bits and quad precision 121 bits. All seven bits short of
the usual.

It is issues such as you mention that my approach was different. The instruction-specifier contains everything the decoder needs to know
about where the operands are, how to rout them into calculation, what
to calculate and where to deliver the result. Should the instruction
want constants for an operand* they are concatenated sequentially
after the I-S and come in 32-bit and 64-bit quantities. Should a
32-bit constant be consumed in a 64-bit calculation it is widened
during route.

(*) except for the 16-bit immediates and displacements from the
Major OpCode table.

I could try and use 40-bit parcels but they would need to be at fixed locations on the cache line for performance, and it would waste bytes.

In effect I only have 32-bit parcels.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Robert Finch on Fri Dec 22 11:42:51 2023

Robert Finch wrote:

Stuck on checkpoint RAM now. Everything was going good until…. I
realized that while instructions are executing they need to be able to
update previous checkpoints, not just the current one. Which checkpoint
gets updated depends on which checkpoint the instruction falls under. It
is the register valid bit that needs to be updated. I used a “brute force” approach to implement this and it is 40k LUTs. This is about five times too large a solution. If I reduce the number of checkpoints
supported to four from sixteen, then the component is 20k LUTs. Still
too large.

The issue is there are 256 valid bits times 16 checkpoints which means
4096 registers. Muxing the register inputs and outputs uses a lot of LUTs.

One thought is to stall until all the instructions with targets in a
given checkpoint are finished executing before starting a new checkpoint region. It would seriously impact the CPU performance.

(I don't have a solution, just passing on some info on this particular checkpointing issue.)

Sounds like you might be using the same free register checkpoint algorithm
I came up with for my simulator, which I assumed was a custom sram design.

There is 1 bit for each physical register that is free.
The checkpoint for a Bcc conditional branch copies the free bit vector,
in your case 256 bits, to a row in the checkpoint sram.
As each instruction retires and frees up its old dest physical register
and it must mark the register free in *all* checkpoint contexts.

That requires the ability to set all the free flags for a single register, which means an sram design that can write a whole row, and also set all the bits in one column, in your case set the 16 bits in each checkpoint for one
of the 256 registers.

I was assuming an ASIC design so a small custom sram seemed reasonable.
But for an FPGA it requires 256*16 flip-flops plus decoders, etc.

I think the risc-v Berkeley Out-of-Order Machine (BOOM) folks might have independently come up with the same approach on their BOOM-3 SonicBoom.
Their note [5] describes the same problem as my column setting solves.

https://docs.boom-core.org/en/latest/sections/rename-stage.html

While their target was 22nm ASIC, they say below that they
implemented a version of BOOM-3 on an FPGA but don't give details.
But their project might be open source so maybe the details
are available online.

Sonicboom: The 3rd generation berkeley out-of-order machine http://people.eecs.berkeley.edu/~krste/papers/SonicBOOM-CARRV2020.pdf

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Fri Dec 22 17:49:35 2023

EricP wrote:

Robert Finch wrote:

Stuck on checkpoint RAM now. Everything was going good until…. I
realized that while instructions are executing they need to be able to
update previous checkpoints, not just the current one. Which checkpoint
gets updated depends on which checkpoint the instruction falls under. It
is the register valid bit that needs to be updated. I used a “brute
force” approach to implement this and it is 40k LUTs. This is about five >> times too large a solution. If I reduce the number of checkpoints
supported to four from sixteen, then the component is 20k LUTs. Still
too large.

The issue is there are 256 valid bits times 16 checkpoints which means
4096 registers. Muxing the register inputs and outputs uses a lot of LUTs. >>
One thought is to stall until all the instructions with targets in a
given checkpoint are finished executing before starting a new checkpoint
region. It would seriously impact the CPU performance.

(I don't have a solution, just passing on some info on this particular checkpointing issue.)

Sounds like you might be using the same free register checkpoint algorithm
I came up with for my simulator, which I assumed was a custom sram design.

There is 1 bit for each physical register that is free.
The checkpoint for a Bcc conditional branch copies the free bit vector,
in your case 256 bits, to a row in the checkpoint sram.
As each instruction retires and frees up its old dest physical register
and it must mark the register free in *all* checkpoint contexts.

That requires the ability to set all the free flags for a single register, which means an sram design that can write a whole row, and also set all the bits in one column, in your case set the 16 bits in each checkpoint for one of the 256 registers.

Two points::
1) the register that gets freed up when you know this newly allocated register will retire, can be determined with a small amount of logic (2 gates) per
cell in your 256×16 matrix--no need for the column write/clear/set. You can use this overwrite across columns to perform register write elision.

2) There are going to be allocations where you do not allocate any register
to a particular instruction because the register is overwritten IN the same issue bundle. Here you can use a different "forwarding" notation so the
result is captured by the stations and used without ever seeing the file.

I called this matrix the "History Table" in Mc 88120, it provided valid
bits back to the aRN->pRN CAMs <backup> and valid bits back to the register pool <successful retire>.

Back then, we recognized that the architectural registers were a strict
subset of the physical registers, so that as long as there were exactly
31 (then: 32 now) valid registers in the pRF, one could always read
values to be written into reservation station entries. In effect, the
whole thing was a RoB--Once the RoB gets big enough, there is no reason
to have both a RoB and a aRF; just let the RoB do everything and change
its name to Physical Register File. This eliminates the copy to aRF
at retirement.

I was assuming an ASIC design so a small custom sram seemed reasonable.
But for an FPGA it requires 256*16 flip-flops plus decoders, etc.

I think the risc-v Berkeley Out-of-Order Machine (BOOM) folks might have independently come up with the same approach on their BOOM-3 SonicBoom.
Their note [5] describes the same problem as my column setting solves.

https://docs.boom-core.org/en/latest/sections/rename-stage.html

I was doing something very similar n 1991.

While their target was 22nm ASIC, they say below that they
implemented a version of BOOM-3 on an FPGA but don't give details.
But their project might be open source so maybe the details
are available online.

Sonicboom: The 3rd generation berkeley out-of-order machine http://people.eecs.berkeley.edu/~krste/papers/SonicBOOM-CARRV2020.pdf

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Robert Finch on Sat Dec 23 13:26:17 2023

Robert Finch wrote:

On 2023-12-22 12:49 p.m., MitchAlsup wrote:

EricP wrote:

Robert Finch wrote:

Stuck on checkpoint RAM now. Everything was going good until…. I
realized that while instructions are executing they need to be able
to update previous checkpoints, not just the current one. Which
checkpoint gets updated depends on which checkpoint the instruction
falls under. It is the register valid bit that needs to be updated.
I used a “brute force” approach to implement this and it is 40k
LUTs. This is about five times too large a solution. If I reduce the
number of checkpoints supported to four from sixteen, then the
component is 20k LUTs. Still too large.

The issue is there are 256 valid bits times 16 checkpoints which
means 4096 registers. Muxing the register inputs and outputs uses a
lot of LUTs.

One thought is to stall until all the instructions with targets in a
given checkpoint are finished executing before starting a new
checkpoint region. It would seriously impact the CPU performance.

I think I maybe found a solution using a block RAM and about 8k LUTs.

(I don't have a solution, just passing on some info on this particular
checkpointing issue.)

Sounds like you might be using the same free register checkpoint
algorithm
I came up with for my simulator, which I assumed was a custom sram
design.

There is 1 bit for each physical register that is free.
The checkpoint for a Bcc conditional branch copies the free bit vector,
in your case 256 bits, to a row in the checkpoint sram.
As each instruction retires and frees up its old dest physical register
and it must mark the register free in *all* checkpoint contexts.

That requires the ability to set all the free flags for a single
register,
which means an sram design that can write a whole row, and also set
all the
bits in one column, in your case set the 16 bits in each checkpoint
for one
of the 256 registers.

Not sure about setting bits in all checkpoints. I probably have not just understood the issue yet. Partially terminology. There are two different things happening. The register free/available which is being managed
with fifos and the register contents valid bit. At the far end of the pipeline, registers that were used are made free again by adding to the
free fifo. This is somewhat inefficient because they could be freed
sooner, but it would require more logic, instead more registers are
used, they are available from the RAM anyway.
The register contents valid bit is cleared when a target register is assigned, and set once a value is loaded into the target register. The
valid bit is also set for instructions that are stomped on as the old
value is valid. When a checkpoint is restored, it restores the state of
the valid bit along with the physical register tag. I am not
understanding why the valid bit would need to be modified in all
checkpoints. I would think it should reflect the pre-branch state of
things.

This has to do with free physical register list checkpointing and
a particular gotcha that occurs if one tries to use a vanilla sram
to save the free map bit vector for each checkpoint.
It sounds like the BOOM people stepped in this gotcha at some point.

Say a design has a bit vector indicating which physical registers are free. Rename allocates a register by using a priority selector to scan that
vector and select a free PR to assign as a new dest PR.
When this instruction retires, the old dest PR is freed and
the new dest PR becomes the architectural register.

When Decode sees a conditional branch Bcc it allocates a
checkpoint in a circular buffer by incrementing the head counter,
copies the *current* free bit vector into the new checkpoint row,
and saves the new checkpoint index # in the Bcc uOp.
If a branch mispredict occurs then we can restore the state at the
Bcc by copying various state info from the Bcc checkpoint index #.
This includes copying back the saved free vector to the current free vector. When the Bcc uOp retires we increment the circular tail counter
to recover the checkpoint buffer row.

The problem occurs when an old dest PR is in use so its free bit is clear
when the checkpoint is saved. Then the instruction retires and marks the
old dest PR as free in the bit vector. Then Bcc triggers a mispredict
and restores the free vector that was copied when the checkpoint was saved, including the then not-free state of the PR freed after the checkpoint.
Result: the PR is lost from the free list. After enough mispredicts you
run out of free physical registers and hang at Rename waiting to allocate.

It needs some way to edit the checkpointed free bit vector so that
no matter what order of PR-allocate, retire-PR-free, checkpoint save #X,
and rollback to checkpoint #Y, that the correct free vector gets restored.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Sat Dec 23 23:19:47 2023

EricP wrote:

Robert Finch wrote:

On 2023-12-22 12:49 p.m., MitchAlsup wrote:

EricP wrote:

Robert Finch wrote:

Stuck on checkpoint RAM now. Everything was going good until…. I
realized that while instructions are executing they need to be able
to update previous checkpoints, not just the current one. Which
checkpoint gets updated depends on which checkpoint the instruction
falls under. It is the register valid bit that needs to be updated.
I used a “brute force” approach to implement this and it is 40k
LUTs. This is about five times too large a solution. If I reduce the >>>>> number of checkpoints supported to four from sixteen, then the
component is 20k LUTs. Still too large.

The issue is there are 256 valid bits times 16 checkpoints which
means 4096 registers. Muxing the register inputs and outputs uses a
lot of LUTs.

One thought is to stall until all the instructions with targets in a >>>>> given checkpoint are finished executing before starting a new
checkpoint region. It would seriously impact the CPU performance.

I think I maybe found a solution using a block RAM and about 8k LUTs.

(I don't have a solution, just passing on some info on this particular >>>> checkpointing issue.)

Sounds like you might be using the same free register checkpoint
algorithm
I came up with for my simulator, which I assumed was a custom sram
design.

There is 1 bit for each physical register that is free.
The checkpoint for a Bcc conditional branch copies the free bit vector, >>>> in your case 256 bits, to a row in the checkpoint sram.
As each instruction retires and frees up its old dest physical register >>>> and it must mark the register free in *all* checkpoint contexts.

That requires the ability to set all the free flags for a single
register,
which means an sram design that can write a whole row, and also set
all the
bits in one column, in your case set the 16 bits in each checkpoint
for one
of the 256 registers.

Not sure about setting bits in all checkpoints. I probably have not just
understood the issue yet. Partially terminology. There are two different
things happening. The register free/available which is being managed
with fifos and the register contents valid bit. At the far end of the
pipeline, registers that were used are made free again by adding to the
free fifo. This is somewhat inefficient because they could be freed
sooner, but it would require more logic, instead more registers are
used, they are available from the RAM anyway.
The register contents valid bit is cleared when a target register is
assigned, and set once a value is loaded into the target register. The
valid bit is also set for instructions that are stomped on as the old
value is valid. When a checkpoint is restored, it restores the state of
the valid bit along with the physical register tag. I am not
understanding why the valid bit would need to be modified in all
checkpoints. I would think it should reflect the pre-branch state of
things.

This has to do with free physical register list checkpointing and
a particular gotcha that occurs if one tries to use a vanilla sram
to save the free map bit vector for each checkpoint.
It sounds like the BOOM people stepped in this gotcha at some point.

Say a design has a bit vector indicating which physical registers are free. Rename allocates a register by using a priority selector to scan that
vector and select a free PR to assign as a new dest PR.
When this instruction retires, the old dest PR is freed and
the new dest PR becomes the architectural register.

It is often the case where a logical register can be used in more
than one result in a single checkpoint. When this is the case, no
physical register need be allocated to the now-dead result, so we
invented a way to convey this result is only captured from the
operand bus and was not even contemplated to be written into the
pRF. This makes the pool of free registers go further--up to 30%
further.......

When Decode sees a conditional branch Bcc it allocates a
checkpoint in a circular buffer by incrementing the head counter,
copies the *current* free bit vector into the new checkpoint row,
and saves the new checkpoint index # in the Bcc uOp.
If a branch mispredict occurs then we can restore the state at the
Bcc by copying various state info from the Bcc checkpoint index #.
This includes copying back the saved free vector to the current free vector. When the Bcc uOp retires we increment the circular tail counter
to recover the checkpoint buffer row.

The problem occurs when an old dest PR is in use so its free bit is clear when the checkpoint is saved. Then the instruction retires and marks the
old dest PR as free in the bit vector. Then Bcc triggers a mispredict
and restores the free vector that was copied when the checkpoint was saved, including the then not-free state of the PR freed after the checkpoint. Result: the PR is lost from the free list. After enough mispredicts you
run out of free physical registers and hang at Rename waiting to allocate.

Michael Shebanow and I have a patent on that dated around 1992 (filing).
Our design could be retiring one or more checkpoints, backing up a mis- pedicted branch, and issuing instructions on the alternate path; all in
the same clock.

It needs some way to edit the checkpointed free bit vector so that
no matter what order of PR-allocate, retire-PR-free, checkpoint save #X,
and rollback to checkpoint #Y, that the correct free vector gets restored.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Robert Finch on Tue Jan 9 08:23:24 2024

Robert Finch wrote:

Predicated logic and the PRED modifier on my mind tonight.

I think I have discovered an interesting way to handle predicated logic.
If a predicate is true the instruction is scheduled and executes
normally. If the predicate is false the instruction is modified to a
special copy operation and scheduled to execute on an ALU regardless of
what the original execution unit would be. What makes this efficient is
that only a single target register read port is required for the ALU
unit versus having a target register read port for every functional
unit. The copy mux is present in the ALU only and not in the other
functional units. For most instructions there is no predication.

Yes, the general case is each uOp has predicate source and a bool to test.
If the value matches the predicate you execute the ON_MATCH part of the uOp,
if it does not match then execute the ON_NO_MATCH part.

condition = True | False

(pred == condition) ? ON_MATCH : ON_NO_MATCH;

The ON_NO_MATCH uOp function is usually some housekeeping.
On an in-order it might diddle the scoreboard to indicate the register
write is done. On OoO it might copy the old dest register to new.

Note that the source register dependencies change between match and no_match.

if (pred == True) ADD r3 = r2 + r1

If pred == True then it matches and the uOp is dependent on r2 and r1.
If pred != True then it no_match and uOp is dependent on the old dest r3
as a source to copy to the new dest r3.

Dynamically pruning the unnecessary uOp source register dependencies
for the alternate part can allow it to launch earlier.

Also predicated LD and ST have some particular issues to think about.
For example, under TSO a younger LD cannot bypass an older LD.
If an older LD has an unresolved predicate then we don't know if it exists
so we have to block the younger LD until the older predicate resolves.
The LD's ON_NO_MATCH housekeeping might include diddling the LSQ dependency matrix to wake up any younger LD's in the LSQ that had been blocked.

(Yes, I'm sure one could get fancier with replay traps.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Tue Jan 9 20:38:41 2024

EricP wrote:

Robert Finch wrote:

Predicated logic and the PRED modifier on my mind tonight.

I think I have discovered an interesting way to handle predicated logic.
If a predicate is true the instruction is scheduled and executes
normally. If the predicate is false the instruction is modified to a
special copy operation and scheduled to execute on an ALU regardless of
what the original execution unit would be. What makes this efficient is
that only a single target register read port is required for the ALU
unit versus having a target register read port for every functional
unit. The copy mux is present in the ALU only and not in the other
functional units. For most instructions there is no predication.

Yes, the general case is each uOp has predicate source and a bool to test.
If the value matches the predicate you execute the ON_MATCH part of the uOp, if it does not match then execute the ON_NO_MATCH part.

condition = True | False

(pred == condition) ? ON_MATCH : ON_NO_MATCH;

The ON_NO_MATCH uOp function is usually some housekeeping.
On an in-order it might diddle the scoreboard to indicate the register
write is done. On OoO it might copy the old dest register to new.

A SB handles this situation with greater elegance than a reservation station. The SB can merely clear the dependency without writing to the RF, so the
now released reader reads the older value. {Thornton SB}

The value capturing reservation station entry has to first capture and then ignore the delivered result (and so does the RF/RoB. {Thomasulo RS}

The Value-free RS entry is more like the SB than the Thomasulo RS.

A typical SB Can be used to hold result delivery on instructions in the
shadow of a PRED to avoid the data-flow mechanism from getting unkempt.
Both then-clause and else-clause can be held while the condition is evaluating,...

Note that the source register dependencies change between match and no_match.

if (pred == True) ADD r3 = r2 + r1

If pred == True then it matches and the uOp is dependent on r2 and r1.
If pred != True then it no_match and uOp is dependent on the old dest r3
as a source to copy to the new dest r3.

Yes, and there can be multiple instructions in the shadow of a PRED.

Dynamically pruning the unnecessary uOp source register dependencies
for the alternate part can allow it to launch earlier.

As illustrated above, no need to stall launch if you can stall result
delivery. {A key component of the Thornton SB}

Also predicated LD and ST have some particular issues to think about.
For example, under TSO a younger LD cannot bypass an older LD.

Easy:: don't do TSO <most of the time> or SC <outside of ATOMIC stuff>.

If an older LD has an unresolved predicate then we don't know if it exists
so we have to block the younger LD until the older predicate resolves.

This is why TSO and SC are slower than causal or weaker. Consider a memory reorder buffer which allows generated addresses to probe the cache and determine hit as operand data-flow permits--BUT holds onto the data and
writes (LD or reads (ST) to) the RF in program order. This violates TSO
and SC but mono-threaded codes are immune to this memory ordering problem
{and multi-threaded programs are immune except while performing ATIMIC
things.}

TSO and SC is simply slower when trying to perform memory reference inst- ructions in both the then-clause and in the else clause while waiting the resolution of the condition--even if no results are written into RF until
after resolution.

The LD's ON_NO_MATCH housekeeping might include diddling the LSQ dependency matrix to wake up any younger LD's in the LSQ that had been blocked.

(Yes, I'm sure one could get fancier with replay traps.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to MitchAlsup on Wed Jan 10 23:30:12 2024

MitchAlsup wrote:

EricP wrote:

If an older LD has an unresolved predicate then we don't know if it
exists
so we have to block the younger LD until the older predicate resolves.

This is why TSO and SC are slower than causal or weaker. Consider a memory reorder buffer which allows generated addresses to probe the cache and determine hit as operand data-flow permits--BUT holds onto the data and writes (LD or reads (ST) to) the RF in program order. This violates TSO
and SC but mono-threaded codes are immune to this memory ordering problem {and multi-threaded programs are immune except while performing ATIMIC things.}

TSO and SC is simply slower when trying to perform memory reference inst- ructions in both the then-clause and in the else clause while waiting
the resolution of the condition--even if no results are written into RF
until after resolution.

BTW in case anyone is interested I came across the recent paper that
compares the Apple M1 ARM processors two memory consistency models:
the ARM weak ordering and the total store ordering (TSO) model from x86.

"Based on various workloads, our findings indicate that TSO is,
on average, 8.94% slower than ARM’s weaker memory ordering."

TOSTING: Investigating Total Store Ordering on ARM, 2023 https://sra.uni-hannover.de/Publications/2023/tosting-arcs23/wrenger_23_arcs.pdf
https://www.sra.uni-hannover.de/Publications/2023/tosting-arcs23/wrenger_23_arcs_talk.pdf

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Chris M. Thomasson on Thu Jan 11 01:10:54 2024

Chris M. Thomasson wrote:

On 1/10/2024 8:30 PM, EricP wrote:

MitchAlsup wrote:

EricP wrote:

If an older LD has an unresolved predicate then we don't know if it
exists
so we have to block the younger LD until the older predicate resolves.

This is why TSO and SC are slower than causal or weaker. Consider a
memory
reorder buffer which allows generated addresses to probe the cache
and determine hit as operand data-flow permits--BUT holds onto the
data and
writes (LD or reads (ST) to) the RF in program order. This violates TSO
and SC but mono-threaded codes are immune to this memory ordering
problem
{and multi-threaded programs are immune except while performing ATIMIC
things.}

TSO and SC is simply slower when trying to perform memory reference
inst-
ructions in both the then-clause and in the else clause while waiting
the resolution of the condition--even if no results are written into
RF until after resolution.

BTW in case anyone is interested I came across the recent paper that
compares the Apple M1 ARM processors two memory consistency models:
the ARM weak ordering and the total store ordering (TSO) model from x86.

"Based on various workloads, our findings indicate that TSO is,
on average, 8.94% slower than ARM’s weaker memory ordering."

TOSTING: Investigating Total Store Ordering on ARM, 2023
https://sra.uni-hannover.de/Publications/2023/tosting-arcs23/wrenger_23_arcs.pdf

https://www.sra.uni-hannover.de/Publications/2023/tosting-arcs23/wrenger_23_arcs_talk.pdf

Will read them. Thanks for the heads up Eric.

Caveat that this compares Apples ARM weak to Apples TSO implementation.
Because Apple M1 has two consistency models,
if TSO is just there as a porting aid for x86 code that depends on it,
they might not have put as many bells and whistles into making it
as fast as someone who has only TSO.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Thu Jan 11 06:47:21 2024

EricP <ThatWouldBeTelling@thevillage.com> writes:

Chris M. Thomasson wrote:

On 1/10/2024 8:30 PM, EricP wrote:

BTW in case anyone is interested I came across the recent paper that
compares the Apple M1 ARM processors two memory consistency models:
the ARM weak ordering and the total store ordering (TSO) model from x86. >>>
"Based on various workloads, our findings indicate that TSO is,
on average, 8.94% slower than ARM’s weaker memory ordering."

TOSTING: Investigating Total Store Ordering on ARM, 2023
https://sra.uni-hannover.de/Publications/2023/tosting-arcs23/wrenger_23_arcs.pdf

https://www.sra.uni-hannover.de/Publications/2023/tosting-arcs23/wrenger_23_arcs_talk.pdf

Thanks.

Caveat that this compares Apples ARM weak to Apples TSO implementation. >Because Apple M1 has two consistency models,
if TSO is just there as a porting aid for x86 code that depends on it,
they might not have put as many bells and whistles into making it
as fast as someone who has only TSO.

Exactly. In particular, my take is that the microarchitecture can
reorder memory accesses as much as it wants, but has to check other
cores' memory accesses, and then roll back if the guarantees of the architecture (ideally SC, but you can also make it TSO for the sake of
the discussion) would not be met. The costs are that the
microarchitecture may need more buffering (slowdowns only if the
buffers are full), and maybe (not sure) more coherence traffic, but as
long as the resources are there, there is no slowdown in the
non-contended case, not even when accessing shared data (where code
with explicit or (MY66000) automatically inserted barriers is slow).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to All on Fri Jan 12 02:09:31 2024

So, TSO looses ~10% in performance

Sounds about right.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Robert Finch on Mon Mar 4 09:03:52 2024

Robert Finch wrote:

Trading off the maximum amount of contiguously addressed memory for a
smaller PTE and PTP format. The PTE and PTPs are thus only 32-bits in
size. A PTE has a 17-bit page number, while the PTP uses 30-bits for the
page number. With 64kB pages this limits the system to 2^46 bytes of
memory, which is probably okay for a small system. The PTEs 17-bit page number can only work with 8GB of contiguous memory. All the pages the
PTE table covers must be in the same 8GB memory range. The upper bits of
the translated address will come from the PTPs upper bits. This makes
memory looks like a tree. Groups of leafs are stuck to particular branches.

If I understand you correctly this means the PTE pages for
each 8 GB range must be in a PTP located inside that 8 GB range.
If that is ROM or IO registers in that range then there must be
RAM in the same 8 GB range in order to map it.

That would make modularizing components a little difficult as you
will have to add RAM mapping modules to each 8 GB address range.

And of course the OS memory manager has to be coded to specially
handle the RAM for each of these mapping ranges.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Mon Mar 4 18:37:08 2024

EricP wrote:

Robert Finch wrote:

Trading off the maximum amount of contiguously addressed memory for a
smaller PTE and PTP format. The PTE and PTPs are thus only 32-bits in
size. A PTE has a 17-bit page number, while the PTP uses 30-bits for the
page number. With 64kB pages this limits the system to 2^46 bytes of
memory, which is probably okay for a small system. The PTEs 17-bit page
number can only work with 8GB of contiguous memory. All the pages the
PTE table covers must be in the same 8GB memory range. The upper bits of
the translated address will come from the PTPs upper bits. This makes
memory looks like a tree. Groups of leafs are stuck to particular branches.

If I understand you correctly this means the PTE pages for
each 8 GB range must be in a PTP located inside that 8 GB range.
If that is ROM or IO registers in that range then there must be
RAM in the same 8 GB range in order to map it.

Consider a PTE mapping ROM !! How doe it get set ??

That would make modularizing components a little difficult as you
will have to add RAM mapping modules to each 8 GB address range.

You may even have to create a way to map PTE elements into areas
that have no RAM, creating more overhead and complexity.

And of course the OS memory manager has to be coded to specially
handle the RAM for each of these mapping ranges.

How would you map a DRAM DIMM that contained only 2GB of RAM ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Robert Finch on Tue Mar 5 14:49:53 2024

Robert Finch <robfi680@gmail.com> writes:

On 2024-03-04 9:03 a.m., EricP wrote:

If I understand you correctly this means the PTE pages for
each 8 GB range must be in a PTP located inside that 8 GB range.
If that is ROM or IO registers in that range then there must be
RAM in the same 8 GB range in order to map it.

That would make modularizing components a little difficult as you
will have to add RAM mapping modules to each 8 GB address range.

And of course the OS memory manager has to be coded to specially
handle the RAM for each of these mapping ranges.

Yes, the above is what I was thinking.

There is a scratchpad RAM in the ROM address space, used for
bootstrapping. RAM access is needed during the boot process before
everything is setup and the DRAM is accessible. So, it is possible to
map in that manner.

It's not uncommon to use the LLC as a scratchpad during
DRAM controller initialization...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Tue Mar 5 15:37:13 2024

Robert Finch wrote:

Bigfoot pages are 1MB in size. That allows an entire 64GB address range
to be mapped with one page table. With larger memory systems a larger
page size is needed IMO. 64GB is 65,536 pages still when the pages are
1MB in size. There is 32GB RAM in my machine today. Tomorrow it will 64GB.

Above a certain point the added latency of filling/spilling a page to DISK
that is 64KB in size rather than 4KB in size outweighs the gain in the TLB.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Mar 5 16:20:37 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Robert Finch wrote:

Bigfoot pages are 1MB in size. That allows an entire 64GB address range
to be mapped with one page table. With larger memory systems a larger
page size is needed IMO. 64GB is 65,536 pages still when the pages are
1MB in size. There is 32GB RAM in my machine today. Tomorrow it will 64GB.

Above a certain point the added latency of filling/spilling a page to DISK >that is 64KB in size rather than 4KB in size outweighs the gain in the TLB.

Although that's not as big a concern today, given nVME (or Optane).

Intel, AMD and ARM already support large page sizes (1GB for Intel/AMD
and 512GB for ARM64).

Mixing page sizes in a single operating system is tricky, as the
physical regions backing the large page sizes need to be page-size aligned,
and when supporting multiple page sizes, leads to checkerboarding
or the need to reassign pages when making a large page allocation
(linux can preallocate at boot, or use THP).

Using a single large page size to accomodate a small number of applications will waste memory for the large number of small memory applications
(e.g. most unix/linux commands).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Tue Mar 5 08:40:36 2024

On 3/5/2024 7:37 AM, MitchAlsup1 wrote:

Robert Finch wrote:

Bigfoot pages are 1MB in size. That allows an entire 64GB address
range to be mapped with one page table. With larger memory systems a
larger page size is needed IMO. 64GB is 65,536 pages still when the
pages are 1MB in size. There is 32GB RAM in my machine today. Tomorrow
it will 64GB.

Above a certain point the added latency of filling/spilling a page to DISK that is 64KB in size rather than 4KB in size outweighs the gain in the TLB.

Sure. But the 4K size was first used when disk transfer rates were
about 3 MB/sec. Today, they are many times that, even if you are using
a single physical hard disk. RAID and SSD can make that even larger. I
don't know what the "optimal" page size is, but it is certainly larger
than 4KB.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Tue Mar 5 17:32:16 2024

Stephen Fuld wrote:

On 3/5/2024 7:37 AM, MitchAlsup1 wrote:

Robert Finch wrote:

Bigfoot pages are 1MB in size. That allows an entire 64GB address
range to be mapped with one page table. With larger memory systems a
larger page size is needed IMO. 64GB is 65,536 pages still when the
pages are 1MB in size. There is 32GB RAM in my machine today. Tomorrow
it will 64GB.

Above a certain point the added latency of filling/spilling a page to DISK >> that is 64KB in size rather than 4KB in size outweighs the gain in the TLB.

Sure. But the 4K size was first used when disk transfer rates were
about 3 MB/sec. Today, they are many times that, even if you are using
a single physical hard disk. RAID and SSD can make that even larger. I don't know what the "optimal" page size is, but it is certainly larger
than 4KB.

While the data transfer rates are far higher today, the disk latency has
not kept pace with CPU performance {SSD is different}.

The CPU went from 60ns (16 MHz: 360/67) to 200ps (5GHz) 300× !!
And from 5 cycles per instruction to 3 instruction per cycle 15×
for a combined gain of 4,500×

Disk latency may have gone down from 8ms to 3ms (sometimes 2ms for
smaller disks) 3×-4×

{{I consider 360/67 as the beginning of paging; although Multics may be
the beginning.}}

I, also, believe that 4KB pages are a bit small for a new architecture.
I chose 8KB for My 66000 more because it took 1 level out of the page
tables supporting 64-bit VAS, and also made the paging hierarchy more
easy to remember:: 8K, 8M, 8G, 8T, 8P, 8E is easier to remember than
4K, 2M, 1G, ½T, ¼P, ⅛E.

I also believe in the tension between pages that are too small and those
that are too large. 256B is widely seen as too small (VAX). I think most
people are comfortable in the 4KB range. I think 64KB is too big since something like cat needs less code, less data, and less stack space than
1 single 64KB page and another 64KB page to map cat from its VAS. So, now; instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need
four 64KB pages 256KB. It is these small applications that drive the
minimum page size down.

But with desktops coming with 64GB+ of DRAM, perhaps it matters less now.

(*) 5 if you separate .bss from .data
6 if you separate .rodata from .bss and .data

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Mar 5 18:04:04 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Stephen Fuld wrote:

On 3/5/2024 7:37 AM, MitchAlsup1 wrote:

Robert Finch wrote:

Bigfoot pages are 1MB in size. That allows an entire 64GB address
range to be mapped with one page table. With larger memory systems a
larger page size is needed IMO. 64GB is 65,536 pages still when the
pages are 1MB in size. There is 32GB RAM in my machine today. Tomorrow >>>> it will 64GB.

Above a certain point the added latency of filling/spilling a page to DISK >>> that is 64KB in size rather than 4KB in size outweighs the gain in the TLB.

Sure. But the 4K size was first used when disk transfer rates were
about 3 MB/sec. Today, they are many times that, even if you are using
a single physical hard disk. RAID and SSD can make that even larger. I
don't know what the "optimal" page size is, but it is certainly larger
than 4KB.

While the data transfer rates are far higher today, the disk latency has
not kept pace with CPU performance {SSD is different}.

The CPU went from 60ns (16 MHz: 360/67) to 200ps (5GHz) 300× !!
And from 5 cycles per instruction to 3 instruction per cycle 15×
for a combined gain of 4,500×

Disk latency may have gone down from 8ms to 3ms (sometimes 2ms for
smaller disks) 3×-4×

NVMe over a low latency fabric has 10us end-to-end
latency.

{{I consider 360/67 as the beginning of paging; although Multics may be
the beginning.}}

I, also, believe that 4KB pages are a bit small for a new architecture.
I chose 8KB for My 66000 more because it took 1 level out of the page
tables supporting 64-bit VAS, and also made the paging hierarchy more
easy to remember:: 8K, 8M, 8G, 8T, 8P, 8E is easier to remember than
4K, 2M, 1G, ½T, ¼P, ⅛E.

I also believe in the tension between pages that are too small and those
that are too large. 256B is widely seen as too small (VAX).

The VAX page size was 512 bytes, which matched the sector size.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stephen Fuld on Tue Mar 5 18:10:01 2024

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 3/5/2024 7:37 AM, MitchAlsup1 wrote:

Robert Finch wrote:

Bigfoot pages are 1MB in size. That allows an entire 64GB address
range to be mapped with one page table. With larger memory systems a
larger page size is needed IMO. 64GB is 65,536 pages still when the
pages are 1MB in size. There is 32GB RAM in my machine today. Tomorrow
it will 64GB.

Above a certain point the added latency of filling/spilling a page to DISK >> that is 64KB in size rather than 4KB in size outweighs the gain in the TLB.

Sure. But the 4K size was first used when disk transfer rates were
about 3 MB/sec. Today, they are many times that, even if you are using
a single physical hard disk. RAID and SSD can make that even larger. I >don't know what the "optimal" page size is, but it is certainly larger
than 4KB.

If paging out and HDDs were still relevant, a good page size would be
about 3MB (where the transfer time is similar to the seek time). But
they are not.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Tue Mar 5 18:14:48 2024

mitchalsup@aol.com (MitchAlsup1) writes:

256B is widely seen as too small (VAX).

Didn't VAX use 512B?

I think most
people are comfortable in the 4KB range. I think 64KB is too big since >something like cat needs less code, less data, and less stack space than
1 single 64KB page and another 64KB page to map cat from its VAS. So, now; >instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need
four 64KB pages 256KB.

Yes, so what? Who cares if cat takes 16KB or 256KB when we have
Gigabytes of RAM?

A while ago <2020Oct9.190337@mips.complang.tuwien.ac.at> I posted
numbers here about the sizes of VMAs in the processes on several Linux
systems and how much extra space would be needed from larger pages:

|VMAs unique used total 8KB 16KB 32KB 64KB
| 7552 2333 555964 1033320 6704 22344 56344 125144 desktop
|82836 25276 5346060 15707448 76072 223000 514472 1113672 laptop
|47017 15425 105490636 60186068 40804 134492 319852 708588 server

The numbers in the 8KB, 16KB, 32KB, 64KB columns estimate how much
extra RAM would be needed if the pages were that large. So, 1.1GB
extra for 64KB pages on the laptop, but 8KB and 16KB pages would be
relatively cheap.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Tue Mar 5 19:08:34 2024

Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

256B is widely seen as too small (VAX).

Didn't VAX use 512B?

It has been a long time.

I think most
people are comfortable in the 4KB range. I think 64KB is too big since >>something like cat needs less code, less data, and less stack space than
1 single 64KB page and another 64KB page to map cat from its VAS. So, now; >>instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need >>four 64KB pages 256KB.

Yes, so what? Who cares if cat takes 16KB or 256KB when we have
Gigabytes of RAM?

A while ago <2020Oct9.190337@mips.complang.tuwien.ac.at> I posted
numbers here about the sizes of VMAs in the processes on several Linux systems and how much extra space would be needed from larger pages:

|VMAs unique used total 8KB 16KB 32KB 64KB
| 7552 2333 555964 1033320 6704 22344 56344 125144 desktop |82836 25276 5346060 15707448 76072 223000 514472 1113672 laptop
|47017 15425 105490636 60186068 40804 134492 319852 708588 server

Are the 8K numbers to be compared to unique ? used ? or total ? to estimate waste ??

The numbers in the 8KB, 16KB, 32KB, 64KB columns estimate how much
extra RAM would be needed if the pages were that large. So, 1.1GB
extra for 64KB pages on the laptop, but 8KB and 16KB pages would be relatively cheap.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Tue Mar 5 10:32:03 2024

On 3/5/2024 9:32 AM, MitchAlsup1 wrote:

Stephen Fuld wrote:

On 3/5/2024 7:37 AM, MitchAlsup1 wrote:

Robert Finch wrote:

Bigfoot pages are 1MB in size. That allows an entire 64GB address
range to be mapped with one page table. With larger memory systems a
larger page size is needed IMO. 64GB is 65,536 pages still when the
pages are 1MB in size. There is 32GB RAM in my machine today.
Tomorrow it will 64GB.

Above a certain point the added latency of filling/spilling a page to
DISK
that is 64KB in size rather than 4KB in size outweighs the gain in
the TLB.

Sure. But the 4K size was first used when disk transfer rates were
about 3 MB/sec. Today, they are many times that, even if you are
using a single physical hard disk. RAID and SSD can make that even
larger. I don't know what the "optimal" page size is, but it is
certainly larger than 4KB.

While the data transfer rates are far higher today, the disk latency has
not kept pace with CPU performance {SSD is different}.

Sure, but the latency is the same no matter what the transfer size is.
So the fact that latency improvements haven't kept pace is irrelevant to
the question of optimal page size.

The CPU went from 60ns (16 MHz: 360/67) to 200ps (5GHz) 300× !!
And from 5 cycles per instruction to 3 instruction per cycle 15×
for a combined gain of 4,500×

Disk latency may have gone down from 8ms to 3ms (sometimes 2ms for
smaller disks) 3×-4×

{{I consider 360/67 as the beginning of paging; although Multics may be
the beginning.}}

Agreed.

I, also, believe that 4KB pages are a bit small for a new architecture.
I chose 8KB for My 66000 more because it took 1 level out of the page
tables supporting 64-bit VAS, and also made the paging hierarchy more
easy to remember:: 8K, 8M, 8G, 8T, 8P, 8E is easier to remember than
4K, 2M, 1G, ½T, ¼P, ⅛E.

Fair enough. When Unisys implemented paging in the 2200 Series in the
1990s, they chose 16K (approximately - exactly if you consider 36 equals
32!).

I also believe in the tension between pages that are too small and those
that are too large.

Naturally.

256B is widely seen as too small (VAX). I think most
people are comfortable in the 4KB range.

While true, how much of that is just that is what they are used to, as
opposed to some kind of optimal?

I think 64KB is too big since
something like cat needs less code, less data, and less stack space than
1 single 64KB page and another 64KB page to map cat from its VAS. So, now; instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need
four 64KB pages 256KB. It is these small applications that drive the
minimum page size down.

Agreed. And large applications tend to drive the optimum page size up.
And I think that small applications like cat tend to be quick thus never swapped, and only "waste" memory for a short amount of time. On the
other hand, larger page sizes cause a larger memory "waste" for all applications.

So the optimum is, at least to some degree, usage dependent. Of course,
this is all an argument for multiple page sizes.

But with desktops coming with 64GB+ of DRAM, perhaps it matters less now.

I think larger main memories argue for larger page sizes, both because
the "waste" costs less, and larger memories require more pages and thus, perhaps a larger TLB.

As with most such things, there is a is tradeoff, and the optimum
probably changes as technology changes.

(*) 5 if you separate .bss from .data
6 if you separate .rodata from .bss and .data

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Tue Mar 5 19:16:57 2024

Stephen Fuld wrote:

On 3/5/2024 9:32 AM, MitchAlsup1 wrote:

Stephen Fuld wrote:

On 3/5/2024 7:37 AM, MitchAlsup1 wrote:

Robert Finch wrote:

Bigfoot pages are 1MB in size. That allows an entire 64GB address
range to be mapped with one page table. With larger memory systems a >>>>> larger page size is needed IMO. 64GB is 65,536 pages still when the
pages are 1MB in size. There is 32GB RAM in my machine today.
Tomorrow it will 64GB.

Above a certain point the added latency of filling/spilling a page to
DISK
that is 64KB in size rather than 4KB in size outweighs the gain in
the TLB.

Sure. But the 4K size was first used when disk transfer rates were
about 3 MB/sec. Today, they are many times that, even if you are
using a single physical hard disk. RAID and SSD can make that even
larger. I don't know what the "optimal" page size is, but it is
certainly larger than 4KB.

While the data transfer rates are far higher today, the disk latency has
not kept pace with CPU performance {SSD is different}.

Sure, but the latency is the same no matter what the transfer size is.
So the fact that latency improvements haven't kept pace is irrelevant to
the question of optimal page size.

The CPU went from 60ns (16 MHz: 360/67) to 200ps (5GHz) 300× !!
And from 5 cycles per instruction to 3 instruction per cycle 15×
for a combined gain of 4,500×

Disk latency may have gone down from 8ms to 3ms (sometimes 2ms for
smaller disks) 3×-4×

{{I consider 360/67 as the beginning of paging; although Multics may be
the beginning.}}

Agreed.

I, also, believe that 4KB pages are a bit small for a new architecture.
I chose 8KB for My 66000 more because it took 1 level out of the page
tables supporting 64-bit VAS, and also made the paging hierarchy more
easy to remember:: 8K, 8M, 8G, 8T, 8P, 8E is easier to remember than
4K, 2M, 1G, ½T, ¼P, ⅛E.

Fair enough. When Unisys implemented paging in the 2200 Series in the
1990s, they chose 16K (approximately - exactly if you consider 36 equals 32!).

I also believe in the tension between pages that are too small and those
that are too large.

Naturally.

256B is widely seen as too small (VAX). I think most
people are comfortable in the 4KB range.

While true, how much of that is just that is what they are used to, as opposed to some kind of optimal?

When paging first came about, OS people told us that they really wanted
~1M pages (this was 1981-ish). 1M was enough to juggle the then workloads
at least somewhat efficiently. This corresponded rather well with the then 32-bit address spaces.

Workloads are now bigger (heck tabs in Chrome tend to be ~1GB in size)

I think 64KB is too big since
something like cat needs less code, less data, and less stack space than
1 single 64KB page and another 64KB page to map cat from its VAS. So, now; >> instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need
four 64KB pages 256KB. It is these small applications that drive the
minimum page size down.

Agreed. And large applications tend to drive the optimum page size up.
And I think that small applications like cat tend to be quick thus never swapped, and only "waste" memory for a short amount of time. On the
other hand, larger page sizes cause a larger memory "waste" for all applications.

So the optimum is, at least to some degree, usage dependent. Of course,
this is all an argument for multiple page sizes.

Which My 66000 provides, but it also provides big pages that can have an
extent [0..extent-1] so you can use a 8GB page to map anything from 8KB
through 8GB in a single PTE. The extent-bits are exactly the bits not
being used as PA-bits.

But with desktops coming with 64GB+ of DRAM, perhaps it matters less now.

I think larger main memories argue for larger page sizes, both because
the "waste" costs less, and larger memories require more pages and thus, perhaps a larger TLB.

As with most such things, there is a is tradeoff, and the optimum
probably changes as technology changes.

Given you want an single page size spanning cell-phones to multi-rack servers finding something "optimal" is difficult at best.

(*) 5 if you separate .bss from .data
6 if you separate .rodata from .bss and .data

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Tue Mar 5 19:18:41 2024

Stephen Fuld wrote:

Fair enough. When Unisys implemented paging in the 2200 Series in the
1990s, they chose 16K (approximately - exactly if you consider 36 equals 32!).

36 base 9 is 33 which is close enough to 32 to be considered equal.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Robert Finch on Tue Mar 5 21:18:48 2024

Robert Finch <robfi680@gmail.com> writes:

On 2024-03-05 1:13 p.m., BGB wrote:

Another possible trick was mapping these to ROM zero page when reloaded,
and only "reviving" them as actual swap-space pages, when something was
written to them. Since the page-table was also partly used to track
pages in the page-table, there needed to be special handling in the TLB
miss handler to signal "yeah, this page is really a zeroes-only page".

Say, page states:
Invalid / unassigned;
Valid / assigned;
Invalid / mapped to pagefile;
Page is swapped out.
Valid / mapped to pagefile;
Page is zeroed.

Though, potentially, any hardware page-walker would need to be aware of
the zero-page trick (vs, say, trying to map the page to an invalid
physical address).

...

What is LLC? (Local Lan controller?)

Last Level Cache (e.g. L3).

Intel, AMD and ARM already support large page sizes (1GB for Intel/AMD
and 512GB for ARM64).

Do they support using the entire page as a page table?

All multilevel page tables use an entire page (granule in ARMv8)
at each level of the page table. To map larger page/block sizes,
they just stop the table walk at one, two or three levels rather
than walking all four levels to the leaf page size.

Compression could be useful for something serialized to disk or through
the network. Transferring the page number and compressed contents.

Compressing tiered DRAM is looking to be the next big thing.

https://www.zeropoint-tech.com/news/meta-and-google-unveil-a-revolutionary-hardware-compressed-cxl-memory-tier-specification

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Tue Mar 5 22:40:54 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Anton Ertl wrote:

A while ago <2020Oct9.190337@mips.complang.tuwien.ac.at> I posted
numbers here about the sizes of VMAs in the processes on several Linux
systems and how much extra space would be needed from larger pages:

|VMAs unique used total 8KB 16KB 32KB 64KB
| 7552 2333 555964 1033320 6704 22344 56344 125144 desktop
|82836 25276 5346060 15707448 76072 223000 514472 1113672 laptop
|47017 15425 105490636 60186068 40804 134492 319852 708588 server

Are the 8K numbers to be compared to unique ? used ? or total ? to estimate >waste ??

"unique" is the number of unique VMAs (so compare "unique" to "VMAs").

"used" is the number of KB used reported by free. "total" is the
number of KB in the unique VMAs. "total" can be larger than "used"
because of copy-on-write (in particular, pages that are allocated and
have not been written to yet). I don't know why the server workload
gets a "used" number that's larger than the "total" number.

The waste numbers (8KB-64KB) should be compared to total memory (16GB
on the desktop, 8GB for the laptop, 128GB for the server).

You can also compare it to the "total" numbers; both the waste and the
"total" numbers are based on the same VMA data.

You could also compare it to "used", but given that "used" reflects
actual usage rather than just VMA size, a better approach would be to
know which pages of each VMA are used, and base the estimate on that. Unfortunately, I don't know how to easily extract such numbers from
the Linux kernel.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Mar 6 00:48:08 2024

Scott Lurndal wrote:

Robert Finch <robfi680@gmail.com> writes:

Do they support using the entire page as a page table?

All multilevel page tables use an entire page (granule in ARMv8)
at each level of the page table. To map larger page/block sizes,
they just stop the table walk at one, two or three levels rather
than walking all four levels to the leaf page size.

My 66000 page structure supports both skipping of levels in the table
and stopping at any level in the table.

Compression could be useful for something serialized to disk or through
the network. Transferring the page number and compressed contents.

Compressing tiered DRAM is looking to be the next big thing.

https://www.zeropoint-tech.com/news/meta-and-google-unveil-a-revolutionary-hardware-compressed-cxl-memory-tier-specification

CXL is just* an offload of DRAM from an on-die memory controller to
an on-PCIe 5.0-6.0 high speed links. What this means in practice is
that one can put as many DRAM controllers on as many PCIe links as
the chip provides. In PCIe 6.0, a link (4-wires; 2 in 2 out) trans-
mit up to 64 GTs compared to DDR 6 at 22 Gb/s (PCIe is 3× faster),
but you can trade width of the interface for number of independent
interfaces. And with external companies making CXL DRAM controllers
the chip/system designer can dispense with the DRC and spend more
pins for PCIe BW.

So, in the past when the on-die DRC only allows for 2,3,4,6,8 DIMMs;
with CXL DRC you can have as many DRAM DIMMs as makes sense for your
target market--more DIMMs for costly servers, fewer pins for lower
cost devices.

Compression is icing on the cake.

(*) it also allows for a lot of other new PCIe functionality.

An interesting topic between PCIE 5.0 and 6.0 is the change from
NRZ encoding to PAM 4. This comes with a degradation of error rate
from 10^-12 goes down to 10^-6, so you are going to need some kind
of ECC on the PCIe links and layers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed Mar 6 03:45:12 2024

BGB wrote:

On 3/5/2024 3:07 PM, Robert Finch wrote:

On 2024-03-05 1:13 p.m., BGB wrote:

What is LLC? (Local Lan controller?) I used the text mode video display
RAM in the past.

Intel, AMD and ARM already support large page sizes (1GB for Intel/AMD
and 512GB for ARM64).

Do they support using the entire page as a page table?

I think it is a case of skipping the last N levels.

Should be able to skip any level. Root could point at a page containing
only 8K translations, or it could point to the top of a tree supporting
a 63-bit VAS. Each PTP (of which Root is one) can point at any further
layer, skipping any number of intermediate non-needed levels.

That is how one makes a 1 page "map" for tiny things like cat:: Root
points at a page containing 8KB translations, of which only 4-6 have
the valid bit set.

Where, say, with 4K pages and 64-bit PTEs:
4K: No Skip
2MB: 1-level skip
1GB: 2-level skip.

One can have a root pointing at a 63-bit VAS, one PTP points at 13-bit
VAS translations, another pointing at a 43-bit VAS which contains pointers
to 23-bit VAS sub-spaces. So, you can skip any number of levels at each level.

Seemingly, Windows had used 64K logical pages, but these were likely
faked in software.

MIPS did not allow aliasing at smaller granularities than 64-bits due to
their R2000/R3000 SRAM cache structure.

Not entirely sure the reason for them doing so.

Compression could be useful for something serialized to disk or through
the network. Transferring the page number and compressed contents.

Funny story, I was hired to look into a person marketing a FFT-based compression algorithm for a company running semiconductor testers. I
looked at his algorithms and looked at the vector data sets. It turned
out that if one does NRZ encoding vertically over the records, one gets
98%-99% compression without any "interesting algorithms". It worked so
well that they started to use it to decrease disk load time of the
vector set--converting a 1 minute problem into something under 1 second.

The records (cycles) were as wide as the number of signal pins on the part-under-test (several hundred bits wide) with a 3-state code {high,
low, high-Z}. SO the reset pin was asserted for 10-odd cycles, and then
changed to 0 and stayed there for 1M-odd cycles. Pretty easy to compress
stuff like this.

My bet is that many data structures would encode rather well using this technique--for example pointers all having the same 24-HoBs being 0 (user)
of 1 (super).

..

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Mar 6 14:25:15 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Scott Lurndal wrote:

Robert Finch <robfi680@gmail.com> writes:

Do they support using the entire page as a page table?

All multilevel page tables use an entire page (granule in ARMv8)
at each level of the page table. To map larger page/block sizes,
they just stop the table walk at one, two or three levels rather
than walking all four levels to the leaf page size.

My 66000 page structure supports both skipping of levels in the table
and stopping at any level in the table.

That's all well and good. What operating system do you
have running on the MY 66000 processor? When will I be able
to purchase a system based on that processor?

Compression could be useful for something serialized to disk or through >>>the network. Transferring the page number and compressed contents.

Compressing tiered DRAM is looking to be the next big thing.

https://www.zeropoint-tech.com/news/meta-and-google-unveil-a-revolutionary-hardware-compressed-cxl-memory-tier-specification

CXL is just* an offload of DRAM from an on-die memory controller to
an on-PCIe 5.0-6.0 high speed links.

Yes, I'm well aware of that. What you don't mention is that it
can become part of the processor cache coherency domain.

An interesting topic between PCIE 5.0 and 6.0 is the change from
NRZ encoding to PAM 4. This comes with a degradation of error rate
from 10^-12 goes down to 10^-6, so you are going to need some kind
of ECC on the PCIe links and layers.

PAM4 is something with which we have a great deal of expertise,
along with the associated error correction.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Thu Mar 7 01:28:46 2024

BGB wrote:

On 3/6/2024 8:42 AM, Robert Finch wrote:

In my case, access is figured out on cache-line fetch, and is precooked:
NR, NW, NX, NU, NC: NoRead/NoWrite/NoExecute/NoUser/NoCache.
Though, some combinations of these flags are special.

Is there a reason these flags (other than user) are inverted ??
{{And even noUser can be changed into Super.}}

In addition, I think you will want to be able to specify which level of
cache {L1, L2, LLC} this line is stored at, prefetched to, and pushed out
to.

My 66000 is using ASID instead of something like Super/Global because I
don't want to have to flush the TLB on a hypervisor context switch --
where one GuestOS Super/Global is not the same as another GuestOSs. When
a GuestOS is accessing one of its user applications, AGEN automagiaclly
uses application AISD instead of GuestOS ASID. {Similar for HV accessing GuestOS -- while switching from 1-level translation to 2-level.

<snip>

The L1 cache only hits if the current mode matches the mode that was in effect at the time the cache-line was fetched, and if KRR has not
changed (as determined by a hash value), ...

s/mode/ASID/

For my system the ACL is not part of the PTE, it is part of the software
managed page information, along with share counts. I do not see the ACL
for a page being different depending on the page table.

In my case, ACL handling is done via a combination of keyring register
(KRR), and a small fully-associative cache (4 entry at present, 6 could
be better in theory; luckily each entry is comparably small).

The ACLID is tied to the TLBE, so the intersection of the ACLID and KRR
entry are used to figure out access in the ACL cache (or,
ignored/disabled if the low 16 bits of KRR are 0).

I have dedicated some of the block RAMs for the page management
information, so they may be read out in parallel with a memory access.
So shifted the block RAM usage from the TLB to the PMT. This makes the
TLB smaller. It also reduces the memory usage. The page management
information only needs one copy for each page of memory. If the
information were in the TLBE / PTEs there would be multiple copies of
the information in the page tables. How do you keep things coherent if
there are multiple copies in page tables?

The access ID for pages is kept in sync with the memory address, since
both are uploaded to the TLB at the same time.

However, as for ACL checks themselves, these are handled with a separate cache. So, say, changing the access to an ACLID, and flushing the corresponding entry from the ACL cache, will automatically apply to any
pages previously loaded into the TLB.

There was also the older VUGID system, which used traditional Unix-style permissions. If I were designing it now, would likely design things
around using exclusively ACL checking, which (ironically) also needs
less bits to encode.

Generally, software TLB miss handling is used in my case.

There is no automatic way to keep the TLB in sync with the page table
(if the page table entry is modified).

My 66000 has a coherent TLB.

Usual thing is that if the current page table is updated, then one needs
to forge a special dummy entry, and then upload this entry to the TLB multiple times (via the LDTLB instruction) to knock the prior contents
out of the TLB (or use the INVTLB instruction, but this currently
invalidates the entire TLB; which is a bad situation for
software-managed TLB...).

See how much easier a coherent TLB is ??

Generally, the assumption is that all pages in a mapping will have the
same ACLID (generally corresponding to the "owner" of the mapping).

An unsupported assumption if one wants to keep LB flushes minimized.

If using multiple page tables for context switching, it will be
necessary to use ASIDs.

See how much easier it is for HW to perform context switches en massé

It is possible to share global pages across "ASID groups", but currently there are not "truly global" pages (and, implicitly, some groups may
disallow global pages).

Where, say, the ASID is a 16-bit field:
(15:10): ASID Group
( 9: 0): ASID ID

At present, for most normal use, the idea is that the ASID and ACL/KRR
ID's will be aliased to a process's PID.

Not aliased to but accessed from !

Say, with Groups 00..1F (in both ASID and ACLID space) being used for
the PID aliased range (20..37 for special use, and 38..3F for selective one-off entries).

Although completely under SW control, I am assuming that ASID = 0 is the hypervisor, that ASID = {1..255} is Guest HV, and {256-65535} is for GuestOS use.

Currently, threads also eat PID's, but this is likely to change, say:
TPID (Task ID):
(31:16): PID
(15: 0): Thread ID (local to a given PID)

PIDs are GuestOS defined and used.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Thu Mar 7 19:51:36 2024

Robert Finch wrote:

On 2024-03-07 1:39 a.m., BGB wrote:

Bigfoot uses a whole byte for access rights, with separate
read-write-execute for user and supervisor, and write protect for
hypervisor and machine modes. He also uses 4 bits for the cache-ability
which match the cache-ability bits on the bus.

Can you think of an example where a user Read-Only page would not be
writeable from super ?? Or a device on PCIe ??
Can you think of an example where a user Execute-Only page would not
be readable from super ?? Or a device on PCIe ??
Can you think of an example where a user page marked RWE = 000 would
not be readable and writeable from super ? Or a device on PCIe ??

Can you think of an example where 2-bits denoting {L1, L2, LLC, DRAM}
does not describe the cache placement of a line adequately ??

I am using the value of zero for the ASID to represent the machine
mode’s ASID. A lot of hardware is initialized to zero at reset, so it’s automatically the machine mode’s. Other than zero the ASID could be anything assigned by the OS.

I do not rely on control registers being set to zero, instead part of
HW context switching end up reading these out of ROM and into those registers--so they can have any reasonable bit pattern SW desires.
{{This is sort of like how Alpha comes out of reset and streams a ROM
through the scan path to initialize the internals.}}

I am also assuming that ASID = 0 is the highest level of privilege;
but this is purely a SW choice.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Thu Mar 7 11:58:53 2024

On 3/5/2024 9:32 AM, MitchAlsup1 wrote:

snip

I also believe in the tension between pages that are too small and those
that are too large. 256B is widely seen as too small (VAX). I think most people are comfortable in the 4KB range. I think 64KB is too big since something like cat needs less code, less data, and less stack space than
1 single 64KB page and another 64KB page to map cat from its VAS. So, now; instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need
four 64KB pages 256KB. It is these small applications that drive the
minimum page size down.

In thinking about this, an idea occurred to me that may ease this
tension some. For a large page, you introduce a new protection mode
such that, for example, the lower half of the addresses in the page are
execute only, and the upper half are read/write enabled. This would
allow the code and the data, and perhaps even the stack for such a
program to share a single page, while still maintaining the required
access protection. I think the hardware to implement this is pretty
small. While the benefits of this would be modest, if such "small
programs" occur often enough it may be worth the modest cost of the
additional hardware.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Stephen Fuld on Thu Mar 7 20:32:44 2024

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 3/5/2024 9:32 AM, MitchAlsup1 wrote:

snip

I also believe in the tension between pages that are too small and those
that are too large. 256B is widely seen as too small (VAX). I think most
people are comfortable in the 4KB range. I think 64KB is too big since
something like cat needs less code, less data, and less stack space than
1 single 64KB page and another 64KB page to map cat from its VAS. So, now; >> instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need
four 64KB pages 256KB. It is these small applications that drive the
minimum page size down.

In thinking about this, an idea occurred to me that may ease this
tension some. For a large page, you introduce a new protection mode
such that, for example, the lower half of the addresses in the page are >execute only, and the upper half are read/write enabled. This would
allow the code and the data, and perhaps even the stack for such a
program to share a single page, while still maintaining the required
access protection. I think the hardware to implement this is pretty
small. While the benefits of this would be modest, if such "small
programs" occur often enough it may be worth the modest cost of the >additional hardware.

The biggest problem with variable page sizes isn't the hardware.

The problem is how to effectively use multiple page sizes without
serious checkerboarding and subsequent allocation issues. Solved
in linux by preallocation at boot time of a range to be used
only for large pages - which if they aren't used, are not
available to be used as regular pages. Linux also has THP[*], which will move stuff around to make sufficiently sized (and aligned, which is
the harder problem) regions that can be changed to a larger
mapping.

[*] Transparent Huge Pages.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Thu Mar 7 20:29:29 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Robert Finch wrote:

On 2024-03-07 1:39 a.m., BGB wrote:

Bigfoot uses a whole byte for access rights, with separate
read-write-execute for user and supervisor, and write protect for
hypervisor and machine modes. He also uses 4 bits for the cache-ability
which match the cache-ability bits on the bus.

Can you think of an example where a user Read-Only page would not be >writeable from super?

Yep. The entire user address space should not be accessible
to privileged code (except when using specialized load and
store instructions that validate the access using the user
privileges). Aside from initially loading the code/data
at 'exec' time. There are exceptions, such as the VDSO
pages in linux which are explicitly shared between user-mode
and privileged software.

This is a mitigation for kernel compromises to prevent access
to secrets and/or code injection.

?? Or a device on PCIe ??

That's entirely up to the IOMMU, which generally uses different
translation tables than the processor(s).

Can you think of an example where a user Execute-Only page would not
be readable from super ?? Or a device on PCIe ??

See above. Minimize the security footprint.

Can you think of an example where a user page marked RWE = 000 would
not be readable and writeable from super ? Or a device on PCIe ??

Can you think of an example where 2-bits denoting {L1, L2, LLC, DRAM}
does not describe the cache placement of a line adequately ??

A remote cache in a non-coherent CXL compute expander?

I am also assuming that ASID = 0 is the highest level of privilege;
but this is purely a SW choice.

I'm troubled about the idea of the ASID having anything to do
with security. There are benefits to having the ASID being
qualified by a guest (VM) identifier such that the guest
operating system can use the entire range of ASIDs as if
it were running on real hardware.

Reminds me of the 1960s, when the base register == 0 would
enable access to the privileged instructions. The next
generation of that system switched to using a control register.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Thu Mar 7 21:47:32 2024

Robert Finch wrote:

On 2024-03-07 2:51 p.m., MitchAlsup1 wrote:

Robert Finch wrote:

On 2024-03-07 1:39 a.m., BGB wrote:

Bigfoot uses a whole byte for access rights, with separate
read-write-execute for user and supervisor, and write protect for
hypervisor and machine modes. He also uses 4 bits for the
cache-ability which match the cache-ability bits on the bus.

Can you think of an example where a user Read-Only page would not be
writeable from super ?? Or a device on PCIe ??
Can you think of an example where a user Execute-Only page would not
be readable from super ?? Or a device on PCIe ??

I cannot think of examples. But I had thought the hypervisor / machine
might want to treat supervisor mode like an alternate user mode. The
bits can always just be set = 7.

Or, you can avoid blowing the 3-extra-bits and just assume a higher
privilege level can access pages the lesser privileged cannot.

Can you think of an example where a user page marked RWE = 000 would
not be readable and writeable from super ? Or a device on PCIe ??

A page marked RWE=000 is an unusable page. Perhaps to signal bad memory.
Or perhaps as a hidden data page full of comments or remarks. If its not readable-writeable or executable what is it? Nothing should be able to
access it, except maybe the machine/debug operating mode.

I choose this because I have a mandate on the Safe-Stack area that LDs
and STs (and prefetches and post pushes) cannot access the data--only
ENTER and EXIT can access the data so the contract (ABI) between caller
and callee cannot be violated {{avoiding may attack strategies}}

Can you think of an example where 2-bits denoting {L1, L2, LLC, DRAM}
does not describe the cache placement of a line adequately ??

The cache-ability bits were not directly describing cache placement.
They were like the cache-ability bits in the bus. They specified cache-policy. Bufferable / non-bufferable. Write-through, write-back, allocate, etc. But now that I reviewed it I forgot I had removed these
bits from the PTE / TLBE.

I do not like situations where all possible codes are used.

I would like to apply this argument to integer encodings--integers need
an encoding to represent "there is no value here" {similar to NaN}.

So, I would probably use three bits. Could a cache line be located in a Register for instance?

My cache lines are 64-bytes (512-bits) in size,
My registers are 64-bits in size.
Flip-flops can be any number of bits in size.

4 cache lines ARE my register FILE.
1 cache line IS my Thread Header
5 cache lines ARE my Thread State (all of it)

4 Thread States are accessible at any instant of time making context
switching easier. SVC goes up the chain to higher privilege, SVR does
the reverse.

I cannot envision every usage, although a lot is known today, I thought
it would be better to err on the side of providing too many bits rather
than not enough.

This works only so long as you do not run out of bits. Once you start scrambling to find an encoding of 2 (or 3) fields that would never be
in use and use the combination of 2 fields to mean something "special"
you know you are in trouble.

Not enough is hard to add later. There are loads of
bits available in the 128-bit PTE, 96 bits would be enough. But it is
not a nice power of two.

Hmmmmmmmmmm,

I got pairs of 63-bit virtual address spaces into a 64-bit container.
And since we are only around 48-bits* of address space consumption,
it will outlast my lifetime.

(*) the largest servers while typical big desktops are down in the
35-37-bit range.

I am using the value of zero for the ASID to represent the machine
mode’s ASID. A lot of hardware is initialized to zero at reset, so
it’s automatically the machine mode’s. Other than zero the ASID could >>> be anything assigned by the OS.

I do not rely on control registers being set to zero, instead part of
HW context switching end up reading these out of ROM and into those
registers--so they can have any reasonable bit pattern SW desires.
{{This is sort of like how Alpha comes out of reset and streams a ROM
through the scan path to initialize the internals.}}

I am also assuming that ASID = 0 is the highest level of privilege;
but this is purely a SW choice.

Assuming a zero at reset was more of a default ‘if I forget’ approach. I have machine state being initialized from ROM too.

In effect, I have 1 "register" that is set at reset {along with clearing
the control bits of the pipeline.} Everything else is loaded from ROM.

When control arrives after reset, you have a Thread Header which provides
{IP, Root, Call Stack Pointer, raised and Enabled exceptions, ASID, Why,
and a few more}. The privilege level determines which register file gets loaded. So by the time you are fetching instructions, you have a register
file (filled with data), a data stack pointer, a call stack pointer, and
30-odd registers containing whatever BOOT programmers thought was appropriate.

You also have MMU setup with the TLB enabled (and active). The MMU maps L1
and L2 in the allocated state, so you have at least 256KB of memory to work
in BEFORE DRAM is identified, configured, initialized, and tuned to the electrical environment they are plugged into. This, in turn means all (100%)
of BOOT can be written in a HLL (C) without access to an assembler.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Scott Lurndal on Thu Mar 7 19:55:11 2024

On 3/7/2024 12:32 PM, Scott Lurndal wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 3/5/2024 9:32 AM, MitchAlsup1 wrote:

snip

I also believe in the tension between pages that are too small and those >>> that are too large. 256B is widely seen as too small (VAX). I think most >>> people are comfortable in the 4KB range. I think 64KB is too big since
something like cat needs less code, less data, and less stack space than >>> 1 single 64KB page and another 64KB page to map cat from its VAS. So, now; >>> instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need >>> four 64KB pages 256KB. It is these small applications that drive the
minimum page size down.

In thinking about this, an idea occurred to me that may ease this
tension some. For a large page, you introduce a new protection mode
such that, for example, the lower half of the addresses in the page are
execute only, and the upper half are read/write enabled. This would
allow the code and the data, and perhaps even the stack for such a
program to share a single page, while still maintaining the required
access protection. I think the hardware to implement this is pretty
small. While the benefits of this would be modest, if such "small
programs" occur often enough it may be worth the modest cost of the
additional hardware.

The biggest problem with variable page sizes isn't the hardware.

What I proposed is not variable page sizes. All pages are the same
size. This idea is to add a new protection option within the same page.
The new option will allow "mixing" the code and data for a small
program within the same page without sacrificing the protection that
normaly requires multiple pages.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David W Schroth@21:1/5 to sfuld@alumni.cmu.edu.invalid on Fri Mar 8 08:24:53 2024

On Tue, 5 Mar 2024 10:32:03 -0800, Stephen Fuld
<sfuld@alumni.cmu.edu.invalid> wrote:

On 3/5/2024 9:32 AM, MitchAlsup1 wrote:

Stephen Fuld wrote:

On 3/5/2024 7:37 AM, MitchAlsup1 wrote:

Robert Finch wrote:

Bigfoot pages are 1MB in size. That allows an entire 64GB address
range to be mapped with one page table. With larger memory systems a >>>>> larger page size is needed IMO. 64GB is 65,536 pages still when the
pages are 1MB in size. There is 32GB RAM in my machine today.
Tomorrow it will 64GB.

Above a certain point the added latency of filling/spilling a page to
DISK
that is 64KB in size rather than 4KB in size outweighs the gain in
the TLB.

Sure.� But the 4K size was first used when disk transfer rates were
about 3 MB/sec.� Today, they are many times that, even if you are
using a single physical hard disk.� RAID and SSD can make that even
larger.� I don't know what the "optimal" page size is, but it is
certainly larger than 4KB.

While the data transfer rates are far higher today, the disk latency has
not kept pace with CPU performance {SSD is different}.

Sure, but the latency is the same no matter what the transfer size is.
So the fact that latency improvements haven't kept pace is irrelevant to
the question of optimal page size.

The CPU went from 60ns (16 MHz: 360/67) to 200ps (5GHz) 300� !!
And from 5 cycles per instruction to 3 instruction per cycle 15�
for a combined gain of 4,500�

Disk latency may have gone down from 8ms to 3ms (sometimes 2ms for
smaller disks) 3�-4�

{{I consider 360/67 as the beginning of paging; although Multics may be
the beginning.}}

Agreed.

I, also, believe that 4KB pages are a bit small for a new architecture.
I chose 8KB for My 66000 more because it took 1 level out of the page
tables supporting 64-bit VAS, and also made the paging hierarchy more
easy to remember:: 8K, 8M, 8G, 8T, 8P, 8E is easier to remember than
4K, 2M, 1G, �T, �P, ?E.

Fair enough. When Unisys implemented paging in the 2200 Series in the
1990s, they chose 16K (approximately - exactly if you consider 36 equals >32!).

Actually, we use 4 KiW (contained in 32 KiB) as the page size. I
don't remember spill/fill time being an issue.

<snip>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Fri Mar 8 10:08:31 2024

MitchAlsup1 wrote:

Stephen Fuld wrote:

So the optimum is, at least to some degree, usage dependent. Of
course, this is all an argument for multiple page sizes.

Which My 66000 provides, but it also provides big pages that can have an extent [0..extent-1] so you can use a 8GB page to map anything from 8KB through 8GB in a single PTE. The extent-bits are exactly the bits not
being used as PA-bits.

By the way, I borrowed your extent idea (thank you :-) ) as when combined
with skipping from the root pointer to a lower level, it can be used to
map the whole OS code and data with just a couple of PTE's.
This eliminates table walks for most OS virtual addresses and
could allow a few 'pinned' TLB entries to map the whole OS.

This achieves a similar result to my Block Address Translate (BAT)
approach but without requiring a whole separate MMU mechanism.

The idea is for the OS to be separated into static and dynamic sets
of code and RW data. The static parts are always resident in the OS
plus any mandatory drivers. The linker packages all the static code
and data together into two huge RE and RW memory sections at specific
high end virtual addresses, aligned to a huge page boundary.

The extent feature allows the OS static code and data to be loaded
into just the portion of a huge page that each needs, with any unused
remainder in the huge pages being returned to the general pool as
smaller pages (so no wasted space in the 1GB or 8GB pages).

And voila - two PTE's map the whole static OS code and data
which can be permanently held in two MMU mapping registers.
With one more for the graphics memory and the bulk of
table walks for system space can be eliminated.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to David W Schroth on Fri Mar 8 06:58:11 2024

On 3/8/2024 6:24 AM, David W Schroth wrote:

On Tue, 5 Mar 2024 10:32:03 -0800, Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

On 3/5/2024 9:32 AM, MitchAlsup1 wrote:

Stephen Fuld wrote:

On 3/5/2024 7:37 AM, MitchAlsup1 wrote:

Robert Finch wrote:

Bigfoot pages are 1MB in size. That allows an entire 64GB address
range to be mapped with one page table. With larger memory systems a >>>>>> larger page size is needed IMO. 64GB is 65,536 pages still when the >>>>>> pages are 1MB in size. There is 32GB RAM in my machine today.
Tomorrow it will 64GB.

Above a certain point the added latency of filling/spilling a page to >>>>> DISK
that is 64KB in size rather than 4KB in size outweighs the gain in
the TLB.

Sure. But the 4K size was first used when disk transfer rates were
about 3 MB/sec. Today, they are many times that, even if you are
using a single physical hard disk. RAID and SSD can make that even
larger. I don't know what the "optimal" page size is, but it is
certainly larger than 4KB.

While the data transfer rates are far higher today, the disk latency has >>> not kept pace with CPU performance {SSD is different}.

Sure, but the latency is the same no matter what the transfer size is.
So the fact that latency improvements haven't kept pace is irrelevant to
the question of optimal page size.

The CPU went from 60ns (16 MHz: 360/67) to 200ps (5GHz) 300× !!
And from 5 cycles per instruction to 3 instruction per cycle 15×
for a combined gain of 4,500×

Disk latency may have gone down from 8ms to 3ms (sometimes 2ms for
smaller disks) 3×-4×

{{I consider 360/67 as the beginning of paging; although Multics may be
the beginning.}}

Agreed.

I, also, believe that 4KB pages are a bit small for a new architecture.
I chose 8KB for My 66000 more because it took 1 level out of the page
tables supporting 64-bit VAS, and also made the paging hierarchy more
easy to remember:: 8K, 8M, 8G, 8T, 8P, 8E is easier to remember than
4K, 2M, 1G, ½T, ¼P, ?E.

Fair enough. When Unisys implemented paging in the 2200 Series in the
1990s, they chose 16K (approximately - exactly if you consider 36 equals
32!).

Actually, we use 4 KiW

Yes. For those too young to remember, on the 2200 series, one word is
36 bits. Hence my comment about 4KW being 16 KB, if 36 equals 32.

(contained in 32 KiB) as the page size.

I presume this is a result of the emulated systems using 64 bits for
each 36 bit word. I was referring to the original implementation on
native hardware.

I
don't remember spill/fill time being an issue.

Ahh! That goes along with Anton's comment and contradicts Mitch's
comments about disk times being a factor. Since you were there and
involved in the implementation, what were the considerations in choosing
4KW? Not that I am challenging it - I think it was probably the correct decision. I just want to better understand the reasoning behind it.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to moi on Fri Mar 8 09:52:44 2024

On 3/8/2024 9:48 AM, moi wrote:

{{I consider 360/67 as the beginning of paging; although Multics may be >>>> the beginning.}}

Agreed.

It certainly is not.
IBM bought the right to use the Manchester University / Ferranti paging patent.

I didn't know that. Thanks.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From moi@21:1/5 to All on Fri Mar 8 17:48:08 2024

{{I consider 360/67 as the beginning of paging; although Multics may be
the beginning.}}

Agreed.

It certainly is not.
IBM bought the right to use the Manchester University / Ferranti paging
patent.

--
Bill F.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Fri Mar 8 18:49:30 2024

EricP wrote:

MitchAlsup1 wrote:

Stephen Fuld wrote:

So the optimum is, at least to some degree, usage dependent. Of
course, this is all an argument for multiple page sizes.

Which My 66000 provides, but it also provides big pages that can have an
extent [0..extent-1] so you can use a 8GB page to map anything from 8KB
through 8GB in a single PTE. The extent-bits are exactly the bits not
being used as PA-bits.

By the way, I borrowed your extent idea (thank you :-) ) as when combined with skipping from the root pointer to a lower level, it can be used to
map the whole OS code and data with just a couple of PTE's.
This eliminates table walks for most OS virtual addresses and
could allow a few 'pinned' TLB entries to map the whole OS.

This achieves a similar result to my Block Address Translate (BAT)
approach but without requiring a whole separate MMU mechanism.

The idea is for the OS to be separated into static and dynamic sets
of code and RW data. The static parts are always resident in the OS
plus any mandatory drivers. The linker packages all the static code
and data together into two huge RE and RW memory sections at specific
high end virtual addresses, aligned to a huge page boundary.

The extent feature allows the OS static code and data to be loaded
into just the portion of a huge page that each needs, with any unused remainder in the huge pages being returned to the general pool as
smaller pages (so no wasted space in the 1GB or 8GB pages).

Those Huge page boundaries can be achieved at nominal page boundaries
with a bit of paravirtualization help by HyperVisor simply because they
are in GuestOS PaS = HV VaS.

And voila - two PTE's map the whole static OS code and data
which can be permanently held in two MMU mapping registers.

And can be as big or small as GuestOS desires.

With one more for the graphics memory and the bulk of
table walks for system space can be eliminated.

And so much for Kernel excursions wiping out the TLB.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to All on Fri Mar 8 14:41:51 2024

On Thu, 7 Mar 2024 16:02:45 -0500, Robert Finch <robfi680@gmail.com>
wrote:

A page marked RWE=000 is an unusable page. Perhaps to signal bad memory.
Or perhaps as a hidden data page full of comments or remarks. If its not >readable-writeable or executable what is it? Nothing should be able to
access it, except maybe the machine/debug operating mode.

The ability to change (at least data) pages between "untouchable" and
RW is required for MMU assisted incremental GC. If the GC also
handles code, then it must be able to mark pages executable as well.

If an "untouchable" page can't be manipulated by user software, then
you've disallowed an entire class of GC systems.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Fri Mar 8 22:34:12 2024

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

Scott Lurndal wrote:

Robert Finch <robfi680@gmail.com> writes:

Do they support using the entire page as a page table?

All multilevel page tables use an entire page (granule in ARMv8)
at each level of the page table. To map larger page/block sizes,
they just stop the table walk at one, two or three levels rather
than walking all four levels to the leaf page size.

My 66000 page structure supports both skipping of levels in the table
and stopping at any level in the table.

That's all well and good. What operating system do you
have running on the MY 66000 processor? When will I be able
to purchase a system based on that processor?

Linux.

Compression could be useful for something serialized to disk or through >>>>the network. Transferring the page number and compressed contents.

Compressing tiered DRAM is looking to be the next big thing.

https://www.zeropoint-tech.com/news/meta-and-google-unveil-a-revolutionary-hardware-compressed-cxl-memory-tier-specification

CXL is just* an offload of DRAM from an on-die memory controller to
an on-PCIe 5.0-6.0 high speed links.

Yes, I'm well aware of that. What you don't mention is that it
can become part of the processor cache coherency domain.

My 66000 considers DRAM as part of the cache/memory hierarchy and
considers LLC as the front end to DRAM.

An interesting topic between PCIE 5.0 and 6.0 is the change from
NRZ encoding to PAM 4. This comes with a degradation of error rate
from 10^-12 goes down to 10^-6, so you are going to need some kind
of ECC on the PCIe links and layers.

PAM4 is something with which we have a great deal of expertise,
along with the associated error correction.

I only point this out because it is "different".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to George Neuner on Fri Mar 8 22:50:43 2024

George Neuner wrote:

On Thu, 7 Mar 2024 16:02:45 -0500, Robert Finch <robfi680@gmail.com>
wrote:

A page marked RWE=000 is an unusable page.

Inaccessible, not unusable.

ENTER and EXIT check that the Safe-Stack is inaccessible to the application {RWE = 000). This means the application cannot LD from or ST to the Safe- Stack. ENTER and EXIT can !! This simple twist of the wrist eliminates the ability to overrun data onto the data-stack does not alter the ABI guarantee
of callee returns to caller with all its preserved registers as if unchanged.

Perhaps to signal bad memory.

That is eminently possible.

Or perhaps as a hidden data page full of comments or remarks. If its not >>readable-writeable or executable what is it? Nothing should be able to >>access it, except maybe the machine/debug operating mode.

A) it is accessible by more privileged levels of the system.
B) GuestOS can put information in process VaS that application cannot access
{Say for example: to avoid keeping it in kernel address space.}
C) it can still be accessed by devices
d) it can be decrypted as touched (GuestOS exception)
e) A stack that Guarantees ABI in untrusted computing environments
..
The only one guaranteed no access is the application at the privilege level
of that applications memory map. All higher privilege applications access.

The ability to change (at least data) pages between "untouchable" and
RW is required for MMU assisted incremental GC. If the GC also
handles code, then it must be able to mark pages executable as well.

Another use.

If an "untouchable" page can't be manipulated by user software, then
you've disallowed an entire class of GC systems.

I did not know of this technique, but it works in my design, too and
without alteration.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David W Schroth@21:1/5 to sfuld@alumni.cmu.edu.invalid on Fri Mar 8 19:04:35 2024

On Fri, 8 Mar 2024 06:58:11 -0800, Stephen Fuld
<sfuld@alumni.cmu.edu.invalid> wrote:

On 3/8/2024 6:24 AM, David W Schroth wrote:

On Tue, 5 Mar 2024 10:32:03 -0800, Stephen Fuld
<sfuld@alumni.cmu.edu.invalid> wrote:

On 3/5/2024 9:32 AM, MitchAlsup1 wrote:

Stephen Fuld wrote:

On 3/5/2024 7:37 AM, MitchAlsup1 wrote:

Robert Finch wrote:

Bigfoot pages are 1MB in size. That allows an entire 64GB address >>>>>>> range to be mapped with one page table. With larger memory systems a >>>>>>> larger page size is needed IMO. 64GB is 65,536 pages still when the >>>>>>> pages are 1MB in size. There is 32GB RAM in my machine today.
Tomorrow it will 64GB.

Above a certain point the added latency of filling/spilling a page to >>>>>> DISK
that is 64KB in size rather than 4KB in size outweighs the gain in >>>>>> the TLB.

Sure.� But the 4K size was first used when disk transfer rates were
about 3 MB/sec.� Today, they are many times that, even if you are
using a single physical hard disk.� RAID and SSD can make that even
larger.� I don't know what the "optimal" page size is, but it is
certainly larger than 4KB.

While the data transfer rates are far higher today, the disk latency has >>>> not kept pace with CPU performance {SSD is different}.

Sure, but the latency is the same no matter what the transfer size is.
So the fact that latency improvements haven't kept pace is irrelevant to >>> the question of optimal page size.

The CPU went from 60ns (16 MHz: 360/67) to 200ps (5GHz) 300� !!
And from 5 cycles per instruction to 3 instruction per cycle 15�
for a combined gain of 4,500�

Disk latency may have gone down from 8ms to 3ms (sometimes 2ms for
smaller disks) 3�-4�

{{I consider 360/67 as the beginning of paging; although Multics may be >>>> the beginning.}}

Agreed.

I, also, believe that 4KB pages are a bit small for a new architecture. >>>> I chose 8KB for My 66000 more because it took 1 level out of the page
tables supporting 64-bit VAS, and also made the paging hierarchy more
easy to remember:: 8K, 8M, 8G, 8T, 8P, 8E is easier to remember than
4K, 2M, 1G, �T, �P, ?E.

Fair enough. When Unisys implemented paging in the 2200 Series in the
1990s, they chose 16K (approximately - exactly if you consider 36 equals >>> 32!).

Actually, we use 4 KiW

Yes. For those too young to remember, on the 2200 series, one word is
36 bits. Hence my comment about 4KW being 16 KB, if 36 equals 32.

(contained in 32 KiB) as the page size.

I presume this is a result of the emulated systems using 64 bits for
each 36 bit word. I was referring to the original implementation on
native hardware.

I
don't remember spill/fill time being an issue.

Ahh! That goes along with Anton's comment and contradicts Mitch's
comments about disk times being a factor. Since you were there and
involved in the implementation, what were the considerations in choosing
4KW? Not that I am challenging it - I think it was probably the correct >decision. I just want to better understand the reasoning behind it.

I wasn't there when the paging architecture was defined, so I can't
actually say. I would speculate that it's because the D
(displacement) field in the Extended Mode instruction format is 12
bits wide.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lynn Wheeler@21:1/5 to mitchalsup@aol.com on Fri Mar 8 14:19:17 2024

mitchalsup@aol.com (MitchAlsup1) writes:

{{I consider 360/67 as the beginning of paging; although Multics may be
the beginning.}}

Science center thot IBM could get MIT MULTICS ... but went to GE. Then
the IBM mission for virtual memory/paging went to "new" TSS/360
group. Science Center modified a 360/40 with virtual memory & paging
pending availability of 360/67 standard with virtual memory (and
cp40/cms morphs into cp67/cms).

Melinda's history web pages
http://www.leeandmelindavarian.com/Melinda#VMHist
from (lots of early history, some CTSS/7094 went to 5th flr and Multics
and others went to IBM science center on 4th flr) http://www.leeandmelindavarian.com/Melinda/neuvm.pdf
footnote from Les Comeau:

Since the early time-sharing experiments used base and limit registers
for relocation, they had to roll in and roll out entire programs when
switching users....Virtual memory, with its paging technique, was
expected to reduce significantly the time spent waiting for an exchange
of user programs.

What was most significant was that the commitment to virtual memory was
backed with no successful experience. A system of that period that had implemented virtual memory was the Ferranti Atlas computer, and that was
known not to be working well. What was frightening is that nobody who
was setting this virtual memory direction at IBM knew why Atlas didn't
work.35

... snip ...

Atlas reference (gone?) but lives free at wayback): https://web.archive.org/web/20121118232455/http://www.ics.uci.edu/~bic/courses/JaverOS/ch8.pdf
from above:

Paging can be credited to the designers of the ATLAS computer, who
employed an associative memory for the address mapping [Kilburn, et
al., 1962]. For the ATLAS computer, |w| = 9 (resulting in 512 words
per page), |p| = 11 (resulting in 2024 pages), and f = 5 (resulting in
32 page frames). Thus a 220-word virtual memory was provided for a
214- word machine. But the original ATLAS operating system employed
paging solely as a means of implementing a large virtual memory; multiprogramming of user processes was not attempted initially, and
thus no process id's had to be recorded in the associative memory. The
search for a match was performed only on the page number p.

... snip ...

I was undergraduate at univ and was hired fulltime responsible for
OS/360. The univ. had gotten a 360/67 for tss/360, but was running as
360/65 (univ. shutdown datacenter on weekends and I had whole place
dedicated, 48hrs w/o sleep did make monday classes hard).

Then CSC came out to install CP67 (3rd installation after CSC itself and
MIT Lincoln Labs) and I mostly played with it in my weekend time. This
early release had very rudimentary page replacement algorithm and no
page thrashing controls. I did (global LRU) reference bit scan
and dynamic adaptive page thrashing controls.

Nearly 15yrs later at Dec81 ACM SIGOPS, Jim Gray ask if I could help a
Tandem co-worker get Stanford Phd .... it involved similar global LRU to
the work that I had done in the 60s and there were "local LRU" forces
from the 60s lobbying hard not to award a Phd (that wasn't "local
LRU"). I had real live data from a CP/67 with global LRU on 768kbyte
(104 pageable pages) 360/67 with 80users that had better response and throughput compared to a CP/67 (with nearly identical type of workload
but 35users) that implemented 60s "local LRU" implementation and 1mbyte
360/67 (155 pageable pages after fixed storage) ... aka half the users
and 50% more real paging storage.

a decade ago, I was asked to track down decision to add virtual memory
to all 370s ... found somebody involved, archived posts with pieces
of the email exchange:
https://www.garlic.com/~lynn//2011d.html#73

--
virtualization experience starting Jan1968, online at home since Mar1970

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David W Schroth@21:1/5 to sfuld@alumni.cmu.edu.invalid on Fri Mar 8 19:13:38 2024

On Thu, 7 Mar 2024 11:58:53 -0800, Stephen Fuld
<sfuld@alumni.cmu.edu.invalid> wrote:

On 3/5/2024 9:32 AM, MitchAlsup1 wrote:

snip

I also believe in the tension between pages that are too small and those
that are too large. 256B is widely seen as too small (VAX). I think most
people are comfortable in the 4KB range. I think 64KB is too big since
something like cat needs less code, less data, and less stack space than
1 single 64KB page and another 64KB page to map cat from its VAS. So, now; >> instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need
four 64KB pages 256KB. It is these small applications that drive the
minimum page size down.

In thinking about this, an idea occurred to me that may ease this
tension some. For a large page, you introduce a new protection mode
such that, for example, the lower half of the addresses in the page are >execute only, and the upper half are read/write enabled. This would
allow the code and the data, and perhaps even the stack for such a
program to share a single page, while still maintaining the required
access protection. I think the hardware to implement this is pretty
small. While the benefits of this would be modest, if such "small
programs" occur often enough it may be worth the modest cost of the >additional hardware.

I would suggest that your proposal would be better done by splitting access/protection from virtual to physical translation (think Mill
turfs). I suppose OS2200 could use our existing protection to
implement what you propose, but we haven't (largely because there
seems to be no need/call for the capability).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to David W Schroth on Sat Mar 9 08:46:54 2024

On 3/8/2024 5:13 PM, David W Schroth wrote:

On Thu, 7 Mar 2024 11:58:53 -0800, Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

On 3/5/2024 9:32 AM, MitchAlsup1 wrote:

snip

I also believe in the tension between pages that are too small and those >>> that are too large. 256B is widely seen as too small (VAX). I think most >>> people are comfortable in the 4KB range. I think 64KB is too big since
something like cat needs less code, less data, and less stack space than >>> 1 single 64KB page and another 64KB page to map cat from its VAS. So, now; >>> instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need >>> four 64KB pages 256KB. It is these small applications that drive the
minimum page size down.

In thinking about this, an idea occurred to me that may ease this
tension some. For a large page, you introduce a new protection mode
such that, for example, the lower half of the addresses in the page are
execute only, and the upper half are read/write enabled. This would
allow the code and the data, and perhaps even the stack for such a
program to share a single page, while still maintaining the required
access protection. I think the hardware to implement this is pretty
small. While the benefits of this would be modest, if such "small
programs" occur often enough it may be worth the modest cost of the
additional hardware.

I would suggest that your proposal would be better done by splitting access/protection from virtual to physical translation (think Mill
turfs).

I think we are in general agreement here. As you implied, the
fundamental problem is that most current systems "overload" two pieces
of functionality (memory management and protection) onto a single
mechanism (paging). I like Mill's approach as it clearly separates the
two functions, though it does require more hardware, and I am not sure
how much easier it is for them than it would be for other systems since
they use a single address space. My proposal was for a low hardware
cost and easily implemented mechanism.

I suppose OS2200 could use our existing protection to
implement what you propose, but we haven't (largely because there
seems to be no need/call for the capability).

Since the 2200 already separates the protection function (banking) from
the memory management function (paging), I think all that would be
required is to allow multiple banks, perhaps even from multiple programs
to use different parts of a page. But I agree that, with ~16 KB pages,
the savings would be much less than if larger pages were used, so the
benefits would be modest at best and probably not worth the effort.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Lynn Wheeler on Sat Mar 9 13:07:44 2024

Lynn Wheeler wrote:

Nearly 15yrs later at Dec81 ACM SIGOPS, Jim Gray ask if I could help a
Tandem co-worker get Stanford Phd .... it involved similar global LRU to
the work that I had done in the 60s and there were "local LRU" forces
from the 60s lobbying hard not to award a Phd (that wasn't "local
LRU"). I had real live data from a CP/67 with global LRU on 768kbyte
(104 pageable pages) 360/67 with 80users that had better response and throughput compared to a CP/67 (with nearly identical type of workload
but 35users) that implemented 60s "local LRU" implementation and 1mbyte 360/67 (155 pageable pages after fixed storage) ... aka half the users
and 50% more real paging storage.

I assume by "local LRU" you mean local working set management,
as opposed to global working set e.g. WSClock.

Are you saying some people tried to block someone from getting a PhD
because he researched a different working set management than their
pet approach?

If so, wow...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Robert Finch on Sat Mar 9 21:34:39 2024

On 3/7/2024 9:13 PM, Robert Finch wrote:

On 2024-03-07 2:58 p.m., Stephen Fuld wrote:

On 3/5/2024 9:32 AM, MitchAlsup1 wrote:

snip

I also believe in the tension between pages that are too small and
those that are too large. 256B is widely seen as too small (VAX). I
think most
people are comfortable in the 4KB range. I think 64KB is too big since
something like cat needs less code, less data, and less stack space than >>> 1 single 64KB page and another 64KB page to map cat from its VAS. So,
now;
instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need >>> four 64KB pages 256KB. It is these small applications that drive the
minimum page size down.

In thinking about this, an idea occurred to me that may ease this
tension some. For a large page, you introduce a new protection mode
such that, for example, the lower half of the addresses in the page
are execute only, and the upper half are read/write enabled. This
would allow the code and the data, and perhaps even the stack for such
a program to share a single page, while still maintaining the required
access protection. I think the hardware to implement this is pretty
small. While the benefits of this would be modest, if such "small
programs" occur often enough it may be worth the modest cost of the
additional hardware.

I had thoughts along this line too. I have added user access rights for
each 1/4 of a page. Only a single 64kB page split in four is needed for
a small app then.

Yes, similar. I suspect your solution is more general than mine and
thus handles more cases, but requires more hardware, especially bits in
the PTE. It's all a tradeoff.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to All on Sun Mar 10 14:29:52 2024

On Fri, 8 Mar 2024 20:26:08 -0500, Robert Finch <robfi680@gmail.com>
wrote:

I plan on having garbage collection as part of the OS. There is a shared >hardware-card table involved.

What kind?
[Actually "kind" is the wrong word because any non-toy, real world GC
will need to employ a combination of techniques. So the question
really should be "in what major class is your GC"?]

Problem is - whatever you choose - it will be wrong and have bad
performance for some important class of GC'd applications.

So, I guess that would disallow user
garbage collectors using untouchable pages. The MMU could be faked out
using a VM, so I have read.

Yes, a VM can emulate MMU operation, but currently that requires using
a hypervisor - a heavyweight solution that also requires a guest OS to
run the program.

There are a number of light(er) weight VMs for running programs in
managed environments [which include GC] ... but all of them have to
run under an OS and are at its mercy.

GCs that use no-access pages are not rare, and they are just one class
of MMU assisted GC systems. There are a number of ways a collector
can leverage the MMU to help.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to gneuner2@comcast.net on Sun Mar 10 19:53:12 2024

On Sun, 10 Mar 2024 14:29:52 -0400, George Neuner
<gneuner2@comcast.net> wrote:

Problem is - whatever [GC] you choose - it will be wrong and have bad >performance for some important class of GC'd applications.

"bad performance" may mean "slow", but also may mean memory use much
higher than should be necessary.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to George Neuner on Mon Mar 11 09:29:57 2024

George Neuner <gneuner2@comcast.net> writes:

On Fri, 8 Mar 2024 20:26:08 -0500, Robert Finch <robfi680@gmail.com>
wrote:

I plan on having garbage collection as part of the OS. There is a shared
hardware-card table involved.

What kind?
[Actually "kind" is the wrong word because any non-toy, real world GC
will need to employ a combination of techniques. So the question
really should be "in what major class is your GC"?]

Problem is - whatever you choose - it will be wrong and have bad
performance for some important class of GC'd applications.

I'm curious to know what you consider to be the different kinds,
or classes, of GC, and the same question for applications.

Certainly, for any given GC implementation, one can construct an
application that does poorly with respect to that GC, but that
doesn't make the constructed application a "class". For the
statement to have meaningful content there needs to be some kind
of identification of what are the classes of GCs, and what are
the classes of applications.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to George Neuner on Mon Mar 11 09:32:40 2024

George Neuner <gneuner2@comcast.net> writes:

On Sun, 10 Mar 2024 14:29:52 -0400, George Neuner
<gneuner2@comcast.net> wrote:

Problem is - whatever [GC] you choose - it will be wrong and have bad
performance for some important class of GC'd applications.

"bad performance" may mean "slow", but also may mean memory use much
higher than should be necessary.

Understood. And there are other relevant metrics as well, as
for example not throughput but worst-case latency.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to tr.17687@z991.linuxsc.com on Tue Mar 12 07:30:08 2024

On Mon, 11 Mar 2024 09:32:40 -0700, Tim Rentsch
<tr.17687@z991.linuxsc.com> wrote:

George Neuner <gneuner2@comcast.net> writes:

On Sun, 10 Mar 2024 14:29:52 -0400, George Neuner
<gneuner2@comcast.net> wrote:

Problem is - whatever [GC] you choose - it will be wrong and have bad
performance for some important class of GC'd applications.

"bad performance" may mean "slow", but also may mean memory use much
higher than should be necessary.

Understood. And there are other relevant metrics as well, as
for example not throughput but worst-case latency.

If latency is the primary concern, then you should use a deterministic
system such as Baker's Treadmill.

Treadmill essentially is just a set of linked lists, and collectio
operations like marking and sweeping simply move blocks from one list
to another. But you pay for that determinism with space - compared to
other systems, Treadmills have a lot of per-block metadata.

Note that for allocating and freeing to be deterministic, a Treadmill
has to work with fixed size blocks. But you can run multiple
Treadmills for common block sizes, with a catchall for big blocks.
Some malloc/free style allocators already work like this, using
separate lists for some commonly requested block sizes.

Depending on how you handle the metadata, Treadmills also are amenable
to coalescing free space and/or compacting the heap / working set. But
note that these types of operations can't be made deterministic.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to George Neuner on Tue Mar 12 20:13:53 2024

George Neuner <gneuner2@comcast.net> writes:

On Mon, 11 Mar 2024 09:32:40 -0700, Tim Rentsch
<tr.17687@z991.linuxsc.com> wrote:

George Neuner <gneuner2@comcast.net> writes:

On Sun, 10 Mar 2024 14:29:52 -0400, George Neuner
<gneuner2@comcast.net> wrote:

Problem is - whatever [GC] you choose - it will be wrong and have bad
performance for some important class of GC'd applications.

"bad performance" may mean "slow", but also may mean memory use much
higher than should be necessary.

Understood. And there are other relevant metrics as well, as
for example not throughput but worst-case latency.

If latency is the primary concern, then you should use a deterministic
system such as Baker's Treadmill. [...]

Does this mean you aren't going to answer my other question?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to tr.17687@z991.linuxsc.com on Tue Mar 12 23:57:02 2024

On Mon, 11 Mar 2024 09:29:57 -0700, Tim Rentsch
<tr.17687@z991.linuxsc.com> wrote:

George Neuner <gneuner2@comcast.net> writes:

On Fri, 8 Mar 2024 20:26:08 -0500, Robert Finch <robfi680@gmail.com>
wrote:

I plan on having garbage collection as part of the OS. There is a shared >>> hardware-card table involved.

What kind?
[Actually "kind" is the wrong word because any non-toy, real world GC
will need to employ a combination of techniques. So the question
really should be "in what major class is your GC"?]

Problem is - whatever you choose - it will be wrong and have bad
performance for some important class of GC'd applications.

I'm curious to know what you consider to be the different kinds,
or classes, of GC, and the same question for applications.

Certainly, for any given GC implementation, one can construct an
application that does poorly with respect to that GC, but that
doesn't make the constructed application a "class". For the
statement to have meaningful content there needs to be some kind
of identification of what are the classes of GCs, and what are
the classes of applications.

Feeling mathematical are we?

Every application contains delineated portions of its overall
allocation profile which correspond closely to portions of the
profiles of other applications.

If a given profile performs poorly under a given GC, it is reasonable
to infer that other applications having corresponding profiles also
will perform poorly while those profiles are in force.

That said ...

GC systems - including their associated allocator(s) - are categorized
(better word?) by their behavior. Unfortunately, behavior is
described by a complex set of implementation choices.

Understand that real world GC typically implement more than one
algorithm, and often the algorithms are hybridized - derived from and
relatable to published algorithms, but having unique mix of function
that won't be found "as is" in any search of literature. [In truth,
GC literature tends to leave a lot as exercise for the reader.]

GC behavior often is adaptive, reacting to run time conditions: e.g.,
based on memory fragmentation it could shift between non-moving
mark/sweep and moving mark/compact. It may also employ differing
algorithms simultaneously in different spaces, such as being
conservative in stacks while being precise in dynamic heaps, or being stop-world in thread private heaps while being concurrent or parallel
in shared heaps. Etc.

Concurrent GC (aka incremental) runs as a co-routine with the mutator.
These systems are distinguished by how they are triggered to run, and
what bounds may be placed on their execution time. There are
concurrent systems having completely deterministic operation [their
actual execution time, of course, may depend on factors beyond the
GC's control, such as multitasking, caching or paging.]

Parallel GC may be both prioritized and scheduled. These systems may
offer some guarantees about the percentage of (application) time given
to collector vs mutator(s).

Major choices:

- precise or conservative?
- moving or non-moving?
- tracing (marking)?
- copying / compacting?
- stop-world, concurrent, or parallel?

- single or multiple spaces?
- semispaces?
- generational?

Minor choices:

- software-only or hardware (MMU) assisted?
- snapshot at beginning?

- bump or block allocation?
- allocation color?

- free blocks coalesced? {if not compacting}

- multiple mutators?
- mutation color?
- writable shared objects?
- FROM-space mutation?
- finalization?

Note that all of these represent free dimensions in design. As
mentioned above, any particular system may implement multiple
collection algorithms each embodying a different set of choices.

You may wonder why I didn't mention "sweeping" ... essentially it is
because sequential scan is more an implementation detail than a
technique. Although "mark/sweep" is a well established technique, it
is the marking (tracing) that really defines it. Then too, modern
collectors often are neither mark/sweep nor copying as presented in
textbooks: e.g., there are systems that mark and copy, systems that
sweep and copy (without marking), and "copying" systems in which
copies are logical and nothing actually changes address.

Aside: all GC can be considered to use [logical] semispaces because
all have the notion of segregated FROM-space and TO-space during
collection. True semispaces are a set of (address) contiguous spaces
- not necessarily equally sized - which are rotated as targets for new allocation. True semispaces do imply physical copying [but see the
Treadmill for an example of "non-moving copy" using logical
semispaces].

So what do I consider to be the "kind" of a GC?

The choices above pretty much define the extent of the design space
[but note I did intentionally leave out reference counting]. However,
the first 8 choices are structural, whereas the rest specify important characteristics but don't affect structure.

A particular "kind" might be, e.g.,
"precise, generational, multispace, non-moving, concurrent
tracer".
and so on.

I'm guessing this probably didn't really answer your question, but it
was fun to write.
;-)

George

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to George Neuner on Sun Apr 28 15:27:41 2024

George Neuner <gneuner2@comcast.net> writes:

On Mon, 11 Mar 2024 09:29:57 -0700, Tim Rentsch
<tr.17687@z991.linuxsc.com> wrote:

George Neuner <gneuner2@comcast.net> writes:

On Fri, 8 Mar 2024 20:26:08 -0500, Robert Finch <robfi680@gmail.com>
wrote:

I plan on having garbage collection as part of the OS. There
is a shared hardware-card table involved.

What kind?
[Actually "kind" is the wrong word because any non-toy, real
world GC will need to employ a combination of techniques. So
the question really should be "in what major class is your GC"?]

Problem is - whatever you choose - it will be wrong and have bad
performance for some important class of GC'd applications.

I'm curious to know what you consider to be the different kinds,
or classes, of GC, and the same question for applications.

Certainly, for any given GC implementation, one can construct an
application that does poorly with respect to that GC, but that
doesn't make the constructed application a "class". For the
statement to have meaningful content there needs to be some kind
of identification of what are the classes of GCs, and what are
the classes of applications.

Feeling mathematical are we?

If you're trying to say I make an effort to be accurate and
precise in my writing, I plead guilty as charged.

Every application contains delineated portions of its overall
allocation profile which correspond closely to portions of the
profiles of other applications.

If a given profile performs poorly under a given GC, it is
reasonable to infer that other applications having corresponding
profiles also will perform poorly while those profiles are in
force.

An empty, circular observation. Very disappointing.

That said ...

GC systems - including their associated allocator(s) - are categorized (better word?) by their behavior. Unfortunately, behavior is
described by a complex set of implementation choices.

Understand that real world GC typically implement more than one
algorithm, and often the algorithms are hybridized - derived from and relatable to published algorithms, but having unique mix of function
that won't be found "as is" in any search of literature. [In truth,
GC literature tends to leave a lot as exercise for the reader.]

GC behavior often is adaptive, reacting to run time conditions: e.g.,
based on memory fragmentation it could shift between non-moving
mark/sweep and moving mark/compact. It may also employ differing
algorithms simultaneously in different spaces, such as being
conservative in stacks while being precise in dynamic heaps, or being stop-world in thread private heaps while being concurrent or parallel
in shared heaps. Etc.

Concurrent GC (aka incremental) runs as a co-routine with the mutator.
These systems are distinguished by how they are triggered to run, and
what bounds may be placed on their execution time. There are
concurrent systems having completely deterministic operation [their
actual execution time, of course, may depend on factors beyond the
GC's control, such as multitasking, caching or paging.]

Parallel GC may be both prioritized and scheduled. These systems may
offer some guarantees about the percentage of (application) time given
to collector vs mutator(s).

Major choices:

- precise or conservative?
- moving or non-moving?
- tracing (marking)?
- copying / compacting?
- stop-world, concurrent, or parallel?

- single or multiple spaces?
- semispaces?
- generational?

Minor choices:

- software-only or hardware (MMU) assisted?
- snapshot at beginning?

- bump or block allocation?
- allocation color?

- free blocks coalesced? {if not compacting}

- multiple mutators?
- mutation color?
- writable shared objects?
- FROM-space mutation?
- finalization?

Note that all of these represent free dimensions in design. As
mentioned above, any particular system may implement multiple
collection algorithms each embodying a different set of choices.

I'm familiar with many or perhaps most of the variations
and techniques used in garbage collection. That isn't
what I was asking about.

You may wonder why I didn't mention "sweeping" ... essentially it is
because sequential scan is more an implementation detail than a
technique. Although "mark/sweep" is a well established technique, it
is the marking (tracing) that really defines it. Then too, modern
collectors often are neither mark/sweep nor copying as presented in textbooks: e.g., there are systems that mark and copy, systems that
sweep and copy (without marking), and "copying" systems in which
copies are logical and nothing actually changes address.

Aside: all GC can be considered to use [logical] semispaces because
all have the notion of segregated FROM-space and TO-space during
collection. True semispaces are a set of (address) contiguous spaces
- not necessarily equally sized - which are rotated as targets for new allocation. True semispaces do imply physical copying [but see the
Treadmill for an example of "non-moving copy" using logical
semispaces].

So what do I consider to be the "kind" of a GC?

The choices above pretty much define the extent of the design space
[but note I did intentionally leave out reference counting]. However,
the first 8 choices are structural, whereas the rest specify important characteristics but don't affect structure.

A particular "kind" might be, e.g.,
"precise, generational, multispace, non-moving, concurrent
tracer".
and so on.

In effect what you are saying is that if we list all the possible
attributes that a GC implementation might have, we can charactrize
what kind of GC it is by giving its value for each attribute. Not
really a helpful statement.

I'm guessing this probably didn't really answer your question,

Your comments didn't address either of my questions, nor as best
I can tell even make an effort to do so.

but it was fun to write. ;-)

I see. Next time I'll know better.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Ginger1
  Mon Sep 15 19:33:54 2025
  from London via SSH
- Bob Worm
  Mon Sep 15 15:42:34 2025
  from Wales, Uk via Telnet
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	12:42:44
Calls:	10,389
Calls today:	4
Files:	14,061
Messages:	6,416,878
Posted today:	1

Tonight's tradeoff

Who's Online

Recent Visitors

System Info