Branch miss logic versus clock frequency.
The branch miss logic for the current OoO version of Thor is quite
involved. It needs to back out the register source indexes to the last
valid source before the branch instruction. To do this in a single
cycle, the logic is about 25+ logic levels deep. I find this somewhat unacceptable.
I can remove a lot of logic improving the clock frequency substantially
by removing the branch miss logic that resets the registers source id to
the last valid source. Instead of stomping on the instruction on a miss
and flushing the instructions in a single cycle, I think the predicate
for the instructions can be cleared which will effectively turn them
into a NOP. The value of the target register will be propagated in the reorder buffer meaning the registers source id need not be reset. The
reorder buffer is only eight entries. So, on average four entries would
be turned into NOPs. The NOPs would still propagate through the reorder buffer so it may take several clock cycles for them to be flushed from
the buffer. Meaning the branch latency for miss-predicted branches would
be quite high. However, if the clock frequency can be improved by 20%
for all instructions, much of the lost performance on the branches may
be made up.
Branch miss logic versus clock frequency.<
The branch miss logic for the current OoO version of Thor is quite
involved. It needs to back out the register source indexes to the last
valid source before the branch instruction. To do this in a single
cycle, the logic is about 25+ logic levels deep. I find this somewhat unacceptable.
I can remove a lot of logic improving the clock frequency substantially
by removing the branch miss logic that resets the registers source id to
the last valid source. Instead of stomping on the instruction on a miss
and flushing the instructions in a single cycle, I think the predicate
for the instructions can be cleared which will effectively turn them
into a NOP. The value of the target register will be propagated in the reorder buffer meaning the registers source id need not be reset. The
reorder buffer is only eight entries. So, on average four entries would
be turned into NOPs. The NOPs would still propagate through the reorder buffer so it may take several clock cycles for them to be flushed from
the buffer. Meaning the branch latency for miss-predicted branches would
be quite high. However, if the clock frequency can be improved by 20%
for all instructions, much of the lost performance on the branches may
be made up.
Robert Finch wrote:<
Branch miss logic versus clock frequency.
The branch miss logic for the current OoO version of Thor is quite
involved. It needs to back out the register source indexes to the last
valid source before the branch instruction. To do this in a single
cycle, the logic is about 25+ logic levels deep. I find this somewhat
unacceptable.
I can remove a lot of logic improving the clock frequency substantially
by removing the branch miss logic that resets the registers source id to
the last valid source. Instead of stomping on the instruction on a miss
and flushing the instructions in a single cycle, I think the predicate
for the instructions can be cleared which will effectively turn them
into a NOP. The value of the target register will be propagated in the
reorder buffer meaning the registers source id need not be reset. The
reorder buffer is only eight entries. So, on average four entries would
be turned into NOPs. The NOPs would still propagate through the reorder
buffer so it may take several clock cycles for them to be flushed from
the buffer. Meaning the branch latency for miss-predicted branches would
be quite high. However, if the clock frequency can be improved by 20%
for all instructions, much of the lost performance on the branches may
be made up.
Basically it sounds like you want to eliminate the checkpoint and rollback, and instead let resources be recovered at Retire. That could work.
However you are not restoring the Renamer's future Register Alias Table (RAT) to its state at the point of the mispredicted branch instruction, which is what the rollback would have done, so its state will be whatever it was at the end of the mispredicted sequence. That needs to be re-sync'ed with the program state as of the branch.
That can be accomplished by stalling the front end, waiting until the mispredicted branch reaches Retire and then copying the committed RAT, maintained by Retire, to the future RAT at Rename, and restart front end.<
The list of free physical registers is then all those that are not
marked as architectural registers.
This is partly how I handle exceptions.<
Also you still need a mechanism to cancel start of execution of the
subset of pending uOps for the purged set. You don't want to launch
a LD or DIV from the mispredicted set if it has not already started.
If you are using a reservation station design then you need some way
to distribute the cancel request to the various FU's and RS's,
and wait for them to clean themselves up.
Note that some things might not be able to cancel immediately,<
like an in-flight MUL in a pipeline or an outstanding LD to the cache.
So some of this will be asynchronous (send cancel request, wait for ACK).
There are some other things that might need cleanup.<
A Return Stack Predictor might be manipulated by the mispredicted path.
Not sure how to handle that without a checkpoint.<
Maybe have two copies like RAT, a future one maintained by Decode and
a committed one maintained by Retire, and copy the committed to future.
Decided to shelve Thor2024 and begin work on Thor2025. While Thor2024 is<
very good there are a few issues with it. The ROB is used to store
register values and that is effectively a CAM. It is not very resource efficient in an FPGA. I have been researching an x86 OoO implementation (https://www.stuffedcow.net/files/henry-thesis-phd.pdf ) done in an FPGA
and it turns out to be considerably smaller than Thor. There are more efficient implementations for components than what is currently in use.
Thor2025 will use a PRF approach although using a PRF seems large to me.
To reduce the size and complexity of the register file, separate
register files will be used for float and integer operations, along with separate register files for vector mask registers and subroutine link registers. This set of register files limits the GPR file to only 3
write ports and 18 read ports to support all the functional units.
Currently the register file is 10r2w.
The trade-off is block RAM usage instead of LUTs.
While having separate registers files seems like a step backwards, it
should ultimately make the hardware more resource efficient. It does
impact the ISA spec.
Robert Finch wrote:
Decided to shelve Thor2024 and begin work on Thor2025. While Thor2024
is very good there are a few issues with it. The ROB is used to store
register values and that is effectively a CAM. It is not very resource
efficient in an FPGA. I have been researching an x86 OoO
implementation (https://www.stuffedcow.net/files/henry-thesis-phd.pdf
) done in an FPGA and it turns out to be considerably smaller than
Thor. There are more efficient implementations for components than
what is currently in use.
Thor2025 will use a PRF approach although using a PRF seems large to me.<
I have a PRF design I could show you--way to big for comp.arch and
with the requisite figures.
<
To reduce the size and complexity of the register file, separate
register files will be used for float and integer operations, along
with separate register files for vector mask registers and subroutine
link registers. This set of register files limits the GPR file to only
3 write ports and 18 read ports to support all the functional units.
Currently the register file is 10r2w.
The trade-off is block RAM usage instead of LUTs.
While having separate registers files seems like a step backwards, it
should ultimately make the hardware more resource efficient. It does
impact the ISA spec.
On 2023-11-15 2:11 p.m., MitchAlsup wrote:
Robert Finch wrote:
Decided to shelve Thor2024 and begin work on Thor2025. While Thor2024<
is very good there are a few issues with it. The ROB is used to store
register values and that is effectively a CAM. It is not very
resource efficient in an FPGA. I have been researching an x86 OoO
implementation (https://www.stuffedcow.net/files/henry-thesis-phd.pdf
) done in an FPGA and it turns out to be considerably smaller than
Thor. There are more efficient implementations for components than
what is currently in use.
Thor2025 will use a PRF approach although using a PRF seems large to me.
I have a PRF design I could show you--way to big for comp.arch and
with the requisite figures.
<
To reduce the size and complexity of the register file, separate
register files will be used for float and integer operations, along
with separate register files for vector mask registers and subroutine
link registers. This set of register files limits the GPR file to
only 3 write ports and 18 read ports to support all the functional
units. Currently the register file is 10r2w.
The trade-off is block RAM usage instead of LUTs.
While having separate registers files seems like a step backwards, it
should ultimately make the hardware more resource efficient. It does
impact the ISA spec.
Still digesting the PRF diagram.
Decided to go with a unified register file, 27r3w so far. Having
separate register files would not reduce the number of read ports
required and would add complexity to the processor.
Loads, FPU operations and flow control (FCU) operations all share the
third write port of the register file. The other two write ports are dedicated to the ALU results. I think this will be okay given <1% of instructions would be FCU updates. Loads are about 25%, and FPU depends
on the application.
The ALUs/FPU/Loads have five input operands including the 3 source
operands, a target operand, and a mask register. Stores do not need a
target operand. FCU ops are non-masked so do not need a mask register or target operand input.
Not planning to implement the vector register file as it would be immense.
On 2023-11-15 2:11 p.m., MitchAlsup wrote:
Robert Finch wrote:
Decided to shelve Thor2024 and begin work on Thor2025. While Thor2024<
is very good there are a few issues with it. The ROB is used to store
register values and that is effectively a CAM. It is not very resource
efficient in an FPGA. I have been researching an x86 OoO
implementation (https://www.stuffedcow.net/files/henry-thesis-phd.pdf
) done in an FPGA and it turns out to be considerably smaller than
Thor. There are more efficient implementations for components than
what is currently in use.
Thor2025 will use a PRF approach although using a PRF seems large to me.
I have a PRF design I could show you--way to big for comp.arch and
with the requisite figures.
<
To reduce the size and complexity of the register file, separate
register files will be used for float and integer operations, along
with separate register files for vector mask registers and subroutine
link registers. This set of register files limits the GPR file to only
3 write ports and 18 read ports to support all the functional units.
Currently the register file is 10r2w.
The trade-off is block RAM usage instead of LUTs.
While having separate registers files seems like a step backwards, it
should ultimately make the hardware more resource efficient. It does
impact the ISA spec.
Still digesting the PRF diagram.
Decided to go with a unified register file, 27r3w so far. Having
separate register files would not reduce the number of read ports
required and would add complexity to the processor.
Loads, FPU operations and flow control (FCU) operations all share the
third write port of the register file. The other two write ports are dedicated to the ALU results. I think this will be okay given <1% of instructions would be FCU updates. Loads are about 25%, and FPU depends
on the application.
The ALUs/FPU/Loads have five input operands including the 3 source
operands, a target operand, and a mask register. Stores do not need a
target operand. FCU ops are non-masked so do not need a mask register or target operand input.
Not planning to implement the vector register file as it would be immense.
Robert Finch wrote:Freelist:
On 2023-11-15 2:11 p.m., MitchAlsup wrote:
Robert Finch wrote:
Decided to shelve Thor2024 and begin work on Thor2025. While<
Thor2024 is very good there are a few issues with it. The ROB is
used to store register values and that is effectively a CAM. It is
not very resource efficient in an FPGA. I have been researching an
x86 OoO implementation
(https://www.stuffedcow.net/files/henry-thesis-phd.pdf ) done in an
FPGA and it turns out to be considerably smaller than Thor. There
are more efficient implementations for components than what is
currently in use.
Thor2025 will use a PRF approach although using a PRF seems large to
me.
I have a PRF design I could show you--way to big for comp.arch and
with the requisite figures.
<
To reduce the size and complexity of the register file, separate
register files will be used for float and integer operations, along
with separate register files for vector mask registers and
subroutine link registers. This set of register files limits the GPR
file to only 3 write ports and 18 read ports to support all the
functional units. Currently the register file is 10r2w.
The trade-off is block RAM usage instead of LUTs.
While having separate registers files seems like a step backwards,
it should ultimately make the hardware more resource efficient. It
does impact the ISA spec.
Still digesting the PRF diagram.
The diagram is for a 6R6W PRF with a history table, ARN->PRN translation, Free pool pickers, and register ports. The X with a ½ box is a latch
or flip-flop depending on the clocking that is put around the figure.
It also includes the renamer {history table and free pool pickers}.
Decided to go with a unified register file, 27r3w so far. Having
separate register files would not reduce the number of read ports
required and would add complexity to the processor.
9 Reads per 1 write ?!?!?
Loads, FPU operations and flow control (FCU) operations all share the
third write port of the register file. The other two write ports are
dedicated to the ALU results. I think this will be okay given <1% of
instructions would be FCU updates. Loads are about 25%, and FPU
depends on the application.
The ALUs/FPU/Loads have five input operands including the 3 source
operands, a target operand, and a mask register. Stores do not need a
target operand. FCU ops are non-masked so do not need a mask register
or target operand input.
Not planning to implement the vector register file as it would be
immense.
On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
64kB page can handle 512MB of mappings. Tonight’s trade-off is how many root pointers to support. With a 12-bit ASID, 4096 root pointers are
required to link to the mapping tables with one root pointer for each
address space.
Robert Finch wrote:
On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
64kB page can handle 512MB of mappings. Tonight’s trade-off is how
many root pointers to support. With a 12-bit ASID, 4096 root pointers
are required to link to the mapping tables with one root pointer for
each address space.
So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.
How is this usefully different that only using the ASID to qualify TLB results ?? <Was this TLB entry installed from the same ASID as is accessing right now>. And using ASID as an index into any array might lead to some conundrum down the road a apiece.
Secondarily, SUN started out with 12-bit ASID and went to 16-bits just
about
as fast as they could--even before main memories went bigger than 4GB.
On 2023-11-24 8:00 p.m., MitchAlsup wrote:
Robert Finch wrote:
On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A
single 64kB page can handle 512MB of mappings. Tonight’s trade-off is
how many root pointers to support. With a 12-bit ASID, 4096 root
pointers are required to link to the mapping tables with one root
pointer for each address space.
So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.
How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is
accessing
right now>. And using ASID as an index into any array might lead to some
conundrum down the road a apiece.
Secondarily, SUN started out with 12-bit ASID and went to 16-bits just
about
as fast as they could--even before main memories went bigger than 4GB.
I view the address space as an entity in it own right to be managed by
the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that identifies the address space has a life outside of just the TLB. I may
be increasing the typical scope of an ASID.
It is the same idea as using the ASID to qualify TLB entries, except
that it qualifies the root pointer as well. So, the root pointer does
not need to be switched by software. Once the root pointer is set for
the AS it simply sits there statically until the AS is reused.
I am using the ASID like a process ID. So, the root pointer register
does not need to be reset on a task switch. Address spaces may not be
mapped 1:1 with processes. An address space may outlive a task if it is shared with another task. So, I do not want to use the PID to
distinguish tables. This assumes the address space will not be freed up
and reused by another task, if there are tasks using the ASID.
4096 address spaces is a lot. But if using a 16-bit ASID it would no
longer be practical to store a root pointer per ASID in a table.
Instead, the root pointer would have to be managed by software as is
normally done.
I am wondering why the 16-bit ASID? 256 address spaces in 256 process? I suspect it is just because 16-bit is easier to pass around/calculate in
a HLL than some other value like 14-bits. Are 65536 address spaces
really needed?
On 11/24/2023 8:28 PM, Robert Finch wrote:
On 2023-11-24 8:00 p.m., MitchAlsup wrote:
Robert Finch wrote:
On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A
single 64kB page can handle 512MB of mappings. Tonight’s trade-off
is how many root pointers to support. With a 12-bit ASID, 4096 root
pointers are required to link to the mapping tables with one root
pointer for each address space.
So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.
How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is
accessing
right now>. And using ASID as an index into any array might lead to some >>> conundrum down the road a apiece.
Secondarily, SUN started out with 12-bit ASID and went to 16-bits
just about
as fast as they could--even before main memories went bigger than 4GB.
I view the address space as an entity in it own right to be managed by
the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that
identifies the address space has a life outside of just the TLB. I may
be increasing the typical scope of an ASID.
It is the same idea as using the ASID to qualify TLB entries, except
that it qualifies the root pointer as well. So, the root pointer does
not need to be switched by software. Once the root pointer is set for
the AS it simply sits there statically until the AS is reused.
I am using the ASID like a process ID. So, the root pointer register
does not need to be reset on a task switch. Address spaces may not be
mapped 1:1 with processes. An address space may outlive a task if it
is shared with another task. So, I do not want to use the PID to
distinguish tables. This assumes the address space will not be freed
up and reused by another task, if there are tasks using the ASID.
4096 address spaces is a lot. But if using a 16-bit ASID it would no
longer be practical to store a root pointer per ASID in a table.
Instead, the root pointer would have to be managed by software as is
normally done.
I am wondering why the 16-bit ASID? 256 address spaces in 256 process?
I suspect it is just because 16-bit is easier to pass around/calculate
in a HLL than some other value like 14-bits. Are 65536 address spaces
really needed?
If one assumes one address space per PID, then one is going to hit a
limit of 4K a lot faster than 64K, and when one hits the limit, there is
no good way to "reclaim" previously used address spaces short of
flushing the TLB to be sure that no entries from that space remain in
the TLB (ASID thrashing is likely to be relatively expensive to deal
with as a result).
Well, along with other things, like if/how to allow "Global" pages:
True global pages are likely a foot gun, as there is no way to exclude
them from a given process (where there may be a need to do so);
Disallowing global pages entirely means higher TLB miss rates because no processes can share TLB entries.
One option seems to be, say, that a few of the high-order bits of the
ASID could be used as a "page group", with global pages only applying
within a single page-group (possibly with one of the page groups being designated as "No global pages allowed").
...
Robert Finch wrote:
On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
64kB page can handle 512MB of mappings. Tonight’s trade-off is how many
root pointers to support. With a 12-bit ASID, 4096 root pointers are
required to link to the mapping tables with one root pointer for each
address space.
So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.
How is this usefully different that only using the ASID to qualify TLB >results ?? <Was this TLB entry installed from the same ASID as is accessing >right now>. And using ASID as an index into any array might lead to some >conundrum down the road a apiece.
Secondarily, SUN started out with 12-bit ASID and went to 16-bits just about >as fast as they could--even before main memories went bigger than 4GB.
On 2023-11-24 10:16 p.m., BGB wrote:
On 11/24/2023 8:28 PM, Robert Finch wrote:
On 2023-11-24 8:00 p.m., MitchAlsup wrote:
I see after reading several webpages that the root pointer is used to
If one assumes one address space per PID, then one is going to hit a
limit of 4K a lot faster than 64K, and when one hits the limit, there is
no good way to "reclaim" previously used address spaces short of
flushing the TLB to be sure that no entries from that space remain in
the TLB (ASID thrashing is likely to be relatively expensive to deal
with as a result).
point to only a single table for a process. This is not how I was doing >things. I have a MMU tables for each address space as opposed to having
a table for the process. The process may have only a single address
space, or it may use several address spaces.
I am wondering why there is only a single table per process.
On 2023-11-24 8:00 p.m., MitchAlsup wrote:
Robert Finch wrote:
On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
64kB page can handle 512MB of mappings. Tonight’s trade-off is how
many root pointers to support. With a 12-bit ASID, 4096 root pointers
are required to link to the mapping tables with one root pointer for
each address space.
So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.
How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is accessing >> right now>. And using ASID as an index into any array might lead to some
conundrum down the road a apiece.
Secondarily, SUN started out with 12-bit ASID and went to 16-bits just
about
as fast as they could--even before main memories went bigger than 4GB.
I view the address space as an entity in it own right to be managed by
the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that >identifies the address space has a life outside of just the TLB. I may
be increasing the typical scope of an ASID.
It is the same idea as using the ASID to qualify TLB entries, except
that it qualifies the root pointer as well. So, the root pointer does
not need to be switched by software. Once the root pointer is set for
the AS it simply sits there statically until the AS is reused.
I am using the ASID like a process ID. So, the root pointer register
does not need to be reset on a task switch. Address spaces may not be
mapped 1:1 with processes. An address space may outlive a task if it is >shared with another task. So, I do not want to use the PID to
distinguish tables. This assumes the address space will not be freed up
and reused by another task, if there are tasks using the ASID.
4096 address spaces is a lot. But if using a 16-bit ASID it would no
longer be practical to store a root pointer per ASID in a table.
Instead, the root pointer would have to be managed by software as is
normally done.
I am wondering why the 16-bit ASID? 256 address spaces in 256 process? I >suspect it is just because 16-bit is easier to pass around/calculate in
a HLL than some other value like 14-bits. Are 65536 address spaces
really needed?
On 2023-11-24 10:16 p.m., BGB wrote:
On 11/24/2023 8:28 PM, Robert Finch wrote:I see after reading several webpages that the root pointer is used to
On 2023-11-24 8:00 p.m., MitchAlsup wrote:
Robert Finch wrote:
On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A
single 64kB page can handle 512MB of mappings. Tonight’s trade-off >>>>> is how many root pointers to support. With a 12-bit ASID, 4096 root
pointers are required to link to the mapping tables with one root
pointer for each address space.
So, you associate a single ROOT pointer VALUE with an ASID, and manage >>>> in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.
How is this usefully different that only using the ASID to qualify TLB >>>> results ?? <Was this TLB entry installed from the same ASID as is
accessing
right now>. And using ASID as an index into any array might lead to
some
conundrum down the road a apiece.
Secondarily, SUN started out with 12-bit ASID and went to 16-bits
just about
as fast as they could--even before main memories went bigger than 4GB.
I view the address space as an entity in it own right to be managed
by the MMU. ASIDs and address spaces should be mapped 1:1. The ASID
that identifies the address space has a life outside of just the TLB.
I may be increasing the typical scope of an ASID.
It is the same idea as using the ASID to qualify TLB entries, except
that it qualifies the root pointer as well. So, the root pointer does
not need to be switched by software. Once the root pointer is set for
the AS it simply sits there statically until the AS is reused.
I am using the ASID like a process ID. So, the root pointer register
does not need to be reset on a task switch. Address spaces may not be
mapped 1:1 with processes. An address space may outlive a task if it
is shared with another task. So, I do not want to use the PID to
distinguish tables. This assumes the address space will not be freed
up and reused by another task, if there are tasks using the ASID.
4096 address spaces is a lot. But if using a 16-bit ASID it would no
longer be practical to store a root pointer per ASID in a table.
Instead, the root pointer would have to be managed by software as is
normally done.
I am wondering why the 16-bit ASID? 256 address spaces in 256
process? I suspect it is just because 16-bit is easier to pass
around/calculate in a HLL than some other value like 14-bits. Are
65536 address spaces really needed?
If one assumes one address space per PID, then one is going to hit a
limit of 4K a lot faster than 64K, and when one hits the limit, there
is no good way to "reclaim" previously used address spaces short of
flushing the TLB to be sure that no entries from that space remain in
the TLB (ASID thrashing is likely to be relatively expensive to deal
with as a result).
point to only a single table for a process. This is not how I was doing things. I have a MMU tables for each address space as opposed to having
a table for the process. The process may have only a single address
space, or it may use several address spaces.
I am wondering why there is only a single table per process.
Global space can be assigned by designating an address space as a global space and giving it an ASID. All process wanting access to the global
Well, along with other things, like if/how to allow "Global" pages:
True global pages are likely a foot gun, as there is no way to exclude
them from a given process (where there may be a need to do so);
Disallowing global pages entirely means higher TLB miss rates because
no processes can share TLB entries.
space need only then use the MMU table for that ASID. Eg. use ASID 0 for
the global address space.
One option seems to be, say, that a few of the high-order bits of the
ASID could be used as a "page group", with global pages only applying
within a single page-group (possibly with one of the page groups being
designated as "No global pages allowed").
...
On 2023-11-24 8:00 p.m., MitchAlsup wrote:
Robert Finch wrote:
On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
64kB page can handle 512MB of mappings. Tonight’s trade-off is how
many root pointers to support. With a 12-bit ASID, 4096 root pointers
are required to link to the mapping tables with one root pointer for
each address space.
So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.
How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is accessing >> right now>. And using ASID as an index into any array might lead to some
conundrum down the road a apiece.
Secondarily, SUN started out with 12-bit ASID and went to 16-bits just
about
as fast as they could--even before main memories went bigger than 4GB.
I view the address space as an entity in it own right to be managed by
the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that identifies the address space has a life outside of just the TLB. I may
be increasing the typical scope of an ASID.
It is the same idea as using the ASID to qualify TLB entries, except
that it qualifies the root pointer as well. So, the root pointer does
not need to be switched by software. Once the root pointer is set for
the AS it simply sits there statically until the AS is reused.
I am using the ASID like a process ID. So, the root pointer register
does not need to be reset on a task switch. Address spaces may not be
mapped 1:1 with processes. An address space may outlive a task if it is shared with another task. So, I do not want to use the PID to
distinguish tables. This assumes the address space will not be freed up
and reused by another task, if there are tasks using the ASID.
4096 address spaces is a lot. But if using a 16-bit ASID it would no
longer be practical to store a root pointer per ASID in a table.
Instead, the root pointer would have to be managed by software as is
normally done.
I am wondering why the 16-bit ASID? 256 address spaces in 256 process? I suspect it is just because 16-bit is easier to pass around/calculate in
a HLL than some other value like 14-bits. Are 65536 address spaces
really needed?
Robert Finch <robfi680@gmail.com> writes:
On 2023-11-24 10:16 p.m., BGB wrote:
On 11/24/2023 8:28 PM, Robert Finch wrote:
On 2023-11-24 8:00 p.m., MitchAlsup wrote:
I see after reading several webpages that the root pointer is used to
If one assumes one address space per PID, then one is going to hit a
limit of 4K a lot faster than 64K, and when one hits the limit, there is >>> no good way to "reclaim" previously used address spaces short of
flushing the TLB to be sure that no entries from that space remain in
the TLB (ASID thrashing is likely to be relatively expensive to deal
with as a result).
point to only a single table for a process. This is not how I was doing >>things. I have a MMU tables for each address space as opposed to having
a table for the process. The process may have only a single address
space, or it may use several address spaces.
I am wondering why there is only a single table per process.
There is actually two in most operating systems - the lower half
of the VA space is owned by the user-mode code in the process and
the upper-half is shared by all processors and used by the
operating system on behalf of the process. For Intel/AMD, the
kernel manages both halves, for ARMv8, each half has a completely
distinct and separate root pointer (at each exeception level).
Robert Finch wrote:
On 2023-11-24 8:00 p.m., MitchAlsup wrote:
Robert Finch wrote:
On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single >>>> 64kB page can handle 512MB of mappings. Tonight’s trade-off is how
many root pointers to support. With a 12-bit ASID, 4096 root pointers
are required to link to the mapping tables with one root pointer for
each address space.
So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.
How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is accessing >>> right now>. And using ASID as an index into any array might lead to some >>> conundrum down the road a apiece.
Secondarily, SUN started out with 12-bit ASID and went to 16-bits just
about
as fast as they could--even before main memories went bigger than 4GB.
I view the address space as an entity in it own right to be managed by
the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that
identifies the address space has a life outside of just the TLB. I may
be increasing the typical scope of an ASID.
Consider the case where two different processes MMAP the same area
of memory.
Should they both end up using the same ASID ??
Should they both take extra TLB walks because they use different ASIDs ??
mitchalsup@aol.com (MitchAlsup) writes:
Robert Finch wrote:
On 2023-11-24 8:00 p.m., MitchAlsup wrote:
Robert Finch wrote:
On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single >>>>> 64kB page can handle 512MB of mappings. Tonight’s trade-off is how >>>>> many root pointers to support. With a 12-bit ASID, 4096 root pointers >>>>> are required to link to the mapping tables with one root pointer for >>>>> each address space.
So, you associate a single ROOT pointer VALUE with an ASID, and manage >>>> in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.
How is this usefully different that only using the ASID to qualify TLB >>>> results ?? <Was this TLB entry installed from the same ASID as is accessing
right now>. And using ASID as an index into any array might lead to some >>>> conundrum down the road a apiece.
Secondarily, SUN started out with 12-bit ASID and went to 16-bits just >>>> about
as fast as they could--even before main memories went bigger than 4GB.
I view the address space as an entity in it own right to be managed by
the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that
identifies the address space has a life outside of just the TLB. I may
be increasing the typical scope of an ASID.
Consider the case where two different processes MMAP the same area
of memory.
In which case, the area of memory would be mapped to different
virtual address ranges in each process, and thus naturally
consume two TLBs.
FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.
Given various forms of ASLR being used, it's unlikely even in
two instances of the same executable that a call to mmap
with MAP_SHARED without MAP_FIXED would map the region at
the same virtual address in both processes.
Should they both end up using the same ASID ??
They couldn't share an ASID assuming the TLB looks up by VA.
Should they both take extra TLB walks because they use different ASIDs ??
Given the above, yes. It's likely they'll each be scheduled
on different cores anyway in any modern system.
Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup) writes:
Robert Finch wrote:
On 2023-11-24 8:00 p.m., MitchAlsup wrote:
Robert Finch wrote:I view the address space as an entity in it own right to be managed by >>>> the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that
On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single >>>>>> 64kB page can handle 512MB of mappings. Tonight’s trade-off is how >>>>>> many root pointers to support. With a 12-bit ASID, 4096 root pointers >>>>>> are required to link to the mapping tables with one root pointer for >>>>>> each address space.
So, you associate a single ROOT pointer VALUE with an ASID, and manage >>>>> in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.
How is this usefully different that only using the ASID to qualify TLB >>>>> results ?? <Was this TLB entry installed from the same ASID as is accessing
right now>. And using ASID as an index into any array might lead to some >>>>> conundrum down the road a apiece.
Secondarily, SUN started out with 12-bit ASID and went to 16-bits just >>>>> about
as fast as they could--even before main memories went bigger than 4GB. >>>
identifies the address space has a life outside of just the TLB. I may >>>> be increasing the typical scope of an ASID.
Consider the case where two different processes MMAP the same area
of memory.
In which case, the area of memory would be mapped to different
virtual address ranges in each process, and thus naturally
consume two TLBs.
MMAP() first, fork() second. Now we have 2 processes with the
memory mapped shared memory at the same address.
FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.
Given various forms of ASLR being used, it's unlikely even in
two instances of the same executable that a call to mmap
with MAP_SHARED without MAP_FIXED would map the region at
the same virtual address in both processes.
Should they both end up using the same ASID ??
They couldn't share an ASID assuming the TLB looks up by VA.
Should they both take extra TLB walks because they use different ASIDs ??
Given the above, yes. It's likely they'll each be scheduled
on different cores anyway in any modern system.
Are top-level page directory pages shared between tasks?
Suppose a task
needs a 32-bit address space. With one level of page maps, 27 bits is accommodated, that leaves 5 bits of address translation to be done by
the page directory. Using a whole page which can handle 11 address bits
would be wasteful. But if root pointers could point into the same page directory page then the space would not be wasted. For instance, root
pointer for task #1 could point the first 32 entries, root pointer for
task #2 could point into the next 32 entries, and so on.
Are top-level page directory pages shared between tasks?
Suppose a task
needs a 32-bit address space. With one level of page maps, 27 bits is >accommodated, that leaves 5 bits of address translation to be done by
the page directory. Using a whole page which can handle 11 address bits
would be wasteful. But if root pointers could point into the same page >directory page then the space would not be wasted. For instance, root
pointer for task #1 could point the first 32 entries, root pointer for
task #2 could point into the next 32 entries, and so on.
mitchalsup@aol.com (MitchAlsup) writes:
Consider the case where two different processes MMAP the same area
of memory.
In which case, the area of memory would be mapped to different
virtual address ranges in each process,
FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.
Should they both end up using the same ASID ??
They couldn't share an ASID assuming the TLB looks up by VA.
scott@slp53.sl.home (Scott Lurndal) writes:
mitchalsup@aol.com (MitchAlsup) writes:
Consider the case where two different processes MMAP the same areaIn which case, the area of memory would be mapped to different
of memory.
virtual address ranges in each process,
Says who? Unless the user process asks for MAP_FIXED or the address
range is already occupied in the user process, nothing prevents the OS
from putting the shared area in the same process. If the permissions
are also the same, the OS can then use one ASID for the shared area.
Anton Ertl wrote:
scott@slp53.sl.home (Scott Lurndal) writes:
mitchalsup@aol.com (MitchAlsup) writes:
Consider the case where two different processes MMAP the same areaIn which case, the area of memory would be mapped to different
of memory.
virtual address ranges in each process,
Says who? Unless the user process asks for MAP_FIXED or the address
range is already occupied in the user process, nothing prevents the OS
from putting the shared area in the same process. If the permissions
are also the same, the OS can then use one ASID for the shared area.
If the mapping range is being selected dynamically, the chance that a
range will already be in use goes up with the number of sharers.
At some point when a new member tries to join the sharing group
the map request will be denied.
Software that does not want to have a mapping request fail should assume
that a shared area will be mapped at a different address in each process. That implies one should not assume that virtual address can be passed
but instead use, say, section relative offsets to build a linked list.
Anton Ertl wrote:
scott@slp53.sl.home (Scott Lurndal) writes:
In which case, the area of memory would be mapped to different
virtual address ranges in each process,
Says who? Unless the user process asks for MAP_FIXED or the address
range is already occupied in the user process, nothing prevents the OS
from putting the shared area in the same process.
If the permissions
are also the same, the OS can then use one ASID for the shared area.
If the mapping range is being selected dynamically, the chance that a
range will already be in use goes up with the number of sharers.
At some point when a new member tries to join the sharing group
the map request will be denied.
Software that does not want to have a mapping request fail should assume
that a shared area will be mapped at a different address in each process. >That implies one should not assume that virtual address can be passed
but instead use, say, section relative offsets to build a linked list.
scott@slp53.sl.home (Scott Lurndal) writes:
mitchalsup@aol.com (MitchAlsup) writes:
Consider the case where two different processes MMAP the same area
of memory.
In which case, the area of memory would be mapped to different
virtual address ranges in each process,
Says who? Unless the user process asks for MAP_FIXED or the address
range is already occupied in the user process, nothing prevents the OS
from putting the shared area in the same process. If the permissions
are also the same, the OS can then use one ASID for the shared area.
This would be especially useful for the read-only sections (e.g, code)
of common libraries like libc. However, in todays security landscape,
you don't want one process to know where library code is mapped in
other processes (i.e., you want ASLR), so we can no longer make use of
that benefit. And it's doubtful whether other uses are worth the complications (and even if they are, there might be security issues,
too).
FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.
As usual, what is specified by a common-subset standard is not
relevant for what an OS implementor has to do if they want to supply
more than a practically unusable checkbox feature like the POSIX
subsystem for Windows. There is a reason why WSL2 includes a full
Linux kernel.
Should they both end up using the same ASID ??
They couldn't share an ASID assuming the TLB looks up by VA.
Of course the TLB looks up by VA, what else. But if the VA is the
same and the PA is the same, the same ASID can be used.
scott@slp53.sl.home (Scott Lurndal) writes:
mitchalsup@aol.com (MitchAlsup) writes:
Consider the case where two different processes MMAP the same area
of memory.
In which case, the area of memory would be mapped to different
virtual address ranges in each process,
Says who? Unless the user process asks for MAP_FIXED or the address
range is already occupied in the user process, nothing prevents the OS
from putting the shared area in the same process. If the permissions
are also the same, the OS can then use one ASID for the shared area.
This would be especially useful for the read-only sections (e.g, code)
of common libraries like libc. However, in todays security landscape,
you don't want one process to know where library code is mapped in
other processes (i.e., you want ASLR), so we can no longer make use of
that benefit. And it's doubtful whether other uses are worth the >complications (and even if they are, there might be security issues,
too).
FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.
As usual, what is specified by a common-subset standard is not
relevant for what an OS implementor has to do if they want to supply
more than a practically unusable checkbox feature like the POSIX
subsystem for Windows. There is a reason why WSL2 includes a full
Linux kernel.
Should they both end up using the same ASID ??
They couldn't share an ASID assuming the TLB looks up by VA.
Of course the TLB looks up by VA, what else. But if the VA is the
same and the PA is the same, the same ASID can be used.
On 11/26/2023 9:45 AM, Anton Ertl wrote:
Where, say, memory map will look something like:
00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
00yy_xxxxxxxx: Start of global virtual memory (*1);
3FFF_xxxxxxxx: End of global virtual memory;
4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
7FFF_xxxxxxxx: End of private/local virtual memory (possible);
8000_xxxxxxxx: Start of kernel virtual memory;
BFFF_xxxxxxxx: End of kernel virtual memory;
Cxxx_xxxxxxxx: Physical Address Range (Cached);
Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
Exxx_xxxxxxxx: Reserved;
Fxxx_xxxxxxxx: MMIO and similar.
On 11/26/2023 9:45 AM, Anton Ertl wrote:
scott@slp53.sl.home (Scott Lurndal) writes:
mitchalsup@aol.com (MitchAlsup) writes:
Consider the case where two different processes MMAP the same area
of memory.
In which case, the area of memory would be mapped to different
virtual address ranges in each process,
Says who? Unless the user process asks for MAP_FIXED or the address
range is already occupied in the user process, nothing prevents the OS
from putting the shared area in the same process. If the permissions
are also the same, the OS can then use one ASID for the shared area.
This would be especially useful for the read-only sections (e.g, code)
of common libraries like libc. However, in todays security landscape,
you don't want one process to know where library code is mapped in
other processes (i.e., you want ASLR), so we can no longer make use of
that benefit. And it's doubtful whether other uses are worth the
complications (and even if they are, there might be security issues,
too).
It seems to me, as long as it is a different place on each system,
probably good enough. Demanding a different location in each process
would create a lot of additional memory overhead due to from things like base-relocations or similar.
FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.
As usual, what is specified by a common-subset standard is not
relevant for what an OS implementor has to do if they want to supply
more than a practically unusable checkbox feature like the POSIX
subsystem for Windows. There is a reason why WSL2 includes a full
Linux kernel.
Still using WSL1 here as for whatever reason hardware virtualization has
thus far refused to work on my PC, and is apparently required for WSL2.
I can add this to my list of annoyances, like I can install "just short
of 128GB", but putting in the full 128GB causes my PC to be like "Oh
Crap, I guess there is 3.5GB ..." (but, apparently "112GB with unmatched
RAM sticks is fine I guess...").
But, yeah, the original POSIX is an easier goal to achieve, vs, say, the ability to port over the GNU userland.
A lot of it is doable, but things like fork+exec are a problem if one
wants to support NOMMU operation or otherwise run all of the logical processes in a shared address space.
A practical alternative is something more like a CreateProcess style
call, but this is "not exactly POSIX". In theory though, one could treat "fork()" more like "vfork()" and then turn the exec* call into a CreateProcess call and then terminate the current thread. Wouldn't
really work "in general" though, for programs that expect to be able to "fork()" and then continue running the current program as a sub-process.
Should they both end up using the same ASID ??
They couldn't share an ASID assuming the TLB looks up by VA.
Of course the TLB looks up by VA, what else. But if the VA is the
same and the PA is the same, the same ASID can be used.
?...
Typically the ASID applies to the whole virtual address space, not to individual memory objects.
Or, at least, my page-table scheme doesn't have a way to express
per-page ASIDs (merely if a page is Private/Shared, with the results of
this partly depending on the current ASID given for the page-table).
Where, say, I am mostly using 64-bit entries in the page-table, as going
to a 128-bit page-table format would be a bit steep.
Say, PTE layout (16K pages):
(63:48): ACLID
(47:14): Physical Address.
(13:12): Address or OS flag.
(11:10): For use by OS
( 9: 0): Base page-access and similar.
(9): S1 / U1 (Page-Size or OS Flag)
(8): S0 / U0 (Page-Size or OS Flag)
(7): Nu User (Supervisor Only)
(6): No Execute
(5): No Write
(4): No Read
(3): No Cache
(2): Dirty (OS, ignored by TLB)
(1): Private/Shared (MBZ if not Valid)
(0): Present/Valid
Where, ACLID serves as an index into the ACL table, or to lookup the
VUGID parameters for the page (well, along with an alternate PTE variant
that encodes VUGID directly, but reduces the physical address to 36
bits). It is possible that the original VUGID scheme may be phased out
in favor of using exclusively ACL checking.
Note that the ACL checks don't add new permissions to a page, they add further restrictions (with the base-access being the most permissive).
Some combinations of flags are special, and encode a few edge-case
modes; such as pages which are Read/Write in Supervisor mode but
Read-Only in user mode (separate from the possible use of ACL's to mark
pages as read-only for certain tasks).
But, FWIW, I ended up adding an extended MAP_GLOBAL flag for "mmap'ed
space should be visible to all of the processes"; which in turn was used
as part of the backing memory for the "GlobalAlloc" style calls (it is
not a global heap, in that each process still manages the memory
locally, but other intersecting processes can see the address within
their own address spaces).
Well, along with a MAP_PHYSICAL flag, for if one needs memory where
VA==PA (this may fail, with the mmap returning NULL, effectively only
allowed for "superusermode"; mostly intended for hardware interfaces).
The usual behavior of MAP_SHARED didn't really make sense outside of the context of mapping a file, and didn't really serve the needed purpose
(say, one wants to hand off a pointer to a bitmap buffer to the GUI
subsystem to have it drawn into a window).
It is also being used for things like shared scratch buffers, say, for passing BITMAPINFOHEADER and MIDI commands and similar across the interprocess calls (the C API style wrapper wraps a lot of this; whereas
the internal COM-style interfaces will require any pointer-style
arguments to point to shared memory).
This is not required for normal syscall handlers, where the usual
assumption is that normal syscalls will have some means of directly
accessing the address space of the caller process. I didn't really want
to require that TKGDI have this same capability.
It is debatable whether calls like BlitImage and similar should require global memory, or merely recommend it (potentially having the call fall
back to a scratch buffer and internal memcpy if the passed bitmap image
is not already in global memory).
I had originally considered a more complex mechanism for object sharing,
but then ended up going with this for now partly because it was easier
and lower overhead (well, and also because I wanted something that would still work if/when I started to add proper memory protection). May make
sense to impose a limit on per-process global alloc's though (since it
is intended specifically for shared buffers and not for general heap allocation; where for heap allocation ANONYMOUS+PRIVATE would be used instead).
Though, looking at stuff, MAP_GLOBAL semantics may have also been
partially covered by "MAP_ANONYMOUS|MAP_SHARED"?... Though, the
semantics aren't the same.
I guess, another alternative would have been to use shm_open+mmap or
similar.
Where, say, memory map will look something like:
00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
00yy_xxxxxxxx: Start of global virtual memory (*1);
3FFF_xxxxxxxx: End of global virtual memory;
4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
7FFF_xxxxxxxx: End of private/local virtual memory (possible);
8000_xxxxxxxx: Start of kernel virtual memory;
BFFF_xxxxxxxx: End of kernel virtual memory;
Cxxx_xxxxxxxx: Physical Address Range (Cached);
Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
Exxx_xxxxxxxx: Reserved;
Fxxx_xxxxxxxx: MMIO and similar.
*1: The 'yy' division point may move, will depend on things like how
much RAM exists (currently, 00/01; no current "sane cost" FPGA boards
having more than 256 or 512 MB of RAM).
*2: If I go to a scheme of giving processes their own address spaces,
then private memory will be used. It is likely that executable code may remain shared, but the data sections and heap would be put into private address ranges.
BGB <cr88192@gmail.com> writes:
On 11/26/2023 9:45 AM, Anton Ertl wrote:
Where, say, memory map will look something like:
00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
00yy_xxxxxxxx: Start of global virtual memory (*1);
3FFF_xxxxxxxx: End of global virtual memory;
4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
7FFF_xxxxxxxx: End of private/local virtual memory (possible);
8000_xxxxxxxx: Start of kernel virtual memory;
BFFF_xxxxxxxx: End of kernel virtual memory;
Cxxx_xxxxxxxx: Physical Address Range (Cached);
Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
Exxx_xxxxxxxx: Reserved;
Fxxx_xxxxxxxx: MMIO and similar.
The modern preference is to make the memory map flexible.
Linux, for example, requires that PCI Base Address Registers
be programmable by the operating system, and the OS can
choose any range (subject to host bridge configuration, of
course) for the device.
It is notable that even on non-intel systems, one may need
to map a 32-bit PCI BAR (AHCI is the classic example) which
requires the address programmed in the bar to be less than
0x10000000. Granted systems can have custom PCI controllers
that remap that into the larger physical address space with
a bit of extra hardware, however the kernel people don't
like that at all since there is no universal standard for
such remapping and they don't want to support
dozens of independent implementations, constantly
changing from generation to generation.
Many modern SoCs (and ARM SBSA requires this) make their
on-board devices and coprocessors look like PCI express
devices to software, and SBSA requires the PCIe ECAM
region for device discovery. Here again, each of
these on board devices will have from one to six
memory region base address registers (or one to
three for 64-bit bars).
Encoding memory attributes into the address is common
in microcontrollers, but in a general purpose processor
constrains the system to an extent sufficient to make it
unattractive for general purpose workloads.
BGB <cr88192@gmail.com> writes:
On 11/26/2023 9:45 AM, Anton Ertl wrote:
Where, say, memory map will look something like:
00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
00yy_xxxxxxxx: Start of global virtual memory (*1);
3FFF_xxxxxxxx: End of global virtual memory;
4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
7FFF_xxxxxxxx: End of private/local virtual memory (possible);
8000_xxxxxxxx: Start of kernel virtual memory;
BFFF_xxxxxxxx: End of kernel virtual memory;
Cxxx_xxxxxxxx: Physical Address Range (Cached);
Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
Exxx_xxxxxxxx: Reserved;
Fxxx_xxxxxxxx: MMIO and similar.
The modern preference is to make the memory map flexible.
Linux, for example, requires that PCI Base Address Registers
be programmable by the operating system, and the OS can
choose any range (subject to host bridge configuration, of
course) for the device.
It is notable that even on non-intel systems, one may need
to map a 32-bit PCI BAR (AHCI is the classic example) which
requires the address programmed in the bar to be less than
0x10000000.
Granted systems can have custom PCI controllers
that remap that into the larger physical address space with
a bit of extra hardware, however the kernel people don't
like that at all since there is no universal standard for
such remapping and they don't want to support
dozens of independent implementations, constantly
changing from generation to generation.
Many modern SoCs (and ARM SBSA requires this) make their
on-board devices and coprocessors look like PCI express
devices to software,
and SBSA requires the PCIe ECAM
region for device discovery. Here again, each of
these on board devices will have from one to six
memory region base address registers (or one to
three for 64-bit bars).
Encoding memory attributes into the address is common
in microcontrollers, but in a general purpose processor
constrains the system to an extent sufficient to make it
unattractive for general purpose workloads.
Scott Lurndal wrote:
BGB <cr88192@gmail.com> writes:
On 11/26/2023 9:45 AM, Anton Ertl wrote:
Where, say, memory map will look something like:
00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
00yy_xxxxxxxx: Start of global virtual memory (*1);
3FFF_xxxxxxxx: End of global virtual memory;
4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
7FFF_xxxxxxxx: End of private/local virtual memory (possible);
8000_xxxxxxxx: Start of kernel virtual memory;
BFFF_xxxxxxxx: End of kernel virtual memory;
Cxxx_xxxxxxxx: Physical Address Range (Cached);
Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
Exxx_xxxxxxxx: Reserved;
Fxxx_xxxxxxxx: MMIO and similar.
The modern preference is to make the memory map flexible.
// cacheable, used, modified bits
CUM kind of access
--- ------------------------------
000 uncacheable DRAM
001 MMI/O
010 config
011 ROM
1xx cacheable DRAM
Linux, for example, requires that PCI Base Address Registers
be programmable by the operating system, and the OS can
choose any range (subject to host bridge configuration, of
course) for the device.
Easily done, just create an uncacheable PTE and set UM to 10
for config space or 01 for MMI/O space.
It is notable that even on non-intel systems, one may need
to map a 32-bit PCI BAR (AHCI is the classic example) which
requires the address programmed in the bar to be less than
0x10000000.
I/O MMU translates these devices from a 32-bit VAS into the 64-bit PAS.
Granted systems can have custom PCI controllers
that remap that into the larger physical address space with
a bit of extra hardware, however the kernel people don't
like that at all since there is no universal standard for
such remapping and they don't want to support
dozens of independent implementations, constantly
changing from generation to generation.
What they figure if they are already supporting 4 incompatible
mapping systems {Intel, AMD, ARM, RISC-V} you would have though
they had gotten good at these implementations :-)
Many modern SoCs (and ARM SBSA requires this) make their
on-board devices and coprocessors look like PCI express
devices to software,
I made the CPU/cores in My 66000 have a configuration port
that is setup during boot that smells just like a PCIe
port.
and SBSA requires the PCIe ECAM
region for device discovery. Here again, each of
these on board devices will have from one to six
memory region base address registers (or one to
three for 64-bit bars).
Encoding memory attributes into the address is common
in microcontrollers, but in a general purpose processor
constrains the system to an extent sufficient to make it
unattractive for general purpose workloads.
Agreed.
On 11/26/2023 9:45 AM, Anton Ertl wrote:
This would be especially useful for the read-only sections (e.g, code)
of common libraries like libc. However, in todays security landscape,
you don't want one process to know where library code is mapped in
other processes (i.e., you want ASLR), so we can no longer make use of
that benefit. And it's doubtful whether other uses are worth the
complications (and even if they are, there might be security issues,
too).
It seems to me, as long as it is a different place on each system,
probably good enough. Demanding a different location in each process
would create a lot of additional memory overhead due to from things like >base-relocations or similar.
Of course the TLB looks up by VA, what else. But if the VA is the
same and the PA is the same, the same ASID can be used.
?...
Typically the ASID applies to the whole virtual address space, not to >individual memory objects.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:...
scott@slp53.sl.home (Scott Lurndal) writes:
FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.
As usual, what is specified by a common-subset standard is not
relevant for what an OS implementor has to do if they want to supply
more than a practically unusable checkbox feature like the POSIX
subsystem for Windows. There is a reason why WSL2 includes a full
Linux kernel.
Because the semantics of MAP_FIXED are to unmap any
prior mapping in the range, if the implementation had happened to
allocate the heap or shared System V region at that address, the heap
would have become corrupt with dangling references hanging
around which, if stored into, would subsequently corrupt the mapped region.
BGB <cr88192@gmail.com> writes:
On 11/26/2023 9:45 AM, Anton Ertl wrote:
This would be especially useful for the read-only sections (e.g, code)
of common libraries like libc. However, in todays security landscape,
you don't want one process to know where library code is mapped in
other processes (i.e., you want ASLR), so we can no longer make use of
that benefit. And it's doubtful whether other uses are worth the
complications (and even if they are, there might be security issues,
too).
It seems to me, as long as it is a different place on each system,
probably good enough. Demanding a different location in each process
would create a lot of additional memory overhead due to from things like
base-relocations or similar.
If the binary is position-independent (the default on Linux on AMD64),
there is no such overhead.
I just started the same binary twice and looked at the address of the
same peace of code:
Gforth 0.7.3, Copyright (C) 1995-2008 Free Software Foundation, Inc.
Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license'
Type `bye' to exit
see open-file
Code open-file
0x000055c2b76d5833 <gforth_engine+6595>: mov %r15,0x1c126(%rip)
...
For the other process the same instruction is:
Code open-file
0x000055dd606e4833 <gforth_engine+6595>: mov %r15,0x1c126(%rip)
Following the calls until I get to glibc, I get, for the two processes:
0x00007f705c0c3b90 <__libc_open64+0>: push %r12
0x00007f190aa34b90 <__libc_open64+0>: push %r12
So not just the binary, but also glibc resides at different virtual
addresses in the two processes.
So obviously the Linux and glibc maintainers think that per-system
ASLR is not good enough. They obviously want ASLR to work as well as possible against local attackers.
Of course the TLB looks up by VA, what else. But if the VA is the
same and the PA is the same, the same ASID can be used.
?...
Typically the ASID applies to the whole virtual address space, not to
individual memory objects.
Yes, one would need more complicated ASID management than setting
"the" ASID on switching to a process if different VMAs in the process
have different ASIDs. Another reason not to go there.
Power (and IIRC HPPA) do something in this direction with their
"segments", where the VA space was split into 16 equally parts, and
IIRC the 16 parts each extended the address by 16 bits (minus the 4
bits of the segment number), so essentially they have 16 16-bit ASIDs.
The address spaces are somewhat unflexible, but with 64-bit VAs
(i.e. 60-bit address spaces) that may be good enough for quite a
while. The cost is that you now have to manage 16 ASID registers.
And if we ever get to actually making use of more the 60 bits of VA in
other ways, combining this ASID scheme with the other use of the VAs.
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:...
scott@slp53.sl.home (Scott Lurndal) writes:
FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.
As usual, what is specified by a common-subset standard is not
relevant for what an OS implementor has to do if they want to supply
more than a practically unusable checkbox feature like the POSIX >>>subsystem for Windows. There is a reason why WSL2 includes a full
Linux kernel.
Because the semantics of MAP_FIXED are to unmap any
prior mapping in the range, if the implementation had happened to
allocate the heap or shared System V region at that address, the heap
would have become corrupt with dangling references hanging
around which, if stored into, would subsequently corrupt the mapped region.
Of course you can provide an address without specifying MAP_FIXED, and
a high-quality OS will satisfy the request if possible (and return a >different address if not), while a work-to-rule OS like the POSIX
subsystem for Windows may then treat that address as if the user had
passed NULL.
Interestingly, Linux (since 4.17) also provides MAP_FIXED_NOREPLACE,
which works like MAP_FIXED except that it returns an error if
MAP_FIXED would replace part of an existing mapping. Makes me wonder
if in the no-conflict case, and given a page-aligned addr there is any >difference between MAP_FIXED, MAP_FIXED_NOREPLACE and just providing
an address without any of these flags in Linux. In the conflict case,
the difference between the latter two variants is how you detect that
it did not work as desired.
I've never seen a case where using MAP_FIXED was useful, and I've
been using mmap since the early 90's.
My thought was to support two
banks of registers, one for the highest operating mode, and the other
for remaining operating modes.
The Q+ register file is implemented with one block-RAM per read port.
With a 64-bit width this gives 512 registers in a block RAM. 192
registers are needed for renaming a 64-entry architectural register
file. That leaves 320 registers unused. My thought was to support two
banks of registers, one for the highest operating mode, and the other
for remaining operating modes. On exceptions the register bank could be switched. But to do this there are now 128-register effectively being
renamed which leads to 384 physical registers to manage. This doubles
the size of the register management code. Unless, a pipeline flush
occurs for exception processing which I think would allow the renamer to reuse the same hardware to manage a new bank of registers. But that
hinges on all references to registers in the current bank being unused.
My other thought was that with approximately three times the number of architectural registers required, using 256 physical registers would
allow 85 architectural registers. Perhaps some of the registers could be banked for different operating modes. Banking four registers per mode
would use up 16.
If the 512-register file were divided by three, 170 physical registers
could be available for renaming. This is less than the ideal 192
registers but maybe close enough to not impact performance adversely.
Robert Finch wrote:
The Q+ register file is implemented with one block-RAM per read port.
With a 64-bit width this gives 512 registers in a block RAM. 192
registers are needed for renaming a 64-entry architectural register
file. That leaves 320 registers unused. My thought was to support two
banks of registers, one for the highest operating mode, and the other
for remaining operating modes. On exceptions the register bank could be
switched. But to do this there are now 128-register effectively being
renamed which leads to 384 physical registers to manage. This doubles
the size of the register management code. Unless, a pipeline flush
occurs for exception processing which I think would allow the renamer to
reuse the same hardware to manage a new bank of registers. But that
hinges on all references to registers in the current bank being unused.
My other thought was that with approximately three times the number of
architectural registers required, using 256 physical registers would
allow 85 architectural registers. Perhaps some of the registers could be
banked for different operating modes. Banking four registers per mode
would use up 16.
If the 512-register file were divided by three, 170 physical registers
could be available for renaming. This is less than the ideal 192
registers but maybe close enough to not impact performance adversely.
I don't understand the problem.
You want 64 architecture registers, each which needs a physical register, plus 128 registers for in-flight instructions, so 196 physical registers.
If you add a second bank of 64 architecture registers for interrupts
then each needs a physical register. But that doesn't change the number
of in-flight registers so thats 256 physical total.
Plus two sets of rename banks, one for each mode.
If you drain the pipeline before switching register banks then all
of the 128 in-flight registers will be free at the time of switch.
If you can switch to interrupt mode without draining the pipeline then
some of those 128 will be in-use for the old mode, some for the new mode
(and the uOps carry a privilege mode flag so you can do things like
check LD or ST ops against the appropriate PTE mode access control).
On 2023-11-30 3:30 p.m., MitchAlsup wrote:
EricP wrote:
Robert Finch wrote:
The Q+ register file is implemented with one block-RAM per read port.
With a 64-bit width this gives 512 registers in a block RAM. 192
registers are needed for renaming a 64-entry architectural register
file. That leaves 320 registers unused. My thought was to support two
banks of registers, one for the highest operating mode, and the other
for remaining operating modes. On exceptions the register bank could
be switched. But to do this there are now 128-register effectively
being renamed which leads to 384 physical registers to manage. This
doubles the size of the register management code. Unless, a pipeline
flush occurs for exception processing which I think would allow the
renamer to reuse the same hardware to manage a new bank of registers.
But that hinges on all references to registers in the current bank
being unused.
My other thought was that with approximately three times the number
of architectural registers required, using 256 physical registers
would allow 85 architectural registers. Perhaps some of the registers
could be banked for different operating modes. Banking four registers
per mode would use up 16.
If the 512-register file were divided by three, 170 physical
registers could be available for renaming. This is less than the
ideal 192 registers but maybe close enough to not impact performance
adversely.
I don't understand the problem.
You want 64 architecture registers, each which needs a physical register, >>> plus 128 registers for in-flight instructions, so 196 physical registers. >>
If you add a second bank of 64 architecture registers for interrupts
then each needs a physical register. But that doesn't change the number
of in-flight registers so thats 256 physical total.
Plus two sets of rename banks, one for each mode.
If you drain the pipeline before switching register banks then all
of the 128 in-flight registers will be free at the time of switch.
A couple of bits of state and you don't need to drain the pipeline,
you just have to find the youngest instruction with the property that
all older instructions cannot raise an exception; these can be
allowed to finish execution while you are fetching instruction for
the new context.
Not quite comprehending. Will not the registers for the new context be improperly mapped if there are registers in use for the old map?
I think
a state bit could be used to pause a fetch of a register still in use in
the old map, but that is draining the pipeline anyway.
When the context swaps, a new set of target registers is always
established before the registers are used.
So incoming references in the
new context should always map to the new registers?
If you can switch to interrupt mode without draining the pipeline then
some of those 128 will be in-use for the old mode, some for the new mode >>> (and the uOps carry a privilege mode flag so you can do things like
check LD or ST ops against the appropriate PTE mode access control).
And 1 bit of state keeps track of which is which.
Did some experimenting and the RAT turns out to be too large if more registers are incorporated. Even as few as 256 regs caused the RAT to increase in size substantially. So, I may go the alternate route of
making register wider rather than deeper, having 128-bit wide registers instead.
There is an eight bit sequence number bit associated with each
instruction. So it can easily be detected the age of an instruction. I
found a really slick way of detecting instruction age using a matrix
approach on the web. But I did not fully understand it. So I just use
eight bit counters for now.
There is a two bit privilege mode flag for instructions in the ROB. I
suppose the ROB entries could be called uOps.
On 2023-11-30 6:06 p.m., MitchAlsup wrote:
Robert Finch wrote:
On 2023-11-30 3:30 p.m., MitchAlsup wrote:
EricP wrote:
Robert Finch wrote:
The Q+ register file is implemented with one block-RAM per read
port. With a 64-bit width this gives 512 registers in a block RAM. >>>>>> 192 registers are needed for renaming a 64-entry architectural
register file. That leaves 320 registers unused. My thought was to >>>>>> support two banks of registers, one for the highest operating mode, >>>>>> and the other for remaining operating modes. On exceptions the
register bank could be switched. But to do this there are now
128-register effectively being renamed which leads to 384 physical >>>>>> registers to manage. This doubles the size of the register
management code. Unless, a pipeline flush occurs for exception
processing which I think would allow the renamer to reuse the same >>>>>> hardware to manage a new bank of registers. But that hinges on all >>>>>> references to registers in the current bank being unused.
My other thought was that with approximately three times the number >>>>>> of architectural registers required, using 256 physical registers
would allow 85 architectural registers. Perhaps some of the
registers could be banked for different operating modes. Banking
four registers per mode would use up 16.
If the 512-register file were divided by three, 170 physical
registers could be available for renaming. This is less than the
ideal 192 registers but maybe close enough to not impact
performance adversely.
I don't understand the problem.
You want 64 architecture registers, each which needs a physical
register,
plus 128 registers for in-flight instructions, so 196 physical
registers.
If you add a second bank of 64 architecture registers for interrupts >>>>> then each needs a physical register. But that doesn't change the number >>>>> of in-flight registers so thats 256 physical total.
Plus two sets of rename banks, one for each mode.
If you drain the pipeline before switching register banks then all
of the 128 in-flight registers will be free at the time of switch.
A couple of bits of state and you don't need to drain the pipeline,
you just have to find the youngest instruction with the property that
all older instructions cannot raise an exception; these can be
allowed to finish execution while you are fetching instruction for
the new context.
Not quite comprehending. Will not the registers for the new context be
improperly mapped if there are registers in use for the old map?
All the in-flight destination registers will get written by the in-flight
instructions. All the instruction of the new context will allocate
registers
from the pool which is not currently in-flight. So, while there is mental
confusion on how this gets pulled off in HW, it does get pulled off just
fine. When the new context STs the registers of the old context, it obtains >> the correct register from the old context {{Should HW be doing this the
same orchestration applies--and it still works.}}
I
think a state bit could be used to pause a fetch of a register still
in use in the old map, but that is draining the pipeline anyway.
You are assuming a RAT, I am not using a RAT but a CAM where I can restore >> to any checkpoint by simply rewriting the valid bit vector.
I think the RAT can be restored to a specific checkpoint as well using
just an index value. Q+ has a checkpoint RAM of which one of the
checkpoints is the active RAT. The RAT is really 16 tables. I stored a
bit vector of the valid registers in the ROB so that the valid
register set may be reset when a checkpoint is restored.
When the context swaps, a new set of target registers is always
established before the registers are used.
You still have to deal with the transient state and the CAM version works
with either SW or HW save/restore.
So incoming references in
the new context should always map to the new registers?
Which they will--as illustrated above.
If you can switch to interrupt mode without draining the pipeline then >>>>> some of those 128 will be in-use for the old mode, some for the new
mode
(and the uOps carry a privilege mode flag so you can do things like
check LD or ST ops against the appropriate PTE mode access control).
And 1 bit of state keeps track of which is which.
Did some experimenting and the RAT turns out to be too large if more
registers are incorporated. Even as few as 256 regs caused the RAT to
increase in size substantially. So, I may go the alternate route of
making register wider rather than deeper, having 128-bit wide
registers instead.
Register ports (or equivalently RAT ports) are one of the things that most >> limit issue width. K9 was to have 22 RAT ports, and was similar in size
to the {standard decoded Register File.}
The Q+ RAT has 16 read and 8 write ports. I am trying for a 4-wide
machine. It is using about as many LUTs as the register file. The RAT is implemented with LUT ram instead of block RAMs. I do not like the size,
but it adds a lot to the operation of the machine.
The same thing is done with Q+. It support 16 checkpoints with a
There is an eight bit sequence number bit associated with each
instruction. So it can easily be detected the age of an instruction. I
I assign a 4-bit number (16-checkpints) to all instructions issued in
the same clock cycle. This gives a 6-wide machine up to 96 instructions
in-flight; and makes backing up (misprediction) simple and fast.
four-bit number too. Having read that 16 is almost the same as infinity.
found a really slick way of detecting instruction age using a matrix
approach on the web. But I did not fully understand it. So I just use
eight bit counters for now.
There is a two bit privilege mode flag for instructions in the ROB. I
suppose the ROB entries could be called uOps.
Figured it out. Each architectural register in the RAT must refer to N physical registers, where N is the number of banks. Setting N to 4
results in a RAT that is only about 50% larger than one supporting only
a single bank. The operating mode is used to select the physical
register. The first eight registers are shared between all operating
modes so arguments can be passed to syscalls. It is tempting to have
eight banks of registers, one for each hardware interrupt level.
On 2023-11-30 9:43 p.m., MitchAlsup wrote:
four-bit number too. Having read that 16 is almost the same as infinity.
Branch repair (from misprediction) has to be fast--especially if you are
going for 0-cycle repair.
I think I am far away from zero-cycle repair. Does getting zero-cycle
repair mean fetching from both branch directions and then selecting the correct one?
I will be happy if I can get branching to work at all. It
is my first implementation using checkpoints. All the details of
handling branches are not yet worked out in code for Q+. I think enough
of the code is in place to get rough timing estimates. Not sure how well
the BTB will work. A gselect predictor is also being used. Expecting a
lot of branch mispredictions.
Checkpoint recovery (from interrupt) just has to pick 1 of the checkpoints >> that has achieved the consistent state (no older instructions can raise an >> exception).
Sounds straight-forward enough.
Exception recovery can backup to the checkpoint containing the
instruction which raised the exception, and then single step forward
until the exception
is identified. Thus, you do not need "order" at a granularity smaller than >> a checkpoint.
This sounds a little trickier to do. Q+ currently takes an exception
when things commit. It looks in the exception field of the queue entry
for a fault code. If there is one it performs almost the same operation
as a branch except it is occurring at the commit stage.
Noted.
One can use pseudo-exceptions to solve difficult timing or sequencing
problems, saving certain kinds of state transitions in the instruction
queuing mechanism. For example, one could use pseudo-exception to regain
memory order in an ATOMIC event when you detect the order was less than
sequentially consistent.
Gone back to using variable length instructions. Had to pipeline the instruction length decode across three clock cycles to get it to meet
timing.
Robert Finch wrote:
Figured it out. Each architectural register in the RAT must refer to N
physical registers, where N is the number of banks. Setting N to 4
results in a RAT that is only about 50% larger than one supporting only
a single bank. The operating mode is used to select the physical
register. The first eight registers are shared between all operating
modes so arguments can be passed to syscalls. It is tempting to have
eight banks of registers, one for each hardware interrupt level.
A consequence of multiple architecture register banks is each extra
bank keeps a set of mostly unused physical register attached to them.
For example, if there are 2 modes User and Super and a bank for each,
since User and Super are mutually exclusive,
64 of your 256 physical registers will be sitting unused tied
to the other mode bank, so max of 75% utilization efficiency.
If you have 8 register banks then only 3/10 of the physical registers
are available to use, the other 7/10 are sitting idle attached to
arch registers in other modes consuming power.
Also you don't have to play overlapped-register-bank games to pass
args to/from syscalls. You can have specific instructions that reach
into other banks: Move To User Reg, Move From User Reg.
Since only syscall passes args into the OS you only need to access
the user mode bank from the OS kernel bank.
For the Q+ MPU and SOC the bus system is organized like a tree with the
root being at the CPU. The system bus operates with asynchronous transactions. The bus then fans out through bus bridges to various
system components. Responses coming back from devices are buffered and
merge results together into a more common bus when there are open spaces
in the bus. I think it is fairly fast (well at least for homebrew FPGA).
Bus accesses are single cycle, but they may have a varying amount of
latency.
Writes are “posted” so they are essentially single cycle.
Reads percolate back up the tree to the CPU. It operates at the CPU clock rate (currently 40MHz) and transfers 128-bits at a time. Maximum peak
transfer rate would then be 640 MB/s. Copying memory is bound to be much slower due to the read latency. Devices on the bus have a configuration
block which looks something like a PCI config block, so devices
addressing may be controlled by the OS.
Multiple devices access the main DRAM memory via a memory controller.
Several devices that are bus masters have their own ports to the memory controller and do not use up time on the main system bus tree. The
What happens when there is a sequence of numerous branches in row, such
that the machine would run out of checkpoints for the branches?
Suppose you go
Bra tgt1
Bra tgt1
… 30 times
Bra tgt1
Will the machine still work? Or will it crash?
I have Q+ stalling until checkpoints are available, but it seems like a
loss of performance. It is extra hardware to check for the case that
might be preventable with software. I mean how often would a sequence
like the above occur?
Getting a bit lazy on the Q+ instruction commit in the interest of
increasing the fmax. The results are already in the register file, so
all the commit has to do is:
1) Update the branch predictor.
2) Free up physical registers
3) Free load/store queue entries associated with the ROB entry.
4) Commit oddball instructions.
5) Process any outstanding exceptions.
6) Free the ROB entry
7) Gather performance statistics.
What needs to be committed is computed in the clock cycle before the
commit. This pipelined signal adds a cycle of latency to the commit, but
it only really affects oddball instructions rarely executed, and
exceptions. Commit also will not commit if the commit pointer is near
the queue pointer. Commit will also only commit up to the first oddball instruction or exception.
Decided to axe the branch-to-register feature of conditional branch instructions because the branch target would not be known at enqueue
time. It would require updating the ROB in two places.
Branches can now use a postfix immediate to extend the branch range.
This allows 32 and 64-bit displacements in addition to the existing
17-bit one. However, the assembler cannot know which to use in advance,
so choosing a larger branch displacement size should be an option.
On 2023-12-09 8:02 p.m., MitchAlsup wrote:
Robert Finch wrote:Hey thanks, I should have thought of that. While there are more physical registers available than needed (256 and only about 204 are needed), so
Getting a bit lazy on the Q+ instruction commit in the interest of
increasing the fmax. The results are already in the register file, so
all the commit has to do is:
1) Update the branch predictor.
2) Free up physical registers
By the time you write the physical register into the file, you are in
a position to free up the now permanently invisible physical register
it replaced.
it would probably run okay, I think I see a way to reduce multiplexor
usage by freeing the register when it is written.
Is miss data for a TLB page fault?3) Free load/store queue entries associated with the ROB entry.
Spectré:: write miss buffer data into Cache and TLB.
This is also where I write ST.data into cache.
I have this stored in a register in
the TLB which must be read by the CPU during exception handling.
Otherwise the TLB has a hidden page walker that updates the TLB.
Scratching my head now over writing the store data at commit time.
There is a JMP address[Rn] instruction (really JSR Rt,address[Rn]) in4) Commit oddball instructions.
5) Process any outstanding exceptions.
6) Free the ROB entry
7) Gather performance statistics.
What needs to be committed is computed in the clock cycle before the
commit. This pipelined signal adds a cycle of latency to the commit,
but it only really affects oddball instructions rarely executed, and
exceptions. Commit also will not commit if the commit pointer is near
the queue pointer. Commit will also only commit up to the first
oddball instruction or exception.
Decided to axe the branch-to-register feature of conditional branch
instructions because the branch target would not be known at enqueue
time. It would require updating the ROB in two places.
Question:: How would you handle::
IDIV R6,R7,R8
JMP R6
??
the instruction set which is always treated as a branch miss when it executes. The RTS instruction could also be used, it allows the return address register to be specified and it is a couple of bytes shorter. It
was just that conditional branches had the feature removed. It required
a third register be read for the flow control unit too.
Branches can now use a postfix immediate to extend the branch range.
This allows 32 and 64-bit displacements in addition to the existing
17-bit one. However, the assembler cannot know which to use in
advance, so choosing a larger branch displacement size should be an
option.
On 2023-12-09 11:06 p.m., MitchAlsup wrote:
Robert Finch wrote:While waiting for the register value, other instructions would continue
I have a LD IP,[address] instruction which is used to access GOT[k] for
calling dynamically linked subroutines. This bypasses the LD-aligner
to deliver IP to fetch faster.
But you side-stepped answering my question. My question is what do you
do when the Jump address will not arrive for another 20 cycles.
to queue and execute. Then that processing would be dumped because of
the branch miss. I suppose hardware could be added to suppress
processing until the register value is known. An option for a larger build.
Branches can now use a postfix immediate to extend the branch range. >>>>> This allows 32 and 64-bit displacements in addition to the existing
17-bit one. However, the assembler cannot know which to use in
advance, so choosing a larger branch displacement size should be an
option.
I use GOT[k] to branch farther than the 28-bit unconditional branch
displacement can reach. We have not yet run into a subroutine that
needs branches of more then 18-bits conditionally or 28-bits uncon-
ditionally.
I have yet to use GOT addressing.
There are issues to resolve in the Q+ frontend. The next PC value for
the BTB is not available for about three clocks. To go backwards in
time, the next PC needs to be cached, or rather the displacement to the
next PC to reduce cache size.
The first time a next PC is needed it will
not be available for three clocks. Once cached it would be available
within a clock. The next PC displacement is the sum of the lengths of
next four instructions. There is not enough room in the FPGA to add
another cache and associated logic, however. Next PC = PC + 20 seems a
whole lot simpler to me.
Thus, I may go back to using a fixed size instruction or rather
instructions with fixed alignment. The position of instructions could be
as if they were fixed length while remaining variable length.
Instructions would just be aligned at fixed intervals. If I set the
length to five bytes for instance, most the instruction set could be accommodated. Operation by “packed” instructions would be an option for
a larger build. There could be a bit in a control register to allow
execution by packed or unpacked instructions so there is some backwards compatibility to a smaller build.
On 2023-12-10 10:11 a.m., MitchAlsup wrote:
Robert Finch wrote:
On 2023-12-09 11:06 p.m., MitchAlsup wrote:
Robert Finch wrote:While waiting for the register value, other instructions would
I have a LD IP,[address] instruction which is used to access GOT[k] for >>>> calling dynamically linked subroutines. This bypasses the LD-aligner
to deliver IP to fetch faster.
But you side-stepped answering my question. My question is what do you >>>> do when the Jump address will not arrive for another 20 cycles.
continue to queue and execute. Then that processing would be dumped
because of the branch miss. I suppose hardware could be added to
suppress processing until the register value is known. An option for a
larger build.
Branches can now use a postfix immediate to extend the branch
range. This allows 32 and 64-bit displacements in addition to the >>>>>>> existing 17-bit one. However, the assembler cannot know which to >>>>>>> use in advance, so choosing a larger branch displacement size
should be an option.
I use GOT[k] to branch farther than the 28-bit unconditional branch
displacement can reach. We have not yet run into a subroutine that
needs branches of more then 18-bits conditionally or 28-bits uncon-
ditionally.
I have yet to use GOT addressing.
There are issues to resolve in the Q+ frontend. The next PC value for
the BTB is not available for about three clocks. To go backwards in
time, the next PC needs to be cached, or rather the displacement to
the next PC to reduce cache size.
What you need is an index and a set to directly access the cache--all
the other stuff can be done in arears {AGEN and cache tag check}
The first time a next PC is needed it
will not be available for three clocks. Once cached it would be
available within a clock. The next PC displacement is the sum of the
lengths of next four instructions. There is not enough room in the
FPGA to add another cache and associated logic, however. Next PC = PC
+ 20 seems a whole lot simpler to me.
Thus, I may go back to using a fixed size instruction or rather
instructions with fixed alignment. The position of instructions could
be as if they were fixed length while remaining variable length.
If the first part of an instruction decodes to the length of the
instruction
easily (EASILY) and cheaply, you can avoid the header and build a tree of
unary pointers each such pointer pointing at twice as many instruction
starting points as the previous. Even without headers, My 66000 can find
the instruction boundaries of up to 16 instructions per cycle without
adding
"stuff" the the block of instructions.
Instructions would just be aligned at fixed intervals. If I set the
length to five bytes for instance, most the instruction set could be
accommodated. Operation by “packed” instructions would be an option
for a larger build. There could be a bit in a control register to
allow execution by packed or unpacked instructions so there is some
backwards compatibility to a smaller build.
I cannot get it to work at a decent speed for only six instructions.
With byte-aligned instructions 64-decoders are in use. (They’re really small). Then output from the appropriate ones are selected. It is
partially the fullness of the FPGA and routing congestion because of the design. Routing is taking 90% of the time. Logic is only about 10%.
I did some experimenting with block headers and ended up with a block
trailer instead of a header, for the assembler’s benefit which needs to know all the instruction lengths before the trailer can be output. Only
the index of the instruction group is needed, so usually there are only
a couple of indexes used per instruction block. It can likely get by
with a 24-bit trailer containing four indexes plus the assumed one.
Usually only one or two bytes are wasted at the end of a block.
I assembled the boot rom and there are 4.9 bytes per instruction
average, including the overhead of block trailers and wasted bytes.
Branche and postfixes are five bytes, and there are a lot of them.
Code density is a little misleading because branches occupy five bytes
but do both a compare and branch operation. So they should maybe count
as two instructions.
On 2023-12-10 10:11 a.m., MitchAlsup wrote:
Robert Finch wrote:
On 2023-12-09 11:06 p.m., MitchAlsup wrote:
Robert Finch wrote:While waiting for the register value, other instructions would
I have a LD IP,[address] instruction which is used to access GOT[k] for >>>> calling dynamically linked subroutines. This bypasses the LD-aligner
to deliver IP to fetch faster.
But you side-stepped answering my question. My question is what do you >>>> do when the Jump address will not arrive for another 20 cycles.
continue to queue and execute. Then that processing would be dumped
because of the branch miss. I suppose hardware could be added to
suppress processing until the register value is known. An option for a
larger build.
Branches can now use a postfix immediate to extend the branch
range. This allows 32 and 64-bit displacements in addition to the >>>>>>> existing 17-bit one. However, the assembler cannot know which to >>>>>>> use in advance, so choosing a larger branch displacement size
should be an option.
I use GOT[k] to branch farther than the 28-bit unconditional branch
displacement can reach. We have not yet run into a subroutine that
needs branches of more then 18-bits conditionally or 28-bits uncon-
ditionally.
I have yet to use GOT addressing.
There are issues to resolve in the Q+ frontend. The next PC value for
the BTB is not available for about three clocks. To go backwards in
time, the next PC needs to be cached, or rather the displacement to
the next PC to reduce cache size.
What you need is an index and a set to directly access the cache--all
the other stuff can be done in arears {AGEN and cache tag check}
The first time a next PC is needed it
will not be available for three clocks. Once cached it would be
available within a clock. The next PC displacement is the sum of the
lengths of next four instructions. There is not enough room in the
FPGA to add another cache and associated logic, however. Next PC = PC
+ 20 seems a whole lot simpler to me.
Thus, I may go back to using a fixed size instruction or rather
instructions with fixed alignment. The position of instructions could
be as if they were fixed length while remaining variable length.
If the first part of an instruction decodes to the length of the
instruction
easily (EASILY) and cheaply, you can avoid the header and build a tree of
unary pointers each such pointer pointing at twice as many instruction
starting points as the previous. Even without headers, My 66000 can find
the instruction boundaries of up to 16 instructions per cycle without
adding
"stuff" the the block of instructions.
Instructions would just be aligned at fixed intervals. If I set the
length to five bytes for instance, most the instruction set could be
accommodated. Operation by “packed” instructions would be an option
for a larger build. There could be a bit in a control register to
allow execution by packed or unpacked instructions so there is some
backwards compatibility to a smaller build.
I cannot get it to work at a decent speed for only six instructions.
With byte-aligned instructions 64-decoders are in use. (They’re really small). Then output from the appropriate ones are selected. It is
partially the fullness of the FPGA and routing congestion because of the design. Routing is taking 90% of the time. Logic is only about 10%.
I did some experimenting with block headers and ended up with a block
trailer instead of a header, for the assembler’s benefit which needs to know all the instruction lengths before the trailer can be output. Only
the index of the instruction group is needed, so usually there are only
a couple of indexes used per instruction block. It can likely get by
with a 24-bit trailer containing four indexes plus the assumed one.
Usually only one or two bytes are wasted at the end of a block.
I assembled the boot rom and there are 4.9 bytes per instruction
average, including the overhead of block trailers and wasted bytes.
Branche and postfixes are five bytes, and there are a lot of them.
Code density is a little misleading because branches occupy five bytes
but do both a compare and branch operation. So they should maybe count
as two instructions.
On 2023-12-11 4:57 a.m., BGB wrote:
I got timing to work at 40+ MHz by using 32-bit instruction parcels
rather than byte-oriented ones.
An issue with 32-bit parcels is that float constants do not fit well
into them because of the opcode present in a postfix. A 32-bit postfix
has only 25 available bits for a constant. The next size up has 57 bits available. One thought I had was to reduce the floating-point precision
to correspond. Single precision floats would be 25 bits, double
precision 57 bits and quad precision 121 bits. All seven bits short of
the usual.
I could try and use 40-bit parcels but they would need to be at fixed locations on the cache line for performance, and it would waste bytes.
Stuck on checkpoint RAM now. Everything was going good until…. I
realized that while instructions are executing they need to be able to
update previous checkpoints, not just the current one. Which checkpoint
gets updated depends on which checkpoint the instruction falls under. It
is the register valid bit that needs to be updated. I used a “brute force” approach to implement this and it is 40k LUTs. This is about five times too large a solution. If I reduce the number of checkpoints
supported to four from sixteen, then the component is 20k LUTs. Still
too large.
The issue is there are 256 valid bits times 16 checkpoints which means
4096 registers. Muxing the register inputs and outputs uses a lot of LUTs.
One thought is to stall until all the instructions with targets in a
given checkpoint are finished executing before starting a new checkpoint region. It would seriously impact the CPU performance.
Robert Finch wrote:
Stuck on checkpoint RAM now. Everything was going good until…. I
realized that while instructions are executing they need to be able to
update previous checkpoints, not just the current one. Which checkpoint
gets updated depends on which checkpoint the instruction falls under. It
is the register valid bit that needs to be updated. I used a “brute
force” approach to implement this and it is 40k LUTs. This is about five >> times too large a solution. If I reduce the number of checkpoints
supported to four from sixteen, then the component is 20k LUTs. Still
too large.
The issue is there are 256 valid bits times 16 checkpoints which means
4096 registers. Muxing the register inputs and outputs uses a lot of LUTs. >>
One thought is to stall until all the instructions with targets in a
given checkpoint are finished executing before starting a new checkpoint
region. It would seriously impact the CPU performance.
(I don't have a solution, just passing on some info on this particular checkpointing issue.)
Sounds like you might be using the same free register checkpoint algorithm
I came up with for my simulator, which I assumed was a custom sram design.
There is 1 bit for each physical register that is free.
The checkpoint for a Bcc conditional branch copies the free bit vector,
in your case 256 bits, to a row in the checkpoint sram.
As each instruction retires and frees up its old dest physical register
and it must mark the register free in *all* checkpoint contexts.
That requires the ability to set all the free flags for a single register, which means an sram design that can write a whole row, and also set all the bits in one column, in your case set the 16 bits in each checkpoint for one of the 256 registers.
I was assuming an ASIC design so a small custom sram seemed reasonable.
But for an FPGA it requires 256*16 flip-flops plus decoders, etc.
I think the risc-v Berkeley Out-of-Order Machine (BOOM) folks might have independently come up with the same approach on their BOOM-3 SonicBoom.
Their note [5] describes the same problem as my column setting solves.
https://docs.boom-core.org/en/latest/sections/rename-stage.html
While their target was 22nm ASIC, they say below that they
implemented a version of BOOM-3 on an FPGA but don't give details.
But their project might be open source so maybe the details
are available online.
Sonicboom: The 3rd generation berkeley out-of-order machine http://people.eecs.berkeley.edu/~krste/papers/SonicBOOM-CARRV2020.pdf
On 2023-12-22 12:49 p.m., MitchAlsup wrote:
EricP wrote:I think I maybe found a solution using a block RAM and about 8k LUTs.
Robert Finch wrote:
Stuck on checkpoint RAM now. Everything was going good until…. I
realized that while instructions are executing they need to be able
to update previous checkpoints, not just the current one. Which
checkpoint gets updated depends on which checkpoint the instruction
falls under. It is the register valid bit that needs to be updated.
I used a “brute force” approach to implement this and it is 40k
LUTs. This is about five times too large a solution. If I reduce the
number of checkpoints supported to four from sixteen, then the
component is 20k LUTs. Still too large.
The issue is there are 256 valid bits times 16 checkpoints which
means 4096 registers. Muxing the register inputs and outputs uses a
lot of LUTs.
One thought is to stall until all the instructions with targets in a
given checkpoint are finished executing before starting a new
checkpoint region. It would seriously impact the CPU performance.
Not sure about setting bits in all checkpoints. I probably have not just understood the issue yet. Partially terminology. There are two different things happening. The register free/available which is being managed
(I don't have a solution, just passing on some info on this particular
checkpointing issue.)
Sounds like you might be using the same free register checkpoint
algorithm
I came up with for my simulator, which I assumed was a custom sram
design.
There is 1 bit for each physical register that is free.
The checkpoint for a Bcc conditional branch copies the free bit vector,
in your case 256 bits, to a row in the checkpoint sram.
As each instruction retires and frees up its old dest physical register
and it must mark the register free in *all* checkpoint contexts.
That requires the ability to set all the free flags for a single
register,
which means an sram design that can write a whole row, and also set
all the
bits in one column, in your case set the 16 bits in each checkpoint
for one
of the 256 registers.
with fifos and the register contents valid bit. At the far end of the pipeline, registers that were used are made free again by adding to the
free fifo. This is somewhat inefficient because they could be freed
sooner, but it would require more logic, instead more registers are
used, they are available from the RAM anyway.
The register contents valid bit is cleared when a target register is assigned, and set once a value is loaded into the target register. The
valid bit is also set for instructions that are stomped on as the old
value is valid. When a checkpoint is restored, it restores the state of
the valid bit along with the physical register tag. I am not
understanding why the valid bit would need to be modified in all
checkpoints. I would think it should reflect the pre-branch state of
things.
Robert Finch wrote:
On 2023-12-22 12:49 p.m., MitchAlsup wrote:
EricP wrote:I think I maybe found a solution using a block RAM and about 8k LUTs.
Robert Finch wrote:
Stuck on checkpoint RAM now. Everything was going good until…. I
realized that while instructions are executing they need to be able
to update previous checkpoints, not just the current one. Which
checkpoint gets updated depends on which checkpoint the instruction
falls under. It is the register valid bit that needs to be updated.
I used a “brute force” approach to implement this and it is 40k
LUTs. This is about five times too large a solution. If I reduce the >>>>> number of checkpoints supported to four from sixteen, then the
component is 20k LUTs. Still too large.
The issue is there are 256 valid bits times 16 checkpoints which
means 4096 registers. Muxing the register inputs and outputs uses a
lot of LUTs.
One thought is to stall until all the instructions with targets in a >>>>> given checkpoint are finished executing before starting a new
checkpoint region. It would seriously impact the CPU performance.
Not sure about setting bits in all checkpoints. I probably have not just
(I don't have a solution, just passing on some info on this particular >>>> checkpointing issue.)
Sounds like you might be using the same free register checkpoint
algorithm
I came up with for my simulator, which I assumed was a custom sram
design.
There is 1 bit for each physical register that is free.
The checkpoint for a Bcc conditional branch copies the free bit vector, >>>> in your case 256 bits, to a row in the checkpoint sram.
As each instruction retires and frees up its old dest physical register >>>> and it must mark the register free in *all* checkpoint contexts.
That requires the ability to set all the free flags for a single
register,
which means an sram design that can write a whole row, and also set
all the
bits in one column, in your case set the 16 bits in each checkpoint
for one
of the 256 registers.
understood the issue yet. Partially terminology. There are two different
things happening. The register free/available which is being managed
with fifos and the register contents valid bit. At the far end of the
pipeline, registers that were used are made free again by adding to the
free fifo. This is somewhat inefficient because they could be freed
sooner, but it would require more logic, instead more registers are
used, they are available from the RAM anyway.
The register contents valid bit is cleared when a target register is
assigned, and set once a value is loaded into the target register. The
valid bit is also set for instructions that are stomped on as the old
value is valid. When a checkpoint is restored, it restores the state of
the valid bit along with the physical register tag. I am not
understanding why the valid bit would need to be modified in all
checkpoints. I would think it should reflect the pre-branch state of
things.
This has to do with free physical register list checkpointing and
a particular gotcha that occurs if one tries to use a vanilla sram
to save the free map bit vector for each checkpoint.
It sounds like the BOOM people stepped in this gotcha at some point.
Say a design has a bit vector indicating which physical registers are free. Rename allocates a register by using a priority selector to scan that
vector and select a free PR to assign as a new dest PR.
When this instruction retires, the old dest PR is freed and
the new dest PR becomes the architectural register.
When Decode sees a conditional branch Bcc it allocates a
checkpoint in a circular buffer by incrementing the head counter,
copies the *current* free bit vector into the new checkpoint row,
and saves the new checkpoint index # in the Bcc uOp.
If a branch mispredict occurs then we can restore the state at the
Bcc by copying various state info from the Bcc checkpoint index #.
This includes copying back the saved free vector to the current free vector. When the Bcc uOp retires we increment the circular tail counter
to recover the checkpoint buffer row.
The problem occurs when an old dest PR is in use so its free bit is clear when the checkpoint is saved. Then the instruction retires and marks the
old dest PR as free in the bit vector. Then Bcc triggers a mispredict
and restores the free vector that was copied when the checkpoint was saved, including the then not-free state of the PR freed after the checkpoint. Result: the PR is lost from the free list. After enough mispredicts you
run out of free physical registers and hang at Rename waiting to allocate.
It needs some way to edit the checkpointed free bit vector so that
no matter what order of PR-allocate, retire-PR-free, checkpoint save #X,
and rollback to checkpoint #Y, that the correct free vector gets restored.
Predicated logic and the PRED modifier on my mind tonight.
I think I have discovered an interesting way to handle predicated logic.
If a predicate is true the instruction is scheduled and executes
normally. If the predicate is false the instruction is modified to a
special copy operation and scheduled to execute on an ALU regardless of
what the original execution unit would be. What makes this efficient is
that only a single target register read port is required for the ALU
unit versus having a target register read port for every functional
unit. The copy mux is present in the ALU only and not in the other
functional units. For most instructions there is no predication.
Robert Finch wrote:
Predicated logic and the PRED modifier on my mind tonight.
I think I have discovered an interesting way to handle predicated logic.
If a predicate is true the instruction is scheduled and executes
normally. If the predicate is false the instruction is modified to a
special copy operation and scheduled to execute on an ALU regardless of
what the original execution unit would be. What makes this efficient is
that only a single target register read port is required for the ALU
unit versus having a target register read port for every functional
unit. The copy mux is present in the ALU only and not in the other
functional units. For most instructions there is no predication.
Yes, the general case is each uOp has predicate source and a bool to test.
If the value matches the predicate you execute the ON_MATCH part of the uOp, if it does not match then execute the ON_NO_MATCH part.
condition = True | False
(pred == condition) ? ON_MATCH : ON_NO_MATCH;
The ON_NO_MATCH uOp function is usually some housekeeping.
On an in-order it might diddle the scoreboard to indicate the register
write is done. On OoO it might copy the old dest register to new.
Note that the source register dependencies change between match and no_match.
if (pred == True) ADD r3 = r2 + r1
If pred == True then it matches and the uOp is dependent on r2 and r1.
If pred != True then it no_match and uOp is dependent on the old dest r3
as a source to copy to the new dest r3.
Dynamically pruning the unnecessary uOp source register dependencies
for the alternate part can allow it to launch earlier.
Also predicated LD and ST have some particular issues to think about.
For example, under TSO a younger LD cannot bypass an older LD.
If an older LD has an unresolved predicate then we don't know if it exists
so we have to block the younger LD until the older predicate resolves.
The LD's ON_NO_MATCH housekeeping might include diddling the LSQ dependency matrix to wake up any younger LD's in the LSQ that had been blocked.
(Yes, I'm sure one could get fancier with replay traps.)
EricP wrote:
If an older LD has an unresolved predicate then we don't know if it
exists
so we have to block the younger LD until the older predicate resolves.
This is why TSO and SC are slower than causal or weaker. Consider a memory reorder buffer which allows generated addresses to probe the cache and determine hit as operand data-flow permits--BUT holds onto the data and writes (LD or reads (ST) to) the RF in program order. This violates TSO
and SC but mono-threaded codes are immune to this memory ordering problem {and multi-threaded programs are immune except while performing ATIMIC things.}
TSO and SC is simply slower when trying to perform memory reference inst- ructions in both the then-clause and in the else clause while waiting
the resolution of the condition--even if no results are written into RF
until after resolution.
On 1/10/2024 8:30 PM, EricP wrote:
MitchAlsup wrote:
EricP wrote:
If an older LD has an unresolved predicate then we don't know if it
exists
so we have to block the younger LD until the older predicate resolves.
This is why TSO and SC are slower than causal or weaker. Consider a
memory
reorder buffer which allows generated addresses to probe the cache
and determine hit as operand data-flow permits--BUT holds onto the
data and
writes (LD or reads (ST) to) the RF in program order. This violates TSO
and SC but mono-threaded codes are immune to this memory ordering
problem
{and multi-threaded programs are immune except while performing ATIMIC
things.}
TSO and SC is simply slower when trying to perform memory reference
inst-
ructions in both the then-clause and in the else clause while waiting
the resolution of the condition--even if no results are written into
RF until after resolution.
BTW in case anyone is interested I came across the recent paper that
compares the Apple M1 ARM processors two memory consistency models:
the ARM weak ordering and the total store ordering (TSO) model from x86.
"Based on various workloads, our findings indicate that TSO is,
on average, 8.94% slower than ARM’s weaker memory ordering."
TOSTING: Investigating Total Store Ordering on ARM, 2023
https://sra.uni-hannover.de/Publications/2023/tosting-arcs23/wrenger_23_arcs.pdf
https://www.sra.uni-hannover.de/Publications/2023/tosting-arcs23/wrenger_23_arcs_talk.pdf
Will read them. Thanks for the heads up Eric.
Chris M. Thomasson wrote:
On 1/10/2024 8:30 PM, EricP wrote:
BTW in case anyone is interested I came across the recent paper that
compares the Apple M1 ARM processors two memory consistency models:
the ARM weak ordering and the total store ordering (TSO) model from x86. >>>
"Based on various workloads, our findings indicate that TSO is,
on average, 8.94% slower than ARM’s weaker memory ordering."
TOSTING: Investigating Total Store Ordering on ARM, 2023
https://sra.uni-hannover.de/Publications/2023/tosting-arcs23/wrenger_23_arcs.pdf
https://www.sra.uni-hannover.de/Publications/2023/tosting-arcs23/wrenger_23_arcs_talk.pdf
Caveat that this compares Apples ARM weak to Apples TSO implementation. >Because Apple M1 has two consistency models,
if TSO is just there as a porting aid for x86 code that depends on it,
they might not have put as many bells and whistles into making it
as fast as someone who has only TSO.
Trading off the maximum amount of contiguously addressed memory for a
smaller PTE and PTP format. The PTE and PTPs are thus only 32-bits in
size. A PTE has a 17-bit page number, while the PTP uses 30-bits for the
page number. With 64kB pages this limits the system to 2^46 bytes of
memory, which is probably okay for a small system. The PTEs 17-bit page number can only work with 8GB of contiguous memory. All the pages the
PTE table covers must be in the same 8GB memory range. The upper bits of
the translated address will come from the PTPs upper bits. This makes
memory looks like a tree. Groups of leafs are stuck to particular branches.
Robert Finch wrote:
Trading off the maximum amount of contiguously addressed memory for a
smaller PTE and PTP format. The PTE and PTPs are thus only 32-bits in
size. A PTE has a 17-bit page number, while the PTP uses 30-bits for the
page number. With 64kB pages this limits the system to 2^46 bytes of
memory, which is probably okay for a small system. The PTEs 17-bit page
number can only work with 8GB of contiguous memory. All the pages the
PTE table covers must be in the same 8GB memory range. The upper bits of
the translated address will come from the PTPs upper bits. This makes
memory looks like a tree. Groups of leafs are stuck to particular branches.
If I understand you correctly this means the PTE pages for
each 8 GB range must be in a PTP located inside that 8 GB range.
If that is ROM or IO registers in that range then there must be
RAM in the same 8 GB range in order to map it.
That would make modularizing components a little difficult as you
will have to add RAM mapping modules to each 8 GB address range.
And of course the OS memory manager has to be coded to specially
handle the RAM for each of these mapping ranges.
On 2024-03-04 9:03 a.m., EricP wrote:
If I understand you correctly this means the PTE pages forYes, the above is what I was thinking.
each 8 GB range must be in a PTP located inside that 8 GB range.
If that is ROM or IO registers in that range then there must be
RAM in the same 8 GB range in order to map it.
That would make modularizing components a little difficult as you
will have to add RAM mapping modules to each 8 GB address range.
And of course the OS memory manager has to be coded to specially
handle the RAM for each of these mapping ranges.
There is a scratchpad RAM in the ROM address space, used for
bootstrapping. RAM access is needed during the boot process before
everything is setup and the DRAM is accessible. So, it is possible to
map in that manner.
Bigfoot pages are 1MB in size. That allows an entire 64GB address range
to be mapped with one page table. With larger memory systems a larger
page size is needed IMO. 64GB is 65,536 pages still when the pages are
1MB in size. There is 32GB RAM in my machine today. Tomorrow it will 64GB.
Robert Finch wrote:
Bigfoot pages are 1MB in size. That allows an entire 64GB address range
to be mapped with one page table. With larger memory systems a larger
page size is needed IMO. 64GB is 65,536 pages still when the pages are
1MB in size. There is 32GB RAM in my machine today. Tomorrow it will 64GB.
Above a certain point the added latency of filling/spilling a page to DISK >that is 64KB in size rather than 4KB in size outweighs the gain in the TLB.
Robert Finch wrote:
Bigfoot pages are 1MB in size. That allows an entire 64GB address
range to be mapped with one page table. With larger memory systems a
larger page size is needed IMO. 64GB is 65,536 pages still when the
pages are 1MB in size. There is 32GB RAM in my machine today. Tomorrow
it will 64GB.
Above a certain point the added latency of filling/spilling a page to DISK that is 64KB in size rather than 4KB in size outweighs the gain in the TLB.
On 3/5/2024 7:37 AM, MitchAlsup1 wrote:
Robert Finch wrote:
Bigfoot pages are 1MB in size. That allows an entire 64GB address
range to be mapped with one page table. With larger memory systems a
larger page size is needed IMO. 64GB is 65,536 pages still when the
pages are 1MB in size. There is 32GB RAM in my machine today. Tomorrow
it will 64GB.
Above a certain point the added latency of filling/spilling a page to DISK >> that is 64KB in size rather than 4KB in size outweighs the gain in the TLB.
Sure. But the 4K size was first used when disk transfer rates were
about 3 MB/sec. Today, they are many times that, even if you are using
a single physical hard disk. RAID and SSD can make that even larger. I don't know what the "optimal" page size is, but it is certainly larger
than 4KB.
Stephen Fuld wrote:
On 3/5/2024 7:37 AM, MitchAlsup1 wrote:
Robert Finch wrote:
Bigfoot pages are 1MB in size. That allows an entire 64GB address
range to be mapped with one page table. With larger memory systems a
larger page size is needed IMO. 64GB is 65,536 pages still when the
pages are 1MB in size. There is 32GB RAM in my machine today. Tomorrow >>>> it will 64GB.
Above a certain point the added latency of filling/spilling a page to DISK >>> that is 64KB in size rather than 4KB in size outweighs the gain in the TLB.
Sure. But the 4K size was first used when disk transfer rates were
about 3 MB/sec. Today, they are many times that, even if you are using
a single physical hard disk. RAID and SSD can make that even larger. I
don't know what the "optimal" page size is, but it is certainly larger
than 4KB.
While the data transfer rates are far higher today, the disk latency has
not kept pace with CPU performance {SSD is different}.
The CPU went from 60ns (16 MHz: 360/67) to 200ps (5GHz) 300× !!
And from 5 cycles per instruction to 3 instruction per cycle 15×
for a combined gain of 4,500×
Disk latency may have gone down from 8ms to 3ms (sometimes 2ms for
smaller disks) 3×-4×
{{I consider 360/67 as the beginning of paging; although Multics may be
the beginning.}}
I, also, believe that 4KB pages are a bit small for a new architecture.
I chose 8KB for My 66000 more because it took 1 level out of the page
tables supporting 64-bit VAS, and also made the paging hierarchy more
easy to remember:: 8K, 8M, 8G, 8T, 8P, 8E is easier to remember than
4K, 2M, 1G, ½T, ¼P, ⅛E.
I also believe in the tension between pages that are too small and those
that are too large. 256B is widely seen as too small (VAX).
On 3/5/2024 7:37 AM, MitchAlsup1 wrote:
Robert Finch wrote:
Bigfoot pages are 1MB in size. That allows an entire 64GB address
range to be mapped with one page table. With larger memory systems a
larger page size is needed IMO. 64GB is 65,536 pages still when the
pages are 1MB in size. There is 32GB RAM in my machine today. Tomorrow
it will 64GB.
Above a certain point the added latency of filling/spilling a page to DISK >> that is 64KB in size rather than 4KB in size outweighs the gain in the TLB.
Sure. But the 4K size was first used when disk transfer rates were
about 3 MB/sec. Today, they are many times that, even if you are using
a single physical hard disk. RAID and SSD can make that even larger. I >don't know what the "optimal" page size is, but it is certainly larger
than 4KB.
256B is widely seen as too small (VAX).
I think most
people are comfortable in the 4KB range. I think 64KB is too big since >something like cat needs less code, less data, and less stack space than
1 single 64KB page and another 64KB page to map cat from its VAS. So, now; >instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need
four 64KB pages 256KB.
mitchalsup@aol.com (MitchAlsup1) writes:
256B is widely seen as too small (VAX).
Didn't VAX use 512B?
I think most
people are comfortable in the 4KB range. I think 64KB is too big since >>something like cat needs less code, less data, and less stack space than
1 single 64KB page and another 64KB page to map cat from its VAS. So, now; >>instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need >>four 64KB pages 256KB.
Yes, so what? Who cares if cat takes 16KB or 256KB when we have
Gigabytes of RAM?
A while ago <2020Oct9.190337@mips.complang.tuwien.ac.at> I posted
numbers here about the sizes of VMAs in the processes on several Linux systems and how much extra space would be needed from larger pages:
|VMAs unique used total 8KB 16KB 32KB 64KB
| 7552 2333 555964 1033320 6704 22344 56344 125144 desktop |82836 25276 5346060 15707448 76072 223000 514472 1113672 laptop
|47017 15425 105490636 60186068 40804 134492 319852 708588 server
The numbers in the 8KB, 16KB, 32KB, 64KB columns estimate how much
extra RAM would be needed if the pages were that large. So, 1.1GB
extra for 64KB pages on the laptop, but 8KB and 16KB pages would be relatively cheap.
- anton
Stephen Fuld wrote:
On 3/5/2024 7:37 AM, MitchAlsup1 wrote:
Robert Finch wrote:
Bigfoot pages are 1MB in size. That allows an entire 64GB address
range to be mapped with one page table. With larger memory systems a
larger page size is needed IMO. 64GB is 65,536 pages still when the
pages are 1MB in size. There is 32GB RAM in my machine today.
Tomorrow it will 64GB.
Above a certain point the added latency of filling/spilling a page to
DISK
that is 64KB in size rather than 4KB in size outweighs the gain in
the TLB.
Sure. But the 4K size was first used when disk transfer rates were
about 3 MB/sec. Today, they are many times that, even if you are
using a single physical hard disk. RAID and SSD can make that even
larger. I don't know what the "optimal" page size is, but it is
certainly larger than 4KB.
While the data transfer rates are far higher today, the disk latency has
not kept pace with CPU performance {SSD is different}.
The CPU went from 60ns (16 MHz: 360/67) to 200ps (5GHz) 300× !!
And from 5 cycles per instruction to 3 instruction per cycle 15×
for a combined gain of 4,500×
Disk latency may have gone down from 8ms to 3ms (sometimes 2ms for
smaller disks) 3×-4×
{{I consider 360/67 as the beginning of paging; although Multics may be
the beginning.}}
I, also, believe that 4KB pages are a bit small for a new architecture.
I chose 8KB for My 66000 more because it took 1 level out of the page
tables supporting 64-bit VAS, and also made the paging hierarchy more
easy to remember:: 8K, 8M, 8G, 8T, 8P, 8E is easier to remember than
4K, 2M, 1G, ½T, ¼P, ⅛E.
I also believe in the tension between pages that are too small and those
that are too large.
256B is widely seen as too small (VAX). I think most
people are comfortable in the 4KB range.
I think 64KB is too big since
something like cat needs less code, less data, and less stack space than
1 single 64KB page and another 64KB page to map cat from its VAS. So, now; instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need
four 64KB pages 256KB. It is these small applications that drive the
minimum page size down.
But with desktops coming with 64GB+ of DRAM, perhaps it matters less now.
(*) 5 if you separate .bss from .data
6 if you separate .rodata from .bss and .data
On 3/5/2024 9:32 AM, MitchAlsup1 wrote:
Stephen Fuld wrote:
On 3/5/2024 7:37 AM, MitchAlsup1 wrote:
Robert Finch wrote:
Bigfoot pages are 1MB in size. That allows an entire 64GB address
range to be mapped with one page table. With larger memory systems a >>>>> larger page size is needed IMO. 64GB is 65,536 pages still when the
pages are 1MB in size. There is 32GB RAM in my machine today.
Tomorrow it will 64GB.
Above a certain point the added latency of filling/spilling a page to
DISK
that is 64KB in size rather than 4KB in size outweighs the gain in
the TLB.
Sure. But the 4K size was first used when disk transfer rates were
about 3 MB/sec. Today, they are many times that, even if you are
using a single physical hard disk. RAID and SSD can make that even
larger. I don't know what the "optimal" page size is, but it is
certainly larger than 4KB.
While the data transfer rates are far higher today, the disk latency has
not kept pace with CPU performance {SSD is different}.
Sure, but the latency is the same no matter what the transfer size is.
So the fact that latency improvements haven't kept pace is irrelevant to
the question of optimal page size.
The CPU went from 60ns (16 MHz: 360/67) to 200ps (5GHz) 300× !!
And from 5 cycles per instruction to 3 instruction per cycle 15×
for a combined gain of 4,500×
Disk latency may have gone down from 8ms to 3ms (sometimes 2ms for
smaller disks) 3×-4×
{{I consider 360/67 as the beginning of paging; although Multics may be
the beginning.}}
Agreed.
I, also, believe that 4KB pages are a bit small for a new architecture.
I chose 8KB for My 66000 more because it took 1 level out of the page
tables supporting 64-bit VAS, and also made the paging hierarchy more
easy to remember:: 8K, 8M, 8G, 8T, 8P, 8E is easier to remember than
4K, 2M, 1G, ½T, ¼P, ⅛E.
Fair enough. When Unisys implemented paging in the 2200 Series in the
1990s, they chose 16K (approximately - exactly if you consider 36 equals 32!).
I also believe in the tension between pages that are too small and those
that are too large.
Naturally.
256B is widely seen as too small (VAX). I think most
people are comfortable in the 4KB range.
While true, how much of that is just that is what they are used to, as opposed to some kind of optimal?
I think 64KB is too big since
something like cat needs less code, less data, and less stack space than
1 single 64KB page and another 64KB page to map cat from its VAS. So, now; >> instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need
four 64KB pages 256KB. It is these small applications that drive the
minimum page size down.
Agreed. And large applications tend to drive the optimum page size up.
And I think that small applications like cat tend to be quick thus never swapped, and only "waste" memory for a short amount of time. On the
other hand, larger page sizes cause a larger memory "waste" for all applications.
So the optimum is, at least to some degree, usage dependent. Of course,
this is all an argument for multiple page sizes.
But with desktops coming with 64GB+ of DRAM, perhaps it matters less now.
I think larger main memories argue for larger page sizes, both because
the "waste" costs less, and larger memories require more pages and thus, perhaps a larger TLB.
As with most such things, there is a is tradeoff, and the optimum
probably changes as technology changes.
(*) 5 if you separate .bss from .data
6 if you separate .rodata from .bss and .data
Fair enough. When Unisys implemented paging in the 2200 Series in the
1990s, they chose 16K (approximately - exactly if you consider 36 equals 32!).
On 2024-03-05 1:13 p.m., BGB wrote:
What is LLC? (Local Lan controller?)
Another possible trick was mapping these to ROM zero page when reloaded,
and only "reviving" them as actual swap-space pages, when something was
written to them. Since the page-table was also partly used to track
pages in the page-table, there needed to be special handling in the TLB
miss handler to signal "yeah, this page is really a zeroes-only page".
Say, page states:
Invalid / unassigned;
Valid / assigned;
Invalid / mapped to pagefile;
Page is swapped out.
Valid / mapped to pagefile;
Page is zeroed.
Though, potentially, any hardware page-walker would need to be aware of
the zero-page trick (vs, say, trying to map the page to an invalid
physical address).
...
Intel, AMD and ARM already support large page sizes (1GB for Intel/AMD
and 512GB for ARM64).
Do they support using the entire page as a page table?
Compression could be useful for something serialized to disk or through
the network. Transferring the page number and compressed contents.
Anton Ertl wrote:
A while ago <2020Oct9.190337@mips.complang.tuwien.ac.at> I posted
numbers here about the sizes of VMAs in the processes on several Linux
systems and how much extra space would be needed from larger pages:
|VMAs unique used total 8KB 16KB 32KB 64KB
| 7552 2333 555964 1033320 6704 22344 56344 125144 desktop
|82836 25276 5346060 15707448 76072 223000 514472 1113672 laptop
|47017 15425 105490636 60186068 40804 134492 319852 708588 server
Are the 8K numbers to be compared to unique ? used ? or total ? to estimate >waste ??
Robert Finch <robfi680@gmail.com> writes:
Do they support using the entire page as a page table?
All multilevel page tables use an entire page (granule in ARMv8)
at each level of the page table. To map larger page/block sizes,
they just stop the table walk at one, two or three levels rather
than walking all four levels to the leaf page size.
Compression could be useful for something serialized to disk or through
the network. Transferring the page number and compressed contents.
Compressing tiered DRAM is looking to be the next big thing.
https://www.zeropoint-tech.com/news/meta-and-google-unveil-a-revolutionary-hardware-compressed-cxl-memory-tier-specification
On 3/5/2024 3:07 PM, Robert Finch wrote:
On 2024-03-05 1:13 p.m., BGB wrote:
What is LLC? (Local Lan controller?) I used the text mode video display
RAM in the past.
Intel, AMD and ARM already support large page sizes (1GB for Intel/AMD
and 512GB for ARM64).
Do they support using the entire page as a page table?
I think it is a case of skipping the last N levels.
Where, say, with 4K pages and 64-bit PTEs:
4K: No Skip
2MB: 1-level skip
1GB: 2-level skip.
Seemingly, Windows had used 64K logical pages, but these were likely
faked in software.
Not entirely sure the reason for them doing so.
Compression could be useful for something serialized to disk or through
the network. Transferring the page number and compressed contents.
Scott Lurndal wrote:
Robert Finch <robfi680@gmail.com> writes:
Do they support using the entire page as a page table?
All multilevel page tables use an entire page (granule in ARMv8)
at each level of the page table. To map larger page/block sizes,
they just stop the table walk at one, two or three levels rather
than walking all four levels to the leaf page size.
My 66000 page structure supports both skipping of levels in the table
and stopping at any level in the table.
Compression could be useful for something serialized to disk or through >>>the network. Transferring the page number and compressed contents.
Compressing tiered DRAM is looking to be the next big thing.
https://www.zeropoint-tech.com/news/meta-and-google-unveil-a-revolutionary-hardware-compressed-cxl-memory-tier-specification
CXL is just* an offload of DRAM from an on-die memory controller to
an on-PCIe 5.0-6.0 high speed links.
An interesting topic between PCIE 5.0 and 6.0 is the change from
NRZ encoding to PAM 4. This comes with a degradation of error rate
from 10^-12 goes down to 10^-6, so you are going to need some kind
of ECC on the PCIe links and layers.
On 3/6/2024 8:42 AM, Robert Finch wrote:
In my case, access is figured out on cache-line fetch, and is precooked:
NR, NW, NX, NU, NC: NoRead/NoWrite/NoExecute/NoUser/NoCache.
Though, some combinations of these flags are special.
The L1 cache only hits if the current mode matches the mode that was in effect at the time the cache-line was fetched, and if KRR has not
changed (as determined by a hash value), ...
For my system the ACL is not part of the PTE, it is part of the software
managed page information, along with share counts. I do not see the ACL
for a page being different depending on the page table.
In my case, ACL handling is done via a combination of keyring register
(KRR), and a small fully-associative cache (4 entry at present, 6 could
be better in theory; luckily each entry is comparably small).
The ACLID is tied to the TLBE, so the intersection of the ACLID and KRR
entry are used to figure out access in the ACL cache (or,
ignored/disabled if the low 16 bits of KRR are 0).
I have dedicated some of the block RAMs for the page management
information, so they may be read out in parallel with a memory access.
So shifted the block RAM usage from the TLB to the PMT. This makes the
TLB smaller. It also reduces the memory usage. The page management
information only needs one copy for each page of memory. If the
information were in the TLBE / PTEs there would be multiple copies of
the information in the page tables. How do you keep things coherent if
there are multiple copies in page tables?
The access ID for pages is kept in sync with the memory address, since
both are uploaded to the TLB at the same time.
However, as for ACL checks themselves, these are handled with a separate cache. So, say, changing the access to an ACLID, and flushing the corresponding entry from the ACL cache, will automatically apply to any
pages previously loaded into the TLB.
There was also the older VUGID system, which used traditional Unix-style permissions. If I were designing it now, would likely design things
around using exclusively ACL checking, which (ironically) also needs
less bits to encode.
Generally, software TLB miss handling is used in my case.
There is no automatic way to keep the TLB in sync with the page table
(if the page table entry is modified).
Usual thing is that if the current page table is updated, then one needs
to forge a special dummy entry, and then upload this entry to the TLB multiple times (via the LDTLB instruction) to knock the prior contents
out of the TLB (or use the INVTLB instruction, but this currently
invalidates the entire TLB; which is a bad situation for
software-managed TLB...).
Generally, the assumption is that all pages in a mapping will have the
same ACLID (generally corresponding to the "owner" of the mapping).
If using multiple page tables for context switching, it will be
necessary to use ASIDs.
It is possible to share global pages across "ASID groups", but currently there are not "truly global" pages (and, implicitly, some groups may
disallow global pages).
Where, say, the ASID is a 16-bit field:
(15:10): ASID Group
( 9: 0): ASID ID
At present, for most normal use, the idea is that the ASID and ACL/KRR
ID's will be aliased to a process's PID.
Say, with Groups 00..1F (in both ASID and ACLID space) being used for
the PID aliased range (20..37 for special use, and 38..3F for selective one-off entries).
Currently, threads also eat PID's, but this is likely to change, say:
TPID (Task ID):
(31:16): PID
(15: 0): Thread ID (local to a given PID)
On 2024-03-07 1:39 a.m., BGB wrote:
Bigfoot uses a whole byte for access rights, with separate
read-write-execute for user and supervisor, and write protect for
hypervisor and machine modes. He also uses 4 bits for the cache-ability
which match the cache-ability bits on the bus.
I am using the value of zero for the ASID to represent the machine
mode’s ASID. A lot of hardware is initialized to zero at reset, so it’s automatically the machine mode’s. Other than zero the ASID could be anything assigned by the OS.
I also believe in the tension between pages that are too small and those
that are too large. 256B is widely seen as too small (VAX). I think most people are comfortable in the 4KB range. I think 64KB is too big since something like cat needs less code, less data, and less stack space than
1 single 64KB page and another 64KB page to map cat from its VAS. So, now; instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need
four 64KB pages 256KB. It is these small applications that drive the
minimum page size down.
On 3/5/2024 9:32 AM, MitchAlsup1 wrote:
snip
I also believe in the tension between pages that are too small and those
that are too large. 256B is widely seen as too small (VAX). I think most
people are comfortable in the 4KB range. I think 64KB is too big since
something like cat needs less code, less data, and less stack space than
1 single 64KB page and another 64KB page to map cat from its VAS. So, now; >> instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need
four 64KB pages 256KB. It is these small applications that drive the
minimum page size down.
In thinking about this, an idea occurred to me that may ease this
tension some. For a large page, you introduce a new protection mode
such that, for example, the lower half of the addresses in the page are >execute only, and the upper half are read/write enabled. This would
allow the code and the data, and perhaps even the stack for such a
program to share a single page, while still maintaining the required
access protection. I think the hardware to implement this is pretty
small. While the benefits of this would be modest, if such "small
programs" occur often enough it may be worth the modest cost of the >additional hardware.
Robert Finch wrote:
On 2024-03-07 1:39 a.m., BGB wrote:
Bigfoot uses a whole byte for access rights, with separate
read-write-execute for user and supervisor, and write protect for
hypervisor and machine modes. He also uses 4 bits for the cache-ability
which match the cache-ability bits on the bus.
Can you think of an example where a user Read-Only page would not be >writeable from super?
Can you think of an example where a user Execute-Only page would not
be readable from super ?? Or a device on PCIe ??
Can you think of an example where a user page marked RWE = 000 would
not be readable and writeable from super ? Or a device on PCIe ??
Can you think of an example where 2-bits denoting {L1, L2, LLC, DRAM}
does not describe the cache placement of a line adequately ??
I am also assuming that ASID = 0 is the highest level of privilege;
but this is purely a SW choice.
On 2024-03-07 2:51 p.m., MitchAlsup1 wrote:
Robert Finch wrote:
On 2024-03-07 1:39 a.m., BGB wrote:
Bigfoot uses a whole byte for access rights, with separate
read-write-execute for user and supervisor, and write protect for
hypervisor and machine modes. He also uses 4 bits for the
cache-ability which match the cache-ability bits on the bus.
Can you think of an example where a user Read-Only page would not be
writeable from super ?? Or a device on PCIe ??
Can you think of an example where a user Execute-Only page would not
be readable from super ?? Or a device on PCIe ??
I cannot think of examples. But I had thought the hypervisor / machine
might want to treat supervisor mode like an alternate user mode. The
bits can always just be set = 7.
Can you think of an example where a user page marked RWE = 000 wouldA page marked RWE=000 is an unusable page. Perhaps to signal bad memory.
not be readable and writeable from super ? Or a device on PCIe ??
Or perhaps as a hidden data page full of comments or remarks. If its not readable-writeable or executable what is it? Nothing should be able to
access it, except maybe the machine/debug operating mode.
Can you think of an example where 2-bits denoting {L1, L2, LLC, DRAM}The cache-ability bits were not directly describing cache placement.
does not describe the cache placement of a line adequately ??
They were like the cache-ability bits in the bus. They specified cache-policy. Bufferable / non-bufferable. Write-through, write-back, allocate, etc. But now that I reviewed it I forgot I had removed these
bits from the PTE / TLBE.
I do not like situations where all possible codes are used.
So, I would probably use three bits. Could a cache line be located in a Register for instance?
I cannot envision every usage, although a lot is known today, I thought
it would be better to err on the side of providing too many bits rather
than not enough.
Not enough is hard to add later. There are loads of
bits available in the 128-bit PTE, 96 bits would be enough. But it is
not a nice power of two.
I am using the value of zero for the ASID to represent the machine
mode’s ASID. A lot of hardware is initialized to zero at reset, so
it’s automatically the machine mode’s. Other than zero the ASID could >>> be anything assigned by the OS.
I do not rely on control registers being set to zero, instead part of
HW context switching end up reading these out of ROM and into those
registers--so they can have any reasonable bit pattern SW desires.
{{This is sort of like how Alpha comes out of reset and streams a ROM
through the scan path to initialize the internals.}}
I am also assuming that ASID = 0 is the highest level of privilege;
but this is purely a SW choice.
Assuming a zero at reset was more of a default ‘if I forget’ approach. I have machine state being initialized from ROM too.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 3/5/2024 9:32 AM, MitchAlsup1 wrote:
snip
I also believe in the tension between pages that are too small and those >>> that are too large. 256B is widely seen as too small (VAX). I think most >>> people are comfortable in the 4KB range. I think 64KB is too big since
something like cat needs less code, less data, and less stack space than >>> 1 single 64KB page and another 64KB page to map cat from its VAS. So, now; >>> instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need >>> four 64KB pages 256KB. It is these small applications that drive the
minimum page size down.
In thinking about this, an idea occurred to me that may ease this
tension some. For a large page, you introduce a new protection mode
such that, for example, the lower half of the addresses in the page are
execute only, and the upper half are read/write enabled. This would
allow the code and the data, and perhaps even the stack for such a
program to share a single page, while still maintaining the required
access protection. I think the hardware to implement this is pretty
small. While the benefits of this would be modest, if such "small
programs" occur often enough it may be worth the modest cost of the
additional hardware.
The biggest problem with variable page sizes isn't the hardware.
On 3/5/2024 9:32 AM, MitchAlsup1 wrote:
Stephen Fuld wrote:
On 3/5/2024 7:37 AM, MitchAlsup1 wrote:
Robert Finch wrote:
Bigfoot pages are 1MB in size. That allows an entire 64GB address
range to be mapped with one page table. With larger memory systems a >>>>> larger page size is needed IMO. 64GB is 65,536 pages still when the
pages are 1MB in size. There is 32GB RAM in my machine today.
Tomorrow it will 64GB.
Above a certain point the added latency of filling/spilling a page to
DISK
that is 64KB in size rather than 4KB in size outweighs the gain in
the TLB.
Sure. But the 4K size was first used when disk transfer rates were
about 3 MB/sec. Today, they are many times that, even if you are
using a single physical hard disk. RAID and SSD can make that even
larger. I don't know what the "optimal" page size is, but it is
certainly larger than 4KB.
While the data transfer rates are far higher today, the disk latency has
not kept pace with CPU performance {SSD is different}.
Sure, but the latency is the same no matter what the transfer size is.
So the fact that latency improvements haven't kept pace is irrelevant to
the question of optimal page size.
The CPU went from 60ns (16 MHz: 360/67) to 200ps (5GHz) 300 !!
And from 5 cycles per instruction to 3 instruction per cycle 15
for a combined gain of 4,500
Disk latency may have gone down from 8ms to 3ms (sometimes 2ms for
smaller disks) 3-4
{{I consider 360/67 as the beginning of paging; although Multics may be
the beginning.}}
Agreed.
I, also, believe that 4KB pages are a bit small for a new architecture.
I chose 8KB for My 66000 more because it took 1 level out of the page
tables supporting 64-bit VAS, and also made the paging hierarchy more
easy to remember:: 8K, 8M, 8G, 8T, 8P, 8E is easier to remember than
4K, 2M, 1G, T, P, ?E.
Fair enough. When Unisys implemented paging in the 2200 Series in the
1990s, they chose 16K (approximately - exactly if you consider 36 equals >32!).
Stephen Fuld wrote:
So the optimum is, at least to some degree, usage dependent. Of
course, this is all an argument for multiple page sizes.
Which My 66000 provides, but it also provides big pages that can have an extent [0..extent-1] so you can use a 8GB page to map anything from 8KB through 8GB in a single PTE. The extent-bits are exactly the bits not
being used as PA-bits.
On Tue, 5 Mar 2024 10:32:03 -0800, Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
On 3/5/2024 9:32 AM, MitchAlsup1 wrote:
Stephen Fuld wrote:
On 3/5/2024 7:37 AM, MitchAlsup1 wrote:
Robert Finch wrote:
Bigfoot pages are 1MB in size. That allows an entire 64GB address
range to be mapped with one page table. With larger memory systems a >>>>>> larger page size is needed IMO. 64GB is 65,536 pages still when the >>>>>> pages are 1MB in size. There is 32GB RAM in my machine today.
Tomorrow it will 64GB.
Above a certain point the added latency of filling/spilling a page to >>>>> DISK
that is 64KB in size rather than 4KB in size outweighs the gain in
the TLB.
Sure. But the 4K size was first used when disk transfer rates were
about 3 MB/sec. Today, they are many times that, even if you are
using a single physical hard disk. RAID and SSD can make that even
larger. I don't know what the "optimal" page size is, but it is
certainly larger than 4KB.
While the data transfer rates are far higher today, the disk latency has >>> not kept pace with CPU performance {SSD is different}.
Sure, but the latency is the same no matter what the transfer size is.
So the fact that latency improvements haven't kept pace is irrelevant to
the question of optimal page size.
The CPU went from 60ns (16 MHz: 360/67) to 200ps (5GHz) 300× !!
And from 5 cycles per instruction to 3 instruction per cycle 15×
for a combined gain of 4,500×
Disk latency may have gone down from 8ms to 3ms (sometimes 2ms for
smaller disks) 3×-4×
{{I consider 360/67 as the beginning of paging; although Multics may be
the beginning.}}
Agreed.
I, also, believe that 4KB pages are a bit small for a new architecture.
I chose 8KB for My 66000 more because it took 1 level out of the page
tables supporting 64-bit VAS, and also made the paging hierarchy more
easy to remember:: 8K, 8M, 8G, 8T, 8P, 8E is easier to remember than
4K, 2M, 1G, ½T, ¼P, ?E.
Fair enough. When Unisys implemented paging in the 2200 Series in the
1990s, they chose 16K (approximately - exactly if you consider 36 equals
32!).
Actually, we use 4 KiW
I
don't remember spill/fill time being an issue.
{{I consider 360/67 as the beginning of paging; although Multics may be >>>> the beginning.}}
Agreed.
It certainly is not.
IBM bought the right to use the Manchester University / Ferranti paging patent.
{{I consider 360/67 as the beginning of paging; although Multics may be
the beginning.}}
Agreed.
MitchAlsup1 wrote:
Stephen Fuld wrote:
So the optimum is, at least to some degree, usage dependent. Of
course, this is all an argument for multiple page sizes.
Which My 66000 provides, but it also provides big pages that can have an
extent [0..extent-1] so you can use a 8GB page to map anything from 8KB
through 8GB in a single PTE. The extent-bits are exactly the bits not
being used as PA-bits.
By the way, I borrowed your extent idea (thank you :-) ) as when combined with skipping from the root pointer to a lower level, it can be used to
map the whole OS code and data with just a couple of PTE's.
This eliminates table walks for most OS virtual addresses and
could allow a few 'pinned' TLB entries to map the whole OS.
This achieves a similar result to my Block Address Translate (BAT)
approach but without requiring a whole separate MMU mechanism.
The idea is for the OS to be separated into static and dynamic sets
of code and RW data. The static parts are always resident in the OS
plus any mandatory drivers. The linker packages all the static code
and data together into two huge RE and RW memory sections at specific
high end virtual addresses, aligned to a huge page boundary.
The extent feature allows the OS static code and data to be loaded
into just the portion of a huge page that each needs, with any unused remainder in the huge pages being returned to the general pool as
smaller pages (so no wasted space in the 1GB or 8GB pages).
And voila - two PTE's map the whole static OS code and data
which can be permanently held in two MMU mapping registers.
With one more for the graphics memory and the bulk of
table walks for system space can be eliminated.
A page marked RWE=000 is an unusable page. Perhaps to signal bad memory.
Or perhaps as a hidden data page full of comments or remarks. If its not >readable-writeable or executable what is it? Nothing should be able to
access it, except maybe the machine/debug operating mode.
mitchalsup@aol.com (MitchAlsup1) writes:
Scott Lurndal wrote:
Robert Finch <robfi680@gmail.com> writes:
Do they support using the entire page as a page table?
All multilevel page tables use an entire page (granule in ARMv8)
at each level of the page table. To map larger page/block sizes,
they just stop the table walk at one, two or three levels rather
than walking all four levels to the leaf page size.
My 66000 page structure supports both skipping of levels in the table
and stopping at any level in the table.
That's all well and good. What operating system do you
have running on the MY 66000 processor? When will I be able
to purchase a system based on that processor?
Compression could be useful for something serialized to disk or through >>>>the network. Transferring the page number and compressed contents.
Compressing tiered DRAM is looking to be the next big thing.
https://www.zeropoint-tech.com/news/meta-and-google-unveil-a-revolutionary-hardware-compressed-cxl-memory-tier-specification
CXL is just* an offload of DRAM from an on-die memory controller to
an on-PCIe 5.0-6.0 high speed links.
Yes, I'm well aware of that. What you don't mention is that it
can become part of the processor cache coherency domain.
An interesting topic between PCIE 5.0 and 6.0 is the change from
NRZ encoding to PAM 4. This comes with a degradation of error rate
from 10^-12 goes down to 10^-6, so you are going to need some kind
of ECC on the PCIe links and layers.
PAM4 is something with which we have a great deal of expertise,
along with the associated error correction.
On Thu, 7 Mar 2024 16:02:45 -0500, Robert Finch <robfi680@gmail.com>
wrote:
A page marked RWE=000 is an unusable page.
Perhaps to signal bad memory.
Or perhaps as a hidden data page full of comments or remarks. If its not >>readable-writeable or executable what is it? Nothing should be able to >>access it, except maybe the machine/debug operating mode.
The ability to change (at least data) pages between "untouchable" and
RW is required for MMU assisted incremental GC. If the GC also
handles code, then it must be able to mark pages executable as well.
If an "untouchable" page can't be manipulated by user software, then
you've disallowed an entire class of GC systems.
On 3/8/2024 6:24 AM, David W Schroth wrote:
On Tue, 5 Mar 2024 10:32:03 -0800, Stephen Fuld
<sfuld@alumni.cmu.edu.invalid> wrote:
On 3/5/2024 9:32 AM, MitchAlsup1 wrote:
Stephen Fuld wrote:
On 3/5/2024 7:37 AM, MitchAlsup1 wrote:
Robert Finch wrote:
Bigfoot pages are 1MB in size. That allows an entire 64GB address >>>>>>> range to be mapped with one page table. With larger memory systems a >>>>>>> larger page size is needed IMO. 64GB is 65,536 pages still when the >>>>>>> pages are 1MB in size. There is 32GB RAM in my machine today.
Tomorrow it will 64GB.
Above a certain point the added latency of filling/spilling a page to >>>>>> DISK
that is 64KB in size rather than 4KB in size outweighs the gain in >>>>>> the TLB.
Sure. But the 4K size was first used when disk transfer rates were
about 3 MB/sec. Today, they are many times that, even if you are
using a single physical hard disk. RAID and SSD can make that even
larger. I don't know what the "optimal" page size is, but it is
certainly larger than 4KB.
While the data transfer rates are far higher today, the disk latency has >>>> not kept pace with CPU performance {SSD is different}.
Sure, but the latency is the same no matter what the transfer size is.
So the fact that latency improvements haven't kept pace is irrelevant to >>> the question of optimal page size.
The CPU went from 60ns (16 MHz: 360/67) to 200ps (5GHz) 300 !!
And from 5 cycles per instruction to 3 instruction per cycle 15
for a combined gain of 4,500
Disk latency may have gone down from 8ms to 3ms (sometimes 2ms for
smaller disks) 3-4
{{I consider 360/67 as the beginning of paging; although Multics may be >>>> the beginning.}}
Agreed.
I, also, believe that 4KB pages are a bit small for a new architecture. >>>> I chose 8KB for My 66000 more because it took 1 level out of the page
tables supporting 64-bit VAS, and also made the paging hierarchy more
easy to remember:: 8K, 8M, 8G, 8T, 8P, 8E is easier to remember than
4K, 2M, 1G, T, P, ?E.
Fair enough. When Unisys implemented paging in the 2200 Series in the
1990s, they chose 16K (approximately - exactly if you consider 36 equals >>> 32!).
Actually, we use 4 KiW
Yes. For those too young to remember, on the 2200 series, one word is
36 bits. Hence my comment about 4KW being 16 KB, if 36 equals 32.
(contained in 32 KiB) as the page size.
I presume this is a result of the emulated systems using 64 bits for
each 36 bit word. I was referring to the original implementation on
native hardware.
I
don't remember spill/fill time being an issue.
Ahh! That goes along with Anton's comment and contradicts Mitch's
comments about disk times being a factor. Since you were there and
involved in the implementation, what were the considerations in choosing
4KW? Not that I am challenging it - I think it was probably the correct >decision. I just want to better understand the reasoning behind it.
{{I consider 360/67 as the beginning of paging; although Multics may be
the beginning.}}
On 3/5/2024 9:32 AM, MitchAlsup1 wrote:
snip
I also believe in the tension between pages that are too small and those
that are too large. 256B is widely seen as too small (VAX). I think most
people are comfortable in the 4KB range. I think 64KB is too big since
something like cat needs less code, less data, and less stack space than
1 single 64KB page and another 64KB page to map cat from its VAS. So, now; >> instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need
four 64KB pages 256KB. It is these small applications that drive the
minimum page size down.
In thinking about this, an idea occurred to me that may ease this
tension some. For a large page, you introduce a new protection mode
such that, for example, the lower half of the addresses in the page are >execute only, and the upper half are read/write enabled. This would
allow the code and the data, and perhaps even the stack for such a
program to share a single page, while still maintaining the required
access protection. I think the hardware to implement this is pretty
small. While the benefits of this would be modest, if such "small
programs" occur often enough it may be worth the modest cost of the >additional hardware.
On Thu, 7 Mar 2024 11:58:53 -0800, Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
On 3/5/2024 9:32 AM, MitchAlsup1 wrote:
snip
I also believe in the tension between pages that are too small and those >>> that are too large. 256B is widely seen as too small (VAX). I think most >>> people are comfortable in the 4KB range. I think 64KB is too big since
something like cat needs less code, less data, and less stack space than >>> 1 single 64KB page and another 64KB page to map cat from its VAS. So, now; >>> instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need >>> four 64KB pages 256KB. It is these small applications that drive the
minimum page size down.
In thinking about this, an idea occurred to me that may ease this
tension some. For a large page, you introduce a new protection mode
such that, for example, the lower half of the addresses in the page are
execute only, and the upper half are read/write enabled. This would
allow the code and the data, and perhaps even the stack for such a
program to share a single page, while still maintaining the required
access protection. I think the hardware to implement this is pretty
small. While the benefits of this would be modest, if such "small
programs" occur often enough it may be worth the modest cost of the
additional hardware.
I would suggest that your proposal would be better done by splitting access/protection from virtual to physical translation (think Mill
turfs).
I suppose OS2200 could use our existing protection to
implement what you propose, but we haven't (largely because there
seems to be no need/call for the capability).
Nearly 15yrs later at Dec81 ACM SIGOPS, Jim Gray ask if I could help a
Tandem co-worker get Stanford Phd .... it involved similar global LRU to
the work that I had done in the 60s and there were "local LRU" forces
from the 60s lobbying hard not to award a Phd (that wasn't "local
LRU"). I had real live data from a CP/67 with global LRU on 768kbyte
(104 pageable pages) 360/67 with 80users that had better response and throughput compared to a CP/67 (with nearly identical type of workload
but 35users) that implemented 60s "local LRU" implementation and 1mbyte 360/67 (155 pageable pages after fixed storage) ... aka half the users
and 50% more real paging storage.
On 2024-03-07 2:58 p.m., Stephen Fuld wrote:
On 3/5/2024 9:32 AM, MitchAlsup1 wrote:I had thoughts along this line too. I have added user access rights for
snip
I also believe in the tension between pages that are too small and
those that are too large. 256B is widely seen as too small (VAX). I
think most
people are comfortable in the 4KB range. I think 64KB is too big since
something like cat needs less code, less data, and less stack space than >>> 1 single 64KB page and another 64KB page to map cat from its VAS. So,
now;
instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need >>> four 64KB pages 256KB. It is these small applications that drive the
minimum page size down.
In thinking about this, an idea occurred to me that may ease this
tension some. For a large page, you introduce a new protection mode
such that, for example, the lower half of the addresses in the page
are execute only, and the upper half are read/write enabled. This
would allow the code and the data, and perhaps even the stack for such
a program to share a single page, while still maintaining the required
access protection. I think the hardware to implement this is pretty
small. While the benefits of this would be modest, if such "small
programs" occur often enough it may be worth the modest cost of the
additional hardware.
each 1/4 of a page. Only a single 64kB page split in four is needed for
a small app then.
I plan on having garbage collection as part of the OS. There is a shared >hardware-card table involved.
So, I guess that would disallow user
garbage collectors using untouchable pages. The MMU could be faked out
using a VM, so I have read.
Problem is - whatever [GC] you choose - it will be wrong and have bad >performance for some important class of GC'd applications.
On Fri, 8 Mar 2024 20:26:08 -0500, Robert Finch <robfi680@gmail.com>
wrote:
I plan on having garbage collection as part of the OS. There is a shared
hardware-card table involved.
What kind?
[Actually "kind" is the wrong word because any non-toy, real world GC
will need to employ a combination of techniques. So the question
really should be "in what major class is your GC"?]
Problem is - whatever you choose - it will be wrong and have bad
performance for some important class of GC'd applications.
On Sun, 10 Mar 2024 14:29:52 -0400, George Neuner
<gneuner2@comcast.net> wrote:
Problem is - whatever [GC] you choose - it will be wrong and have bad
performance for some important class of GC'd applications.
"bad performance" may mean "slow", but also may mean memory use much
higher than should be necessary.
George Neuner <gneuner2@comcast.net> writes:
On Sun, 10 Mar 2024 14:29:52 -0400, George Neuner
<gneuner2@comcast.net> wrote:
Problem is - whatever [GC] you choose - it will be wrong and have bad
performance for some important class of GC'd applications.
"bad performance" may mean "slow", but also may mean memory use much
higher than should be necessary.
Understood. And there are other relevant metrics as well, as
for example not throughput but worst-case latency.
On Mon, 11 Mar 2024 09:32:40 -0700, Tim Rentsch
<tr.17687@z991.linuxsc.com> wrote:
George Neuner <gneuner2@comcast.net> writes:
On Sun, 10 Mar 2024 14:29:52 -0400, George Neuner
<gneuner2@comcast.net> wrote:
Problem is - whatever [GC] you choose - it will be wrong and have bad
performance for some important class of GC'd applications.
"bad performance" may mean "slow", but also may mean memory use much
higher than should be necessary.
Understood. And there are other relevant metrics as well, as
for example not throughput but worst-case latency.
If latency is the primary concern, then you should use a deterministic
system such as Baker's Treadmill. [...]
George Neuner <gneuner2@comcast.net> writes:
On Fri, 8 Mar 2024 20:26:08 -0500, Robert Finch <robfi680@gmail.com>
wrote:
I plan on having garbage collection as part of the OS. There is a shared >>> hardware-card table involved.
What kind?
[Actually "kind" is the wrong word because any non-toy, real world GC
will need to employ a combination of techniques. So the question
really should be "in what major class is your GC"?]
Problem is - whatever you choose - it will be wrong and have bad
performance for some important class of GC'd applications.
I'm curious to know what you consider to be the different kinds,
or classes, of GC, and the same question for applications.
Certainly, for any given GC implementation, one can construct an
application that does poorly with respect to that GC, but that
doesn't make the constructed application a "class". For the
statement to have meaningful content there needs to be some kind
of identification of what are the classes of GCs, and what are
the classes of applications.
On Mon, 11 Mar 2024 09:29:57 -0700, Tim Rentsch
<tr.17687@z991.linuxsc.com> wrote:
George Neuner <gneuner2@comcast.net> writes:
On Fri, 8 Mar 2024 20:26:08 -0500, Robert Finch <robfi680@gmail.com>
wrote:
I plan on having garbage collection as part of the OS. There
is a shared hardware-card table involved.
What kind?
[Actually "kind" is the wrong word because any non-toy, real
world GC will need to employ a combination of techniques. So
the question really should be "in what major class is your GC"?]
Problem is - whatever you choose - it will be wrong and have bad
performance for some important class of GC'd applications.
I'm curious to know what you consider to be the different kinds,
or classes, of GC, and the same question for applications.
Certainly, for any given GC implementation, one can construct an
application that does poorly with respect to that GC, but that
doesn't make the constructed application a "class". For the
statement to have meaningful content there needs to be some kind
of identification of what are the classes of GCs, and what are
the classes of applications.
Feeling mathematical are we?
Every application contains delineated portions of its overall
allocation profile which correspond closely to portions of the
profiles of other applications.
If a given profile performs poorly under a given GC, it is
reasonable to infer that other applications having corresponding
profiles also will perform poorly while those profiles are in
force.
That said ...
GC systems - including their associated allocator(s) - are categorized (better word?) by their behavior. Unfortunately, behavior is
described by a complex set of implementation choices.
Understand that real world GC typically implement more than one
algorithm, and often the algorithms are hybridized - derived from and relatable to published algorithms, but having unique mix of function
that won't be found "as is" in any search of literature. [In truth,
GC literature tends to leave a lot as exercise for the reader.]
GC behavior often is adaptive, reacting to run time conditions: e.g.,
based on memory fragmentation it could shift between non-moving
mark/sweep and moving mark/compact. It may also employ differing
algorithms simultaneously in different spaces, such as being
conservative in stacks while being precise in dynamic heaps, or being stop-world in thread private heaps while being concurrent or parallel
in shared heaps. Etc.
Concurrent GC (aka incremental) runs as a co-routine with the mutator.
These systems are distinguished by how they are triggered to run, and
what bounds may be placed on their execution time. There are
concurrent systems having completely deterministic operation [their
actual execution time, of course, may depend on factors beyond the
GC's control, such as multitasking, caching or paging.]
Parallel GC may be both prioritized and scheduled. These systems may
offer some guarantees about the percentage of (application) time given
to collector vs mutator(s).
Major choices:
- precise or conservative?
- moving or non-moving?
- tracing (marking)?
- copying / compacting?
- stop-world, concurrent, or parallel?
- single or multiple spaces?
- semispaces?
- generational?
Minor choices:
- software-only or hardware (MMU) assisted?
- snapshot at beginning?
- bump or block allocation?
- allocation color?
- free blocks coalesced? {if not compacting}
- multiple mutators?
- mutation color?
- writable shared objects?
- FROM-space mutation?
- finalization?
Note that all of these represent free dimensions in design. As
mentioned above, any particular system may implement multiple
collection algorithms each embodying a different set of choices.
You may wonder why I didn't mention "sweeping" ... essentially it is
because sequential scan is more an implementation detail than a
technique. Although "mark/sweep" is a well established technique, it
is the marking (tracing) that really defines it. Then too, modern
collectors often are neither mark/sweep nor copying as presented in textbooks: e.g., there are systems that mark and copy, systems that
sweep and copy (without marking), and "copying" systems in which
copies are logical and nothing actually changes address.
Aside: all GC can be considered to use [logical] semispaces because
all have the notion of segregated FROM-space and TO-space during
collection. True semispaces are a set of (address) contiguous spaces
- not necessarily equally sized - which are rotated as targets for new allocation. True semispaces do imply physical copying [but see the
Treadmill for an example of "non-moving copy" using logical
semispaces].
So what do I consider to be the "kind" of a GC?
The choices above pretty much define the extent of the design space
[but note I did intentionally leave out reference counting]. However,
the first 8 choices are structural, whereas the rest specify important characteristics but don't affect structure.
A particular "kind" might be, e.g.,
"precise, generational, multispace, non-moving, concurrent
tracer".
and so on.
I'm guessing this probably didn't really answer your question,
but it was fun to write. ;-)
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 12:42:44 |
Calls: | 10,389 |
Calls today: | 4 |
Files: | 14,061 |
Messages: | 6,416,878 |
Posted today: | 1 |