• Tonight's tradeoff

    From Robert Finch@21:1/5 to All on Sun Nov 12 22:47:12 2023
    Branch miss logic versus clock frequency.

    The branch miss logic for the current OoO version of Thor is quite
    involved. It needs to back out the register source indexes to the last
    valid source before the branch instruction. To do this in a single
    cycle, the logic is about 25+ logic levels deep. I find this somewhat unacceptable.

    I can remove a lot of logic improving the clock frequency substantially
    by removing the branch miss logic that resets the registers source id to
    the last valid source. Instead of stomping on the instruction on a miss
    and flushing the instructions in a single cycle, I think the predicate
    for the instructions can be cleared which will effectively turn them
    into a NOP. The value of the target register will be propagated in the
    reorder buffer meaning the registers source id need not be reset. The
    reorder buffer is only eight entries. So, on average four entries would
    be turned into NOPs. The NOPs would still propagate through the reorder
    buffer so it may take several clock cycles for them to be flushed from
    the buffer. Meaning the branch latency for miss-predicted branches would
    be quite high. However, if the clock frequency can be improved by 20%
    for all instructions, much of the lost performance on the branches may
    be made up.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Robert Finch on Mon Nov 13 11:10:19 2023
    Robert Finch wrote:
    Branch miss logic versus clock frequency.

    The branch miss logic for the current OoO version of Thor is quite
    involved. It needs to back out the register source indexes to the last
    valid source before the branch instruction. To do this in a single
    cycle, the logic is about 25+ logic levels deep. I find this somewhat unacceptable.

    I can remove a lot of logic improving the clock frequency substantially
    by removing the branch miss logic that resets the registers source id to
    the last valid source. Instead of stomping on the instruction on a miss
    and flushing the instructions in a single cycle, I think the predicate
    for the instructions can be cleared which will effectively turn them
    into a NOP. The value of the target register will be propagated in the reorder buffer meaning the registers source id need not be reset. The
    reorder buffer is only eight entries. So, on average four entries would
    be turned into NOPs. The NOPs would still propagate through the reorder buffer so it may take several clock cycles for them to be flushed from
    the buffer. Meaning the branch latency for miss-predicted branches would
    be quite high. However, if the clock frequency can be improved by 20%
    for all instructions, much of the lost performance on the branches may
    be made up.

    Basically it sounds like you want to eliminate the checkpoint and rollback,
    and instead let resources be recovered at Retire. That could work.

    However you are not restoring the Renamer's future Register Alias Table (RAT) to its state at the point of the mispredicted branch instruction, which is
    what the rollback would have done, so its state will be whatever it was at
    the end of the mispredicted sequence. That needs to be re-sync'ed with the program state as of the branch.

    That can be accomplished by stalling the front end, waiting until the mispredicted branch reaches Retire and then copying the committed RAT, maintained by Retire, to the future RAT at Rename, and restart front end.
    The list of free physical registers is then all those that are not
    marked as architectural registers.
    This is partly how I handle exceptions.

    Also you still need a mechanism to cancel start of execution of the
    subset of pending uOps for the purged set. You don't want to launch
    a LD or DIV from the mispredicted set if it has not already started.
    If you are using a reservation station design then you need some way
    to distribute the cancel request to the various FU's and RS's,
    and wait for them to clean themselves up.

    Note that some things might not be able to cancel immediately,
    like an in-flight MUL in a pipeline or an outstanding LD to the cache.
    So some of this will be asynchronous (send cancel request, wait for ACK).

    There are some other things that might need cleanup.
    A Return Stack Predictor might be manipulated by the mispredicted path.
    Not sure how to handle that without a checkpoint.
    Maybe have two copies like RAT, a future one maintained by Decode and
    a committed one maintained by Retire, and copy the committed to future.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Mon Nov 13 19:47:27 2023
    Robert Finch wrote:

    Branch miss logic versus clock frequency.

    The branch miss logic for the current OoO version of Thor is quite
    involved. It needs to back out the register source indexes to the last
    valid source before the branch instruction. To do this in a single
    cycle, the logic is about 25+ logic levels deep. I find this somewhat unacceptable.
    <
    When you launch a predicted branch into execution (prelude to signaling recovery is required), while the branch is determining whether to backup
    (or not) have the branch recovery logic setup the register indexes such
    that::
    a) if the branch succeeds keep the current map
    b) if the branch fails, you are 1 multiplexer delay from having the state
    you want.
    <
    That is move the setup to repair the previous clock.
    <
    I can remove a lot of logic improving the clock frequency substantially
    by removing the branch miss logic that resets the registers source id to
    the last valid source. Instead of stomping on the instruction on a miss
    and flushing the instructions in a single cycle, I think the predicate
    for the instructions can be cleared which will effectively turn them
    into a NOP. The value of the target register will be propagated in the reorder buffer meaning the registers source id need not be reset. The
    reorder buffer is only eight entries. So, on average four entries would
    be turned into NOPs. The NOPs would still propagate through the reorder buffer so it may take several clock cycles for them to be flushed from
    the buffer. Meaning the branch latency for miss-predicted branches would
    be quite high. However, if the clock frequency can be improved by 20%
    for all instructions, much of the lost performance on the branches may
    be made up.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Mon Nov 13 20:01:53 2023
    EricP wrote:

    Robert Finch wrote:
    Branch miss logic versus clock frequency.

    The branch miss logic for the current OoO version of Thor is quite
    involved. It needs to back out the register source indexes to the last
    valid source before the branch instruction. To do this in a single
    cycle, the logic is about 25+ logic levels deep. I find this somewhat
    unacceptable.

    I can remove a lot of logic improving the clock frequency substantially
    by removing the branch miss logic that resets the registers source id to
    the last valid source. Instead of stomping on the instruction on a miss
    and flushing the instructions in a single cycle, I think the predicate
    for the instructions can be cleared which will effectively turn them
    into a NOP. The value of the target register will be propagated in the
    reorder buffer meaning the registers source id need not be reset. The
    reorder buffer is only eight entries. So, on average four entries would
    be turned into NOPs. The NOPs would still propagate through the reorder
    buffer so it may take several clock cycles for them to be flushed from
    the buffer. Meaning the branch latency for miss-predicted branches would
    be quite high. However, if the clock frequency can be improved by 20%
    for all instructions, much of the lost performance on the branches may
    be made up.

    Basically it sounds like you want to eliminate the checkpoint and rollback, and instead let resources be recovered at Retire. That could work.

    However you are not restoring the Renamer's future Register Alias Table (RAT) to its state at the point of the mispredicted branch instruction, which is what the rollback would have done, so its state will be whatever it was at the end of the mispredicted sequence. That needs to be re-sync'ed with the program state as of the branch.
    <
    I, personally, don't use a RAT--I use a CAM based architectural decoder
    for operand read and a standard physical equality decoder for writes.
    <
    Every cycle the CAM.valid bits are block loaded into a history table
    and if you need to return the CAMs to the checkpointed mappings, you
    take the valid bits from the history table and write the CAM.valid
    bits back into the physical register file. Presto, the map is how it
    used to be.
    <
    Can even be made to be performed in 0-cycles. {yes: 0 not 1 cycles}
    <
    That can be accomplished by stalling the front end, waiting until the mispredicted branch reaches Retire and then copying the committed RAT, maintained by Retire, to the future RAT at Rename, and restart front end.
    The list of free physical registers is then all those that are not
    marked as architectural registers.
    <
    Sounds slow.
    <
    This is partly how I handle exceptions.

    Also you still need a mechanism to cancel start of execution of the
    subset of pending uOps for the purged set. You don't want to launch
    a LD or DIV from the mispredicted set if it has not already started.
    If you are using a reservation station design then you need some way
    to distribute the cancel request to the various FU's and RS's,
    and wait for them to clean themselves up.
    <
    I use the concept of an execution window to do this both at the reservation station and function units. There is an insert pointer and a consistent
    pointer RS is only allowed to launch when the instruction is between.
    FU are only allowed to calculate so long as the instruction remains
    between these 2 pointers. The 2 pointers (4-bits each) are broadcast
    around the machine every cycle. Each station and unit decide for themselves.

    Note that some things might not be able to cancel immediately,
    like an in-flight MUL in a pipeline or an outstanding LD to the cache.
    So some of this will be asynchronous (send cancel request, wait for ACK).
    <
    If an instruction that should not have its result delivered is delivered,
    it is delivered to the physical register it was assigned at its issue time.
    But since the value had not been delivered, that register is not in the
    pool of assignable registers, so no dependency has been created.
    <
    There are some other things that might need cleanup.
    A Return Stack Predictor might be manipulated by the mispredicted path.
    <
    Do these with a linked list and you can backup a misprediced return
    to a mispredicted call.
    <
    Not sure how to handle that without a checkpoint.
    <
    Every (non exceptional) flow altering instruction needs a checkpoint. Predicated strings of instructions use a light weight checkpoint;
    predicted branches use a heavy weight version.
    <
    Maybe have two copies like RAT, a future one maintained by Decode and
    a committed one maintained by Retire, and copy the committed to future.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Finch@21:1/5 to All on Wed Nov 15 01:21:16 2023
    Decided to shelve Thor2024 and begin work on Thor2025. While Thor2024 is
    very good there are a few issues with it. The ROB is used to store
    register values and that is effectively a CAM. It is not very resource efficient in an FPGA. I have been researching an x86 OoO implementation (https://www.stuffedcow.net/files/henry-thesis-phd.pdf ) done in an FPGA
    and it turns out to be considerably smaller than Thor. There are more
    efficient implementations for components than what is currently in use.

    Thor2025 will use a PRF approach although using a PRF seems large to me.
    To reduce the size and complexity of the register file, separate
    register files will be used for float and integer operations, along with separate register files for vector mask registers and subroutine link registers. This set of register files limits the GPR file to only 3
    write ports and 18 read ports to support all the functional units.
    Currently the register file is 10r2w.

    The trade-off is block RAM usage instead of LUTs.

    While having separate registers files seems like a step backwards, it
    should ultimately make the hardware more resource efficient. It does
    impact the ISA spec.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Wed Nov 15 19:11:31 2023
    Robert Finch wrote:

    Decided to shelve Thor2024 and begin work on Thor2025. While Thor2024 is
    very good there are a few issues with it. The ROB is used to store
    register values and that is effectively a CAM. It is not very resource efficient in an FPGA. I have been researching an x86 OoO implementation (https://www.stuffedcow.net/files/henry-thesis-phd.pdf ) done in an FPGA
    and it turns out to be considerably smaller than Thor. There are more efficient implementations for components than what is currently in use.

    Thor2025 will use a PRF approach although using a PRF seems large to me.
    <
    I have a PRF design I could show you--way to big for comp.arch and
    with the requisite figures.
    <
    To reduce the size and complexity of the register file, separate
    register files will be used for float and integer operations, along with separate register files for vector mask registers and subroutine link registers. This set of register files limits the GPR file to only 3
    write ports and 18 read ports to support all the functional units.
    Currently the register file is 10r2w.

    The trade-off is block RAM usage instead of LUTs.

    While having separate registers files seems like a step backwards, it
    should ultimately make the hardware more resource efficient. It does
    impact the ISA spec.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Finch@21:1/5 to MitchAlsup on Fri Nov 17 22:39:45 2023
    On 2023-11-15 2:11 p.m., MitchAlsup wrote:
    Robert Finch wrote:

    Decided to shelve Thor2024 and begin work on Thor2025. While Thor2024
    is very good there are a few issues with it. The ROB is used to store
    register values and that is effectively a CAM. It is not very resource
    efficient in an FPGA. I have been researching an x86 OoO
    implementation (https://www.stuffedcow.net/files/henry-thesis-phd.pdf
    ) done in an FPGA and it turns out to be considerably smaller than
    Thor. There are more efficient implementations for components than
    what is currently in use.

    Thor2025 will use a PRF approach although using a PRF seems large to me.
    <
    I have a PRF design I could show you--way to big for comp.arch and
    with the requisite figures.
    <
    To reduce the size and complexity of the register file, separate
    register files will be used for float and integer operations, along
    with separate register files for vector mask registers and subroutine
    link registers. This set of register files limits the GPR file to only
    3 write ports and 18 read ports to support all the functional units.
    Currently the register file is 10r2w.

    The trade-off is block RAM usage instead of LUTs.

    While having separate registers files seems like a step backwards, it
    should ultimately make the hardware more resource efficient. It does
    impact the ISA spec.

    Still digesting the PRF diagram.

    Decided to go with a unified register file, 27r3w so far. Having
    separate register files would not reduce the number of read ports
    required and would add complexity to the processor.

    Loads, FPU operations and flow control (FCU) operations all share the
    third write port of the register file. The other two write ports are
    dedicated to the ALU results. I think this will be okay given <1% of instructions would be FCU updates. Loads are about 25%, and FPU depends
    on the application.

    The ALUs/FPU/Loads have five input operands including the 3 source
    operands, a target operand, and a mask register. Stores do not need a
    target operand. FCU ops are non-masked so do not need a mask register or
    target operand input.

    Not planning to implement the vector register file as it would be immense.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Finch@21:1/5 to Robert Finch on Sat Nov 18 05:58:42 2023
    On 2023-11-17 10:39 p.m., Robert Finch wrote:
    On 2023-11-15 2:11 p.m., MitchAlsup wrote:
    Robert Finch wrote:

    Decided to shelve Thor2024 and begin work on Thor2025. While Thor2024
    is very good there are a few issues with it. The ROB is used to store
    register values and that is effectively a CAM. It is not very
    resource efficient in an FPGA. I have been researching an x86 OoO
    implementation (https://www.stuffedcow.net/files/henry-thesis-phd.pdf
    ) done in an FPGA and it turns out to be considerably smaller than
    Thor. There are more efficient implementations for components than
    what is currently in use.

    Thor2025 will use a PRF approach although using a PRF seems large to me.
    <
    I have a PRF design I could show you--way to big for comp.arch and
    with the requisite figures.
    <
    To reduce the size and complexity of the register file, separate
    register files will be used for float and integer operations, along
    with separate register files for vector mask registers and subroutine
    link registers. This set of register files limits the GPR file to
    only 3 write ports and 18 read ports to support all the functional
    units. Currently the register file is 10r2w.

    The trade-off is block RAM usage instead of LUTs.

    While having separate registers files seems like a step backwards, it
    should ultimately make the hardware more resource efficient. It does
    impact the ISA spec.

    Still digesting the PRF diagram.

    Decided to go with a unified register file, 27r3w so far. Having
    separate register files would not reduce the number of read ports
    required and would add complexity to the processor.

    Loads, FPU operations and flow control (FCU) operations all share the
    third write port of the register file. The other two write ports are dedicated to the ALU results. I think this will be okay given <1% of instructions would be FCU updates. Loads are about 25%, and FPU depends
    on the application.

    The ALUs/FPU/Loads have five input operands including the 3 source
    operands, a target operand, and a mask register. Stores do not need a
    target operand. FCU ops are non-masked so do not need a mask register or target operand input.

    Not planning to implement the vector register file as it would be immense.

    Changed the moniker of my current processor project from Thor to Qupls (Q-Plus). I wanted a five- letter name beginning with ‘Q’. For a moment
    I thought of calling it Quake but thought that would be too confusing.
    One must understand the magic behind name choices.

    The current design uses instruction postfixes of 32, 48, 80, and 144
    bits which provide constants of 23, 39, 64 and 128 bits. Two bits in the instruction indicate the postfix size. 64 and 128-bit constants have
    seven extra unused bits available. The fields available being 71 and 135
    bits.

    Somewhat ugly, but it is desired to keep instructions a multiple of
    16-bits in size. The shortest instruction is a NOP which is 16-bits so
    that it may be used for alignment.

    I almost switched to 96-bit floats which seem appealing, but once again remembered that the progression of 32, 64, 128-bit floats work very well
    for the float approximations.

    Branches are 48-bit, being a combination of a compare and a branch with
    a 24-bit target address field. Other flow control ops like JSR and JMP
    are also 48-bit to keep all flow controls at 48-bit for simplified decoding.

    Most instructions are 32-bits in size.

    Sticking with a 64-register unified register file.

    Removed the vector operations. There is enough play in the ISA to add
    them at a later date if desired.

    Loads and stores support two address mode, d(Rn) and d(Rn+Rm*Sc). The
    scaled index address mode will likely be a 48-bit op.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Sat Nov 18 17:27:50 2023
    Robert Finch wrote:

    On 2023-11-15 2:11 p.m., MitchAlsup wrote:
    Robert Finch wrote:

    Decided to shelve Thor2024 and begin work on Thor2025. While Thor2024
    is very good there are a few issues with it. The ROB is used to store
    register values and that is effectively a CAM. It is not very resource
    efficient in an FPGA. I have been researching an x86 OoO
    implementation (https://www.stuffedcow.net/files/henry-thesis-phd.pdf
    ) done in an FPGA and it turns out to be considerably smaller than
    Thor. There are more efficient implementations for components than
    what is currently in use.

    Thor2025 will use a PRF approach although using a PRF seems large to me.
    <
    I have a PRF design I could show you--way to big for comp.arch and
    with the requisite figures.
    <
    To reduce the size and complexity of the register file, separate
    register files will be used for float and integer operations, along
    with separate register files for vector mask registers and subroutine
    link registers. This set of register files limits the GPR file to only
    3 write ports and 18 read ports to support all the functional units.
    Currently the register file is 10r2w.

    The trade-off is block RAM usage instead of LUTs.

    While having separate registers files seems like a step backwards, it
    should ultimately make the hardware more resource efficient. It does
    impact the ISA spec.

    Still digesting the PRF diagram.

    The diagram is for a 6R6W PRF with a history table, ARN->PRN translation,
    Free pool pickers, and register ports. The X with a ½ box is a latch
    or flip-flop depending on the clocking that is put around the figure.
    It also includes the renamer {history table and free pool pickers}.

    Decided to go with a unified register file, 27r3w so far. Having
    separate register files would not reduce the number of read ports
    required and would add complexity to the processor.

    9 Reads per 1 write ?!?!?

    Loads, FPU operations and flow control (FCU) operations all share the
    third write port of the register file. The other two write ports are dedicated to the ALU results. I think this will be okay given <1% of instructions would be FCU updates. Loads are about 25%, and FPU depends
    on the application.

    The ALUs/FPU/Loads have five input operands including the 3 source
    operands, a target operand, and a mask register. Stores do not need a
    target operand. FCU ops are non-masked so do not need a mask register or target operand input.

    Not planning to implement the vector register file as it would be immense.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Finch@21:1/5 to MitchAlsup on Sat Nov 18 14:41:14 2023
    On 2023-11-18 12:27 p.m., MitchAlsup wrote:
    Robert Finch wrote:

    On 2023-11-15 2:11 p.m., MitchAlsup wrote:
    Robert Finch wrote:

    Decided to shelve Thor2024 and begin work on Thor2025. While
    Thor2024 is very good there are a few issues with it. The ROB is
    used to store register values and that is effectively a CAM. It is
    not very resource efficient in an FPGA. I have been researching an
    x86 OoO implementation
    (https://www.stuffedcow.net/files/henry-thesis-phd.pdf ) done in an
    FPGA and it turns out to be considerably smaller than Thor. There
    are more efficient implementations for components than what is
    currently in use.

    Thor2025 will use a PRF approach although using a PRF seems large to
    me.
    <
    I have a PRF design I could show you--way to big for comp.arch and
    with the requisite figures.
    <
    To reduce the size and complexity of the register file, separate
    register files will be used for float and integer operations, along
    with separate register files for vector mask registers and
    subroutine link registers. This set of register files limits the GPR
    file to only 3 write ports and 18 read ports to support all the
    functional units. Currently the register file is 10r2w.

    The trade-off is block RAM usage instead of LUTs.

    While having separate registers files seems like a step backwards,
    it should ultimately make the hardware more resource efficient. It
    does impact the ISA spec.

    Still digesting the PRF diagram.

    The diagram is for a 6R6W PRF with a history table, ARN->PRN translation, Free pool pickers, and register ports. The X with a ½ box is a latch
    or flip-flop depending on the clocking that is put around the figure.
    It also includes the renamer {history table and free pool pickers}.

    Decided to go with a unified register file, 27r3w so far. Having
    separate register files would not reduce the number of read ports
    required and would add complexity to the processor.

    9 Reads per 1 write ?!?!?

    Loads, FPU operations and flow control (FCU) operations all share the
    third write port of the register file. The other two write ports are
    dedicated to the ALU results. I think this will be okay given <1% of
    instructions would be FCU updates. Loads are about 25%, and FPU
    depends on the application.

    The ALUs/FPU/Loads have five input operands including the 3 source
    operands, a target operand, and a mask register. Stores do not need a
    target operand. FCU ops are non-masked so do not need a mask register
    or target operand input.

    Not planning to implement the vector register file as it would be
    immense.
    Freelist:

    I just used the find-first/last-one’s trick on a bit-list to pick a PR
    for an AR. It can provide PRs for two ARs per cycle. I have all the PRs
    from the ROB feeding into the list manager so that on a branch miss the
    PRs can be freed up. (Just the portion of the PRs associated with the
    miss are freed). Three discarded PRs from commit also feed into the list manager so they can be freed. It seems like a lot of logic translating
    the PR to a bit. It seems a bit impractical to me to feed all the PRs
    from the ROB to the list manager. It can be done with the smallish 16
    entry ROB, but for a larger ROB the free may have to be split up or
    another means found.

    RAT:

    A register alias table is being used to track the mappings of ARs to
    PRs. It uses two maps; speculative and committed. On instruction enqueue speculative mappings are updated. On commit committed mappings are
    updated, and on pipeline flush commit is copied to speculative.

    Register file:

    I’ve reduced the number of read ports, by not supporting the vector
    stuff. There are only 18 read ports. Six groups of three.

    ROB:
    The ROB acts like a CAM to store both the aRN and pRN for the target
    register. The aRN is needed to know which previous pRN to free on
    commit. For source operands only the pRN is stored.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Finch@21:1/5 to All on Fri Nov 24 19:32:09 2023
    On 2023-11-18 2:41 p.m., Robert Finch wrote:
    Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
    64kB page can handle 512MB of mappings. Tonight’s trade-off is how many
    root pointers to support. With a 12-bit ASID, 4096 root pointers are
    required to link to the mapping tables with one root pointer for each
    address space. A 512 MB space is probably sufficient for a large number
    of apps. Meaning access for a TLB update is via a single root pointer
    lookup and then looking up the translation from a single memory page.
    Not much for the table walker to do. The 4096 root pointers use two
    block RAMs and require an 8192-byte address space for update assuming a
    32-bit physical address space (a 16-bit root page number).

    An IO mapped area of 64kB is available for root pointer memory. 16 block
    RAMs could be setup in this area, that would allow 8 root pointers for
    each address space. Three bits of the virtual address space could then
    be mapped using root pointers. If the root pointer just points to a
    single level of page tables, then a 4GB (32-bit) space could be mapped.
    I am mulling over whether it is worth it to support the additional root pointers. It is a chunk of block RAM memory that might be better spent elsewhere.

    If I use an 11-bit ASID, all the root pointers could be present in a
    single block RAM. So, design choices are 11 or 12-bits ASID, 1 or 8 root pointers per address space.

    My thought is to have only a single root pointer per space, and organize
    the root pointer table as if there were 32-bits for the pointer. This
    would allow a 48-bit physical address space to place the mapping tables
    in. The RAM could be mapped so that the high order bits of the pointer
    are assumed to be zero. The system could get by using a single block RAM
    if the mapping tables location were restricted to a 16MB address range. Eight-bit pointers could be used then.

    Given that it is a small system, with only 512MB of DRAM, I think it
    best to keep the page-table-walker simple, and use the minimum amount of
    BRAM (1).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Sat Nov 25 01:00:29 2023
    Robert Finch wrote:

    On 2023-11-18 2:41 p.m., Robert Finch wrote:
    Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
    64kB page can handle 512MB of mappings. Tonight’s trade-off is how many root pointers to support. With a 12-bit ASID, 4096 root pointers are
    required to link to the mapping tables with one root pointer for each
    address space.

    So, you associate a single ROOT pointer VALUE with an ASID, and manage
    in SW who gets that ROOT pointer VALUE; using ASID as an index into
    Virtual Address Spaces.

    How is this usefully different that only using the ASID to qualify TLB
    results ?? <Was this TLB entry installed from the same ASID as is accessing right now>. And using ASID as an index into any array might lead to some conundrum down the road a apiece.

    Secondarily, SUN started out with 12-bit ASID and went to 16-bits just about
    as fast as they could--even before main memories went bigger than 4GB.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Finch@21:1/5 to MitchAlsup on Fri Nov 24 21:28:25 2023
    On 2023-11-24 8:00 p.m., MitchAlsup wrote:
    Robert Finch wrote:

    On 2023-11-18 2:41 p.m., Robert Finch wrote:
    Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
    64kB page can handle 512MB of mappings. Tonight’s trade-off is how
    many root pointers to support. With a 12-bit ASID, 4096 root pointers
    are required to link to the mapping tables with one root pointer for
    each address space.

    So, you associate a single ROOT pointer VALUE with an ASID, and manage
    in SW who gets that ROOT pointer VALUE; using ASID as an index into
    Virtual Address Spaces.

    How is this usefully different that only using the ASID to qualify TLB results ?? <Was this TLB entry installed from the same ASID as is accessing right now>. And using ASID as an index into any array might lead to some conundrum down the road a apiece.

    Secondarily, SUN started out with 12-bit ASID and went to 16-bits just
    about
    as fast as they could--even before main memories went bigger than 4GB.

    I view the address space as an entity in it own right to be managed by
    the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that identifies the address space has a life outside of just the TLB. I may
    be increasing the typical scope of an ASID.

    It is the same idea as using the ASID to qualify TLB entries, except
    that it qualifies the root pointer as well. So, the root pointer does
    not need to be switched by software. Once the root pointer is set for
    the AS it simply sits there statically until the AS is reused.

    I am using the ASID like a process ID. So, the root pointer register
    does not need to be reset on a task switch. Address spaces may not be
    mapped 1:1 with processes. An address space may outlive a task if it is
    shared with another task. So, I do not want to use the PID to
    distinguish tables. This assumes the address space will not be freed up
    and reused by another task, if there are tasks using the ASID.

    4096 address spaces is a lot. But if using a 16-bit ASID it would no
    longer be practical to store a root pointer per ASID in a table.
    Instead, the root pointer would have to be managed by software as is
    normally done.

    I am wondering why the 16-bit ASID? 256 address spaces in 256 process? I suspect it is just because 16-bit is easier to pass around/calculate in
    a HLL than some other value like 14-bits. Are 65536 address spaces
    really needed?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Robert Finch on Fri Nov 24 21:16:43 2023
    On 11/24/2023 8:28 PM, Robert Finch wrote:
    On 2023-11-24 8:00 p.m., MitchAlsup wrote:
    Robert Finch wrote:

    On 2023-11-18 2:41 p.m., Robert Finch wrote:
    Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A
    single 64kB page can handle 512MB of mappings. Tonight’s trade-off is
    how many root pointers to support. With a 12-bit ASID, 4096 root
    pointers are required to link to the mapping tables with one root
    pointer for each address space.

    So, you associate a single ROOT pointer VALUE with an ASID, and manage
    in SW who gets that ROOT pointer VALUE; using ASID as an index into
    Virtual Address Spaces.

    How is this usefully different that only using the ASID to qualify TLB
    results ?? <Was this TLB entry installed from the same ASID as is
    accessing
    right now>. And using ASID as an index into any array might lead to some
    conundrum down the road a apiece.

    Secondarily, SUN started out with 12-bit ASID and went to 16-bits just
    about
    as fast as they could--even before main memories went bigger than 4GB.

    I view the address space as an entity in it own right to be managed by
    the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that identifies the address space has a life outside of just the TLB. I may
    be increasing the typical scope of an ASID.

    It is the same idea as using the ASID to qualify TLB entries, except
    that it qualifies the root pointer as well. So, the root pointer does
    not need to be switched by software. Once the root pointer is set for
    the AS it simply sits there statically until the AS is reused.

    I am using the ASID like a process ID. So, the root pointer register
    does not need to be reset on a task switch. Address spaces may not be
    mapped 1:1 with processes. An address space may outlive a task if it is shared with another task. So, I do not want to use the PID to
    distinguish tables. This assumes the address space will not be freed up
    and reused by another task, if there are tasks using the ASID.

    4096 address spaces is a lot. But if using a 16-bit ASID it would no
    longer be practical to store a root pointer per ASID in a table.
    Instead, the root pointer would have to be managed by software as is
    normally done.

    I am wondering why the 16-bit ASID? 256 address spaces in 256 process? I suspect it is just because 16-bit is easier to pass around/calculate in
    a HLL than some other value like 14-bits. Are 65536 address spaces
    really needed?


    If one assumes one address space per PID, then one is going to hit a
    limit of 4K a lot faster than 64K, and when one hits the limit, there is
    no good way to "reclaim" previously used address spaces short of
    flushing the TLB to be sure that no entries from that space remain in
    the TLB (ASID thrashing is likely to be relatively expensive to deal
    with as a result).



    Well, along with other things, like if/how to allow "Global" pages:
    True global pages are likely a foot gun, as there is no way to exclude
    them from a given process (where there may be a need to do so);
    Disallowing global pages entirely means higher TLB miss rates because no processes can share TLB entries.

    One option seems to be, say, that a few of the high-order bits of the
    ASID could be used as a "page group", with global pages only applying
    within a single page-group (possibly with one of the page groups being designated as "No global pages allowed").

    ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Finch@21:1/5 to BGB on Fri Nov 24 22:48:35 2023
    On 2023-11-24 10:16 p.m., BGB wrote:
    On 11/24/2023 8:28 PM, Robert Finch wrote:
    On 2023-11-24 8:00 p.m., MitchAlsup wrote:
    Robert Finch wrote:

    On 2023-11-18 2:41 p.m., Robert Finch wrote:
    Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A
    single 64kB page can handle 512MB of mappings. Tonight’s trade-off
    is how many root pointers to support. With a 12-bit ASID, 4096 root
    pointers are required to link to the mapping tables with one root
    pointer for each address space.

    So, you associate a single ROOT pointer VALUE with an ASID, and manage
    in SW who gets that ROOT pointer VALUE; using ASID as an index into
    Virtual Address Spaces.

    How is this usefully different that only using the ASID to qualify TLB
    results ?? <Was this TLB entry installed from the same ASID as is
    accessing
    right now>. And using ASID as an index into any array might lead to some >>> conundrum down the road a apiece.

    Secondarily, SUN started out with 12-bit ASID and went to 16-bits
    just about
    as fast as they could--even before main memories went bigger than 4GB.

    I view the address space as an entity in it own right to be managed by
    the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that
    identifies the address space has a life outside of just the TLB. I may
    be increasing the typical scope of an ASID.

    It is the same idea as using the ASID to qualify TLB entries, except
    that it qualifies the root pointer as well. So, the root pointer does
    not need to be switched by software. Once the root pointer is set for
    the AS it simply sits there statically until the AS is reused.

    I am using the ASID like a process ID. So, the root pointer register
    does not need to be reset on a task switch. Address spaces may not be
    mapped 1:1 with processes. An address space may outlive a task if it
    is shared with another task. So, I do not want to use the PID to
    distinguish tables. This assumes the address space will not be freed
    up and reused by another task, if there are tasks using the ASID.

    4096 address spaces is a lot. But if using a 16-bit ASID it would no
    longer be practical to store a root pointer per ASID in a table.
    Instead, the root pointer would have to be managed by software as is
    normally done.

    I am wondering why the 16-bit ASID? 256 address spaces in 256 process?
    I suspect it is just because 16-bit is easier to pass around/calculate
    in a HLL than some other value like 14-bits. Are 65536 address spaces
    really needed?


    If one assumes one address space per PID, then one is going to hit a
    limit of 4K a lot faster than 64K, and when one hits the limit, there is
    no good way to "reclaim" previously used address spaces short of
    flushing the TLB to be sure that no entries from that space remain in
    the TLB (ASID thrashing is likely to be relatively expensive to deal
    with as a result).

    I see after reading several webpages that the root pointer is used to
    point to only a single table for a process. This is not how I was doing
    things. I have a MMU tables for each address space as opposed to having
    a table for the process. The process may have only a single address
    space, or it may use several address spaces.

    I am wondering why there is only a single table per process.


    Well, along with other things, like if/how to allow "Global" pages:
    True global pages are likely a foot gun, as there is no way to exclude
    them from a given process (where there may be a need to do so);
    Disallowing global pages entirely means higher TLB miss rates because no processes can share TLB entries.

    Global space can be assigned by designating an address space as a global
    space and giving it an ASID. All process wanting access to the global
    space need only then use the MMU table for that ASID. Eg. use ASID 0 for
    the global address space.

    One option seems to be, say, that a few of the high-order bits of the
    ASID could be used as a "page group", with global pages only applying
    within a single page-group (possibly with one of the page groups being designated as "No global pages allowed").

    ...


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Sat Nov 25 17:11:09 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Robert Finch wrote:

    On 2023-11-18 2:41 p.m., Robert Finch wrote:
    Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
    64kB page can handle 512MB of mappings. Tonight’s trade-off is how many
    root pointers to support. With a 12-bit ASID, 4096 root pointers are
    required to link to the mapping tables with one root pointer for each
    address space.

    So, you associate a single ROOT pointer VALUE with an ASID, and manage
    in SW who gets that ROOT pointer VALUE; using ASID as an index into
    Virtual Address Spaces.

    How is this usefully different that only using the ASID to qualify TLB >results ?? <Was this TLB entry installed from the same ASID as is accessing >right now>. And using ASID as an index into any array might lead to some >conundrum down the road a apiece.

    Secondarily, SUN started out with 12-bit ASID and went to 16-bits just about >as fast as they could--even before main memories went bigger than 4GB.

    Yeah, armv8 was originally 8-bit, and added 16 even before the spec was dry.

    I don't see a benefit to tying the ASID (or VMID for that matter) to
    the root of the page table. Especially with the common split
    address spaces (ARMv8 has a root pointer for each half of the VA space,
    for example, where the upper half is shared by all schedulable entities).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Robert Finch on Sat Nov 25 17:20:34 2023
    Robert Finch <robfi680@gmail.com> writes:
    On 2023-11-24 10:16 p.m., BGB wrote:
    On 11/24/2023 8:28 PM, Robert Finch wrote:
    On 2023-11-24 8:00 p.m., MitchAlsup wrote:


    If one assumes one address space per PID, then one is going to hit a
    limit of 4K a lot faster than 64K, and when one hits the limit, there is
    no good way to "reclaim" previously used address spaces short of
    flushing the TLB to be sure that no entries from that space remain in
    the TLB (ASID thrashing is likely to be relatively expensive to deal
    with as a result).

    I see after reading several webpages that the root pointer is used to
    point to only a single table for a process. This is not how I was doing >things. I have a MMU tables for each address space as opposed to having
    a table for the process. The process may have only a single address
    space, or it may use several address spaces.

    I am wondering why there is only a single table per process.

    There is actually two in most operating systems - the lower half
    of the VA space is owned by the user-mode code in the process and
    the upper-half is shared by all processors and used by the
    operating system on behalf of the process. For Intel/AMD, the
    kernel manages both halves, for ARMv8, each half has a completely
    distinct and separate root pointer (at each exeception level).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Robert Finch on Sat Nov 25 17:16:55 2023
    Robert Finch <robfi680@gmail.com> writes:
    On 2023-11-24 8:00 p.m., MitchAlsup wrote:
    Robert Finch wrote:

    On 2023-11-18 2:41 p.m., Robert Finch wrote:
    Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
    64kB page can handle 512MB of mappings. Tonight’s trade-off is how
    many root pointers to support. With a 12-bit ASID, 4096 root pointers
    are required to link to the mapping tables with one root pointer for
    each address space.

    So, you associate a single ROOT pointer VALUE with an ASID, and manage
    in SW who gets that ROOT pointer VALUE; using ASID as an index into
    Virtual Address Spaces.

    How is this usefully different that only using the ASID to qualify TLB
    results ?? <Was this TLB entry installed from the same ASID as is accessing >> right now>. And using ASID as an index into any array might lead to some
    conundrum down the road a apiece.

    Secondarily, SUN started out with 12-bit ASID and went to 16-bits just
    about
    as fast as they could--even before main memories went bigger than 4GB.

    I view the address space as an entity in it own right to be managed by
    the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that >identifies the address space has a life outside of just the TLB. I may
    be increasing the typical scope of an ASID.

    It is the same idea as using the ASID to qualify TLB entries, except
    that it qualifies the root pointer as well. So, the root pointer does
    not need to be switched by software. Once the root pointer is set for
    the AS it simply sits there statically until the AS is reused.

    I am using the ASID like a process ID. So, the root pointer register
    does not need to be reset on a task switch. Address spaces may not be
    mapped 1:1 with processes. An address space may outlive a task if it is >shared with another task. So, I do not want to use the PID to
    distinguish tables. This assumes the address space will not be freed up
    and reused by another task, if there are tasks using the ASID.

    4096 address spaces is a lot. But if using a 16-bit ASID it would no
    longer be practical to store a root pointer per ASID in a table.
    Instead, the root pointer would have to be managed by software as is
    normally done.

    I am wondering why the 16-bit ASID? 256 address spaces in 256 process? I >suspect it is just because 16-bit is easier to pass around/calculate in
    a HLL than some other value like 14-bits. Are 65536 address spaces
    really needed?


    256 is far too small.

    $ ps -ef | wc -l
    709

    Every time the ASID overflows, the system must basically flush
    all the caches system-wide. On an 80 processor system, that's a lot of overhead.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Robert Finch on Sat Nov 25 11:59:42 2023
    On 11/24/2023 9:48 PM, Robert Finch wrote:
    On 2023-11-24 10:16 p.m., BGB wrote:
    On 11/24/2023 8:28 PM, Robert Finch wrote:
    On 2023-11-24 8:00 p.m., MitchAlsup wrote:
    Robert Finch wrote:

    On 2023-11-18 2:41 p.m., Robert Finch wrote:
    Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A
    single 64kB page can handle 512MB of mappings. Tonight’s trade-off >>>>> is how many root pointers to support. With a 12-bit ASID, 4096 root
    pointers are required to link to the mapping tables with one root
    pointer for each address space.

    So, you associate a single ROOT pointer VALUE with an ASID, and manage >>>> in SW who gets that ROOT pointer VALUE; using ASID as an index into
    Virtual Address Spaces.

    How is this usefully different that only using the ASID to qualify TLB >>>> results ?? <Was this TLB entry installed from the same ASID as is
    accessing
    right now>. And using ASID as an index into any array might lead to
    some
    conundrum down the road a apiece.

    Secondarily, SUN started out with 12-bit ASID and went to 16-bits
    just about
    as fast as they could--even before main memories went bigger than 4GB.

    I view the address space as an entity in it own right to be managed
    by the MMU. ASIDs and address spaces should be mapped 1:1. The ASID
    that identifies the address space has a life outside of just the TLB.
    I may be increasing the typical scope of an ASID.

    It is the same idea as using the ASID to qualify TLB entries, except
    that it qualifies the root pointer as well. So, the root pointer does
    not need to be switched by software. Once the root pointer is set for
    the AS it simply sits there statically until the AS is reused.

    I am using the ASID like a process ID. So, the root pointer register
    does not need to be reset on a task switch. Address spaces may not be
    mapped 1:1 with processes. An address space may outlive a task if it
    is shared with another task. So, I do not want to use the PID to
    distinguish tables. This assumes the address space will not be freed
    up and reused by another task, if there are tasks using the ASID.

    4096 address spaces is a lot. But if using a 16-bit ASID it would no
    longer be practical to store a root pointer per ASID in a table.
    Instead, the root pointer would have to be managed by software as is
    normally done.

    I am wondering why the 16-bit ASID? 256 address spaces in 256
    process? I suspect it is just because 16-bit is easier to pass
    around/calculate in a HLL than some other value like 14-bits. Are
    65536 address spaces really needed?


    If one assumes one address space per PID, then one is going to hit a
    limit of 4K a lot faster than 64K, and when one hits the limit, there
    is no good way to "reclaim" previously used address spaces short of
    flushing the TLB to be sure that no entries from that space remain in
    the TLB (ASID thrashing is likely to be relatively expensive to deal
    with as a result).

    I see after reading several webpages that the root pointer is used to
    point to only a single table for a process. This is not how I was doing things. I have a MMU tables for each address space as opposed to having
    a table for the process. The process may have only a single address
    space, or it may use several address spaces.

    I am wondering why there is only a single table per process.


    I went the opposite route of one big address space, with the idea of
    allowing memory protection within this address space via the VUGID/ACL mechanism. There is a KRR, or Keyring Register, which holds up to 4 keys
    that may be used for ACL checking, granting an access if it is allowed
    by at least one of the keys; triggering an ISR on miss similar to the
    TLB. In this case, the conceptual model is more similar to that
    typically used in filesystems.

    But, I also have a 16-bit ASID...

    As-is, there is at most one set of page tables per address space, or per-process if processes are given different address spaces.




    Well, along with other things, like if/how to allow "Global" pages:
    True global pages are likely a foot gun, as there is no way to exclude
    them from a given process (where there may be a need to do so);
    Disallowing global pages entirely means higher TLB miss rates because
    no processes can share TLB entries.

    Global space can be assigned by designating an address space as a global space and giving it an ASID. All process wanting access to the global
    space need only then use the MMU table for that ASID. Eg. use ASID 0 for
    the global address space.


    Had considered this, but there is a problem:
    What if you have a process that you *don't* want to be able to see into
    this global space?...

    Though, this is where the idea of page-grouping can come in, say, the
    ASID becomes:
    gggg-pppp-pppp-pppp

    Where:
    0000 is visible to all of 0zzz
    1000 is visible to all of 1zzz
    ...
    Except:
    Fzzz, this group does not have any global pages (all one-off ASIDs).

    Or, possible also, is a 2.14 bit split.


    One option seems to be, say, that a few of the high-order bits of the
    ASID could be used as a "page group", with global pages only applying
    within a single page-group (possibly with one of the page groups being
    designated as "No global pages allowed").

    ...




    Meanwhile:
    I went and bought 128GB of RAM, only to realize my PC doesn't work if
    one tries to install the full 128GB (the BIOS boot-loops a bunch of
    times, and then apparently concludes that there is only 3.5GB ...).

    Does work at least if I install 3x 32GB sticks and 1x 16GB stick, giving
    112GB. This breaks the pairing rules, but seems to be working.

    ...

    Had I known this, could have spent half as much, and only upgraded to 96GB.



    Seemingly MOBO/BIOS/... designers didn't anticipate someone sticking a
    full 128GB in this thing?... (BIOS is dated from 2018).

    Well, either this, or a hardware compatibility issue with one of the
    cards?...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Sat Nov 25 19:31:13 2023
    Robert Finch wrote:

    On 2023-11-24 8:00 p.m., MitchAlsup wrote:
    Robert Finch wrote:

    On 2023-11-18 2:41 p.m., Robert Finch wrote:
    Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
    64kB page can handle 512MB of mappings. Tonight’s trade-off is how
    many root pointers to support. With a 12-bit ASID, 4096 root pointers
    are required to link to the mapping tables with one root pointer for
    each address space.

    So, you associate a single ROOT pointer VALUE with an ASID, and manage
    in SW who gets that ROOT pointer VALUE; using ASID as an index into
    Virtual Address Spaces.

    How is this usefully different that only using the ASID to qualify TLB
    results ?? <Was this TLB entry installed from the same ASID as is accessing >> right now>. And using ASID as an index into any array might lead to some
    conundrum down the road a apiece.

    Secondarily, SUN started out with 12-bit ASID and went to 16-bits just
    about
    as fast as they could--even before main memories went bigger than 4GB.

    I view the address space as an entity in it own right to be managed by
    the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that identifies the address space has a life outside of just the TLB. I may
    be increasing the typical scope of an ASID.

    Consider the case where two different processes MMAP the same area
    of memory.
    Should they both end up using the same ASID ??
    Should they both take extra TLB walks because they use different ASIDs ?? Should they uses their own ASIDs for their own memory but a different ASID
    for the shared memory ?? And How do you expect this to happen ??

    It is the same idea as using the ASID to qualify TLB entries, except
    that it qualifies the root pointer as well. So, the root pointer does
    not need to be switched by software. Once the root pointer is set for
    the AS it simply sits there statically until the AS is reused.

    I am using the ASID like a process ID. So, the root pointer register
    does not need to be reset on a task switch. Address spaces may not be
    mapped 1:1 with processes. An address space may outlive a task if it is shared with another task. So, I do not want to use the PID to
    distinguish tables. This assumes the address space will not be freed up
    and reused by another task, if there are tasks using the ASID.

    4096 address spaces is a lot. But if using a 16-bit ASID it would no
    longer be practical to store a root pointer per ASID in a table.
    Instead, the root pointer would have to be managed by software as is
    normally done.

    I am wondering why the 16-bit ASID? 256 address spaces in 256 process? I suspect it is just because 16-bit is easier to pass around/calculate in
    a HLL than some other value like 14-bits. Are 65536 address spaces
    really needed?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Sat Nov 25 19:44:13 2023
    Scott Lurndal wrote:

    Robert Finch <robfi680@gmail.com> writes:
    On 2023-11-24 10:16 p.m., BGB wrote:
    On 11/24/2023 8:28 PM, Robert Finch wrote:
    On 2023-11-24 8:00 p.m., MitchAlsup wrote:


    If one assumes one address space per PID, then one is going to hit a
    limit of 4K a lot faster than 64K, and when one hits the limit, there is >>> no good way to "reclaim" previously used address spaces short of
    flushing the TLB to be sure that no entries from that space remain in
    the TLB (ASID thrashing is likely to be relatively expensive to deal
    with as a result).

    I see after reading several webpages that the root pointer is used to
    point to only a single table for a process. This is not how I was doing >>things. I have a MMU tables for each address space as opposed to having
    a table for the process. The process may have only a single address
    space, or it may use several address spaces.

    I am wondering why there is only a single table per process.

    There is actually two in most operating systems - the lower half
    of the VA space is owned by the user-mode code in the process and
    the upper-half is shared by all processors and used by the
    operating system on behalf of the process. For Intel/AMD, the
    kernel manages both halves, for ARMv8, each half has a completely
    distinct and separate root pointer (at each exeception level).


    My 66000 Architecture has 4 Root Pointers available at all instants
    of time. The above was designed before the rise of HyperVisors and is
    now showing its age problems. All 4 Root Pointers are used based on
    privilege level::

    HOB=0 HOB=1
    Application:: Application 2-level No Access
    Guest OS :: Application 2-level Guest OS 2-level
    Guest HV :: Guest HV 1-level Guest OS 2-level
    Real HV :: Guest HV 1-level Real HV 1-level

    The overhead of Application to Application is no higher than that
    of Guest OS to a different Guest OS--whereas on the machines with
    VMENTER and VMEXIT it takes 10,000 cycles whereas Application to
    Application is closer to 1,000 cycles. I want this down in the
    10-100 cycle range.

    The exception <stack> system is designed to allow Guest HV to
    recover a Guest OS that takes page faults while servicing ISRs
    (and the like).

    The interrupt <stack> system is designed to allow the ISR to
    RPC or softIRQ without having to look at the pending stack on
    the way out. RTI looks at the pending stack and services the
    highest pending PRC/softIRQ affinitized to the CPU with control.

    The Interrupt dispatch system allows the CPU to continue running
    instructions until the contending CPUs decide which interrupt
    is claimed by which CPU (1::1) and then context switch do the
    interrupt dispatcher.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Sat Nov 25 20:02:15 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Robert Finch wrote:

    On 2023-11-24 8:00 p.m., MitchAlsup wrote:
    Robert Finch wrote:

    On 2023-11-18 2:41 p.m., Robert Finch wrote:
    Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single >>>> 64kB page can handle 512MB of mappings. Tonight’s trade-off is how
    many root pointers to support. With a 12-bit ASID, 4096 root pointers
    are required to link to the mapping tables with one root pointer for
    each address space.

    So, you associate a single ROOT pointer VALUE with an ASID, and manage
    in SW who gets that ROOT pointer VALUE; using ASID as an index into
    Virtual Address Spaces.

    How is this usefully different that only using the ASID to qualify TLB
    results ?? <Was this TLB entry installed from the same ASID as is accessing >>> right now>. And using ASID as an index into any array might lead to some >>> conundrum down the road a apiece.

    Secondarily, SUN started out with 12-bit ASID and went to 16-bits just
    about
    as fast as they could--even before main memories went bigger than 4GB.

    I view the address space as an entity in it own right to be managed by
    the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that
    identifies the address space has a life outside of just the TLB. I may
    be increasing the typical scope of an ASID.

    Consider the case where two different processes MMAP the same area
    of memory.

    In which case, the area of memory would be mapped to different
    virtual address ranges in each process, and thus naturally
    consume two TLBs.

    FWIW, MAP_FIXED is specified as an optional feature by POSIX
    and may not be supported by the OS at all.

    Given various forms of ASLR being used, it's unlikely even in
    two instances of the same executable that a call to mmap
    with MAP_SHARED without MAP_FIXED would map the region at
    the same virtual address in both processes.

    Should they both end up using the same ASID ??

    They couldn't share an ASID assuming the TLB looks up by VA.

    Should they both take extra TLB walks because they use different ASIDs ??

    Given the above, yes. It's likely they'll each be scheduled
    on different cores anyway in any modern system.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Sat Nov 25 20:40:11 2023
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Robert Finch wrote:

    On 2023-11-24 8:00 p.m., MitchAlsup wrote:
    Robert Finch wrote:

    On 2023-11-18 2:41 p.m., Robert Finch wrote:
    Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single >>>>> 64kB page can handle 512MB of mappings. Tonight’s trade-off is how >>>>> many root pointers to support. With a 12-bit ASID, 4096 root pointers >>>>> are required to link to the mapping tables with one root pointer for >>>>> each address space.

    So, you associate a single ROOT pointer VALUE with an ASID, and manage >>>> in SW who gets that ROOT pointer VALUE; using ASID as an index into
    Virtual Address Spaces.

    How is this usefully different that only using the ASID to qualify TLB >>>> results ?? <Was this TLB entry installed from the same ASID as is accessing
    right now>. And using ASID as an index into any array might lead to some >>>> conundrum down the road a apiece.

    Secondarily, SUN started out with 12-bit ASID and went to 16-bits just >>>> about
    as fast as they could--even before main memories went bigger than 4GB.

    I view the address space as an entity in it own right to be managed by
    the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that
    identifies the address space has a life outside of just the TLB. I may
    be increasing the typical scope of an ASID.

    Consider the case where two different processes MMAP the same area
    of memory.

    In which case, the area of memory would be mapped to different
    virtual address ranges in each process, and thus naturally
    consume two TLBs.

    MMAP() first, fork() second. Now we have 2 processes with the
    memory mapped shared memory at the same address.

    FWIW, MAP_FIXED is specified as an optional feature by POSIX
    and may not be supported by the OS at all.

    Given various forms of ASLR being used, it's unlikely even in
    two instances of the same executable that a call to mmap
    with MAP_SHARED without MAP_FIXED would map the region at
    the same virtual address in both processes.

    Should they both end up using the same ASID ??

    They couldn't share an ASID assuming the TLB looks up by VA.

    Should they both take extra TLB walks because they use different ASIDs ??

    Given the above, yes. It's likely they'll each be scheduled
    on different cores anyway in any modern system.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Sat Nov 25 21:55:04 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Robert Finch wrote:

    On 2023-11-24 8:00 p.m., MitchAlsup wrote:
    Robert Finch wrote:

    On 2023-11-18 2:41 p.m., Robert Finch wrote:
    Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single >>>>>> 64kB page can handle 512MB of mappings. Tonight’s trade-off is how >>>>>> many root pointers to support. With a 12-bit ASID, 4096 root pointers >>>>>> are required to link to the mapping tables with one root pointer for >>>>>> each address space.

    So, you associate a single ROOT pointer VALUE with an ASID, and manage >>>>> in SW who gets that ROOT pointer VALUE; using ASID as an index into
    Virtual Address Spaces.

    How is this usefully different that only using the ASID to qualify TLB >>>>> results ?? <Was this TLB entry installed from the same ASID as is accessing
    right now>. And using ASID as an index into any array might lead to some >>>>> conundrum down the road a apiece.

    Secondarily, SUN started out with 12-bit ASID and went to 16-bits just >>>>> about
    as fast as they could--even before main memories went bigger than 4GB. >>>
    I view the address space as an entity in it own right to be managed by >>>> the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that
    identifies the address space has a life outside of just the TLB. I may >>>> be increasing the typical scope of an ASID.

    Consider the case where two different processes MMAP the same area
    of memory.

    In which case, the area of memory would be mapped to different
    virtual address ranges in each process, and thus naturally
    consume two TLBs.

    MMAP() first, fork() second. Now we have 2 processes with the
    memory mapped shared memory at the same address.

    Yes, in that case, they'll be mapped at the same VA. All
    the below points still apply so long as TLB's are per core.


    FWIW, MAP_FIXED is specified as an optional feature by POSIX
    and may not be supported by the OS at all.

    Given various forms of ASLR being used, it's unlikely even in
    two instances of the same executable that a call to mmap
    with MAP_SHARED without MAP_FIXED would map the region at
    the same virtual address in both processes.

    Should they both end up using the same ASID ??

    They couldn't share an ASID assuming the TLB looks up by VA.

    Should they both take extra TLB walks because they use different ASIDs ??

    Given the above, yes. It's likely they'll each be scheduled
    on different cores anyway in any modern system.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Finch@21:1/5 to All on Sat Nov 25 19:48:06 2023
    Are top-level page directory pages shared between tasks? Suppose a task
    needs a 32-bit address space. With one level of page maps, 27 bits is accommodated, that leaves 5 bits of address translation to be done by
    the page directory. Using a whole page which can handle 11 address bits
    would be wasteful. But if root pointers could point into the same page directory page then the space would not be wasted. For instance, root
    pointer for task #1 could point the first 32 entries, root pointer for
    task #2 could point into the next 32 entries, and so on.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Sun Nov 26 01:34:53 2023
    Robert Finch wrote:

    Are top-level page directory pages shared between tasks?

    The HyperVisor tables supporting a single Guest OS certainly are.
    The Guest OS tables supporting Guest OS certainly are.

    Suppose a task
    needs a 32-bit address space. With one level of page maps, 27 bits is accommodated, that leaves 5 bits of address translation to be done by
    the page directory. Using a whole page which can handle 11 address bits
    would be wasteful. But if root pointers could point into the same page directory page then the space would not be wasted. For instance, root
    pointer for task #1 could point the first 32 entries, root pointer for
    task #2 could point into the next 32 entries, and so on.


    I should Note: that My 66000 Root Pointers determine the address space they map; anything from 8MB through 8EB and PTEs supporting 8KB through 8EB page sizes--with the kicker that large page entries can restrict themselves::
    for example you can use a 8MB PTE and enable only 1..1024 pages under that Virtual sub Address Space; furthermore, levels in the hierarchy can be skipped--all of this to minimize table walk time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Robert Finch on Sun Nov 26 15:55:04 2023
    Robert Finch <robfi680@gmail.com> writes:
    Are top-level page directory pages shared between tasks?

    The top half of the VA space could support this, for
    the most part (since the top half is generally shared
    by all tasks). The bottom half that's much less likely.


    Suppose a task
    needs a 32-bit address space. With one level of page maps, 27 bits is >accommodated, that leaves 5 bits of address translation to be done by
    the page directory. Using a whole page which can handle 11 address bits
    would be wasteful. But if root pointers could point into the same page >directory page then the space would not be wasted. For instance, root
    pointer for task #1 could point the first 32 entries, root pointer for
    task #2 could point into the next 32 entries, and so on.

    If the VA space is small enough, on ARMv8, the tables can be configured
    with fewer than the normal four levels by specifying a smaller VA
    size in the TCR_ELx register, so the walk may be only two or three levels
    deep instead of four (or five when the VA gets larger than 52 bits).

    Using intermediate level blocks (soi disant 'huge pages') reduces the
    walk overhead as well, but has it's issues with allocation (since
    the huge pages need not just be physical contiguous, but aligned
    on huge-page-sized boundaries.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Sun Nov 26 15:45:06 2023
    scott@slp53.sl.home (Scott Lurndal) writes:
    mitchalsup@aol.com (MitchAlsup) writes:
    Consider the case where two different processes MMAP the same area
    of memory.

    In which case, the area of memory would be mapped to different
    virtual address ranges in each process,

    Says who? Unless the user process asks for MAP_FIXED or the address
    range is already occupied in the user process, nothing prevents the OS
    from putting the shared area in the same process. If the permissions
    are also the same, the OS can then use one ASID for the shared area.

    This would be especially useful for the read-only sections (e.g, code)
    of common libraries like libc. However, in todays security landscape,
    you don't want one process to know where library code is mapped in
    other processes (i.e., you want ASLR), so we can no longer make use of
    that benefit. And it's doubtful whether other uses are worth the
    complications (and even if they are, there might be security issues,
    too).

    FWIW, MAP_FIXED is specified as an optional feature by POSIX
    and may not be supported by the OS at all.

    As usual, what is specified by a common-subset standard is not
    relevant for what an OS implementor has to do if they want to supply
    more than a practically unusable checkbox feature like the POSIX
    subsystem for Windows. There is a reason why WSL2 includes a full
    Linux kernel.

    Should they both end up using the same ASID ??

    They couldn't share an ASID assuming the TLB looks up by VA.

    Of course the TLB looks up by VA, what else. But if the VA is the
    same and the PA is the same, the same ASID can be used.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Sun Nov 26 12:32:08 2023
    Anton Ertl wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    mitchalsup@aol.com (MitchAlsup) writes:
    Consider the case where two different processes MMAP the same area
    of memory.
    In which case, the area of memory would be mapped to different
    virtual address ranges in each process,

    Says who? Unless the user process asks for MAP_FIXED or the address
    range is already occupied in the user process, nothing prevents the OS
    from putting the shared area in the same process. If the permissions
    are also the same, the OS can then use one ASID for the shared area.

    If the mapping range is being selected dynamically, the chance that a
    range will already be in use goes up with the number of sharers.
    At some point when a new member tries to join the sharing group
    the map request will be denied.

    Software that does not want to have a mapping request fail should assume
    that a shared area will be mapped at a different address in each process.
    That implies one should not assume that virtual address can be passed
    but instead use, say, section relative offsets to build a linked list.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Sun Nov 26 20:52:23 2023
    EricP wrote:

    Anton Ertl wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    mitchalsup@aol.com (MitchAlsup) writes:
    Consider the case where two different processes MMAP the same area
    of memory.
    In which case, the area of memory would be mapped to different
    virtual address ranges in each process,

    Says who? Unless the user process asks for MAP_FIXED or the address
    range is already occupied in the user process, nothing prevents the OS
    from putting the shared area in the same process. If the permissions
    are also the same, the OS can then use one ASID for the shared area.

    If the mapping range is being selected dynamically, the chance that a
    range will already be in use goes up with the number of sharers.
    At some point when a new member tries to join the sharing group
    the map request will be denied.

    Software that does not want to have a mapping request fail should assume
    that a shared area will be mapped at a different address in each process. That implies one should not assume that virtual address can be passed
    but instead use, say, section relative offsets to build a linked list.


    Here you are using shared memory like PL/1 uses AREA and OFFSET types.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Sun Nov 26 21:26:23 2023
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    In which case, the area of memory would be mapped to different
    virtual address ranges in each process,

    Says who? Unless the user process asks for MAP_FIXED or the address
    range is already occupied in the user process, nothing prevents the OS
    from putting the shared area in the same process.

    s/process/address range/ for the last word.

    If the permissions
    are also the same, the OS can then use one ASID for the shared area.

    If the mapping range is being selected dynamically, the chance that a
    range will already be in use goes up with the number of sharers.
    At some point when a new member tries to join the sharing group
    the map request will be denied.

    It will map, but with a different address range, and therefore a
    different ASID. Then, for further mapping requests, the chances that
    one of the two address ranges are free are increased. So even with a
    large number of processes mapping the same library, you will need only
    a few ASIDs for this physical memory, so there will be lots of
    sharing. Of course with ASLR this is all no longer relevant.

    Software that does not want to have a mapping request fail should assume
    that a shared area will be mapped at a different address in each process. >That implies one should not assume that virtual address can be passed
    but instead use, say, section relative offsets to build a linked list.

    Yes. The other option is to use MAP_FIXED early in the process, and
    to have some way of dealing with potential failures. But sharing of
    VAs in user code between processes is not what the sharing of ASIDs we
    have discussed here would be primarily about.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Anton Ertl on Sun Nov 26 15:45:08 2023
    On 11/26/2023 9:45 AM, Anton Ertl wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    mitchalsup@aol.com (MitchAlsup) writes:
    Consider the case where two different processes MMAP the same area
    of memory.

    In which case, the area of memory would be mapped to different
    virtual address ranges in each process,

    Says who? Unless the user process asks for MAP_FIXED or the address
    range is already occupied in the user process, nothing prevents the OS
    from putting the shared area in the same process. If the permissions
    are also the same, the OS can then use one ASID for the shared area.

    This would be especially useful for the read-only sections (e.g, code)
    of common libraries like libc. However, in todays security landscape,
    you don't want one process to know where library code is mapped in
    other processes (i.e., you want ASLR), so we can no longer make use of
    that benefit. And it's doubtful whether other uses are worth the complications (and even if they are, there might be security issues,
    too).


    It seems to me, as long as it is a different place on each system,
    probably good enough. Demanding a different location in each process
    would create a lot of additional memory overhead due to from things like base-relocations or similar.


    FWIW, MAP_FIXED is specified as an optional feature by POSIX
    and may not be supported by the OS at all.

    As usual, what is specified by a common-subset standard is not
    relevant for what an OS implementor has to do if they want to supply
    more than a practically unusable checkbox feature like the POSIX
    subsystem for Windows. There is a reason why WSL2 includes a full
    Linux kernel.


    Still using WSL1 here as for whatever reason hardware virtualization has
    thus far refused to work on my PC, and is apparently required for WSL2.

    I can add this to my list of annoyances, like I can install "just short
    of 128GB", but putting in the full 128GB causes my PC to be like "Oh
    Crap, I guess there is 3.5GB ..." (but, apparently "112GB with unmatched
    RAM sticks is fine I guess...").



    But, yeah, the original POSIX is an easier goal to achieve, vs, say, the ability to port over the GNU userland.


    A lot of it is doable, but things like fork+exec are a problem if one
    wants to support NOMMU operation or otherwise run all of the logical
    processes in a shared address space.

    A practical alternative is something more like a CreateProcess style
    call, but this is "not exactly POSIX". In theory though, one could treat "fork()" more like "vfork()" and then turn the exec* call into a
    CreateProcess call and then terminate the current thread. Wouldn't
    really work "in general" though, for programs that expect to be able to "fork()" and then continue running the current program as a sub-process.


    Should they both end up using the same ASID ??

    They couldn't share an ASID assuming the TLB looks up by VA.

    Of course the TLB looks up by VA, what else. But if the VA is the
    same and the PA is the same, the same ASID can be used.


    ?...

    Typically the ASID applies to the whole virtual address space, not to individual memory objects.


    Or, at least, my page-table scheme doesn't have a way to express
    per-page ASIDs (merely if a page is Private/Shared, with the results of
    this partly depending on the current ASID given for the page-table).

    Where, say, I am mostly using 64-bit entries in the page-table, as going
    to a 128-bit page-table format would be a bit steep.

    Say, PTE layout (16K pages):
    (63:48): ACLID
    (47:14): Physical Address.
    (13:12): Address or OS flag.
    (11:10): For use by OS
    ( 9: 0): Base page-access and similar.
    (9): S1 / U1 (Page-Size or OS Flag)
    (8): S0 / U0 (Page-Size or OS Flag)
    (7): Nu User (Supervisor Only)
    (6): No Execute
    (5): No Write
    (4): No Read
    (3): No Cache
    (2): Dirty (OS, ignored by TLB)
    (1): Private/Shared (MBZ if not Valid)
    (0): Present/Valid

    Where, ACLID serves as an index into the ACL table, or to lookup the
    VUGID parameters for the page (well, along with an alternate PTE variant
    that encodes VUGID directly, but reduces the physical address to 36
    bits). It is possible that the original VUGID scheme may be phased out
    in favor of using exclusively ACL checking.

    Note that the ACL checks don't add new permissions to a page, they add
    further restrictions (with the base-access being the most permissive).

    Some combinations of flags are special, and encode a few edge-case
    modes; such as pages which are Read/Write in Supervisor mode but
    Read-Only in user mode (separate from the possible use of ACL's to mark
    pages as read-only for certain tasks).



    But, FWIW, I ended up adding an extended MAP_GLOBAL flag for "mmap'ed
    space should be visible to all of the processes"; which in turn was used
    as part of the backing memory for the "GlobalAlloc" style calls (it is
    not a global heap, in that each process still manages the memory
    locally, but other intersecting processes can see the address within
    their own address spaces).

    Well, along with a MAP_PHYSICAL flag, for if one needs memory where
    VA==PA (this may fail, with the mmap returning NULL, effectively only
    allowed for "superusermode"; mostly intended for hardware interfaces).



    The usual behavior of MAP_SHARED didn't really make sense outside of the context of mapping a file, and didn't really serve the needed purpose
    (say, one wants to hand off a pointer to a bitmap buffer to the GUI
    subsystem to have it drawn into a window).

    It is also being used for things like shared scratch buffers, say, for
    passing BITMAPINFOHEADER and MIDI commands and similar across the
    interprocess calls (the C API style wrapper wraps a lot of this; whereas
    the internal COM-style interfaces will require any pointer-style
    arguments to point to shared memory).

    This is not required for normal syscall handlers, where the usual
    assumption is that normal syscalls will have some means of directly
    accessing the address space of the caller process. I didn't really want
    to require that TKGDI have this same capability.

    It is debatable whether calls like BlitImage and similar should require
    global memory, or merely recommend it (potentially having the call fall
    back to a scratch buffer and internal memcpy if the passed bitmap image
    is not already in global memory).



    I had originally considered a more complex mechanism for object sharing,
    but then ended up going with this for now partly because it was easier
    and lower overhead (well, and also because I wanted something that would
    still work if/when I started to add proper memory protection). May make
    sense to impose a limit on per-process global alloc's though (since it
    is intended specifically for shared buffers and not for general heap allocation; where for heap allocation ANONYMOUS+PRIVATE would be used
    instead).

    Though, looking at stuff, MAP_GLOBAL semantics may have also been
    partially covered by "MAP_ANONYMOUS|MAP_SHARED"?... Though, the
    semantics aren't the same.

    I guess, another alternative would have been to use shm_open+mmap or
    similar.


    Where, say, memory map will look something like:
    00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
    00yy_xxxxxxxx: Start of global virtual memory (*1);
    3FFF_xxxxxxxx: End of global virtual memory;
    4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
    7FFF_xxxxxxxx: End of private/local virtual memory (possible);
    8000_xxxxxxxx: Start of kernel virtual memory;
    BFFF_xxxxxxxx: End of kernel virtual memory;
    Cxxx_xxxxxxxx: Physical Address Range (Cached);
    Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
    Exxx_xxxxxxxx: Reserved;
    Fxxx_xxxxxxxx: MMIO and similar.

    *1: The 'yy' division point may move, will depend on things like how
    much RAM exists (currently, 00/01; no current "sane cost" FPGA boards
    having more than 256 or 512 MB of RAM).

    *2: If I go to a scheme of giving processes their own address spaces,
    then private memory will be used. It is likely that executable code may
    remain shared, but the data sections and heap would be put into private
    address ranges.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Sun Nov 26 22:27:37 2023
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    mitchalsup@aol.com (MitchAlsup) writes:
    Consider the case where two different processes MMAP the same area
    of memory.

    In which case, the area of memory would be mapped to different
    virtual address ranges in each process,

    Says who? Unless the user process asks for MAP_FIXED or the address
    range is already occupied in the user process, nothing prevents the OS
    from putting the shared area in the same process. If the permissions
    are also the same, the OS can then use one ASID for the shared area.

    This would be especially useful for the read-only sections (e.g, code)
    of common libraries like libc. However, in todays security landscape,
    you don't want one process to know where library code is mapped in
    other processes (i.e., you want ASLR), so we can no longer make use of
    that benefit. And it's doubtful whether other uses are worth the >complications (and even if they are, there might be security issues,
    too).

    FWIW, MAP_FIXED is specified as an optional feature by POSIX
    and may not be supported by the OS at all.

    As usual, what is specified by a common-subset standard is not
    relevant for what an OS implementor has to do if they want to supply
    more than a practically unusable checkbox feature like the POSIX
    subsystem for Windows. There is a reason why WSL2 includes a full
    Linux kernel.

    If an implementation claims support for the XSI option of
    POSIX, then it must support MAP_FIXED. There were a couple
    of vendors who claimed not to be able to support MAP_FIXED
    back in the days when it was being discussed in the standards
    committee working groups.

    In addition, the standard notes:

    "Use of MAP_FIXED may result in unspecified behavior in
    further use of malloc() and shmat(). The use of MAP_FIXED is
    discouraged, as it may prevent an implementation from making
    the most effective use of resources.

    Because the semantics of MAP_FIXED are to unmap any
    prior mapping in the range, if the implementation had happened to
    allocate the heap or shared System V region at that address, the heap
    would have become corrupt with dangling references hanging
    around which, if stored into, would subsequently corrupt the mapped region.



    Should they both end up using the same ASID ??

    They couldn't share an ASID assuming the TLB looks up by VA.

    Of course the TLB looks up by VA, what else. But if the VA is the
    same and the PA is the same, the same ASID can be used.

    That sounds like a nightmare scenario. Normally the ASID is
    closely associated with a single process and the scope of
    necessary TLB maintenance operations (e.g. invalidates
    after translation table updates) is usually the process.

    It's certainly not possible to do that on ARMv8 systems. The
    ASID tag in the TLB entry comes from the translation table base
    register and applies to all accesses made to the entire range covered
    by the translation table by all the threads of the process.

    Likewise the VMID tag in the TLB entry comes from the nested
    translation table base address system register at the time
    of entry creation.

    For a subsequent process (child or detached) sharing memory with
    that process, there just isn't any way to tag it's TLB entry with
    the ASID of the first process to map the shared region.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to BGB on Sun Nov 26 22:35:19 2023
    BGB <cr88192@gmail.com> writes:
    On 11/26/2023 9:45 AM, Anton Ertl wrote:



    Where, say, memory map will look something like:
    00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
    00yy_xxxxxxxx: Start of global virtual memory (*1);
    3FFF_xxxxxxxx: End of global virtual memory;
    4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
    7FFF_xxxxxxxx: End of private/local virtual memory (possible);
    8000_xxxxxxxx: Start of kernel virtual memory;
    BFFF_xxxxxxxx: End of kernel virtual memory;
    Cxxx_xxxxxxxx: Physical Address Range (Cached);
    Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
    Exxx_xxxxxxxx: Reserved;
    Fxxx_xxxxxxxx: MMIO and similar.


    The modern preference is to make the memory map flexible.

    Linux, for example, requires that PCI Base Address Registers
    be programmable by the operating system, and the OS can
    choose any range (subject to host bridge configuration, of
    course) for the device.

    It is notable that even on non-intel systems, one may need
    to map a 32-bit PCI BAR (AHCI is the classic example) which
    requires the address programmed in the bar to be less than
    0x10000000. Granted systems can have custom PCI controllers
    that remap that into the larger physical address space with
    a bit of extra hardware, however the kernel people don't
    like that at all since there is no universal standard for
    such remapping and they don't want to support
    dozens of independent implementations, constantly
    changing from generation to generation.

    Many modern SoCs (and ARM SBSA requires this) make their
    on-board devices and coprocessors look like PCI express
    devices to software, and SBSA requires the PCIe ECAM
    region for device discovery. Here again, each of
    these on board devices will have from one to six
    memory region base address registers (or one to
    three for 64-bit bars).

    Encoding memory attributes into the address is common
    in microcontrollers, but in a general purpose processor
    constrains the system to an extent sufficient to make it
    unattractive for general purpose workloads.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Finch@21:1/5 to BGB on Sun Nov 26 18:20:58 2023
    On 2023-11-26 4:45 p.m., BGB wrote:
    On 11/26/2023 9:45 AM, Anton Ertl wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    mitchalsup@aol.com (MitchAlsup) writes:
    Consider the case where two different processes MMAP the same area
    of memory.

    In which case, the area of memory would be mapped to different
    virtual address ranges in each process,

    Says who?  Unless the user process asks for MAP_FIXED or the address
    range is already occupied in the user process, nothing prevents the OS
    from putting the shared area in the same process.  If the permissions
    are also the same, the OS can then use one ASID for the shared area.

    This would be especially useful for the read-only sections (e.g, code)
    of common libraries like libc.  However, in todays security landscape,
    you don't want one process to know where library code is mapped in
    other processes (i.e., you want ASLR), so we can no longer make use of
    that benefit.  And it's doubtful whether other uses are worth the
    complications (and even if they are, there might be security issues,
    too).


    It seems to me, as long as it is a different place on each system,
    probably good enough. Demanding a different location in each process
    would create a lot of additional memory overhead due to from things like base-relocations or similar.


    FWIW,  MAP_FIXED is specified as an optional feature by POSIX
    and may not be supported by the OS at all.

    As usual, what is specified by a common-subset standard is not
    relevant for what an OS implementor has to do if they want to supply
    more than a practically unusable checkbox feature like the POSIX
    subsystem for Windows.  There is a reason why WSL2 includes a full
    Linux kernel.


    Still using WSL1 here as for whatever reason hardware virtualization has
    thus far refused to work on my PC, and is apparently required for WSL2.

    I can add this to my list of annoyances, like I can install "just short
    of 128GB", but putting in the full 128GB causes my PC to be like "Oh
    Crap, I guess there is 3.5GB ..." (but, apparently "112GB with unmatched
    RAM sticks is fine I guess...").



    But, yeah, the original POSIX is an easier goal to achieve, vs, say, the ability to port over the GNU userland.


    A lot of it is doable, but things like fork+exec are a problem if one
    wants to support NOMMU operation or otherwise run all of the logical processes in a shared address space.

    A practical alternative is something more like a CreateProcess style
    call, but this is "not exactly POSIX". In theory though, one could treat "fork()" more like "vfork()" and then turn the exec* call into a CreateProcess call and then terminate the current thread. Wouldn't
    really work "in general" though, for programs that expect to be able to "fork()" and then continue running the current program as a sub-process.


    Should they both end up using the same ASID ??

    They couldn't share an ASID assuming the TLB looks up by VA.

    Of course the TLB looks up by VA, what else.  But if the VA is the
    same and the PA is the same, the same ASID can be used.


    ?...

    Typically the ASID applies to the whole virtual address space, not to individual memory objects.


    Or, at least, my page-table scheme doesn't have a way to express
    per-page ASIDs (merely if a page is Private/Shared, with the results of
    this partly depending on the current ASID given for the page-table).

    Where, say, I am mostly using 64-bit entries in the page-table, as going
    to a 128-bit page-table format would be a bit steep.

    Say, PTE layout (16K pages):
      (63:48): ACLID
      (47:14): Physical Address.
      (13:12): Address or OS flag.
      (11:10): For use by OS
      ( 9: 0): Base page-access and similar.
        (9): S1 / U1 (Page-Size or OS Flag)
        (8): S0 / U0 (Page-Size or OS Flag)
        (7): Nu User (Supervisor Only)
        (6): No Execute
        (5): No Write
        (4): No Read
        (3): No Cache
        (2): Dirty (OS, ignored by TLB)
        (1): Private/Shared (MBZ if not Valid)
        (0): Present/Valid

    Where, ACLID serves as an index into the ACL table, or to lookup the
    VUGID parameters for the page (well, along with an alternate PTE variant
    that encodes VUGID directly, but reduces the physical address to 36
    bits). It is possible that the original VUGID scheme may be phased out
    in favor of using exclusively ACL checking.

    Note that the ACL checks don't add new permissions to a page, they add further restrictions (with the base-access being the most permissive).

    Some combinations of flags are special, and encode a few edge-case
    modes; such as pages which are Read/Write in Supervisor mode but
    Read-Only in user mode (separate from the possible use of ACL's to mark
    pages as read-only for certain tasks).


    Q+ has a similar setup, but the ACLID is in a separate table.

    For Q+ Two similar MMUs have been designed, one to be used in a large
    system and a second for a small system. The difference between the two
    is in the size of page numbers. The large system uses 64-bit page
    numbers, and the small system uses 32-bit page numbers. The PTE for the
    large system is 96-bits, 32-bits larger than the PTE for the small
    system due to the extra bits for the page number. Pages are 64kB. The
    small system supports a 48-bit address range.

    The PTE has the following fields:
    PPN 64/32 Physical page number
    URWX 3 User read-write-execute override
    SRWX 3 Supervisor read-write-execute override
    HRWX 3 Hypervisor read-write-execute override
    MRWX 3 Machine read-write-execute override
    CACHE 4 Cache-ability bits
    SW 2 OS software usage
    A 1 1=accessed/used
    M 1 1=modified
    V 1 1 if entry is valid, otherwise 0
    S 1 1=shared page
    G 1 1=global, ignore ASID
    T 1 0=page pointer, 1= table pointer
    RGN 3 Region table index
    LVL/BC 5 the page table level of the entry pointed to

    The RWX and CACHE bits are overrides. These values normally come from
    the region table, but may be overridden by values in the PTE.
    The LVL/BC5 field is five bits to account for a five-bit bounce counter
    for inverted page tables. Only a 3-bit level is in use.

    There is a separate table with per page information that contains a
    reference to an ACL (16-bts), share counts (16-bits), privilege level
    (8-bits), and access key (24-bits), and a couple of other fields for compression / encryption.

    I have made the PTBR a full 64-bit address now rather than a page number
    with control bits. So, it may now point into the middle of a page
    directory which is shared between tasks.

    The table walker and region table look like PCI devices to the system.




    But, FWIW, I ended up adding an extended MAP_GLOBAL flag for "mmap'ed
    space should be visible to all of the processes"; which in turn was used
    as part of the backing memory for the "GlobalAlloc" style calls (it is
    not a global heap, in that each process still manages the memory
    locally, but other intersecting processes can see the address within
    their own address spaces).

    Well, along with a MAP_PHYSICAL flag, for if one needs memory where
    VA==PA (this may fail, with the mmap returning NULL, effectively only
    allowed for "superusermode"; mostly intended for hardware interfaces).



    The usual behavior of MAP_SHARED didn't really make sense outside of the context of mapping a file, and didn't really serve the needed purpose
    (say, one wants to hand off a pointer to a bitmap buffer to the GUI
    subsystem to have it drawn into a window).

    It is also being used for things like shared scratch buffers, say, for passing BITMAPINFOHEADER and MIDI commands and similar across the interprocess calls (the C API style wrapper wraps a lot of this; whereas
    the internal COM-style interfaces will require any pointer-style
    arguments to point to shared memory).

    This is not required for normal syscall handlers, where the usual
    assumption is that normal syscalls will have some means of directly
    accessing the address space of the caller process. I didn't really want
    to require that TKGDI have this same capability.

    It is debatable whether calls like BlitImage and similar should require global memory, or merely recommend it (potentially having the call fall
    back to a scratch buffer and internal memcpy if the passed bitmap image
    is not already in global memory).



    I had originally considered a more complex mechanism for object sharing,
    but then ended up going with this for now partly because it was easier
    and lower overhead (well, and also because I wanted something that would still work if/when I started to add proper memory protection). May make
    sense to impose a limit on per-process global alloc's though (since it
    is intended specifically for shared buffers and not for general heap allocation; where for heap allocation ANONYMOUS+PRIVATE would be used instead).

    Though, looking at stuff, MAP_GLOBAL semantics may have also been
    partially covered by "MAP_ANONYMOUS|MAP_SHARED"?... Though, the
    semantics aren't the same.

    I guess, another alternative would have been to use shm_open+mmap or
    similar.


    Where, say, memory map will look something like:
      00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
      00yy_xxxxxxxx: Start of global virtual memory (*1);
      3FFF_xxxxxxxx: End of global virtual memory;
      4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
      7FFF_xxxxxxxx: End of private/local virtual memory (possible);
      8000_xxxxxxxx: Start of kernel virtual memory;
      BFFF_xxxxxxxx: End of kernel virtual memory;
      Cxxx_xxxxxxxx: Physical Address Range (Cached);
      Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
      Exxx_xxxxxxxx: Reserved;
      Fxxx_xxxxxxxx: MMIO and similar.

    *1: The 'yy' division point may move, will depend on things like how
    much RAM exists (currently, 00/01; no current "sane cost" FPGA boards
    having more than 256 or 512 MB of RAM).

    *2: If I go to a scheme of giving processes their own address spaces,
    then private memory will be used. It is likely that executable code may remain shared, but the data sections and heap would be put into private address ranges.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Scott Lurndal on Sun Nov 26 18:40:16 2023
    On 11/26/2023 4:35 PM, Scott Lurndal wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/26/2023 9:45 AM, Anton Ertl wrote:



    Where, say, memory map will look something like:
    00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
    00yy_xxxxxxxx: Start of global virtual memory (*1);
    3FFF_xxxxxxxx: End of global virtual memory;
    4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
    7FFF_xxxxxxxx: End of private/local virtual memory (possible);
    8000_xxxxxxxx: Start of kernel virtual memory;
    BFFF_xxxxxxxx: End of kernel virtual memory;
    Cxxx_xxxxxxxx: Physical Address Range (Cached);
    Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
    Exxx_xxxxxxxx: Reserved;
    Fxxx_xxxxxxxx: MMIO and similar.


    The modern preference is to make the memory map flexible.

    Linux, for example, requires that PCI Base Address Registers
    be programmable by the operating system, and the OS can
    choose any range (subject to host bridge configuration, of
    course) for the device.



    As for the memory map, actual hardware-relevant part of the map is:
    0000_xxxxxxxx..7FFF_xxxxxxxx: User Mode, virtual
    8000_xxxxxxxx..BFFF_xxxxxxxx: Supervisor Mode, virtual
    Cxxx_xxxxxxxx: Physical Address Range (Cached);
    Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
    Exxx_xxxxxxxx: Reserved;
    Fxxx_xxxxxxxx: MMIO and similar.

    No good way to make more entirely flexible, some of this stuff requires
    special handling from the L1 cache, and by the time it reaches the TLB,
    it is too late (unless there were additional logic to be like "Oh, crap,
    this was actually meant for MMIO!").

    Though, with the 96-bit VA mode, if GBH(47:0)!=0, then the entire 48-bit
    space is User Mode Virtual (and it is not possible to access MMIO or
    similar at all, short of reloading 0 into GBH, or using XMOV.x
    instructions with a 128-bit pointer, say:
    0000_0000_00000000-tttt_Fxxx_xxxxxxxx).


    Note here that the high 16-bits are ignored for normal pointers
    (typically used for type-tagging or bounds-checking by the runtime).

    For branches and captured Link-Register values:
    If LSB is 0: High 16 bits are ignored;
    The branch will always be within the same CPU Mode.
    If LSB is 1: High 16 bits encode CPU Mode control flags.
    LSB is always set for created LR values.
    CPU will trap if the LSB is Clear in LR during an RTS/RTSU.

    Setting the LSB and putting the mode in the high 16 bits is also often
    used on function pointers so that theoretically Baseline and XG2 code
    can play along together (though, at present, BGBCC does not generate any
    mixed binaries, so this part would mostly apply to DLLs).




    For the time being, there is no PCI or PCIe in my case.
    Nor have I gone up the learning curve for what would be required to
    interface with any PCIe devices.


    Had tried to get USB working, but didn't have much success as it seemed
    I was still missing something (seemed to be sending/receiving bytes, but
    the devices would not respond as expected to any requests or commands).

    Mostly ended up using a PS2 keyboard, and had realized that (IIRC) if
    one pulled the D+ and D- lines high (IIRC) the mouse would instead
    implement the PS2 protocol (though, this didn't work on the USB
    keyboards I had tried).


    Most devices are mapped to fixed address ranges in the MMIO space:
    F000Cxxx: Rasterizer / Edge-Walker Control Registers
    F000Exxx: Various basic devices
    SDcard, PS2 Keyboard/Mouse, RS232 UART (*), etc
    F008xxxx: FM Synth / Sample Mixer Control / ...
    F009xxxx: PCM Audio Loop/Registers
    F00Axxxx: MMIO VRAM
    F00Bxxxx: MMIO VRAM and Video Control
    At present, VRAM is also RAM-backed.
    VRAM framebuffer base address in RAM is now movable.

    All this existing within:
    FFFF_Fxxxxxxx

    *: RS232 generally connected to a UART interface that feeds back to a
    connected computer via an on-board FTDI chip or similar.


    As for physical memory map, it is sorta like:
    00000000..00007FFF: Boot ROM
    0000C000..0000DFFF: Boot SRAM
    00010000..0001FFFF: ZERO's
    00020000..0002FFFF: BJX2 NOP's
    00030000..0003FFFF: BJX2 BREAK's
    ...
    01000000..1FFFFFFF: Reserved for RAM
    20000000..3FFFFFFF: Reserved for More RAM (And/or repeating)
    40000000..5FFFFFFF: RAM repeats (and/or Reserved)
    60000000..7FFFFFFF: RAM repeats more (and/or Reserved)
    80000000..EFFFFFFF: Reserved
    F0000000..FFFFFFFF: MMIO in 32-bit Mode (*1)

    *1: There used to be an MMIO range at 0000_F0000000, but this has been eliminated in favor of only recognizing this range as MMIO in 32-bit
    mode (where only the low 32-bits of the address are used). Enabling
    48-bit addressing will now require using the proper MMIO address.

    Currently, nothing past the low 4GB is used in the physical memory map.


    It is notable that even on non-intel systems, one may need
    to map a 32-bit PCI BAR (AHCI is the classic example) which
    requires the address programmed in the bar to be less than
    0x10000000. Granted systems can have custom PCI controllers
    that remap that into the larger physical address space with
    a bit of extra hardware, however the kernel people don't
    like that at all since there is no universal standard for
    such remapping and they don't want to support
    dozens of independent implementations, constantly
    changing from generation to generation.

    Many modern SoCs (and ARM SBSA requires this) make their
    on-board devices and coprocessors look like PCI express
    devices to software, and SBSA requires the PCIe ECAM
    region for device discovery. Here again, each of
    these on board devices will have from one to six
    memory region base address registers (or one to
    three for 64-bit bars).

    Encoding memory attributes into the address is common
    in microcontrollers, but in a general purpose processor
    constrains the system to an extent sufficient to make it
    unattractive for general purpose workloads.


    Possibly, but making things more flexible here would be a non-trivial
    level of complexity to deal with at the moment (and, it seemed relevant
    at first to design something I could "actually implement").


    At the time I started out on this, even maintaining similar hardware
    interfaces to a minimalist version of the Sega Dreamcast (what the
    BJX1's hardware-interface design was partly based on) was asking a bit
    too much (even after leaving out things like the CD-ROM drive and similar).


    So, I simplified things somewhat, initially taking some design
    inspiration in these areas from the Commodore 64 and MSP430 and similar...

    Say:
    VRAM was reinterpreted as being an 80x25 grid of 8x8 pixel color cells;
    Audio was a simple MMIO-backed PCM loop (with a few registers to adjust
    the sample rate and similar).

    In terms of output signals, the display module drives a VGA output, and
    the audio is generally pulled off by turning an IO pin on and off really
    fast.

    Or, one drives 2 lines for audio, say:
    10: +, 01: -, 11: 0

    Using an H-Bridge driver as an amplifier (turns out one needs to drive
    like 50-100mA to get any decent level of loudness out of headphones;
    which is well beyond the power normal IO pins can deliver). Generally
    PCM needs to get turned into PWM/PDM.

    Driving stereo via a dual H-Bridge driver would get a little wonky
    though, since headphones use Left/Right and a Common, effectively one
    needs to drive the center as a neutral, with L/R channels (and/or, just
    get lazy and drive mono across both the L/R channels using a single
    H-Bridge and ignore the center point, which ironically can get more
    loudness at less current because now one is dealing with 70 ohm rather
    than 35 ohm).

    ...


    Generally, with all of the hardware addresses at fixed locations.
    Doing any kind of dynamic configuration or allowing hardware addresses
    to be movable would have likely made the MMIO devices significantly more expensive (vs hard-coding the address of each device).


    Did generally go with MMIO rather than x86 style IO ports though.
    Partly because IO ports sucks, and I wasn't quite *that* limited (say,
    could afford to use a 28-bit space, rather than a 16-bit space).


    ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Mon Nov 27 02:09:52 2023
    Scott Lurndal wrote:

    BGB <cr88192@gmail.com> writes:
    On 11/26/2023 9:45 AM, Anton Ertl wrote:



    Where, say, memory map will look something like:
    00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
    00yy_xxxxxxxx: Start of global virtual memory (*1);
    3FFF_xxxxxxxx: End of global virtual memory;
    4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
    7FFF_xxxxxxxx: End of private/local virtual memory (possible);
    8000_xxxxxxxx: Start of kernel virtual memory;
    BFFF_xxxxxxxx: End of kernel virtual memory;
    Cxxx_xxxxxxxx: Physical Address Range (Cached);
    Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
    Exxx_xxxxxxxx: Reserved;
    Fxxx_xxxxxxxx: MMIO and similar.


    The modern preference is to make the memory map flexible.

    // cacheable, used, modified bits
    CUM kind of access
    --- ------------------------------
    000 uncacheable DRAM
    001 MMI/O
    010 config
    011 ROM
    1xx cacheable DRAM

    Linux, for example, requires that PCI Base Address Registers
    be programmable by the operating system, and the OS can
    choose any range (subject to host bridge configuration, of
    course) for the device.

    Easily done, just create an uncacheable PTE and set UM to 10
    for config space or 01 for MMI/O space.

    It is notable that even on non-intel systems, one may need
    to map a 32-bit PCI BAR (AHCI is the classic example) which
    requires the address programmed in the bar to be less than
    0x10000000.

    I/O MMU translates these devices from a 32-bit VAS into the
    64-bit PAS.

    Granted systems can have custom PCI controllers
    that remap that into the larger physical address space with
    a bit of extra hardware, however the kernel people don't
    like that at all since there is no universal standard for
    such remapping and they don't want to support
    dozens of independent implementations, constantly
    changing from generation to generation.

    What they figure if they are already supporting 4 incompatible
    mapping systems {Intel, AMD, ARM, RISC-V} you would have though
    they had gotten good at these implementations :-)

    Many modern SoCs (and ARM SBSA requires this) make their
    on-board devices and coprocessors look like PCI express
    devices to software,

    I made the CPU/cores in My 66000 have a configuration port
    that is setup during boot that smells just like a PCIe
    port.

    and SBSA requires the PCIe ECAM
    region for device discovery. Here again, each of
    these on board devices will have from one to six
    memory region base address registers (or one to
    three for 64-bit bars).

    Encoding memory attributes into the address is common
    in microcontrollers, but in a general purpose processor
    constrains the system to an extent sufficient to make it
    unattractive for general purpose workloads.

    Agreed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to MitchAlsup on Mon Nov 27 00:04:06 2023
    On 11/26/2023 8:09 PM, MitchAlsup wrote:
    Scott Lurndal wrote:

    BGB <cr88192@gmail.com> writes:
    On 11/26/2023 9:45 AM, Anton Ertl wrote:



    Where, say, memory map will look something like:
      00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
      00yy_xxxxxxxx: Start of global virtual memory (*1);
      3FFF_xxxxxxxx: End of global virtual memory;
      4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
      7FFF_xxxxxxxx: End of private/local virtual memory (possible);
      8000_xxxxxxxx: Start of kernel virtual memory;
      BFFF_xxxxxxxx: End of kernel virtual memory;
      Cxxx_xxxxxxxx: Physical Address Range (Cached);
      Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
      Exxx_xxxxxxxx: Reserved;
      Fxxx_xxxxxxxx: MMIO and similar.

    The modern preference is to make the memory map flexible.


    As noted, some amount of the above would be part of the OS memory map,
    rather than a hardware imposed memory map.


    Like, say, Windows on x86 typically had:
    00000000..000FFFFF: DOS-like map (9x)
    00100000..7FFFFFFF: Userland stuff
    80000000..BFFFFFFF: Shared stuff
    C0000000..FFFFFFFF: Kernel Stuff

    Did the hardware enforce this? No.
    Did Windows follow such a structure? Yes, generally.

    Linux sorta followed a similar structure, except that on some versions,
    they had given the full 4GB to userland addresses (which made an
    annoyance if trying to use TagRefs and the OS might actually put memory
    in the part of the address space one would have otherwise used to hold
    fixnums and similar).

    Ironically though, this sort of thing (along with the limits of 32-bit
    tagrefs) made incentive for my to go over to 64-bit tagrefs even on
    32-bit machines, and a generally similar tagref scheme got carried into
    my later projects.


    Say:
    0ttt_xxxx_xxxxxxxx: Pointers
    1ttt_xxxx_xxxxxxxx: Small Value Spaces
    2ttt_xxxx_xxxxxxxx: ...
    3yyy_xxxx_xxxxxxxx: Bounds Checked Pointers
    4iii_iiii_iiiiiiii: Fixnum
    ..
    7iii_iiii_iiiiiiii: Fixnum
    8iii_iiii_iiiiiiii: Flonum
    ..
    Biii_iiii_iiiiiiii: Flonum
    ...

    But, this scheme is more used by the runtime, not so much by the hardware.

    For the most part, C doesn't use pointer tagging.
    However BGBScript/JavaScript and my BASIC variant do make use of
    type-tagging.


                    // cacheable, used, modified bits
        CUM            kind of access
        ---            ------------------------------
        000            uncacheable DRAM
        001            MMI/O
        010            config
        011            ROM
        1xx            cacheable DRAM


    Hmm...
    Unfortunate acronyms are inescapable it seems...


    Linux, for example, requires that PCI Base Address Registers
    be programmable by the operating system, and the OS can
    choose any range (subject to host bridge configuration, of
    course) for the device.

    Easily done, just create an uncacheable PTE and set UM to 10
    for config space or 01 for MMI/O space.


    I guess, if PCIe were supported, some scheme could be developed to map
    the PCIe space either into part of the MMIO space, into RAM space, or
    maybe some other space.

    There is a functional difference between MMIO space and RAM space in
    terms of how they are accessed:
    RAM space: Cache does its thing and works with cache-lines;
    MMIO space: A request is sent over the bus, and then it waits for a
    response.

    If the MMIO bridge sees an MMIO request, it puts it onto the MMIO Bus,
    and sees if any device responds (if so, sending the response back to the origin). Otherwise, if no device responds after a certain number of
    clock cycles, an all-zeroes response is sent instead.


    Currently, no sort of general purpose bus is routed outside of the FPGA,
    and if it did exist, it is not yet clear what form it would take.

    Would need to limit pin counts though, so probably some sort of serial
    bus in any case.

    PCIe might be sort of tempting in the sense that apparently, 1 PCIe lane
    can be subdivided to multiple devices, and bridge cards exist that can apparently route PCIe over a repurposed USB cable and then connect
    multiple devices, PCI, or ISA cards. Albeit apparently with mixed results.



    It is notable that even on non-intel systems, one may need
    to map a 32-bit PCI BAR (AHCI is the classic example) which
    requires the address programmed in the bar to be less than
    0x10000000.

    I/O MMU translates these devices from a 32-bit VAS into the 64-bit PAS.

                Granted systems can have custom PCI controllers
    that remap that into the larger physical address space with
    a bit of extra hardware, however the kernel people don't
    like that at all since there is no universal standard for
    such remapping and they don't want to support
    dozens of independent implementations, constantly
    changing from generation to generation.

    What they figure if they are already supporting 4 incompatible
    mapping systems {Intel, AMD, ARM, RISC-V} you would have though
    they had gotten good at these implementations :-)

    Many modern SoCs (and ARM SBSA requires this) make their
    on-board devices and coprocessors look like PCI express
    devices to software,

    I made the CPU/cores in My 66000 have a configuration port
    that is setup during boot that smells just like a PCIe
    port.

                         and SBSA requires the PCIe ECAM
    region for device discovery.    Here again, each of
    these on board devices will have from one to six
    memory region base address registers (or one to
    three for 64-bit bars).

    Encoding memory attributes into the address is common
    in microcontrollers, but in a general purpose processor
    constrains the system to an extent sufficient to make it
    unattractive for general purpose workloads.

    Agreed.


    At least for the userland address ranges, there is less of this going on
    than in SH4, which had basically spent the top 3 bits of the 32-bit
    address as mode.

    Say, IIRC:
    (29): No TLB
    (30): No Cache
    (31): Supervisor

    So, in effect, there was only 512MB of usable address space.
    The SH-4A had then expanded the lower part to 31 bits, so one could have
    2GB of usermode address space.


    But, say, if one can have 47 bits of freely usable virtual address space
    for userland, probably good enough.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Mon Nov 27 07:22:22 2023
    BGB <cr88192@gmail.com> writes:
    On 11/26/2023 9:45 AM, Anton Ertl wrote:
    This would be especially useful for the read-only sections (e.g, code)
    of common libraries like libc. However, in todays security landscape,
    you don't want one process to know where library code is mapped in
    other processes (i.e., you want ASLR), so we can no longer make use of
    that benefit. And it's doubtful whether other uses are worth the
    complications (and even if they are, there might be security issues,
    too).


    It seems to me, as long as it is a different place on each system,
    probably good enough. Demanding a different location in each process
    would create a lot of additional memory overhead due to from things like >base-relocations or similar.

    If the binary is position-independent (the default on Linux on AMD64),
    there is no such overhead.

    I just started the same binary twice and looked at the address of the
    same peace of code:

    Gforth 0.7.3, Copyright (C) 1995-2008 Free Software Foundation, Inc.
    Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license'
    Type `bye' to exit
    see open-file
    Code open-file
    0x000055c2b76d5833 <gforth_engine+6595>: mov %r15,0x1c126(%rip)
    ...

    For the other process the same instruction is:

    Code open-file
    0x000055dd606e4833 <gforth_engine+6595>: mov %r15,0x1c126(%rip)

    Following the calls until I get to glibc, I get, for the two processes:

    0x00007f705c0c3b90 <__libc_open64+0>: push %r12
    0x00007f190aa34b90 <__libc_open64+0>: push %r12

    So not just the binary, but also glibc resides at different virtual
    addresses in the two processes.

    So obviously the Linux and glibc maintainers think that per-system
    ASLR is not good enough. They obviously want ASLR to work as well as
    possible against local attackers.

    Of course the TLB looks up by VA, what else. But if the VA is the
    same and the PA is the same, the same ASID can be used.


    ?...

    Typically the ASID applies to the whole virtual address space, not to >individual memory objects.

    Yes, one would need more complicated ASID management than setting
    "the" ASID on switching to a process if different VMAs in the process
    have different ASIDs. Another reason not to go there.

    Power (and IIRC HPPA) do something in this direction with their
    "segments", where the VA space was split into 16 equally parts, and
    IIRC the 16 parts each extended the address by 16 bits (minus the 4
    bits of the segment number), so essentially they have 16 16-bit ASIDs.
    The address spaces are somewhat unflexible, but with 64-bit VAs
    (i.e. 60-bit address spaces) that may be good enough for quite a
    while. The cost is that you now have to manage 16 ASID registers.
    And if we ever get to actually making use of more the 60 bits of VA in
    other ways, combining this ASID scheme with the other use of the VAs.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Mon Nov 27 07:57:08 2023
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    FWIW, MAP_FIXED is specified as an optional feature by POSIX
    and may not be supported by the OS at all.

    As usual, what is specified by a common-subset standard is not
    relevant for what an OS implementor has to do if they want to supply
    more than a practically unusable checkbox feature like the POSIX
    subsystem for Windows. There is a reason why WSL2 includes a full
    Linux kernel.
    ...
    Because the semantics of MAP_FIXED are to unmap any
    prior mapping in the range, if the implementation had happened to
    allocate the heap or shared System V region at that address, the heap
    would have become corrupt with dangling references hanging
    around which, if stored into, would subsequently corrupt the mapped region.

    Of course you can provide an address without specifying MAP_FIXED, and
    a high-quality OS will satisfy the request if possible (and return a
    different address if not), while a work-to-rule OS like the POSIX
    subsystem for Windows may then treat that address as if the user had
    passed NULL.

    Interestingly, Linux (since 4.17) also provides MAP_FIXED_NOREPLACE,
    which works like MAP_FIXED except that it returns an error if
    MAP_FIXED would replace part of an existing mapping. Makes me wonder
    if in the no-conflict case, and given a page-aligned addr there is any difference between MAP_FIXED, MAP_FIXED_NOREPLACE and just providing
    an address without any of these flags in Linux. In the conflict case,
    the difference between the latter two variants is how you detect that
    it did not work as desired.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to Anton Ertl on Mon Nov 27 03:34:34 2023
    On 11/27/2023 1:22 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 11/26/2023 9:45 AM, Anton Ertl wrote:
    This would be especially useful for the read-only sections (e.g, code)
    of common libraries like libc. However, in todays security landscape,
    you don't want one process to know where library code is mapped in
    other processes (i.e., you want ASLR), so we can no longer make use of
    that benefit. And it's doubtful whether other uses are worth the
    complications (and even if they are, there might be security issues,
    too).


    It seems to me, as long as it is a different place on each system,
    probably good enough. Demanding a different location in each process
    would create a lot of additional memory overhead due to from things like
    base-relocations or similar.

    If the binary is position-independent (the default on Linux on AMD64),
    there is no such overhead.


    OK.

    I was thinking mostly of things like PE/COFF, where often a mix of
    relative and absolute addressing is used, and loading typically involves applying base relocations (so, once loaded, the assumption is that the
    binary will not move further).

    Granted, traditional PE/COFF and ELF manage things like global variables differently (direct vs GOT).

    Though, on x86-64, PC-relative addressing is a thing, so less need for
    absolute addressing. PIC with PE/COFF might not be too much of a stretch.


    I just started the same binary twice and looked at the address of the
    same peace of code:

    Gforth 0.7.3, Copyright (C) 1995-2008 Free Software Foundation, Inc.
    Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license'
    Type `bye' to exit
    see open-file
    Code open-file
    0x000055c2b76d5833 <gforth_engine+6595>: mov %r15,0x1c126(%rip)
    ...

    For the other process the same instruction is:

    Code open-file
    0x000055dd606e4833 <gforth_engine+6595>: mov %r15,0x1c126(%rip)

    Following the calls until I get to glibc, I get, for the two processes:

    0x00007f705c0c3b90 <__libc_open64+0>: push %r12
    0x00007f190aa34b90 <__libc_open64+0>: push %r12

    So not just the binary, but also glibc resides at different virtual
    addresses in the two processes.

    So obviously the Linux and glibc maintainers think that per-system
    ASLR is not good enough. They obviously want ASLR to work as well as possible against local attackers.


    OK.


    Of course the TLB looks up by VA, what else. But if the VA is the
    same and the PA is the same, the same ASID can be used.


    ?...

    Typically the ASID applies to the whole virtual address space, not to
    individual memory objects.

    Yes, one would need more complicated ASID management than setting
    "the" ASID on switching to a process if different VMAs in the process
    have different ASIDs. Another reason not to go there.

    Power (and IIRC HPPA) do something in this direction with their
    "segments", where the VA space was split into 16 equally parts, and
    IIRC the 16 parts each extended the address by 16 bits (minus the 4
    bits of the segment number), so essentially they have 16 16-bit ASIDs.
    The address spaces are somewhat unflexible, but with 64-bit VAs
    (i.e. 60-bit address spaces) that may be good enough for quite a
    while. The cost is that you now have to manage 16 ASID registers.
    And if we ever get to actually making use of more the 60 bits of VA in
    other ways, combining this ASID scheme with the other use of the VAs.


    OK.

    That seems a bit odd...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Mon Nov 27 14:59:36 2023
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    FWIW, MAP_FIXED is specified as an optional feature by POSIX
    and may not be supported by the OS at all.

    As usual, what is specified by a common-subset standard is not
    relevant for what an OS implementor has to do if they want to supply
    more than a practically unusable checkbox feature like the POSIX >>>subsystem for Windows. There is a reason why WSL2 includes a full
    Linux kernel.
    ...
    Because the semantics of MAP_FIXED are to unmap any
    prior mapping in the range, if the implementation had happened to
    allocate the heap or shared System V region at that address, the heap
    would have become corrupt with dangling references hanging
    around which, if stored into, would subsequently corrupt the mapped region.

    Of course you can provide an address without specifying MAP_FIXED, and
    a high-quality OS will satisfy the request if possible (and return a >different address if not), while a work-to-rule OS like the POSIX
    subsystem for Windows may then treat that address as if the user had
    passed NULL.

    Interestingly, Linux (since 4.17) also provides MAP_FIXED_NOREPLACE,
    which works like MAP_FIXED except that it returns an error if
    MAP_FIXED would replace part of an existing mapping. Makes me wonder
    if in the no-conflict case, and given a page-aligned addr there is any >difference between MAP_FIXED, MAP_FIXED_NOREPLACE and just providing
    an address without any of these flags in Linux. In the conflict case,
    the difference between the latter two variants is how you detect that
    it did not work as desired.


    I've never seen a case where using MAP_FIXED was useful, and I've
    been using mmap since the early 90's. I'm sure there must be one,
    probabably where someone uses full VAs instead of offsets in data
    structures. Using the full VAs in the region will likely cause
    issues in the long term as the application is moved to updated or
    different posix systems, particularly if the data file associated
    with the region is expected to work in all subsequent
    implementats. MAP_FIXED should be avoided, IMO.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Mon Nov 27 16:10:49 2023
    scott@slp53.sl.home (Scott Lurndal) writes:
    I've never seen a case where using MAP_FIXED was useful, and I've
    been using mmap since the early 90's.

    Gforth uses it for putting the image into the dictionary (the memory
    area for Forth definitions, where more definitions can be put during a session): It first allocates the space for the dictionary with an
    anonymous mmap, then puts the image at the start of this area with a
    file mmap with MAP_FIXED.

    It also currently uses MAP_FIXED for allocating the memory for
    non-relocatable images, but thinking through it again, it's probably
    better to use MAP_FIXED_NOREPLACE or nothing, and then check the
    address, and report any error. However, we have not received any bug
    reports about that, which probably shows that nobody uses
    non-relocatable images.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Robert Finch on Thu Nov 30 16:59:37 2023
    Robert Finch <robfi680@gmail.com> writes:
    <snip>
    My thought was to support two
    banks of registers, one for the highest operating mode, and the other
    for remaining operating modes.

    How do the operating modes pass data between each other? E.g. for
    a system call, the arguments are generally passed to the next higher
    privilege level/operating mode via registers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Robert Finch on Thu Nov 30 13:35:04 2023
    Robert Finch wrote:
    The Q+ register file is implemented with one block-RAM per read port.
    With a 64-bit width this gives 512 registers in a block RAM. 192
    registers are needed for renaming a 64-entry architectural register
    file. That leaves 320 registers unused. My thought was to support two
    banks of registers, one for the highest operating mode, and the other
    for remaining operating modes. On exceptions the register bank could be switched. But to do this there are now 128-register effectively being
    renamed which leads to 384 physical registers to manage. This doubles
    the size of the register management code. Unless, a pipeline flush
    occurs for exception processing which I think would allow the renamer to reuse the same hardware to manage a new bank of registers. But that
    hinges on all references to registers in the current bank being unused.

    My other thought was that with approximately three times the number of architectural registers required, using 256 physical registers would
    allow 85 architectural registers. Perhaps some of the registers could be banked for different operating modes. Banking four registers per mode
    would use up 16.

    If the 512-register file were divided by three, 170 physical registers
    could be available for renaming. This is less than the ideal 192
    registers but maybe close enough to not impact performance adversely.


    I don't understand the problem.
    You want 64 architecture registers, each which needs a physical register,
    plus 128 registers for in-flight instructions, so 196 physical registers.

    If you add a second bank of 64 architecture registers for interrupts
    then each needs a physical register. But that doesn't change the number
    of in-flight registers so thats 256 physical total.
    Plus two sets of rename banks, one for each mode.

    If you drain the pipeline before switching register banks then all
    of the 128 in-flight registers will be free at the time of switch.

    If you can switch to interrupt mode without draining the pipeline then
    some of those 128 will be in-use for the old mode, some for the new mode
    (and the uOps carry a privilege mode flag so you can do things like
    check LD or ST ops against the appropriate PTE mode access control).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Thu Nov 30 20:30:52 2023
    EricP wrote:

    Robert Finch wrote:
    The Q+ register file is implemented with one block-RAM per read port.
    With a 64-bit width this gives 512 registers in a block RAM. 192
    registers are needed for renaming a 64-entry architectural register
    file. That leaves 320 registers unused. My thought was to support two
    banks of registers, one for the highest operating mode, and the other
    for remaining operating modes. On exceptions the register bank could be
    switched. But to do this there are now 128-register effectively being
    renamed which leads to 384 physical registers to manage. This doubles
    the size of the register management code. Unless, a pipeline flush
    occurs for exception processing which I think would allow the renamer to
    reuse the same hardware to manage a new bank of registers. But that
    hinges on all references to registers in the current bank being unused.

    My other thought was that with approximately three times the number of
    architectural registers required, using 256 physical registers would
    allow 85 architectural registers. Perhaps some of the registers could be
    banked for different operating modes. Banking four registers per mode
    would use up 16.

    If the 512-register file were divided by three, 170 physical registers
    could be available for renaming. This is less than the ideal 192
    registers but maybe close enough to not impact performance adversely.


    I don't understand the problem.
    You want 64 architecture registers, each which needs a physical register, plus 128 registers for in-flight instructions, so 196 physical registers.

    If you add a second bank of 64 architecture registers for interrupts
    then each needs a physical register. But that doesn't change the number
    of in-flight registers so thats 256 physical total.
    Plus two sets of rename banks, one for each mode.

    If you drain the pipeline before switching register banks then all
    of the 128 in-flight registers will be free at the time of switch.

    A couple of bits of state and you don't need to drain the pipeline,
    you just have to find the youngest instruction with the property
    that all older instructions cannot raise an exception; these can be
    allowed to finish execution while you are fetching instruction for
    the new context.

    If you can switch to interrupt mode without draining the pipeline then
    some of those 128 will be in-use for the old mode, some for the new mode
    (and the uOps carry a privilege mode flag so you can do things like
    check LD or ST ops against the appropriate PTE mode access control).

    And 1 bit of state keeps track of which is which.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Thu Nov 30 23:06:32 2023
    Robert Finch wrote:

    On 2023-11-30 3:30 p.m., MitchAlsup wrote:
    EricP wrote:

    Robert Finch wrote:
    The Q+ register file is implemented with one block-RAM per read port.
    With a 64-bit width this gives 512 registers in a block RAM. 192
    registers are needed for renaming a 64-entry architectural register
    file. That leaves 320 registers unused. My thought was to support two
    banks of registers, one for the highest operating mode, and the other
    for remaining operating modes. On exceptions the register bank could
    be switched. But to do this there are now 128-register effectively
    being renamed which leads to 384 physical registers to manage. This
    doubles the size of the register management code. Unless, a pipeline
    flush occurs for exception processing which I think would allow the
    renamer to reuse the same hardware to manage a new bank of registers.
    But that hinges on all references to registers in the current bank
    being unused.

    My other thought was that with approximately three times the number
    of architectural registers required, using 256 physical registers
    would allow 85 architectural registers. Perhaps some of the registers
    could be banked for different operating modes. Banking four registers
    per mode would use up 16.

    If the 512-register file were divided by three, 170 physical
    registers could be available for renaming. This is less than the
    ideal 192 registers but maybe close enough to not impact performance
    adversely.


    I don't understand the problem.
    You want 64 architecture registers, each which needs a physical register, >>> plus 128 registers for in-flight instructions, so 196 physical registers. >>
    If you add a second bank of 64 architecture registers for interrupts
    then each needs a physical register. But that doesn't change the number
    of in-flight registers so thats 256 physical total.
    Plus two sets of rename banks, one for each mode.

    If you drain the pipeline before switching register banks then all
    of the 128 in-flight registers will be free at the time of switch.

    A couple of bits of state and you don't need to drain the pipeline,
    you just have to find the youngest instruction with the property that
    all older instructions cannot raise an exception; these can be
    allowed to finish execution while you are fetching instruction for
    the new context.

    Not quite comprehending. Will not the registers for the new context be improperly mapped if there are registers in use for the old map?

    All the in-flight destination registers will get written by the in-flight instructions. All the instruction of the new context will allocate registers from the pool which is not currently in-flight. So, while there is mental confusion on how this gets pulled off in HW, it does get pulled off just
    fine. When the new context STs the registers of the old context, it obtains
    the correct register from the old context {{Should HW be doing this the
    same orchestration applies--and it still works.}}

    I think
    a state bit could be used to pause a fetch of a register still in use in
    the old map, but that is draining the pipeline anyway.

    You are assuming a RAT, I am not using a RAT but a CAM where I can restore
    to any checkpoint by simply rewriting the valid bit vector.

    When the context swaps, a new set of target registers is always
    established before the registers are used.

    You still have to deal with the transient state and the CAM version works
    with either SW or HW save/restore.

    So incoming references in the
    new context should always map to the new registers?

    Which they will--as illustrated above.


    If you can switch to interrupt mode without draining the pipeline then
    some of those 128 will be in-use for the old mode, some for the new mode >>> (and the uOps carry a privilege mode flag so you can do things like
    check LD or ST ops against the appropriate PTE mode access control).

    And 1 bit of state keeps track of which is which.

    Did some experimenting and the RAT turns out to be too large if more registers are incorporated. Even as few as 256 regs caused the RAT to increase in size substantially. So, I may go the alternate route of
    making register wider rather than deeper, having 128-bit wide registers instead.

    Register ports (or equivalently RAT ports) are one of the things that most limit issue width. K9 was to have 22 RAT ports, and was similar in size to
    the {standard decoded Register File.}

    There is an eight bit sequence number bit associated with each
    instruction. So it can easily be detected the age of an instruction. I

    I assign a 4-bit number (16-checkpints) to all instructions issued in
    the same clock cycle. This gives a 6-wide machine up to 96 instructions in-flight; and makes backing up (misprediction) simple and fast.

    found a really slick way of detecting instruction age using a matrix
    approach on the web. But I did not fully understand it. So I just use
    eight bit counters for now.

    There is a two bit privilege mode flag for instructions in the ROB. I
    suppose the ROB entries could be called uOps.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Fri Dec 1 02:43:20 2023
    Robert Finch wrote:

    On 2023-11-30 6:06 p.m., MitchAlsup wrote:
    Robert Finch wrote:

    On 2023-11-30 3:30 p.m., MitchAlsup wrote:
    EricP wrote:

    Robert Finch wrote:
    The Q+ register file is implemented with one block-RAM per read
    port. With a 64-bit width this gives 512 registers in a block RAM. >>>>>> 192 registers are needed for renaming a 64-entry architectural
    register file. That leaves 320 registers unused. My thought was to >>>>>> support two banks of registers, one for the highest operating mode, >>>>>> and the other for remaining operating modes. On exceptions the
    register bank could be switched. But to do this there are now
    128-register effectively being renamed which leads to 384 physical >>>>>> registers to manage. This doubles the size of the register
    management code. Unless, a pipeline flush occurs for exception
    processing which I think would allow the renamer to reuse the same >>>>>> hardware to manage a new bank of registers. But that hinges on all >>>>>> references to registers in the current bank being unused.

    My other thought was that with approximately three times the number >>>>>> of architectural registers required, using 256 physical registers
    would allow 85 architectural registers. Perhaps some of the
    registers could be banked for different operating modes. Banking
    four registers per mode would use up 16.

    If the 512-register file were divided by three, 170 physical
    registers could be available for renaming. This is less than the
    ideal 192 registers but maybe close enough to not impact
    performance adversely.


    I don't understand the problem.
    You want 64 architecture registers, each which needs a physical
    register,
    plus 128 registers for in-flight instructions, so 196 physical
    registers.

    If you add a second bank of 64 architecture registers for interrupts >>>>> then each needs a physical register. But that doesn't change the number >>>>> of in-flight registers so thats 256 physical total.
    Plus two sets of rename banks, one for each mode.

    If you drain the pipeline before switching register banks then all
    of the 128 in-flight registers will be free at the time of switch.

    A couple of bits of state and you don't need to drain the pipeline,
    you just have to find the youngest instruction with the property that
    all older instructions cannot raise an exception; these can be
    allowed to finish execution while you are fetching instruction for
    the new context.

    Not quite comprehending. Will not the registers for the new context be
    improperly mapped if there are registers in use for the old map?

    All the in-flight destination registers will get written by the in-flight
    instructions. All the instruction of the new context will allocate
    registers
    from the pool which is not currently in-flight. So, while there is mental
    confusion on how this gets pulled off in HW, it does get pulled off just
    fine. When the new context STs the registers of the old context, it obtains >> the correct register from the old context {{Should HW be doing this the
    same orchestration applies--and it still works.}}

                                                                    I
    think a state bit could be used to pause a fetch of a register still
    in use in the old map, but that is draining the pipeline anyway.

    You are assuming a RAT, I am not using a RAT but a CAM where I can restore >> to any checkpoint by simply rewriting the valid bit vector.

    I think the RAT can be restored to a specific checkpoint as well using
    just an index value. Q+ has a checkpoint RAM of which one of the
    checkpoints is the active RAT. The RAT is really 16 tables. I stored a
    bit vector of the valid registers in the ROB so that the valid
    register set may be reset when a checkpoint is restored.

    When the context swaps, a new set of target registers is always
    established before the registers are used.

    You still have to deal with the transient state and the CAM version works
    with either SW or HW save/restore.

                                               So incoming references in
    the new context should always map to the new registers?

    Which they will--as illustrated above.


    If you can switch to interrupt mode without draining the pipeline then >>>>> some of those 128 will be in-use for the old mode, some for the new
    mode
    (and the uOps carry a privilege mode flag so you can do things like
    check LD or ST ops against the appropriate PTE mode access control).

    And 1 bit of state keeps track of which is which.

    Did some experimenting and the RAT turns out to be too large if more
    registers are incorporated. Even as few as 256 regs caused the RAT to
    increase in size substantially. So, I may go the alternate route of
    making register wider rather than deeper, having 128-bit wide
    registers instead.

    Register ports (or equivalently RAT ports) are one of the things that most >> limit issue width. K9 was to have 22 RAT ports, and was similar in size
    to the {standard decoded Register File.}

    The Q+ RAT has 16 read and 8 write ports. I am trying for a 4-wide
    machine. It is using about as many LUTs as the register file. The RAT is implemented with LUT ram instead of block RAMs. I do not like the size,
    but it adds a lot to the operation of the machine.


    There is an eight bit sequence number bit associated with each
    instruction. So it can easily be detected the age of an instruction. I

    I assign a 4-bit number (16-checkpints) to all instructions issued in
    the same clock cycle. This gives a 6-wide machine up to 96 instructions
    in-flight; and makes backing up (misprediction) simple and fast.

    The same thing is done with Q+. It support 16 checkpoints with a
    four-bit number too. Having read that 16 is almost the same as infinity.

    Branch repair (from misprediction) has to be fast--especially if you are
    going for 0-cycle repair.

    Checkpoint recovery (from interrupt) just has to pick 1 of the checkpoints
    that has achieved the consistent state (no older instructions can raise an exception).

    Exception recovery can backup to the checkpoint containing the instruction which raised the exception, and then single step forward until the exception
    is identified. Thus, you do not need "order" at a granularity smaller than
    a checkpoint.

    One can use pseudo-exceptions to solve difficult timing or sequencing
    problems, saving certain kinds of state transitions in the instruction
    queuing mechanism. For example, one could use pseudo-exception to regain
    memory order in an ATOMIC event when you detect the order was less than sequentially consistent.

    found a really slick way of detecting instruction age using a matrix
    approach on the web. But I did not fully understand it. So I just use
    eight bit counters for now.

    There is a two bit privilege mode flag for instructions in the ROB. I
    suppose the ROB entries could be called uOps.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Robert Finch on Sun Dec 3 11:07:48 2023
    Robert Finch wrote:
    Figured it out. Each architectural register in the RAT must refer to N physical registers, where N is the number of banks. Setting N to 4
    results in a RAT that is only about 50% larger than one supporting only
    a single bank. The operating mode is used to select the physical
    register. The first eight registers are shared between all operating
    modes so arguments can be passed to syscalls. It is tempting to have
    eight banks of registers, one for each hardware interrupt level.

    A consequence of multiple architecture register banks is each extra
    bank keeps a set of mostly unused physical register attached to them.
    For example, if there are 2 modes User and Super and a bank for each,
    since User and Super are mutually exclusive,
    64 of your 256 physical registers will be sitting unused tied
    to the other mode bank, so max of 75% utilization efficiency.

    If you have 8 register banks then only 3/10 of the physical registers
    are available to use, the other 7/10 are sitting idle attached to
    arch registers in other modes consuming power.

    Also you don't have to play overlapped-register-bank games to pass
    args to/from syscalls. You can have specific instructions that reach
    into other banks: Move To User Reg, Move From User Reg.
    Since only syscall passes args into the OS you only need to access
    the user mode bank from the OS kernel bank.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Sun Dec 3 16:49:33 2023
    Robert Finch wrote:

    On 2023-11-30 9:43 p.m., MitchAlsup wrote:
    four-bit number too. Having read that 16 is almost the same as infinity.

    Branch repair (from misprediction) has to be fast--especially if you are
    going for 0-cycle repair.

    I think I am far away from zero-cycle repair. Does getting zero-cycle
    repair mean fetching from both branch directions and then selecting the correct one?

    No, zero cycle means you access the ICache twice per cycle, once on the predicted path and once on the alternate path. The alternate path inst
    are put in a buffer indexed by branch number. {{This happens 10-12 cycles before the branch prediction is resolved}}

    When the branch instruction is launched out of its inst queue, the buffer
    is read, and if the branch prediction failed, you have the instructions
    from the mispredicted path ready to decode in the subsequent cycle.

    I will be happy if I can get branching to work at all. It
    is my first implementation using checkpoints. All the details of
    handling branches are not yet worked out in code for Q+. I think enough
    of the code is in place to get rough timing estimates. Not sure how well
    the BTB will work. A gselect predictor is also being used. Expecting a
    lot of branch mispredictions.

    Checkpoint recovery (from interrupt) just has to pick 1 of the checkpoints >> that has achieved the consistent state (no older instructions can raise an >> exception).

    Sounds straight-forward enough.

    Exception recovery can backup to the checkpoint containing the
    instruction which raised the exception, and then single step forward
    until the exception
    is identified. Thus, you do not need "order" at a granularity smaller than >> a checkpoint.

    This sounds a little trickier to do. Q+ currently takes an exception
    when things commit. It looks in the exception field of the queue entry
    for a fault code. If there is one it performs almost the same operation
    as a branch except it is occurring at the commit stage.

    One can use pseudo-exceptions to solve difficult timing or sequencing
    problems, saving certain kinds of state transitions in the instruction
    queuing mechanism. For example, one could use pseudo-exception to regain
    memory order in an ATOMIC event when you detect the order was less than
    sequentially consistent.

    Noted.


    Gone back to using variable length instructions. Had to pipeline the instruction length decode across three clock cycles to get it to meet
    timing.

    Curious:: I got VLE to decode in 4-gates of delay, and I can PARSE up to
    16 instruction boundaries in a single cycle (using a tree of multiplexers.)

    DECODE, then, only has to process the 32-bit instructions and route the constants in at Forwarding.

    Now:: I also use 3 cycles after ICache access, but 1 of the cycles includes
    tag comparison and set select, so I consider this a 2½ cycle decode; the ½ cycle part performs the VLE and instruction-specifier rout to decoder[k].

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Sun Dec 3 16:58:38 2023
    EricP wrote:

    Robert Finch wrote:
    Figured it out. Each architectural register in the RAT must refer to N
    physical registers, where N is the number of banks. Setting N to 4
    results in a RAT that is only about 50% larger than one supporting only
    a single bank. The operating mode is used to select the physical
    register. The first eight registers are shared between all operating
    modes so arguments can be passed to syscalls. It is tempting to have
    eight banks of registers, one for each hardware interrupt level.

    A consequence of multiple architecture register banks is each extra
    bank keeps a set of mostly unused physical register attached to them.

    A waste.....

    For example, if there are 2 modes User and Super and a bank for each,
    since User and Super are mutually exclusive,
    64 of your 256 physical registers will be sitting unused tied
    to the other mode bank, so max of 75% utilization efficiency.

    If you have 8 register banks then only 3/10 of the physical registers
    are available to use, the other 7/10 are sitting idle attached to
    arch registers in other modes consuming power.

    Also you don't have to play overlapped-register-bank games to pass
    args to/from syscalls. You can have specific instructions that reach
    into other banks: Move To User Reg, Move From User Reg.
    Since only syscall passes args into the OS you only need to access
    the user mode bank from the OS kernel bank.

    Whereas: Exceptions, interrupts save and restore 32-registers::
    A SysCall in My 66000 only saves and restores 24 of the 32 registers.
    So when control arrives, there are 8 argument registers from the
    Caller and 24 registers from Guest OS already loaded. So, SysCall
    handler already has its stack, and a variety of pointers to data
    structures it is interested in.

    On the way back, RET only restores 24 registers so Guest OS can pass
    back as many as 8 result registers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Tue Dec 5 23:47:10 2023
    Robert Finch wrote:


    For the Q+ MPU and SOC the bus system is organized like a tree with the
    root being at the CPU. The system bus operates with asynchronous transactions. The bus then fans out through bus bridges to various
    system components. Responses coming back from devices are buffered and
    merge results together into a more common bus when there are open spaces
    in the bus. I think it is fairly fast (well at least for homebrew FPGA).
    Bus accesses are single cycle, but they may have a varying amount of
    latency.

    My "bus" is similar, but is, in effect, a 4-wire protocol done with transactions on the buss. Read goes to Mem CTL, when "ordered" Snoops
    go out, Snoop responses go to requesting core, Mem response goes to
    core. When core has SNOOP responses and mem data it sends DONE to
    mem Ctl. The arriving DONE allows the next access to that same cache
    line to begin (that is DONE "orders" successive accesses to the same
    line addresses, while allowing independent accesses to proceed inde-
    pendently.

    The data width of my "bus" is 1 cache line, or ½ cache line at DDR.
    Control is ~90-bits including a 66-bit address.
    SNOOP responses are packed.


    Writes are “posted” so they are essentially single cycle.

    Writes to DRAM are "posted"
    Writes to config space are strongly ordered
    Writes to MMI/O are sequentially Consistent

    Reads percolate back up the tree to the CPU. It operates at the CPU clock rate (currently 40MHz) and transfers 128-bits at a time. Maximum peak
    transfer rate would then be 640 MB/s. Copying memory is bound to be much slower due to the read latency. Devices on the bus have a configuration
    block which looks something like a PCI config block, so devices
    addressing may be controlled by the OS.

    Multiple devices access the main DRAM memory via a memory controller.

    I interpose the LLC (L3) between the "bus" and the Mem Ctl. This interposition is what eliminates RowHammer. The L3 is not really a cache it is a preview
    of the state DRAM will eventually achieve or has already achieved. It is,
    in essence, an infinite write buffer between the MC and DRC and a near
    infinite read buffer between DRC and MC.

    Several devices that are bus masters have their own ports to the memory controller and do not use up time on the main system bus tree. The

    Yes, PCIe HostBridge has master access to the "bus" all "devices" are
    down under HostBridge. With CLX enabled, one can even place DRAM out on
    PCIe tree,...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Fri Dec 8 17:53:12 2023
    Robert Finch wrote:

    What happens when there is a sequence of numerous branches in row, such
    that the machine would run out of checkpoints for the branches?

    Stall Insert.

    Suppose you go
    Bra tgt1
    Bra tgt1
    … 30 times
    Bra tgt1

    Unconditional Branches do not need a checkpoint (all by themselves).

    Will the machine still work? Or will it crash?
    I have Q+ stalling until checkpoints are available, but it seems like a
    loss of performance. It is extra hardware to check for the case that
    might be preventable with software. I mean how often would a sequence
    like the above occur?

    Unconditional branches can be dealt with completely in the front end
    {they do not need to be executed--except as they alter IP.}

    On the other hand:: compilers are pretty good at cleaning up branches
    to unconditional branches.

    How will you tell for sure:: Read the ASM your compiler produces (a lot
    of it).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Sun Dec 10 01:02:00 2023
    Robert Finch wrote:

    Getting a bit lazy on the Q+ instruction commit in the interest of
    increasing the fmax. The results are already in the register file, so
    all the commit has to do is:

    1) Update the branch predictor.
    2) Free up physical registers

    By the time you write the physical register into the file, you are in
    a position to free up the now permanently invisible physical register
    it replaced.

    3) Free load/store queue entries associated with the ROB entry.

    Spectré:: write miss buffer data into Cache and TLB.
    This is also where I write ST.data into cache.

    4) Commit oddball instructions.
    5) Process any outstanding exceptions.
    6) Free the ROB entry
    7) Gather performance statistics.

    What needs to be committed is computed in the clock cycle before the
    commit. This pipelined signal adds a cycle of latency to the commit, but
    it only really affects oddball instructions rarely executed, and
    exceptions. Commit also will not commit if the commit pointer is near
    the queue pointer. Commit will also only commit up to the first oddball instruction or exception.

    Decided to axe the branch-to-register feature of conditional branch instructions because the branch target would not be known at enqueue
    time. It would require updating the ROB in two places.

    Question:: How would you handle::

    IDIV R6,R7,R8
    JMP R6

    ??

    Branches can now use a postfix immediate to extend the branch range.
    This allows 32 and 64-bit displacements in addition to the existing
    17-bit one. However, the assembler cannot know which to use in advance,
    so choosing a larger branch displacement size should be an option.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Sun Dec 10 04:06:39 2023
    Robert Finch wrote:

    On 2023-12-09 8:02 p.m., MitchAlsup wrote:
    Robert Finch wrote:

    Getting a bit lazy on the Q+ instruction commit in the interest of
    increasing the fmax. The results are already in the register file, so
    all the commit has to do is:

    1)    Update the branch predictor.
    2)    Free up physical registers

    By the time you write the physical register into the file, you are in
    a position to free up the now permanently invisible physical register
    it replaced.

    Hey thanks, I should have thought of that. While there are more physical registers available than needed (256 and only about 204 are needed), so
    it would probably run okay, I think I see a way to reduce multiplexor
    usage by freeing the register when it is written.

    You are welcome.

    3)    Free load/store queue entries associated with the ROB entry.

    Spectré:: write miss buffer data into Cache and TLB.
    This is also where I write ST.data into cache.

    Is miss data for a TLB page fault?

    I leave TLB replacements in the miss buffer simply because they are so
    seldom that I don't feel it necessary to build yet another buffer.
    TLB plus any tablewalk acceleration is deferred until the casuing
    instruction retires.

    I have this stored in a register in
    the TLB which must be read by the CPU during exception handling.

    Technically, the TLB is the storage and comparators, while the rest
    of the table walking mechanics {including the TLB} are the MMU.

    Otherwise the TLB has a hidden page walker that updates the TLB.

    If you don't defer TLB update until after the causing instruction retires Spectré-like attacks have a covert channel at their disposal.

    Scratching my head now over writing the store data at commit time.

    My 6-wide machine has a conditional-cache (memory reorder buffer)
    after execution, calculation instructions can raise no exception.
    This is the commit point. Between commit and retire, the conditional
    cache updated the Data Cache. So there is a period of time the
    pipeline builds up state, and once it has been determined that
    nothing can prevent the manifestations of those instructions from
    taking place, there is a period of time state gets updated. Once
    all state is updated, the instruction has retired.

    4)    Commit oddball instructions.
    5)    Process any outstanding exceptions.
    6)    Free the ROB entry
    7)    Gather performance statistics.

    What needs to be committed is computed in the clock cycle before the
    commit. This pipelined signal adds a cycle of latency to the commit,
    but it only really affects oddball instructions rarely executed, and
    exceptions. Commit also will not commit if the commit pointer is near
    the queue pointer. Commit will also only commit up to the first
    oddball instruction or exception.

    Decided to axe the branch-to-register feature of conditional branch
    instructions because the branch target would not be known at enqueue
    time. It would require updating the ROB in two places.

    Question:: How would you handle::

        IDIV    R6,R7,R8
        JMP     R6

    ??

    There is a JMP address[Rn] instruction (really JSR Rt,address[Rn]) in
    the instruction set which is always treated as a branch miss when it executes. The RTS instruction could also be used, it allows the return address register to be specified and it is a couple of bytes shorter. It
    was just that conditional branches had the feature removed. It required
    a third register be read for the flow control unit too.

    I have a LD IP,[address] instruction which is used to access GOT[k] for
    calling dynamically linked subroutines. This bypasses the LD-aligner
    to deliver IP to fetch faster.

    But you side-stepped answering my question. My question is what do you
    do when the Jump address will not arrive for another 20 cycles.

    Branches can now use a postfix immediate to extend the branch range.
    This allows 32 and 64-bit displacements in addition to the existing
    17-bit one. However, the assembler cannot know which to use in
    advance, so choosing a larger branch displacement size should be an
    option.

    I use GOT[k] to branch farther than the 28-bit unconditional branch displacement can reach. We have not yet run into a subroutine that
    needs branches of more then 18-bits conditionally or 28-bits uncon-
    ditionally.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Sun Dec 10 15:11:38 2023
    Robert Finch wrote:

    On 2023-12-09 11:06 p.m., MitchAlsup wrote:
    Robert Finch wrote:

    I have a LD IP,[address] instruction which is used to access GOT[k] for
    calling dynamically linked subroutines. This bypasses the LD-aligner
    to deliver IP to fetch faster.

    But you side-stepped answering my question. My question is what do you
    do when the Jump address will not arrive for another 20 cycles.

    While waiting for the register value, other instructions would continue
    to queue and execute. Then that processing would be dumped because of
    the branch miss. I suppose hardware could be added to suppress
    processing until the register value is known. An option for a larger build.

    Branches can now use a postfix immediate to extend the branch range. >>>>> This allows 32 and 64-bit displacements in addition to the existing
    17-bit one. However, the assembler cannot know which to use in
    advance, so choosing a larger branch displacement size should be an
    option.

    I use GOT[k] to branch farther than the 28-bit unconditional branch
    displacement can reach. We have not yet run into a subroutine that
    needs branches of more then 18-bits conditionally or 28-bits uncon-
    ditionally.

    I have yet to use GOT addressing.

    There are issues to resolve in the Q+ frontend. The next PC value for
    the BTB is not available for about three clocks. To go backwards in
    time, the next PC needs to be cached, or rather the displacement to the
    next PC to reduce cache size.

    What you need is an index and a set to directly access the cache--all
    the other stuff can be done in arears {AGEN and cache tag check}

    The first time a next PC is needed it will
    not be available for three clocks. Once cached it would be available
    within a clock. The next PC displacement is the sum of the lengths of
    next four instructions. There is not enough room in the FPGA to add
    another cache and associated logic, however. Next PC = PC + 20 seems a
    whole lot simpler to me.

    Thus, I may go back to using a fixed size instruction or rather
    instructions with fixed alignment. The position of instructions could be
    as if they were fixed length while remaining variable length.

    If the first part of an instruction decodes to the length of the instruction easily (EASILY) and cheaply, you can avoid the header and build a tree of
    unary pointers each such pointer pointing at twice as many instruction
    starting points as the previous. Even without headers, My 66000 can find
    the instruction boundaries of up to 16 instructions per cycle without adding "stuff" the the block of instructions.

    Instructions would just be aligned at fixed intervals. If I set the
    length to five bytes for instance, most the instruction set could be accommodated. Operation by “packed” instructions would be an option for
    a larger build. There could be a bit in a control register to allow
    execution by packed or unpacked instructions so there is some backwards compatibility to a smaller build.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Sun Dec 10 22:39:10 2023
    Robert Finch wrote:

    On 2023-12-10 10:11 a.m., MitchAlsup wrote:
    Robert Finch wrote:

    On 2023-12-09 11:06 p.m., MitchAlsup wrote:
    Robert Finch wrote:

    I have a LD IP,[address] instruction which is used to access GOT[k] for >>>> calling dynamically linked subroutines. This bypasses the LD-aligner
    to deliver IP to fetch faster.

    But you side-stepped answering my question. My question is what do you >>>> do when the Jump address will not arrive for another 20 cycles.

    While waiting for the register value, other instructions would
    continue to queue and execute. Then that processing would be dumped
    because of the branch miss. I suppose hardware could be added to
    suppress processing until the register value is known. An option for a
    larger build.

    Branches can now use a postfix immediate to extend the branch
    range. This allows 32 and 64-bit displacements in addition to the >>>>>>> existing 17-bit one. However, the assembler cannot know which to >>>>>>> use in advance, so choosing a larger branch displacement size
    should be an option.

    I use GOT[k] to branch farther than the 28-bit unconditional branch
    displacement can reach. We have not yet run into a subroutine that
    needs branches of more then 18-bits conditionally or 28-bits uncon-
    ditionally.

    I have yet to use GOT addressing.

    There are issues to resolve in the Q+ frontend. The next PC value for
    the BTB is not available for about three clocks. To go backwards in
    time, the next PC needs to be cached, or rather the displacement to
    the next PC to reduce cache size.

    What you need is an index and a set to directly access the cache--all
    the other stuff can be done in arears {AGEN and cache tag check}

                                  The first time a next PC is needed it
    will not be available for three clocks. Once cached it would be
    available within a clock. The next PC displacement is the sum of the
    lengths of next four instructions. There is not enough room in the
    FPGA to add another cache and associated logic, however. Next PC = PC
    + 20 seems a whole lot simpler to me.

    Thus, I may go back to using a fixed size instruction or rather
    instructions with fixed alignment. The position of instructions could
    be as if they were fixed length while remaining variable length.

    If the first part of an instruction decodes to the length of the
    instruction
    easily (EASILY) and cheaply, you can avoid the header and build a tree of
    unary pointers each such pointer pointing at twice as many instruction
    starting points as the previous. Even without headers, My 66000 can find
    the instruction boundaries of up to 16 instructions per cycle without
    adding
    "stuff" the the block of instructions.

    Instructions would just be aligned at fixed intervals. If I set the
    length to five bytes for instance, most the instruction set could be
    accommodated. Operation by “packed” instructions would be an option
    for a larger build. There could be a bit in a control register to
    allow execution by packed or unpacked instructions so there is some
    backwards compatibility to a smaller build.

    I cannot get it to work at a decent speed for only six instructions.
    With byte-aligned instructions 64-decoders are in use. (They’re really small). Then output from the appropriate ones are selected. It is
    partially the fullness of the FPGA and routing congestion because of the design. Routing is taking 90% of the time. Logic is only about 10%.

    I did some experimenting with block headers and ended up with a block
    trailer instead of a header, for the assembler’s benefit which needs to know all the instruction lengths before the trailer can be output. Only
    the index of the instruction group is needed, so usually there are only
    a couple of indexes used per instruction block. It can likely get by
    with a 24-bit trailer containing four indexes plus the assumed one.
    Usually only one or two bytes are wasted at the end of a block.
    I assembled the boot rom and there are 4.9 bytes per instruction
    average, including the overhead of block trailers and wasted bytes.
    Branche and postfixes are five bytes, and there are a lot of them.

    Code density is a little misleading because branches occupy five bytes
    but do both a compare and branch operation. So they should maybe count
    as two instructions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Sun Dec 10 22:52:04 2023
    Robert Finch wrote:

    On 2023-12-10 10:11 a.m., MitchAlsup wrote:
    Robert Finch wrote:

    On 2023-12-09 11:06 p.m., MitchAlsup wrote:
    Robert Finch wrote:

    I have a LD IP,[address] instruction which is used to access GOT[k] for >>>> calling dynamically linked subroutines. This bypasses the LD-aligner
    to deliver IP to fetch faster.

    But you side-stepped answering my question. My question is what do you >>>> do when the Jump address will not arrive for another 20 cycles.

    While waiting for the register value, other instructions would
    continue to queue and execute. Then that processing would be dumped
    because of the branch miss. I suppose hardware could be added to
    suppress processing until the register value is known. An option for a
    larger build.

    Branches can now use a postfix immediate to extend the branch
    range. This allows 32 and 64-bit displacements in addition to the >>>>>>> existing 17-bit one. However, the assembler cannot know which to >>>>>>> use in advance, so choosing a larger branch displacement size
    should be an option.

    I use GOT[k] to branch farther than the 28-bit unconditional branch
    displacement can reach. We have not yet run into a subroutine that
    needs branches of more then 18-bits conditionally or 28-bits uncon-
    ditionally.

    I have yet to use GOT addressing.

    There are issues to resolve in the Q+ frontend. The next PC value for
    the BTB is not available for about three clocks. To go backwards in
    time, the next PC needs to be cached, or rather the displacement to
    the next PC to reduce cache size.

    What you need is an index and a set to directly access the cache--all
    the other stuff can be done in arears {AGEN and cache tag check}

                                  The first time a next PC is needed it
    will not be available for three clocks. Once cached it would be
    available within a clock. The next PC displacement is the sum of the
    lengths of next four instructions. There is not enough room in the
    FPGA to add another cache and associated logic, however. Next PC = PC
    + 20 seems a whole lot simpler to me.

    Thus, I may go back to using a fixed size instruction or rather
    instructions with fixed alignment. The position of instructions could
    be as if they were fixed length while remaining variable length.

    If the first part of an instruction decodes to the length of the
    instruction
    easily (EASILY) and cheaply, you can avoid the header and build a tree of
    unary pointers each such pointer pointing at twice as many instruction
    starting points as the previous. Even without headers, My 66000 can find
    the instruction boundaries of up to 16 instructions per cycle without
    adding
    "stuff" the the block of instructions.

    Instructions would just be aligned at fixed intervals. If I set the
    length to five bytes for instance, most the instruction set could be
    accommodated. Operation by “packed” instructions would be an option
    for a larger build. There could be a bit in a control register to
    allow execution by packed or unpacked instructions so there is some
    backwards compatibility to a smaller build.

    I cannot get it to work at a decent speed for only six instructions.
    With byte-aligned instructions 64-decoders are in use. (They’re really small). Then output from the appropriate ones are selected. It is
    partially the fullness of the FPGA and routing congestion because of the design. Routing is taking 90% of the time. Logic is only about 10%.

    That wire:logic ratio is "not that much out of line" for long distance
    bussing of data.

    My word oriented design would cut the decoders down to 16-decoders and
    they have to look at 7-bits to produce 3×5-bit vectors. A tree of
    AND gates takes it from here basically performing FF1.

    I did some experimenting with block headers and ended up with a block
    trailer instead of a header, for the assembler’s benefit which needs to know all the instruction lengths before the trailer can be output. Only
    the index of the instruction group is needed, so usually there are only
    a couple of indexes used per instruction block. It can likely get by
    with a 24-bit trailer containing four indexes plus the assumed one.
    Usually only one or two bytes are wasted at the end of a block.
    I assembled the boot rom and there are 4.9 bytes per instruction
    average, including the overhead of block trailers and wasted bytes.
    Branche and postfixes are five bytes, and there are a lot of them.

    Code density is a little misleading because branches occupy five bytes
    but do both a compare and branch operation. So they should maybe count
    as two instructions.

    Sooner or later you have to mash everything down to {bits, bytes, words} Instructions having VLE and having non-identity units of work performed,
    bytes are probably the best representation. My eXcel spreadsheet stuff
    uses bits.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Robert Finch on Wed Dec 13 19:13:32 2023
    Robert Finch wrote:

    On 2023-12-11 4:57 a.m., BGB wrote:


    I got timing to work at 40+ MHz by using 32-bit instruction parcels
    rather than byte-oriented ones.

    An issue with 32-bit parcels is that float constants do not fit well
    into them because of the opcode present in a postfix. A 32-bit postfix
    has only 25 available bits for a constant. The next size up has 57 bits available. One thought I had was to reduce the floating-point precision
    to correspond. Single precision floats would be 25 bits, double
    precision 57 bits and quad precision 121 bits. All seven bits short of
    the usual.

    It is issues such as you mention that my approach was different. The instruction-specifier contains everything the decoder needs to know
    about where the operands are, how to rout them into calculation, what
    to calculate and where to deliver the result. Should the instruction
    want constants for an operand* they are concatenated sequentially
    after the I-S and come in 32-bit and 64-bit quantities. Should a
    32-bit constant be consumed in a 64-bit calculation it is widened
    during route.

    (*) except for the 16-bit immediates and displacements from the
    Major OpCode table.

    I could try and use 40-bit parcels but they would need to be at fixed locations on the cache line for performance, and it would waste bytes.

    In effect I only have 32-bit parcels.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Robert Finch on Fri Dec 22 11:42:51 2023
    Robert Finch wrote:
    Stuck on checkpoint RAM now. Everything was going good until…. I
    realized that while instructions are executing they need to be able to
    update previous checkpoints, not just the current one. Which checkpoint
    gets updated depends on which checkpoint the instruction falls under. It
    is the register valid bit that needs to be updated. I used a “brute force” approach to implement this and it is 40k LUTs. This is about five times too large a solution. If I reduce the number of checkpoints
    supported to four from sixteen, then the component is 20k LUTs. Still
    too large.

    The issue is there are 256 valid bits times 16 checkpoints which means
    4096 registers. Muxing the register inputs and outputs uses a lot of LUTs.

    One thought is to stall until all the instructions with targets in a
    given checkpoint are finished executing before starting a new checkpoint region. It would seriously impact the CPU performance.


    (I don't have a solution, just passing on some info on this particular checkpointing issue.)

    Sounds like you might be using the same free register checkpoint algorithm
    I came up with for my simulator, which I assumed was a custom sram design.

    There is 1 bit for each physical register that is free.
    The checkpoint for a Bcc conditional branch copies the free bit vector,
    in your case 256 bits, to a row in the checkpoint sram.
    As each instruction retires and frees up its old dest physical register
    and it must mark the register free in *all* checkpoint contexts.

    That requires the ability to set all the free flags for a single register, which means an sram design that can write a whole row, and also set all the bits in one column, in your case set the 16 bits in each checkpoint for one
    of the 256 registers.

    I was assuming an ASIC design so a small custom sram seemed reasonable.
    But for an FPGA it requires 256*16 flip-flops plus decoders, etc.

    I think the risc-v Berkeley Out-of-Order Machine (BOOM) folks might have independently come up with the same approach on their BOOM-3 SonicBoom.
    Their note [5] describes the same problem as my column setting solves.

    https://docs.boom-core.org/en/latest/sections/rename-stage.html

    While their target was 22nm ASIC, they say below that they
    implemented a version of BOOM-3 on an FPGA but don't give details.
    But their project might be open source so maybe the details
    are available online.

    Sonicboom: The 3rd generation berkeley out-of-order machine http://people.eecs.berkeley.edu/~krste/papers/SonicBOOM-CARRV2020.pdf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Fri Dec 22 17:49:35 2023
    EricP wrote:

    Robert Finch wrote:
    Stuck on checkpoint RAM now. Everything was going good until…. I
    realized that while instructions are executing they need to be able to
    update previous checkpoints, not just the current one. Which checkpoint
    gets updated depends on which checkpoint the instruction falls under. It
    is the register valid bit that needs to be updated. I used a “brute
    force” approach to implement this and it is 40k LUTs. This is about five >> times too large a solution. If I reduce the number of checkpoints
    supported to four from sixteen, then the component is 20k LUTs. Still
    too large.

    The issue is there are 256 valid bits times 16 checkpoints which means
    4096 registers. Muxing the register inputs and outputs uses a lot of LUTs. >>
    One thought is to stall until all the instructions with targets in a
    given checkpoint are finished executing before starting a new checkpoint
    region. It would seriously impact the CPU performance.


    (I don't have a solution, just passing on some info on this particular checkpointing issue.)

    Sounds like you might be using the same free register checkpoint algorithm
    I came up with for my simulator, which I assumed was a custom sram design.

    There is 1 bit for each physical register that is free.
    The checkpoint for a Bcc conditional branch copies the free bit vector,
    in your case 256 bits, to a row in the checkpoint sram.
    As each instruction retires and frees up its old dest physical register
    and it must mark the register free in *all* checkpoint contexts.

    That requires the ability to set all the free flags for a single register, which means an sram design that can write a whole row, and also set all the bits in one column, in your case set the 16 bits in each checkpoint for one of the 256 registers.

    Two points::
    1) the register that gets freed up when you know this newly allocated register will retire, can be determined with a small amount of logic (2 gates) per
    cell in your 256×16 matrix--no need for the column write/clear/set. You can use this overwrite across columns to perform register write elision.

    2) There are going to be allocations where you do not allocate any register
    to a particular instruction because the register is overwritten IN the same issue bundle. Here you can use a different "forwarding" notation so the
    result is captured by the stations and used without ever seeing the file.

    I called this matrix the "History Table" in Mc 88120, it provided valid
    bits back to the aRN->pRN CAMs <backup> and valid bits back to the register pool <successful retire>.

    Back then, we recognized that the architectural registers were a strict
    subset of the physical registers, so that as long as there were exactly
    31 (then: 32 now) valid registers in the pRF, one could always read
    values to be written into reservation station entries. In effect, the
    whole thing was a RoB--Once the RoB gets big enough, there is no reason
    to have both a RoB and a aRF; just let the RoB do everything and change
    its name to Physical Register File. This eliminates the copy to aRF
    at retirement.

    I was assuming an ASIC design so a small custom sram seemed reasonable.
    But for an FPGA it requires 256*16 flip-flops plus decoders, etc.

    I think the risc-v Berkeley Out-of-Order Machine (BOOM) folks might have independently come up with the same approach on their BOOM-3 SonicBoom.
    Their note [5] describes the same problem as my column setting solves.

    https://docs.boom-core.org/en/latest/sections/rename-stage.html

    I was doing something very similar n 1991.

    While their target was 22nm ASIC, they say below that they
    implemented a version of BOOM-3 on an FPGA but don't give details.
    But their project might be open source so maybe the details
    are available online.

    Sonicboom: The 3rd generation berkeley out-of-order machine http://people.eecs.berkeley.edu/~krste/papers/SonicBOOM-CARRV2020.pdf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Robert Finch on Sat Dec 23 13:26:17 2023
    Robert Finch wrote:
    On 2023-12-22 12:49 p.m., MitchAlsup wrote:
    EricP wrote:

    Robert Finch wrote:
    Stuck on checkpoint RAM now. Everything was going good until…. I
    realized that while instructions are executing they need to be able
    to update previous checkpoints, not just the current one. Which
    checkpoint gets updated depends on which checkpoint the instruction
    falls under. It is the register valid bit that needs to be updated.
    I used a “brute force” approach to implement this and it is 40k
    LUTs. This is about five times too large a solution. If I reduce the
    number of checkpoints supported to four from sixteen, then the
    component is 20k LUTs. Still too large.

    The issue is there are 256 valid bits times 16 checkpoints which
    means 4096 registers. Muxing the register inputs and outputs uses a
    lot of LUTs.

    One thought is to stall until all the instructions with targets in a
    given checkpoint are finished executing before starting a new
    checkpoint region. It would seriously impact the CPU performance.

    I think I maybe found a solution using a block RAM and about 8k LUTs.

    (I don't have a solution, just passing on some info on this particular
    checkpointing issue.)

    Sounds like you might be using the same free register checkpoint
    algorithm
    I came up with for my simulator, which I assumed was a custom sram
    design.

    There is 1 bit for each physical register that is free.
    The checkpoint for a Bcc conditional branch copies the free bit vector,
    in your case 256 bits, to a row in the checkpoint sram.
    As each instruction retires and frees up its old dest physical register
    and it must mark the register free in *all* checkpoint contexts.

    That requires the ability to set all the free flags for a single
    register,
    which means an sram design that can write a whole row, and also set
    all the
    bits in one column, in your case set the 16 bits in each checkpoint
    for one
    of the 256 registers.

    Not sure about setting bits in all checkpoints. I probably have not just understood the issue yet. Partially terminology. There are two different things happening. The register free/available which is being managed
    with fifos and the register contents valid bit. At the far end of the pipeline, registers that were used are made free again by adding to the
    free fifo. This is somewhat inefficient because they could be freed
    sooner, but it would require more logic, instead more registers are
    used, they are available from the RAM anyway.
    The register contents valid bit is cleared when a target register is assigned, and set once a value is loaded into the target register. The
    valid bit is also set for instructions that are stomped on as the old
    value is valid. When a checkpoint is restored, it restores the state of
    the valid bit along with the physical register tag. I am not
    understanding why the valid bit would need to be modified in all
    checkpoints. I would think it should reflect the pre-branch state of
    things.

    This has to do with free physical register list checkpointing and
    a particular gotcha that occurs if one tries to use a vanilla sram
    to save the free map bit vector for each checkpoint.
    It sounds like the BOOM people stepped in this gotcha at some point.

    Say a design has a bit vector indicating which physical registers are free. Rename allocates a register by using a priority selector to scan that
    vector and select a free PR to assign as a new dest PR.
    When this instruction retires, the old dest PR is freed and
    the new dest PR becomes the architectural register.

    When Decode sees a conditional branch Bcc it allocates a
    checkpoint in a circular buffer by incrementing the head counter,
    copies the *current* free bit vector into the new checkpoint row,
    and saves the new checkpoint index # in the Bcc uOp.
    If a branch mispredict occurs then we can restore the state at the
    Bcc by copying various state info from the Bcc checkpoint index #.
    This includes copying back the saved free vector to the current free vector. When the Bcc uOp retires we increment the circular tail counter
    to recover the checkpoint buffer row.

    The problem occurs when an old dest PR is in use so its free bit is clear
    when the checkpoint is saved. Then the instruction retires and marks the
    old dest PR as free in the bit vector. Then Bcc triggers a mispredict
    and restores the free vector that was copied when the checkpoint was saved, including the then not-free state of the PR freed after the checkpoint.
    Result: the PR is lost from the free list. After enough mispredicts you
    run out of free physical registers and hang at Rename waiting to allocate.

    It needs some way to edit the checkpointed free bit vector so that
    no matter what order of PR-allocate, retire-PR-free, checkpoint save #X,
    and rollback to checkpoint #Y, that the correct free vector gets restored.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Sat Dec 23 23:19:47 2023
    EricP wrote:

    Robert Finch wrote:
    On 2023-12-22 12:49 p.m., MitchAlsup wrote:
    EricP wrote:

    Robert Finch wrote:
    Stuck on checkpoint RAM now. Everything was going good until…. I
    realized that while instructions are executing they need to be able
    to update previous checkpoints, not just the current one. Which
    checkpoint gets updated depends on which checkpoint the instruction
    falls under. It is the register valid bit that needs to be updated.
    I used a “brute force” approach to implement this and it is 40k
    LUTs. This is about five times too large a solution. If I reduce the >>>>> number of checkpoints supported to four from sixteen, then the
    component is 20k LUTs. Still too large.

    The issue is there are 256 valid bits times 16 checkpoints which
    means 4096 registers. Muxing the register inputs and outputs uses a
    lot of LUTs.

    One thought is to stall until all the instructions with targets in a >>>>> given checkpoint are finished executing before starting a new
    checkpoint region. It would seriously impact the CPU performance.

    I think I maybe found a solution using a block RAM and about 8k LUTs.

    (I don't have a solution, just passing on some info on this particular >>>> checkpointing issue.)

    Sounds like you might be using the same free register checkpoint
    algorithm
    I came up with for my simulator, which I assumed was a custom sram
    design.

    There is 1 bit for each physical register that is free.
    The checkpoint for a Bcc conditional branch copies the free bit vector, >>>> in your case 256 bits, to a row in the checkpoint sram.
    As each instruction retires and frees up its old dest physical register >>>> and it must mark the register free in *all* checkpoint contexts.

    That requires the ability to set all the free flags for a single
    register,
    which means an sram design that can write a whole row, and also set
    all the
    bits in one column, in your case set the 16 bits in each checkpoint
    for one
    of the 256 registers.

    Not sure about setting bits in all checkpoints. I probably have not just
    understood the issue yet. Partially terminology. There are two different
    things happening. The register free/available which is being managed
    with fifos and the register contents valid bit. At the far end of the
    pipeline, registers that were used are made free again by adding to the
    free fifo. This is somewhat inefficient because they could be freed
    sooner, but it would require more logic, instead more registers are
    used, they are available from the RAM anyway.
    The register contents valid bit is cleared when a target register is
    assigned, and set once a value is loaded into the target register. The
    valid bit is also set for instructions that are stomped on as the old
    value is valid. When a checkpoint is restored, it restores the state of
    the valid bit along with the physical register tag. I am not
    understanding why the valid bit would need to be modified in all
    checkpoints. I would think it should reflect the pre-branch state of
    things.

    This has to do with free physical register list checkpointing and
    a particular gotcha that occurs if one tries to use a vanilla sram
    to save the free map bit vector for each checkpoint.
    It sounds like the BOOM people stepped in this gotcha at some point.

    Say a design has a bit vector indicating which physical registers are free. Rename allocates a register by using a priority selector to scan that
    vector and select a free PR to assign as a new dest PR.
    When this instruction retires, the old dest PR is freed and
    the new dest PR becomes the architectural register.

    It is often the case where a logical register can be used in more
    than one result in a single checkpoint. When this is the case, no
    physical register need be allocated to the now-dead result, so we
    invented a way to convey this result is only captured from the
    operand bus and was not even contemplated to be written into the
    pRF. This makes the pool of free registers go further--up to 30%
    further.......

    When Decode sees a conditional branch Bcc it allocates a
    checkpoint in a circular buffer by incrementing the head counter,
    copies the *current* free bit vector into the new checkpoint row,
    and saves the new checkpoint index # in the Bcc uOp.
    If a branch mispredict occurs then we can restore the state at the
    Bcc by copying various state info from the Bcc checkpoint index #.
    This includes copying back the saved free vector to the current free vector. When the Bcc uOp retires we increment the circular tail counter
    to recover the checkpoint buffer row.

    The problem occurs when an old dest PR is in use so its free bit is clear when the checkpoint is saved. Then the instruction retires and marks the
    old dest PR as free in the bit vector. Then Bcc triggers a mispredict
    and restores the free vector that was copied when the checkpoint was saved, including the then not-free state of the PR freed after the checkpoint. Result: the PR is lost from the free list. After enough mispredicts you
    run out of free physical registers and hang at Rename waiting to allocate.

    Michael Shebanow and I have a patent on that dated around 1992 (filing).
    Our design could be retiring one or more checkpoints, backing up a mis- pedicted branch, and issuing instructions on the alternate path; all in
    the same clock.

    It needs some way to edit the checkpointed free bit vector so that
    no matter what order of PR-allocate, retire-PR-free, checkpoint save #X,
    and rollback to checkpoint #Y, that the correct free vector gets restored.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Robert Finch on Tue Jan 9 08:23:24 2024
    Robert Finch wrote:
    Predicated logic and the PRED modifier on my mind tonight.

    I think I have discovered an interesting way to handle predicated logic.
    If a predicate is true the instruction is scheduled and executes
    normally. If the predicate is false the instruction is modified to a
    special copy operation and scheduled to execute on an ALU regardless of
    what the original execution unit would be. What makes this efficient is
    that only a single target register read port is required for the ALU
    unit versus having a target register read port for every functional
    unit. The copy mux is present in the ALU only and not in the other
    functional units. For most instructions there is no predication.

    Yes, the general case is each uOp has predicate source and a bool to test.
    If the value matches the predicate you execute the ON_MATCH part of the uOp,
    if it does not match then execute the ON_NO_MATCH part.

    condition = True | False

    (pred == condition) ? ON_MATCH : ON_NO_MATCH;

    The ON_NO_MATCH uOp function is usually some housekeeping.
    On an in-order it might diddle the scoreboard to indicate the register
    write is done. On OoO it might copy the old dest register to new.

    Note that the source register dependencies change between match and no_match.

    if (pred == True) ADD r3 = r2 + r1

    If pred == True then it matches and the uOp is dependent on r2 and r1.
    If pred != True then it no_match and uOp is dependent on the old dest r3
    as a source to copy to the new dest r3.

    Dynamically pruning the unnecessary uOp source register dependencies
    for the alternate part can allow it to launch earlier.

    Also predicated LD and ST have some particular issues to think about.
    For example, under TSO a younger LD cannot bypass an older LD.
    If an older LD has an unresolved predicate then we don't know if it exists
    so we have to block the younger LD until the older predicate resolves.
    The LD's ON_NO_MATCH housekeeping might include diddling the LSQ dependency matrix to wake up any younger LD's in the LSQ that had been blocked.

    (Yes, I'm sure one could get fancier with replay traps.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Tue Jan 9 20:38:41 2024
    EricP wrote:

    Robert Finch wrote:
    Predicated logic and the PRED modifier on my mind tonight.

    I think I have discovered an interesting way to handle predicated logic.
    If a predicate is true the instruction is scheduled and executes
    normally. If the predicate is false the instruction is modified to a
    special copy operation and scheduled to execute on an ALU regardless of
    what the original execution unit would be. What makes this efficient is
    that only a single target register read port is required for the ALU
    unit versus having a target register read port for every functional
    unit. The copy mux is present in the ALU only and not in the other
    functional units. For most instructions there is no predication.

    Yes, the general case is each uOp has predicate source and a bool to test.
    If the value matches the predicate you execute the ON_MATCH part of the uOp, if it does not match then execute the ON_NO_MATCH part.

    condition = True | False

    (pred == condition) ? ON_MATCH : ON_NO_MATCH;

    The ON_NO_MATCH uOp function is usually some housekeeping.
    On an in-order it might diddle the scoreboard to indicate the register
    write is done. On OoO it might copy the old dest register to new.

    A SB handles this situation with greater elegance than a reservation station. The SB can merely clear the dependency without writing to the RF, so the
    now released reader reads the older value. {Thornton SB}

    The value capturing reservation station entry has to first capture and then ignore the delivered result (and so does the RF/RoB. {Thomasulo RS}

    The Value-free RS entry is more like the SB than the Thomasulo RS.

    A typical SB Can be used to hold result delivery on instructions in the
    shadow of a PRED to avoid the data-flow mechanism from getting unkempt.
    Both then-clause and else-clause can be held while the condition is evaluating,...

    Note that the source register dependencies change between match and no_match.

    if (pred == True) ADD r3 = r2 + r1

    If pred == True then it matches and the uOp is dependent on r2 and r1.
    If pred != True then it no_match and uOp is dependent on the old dest r3
    as a source to copy to the new dest r3.

    Yes, and there can be multiple instructions in the shadow of a PRED.

    Dynamically pruning the unnecessary uOp source register dependencies
    for the alternate part can allow it to launch earlier.

    As illustrated above, no need to stall launch if you can stall result
    delivery. {A key component of the Thornton SB}

    Also predicated LD and ST have some particular issues to think about.
    For example, under TSO a younger LD cannot bypass an older LD.

    Easy:: don't do TSO <most of the time> or SC <outside of ATOMIC stuff>.

    If an older LD has an unresolved predicate then we don't know if it exists
    so we have to block the younger LD until the older predicate resolves.

    This is why TSO and SC are slower than causal or weaker. Consider a memory reorder buffer which allows generated addresses to probe the cache and determine hit as operand data-flow permits--BUT holds onto the data and
    writes (LD or reads (ST) to) the RF in program order. This violates TSO
    and SC but mono-threaded codes are immune to this memory ordering problem
    {and multi-threaded programs are immune except while performing ATIMIC
    things.}

    TSO and SC is simply slower when trying to perform memory reference inst- ructions in both the then-clause and in the else clause while waiting the resolution of the condition--even if no results are written into RF until
    after resolution.

    The LD's ON_NO_MATCH housekeeping might include diddling the LSQ dependency matrix to wake up any younger LD's in the LSQ that had been blocked.

    (Yes, I'm sure one could get fancier with replay traps.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Wed Jan 10 23:30:12 2024
    MitchAlsup wrote:
    EricP wrote:

    If an older LD has an unresolved predicate then we don't know if it
    exists
    so we have to block the younger LD until the older predicate resolves.

    This is why TSO and SC are slower than causal or weaker. Consider a memory reorder buffer which allows generated addresses to probe the cache and determine hit as operand data-flow permits--BUT holds onto the data and writes (LD or reads (ST) to) the RF in program order. This violates TSO
    and SC but mono-threaded codes are immune to this memory ordering problem {and multi-threaded programs are immune except while performing ATIMIC things.}

    TSO and SC is simply slower when trying to perform memory reference inst- ructions in both the then-clause and in the else clause while waiting
    the resolution of the condition--even if no results are written into RF
    until after resolution.

    BTW in case anyone is interested I came across the recent paper that
    compares the Apple M1 ARM processors two memory consistency models:
    the ARM weak ordering and the total store ordering (TSO) model from x86.

    "Based on various workloads, our findings indicate that TSO is,
    on average, 8.94% slower than ARM’s weaker memory ordering."

    TOSTING: Investigating Total Store Ordering on ARM, 2023 https://sra.uni-hannover.de/Publications/2023/tosting-arcs23/wrenger_23_arcs.pdf
    https://www.sra.uni-hannover.de/Publications/2023/tosting-arcs23/wrenger_23_arcs_talk.pdf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Chris M. Thomasson on Thu Jan 11 01:10:54 2024
    Chris M. Thomasson wrote:
    On 1/10/2024 8:30 PM, EricP wrote:
    MitchAlsup wrote:
    EricP wrote:

    If an older LD has an unresolved predicate then we don't know if it
    exists
    so we have to block the younger LD until the older predicate resolves.

    This is why TSO and SC are slower than causal or weaker. Consider a
    memory
    reorder buffer which allows generated addresses to probe the cache
    and determine hit as operand data-flow permits--BUT holds onto the
    data and
    writes (LD or reads (ST) to) the RF in program order. This violates TSO
    and SC but mono-threaded codes are immune to this memory ordering
    problem
    {and multi-threaded programs are immune except while performing ATIMIC
    things.}

    TSO and SC is simply slower when trying to perform memory reference
    inst-
    ructions in both the then-clause and in the else clause while waiting
    the resolution of the condition--even if no results are written into
    RF until after resolution.

    BTW in case anyone is interested I came across the recent paper that
    compares the Apple M1 ARM processors two memory consistency models:
    the ARM weak ordering and the total store ordering (TSO) model from x86.

    "Based on various workloads, our findings indicate that TSO is,
    on average, 8.94% slower than ARM’s weaker memory ordering."

    TOSTING: Investigating Total Store Ordering on ARM, 2023
    https://sra.uni-hannover.de/Publications/2023/tosting-arcs23/wrenger_23_arcs.pdf

    https://www.sra.uni-hannover.de/Publications/2023/tosting-arcs23/wrenger_23_arcs_talk.pdf






    Will read them. Thanks for the heads up Eric.

    Caveat that this compares Apples ARM weak to Apples TSO implementation.
    Because Apple M1 has two consistency models,
    if TSO is just there as a porting aid for x86 code that depends on it,
    they might not have put as many bells and whistles into making it
    as fast as someone who has only TSO.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Thu Jan 11 06:47:21 2024
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Chris M. Thomasson wrote:
    On 1/10/2024 8:30 PM, EricP wrote:
    BTW in case anyone is interested I came across the recent paper that
    compares the Apple M1 ARM processors two memory consistency models:
    the ARM weak ordering and the total store ordering (TSO) model from x86. >>>
    "Based on various workloads, our findings indicate that TSO is,
    on average, 8.94% slower than ARM’s weaker memory ordering."

    TOSTING: Investigating Total Store Ordering on ARM, 2023
    https://sra.uni-hannover.de/Publications/2023/tosting-arcs23/wrenger_23_arcs.pdf

    https://www.sra.uni-hannover.de/Publications/2023/tosting-arcs23/wrenger_23_arcs_talk.pdf

    Thanks.

    Caveat that this compares Apples ARM weak to Apples TSO implementation. >Because Apple M1 has two consistency models,
    if TSO is just there as a porting aid for x86 code that depends on it,
    they might not have put as many bells and whistles into making it
    as fast as someone who has only TSO.

    Exactly. In particular, my take is that the microarchitecture can
    reorder memory accesses as much as it wants, but has to check other
    cores' memory accesses, and then roll back if the guarantees of the architecture (ideally SC, but you can also make it TSO for the sake of
    the discussion) would not be met. The costs are that the
    microarchitecture may need more buffering (slowdowns only if the
    buffers are full), and maybe (not sure) more coherence traffic, but as
    long as the resources are there, there is no slowdown in the
    non-contended case, not even when accessing shared data (where code
    with explicit or (MY66000) automatically inserted barriers is slow).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to All on Fri Jan 12 02:09:31 2024
    So, TSO looses ~10% in performance

    Sounds about right.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Robert Finch on Mon Mar 4 09:03:52 2024
    Robert Finch wrote:
    Trading off the maximum amount of contiguously addressed memory for a
    smaller PTE and PTP format. The PTE and PTPs are thus only 32-bits in
    size. A PTE has a 17-bit page number, while the PTP uses 30-bits for the
    page number. With 64kB pages this limits the system to 2^46 bytes of
    memory, which is probably okay for a small system. The PTEs 17-bit page number can only work with 8GB of contiguous memory. All the pages the
    PTE table covers must be in the same 8GB memory range. The upper bits of
    the translated address will come from the PTPs upper bits. This makes
    memory looks like a tree. Groups of leafs are stuck to particular branches.

    If I understand you correctly this means the PTE pages for
    each 8 GB range must be in a PTP located inside that 8 GB range.
    If that is ROM or IO registers in that range then there must be
    RAM in the same 8 GB range in order to map it.

    That would make modularizing components a little difficult as you
    will have to add RAM mapping modules to each 8 GB address range.

    And of course the OS memory manager has to be coded to specially
    handle the RAM for each of these mapping ranges.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Mon Mar 4 18:37:08 2024
    EricP wrote:

    Robert Finch wrote:
    Trading off the maximum amount of contiguously addressed memory for a
    smaller PTE and PTP format. The PTE and PTPs are thus only 32-bits in
    size. A PTE has a 17-bit page number, while the PTP uses 30-bits for the
    page number. With 64kB pages this limits the system to 2^46 bytes of
    memory, which is probably okay for a small system. The PTEs 17-bit page
    number can only work with 8GB of contiguous memory. All the pages the
    PTE table covers must be in the same 8GB memory range. The upper bits of
    the translated address will come from the PTPs upper bits. This makes
    memory looks like a tree. Groups of leafs are stuck to particular branches.

    If I understand you correctly this means the PTE pages for
    each 8 GB range must be in a PTP located inside that 8 GB range.
    If that is ROM or IO registers in that range then there must be
    RAM in the same 8 GB range in order to map it.

    Consider a PTE mapping ROM !! How doe it get set ??

    That would make modularizing components a little difficult as you
    will have to add RAM mapping modules to each 8 GB address range.

    You may even have to create a way to map PTE elements into areas
    that have no RAM, creating more overhead and complexity.

    And of course the OS memory manager has to be coded to specially
    handle the RAM for each of these mapping ranges.

    How would you map a DRAM DIMM that contained only 2GB of RAM ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Robert Finch on Tue Mar 5 14:49:53 2024
    Robert Finch <robfi680@gmail.com> writes:
    On 2024-03-04 9:03 a.m., EricP wrote:

    If I understand you correctly this means the PTE pages for
    each 8 GB range must be in a PTP located inside that 8 GB range.
    If that is ROM or IO registers in that range then there must be
    RAM in the same 8 GB range in order to map it.

    That would make modularizing components a little difficult as you
    will have to add RAM mapping modules to each 8 GB address range.

    And of course the OS memory manager has to be coded to specially
    handle the RAM for each of these mapping ranges.



    Yes, the above is what I was thinking.

    There is a scratchpad RAM in the ROM address space, used for
    bootstrapping. RAM access is needed during the boot process before
    everything is setup and the DRAM is accessible. So, it is possible to
    map in that manner.

    It's not uncommon to use the LLC as a scratchpad during
    DRAM controller initialization...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Tue Mar 5 15:37:13 2024
    Robert Finch wrote:

    Bigfoot pages are 1MB in size. That allows an entire 64GB address range
    to be mapped with one page table. With larger memory systems a larger
    page size is needed IMO. 64GB is 65,536 pages still when the pages are
    1MB in size. There is 32GB RAM in my machine today. Tomorrow it will 64GB.

    Above a certain point the added latency of filling/spilling a page to DISK
    that is 64KB in size rather than 4KB in size outweighs the gain in the TLB.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Mar 5 16:20:37 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    Robert Finch wrote:

    Bigfoot pages are 1MB in size. That allows an entire 64GB address range
    to be mapped with one page table. With larger memory systems a larger
    page size is needed IMO. 64GB is 65,536 pages still when the pages are
    1MB in size. There is 32GB RAM in my machine today. Tomorrow it will 64GB.

    Above a certain point the added latency of filling/spilling a page to DISK >that is 64KB in size rather than 4KB in size outweighs the gain in the TLB.

    Although that's not as big a concern today, given nVME (or Optane).

    Intel, AMD and ARM already support large page sizes (1GB for Intel/AMD
    and 512GB for ARM64).

    Mixing page sizes in a single operating system is tricky, as the
    physical regions backing the large page sizes need to be page-size aligned,
    and when supporting multiple page sizes, leads to checkerboarding
    or the need to reassign pages when making a large page allocation
    (linux can preallocate at boot, or use THP).

    Using a single large page size to accomodate a small number of applications will waste memory for the large number of small memory applications
    (e.g. most unix/linux commands).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Tue Mar 5 08:40:36 2024
    On 3/5/2024 7:37 AM, MitchAlsup1 wrote:
    Robert Finch wrote:

    Bigfoot pages are 1MB in size. That allows an entire 64GB address
    range to be mapped with one page table. With larger memory systems a
    larger page size is needed IMO. 64GB is 65,536 pages still when the
    pages are 1MB in size. There is 32GB RAM in my machine today. Tomorrow
    it will 64GB.

    Above a certain point the added latency of filling/spilling a page to DISK that is 64KB in size rather than 4KB in size outweighs the gain in the TLB.

    Sure. But the 4K size was first used when disk transfer rates were
    about 3 MB/sec. Today, they are many times that, even if you are using
    a single physical hard disk. RAID and SSD can make that even larger. I
    don't know what the "optimal" page size is, but it is certainly larger
    than 4KB.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Tue Mar 5 17:32:16 2024
    Stephen Fuld wrote:

    On 3/5/2024 7:37 AM, MitchAlsup1 wrote:
    Robert Finch wrote:

    Bigfoot pages are 1MB in size. That allows an entire 64GB address
    range to be mapped with one page table. With larger memory systems a
    larger page size is needed IMO. 64GB is 65,536 pages still when the
    pages are 1MB in size. There is 32GB RAM in my machine today. Tomorrow
    it will 64GB.

    Above a certain point the added latency of filling/spilling a page to DISK >> that is 64KB in size rather than 4KB in size outweighs the gain in the TLB.

    Sure. But the 4K size was first used when disk transfer rates were
    about 3 MB/sec. Today, they are many times that, even if you are using
    a single physical hard disk. RAID and SSD can make that even larger. I don't know what the "optimal" page size is, but it is certainly larger
    than 4KB.

    While the data transfer rates are far higher today, the disk latency has
    not kept pace with CPU performance {SSD is different}.

    The CPU went from 60ns (16 MHz: 360/67) to 200ps (5GHz) 300× !!
    And from 5 cycles per instruction to 3 instruction per cycle 15×
    for a combined gain of 4,500×

    Disk latency may have gone down from 8ms to 3ms (sometimes 2ms for
    smaller disks) 3×-4×

    {{I consider 360/67 as the beginning of paging; although Multics may be
    the beginning.}}


    I, also, believe that 4KB pages are a bit small for a new architecture.
    I chose 8KB for My 66000 more because it took 1 level out of the page
    tables supporting 64-bit VAS, and also made the paging hierarchy more
    easy to remember:: 8K, 8M, 8G, 8T, 8P, 8E is easier to remember than
    4K, 2M, 1G, ½T, ¼P, ⅛E.

    I also believe in the tension between pages that are too small and those
    that are too large. 256B is widely seen as too small (VAX). I think most
    people are comfortable in the 4KB range. I think 64KB is too big since something like cat needs less code, less data, and less stack space than
    1 single 64KB page and another 64KB page to map cat from its VAS. So, now; instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need
    four 64KB pages 256KB. It is these small applications that drive the
    minimum page size down.

    But with desktops coming with 64GB+ of DRAM, perhaps it matters less now.

    (*) 5 if you separate .bss from .data
    6 if you separate .rodata from .bss and .data

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Mar 5 18:04:04 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    Stephen Fuld wrote:

    On 3/5/2024 7:37 AM, MitchAlsup1 wrote:
    Robert Finch wrote:

    Bigfoot pages are 1MB in size. That allows an entire 64GB address
    range to be mapped with one page table. With larger memory systems a
    larger page size is needed IMO. 64GB is 65,536 pages still when the
    pages are 1MB in size. There is 32GB RAM in my machine today. Tomorrow >>>> it will 64GB.

    Above a certain point the added latency of filling/spilling a page to DISK >>> that is 64KB in size rather than 4KB in size outweighs the gain in the TLB.

    Sure. But the 4K size was first used when disk transfer rates were
    about 3 MB/sec. Today, they are many times that, even if you are using
    a single physical hard disk. RAID and SSD can make that even larger. I
    don't know what the "optimal" page size is, but it is certainly larger
    than 4KB.

    While the data transfer rates are far higher today, the disk latency has
    not kept pace with CPU performance {SSD is different}.

    The CPU went from 60ns (16 MHz: 360/67) to 200ps (5GHz) 300× !!
    And from 5 cycles per instruction to 3 instruction per cycle 15×
    for a combined gain of 4,500×

    Disk latency may have gone down from 8ms to 3ms (sometimes 2ms for
    smaller disks) 3×-4×

    NVMe over a low latency fabric has 10us end-to-end
    latency.


    {{I consider 360/67 as the beginning of paging; although Multics may be
    the beginning.}}


    I, also, believe that 4KB pages are a bit small for a new architecture.
    I chose 8KB for My 66000 more because it took 1 level out of the page
    tables supporting 64-bit VAS, and also made the paging hierarchy more
    easy to remember:: 8K, 8M, 8G, 8T, 8P, 8E is easier to remember than
    4K, 2M, 1G, ½T, ¼P, ⅛E.

    I also believe in the tension between pages that are too small and those
    that are too large. 256B is widely seen as too small (VAX).

    The VAX page size was 512 bytes, which matched the sector size.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stephen Fuld on Tue Mar 5 18:10:01 2024
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 3/5/2024 7:37 AM, MitchAlsup1 wrote:
    Robert Finch wrote:

    Bigfoot pages are 1MB in size. That allows an entire 64GB address
    range to be mapped with one page table. With larger memory systems a
    larger page size is needed IMO. 64GB is 65,536 pages still when the
    pages are 1MB in size. There is 32GB RAM in my machine today. Tomorrow
    it will 64GB.

    Above a certain point the added latency of filling/spilling a page to DISK >> that is 64KB in size rather than 4KB in size outweighs the gain in the TLB.

    Sure. But the 4K size was first used when disk transfer rates were
    about 3 MB/sec. Today, they are many times that, even if you are using
    a single physical hard disk. RAID and SSD can make that even larger. I >don't know what the "optimal" page size is, but it is certainly larger
    than 4KB.

    If paging out and HDDs were still relevant, a good page size would be
    about 3MB (where the transfer time is similar to the seek time). But
    they are not.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Tue Mar 5 18:14:48 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    256B is widely seen as too small (VAX).

    Didn't VAX use 512B?

    I think most
    people are comfortable in the 4KB range. I think 64KB is too big since >something like cat needs less code, less data, and less stack space than
    1 single 64KB page and another 64KB page to map cat from its VAS. So, now; >instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need
    four 64KB pages 256KB.

    Yes, so what? Who cares if cat takes 16KB or 256KB when we have
    Gigabytes of RAM?

    A while ago <2020Oct9.190337@mips.complang.tuwien.ac.at> I posted
    numbers here about the sizes of VMAs in the processes on several Linux
    systems and how much extra space would be needed from larger pages:

    |VMAs unique used total 8KB 16KB 32KB 64KB
    | 7552 2333 555964 1033320 6704 22344 56344 125144 desktop
    |82836 25276 5346060 15707448 76072 223000 514472 1113672 laptop
    |47017 15425 105490636 60186068 40804 134492 319852 708588 server

    The numbers in the 8KB, 16KB, 32KB, 64KB columns estimate how much
    extra RAM would be needed if the pages were that large. So, 1.1GB
    extra for 64KB pages on the laptop, but 8KB and 16KB pages would be
    relatively cheap.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Tue Mar 5 19:08:34 2024
    Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    256B is widely seen as too small (VAX).

    Didn't VAX use 512B?

    It has been a long time.

    I think most
    people are comfortable in the 4KB range. I think 64KB is too big since >>something like cat needs less code, less data, and less stack space than
    1 single 64KB page and another 64KB page to map cat from its VAS. So, now; >>instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need >>four 64KB pages 256KB.

    Yes, so what? Who cares if cat takes 16KB or 256KB when we have
    Gigabytes of RAM?

    A while ago <2020Oct9.190337@mips.complang.tuwien.ac.at> I posted
    numbers here about the sizes of VMAs in the processes on several Linux systems and how much extra space would be needed from larger pages:

    |VMAs unique used total 8KB 16KB 32KB 64KB
    | 7552 2333 555964 1033320 6704 22344 56344 125144 desktop |82836 25276 5346060 15707448 76072 223000 514472 1113672 laptop
    |47017 15425 105490636 60186068 40804 134492 319852 708588 server

    Are the 8K numbers to be compared to unique ? used ? or total ? to estimate waste ??

    The numbers in the 8KB, 16KB, 32KB, 64KB columns estimate how much
    extra RAM would be needed if the pages were that large. So, 1.1GB
    extra for 64KB pages on the laptop, but 8KB and 16KB pages would be relatively cheap.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Tue Mar 5 10:32:03 2024
    On 3/5/2024 9:32 AM, MitchAlsup1 wrote:
    Stephen Fuld wrote:

    On 3/5/2024 7:37 AM, MitchAlsup1 wrote:
    Robert Finch wrote:

    Bigfoot pages are 1MB in size. That allows an entire 64GB address
    range to be mapped with one page table. With larger memory systems a
    larger page size is needed IMO. 64GB is 65,536 pages still when the
    pages are 1MB in size. There is 32GB RAM in my machine today.
    Tomorrow it will 64GB.

    Above a certain point the added latency of filling/spilling a page to
    DISK
    that is 64KB in size rather than 4KB in size outweighs the gain in
    the TLB.

    Sure.  But the 4K size was first used when disk transfer rates were
    about 3 MB/sec.  Today, they are many times that, even if you are
    using a single physical hard disk.  RAID and SSD can make that even
    larger.  I don't know what the "optimal" page size is, but it is
    certainly larger than 4KB.

    While the data transfer rates are far higher today, the disk latency has
    not kept pace with CPU performance {SSD is different}.

    Sure, but the latency is the same no matter what the transfer size is.
    So the fact that latency improvements haven't kept pace is irrelevant to
    the question of optimal page size.


    The CPU went from 60ns (16 MHz: 360/67) to 200ps (5GHz) 300× !!
    And from 5 cycles per instruction to 3 instruction per cycle 15×
    for a combined gain of 4,500×

    Disk latency may have gone down from 8ms to 3ms (sometimes 2ms for
    smaller disks) 3×-4×

    {{I consider 360/67 as the beginning of paging; although Multics may be
    the beginning.}}

    Agreed.


    I, also, believe that 4KB pages are a bit small for a new architecture.
    I chose 8KB for My 66000 more because it took 1 level out of the page
    tables supporting 64-bit VAS, and also made the paging hierarchy more
    easy to remember:: 8K, 8M, 8G, 8T, 8P, 8E is easier to remember than
    4K, 2M, 1G, ½T, ¼P, ⅛E.

    Fair enough. When Unisys implemented paging in the 2200 Series in the
    1990s, they chose 16K (approximately - exactly if you consider 36 equals
    32!).


    I also believe in the tension between pages that are too small and those
    that are too large.

    Naturally.


    256B is widely seen as too small (VAX). I think most
    people are comfortable in the 4KB range.

    While true, how much of that is just that is what they are used to, as
    opposed to some kind of optimal?


    I think 64KB is too big since
    something like cat needs less code, less data, and less stack space than
    1 single 64KB page and another 64KB page to map cat from its VAS. So, now; instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need
    four 64KB pages 256KB. It is these small applications that drive the
    minimum page size down.

    Agreed. And large applications tend to drive the optimum page size up.
    And I think that small applications like cat tend to be quick thus never swapped, and only "waste" memory for a short amount of time. On the
    other hand, larger page sizes cause a larger memory "waste" for all applications.

    So the optimum is, at least to some degree, usage dependent. Of course,
    this is all an argument for multiple page sizes.


    But with desktops coming with 64GB+ of DRAM, perhaps it matters less now.

    I think larger main memories argue for larger page sizes, both because
    the "waste" costs less, and larger memories require more pages and thus, perhaps a larger TLB.

    As with most such things, there is a is tradeoff, and the optimum
    probably changes as technology changes.



    (*) 5 if you separate .bss from .data
       6 if you separate .rodata from .bss and .data

    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Tue Mar 5 19:16:57 2024
    Stephen Fuld wrote:

    On 3/5/2024 9:32 AM, MitchAlsup1 wrote:
    Stephen Fuld wrote:

    On 3/5/2024 7:37 AM, MitchAlsup1 wrote:
    Robert Finch wrote:

    Bigfoot pages are 1MB in size. That allows an entire 64GB address
    range to be mapped with one page table. With larger memory systems a >>>>> larger page size is needed IMO. 64GB is 65,536 pages still when the
    pages are 1MB in size. There is 32GB RAM in my machine today.
    Tomorrow it will 64GB.

    Above a certain point the added latency of filling/spilling a page to
    DISK
    that is 64KB in size rather than 4KB in size outweighs the gain in
    the TLB.

    Sure.  But the 4K size was first used when disk transfer rates were
    about 3 MB/sec.  Today, they are many times that, even if you are
    using a single physical hard disk.  RAID and SSD can make that even
    larger.  I don't know what the "optimal" page size is, but it is
    certainly larger than 4KB.

    While the data transfer rates are far higher today, the disk latency has
    not kept pace with CPU performance {SSD is different}.

    Sure, but the latency is the same no matter what the transfer size is.
    So the fact that latency improvements haven't kept pace is irrelevant to
    the question of optimal page size.


    The CPU went from 60ns (16 MHz: 360/67) to 200ps (5GHz) 300× !!
    And from 5 cycles per instruction to 3 instruction per cycle 15×
    for a combined gain of 4,500×

    Disk latency may have gone down from 8ms to 3ms (sometimes 2ms for
    smaller disks) 3×-4×

    {{I consider 360/67 as the beginning of paging; although Multics may be
    the beginning.}}

    Agreed.


    I, also, believe that 4KB pages are a bit small for a new architecture.
    I chose 8KB for My 66000 more because it took 1 level out of the page
    tables supporting 64-bit VAS, and also made the paging hierarchy more
    easy to remember:: 8K, 8M, 8G, 8T, 8P, 8E is easier to remember than
    4K, 2M, 1G, ½T, ¼P, ⅛E.

    Fair enough. When Unisys implemented paging in the 2200 Series in the
    1990s, they chose 16K (approximately - exactly if you consider 36 equals 32!).


    I also believe in the tension between pages that are too small and those
    that are too large.

    Naturally.


    256B is widely seen as too small (VAX). I think most
    people are comfortable in the 4KB range.

    While true, how much of that is just that is what they are used to, as opposed to some kind of optimal?

    When paging first came about, OS people told us that they really wanted
    ~1M pages (this was 1981-ish). 1M was enough to juggle the then workloads
    at least somewhat efficiently. This corresponded rather well with the then 32-bit address spaces.

    Workloads are now bigger (heck tabs in Chrome tend to be ~1GB in size)

    I think 64KB is too big since
    something like cat needs less code, less data, and less stack space than
    1 single 64KB page and another 64KB page to map cat from its VAS. So, now; >> instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need
    four 64KB pages 256KB. It is these small applications that drive the
    minimum page size down.

    Agreed. And large applications tend to drive the optimum page size up.
    And I think that small applications like cat tend to be quick thus never swapped, and only "waste" memory for a short amount of time. On the
    other hand, larger page sizes cause a larger memory "waste" for all applications.

    So the optimum is, at least to some degree, usage dependent. Of course,
    this is all an argument for multiple page sizes.

    Which My 66000 provides, but it also provides big pages that can have an
    extent [0..extent-1] so you can use a 8GB page to map anything from 8KB
    through 8GB in a single PTE. The extent-bits are exactly the bits not
    being used as PA-bits.

    But with desktops coming with 64GB+ of DRAM, perhaps it matters less now.

    I think larger main memories argue for larger page sizes, both because
    the "waste" costs less, and larger memories require more pages and thus, perhaps a larger TLB.

    As with most such things, there is a is tradeoff, and the optimum
    probably changes as technology changes.

    Given you want an single page size spanning cell-phones to multi-rack servers finding something "optimal" is difficult at best.

    (*) 5 if you separate .bss from .data
       6 if you separate .rodata from .bss and .data

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Tue Mar 5 19:18:41 2024
    Stephen Fuld wrote:

    Fair enough. When Unisys implemented paging in the 2200 Series in the
    1990s, they chose 16K (approximately - exactly if you consider 36 equals 32!).

    36 base 9 is 33 which is close enough to 32 to be considered equal.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Robert Finch on Tue Mar 5 21:18:48 2024
    Robert Finch <robfi680@gmail.com> writes:
    On 2024-03-05 1:13 p.m., BGB wrote:


    Another possible trick was mapping these to ROM zero page when reloaded,
    and only "reviving" them as actual swap-space pages, when something was
    written to them. Since the page-table was also partly used to track
    pages in the page-table, there needed to be special handling in the TLB
    miss handler to signal "yeah, this page is really a zeroes-only page".

    Say, page states:
      Invalid / unassigned;
      Valid / assigned;
      Invalid / mapped to pagefile;
        Page is swapped out.
      Valid / mapped to pagefile;
        Page is zeroed.

    Though, potentially, any hardware page-walker would need to be aware of
    the zero-page trick (vs, say, trying to map the page to an invalid
    physical address).

    ...

    What is LLC? (Local Lan controller?)


    Last Level Cache (e.g. L3).


    Intel, AMD and ARM already support large page sizes (1GB for Intel/AMD
    and 512GB for ARM64).

    Do they support using the entire page as a page table?

    All multilevel page tables use an entire page (granule in ARMv8)
    at each level of the page table. To map larger page/block sizes,
    they just stop the table walk at one, two or three levels rather
    than walking all four levels to the leaf page size.


    Compression could be useful for something serialized to disk or through
    the network. Transferring the page number and compressed contents.

    Compressing tiered DRAM is looking to be the next big thing.

    https://www.zeropoint-tech.com/news/meta-and-google-unveil-a-revolutionary-hardware-compressed-cxl-memory-tier-specification

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Tue Mar 5 22:40:54 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    Anton Ertl wrote:
    A while ago <2020Oct9.190337@mips.complang.tuwien.ac.at> I posted
    numbers here about the sizes of VMAs in the processes on several Linux
    systems and how much extra space would be needed from larger pages:

    |VMAs unique used total 8KB 16KB 32KB 64KB
    | 7552 2333 555964 1033320 6704 22344 56344 125144 desktop
    |82836 25276 5346060 15707448 76072 223000 514472 1113672 laptop
    |47017 15425 105490636 60186068 40804 134492 319852 708588 server

    Are the 8K numbers to be compared to unique ? used ? or total ? to estimate >waste ??

    "unique" is the number of unique VMAs (so compare "unique" to "VMAs").

    "used" is the number of KB used reported by free. "total" is the
    number of KB in the unique VMAs. "total" can be larger than "used"
    because of copy-on-write (in particular, pages that are allocated and
    have not been written to yet). I don't know why the server workload
    gets a "used" number that's larger than the "total" number.

    The waste numbers (8KB-64KB) should be compared to total memory (16GB
    on the desktop, 8GB for the laptop, 128GB for the server).

    You can also compare it to the "total" numbers; both the waste and the
    "total" numbers are based on the same VMA data.

    You could also compare it to "used", but given that "used" reflects
    actual usage rather than just VMA size, a better approach would be to
    know which pages of each VMA are used, and base the estimate on that. Unfortunately, I don't know how to easily extract such numbers from
    the Linux kernel.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Mar 6 00:48:08 2024
    Scott Lurndal wrote:

    Robert Finch <robfi680@gmail.com> writes:


    Do they support using the entire page as a page table?

    All multilevel page tables use an entire page (granule in ARMv8)
    at each level of the page table. To map larger page/block sizes,
    they just stop the table walk at one, two or three levels rather
    than walking all four levels to the leaf page size.

    My 66000 page structure supports both skipping of levels in the table
    and stopping at any level in the table.


    Compression could be useful for something serialized to disk or through
    the network. Transferring the page number and compressed contents.

    Compressing tiered DRAM is looking to be the next big thing.

    https://www.zeropoint-tech.com/news/meta-and-google-unveil-a-revolutionary-hardware-compressed-cxl-memory-tier-specification

    CXL is just* an offload of DRAM from an on-die memory controller to
    an on-PCIe 5.0-6.0 high speed links. What this means in practice is
    that one can put as many DRAM controllers on as many PCIe links as
    the chip provides. In PCIe 6.0, a link (4-wires; 2 in 2 out) trans-
    mit up to 64 GTs compared to DDR 6 at 22 Gb/s (PCIe is 3× faster),
    but you can trade width of the interface for number of independent
    interfaces. And with external companies making CXL DRAM controllers
    the chip/system designer can dispense with the DRC and spend more
    pins for PCIe BW.

    So, in the past when the on-die DRC only allows for 2,3,4,6,8 DIMMs;
    with CXL DRC you can have as many DRAM DIMMs as makes sense for your
    target market--more DIMMs for costly servers, fewer pins for lower
    cost devices.

    Compression is icing on the cake.

    (*) it also allows for a lot of other new PCIe functionality.

    An interesting topic between PCIE 5.0 and 6.0 is the change from
    NRZ encoding to PAM 4. This comes with a degradation of error rate
    from 10^-12 goes down to 10^-6, so you are going to need some kind
    of ECC on the PCIe links and layers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Wed Mar 6 03:45:12 2024
    BGB wrote:

    On 3/5/2024 3:07 PM, Robert Finch wrote:
    On 2024-03-05 1:13 p.m., BGB wrote:

    What is LLC? (Local Lan controller?) I used the text mode video display
    RAM in the past.

    Intel, AMD and ARM already support large page sizes (1GB for Intel/AMD
    and 512GB for ARM64).

    Do they support using the entire page as a page table?


    I think it is a case of skipping the last N levels.

    Should be able to skip any level. Root could point at a page containing
    only 8K translations, or it could point to the top of a tree supporting
    a 63-bit VAS. Each PTP (of which Root is one) can point at any further
    layer, skipping any number of intermediate non-needed levels.

    That is how one makes a 1 page "map" for tiny things like cat:: Root
    points at a page containing 8KB translations, of which only 4-6 have
    the valid bit set.

    Where, say, with 4K pages and 64-bit PTEs:
    4K: No Skip
    2MB: 1-level skip
    1GB: 2-level skip.

    One can have a root pointing at a 63-bit VAS, one PTP points at 13-bit
    VAS translations, another pointing at a 43-bit VAS which contains pointers
    to 23-bit VAS sub-spaces. So, you can skip any number of levels at each level.

    Seemingly, Windows had used 64K logical pages, but these were likely
    faked in software.

    MIPS did not allow aliasing at smaller granularities than 64-bits due to
    their R2000/R3000 SRAM cache structure.

    Not entirely sure the reason for them doing so.


    Compression could be useful for something serialized to disk or through
    the network. Transferring the page number and compressed contents.

    Funny story, I was hired to look into a person marketing a FFT-based compression algorithm for a company running semiconductor testers. I
    looked at his algorithms and looked at the vector data sets. It turned
    out that if one does NRZ encoding vertically over the records, one gets
    98%-99% compression without any "interesting algorithms". It worked so
    well that they started to use it to decrease disk load time of the
    vector set--converting a 1 minute problem into something under 1 second.

    The records (cycles) were as wide as the number of signal pins on the part-under-test (several hundred bits wide) with a 3-state code {high,
    low, high-Z}. SO the reset pin was asserted for 10-odd cycles, and then
    changed to 0 and stayed there for 1M-odd cycles. Pretty easy to compress
    stuff like this.

    My bet is that many data structures would encode rather well using this technique--for example pointers all having the same 24-HoBs being 0 (user)
    of 1 (super).

    ..

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Mar 6 14:25:15 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    Scott Lurndal wrote:

    Robert Finch <robfi680@gmail.com> writes:


    Do they support using the entire page as a page table?

    All multilevel page tables use an entire page (granule in ARMv8)
    at each level of the page table. To map larger page/block sizes,
    they just stop the table walk at one, two or three levels rather
    than walking all four levels to the leaf page size.

    My 66000 page structure supports both skipping of levels in the table
    and stopping at any level in the table.

    That's all well and good. What operating system do you
    have running on the MY 66000 processor? When will I be able
    to purchase a system based on that processor?



    Compression could be useful for something serialized to disk or through >>>the network. Transferring the page number and compressed contents.

    Compressing tiered DRAM is looking to be the next big thing.

    https://www.zeropoint-tech.com/news/meta-and-google-unveil-a-revolutionary-hardware-compressed-cxl-memory-tier-specification

    CXL is just* an offload of DRAM from an on-die memory controller to
    an on-PCIe 5.0-6.0 high speed links.

    Yes, I'm well aware of that. What you don't mention is that it
    can become part of the processor cache coherency domain.


    An interesting topic between PCIE 5.0 and 6.0 is the change from
    NRZ encoding to PAM 4. This comes with a degradation of error rate
    from 10^-12 goes down to 10^-6, so you are going to need some kind
    of ECC on the PCIe links and layers.

    PAM4 is something with which we have a great deal of expertise,
    along with the associated error correction.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Thu Mar 7 01:28:46 2024
    BGB wrote:

    On 3/6/2024 8:42 AM, Robert Finch wrote:



    In my case, access is figured out on cache-line fetch, and is precooked:
    NR, NW, NX, NU, NC: NoRead/NoWrite/NoExecute/NoUser/NoCache.
    Though, some combinations of these flags are special.

    Is there a reason these flags (other than user) are inverted ??
    {{And even noUser can be changed into Super.}}

    In addition, I think you will want to be able to specify which level of
    cache {L1, L2, LLC} this line is stored at, prefetched to, and pushed out
    to.

    My 66000 is using ASID instead of something like Super/Global because I
    don't want to have to flush the TLB on a hypervisor context switch --
    where one GuestOS Super/Global is not the same as another GuestOSs. When
    a GuestOS is accessing one of its user applications, AGEN automagiaclly
    uses application AISD instead of GuestOS ASID. {Similar for HV accessing GuestOS -- while switching from 1-level translation to 2-level.

    <snip>

    The L1 cache only hits if the current mode matches the mode that was in effect at the time the cache-line was fetched, and if KRR has not
    changed (as determined by a hash value), ...

    s/mode/ASID/

    For my system the ACL is not part of the PTE, it is part of the software
    managed page information, along with share counts. I do not see the ACL
    for a page being different depending on the page table.


    In my case, ACL handling is done via a combination of keyring register
    (KRR), and a small fully-associative cache (4 entry at present, 6 could
    be better in theory; luckily each entry is comparably small).

    The ACLID is tied to the TLBE, so the intersection of the ACLID and KRR
    entry are used to figure out access in the ACL cache (or,
    ignored/disabled if the low 16 bits of KRR are 0).


    I have dedicated some of the block RAMs for the page management
    information, so they may be read out in parallel with a memory access.
    So shifted the block RAM usage from the TLB to the PMT. This makes the
    TLB smaller. It also reduces the memory usage. The page management
    information only needs one copy for each page of memory. If the
    information were in the TLBE / PTEs there would be multiple copies of
    the information in the page tables. How do you keep things coherent if
    there are multiple copies in page tables?



    The access ID for pages is kept in sync with the memory address, since
    both are uploaded to the TLB at the same time.

    However, as for ACL checks themselves, these are handled with a separate cache. So, say, changing the access to an ACLID, and flushing the corresponding entry from the ACL cache, will automatically apply to any
    pages previously loaded into the TLB.

    There was also the older VUGID system, which used traditional Unix-style permissions. If I were designing it now, would likely design things
    around using exclusively ACL checking, which (ironically) also needs
    less bits to encode.



    Generally, software TLB miss handling is used in my case.

    There is no automatic way to keep the TLB in sync with the page table
    (if the page table entry is modified).

    My 66000 has a coherent TLB.

    Usual thing is that if the current page table is updated, then one needs
    to forge a special dummy entry, and then upload this entry to the TLB multiple times (via the LDTLB instruction) to knock the prior contents
    out of the TLB (or use the INVTLB instruction, but this currently
    invalidates the entire TLB; which is a bad situation for
    software-managed TLB...).

    See how much easier a coherent TLB is ??

    Generally, the assumption is that all pages in a mapping will have the
    same ACLID (generally corresponding to the "owner" of the mapping).

    An unsupported assumption if one wants to keep LB flushes minimized.

    If using multiple page tables for context switching, it will be
    necessary to use ASIDs.

    See how much easier it is for HW to perform context switches en massé

    It is possible to share global pages across "ASID groups", but currently there are not "truly global" pages (and, implicitly, some groups may
    disallow global pages).

    Where, say, the ASID is a 16-bit field:
    (15:10): ASID Group
    ( 9: 0): ASID ID

    At present, for most normal use, the idea is that the ASID and ACL/KRR
    ID's will be aliased to a process's PID.

    Not aliased to but accessed from !

    Say, with Groups 00..1F (in both ASID and ACLID space) being used for
    the PID aliased range (20..37 for special use, and 38..3F for selective one-off entries).

    Although completely under SW control, I am assuming that ASID = 0 is the hypervisor, that ASID = {1..255} is Guest HV, and {256-65535} is for GuestOS use.

    Currently, threads also eat PID's, but this is likely to change, say:
    TPID (Task ID):
    (31:16): PID
    (15: 0): Thread ID (local to a given PID)

    PIDs are GuestOS defined and used.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Thu Mar 7 19:51:36 2024
    Robert Finch wrote:

    On 2024-03-07 1:39 a.m., BGB wrote:


    Bigfoot uses a whole byte for access rights, with separate
    read-write-execute for user and supervisor, and write protect for
    hypervisor and machine modes. He also uses 4 bits for the cache-ability
    which match the cache-ability bits on the bus.

    Can you think of an example where a user Read-Only page would not be
    writeable from super ?? Or a device on PCIe ??
    Can you think of an example where a user Execute-Only page would not
    be readable from super ?? Or a device on PCIe ??
    Can you think of an example where a user page marked RWE = 000 would
    not be readable and writeable from super ? Or a device on PCIe ??

    Can you think of an example where 2-bits denoting {L1, L2, LLC, DRAM}
    does not describe the cache placement of a line adequately ??


    I am using the value of zero for the ASID to represent the machine
    mode’s ASID. A lot of hardware is initialized to zero at reset, so it’s automatically the machine mode’s. Other than zero the ASID could be anything assigned by the OS.

    I do not rely on control registers being set to zero, instead part of
    HW context switching end up reading these out of ROM and into those registers--so they can have any reasonable bit pattern SW desires.
    {{This is sort of like how Alpha comes out of reset and streams a ROM
    through the scan path to initialize the internals.}}

    I am also assuming that ASID = 0 is the highest level of privilege;
    but this is purely a SW choice.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Thu Mar 7 11:58:53 2024
    On 3/5/2024 9:32 AM, MitchAlsup1 wrote:

    snip

    I also believe in the tension between pages that are too small and those
    that are too large. 256B is widely seen as too small (VAX). I think most people are comfortable in the 4KB range. I think 64KB is too big since something like cat needs less code, less data, and less stack space than
    1 single 64KB page and another 64KB page to map cat from its VAS. So, now; instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need
    four 64KB pages 256KB. It is these small applications that drive the
    minimum page size down.

    In thinking about this, an idea occurred to me that may ease this
    tension some. For a large page, you introduce a new protection mode
    such that, for example, the lower half of the addresses in the page are
    execute only, and the upper half are read/write enabled. This would
    allow the code and the data, and perhaps even the stack for such a
    program to share a single page, while still maintaining the required
    access protection. I think the hardware to implement this is pretty
    small. While the benefits of this would be modest, if such "small
    programs" occur often enough it may be worth the modest cost of the
    additional hardware.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stephen Fuld on Thu Mar 7 20:32:44 2024
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 3/5/2024 9:32 AM, MitchAlsup1 wrote:

    snip

    I also believe in the tension between pages that are too small and those
    that are too large. 256B is widely seen as too small (VAX). I think most
    people are comfortable in the 4KB range. I think 64KB is too big since
    something like cat needs less code, less data, and less stack space than
    1 single 64KB page and another 64KB page to map cat from its VAS. So, now; >> instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need
    four 64KB pages 256KB. It is these small applications that drive the
    minimum page size down.

    In thinking about this, an idea occurred to me that may ease this
    tension some. For a large page, you introduce a new protection mode
    such that, for example, the lower half of the addresses in the page are >execute only, and the upper half are read/write enabled. This would
    allow the code and the data, and perhaps even the stack for such a
    program to share a single page, while still maintaining the required
    access protection. I think the hardware to implement this is pretty
    small. While the benefits of this would be modest, if such "small
    programs" occur often enough it may be worth the modest cost of the >additional hardware.

    The biggest problem with variable page sizes isn't the hardware.

    The problem is how to effectively use multiple page sizes without
    serious checkerboarding and subsequent allocation issues. Solved
    in linux by preallocation at boot time of a range to be used
    only for large pages - which if they aren't used, are not
    available to be used as regular pages. Linux also has THP[*], which will move stuff around to make sufficiently sized (and aligned, which is
    the harder problem) regions that can be changed to a larger
    mapping.

    [*] Transparent Huge Pages.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Thu Mar 7 20:29:29 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    Robert Finch wrote:

    On 2024-03-07 1:39 a.m., BGB wrote:


    Bigfoot uses a whole byte for access rights, with separate
    read-write-execute for user and supervisor, and write protect for
    hypervisor and machine modes. He also uses 4 bits for the cache-ability
    which match the cache-ability bits on the bus.

    Can you think of an example where a user Read-Only page would not be >writeable from super?

    Yep. The entire user address space should not be accessible
    to privileged code (except when using specialized load and
    store instructions that validate the access using the user
    privileges). Aside from initially loading the code/data
    at 'exec' time. There are exceptions, such as the VDSO
    pages in linux which are explicitly shared between user-mode
    and privileged software.

    This is a mitigation for kernel compromises to prevent access
    to secrets and/or code injection.

    ?? Or a device on PCIe ??

    That's entirely up to the IOMMU, which generally uses different
    translation tables than the processor(s).

    Can you think of an example where a user Execute-Only page would not
    be readable from super ?? Or a device on PCIe ??

    See above. Minimize the security footprint.


    Can you think of an example where a user page marked RWE = 000 would
    not be readable and writeable from super ? Or a device on PCIe ??

    Can you think of an example where 2-bits denoting {L1, L2, LLC, DRAM}
    does not describe the cache placement of a line adequately ??

    A remote cache in a non-coherent CXL compute expander?


    I am also assuming that ASID = 0 is the highest level of privilege;
    but this is purely a SW choice.

    I'm troubled about the idea of the ASID having anything to do
    with security. There are benefits to having the ASID being
    qualified by a guest (VM) identifier such that the guest
    operating system can use the entire range of ASIDs as if
    it were running on real hardware.

    Reminds me of the 1960s, when the base register == 0 would
    enable access to the privileged instructions. The next
    generation of that system switched to using a control register.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Thu Mar 7 21:47:32 2024
    Robert Finch wrote:

    On 2024-03-07 2:51 p.m., MitchAlsup1 wrote:
    Robert Finch wrote:

    On 2024-03-07 1:39 a.m., BGB wrote:


    Bigfoot uses a whole byte for access rights, with separate
    read-write-execute for user and supervisor, and write protect for
    hypervisor and machine modes. He also uses 4 bits for the
    cache-ability which match the cache-ability bits on the bus.

    Can you think of an example where a user Read-Only page would not be
    writeable from super ?? Or a device on PCIe ??
    Can you think of an example where a user Execute-Only page would not
    be readable from super ?? Or a device on PCIe ??

    I cannot think of examples. But I had thought the hypervisor / machine
    might want to treat supervisor mode like an alternate user mode. The
    bits can always just be set = 7.

    Or, you can avoid blowing the 3-extra-bits and just assume a higher
    privilege level can access pages the lesser privileged cannot.

    Can you think of an example where a user page marked RWE = 000 would
    not be readable and writeable from super ? Or a device on PCIe ??

    A page marked RWE=000 is an unusable page. Perhaps to signal bad memory.
    Or perhaps as a hidden data page full of comments or remarks. If its not readable-writeable or executable what is it? Nothing should be able to
    access it, except maybe the machine/debug operating mode.

    I choose this because I have a mandate on the Safe-Stack area that LDs
    and STs (and prefetches and post pushes) cannot access the data--only
    ENTER and EXIT can access the data so the contract (ABI) between caller
    and callee cannot be violated {{avoiding may attack strategies}}

    Can you think of an example where 2-bits denoting {L1, L2, LLC, DRAM}
    does not describe the cache placement of a line adequately ??

    The cache-ability bits were not directly describing cache placement.
    They were like the cache-ability bits in the bus. They specified cache-policy. Bufferable / non-bufferable. Write-through, write-back, allocate, etc. But now that I reviewed it I forgot I had removed these
    bits from the PTE / TLBE.

    I do not like situations where all possible codes are used.

    I would like to apply this argument to integer encodings--integers need
    an encoding to represent "there is no value here" {similar to NaN}.

    So, I would probably use three bits. Could a cache line be located in a Register for instance?

    My cache lines are 64-bytes (512-bits) in size,
    My registers are 64-bits in size.
    Flip-flops can be any number of bits in size.

    4 cache lines ARE my register FILE.
    1 cache line IS my Thread Header
    5 cache lines ARE my Thread State (all of it)

    4 Thread States are accessible at any instant of time making context
    switching easier. SVC goes up the chain to higher privilege, SVR does
    the reverse.

    I cannot envision every usage, although a lot is known today, I thought
    it would be better to err on the side of providing too many bits rather
    than not enough.

    This works only so long as you do not run out of bits. Once you start scrambling to find an encoding of 2 (or 3) fields that would never be
    in use and use the combination of 2 fields to mean something "special"
    you know you are in trouble.

    Not enough is hard to add later. There are loads of
    bits available in the 128-bit PTE, 96 bits would be enough. But it is
    not a nice power of two.

    Hmmmmmmmmmm,

    I got pairs of 63-bit virtual address spaces into a 64-bit container.
    And since we are only around 48-bits* of address space consumption,
    it will outlast my lifetime.

    (*) the largest servers while typical big desktops are down in the
    35-37-bit range.


    I am using the value of zero for the ASID to represent the machine
    mode’s ASID. A lot of hardware is initialized to zero at reset, so
    it’s automatically the machine mode’s. Other than zero the ASID could >>> be anything assigned by the OS.

    I do not rely on control registers being set to zero, instead part of
    HW context switching end up reading these out of ROM and into those
    registers--so they can have any reasonable bit pattern SW desires.
    {{This is sort of like how Alpha comes out of reset and streams a ROM
    through the scan path to initialize the internals.}}

    I am also assuming that ASID = 0 is the highest level of privilege;
    but this is purely a SW choice.

    Assuming a zero at reset was more of a default ‘if I forget’ approach. I have machine state being initialized from ROM too.

    In effect, I have 1 "register" that is set at reset {along with clearing
    the control bits of the pipeline.} Everything else is loaded from ROM.

    When control arrives after reset, you have a Thread Header which provides
    {IP, Root, Call Stack Pointer, raised and Enabled exceptions, ASID, Why,
    and a few more}. The privilege level determines which register file gets loaded. So by the time you are fetching instructions, you have a register
    file (filled with data), a data stack pointer, a call stack pointer, and
    30-odd registers containing whatever BOOT programmers thought was appropriate.

    You also have MMU setup with the TLB enabled (and active). The MMU maps L1
    and L2 in the allocated state, so you have at least 256KB of memory to work
    in BEFORE DRAM is identified, configured, initialized, and tuned to the electrical environment they are plugged into. This, in turn means all (100%)
    of BOOT can be written in a HLL (C) without access to an assembler.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Scott Lurndal on Thu Mar 7 19:55:11 2024
    On 3/7/2024 12:32 PM, Scott Lurndal wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 3/5/2024 9:32 AM, MitchAlsup1 wrote:

    snip

    I also believe in the tension between pages that are too small and those >>> that are too large. 256B is widely seen as too small (VAX). I think most >>> people are comfortable in the 4KB range. I think 64KB is too big since
    something like cat needs less code, less data, and less stack space than >>> 1 single 64KB page and another 64KB page to map cat from its VAS. So, now; >>> instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need >>> four 64KB pages 256KB. It is these small applications that drive the
    minimum page size down.

    In thinking about this, an idea occurred to me that may ease this
    tension some. For a large page, you introduce a new protection mode
    such that, for example, the lower half of the addresses in the page are
    execute only, and the upper half are read/write enabled. This would
    allow the code and the data, and perhaps even the stack for such a
    program to share a single page, while still maintaining the required
    access protection. I think the hardware to implement this is pretty
    small. While the benefits of this would be modest, if such "small
    programs" occur often enough it may be worth the modest cost of the
    additional hardware.

    The biggest problem with variable page sizes isn't the hardware.

    What I proposed is not variable page sizes. All pages are the same
    size. This idea is to add a new protection option within the same page.
    The new option will allow "mixing" the code and data for a small
    program within the same page without sacrificing the protection that
    normaly requires multiple pages.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David W Schroth@21:1/5 to sfuld@alumni.cmu.edu.invalid on Fri Mar 8 08:24:53 2024
    On Tue, 5 Mar 2024 10:32:03 -0800, Stephen Fuld
    <sfuld@alumni.cmu.edu.invalid> wrote:

    On 3/5/2024 9:32 AM, MitchAlsup1 wrote:
    Stephen Fuld wrote:

    On 3/5/2024 7:37 AM, MitchAlsup1 wrote:
    Robert Finch wrote:

    Bigfoot pages are 1MB in size. That allows an entire 64GB address
    range to be mapped with one page table. With larger memory systems a >>>>> larger page size is needed IMO. 64GB is 65,536 pages still when the
    pages are 1MB in size. There is 32GB RAM in my machine today.
    Tomorrow it will 64GB.

    Above a certain point the added latency of filling/spilling a page to
    DISK
    that is 64KB in size rather than 4KB in size outweighs the gain in
    the TLB.

    Sure. But the 4K size was first used when disk transfer rates were
    about 3 MB/sec. Today, they are many times that, even if you are
    using a single physical hard disk. RAID and SSD can make that even
    larger. I don't know what the "optimal" page size is, but it is
    certainly larger than 4KB.

    While the data transfer rates are far higher today, the disk latency has
    not kept pace with CPU performance {SSD is different}.

    Sure, but the latency is the same no matter what the transfer size is.
    So the fact that latency improvements haven't kept pace is irrelevant to
    the question of optimal page size.


    The CPU went from 60ns (16 MHz: 360/67) to 200ps (5GHz) 300 !!
    And from 5 cycles per instruction to 3 instruction per cycle 15
    for a combined gain of 4,500

    Disk latency may have gone down from 8ms to 3ms (sometimes 2ms for
    smaller disks) 3-4

    {{I consider 360/67 as the beginning of paging; although Multics may be
    the beginning.}}

    Agreed.


    I, also, believe that 4KB pages are a bit small for a new architecture.
    I chose 8KB for My 66000 more because it took 1 level out of the page
    tables supporting 64-bit VAS, and also made the paging hierarchy more
    easy to remember:: 8K, 8M, 8G, 8T, 8P, 8E is easier to remember than
    4K, 2M, 1G, T, P, ?E.

    Fair enough. When Unisys implemented paging in the 2200 Series in the
    1990s, they chose 16K (approximately - exactly if you consider 36 equals >32!).


    Actually, we use 4 KiW (contained in 32 KiB) as the page size. I
    don't remember spill/fill time being an issue.

    <snip>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Fri Mar 8 10:08:31 2024
    MitchAlsup1 wrote:
    Stephen Fuld wrote:

    So the optimum is, at least to some degree, usage dependent. Of
    course, this is all an argument for multiple page sizes.

    Which My 66000 provides, but it also provides big pages that can have an extent [0..extent-1] so you can use a 8GB page to map anything from 8KB through 8GB in a single PTE. The extent-bits are exactly the bits not
    being used as PA-bits.

    By the way, I borrowed your extent idea (thank you :-) ) as when combined
    with skipping from the root pointer to a lower level, it can be used to
    map the whole OS code and data with just a couple of PTE's.
    This eliminates table walks for most OS virtual addresses and
    could allow a few 'pinned' TLB entries to map the whole OS.

    This achieves a similar result to my Block Address Translate (BAT)
    approach but without requiring a whole separate MMU mechanism.

    The idea is for the OS to be separated into static and dynamic sets
    of code and RW data. The static parts are always resident in the OS
    plus any mandatory drivers. The linker packages all the static code
    and data together into two huge RE and RW memory sections at specific
    high end virtual addresses, aligned to a huge page boundary.

    The extent feature allows the OS static code and data to be loaded
    into just the portion of a huge page that each needs, with any unused
    remainder in the huge pages being returned to the general pool as
    smaller pages (so no wasted space in the 1GB or 8GB pages).

    And voila - two PTE's map the whole static OS code and data
    which can be permanently held in two MMU mapping registers.
    With one more for the graphics memory and the bulk of
    table walks for system space can be eliminated.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to David W Schroth on Fri Mar 8 06:58:11 2024
    On 3/8/2024 6:24 AM, David W Schroth wrote:
    On Tue, 5 Mar 2024 10:32:03 -0800, Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

    On 3/5/2024 9:32 AM, MitchAlsup1 wrote:
    Stephen Fuld wrote:

    On 3/5/2024 7:37 AM, MitchAlsup1 wrote:
    Robert Finch wrote:

    Bigfoot pages are 1MB in size. That allows an entire 64GB address
    range to be mapped with one page table. With larger memory systems a >>>>>> larger page size is needed IMO. 64GB is 65,536 pages still when the >>>>>> pages are 1MB in size. There is 32GB RAM in my machine today.
    Tomorrow it will 64GB.

    Above a certain point the added latency of filling/spilling a page to >>>>> DISK
    that is 64KB in size rather than 4KB in size outweighs the gain in
    the TLB.

    Sure.  But the 4K size was first used when disk transfer rates were
    about 3 MB/sec.  Today, they are many times that, even if you are
    using a single physical hard disk.  RAID and SSD can make that even
    larger.  I don't know what the "optimal" page size is, but it is
    certainly larger than 4KB.

    While the data transfer rates are far higher today, the disk latency has >>> not kept pace with CPU performance {SSD is different}.

    Sure, but the latency is the same no matter what the transfer size is.
    So the fact that latency improvements haven't kept pace is irrelevant to
    the question of optimal page size.


    The CPU went from 60ns (16 MHz: 360/67) to 200ps (5GHz) 300× !!
    And from 5 cycles per instruction to 3 instruction per cycle 15×
    for a combined gain of 4,500×

    Disk latency may have gone down from 8ms to 3ms (sometimes 2ms for
    smaller disks) 3×-4×

    {{I consider 360/67 as the beginning of paging; although Multics may be
    the beginning.}}

    Agreed.


    I, also, believe that 4KB pages are a bit small for a new architecture.
    I chose 8KB for My 66000 more because it took 1 level out of the page
    tables supporting 64-bit VAS, and also made the paging hierarchy more
    easy to remember:: 8K, 8M, 8G, 8T, 8P, 8E is easier to remember than
    4K, 2M, 1G, ½T, ¼P, ?E.

    Fair enough. When Unisys implemented paging in the 2200 Series in the
    1990s, they chose 16K (approximately - exactly if you consider 36 equals
    32!).


    Actually, we use 4 KiW

    Yes. For those too young to remember, on the 2200 series, one word is
    36 bits. Hence my comment about 4KW being 16 KB, if 36 equals 32.

    (contained in 32 KiB) as the page size.

    I presume this is a result of the emulated systems using 64 bits for
    each 36 bit word. I was referring to the original implementation on
    native hardware.

    I
    don't remember spill/fill time being an issue.

    Ahh! That goes along with Anton's comment and contradicts Mitch's
    comments about disk times being a factor. Since you were there and
    involved in the implementation, what were the considerations in choosing
    4KW? Not that I am challenging it - I think it was probably the correct decision. I just want to better understand the reasoning behind it.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to moi on Fri Mar 8 09:52:44 2024
    On 3/8/2024 9:48 AM, moi wrote:

    {{I consider 360/67 as the beginning of paging; although Multics may be >>>> the beginning.}}

    Agreed.

    It certainly is not.
    IBM bought the right to use the Manchester University / Ferranti paging patent.


    I didn't know that. Thanks.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From moi@21:1/5 to All on Fri Mar 8 17:48:08 2024
    {{I consider 360/67 as the beginning of paging; although Multics may be
    the beginning.}}

    Agreed.

    It certainly is not.
    IBM bought the right to use the Manchester University / Ferranti paging
    patent.

    --
    Bill F.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Fri Mar 8 18:49:30 2024
    EricP wrote:

    MitchAlsup1 wrote:
    Stephen Fuld wrote:

    So the optimum is, at least to some degree, usage dependent. Of
    course, this is all an argument for multiple page sizes.

    Which My 66000 provides, but it also provides big pages that can have an
    extent [0..extent-1] so you can use a 8GB page to map anything from 8KB
    through 8GB in a single PTE. The extent-bits are exactly the bits not
    being used as PA-bits.

    By the way, I borrowed your extent idea (thank you :-) ) as when combined with skipping from the root pointer to a lower level, it can be used to
    map the whole OS code and data with just a couple of PTE's.
    This eliminates table walks for most OS virtual addresses and
    could allow a few 'pinned' TLB entries to map the whole OS.

    This achieves a similar result to my Block Address Translate (BAT)
    approach but without requiring a whole separate MMU mechanism.

    The idea is for the OS to be separated into static and dynamic sets
    of code and RW data. The static parts are always resident in the OS
    plus any mandatory drivers. The linker packages all the static code
    and data together into two huge RE and RW memory sections at specific
    high end virtual addresses, aligned to a huge page boundary.

    The extent feature allows the OS static code and data to be loaded
    into just the portion of a huge page that each needs, with any unused remainder in the huge pages being returned to the general pool as
    smaller pages (so no wasted space in the 1GB or 8GB pages).

    Those Huge page boundaries can be achieved at nominal page boundaries
    with a bit of paravirtualization help by HyperVisor simply because they
    are in GuestOS PaS = HV VaS.

    And voila - two PTE's map the whole static OS code and data
    which can be permanently held in two MMU mapping registers.

    And can be as big or small as GuestOS desires.

    With one more for the graphics memory and the bulk of
    table walks for system space can be eliminated.

    And so much for Kernel excursions wiping out the TLB.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to All on Fri Mar 8 14:41:51 2024
    On Thu, 7 Mar 2024 16:02:45 -0500, Robert Finch <robfi680@gmail.com>
    wrote:

    A page marked RWE=000 is an unusable page. Perhaps to signal bad memory.
    Or perhaps as a hidden data page full of comments or remarks. If its not >readable-writeable or executable what is it? Nothing should be able to
    access it, except maybe the machine/debug operating mode.

    The ability to change (at least data) pages between "untouchable" and
    RW is required for MMU assisted incremental GC. If the GC also
    handles code, then it must be able to mark pages executable as well.

    If an "untouchable" page can't be manipulated by user software, then
    you've disallowed an entire class of GC systems.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Fri Mar 8 22:34:12 2024
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    Scott Lurndal wrote:

    Robert Finch <robfi680@gmail.com> writes:


    Do they support using the entire page as a page table?

    All multilevel page tables use an entire page (granule in ARMv8)
    at each level of the page table. To map larger page/block sizes,
    they just stop the table walk at one, two or three levels rather
    than walking all four levels to the leaf page size.

    My 66000 page structure supports both skipping of levels in the table
    and stopping at any level in the table.

    That's all well and good. What operating system do you
    have running on the MY 66000 processor? When will I be able
    to purchase a system based on that processor?

    Linux.



    Compression could be useful for something serialized to disk or through >>>>the network. Transferring the page number and compressed contents.

    Compressing tiered DRAM is looking to be the next big thing.

    https://www.zeropoint-tech.com/news/meta-and-google-unveil-a-revolutionary-hardware-compressed-cxl-memory-tier-specification

    CXL is just* an offload of DRAM from an on-die memory controller to
    an on-PCIe 5.0-6.0 high speed links.

    Yes, I'm well aware of that. What you don't mention is that it
    can become part of the processor cache coherency domain.

    My 66000 considers DRAM as part of the cache/memory hierarchy and
    considers LLC as the front end to DRAM.

    An interesting topic between PCIE 5.0 and 6.0 is the change from
    NRZ encoding to PAM 4. This comes with a degradation of error rate
    from 10^-12 goes down to 10^-6, so you are going to need some kind
    of ECC on the PCIe links and layers.

    PAM4 is something with which we have a great deal of expertise,
    along with the associated error correction.

    I only point this out because it is "different".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to George Neuner on Fri Mar 8 22:50:43 2024
    George Neuner wrote:

    On Thu, 7 Mar 2024 16:02:45 -0500, Robert Finch <robfi680@gmail.com>
    wrote:

    A page marked RWE=000 is an unusable page.

    Inaccessible, not unusable.

    ENTER and EXIT check that the Safe-Stack is inaccessible to the application {RWE = 000). This means the application cannot LD from or ST to the Safe- Stack. ENTER and EXIT can !! This simple twist of the wrist eliminates the ability to overrun data onto the data-stack does not alter the ABI guarantee
    of callee returns to caller with all its preserved registers as if unchanged.

    Perhaps to signal bad memory.

    That is eminently possible.

    Or perhaps as a hidden data page full of comments or remarks. If its not >>readable-writeable or executable what is it? Nothing should be able to >>access it, except maybe the machine/debug operating mode.

    A) it is accessible by more privileged levels of the system.
    B) GuestOS can put information in process VaS that application cannot access
    {Say for example: to avoid keeping it in kernel address space.}
    C) it can still be accessed by devices
    d) it can be decrypted as touched (GuestOS exception)
    e) A stack that Guarantees ABI in untrusted computing environments
    ..
    The only one guaranteed no access is the application at the privilege level
    of that applications memory map. All higher privilege applications access.

    The ability to change (at least data) pages between "untouchable" and
    RW is required for MMU assisted incremental GC. If the GC also
    handles code, then it must be able to mark pages executable as well.

    Another use.

    If an "untouchable" page can't be manipulated by user software, then
    you've disallowed an entire class of GC systems.

    I did not know of this technique, but it works in my design, too and
    without alteration.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David W Schroth@21:1/5 to sfuld@alumni.cmu.edu.invalid on Fri Mar 8 19:04:35 2024
    On Fri, 8 Mar 2024 06:58:11 -0800, Stephen Fuld
    <sfuld@alumni.cmu.edu.invalid> wrote:

    On 3/8/2024 6:24 AM, David W Schroth wrote:
    On Tue, 5 Mar 2024 10:32:03 -0800, Stephen Fuld
    <sfuld@alumni.cmu.edu.invalid> wrote:

    On 3/5/2024 9:32 AM, MitchAlsup1 wrote:
    Stephen Fuld wrote:

    On 3/5/2024 7:37 AM, MitchAlsup1 wrote:
    Robert Finch wrote:

    Bigfoot pages are 1MB in size. That allows an entire 64GB address >>>>>>> range to be mapped with one page table. With larger memory systems a >>>>>>> larger page size is needed IMO. 64GB is 65,536 pages still when the >>>>>>> pages are 1MB in size. There is 32GB RAM in my machine today.
    Tomorrow it will 64GB.

    Above a certain point the added latency of filling/spilling a page to >>>>>> DISK
    that is 64KB in size rather than 4KB in size outweighs the gain in >>>>>> the TLB.

    Sure. But the 4K size was first used when disk transfer rates were
    about 3 MB/sec. Today, they are many times that, even if you are
    using a single physical hard disk. RAID and SSD can make that even
    larger. I don't know what the "optimal" page size is, but it is
    certainly larger than 4KB.

    While the data transfer rates are far higher today, the disk latency has >>>> not kept pace with CPU performance {SSD is different}.

    Sure, but the latency is the same no matter what the transfer size is.
    So the fact that latency improvements haven't kept pace is irrelevant to >>> the question of optimal page size.


    The CPU went from 60ns (16 MHz: 360/67) to 200ps (5GHz) 300 !!
    And from 5 cycles per instruction to 3 instruction per cycle 15
    for a combined gain of 4,500

    Disk latency may have gone down from 8ms to 3ms (sometimes 2ms for
    smaller disks) 3-4

    {{I consider 360/67 as the beginning of paging; although Multics may be >>>> the beginning.}}

    Agreed.


    I, also, believe that 4KB pages are a bit small for a new architecture. >>>> I chose 8KB for My 66000 more because it took 1 level out of the page
    tables supporting 64-bit VAS, and also made the paging hierarchy more
    easy to remember:: 8K, 8M, 8G, 8T, 8P, 8E is easier to remember than
    4K, 2M, 1G, T, P, ?E.

    Fair enough. When Unisys implemented paging in the 2200 Series in the
    1990s, they chose 16K (approximately - exactly if you consider 36 equals >>> 32!).


    Actually, we use 4 KiW

    Yes. For those too young to remember, on the 2200 series, one word is
    36 bits. Hence my comment about 4KW being 16 KB, if 36 equals 32.

    (contained in 32 KiB) as the page size.

    I presume this is a result of the emulated systems using 64 bits for
    each 36 bit word. I was referring to the original implementation on
    native hardware.

    I
    don't remember spill/fill time being an issue.

    Ahh! That goes along with Anton's comment and contradicts Mitch's
    comments about disk times being a factor. Since you were there and
    involved in the implementation, what were the considerations in choosing
    4KW? Not that I am challenging it - I think it was probably the correct >decision. I just want to better understand the reasoning behind it.

    I wasn't there when the paging architecture was defined, so I can't
    actually say. I would speculate that it's because the D
    (displacement) field in the Extended Mode instruction format is 12
    bits wide.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lynn Wheeler@21:1/5 to mitchalsup@aol.com on Fri Mar 8 14:19:17 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    {{I consider 360/67 as the beginning of paging; although Multics may be
    the beginning.}}

    Science center thot IBM could get MIT MULTICS ... but went to GE. Then
    the IBM mission for virtual memory/paging went to "new" TSS/360
    group. Science Center modified a 360/40 with virtual memory & paging
    pending availability of 360/67 standard with virtual memory (and
    cp40/cms morphs into cp67/cms).

    Melinda's history web pages
    http://www.leeandmelindavarian.com/Melinda#VMHist
    from (lots of early history, some CTSS/7094 went to 5th flr and Multics
    and others went to IBM science center on 4th flr) http://www.leeandmelindavarian.com/Melinda/neuvm.pdf
    footnote from Les Comeau:

    Since the early time-sharing experiments used base and limit registers
    for relocation, they had to roll in and roll out entire programs when
    switching users....Virtual memory, with its paging technique, was
    expected to reduce significantly the time spent waiting for an exchange
    of user programs.

    What was most significant was that the commitment to virtual memory was
    backed with no successful experience. A system of that period that had implemented virtual memory was the Ferranti Atlas computer, and that was
    known not to be working well. What was frightening is that nobody who
    was setting this virtual memory direction at IBM knew why Atlas didn't
    work.35

    ... snip ...

    Atlas reference (gone?) but lives free at wayback): https://web.archive.org/web/20121118232455/http://www.ics.uci.edu/~bic/courses/JaverOS/ch8.pdf
    from above:

    Paging can be credited to the designers of the ATLAS computer, who
    employed an associative memory for the address mapping [Kilburn, et
    al., 1962]. For the ATLAS computer, |w| = 9 (resulting in 512 words
    per page), |p| = 11 (resulting in 2024 pages), and f = 5 (resulting in
    32 page frames). Thus a 220-word virtual memory was provided for a
    214- word machine. But the original ATLAS operating system employed
    paging solely as a means of implementing a large virtual memory; multiprogramming of user processes was not attempted initially, and
    thus no process id's had to be recorded in the associative memory. The
    search for a match was performed only on the page number p.

    ... snip ...

    I was undergraduate at univ and was hired fulltime responsible for
    OS/360. The univ. had gotten a 360/67 for tss/360, but was running as
    360/65 (univ. shutdown datacenter on weekends and I had whole place
    dedicated, 48hrs w/o sleep did make monday classes hard).

    Then CSC came out to install CP67 (3rd installation after CSC itself and
    MIT Lincoln Labs) and I mostly played with it in my weekend time. This
    early release had very rudimentary page replacement algorithm and no
    page thrashing controls. I did (global LRU) reference bit scan
    and dynamic adaptive page thrashing controls.

    Nearly 15yrs later at Dec81 ACM SIGOPS, Jim Gray ask if I could help a
    Tandem co-worker get Stanford Phd .... it involved similar global LRU to
    the work that I had done in the 60s and there were "local LRU" forces
    from the 60s lobbying hard not to award a Phd (that wasn't "local
    LRU"). I had real live data from a CP/67 with global LRU on 768kbyte
    (104 pageable pages) 360/67 with 80users that had better response and throughput compared to a CP/67 (with nearly identical type of workload
    but 35users) that implemented 60s "local LRU" implementation and 1mbyte
    360/67 (155 pageable pages after fixed storage) ... aka half the users
    and 50% more real paging storage.

    a decade ago, I was asked to track down decision to add virtual memory
    to all 370s ... found somebody involved, archived posts with pieces
    of the email exchange:
    https://www.garlic.com/~lynn//2011d.html#73

    --
    virtualization experience starting Jan1968, online at home since Mar1970

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David W Schroth@21:1/5 to sfuld@alumni.cmu.edu.invalid on Fri Mar 8 19:13:38 2024
    On Thu, 7 Mar 2024 11:58:53 -0800, Stephen Fuld
    <sfuld@alumni.cmu.edu.invalid> wrote:

    On 3/5/2024 9:32 AM, MitchAlsup1 wrote:

    snip

    I also believe in the tension between pages that are too small and those
    that are too large. 256B is widely seen as too small (VAX). I think most
    people are comfortable in the 4KB range. I think 64KB is too big since
    something like cat needs less code, less data, and less stack space than
    1 single 64KB page and another 64KB page to map cat from its VAS. So, now; >> instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need
    four 64KB pages 256KB. It is these small applications that drive the
    minimum page size down.

    In thinking about this, an idea occurred to me that may ease this
    tension some. For a large page, you introduce a new protection mode
    such that, for example, the lower half of the addresses in the page are >execute only, and the upper half are read/write enabled. This would
    allow the code and the data, and perhaps even the stack for such a
    program to share a single page, while still maintaining the required
    access protection. I think the hardware to implement this is pretty
    small. While the benefits of this would be modest, if such "small
    programs" occur often enough it may be worth the modest cost of the >additional hardware.

    I would suggest that your proposal would be better done by splitting access/protection from virtual to physical translation (think Mill
    turfs). I suppose OS2200 could use our existing protection to
    implement what you propose, but we haven't (largely because there
    seems to be no need/call for the capability).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to David W Schroth on Sat Mar 9 08:46:54 2024
    On 3/8/2024 5:13 PM, David W Schroth wrote:
    On Thu, 7 Mar 2024 11:58:53 -0800, Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

    On 3/5/2024 9:32 AM, MitchAlsup1 wrote:

    snip

    I also believe in the tension between pages that are too small and those >>> that are too large. 256B is widely seen as too small (VAX). I think most >>> people are comfortable in the 4KB range. I think 64KB is too big since
    something like cat needs less code, less data, and less stack space than >>> 1 single 64KB page and another 64KB page to map cat from its VAS. So, now; >>> instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need >>> four 64KB pages 256KB. It is these small applications that drive the
    minimum page size down.

    In thinking about this, an idea occurred to me that may ease this
    tension some. For a large page, you introduce a new protection mode
    such that, for example, the lower half of the addresses in the page are
    execute only, and the upper half are read/write enabled. This would
    allow the code and the data, and perhaps even the stack for such a
    program to share a single page, while still maintaining the required
    access protection. I think the hardware to implement this is pretty
    small. While the benefits of this would be modest, if such "small
    programs" occur often enough it may be worth the modest cost of the
    additional hardware.

    I would suggest that your proposal would be better done by splitting access/protection from virtual to physical translation (think Mill
    turfs).

    I think we are in general agreement here. As you implied, the
    fundamental problem is that most current systems "overload" two pieces
    of functionality (memory management and protection) onto a single
    mechanism (paging). I like Mill's approach as it clearly separates the
    two functions, though it does require more hardware, and I am not sure
    how much easier it is for them than it would be for other systems since
    they use a single address space. My proposal was for a low hardware
    cost and easily implemented mechanism.


    I suppose OS2200 could use our existing protection to
    implement what you propose, but we haven't (largely because there
    seems to be no need/call for the capability).

    Since the 2200 already separates the protection function (banking) from
    the memory management function (paging), I think all that would be
    required is to allow multiple banks, perhaps even from multiple programs
    to use different parts of a page. But I agree that, with ~16 KB pages,
    the savings would be much less than if larger pages were used, so the
    benefits would be modest at best and probably not worth the effort.

    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Lynn Wheeler on Sat Mar 9 13:07:44 2024
    Lynn Wheeler wrote:

    Nearly 15yrs later at Dec81 ACM SIGOPS, Jim Gray ask if I could help a
    Tandem co-worker get Stanford Phd .... it involved similar global LRU to
    the work that I had done in the 60s and there were "local LRU" forces
    from the 60s lobbying hard not to award a Phd (that wasn't "local
    LRU"). I had real live data from a CP/67 with global LRU on 768kbyte
    (104 pageable pages) 360/67 with 80users that had better response and throughput compared to a CP/67 (with nearly identical type of workload
    but 35users) that implemented 60s "local LRU" implementation and 1mbyte 360/67 (155 pageable pages after fixed storage) ... aka half the users
    and 50% more real paging storage.

    I assume by "local LRU" you mean local working set management,
    as opposed to global working set e.g. WSClock.

    Are you saying some people tried to block someone from getting a PhD
    because he researched a different working set management than their
    pet approach?

    If so, wow...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Robert Finch on Sat Mar 9 21:34:39 2024
    On 3/7/2024 9:13 PM, Robert Finch wrote:
    On 2024-03-07 2:58 p.m., Stephen Fuld wrote:
    On 3/5/2024 9:32 AM, MitchAlsup1 wrote:

    snip

    I also believe in the tension between pages that are too small and
    those that are too large. 256B is widely seen as too small (VAX). I
    think most
    people are comfortable in the 4KB range. I think 64KB is too big since
    something like cat needs less code, less data, and less stack space than >>> 1 single 64KB page and another 64KB page to map cat from its VAS. So,
    now;
    instead of four* 4KB pages (16KB ={code, data, stack, map} ) we now need >>> four 64KB pages 256KB. It is these small applications that drive the
    minimum page size down.

    In thinking about this, an idea occurred to me that may ease this
    tension some.  For a large page, you introduce a new protection mode
    such that, for example, the lower half of the addresses in the page
    are execute only, and the upper half are read/write enabled.  This
    would allow the code and the data, and perhaps even the stack for such
    a program to share a single page, while still maintaining the required
    access protection.  I think the hardware to implement this is pretty
    small.  While the benefits of this would be modest, if such "small
    programs" occur often enough it may be worth the modest cost of the
    additional hardware.


    I had thoughts along this line too. I have added user access rights for
    each 1/4 of a page. Only a single 64kB page split in four is needed for
    a small app then.

    Yes, similar. I suspect your solution is more general than mine and
    thus handles more cases, but requires more hardware, especially bits in
    the PTE. It's all a tradeoff.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to All on Sun Mar 10 14:29:52 2024
    On Fri, 8 Mar 2024 20:26:08 -0500, Robert Finch <robfi680@gmail.com>
    wrote:

    I plan on having garbage collection as part of the OS. There is a shared >hardware-card table involved.

    What kind?
    [Actually "kind" is the wrong word because any non-toy, real world GC
    will need to employ a combination of techniques. So the question
    really should be "in what major class is your GC"?]

    Problem is - whatever you choose - it will be wrong and have bad
    performance for some important class of GC'd applications.


    So, I guess that would disallow user
    garbage collectors using untouchable pages. The MMU could be faked out
    using a VM, so I have read.

    Yes, a VM can emulate MMU operation, but currently that requires using
    a hypervisor - a heavyweight solution that also requires a guest OS to
    run the program.

    There are a number of light(er) weight VMs for running programs in
    managed environments [which include GC] ... but all of them have to
    run under an OS and are at its mercy.

    GCs that use no-access pages are not rare, and they are just one class
    of MMU assisted GC systems. There are a number of ways a collector
    can leverage the MMU to help.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to gneuner2@comcast.net on Sun Mar 10 19:53:12 2024
    On Sun, 10 Mar 2024 14:29:52 -0400, George Neuner
    <gneuner2@comcast.net> wrote:


    Problem is - whatever [GC] you choose - it will be wrong and have bad >performance for some important class of GC'd applications.

    "bad performance" may mean "slow", but also may mean memory use much
    higher than should be necessary.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to George Neuner on Mon Mar 11 09:29:57 2024
    George Neuner <gneuner2@comcast.net> writes:

    On Fri, 8 Mar 2024 20:26:08 -0500, Robert Finch <robfi680@gmail.com>
    wrote:

    I plan on having garbage collection as part of the OS. There is a shared
    hardware-card table involved.

    What kind?
    [Actually "kind" is the wrong word because any non-toy, real world GC
    will need to employ a combination of techniques. So the question
    really should be "in what major class is your GC"?]

    Problem is - whatever you choose - it will be wrong and have bad
    performance for some important class of GC'd applications.

    I'm curious to know what you consider to be the different kinds,
    or classes, of GC, and the same question for applications.

    Certainly, for any given GC implementation, one can construct an
    application that does poorly with respect to that GC, but that
    doesn't make the constructed application a "class". For the
    statement to have meaningful content there needs to be some kind
    of identification of what are the classes of GCs, and what are
    the classes of applications.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to George Neuner on Mon Mar 11 09:32:40 2024
    George Neuner <gneuner2@comcast.net> writes:

    On Sun, 10 Mar 2024 14:29:52 -0400, George Neuner
    <gneuner2@comcast.net> wrote:


    Problem is - whatever [GC] you choose - it will be wrong and have bad
    performance for some important class of GC'd applications.

    "bad performance" may mean "slow", but also may mean memory use much
    higher than should be necessary.

    Understood. And there are other relevant metrics as well, as
    for example not throughput but worst-case latency.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to tr.17687@z991.linuxsc.com on Tue Mar 12 07:30:08 2024
    On Mon, 11 Mar 2024 09:32:40 -0700, Tim Rentsch
    <tr.17687@z991.linuxsc.com> wrote:

    George Neuner <gneuner2@comcast.net> writes:

    On Sun, 10 Mar 2024 14:29:52 -0400, George Neuner
    <gneuner2@comcast.net> wrote:


    Problem is - whatever [GC] you choose - it will be wrong and have bad
    performance for some important class of GC'd applications.

    "bad performance" may mean "slow", but also may mean memory use much
    higher than should be necessary.

    Understood. And there are other relevant metrics as well, as
    for example not throughput but worst-case latency.

    If latency is the primary concern, then you should use a deterministic
    system such as Baker's Treadmill.

    Treadmill essentially is just a set of linked lists, and collectio
    operations like marking and sweeping simply move blocks from one list
    to another. But you pay for that determinism with space - compared to
    other systems, Treadmills have a lot of per-block metadata.

    Note that for allocating and freeing to be deterministic, a Treadmill
    has to work with fixed size blocks. But you can run multiple
    Treadmills for common block sizes, with a catchall for big blocks.
    Some malloc/free style allocators already work like this, using
    separate lists for some commonly requested block sizes.

    Depending on how you handle the metadata, Treadmills also are amenable
    to coalescing free space and/or compacting the heap / working set. But
    note that these types of operations can't be made deterministic.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to George Neuner on Tue Mar 12 20:13:53 2024
    George Neuner <gneuner2@comcast.net> writes:

    On Mon, 11 Mar 2024 09:32:40 -0700, Tim Rentsch
    <tr.17687@z991.linuxsc.com> wrote:

    George Neuner <gneuner2@comcast.net> writes:

    On Sun, 10 Mar 2024 14:29:52 -0400, George Neuner
    <gneuner2@comcast.net> wrote:


    Problem is - whatever [GC] you choose - it will be wrong and have bad
    performance for some important class of GC'd applications.

    "bad performance" may mean "slow", but also may mean memory use much
    higher than should be necessary.

    Understood. And there are other relevant metrics as well, as
    for example not throughput but worst-case latency.

    If latency is the primary concern, then you should use a deterministic
    system such as Baker's Treadmill. [...]

    Does this mean you aren't going to answer my other question?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to tr.17687@z991.linuxsc.com on Tue Mar 12 23:57:02 2024
    On Mon, 11 Mar 2024 09:29:57 -0700, Tim Rentsch
    <tr.17687@z991.linuxsc.com> wrote:

    George Neuner <gneuner2@comcast.net> writes:

    On Fri, 8 Mar 2024 20:26:08 -0500, Robert Finch <robfi680@gmail.com>
    wrote:

    I plan on having garbage collection as part of the OS. There is a shared >>> hardware-card table involved.

    What kind?
    [Actually "kind" is the wrong word because any non-toy, real world GC
    will need to employ a combination of techniques. So the question
    really should be "in what major class is your GC"?]

    Problem is - whatever you choose - it will be wrong and have bad
    performance for some important class of GC'd applications.

    I'm curious to know what you consider to be the different kinds,
    or classes, of GC, and the same question for applications.

    Certainly, for any given GC implementation, one can construct an
    application that does poorly with respect to that GC, but that
    doesn't make the constructed application a "class". For the
    statement to have meaningful content there needs to be some kind
    of identification of what are the classes of GCs, and what are
    the classes of applications.

    Feeling mathematical are we?


    Every application contains delineated portions of its overall
    allocation profile which correspond closely to portions of the
    profiles of other applications.

    If a given profile performs poorly under a given GC, it is reasonable
    to infer that other applications having corresponding profiles also
    will perform poorly while those profiles are in force.

    That said ...



    GC systems - including their associated allocator(s) - are categorized
    (better word?) by their behavior. Unfortunately, behavior is
    described by a complex set of implementation choices.

    Understand that real world GC typically implement more than one
    algorithm, and often the algorithms are hybridized - derived from and
    relatable to published algorithms, but having unique mix of function
    that won't be found "as is" in any search of literature. [In truth,
    GC literature tends to leave a lot as exercise for the reader.]

    GC behavior often is adaptive, reacting to run time conditions: e.g.,
    based on memory fragmentation it could shift between non-moving
    mark/sweep and moving mark/compact. It may also employ differing
    algorithms simultaneously in different spaces, such as being
    conservative in stacks while being precise in dynamic heaps, or being stop-world in thread private heaps while being concurrent or parallel
    in shared heaps. Etc.


    Concurrent GC (aka incremental) runs as a co-routine with the mutator.
    These systems are distinguished by how they are triggered to run, and
    what bounds may be placed on their execution time. There are
    concurrent systems having completely deterministic operation [their
    actual execution time, of course, may depend on factors beyond the
    GC's control, such as multitasking, caching or paging.]

    Parallel GC may be both prioritized and scheduled. These systems may
    offer some guarantees about the percentage of (application) time given
    to collector vs mutator(s).


    Major choices:

    - precise or conservative?
    - moving or non-moving?
    - tracing (marking)?
    - copying / compacting?
    - stop-world, concurrent, or parallel?

    - single or multiple spaces?
    - semispaces?
    - generational?

    Minor choices:

    - software-only or hardware (MMU) assisted?
    - snapshot at beginning?

    - bump or block allocation?
    - allocation color?

    - free blocks coalesced? {if not compacting}

    - multiple mutators?
    - mutation color?
    - writable shared objects?
    - FROM-space mutation?
    - finalization?


    Note that all of these represent free dimensions in design. As
    mentioned above, any particular system may implement multiple
    collection algorithms each embodying a different set of choices.



    You may wonder why I didn't mention "sweeping" ... essentially it is
    because sequential scan is more an implementation detail than a
    technique. Although "mark/sweep" is a well established technique, it
    is the marking (tracing) that really defines it. Then too, modern
    collectors often are neither mark/sweep nor copying as presented in
    textbooks: e.g., there are systems that mark and copy, systems that
    sweep and copy (without marking), and "copying" systems in which
    copies are logical and nothing actually changes address.

    Aside: all GC can be considered to use [logical] semispaces because
    all have the notion of segregated FROM-space and TO-space during
    collection. True semispaces are a set of (address) contiguous spaces
    - not necessarily equally sized - which are rotated as targets for new allocation. True semispaces do imply physical copying [but see the
    Treadmill for an example of "non-moving copy" using logical
    semispaces].



    So what do I consider to be the "kind" of a GC?

    The choices above pretty much define the extent of the design space
    [but note I did intentionally leave out reference counting]. However,
    the first 8 choices are structural, whereas the rest specify important characteristics but don't affect structure.

    A particular "kind" might be, e.g.,
    "precise, generational, multispace, non-moving, concurrent
    tracer".
    and so on.



    I'm guessing this probably didn't really answer your question, but it
    was fun to write.
    ;-)

    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to George Neuner on Sun Apr 28 15:27:41 2024
    George Neuner <gneuner2@comcast.net> writes:

    On Mon, 11 Mar 2024 09:29:57 -0700, Tim Rentsch
    <tr.17687@z991.linuxsc.com> wrote:

    George Neuner <gneuner2@comcast.net> writes:

    On Fri, 8 Mar 2024 20:26:08 -0500, Robert Finch <robfi680@gmail.com>
    wrote:

    I plan on having garbage collection as part of the OS. There
    is a shared hardware-card table involved.

    What kind?
    [Actually "kind" is the wrong word because any non-toy, real
    world GC will need to employ a combination of techniques. So
    the question really should be "in what major class is your GC"?]

    Problem is - whatever you choose - it will be wrong and have bad
    performance for some important class of GC'd applications.

    I'm curious to know what you consider to be the different kinds,
    or classes, of GC, and the same question for applications.

    Certainly, for any given GC implementation, one can construct an
    application that does poorly with respect to that GC, but that
    doesn't make the constructed application a "class". For the
    statement to have meaningful content there needs to be some kind
    of identification of what are the classes of GCs, and what are
    the classes of applications.

    Feeling mathematical are we?

    If you're trying to say I make an effort to be accurate and
    precise in my writing, I plead guilty as charged.

    Every application contains delineated portions of its overall
    allocation profile which correspond closely to portions of the
    profiles of other applications.

    If a given profile performs poorly under a given GC, it is
    reasonable to infer that other applications having corresponding
    profiles also will perform poorly while those profiles are in
    force.

    An empty, circular observation. Very disappointing.

    That said ...



    GC systems - including their associated allocator(s) - are categorized (better word?) by their behavior. Unfortunately, behavior is
    described by a complex set of implementation choices.

    Understand that real world GC typically implement more than one
    algorithm, and often the algorithms are hybridized - derived from and relatable to published algorithms, but having unique mix of function
    that won't be found "as is" in any search of literature. [In truth,
    GC literature tends to leave a lot as exercise for the reader.]

    GC behavior often is adaptive, reacting to run time conditions: e.g.,
    based on memory fragmentation it could shift between non-moving
    mark/sweep and moving mark/compact. It may also employ differing
    algorithms simultaneously in different spaces, such as being
    conservative in stacks while being precise in dynamic heaps, or being stop-world in thread private heaps while being concurrent or parallel
    in shared heaps. Etc.


    Concurrent GC (aka incremental) runs as a co-routine with the mutator.
    These systems are distinguished by how they are triggered to run, and
    what bounds may be placed on their execution time. There are
    concurrent systems having completely deterministic operation [their
    actual execution time, of course, may depend on factors beyond the
    GC's control, such as multitasking, caching or paging.]

    Parallel GC may be both prioritized and scheduled. These systems may
    offer some guarantees about the percentage of (application) time given
    to collector vs mutator(s).


    Major choices:

    - precise or conservative?
    - moving or non-moving?
    - tracing (marking)?
    - copying / compacting?
    - stop-world, concurrent, or parallel?

    - single or multiple spaces?
    - semispaces?
    - generational?

    Minor choices:

    - software-only or hardware (MMU) assisted?
    - snapshot at beginning?

    - bump or block allocation?
    - allocation color?

    - free blocks coalesced? {if not compacting}

    - multiple mutators?
    - mutation color?
    - writable shared objects?
    - FROM-space mutation?
    - finalization?


    Note that all of these represent free dimensions in design. As
    mentioned above, any particular system may implement multiple
    collection algorithms each embodying a different set of choices.

    I'm familiar with many or perhaps most of the variations
    and techniques used in garbage collection. That isn't
    what I was asking about.

    You may wonder why I didn't mention "sweeping" ... essentially it is
    because sequential scan is more an implementation detail than a
    technique. Although "mark/sweep" is a well established technique, it
    is the marking (tracing) that really defines it. Then too, modern
    collectors often are neither mark/sweep nor copying as presented in textbooks: e.g., there are systems that mark and copy, systems that
    sweep and copy (without marking), and "copying" systems in which
    copies are logical and nothing actually changes address.

    Aside: all GC can be considered to use [logical] semispaces because
    all have the notion of segregated FROM-space and TO-space during
    collection. True semispaces are a set of (address) contiguous spaces
    - not necessarily equally sized - which are rotated as targets for new allocation. True semispaces do imply physical copying [but see the
    Treadmill for an example of "non-moving copy" using logical
    semispaces].



    So what do I consider to be the "kind" of a GC?

    The choices above pretty much define the extent of the design space
    [but note I did intentionally leave out reference counting]. However,
    the first 8 choices are structural, whereas the rest specify important characteristics but don't affect structure.

    A particular "kind" might be, e.g.,
    "precise, generational, multispace, non-moving, concurrent
    tracer".
    and so on.

    In effect what you are saying is that if we list all the possible
    attributes that a GC implementation might have, we can charactrize
    what kind of GC it is by giving its value for each attribute. Not
    really a helpful statement.

    I'm guessing this probably didn't really answer your question,

    Your comments didn't address either of my questions, nor as best
    I can tell even make an effort to do so.

    but it was fun to write. ;-)

    I see. Next time I'll know better.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)