• Re: Page tables and TLBs [was Interrupts on the Concertina II]

    From EricP@21:1/5 to All on Thu Jan 25 09:47:26 2024
    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    One accomplishes the same effect by caching the interior PTE nodes
    for each of the HV and GuestOS tables separately on the downward walk, >>>> and hold the combined nested table mapping in the TLB.
    The bottom-up table walkers on each interior PTE cache should
    eliminate 98% of the PTE reads with none of the headaches.

    I call these things:: TableWalk Accelerators.

    Given CAMs at your access, one can cache the outer layers and short
    circuit most of the MMU accesses--such that you don't siply read the
    Accelerator RAM 25 times (two 5-level tables), you CAM down both
    GuestOS and HV tables so only walk the parts not in your CAM. {And
    them put them in your CAM.} A Density trick is for each CAM to have
    access to a whole cache line of PTEs (8 in my case).

    An idea I had here was to allow the OS more explicit control
    for the invalidates of the interior nodes caches.

    The interior nodes, stored in the CAM, retain their physical address, and
    are snooped, so no invalidation is required. ANY write to them is seen and the entry invalidates itself.

    On My66000, but other cores don't have automatically coherent TLB's.
    This feature is intended for that general rabble.

    Just to play devil's advocate...

    To snoop page table updates My66000 TLB would need a large CAM with all
    the physical addresses of the PTE's source cache lines parallel to the
    virtual and ASID CAM's, and route the cache line invalidates through it.

    While the virtual index CAM's are separated in different banks,
    one for each page table level, the P.A. CAM is for all entries in all banks. This extra P.A. CAM will have a lot of entries and therefore be slow.

    Also routing the Invalidate messages through the TLB could slow down all
    their ACK's messages even though there is very low probability of a hit
    because page tables update relatively infrequently.

    Also the L2-TLB's, called the STLB for Second-level TLB by Intel,
    are set assoc., and would have to be virtually indexed and virtually
    tagged with both VA and ASID plus table level to select address mask.
    On Skylake the STLB for 4k/2M pages is 128-rows*12-way, 1G is 4-rows*4-way.

    How can My66000 look up STLB entries by invalidate physical line address?
    It would have to scan all 128 rows for each message.

    On x86/x64 the interior cache invalidation had to be backwards
    compatible,
    so the INVLPG instruction has to guess what besides the main TLB needs
    to be
    invalidated, and it has to do so in a conservative (ie paranoid) manner.
    So it tosses these interior PTE's just in case which means they
    have to be reloaded on the next TLB miss.

    The OS knows which paging levels it is recycling memory for and
    can provide a finer grain control for these TLB invalidates.
    The INVLPG and INVPCID instructions need a control bit mask allowing OS
    to invalidate just the TLB levels it is changing for a virtual address.

    OS or HV does not need to bother in My 66000.

    And for OS debugging purposes, all these HW TLB tables need to be
    readable
    and writable by some means (as control registers or whatever).
    Because when something craps out, what's in memory may not be the same
    as what was loaded into HW some time ago. A debugger should be able to
    look into and manipulate these HW structures.

    All control registers, including the TLB CAMs are accessible via MMI/O accesses. So a remote core can decide what a crashed core was doing at
    the instant of the crash.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Thu Jan 25 17:12:42 2024
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    One accomplishes the same effect by caching the interior PTE nodes
    for each of the HV and GuestOS tables separately on the downward walk, >>>>> and hold the combined nested table mapping in the TLB.
    The bottom-up table walkers on each interior PTE cache should
    eliminate 98% of the PTE reads with none of the headaches.

    I call these things:: TableWalk Accelerators.

    Given CAMs at your access, one can cache the outer layers and short
    circuit most of the MMU accesses--such that you don't siply read the
    Accelerator RAM 25 times (two 5-level tables), you CAM down both
    GuestOS and HV tables so only walk the parts not in your CAM. {And
    them put them in your CAM.} A Density trick is for each CAM to have
    access to a whole cache line of PTEs (8 in my case).

    An idea I had here was to allow the OS more explicit control
    for the invalidates of the interior nodes caches.

    The interior nodes, stored in the CAM, retain their physical address, and
    are snooped, so no invalidation is required. ANY write to them is seen and >> the entry invalidates itself.

    On My66000, but other cores don't have automatically coherent TLB's.
    This feature is intended for that general rabble.

    Just to play devil's advocate...

    To snoop page table updates My66000 TLB would need a large CAM with all
    the physical addresses of the PTE's source cache lines parallel to the virtual and ASID CAM's, and route the cache line invalidates through it.

    Yes, .....

    While the virtual index CAM's are separated in different banks,
    one for each page table level, the P.A. CAM is for all entries in all banks. This extra P.A. CAM will have a lot of entries and therefore be slow.

    That is the TWA.

    Also routing the Invalidate messages through the TLB could slow down all their ACK's messages even though there is very low probability of a hit because page tables update relatively infrequently.

    TLBs don't ACK they self-invalidate. And they can be performing a translation while self-invalidating.

    Also the L2-TLB's, called the STLB for Second-level TLB by Intel,
    are set assoc., and would have to be virtually indexed and virtually
    tagged with both VA and ASID plus table level to select address mask.
    On Skylake the STLB for 4k/2M pages is 128-rows*12-way, 1G is 4-rows*4-way.

    All TLB walks are performed with RealPA.
    All Snoops are performed with RealPA

    How can My66000 look up STLB entries by invalidate physical line address?
    It would have to scan all 128 rows for each message.

    It is not structured like Intel L2 TLB.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Fri Jan 26 10:47:38 2024
    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    One accomplishes the same effect by caching the interior PTE nodes >>>>>> for each of the HV and GuestOS tables separately on the downward
    walk,
    and hold the combined nested table mapping in the TLB.
    The bottom-up table walkers on each interior PTE cache should
    eliminate 98% of the PTE reads with none of the headaches.

    I call these things:: TableWalk Accelerators.

    Given CAMs at your access, one can cache the outer layers and short
    circuit most of the MMU accesses--such that you don't siply read
    the Accelerator RAM 25 times (two 5-level tables), you CAM down both >>>>> GuestOS and HV tables so only walk the parts not in your CAM. {And
    them put them in your CAM.} A Density trick is for each CAM to have
    access to a whole cache line of PTEs (8 in my case).

    An idea I had here was to allow the OS more explicit control
    for the invalidates of the interior nodes caches.

    The interior nodes, stored in the CAM, retain their physical address,
    and
    are snooped, so no invalidation is required. ANY write to them is
    seen and
    the entry invalidates itself.

    On My66000, but other cores don't have automatically coherent TLB's.
    This feature is intended for that general rabble.

    Just to play devil's advocate...

    To snoop page table updates My66000 TLB would need a large CAM with all
    the physical addresses of the PTE's source cache lines parallel to the
    virtual and ASID CAM's, and route the cache line invalidates through it.

    Yes, .....

    While the virtual index CAM's are separated in different banks,
    one for each page table level, the P.A. CAM is for all entries in all
    banks.
    This extra P.A. CAM will have a lot of entries and therefore be slow.

    That is the TWA.

    No, not the Table Walk Accelerator. I'm thinking the PA CAM would
    only need to be accessed for cache line invalidates. However it would be
    very inconvenient if the TLB CAMs had faster access time for virtual
    address lookups than for physical address lookups, so the access time
    would be the longer of the two, that being PA.

    Basically I'm suggesting the big PA CAM slows down VA translates
    and therefore possibly all memory accesses.

    Also routing the Invalidate messages through the TLB could slow down all
    their ACK's messages even though there is very low probability of a hit
    because page tables update relatively infrequently.

    TLBs don't ACK they self-invalidate. And they can be performing a
    translation
    while self-invalidating.

    Hmmm... Danger Will Robinson. Most OS page table management depends on synchronizing after the shootdowns complete on all affected cores.

    The basic safe sequence is:
    - acquire page table mutex
    - modify PTE in memory for a VA
    - issue IPI's with VA to all cores that might have a copy in TLB
    - invalidate local TLB for VA
    - wait for IPI ACK's from remote cores
    - release mutex

    If it doesn't wait for shootdown ACKs then it might be possible for a
    stale PTE copy to exist and be used after the mutex is released.

    Also the L2-TLB's, called the STLB for Second-level TLB by Intel,
    are set assoc., and would have to be virtually indexed and virtually
    tagged with both VA and ASID plus table level to select address mask.
    On Skylake the STLB for 4k/2M pages is 128-rows*12-way, 1G is
    4-rows*4-way.

    All TLB walks are performed with RealPA.
    All Snoops are performed with RealPA

    How can My66000 look up STLB entries by invalidate physical line address?
    It would have to scan all 128 rows for each message.

    It is not structured like Intel L2 TLB.

    Are you saying the My66000 STLB is physically indexed, physically tagged?
    Hows this work for a bottom-up table walk (aka your TWA)?

    The only way I know to do a bottom-up walk is to use the portion of the
    VA for the higher index level to get the PA of the page table page.
    That requires lookup by a masked portion of the VA with the ASID.
    The bottom-up walk is done by making the VA mask shorter for each level.
    This implies a Virtually Indexed Virtually Tagged PTE cache.

    The VIVT PTE cache implies that certain TLB VA or ASID invalidates require
    a full STLB table scan which could be up to 128 clocks for that instruction.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Fri Jan 26 21:43:21 2024
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    One accomplishes the same effect by caching the interior PTE nodes >>>>>>> for each of the HV and GuestOS tables separately on the downward >>>>>>> walk,
    and hold the combined nested table mapping in the TLB.
    The bottom-up table walkers on each interior PTE cache should
    eliminate 98% of the PTE reads with none of the headaches.

    I call these things:: TableWalk Accelerators.

    Given CAMs at your access, one can cache the outer layers and short >>>>>> circuit most of the MMU accesses--such that you don't siply read
    the Accelerator RAM 25 times (two 5-level tables), you CAM down both >>>>>> GuestOS and HV tables so only walk the parts not in your CAM. {And >>>>>> them put them in your CAM.} A Density trick is for each CAM to have >>>>>> access to a whole cache line of PTEs (8 in my case).

    An idea I had here was to allow the OS more explicit control
    for the invalidates of the interior nodes caches.

    The interior nodes, stored in the CAM, retain their physical address,
    and
    are snooped, so no invalidation is required. ANY write to them is
    seen and
    the entry invalidates itself.

    On My66000, but other cores don't have automatically coherent TLB's.
    This feature is intended for that general rabble.

    Just to play devil's advocate...

    To snoop page table updates My66000 TLB would need a large CAM with all
    the physical addresses of the PTE's source cache lines parallel to the
    virtual and ASID CAM's, and route the cache line invalidates through it.

    Yes, .....

    While the virtual index CAM's are separated in different banks,
    one for each page table level, the P.A. CAM is for all entries in all
    banks.
    This extra P.A. CAM will have a lot of entries and therefore be slow.

    That is the TWA.

    No, not the Table Walk Accelerator. I'm thinking the PA CAM would
    only need to be accessed for cache line invalidates. However it would be
    very inconvenient if the TLB CAMs had faster access time for virtual
    address lookups than for physical address lookups, so the access time
    would be the longer of the two, that being PA.

    VA PA
    | |
    V V
    +-----------+ +-----------+-+ +-----------+
    | VA CAM |->| PTEs |v|<-| PA CAM |
    +-----------+ +-----------+-+ +-----------+

    Basically I'm suggesting the big PA CAM slows down VA translates
    and therefore possibly all memory accesses.

    It is a completely independent and concurrent hunk of logic that only has access to the valid bit and can only clear the valid bit.

    Also routing the Invalidate messages through the TLB could slow down all >>> their ACK's messages even though there is very low probability of a hit
    because page tables update relatively infrequently.

    TLBs don't ACK they self-invalidate. And they can be performing a
    translation
    while self-invalidating.

    Hmmm... Danger Will Robinson. Most OS page table management depends on synchronizing after the shootdowns complete on all affected cores.

    The basic safe sequence is:
    - acquire page table mutex
    - modify PTE in memory for a VA

    Here you have obtained write permission on the line containing the PTE
    being modified. By the time you have obtained write permission, all
    other TLBs will have been invalidated.

    - issue IPI's with VA to all cores that might have a copy in TLB
    - invalidate local TLB for VA
    - wait for IPI ACK's from remote cores
    - release mutex

    If it doesn't wait for shootdown ACKs then it might be possible for a
    stale PTE copy to exist and be used after the mutex is released.

    Race condition does not exist. By the time the core modifying the PTE
    obtains write permission, all the TLBs have been cleared of that entry.

    Also the L2-TLB's, called the STLB for Second-level TLB by Intel,
    are set assoc., and would have to be virtually indexed and virtually
    tagged with both VA and ASID plus table level to select address mask.
    On Skylake the STLB for 4k/2M pages is 128-rows*12-way, 1G is
    4-rows*4-way.

    All TLB walks are performed with RealPA.
    All Snoops are performed with RealPA

    How can My66000 look up STLB entries by invalidate physical line address? >>> It would have to scan all 128 rows for each message.

    It is not structured like Intel L2 TLB.

    Are you saying the My66000 STLB is physically indexed, physically tagged? Hows this work for a bottom-up table walk (aka your TWA)?

    L2 TLB is a different structure (SRAM) than TWAs (CAM).
    I can't talk about it:: as Ivan used to say:: NYF.

    The only way I know to do a bottom-up walk is to use the portion of the
    VA for the higher index level to get the PA of the page table page.

    I <actually> did not say I did a bottom up walk. I said I short circuited
    the table walks for those layers I have recent translation PTPs. Its more
    like CAM the Root to the last PTP and every CAM that hits is one layer
    you don't have to access.

    That requires lookup by a masked portion of the VA with the ASID.
    The bottom-up walk is done by making the VA mask shorter for each level.
    This implies a Virtually Indexed Virtually Tagged PTE cache.

    The VIVT PTE cache implies that certain TLB VA or ASID invalidates require
    a full STLB table scan which could be up to 128 clocks for that instruction.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Sun Jan 28 13:35:20 2024
    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    One accomplishes the same effect by caching the interior PTE nodes >>>>>>>> for each of the HV and GuestOS tables separately on the downward >>>>>>>> walk,
    and hold the combined nested table mapping in the TLB.
    The bottom-up table walkers on each interior PTE cache should
    eliminate 98% of the PTE reads with none of the headaches.

    I call these things:: TableWalk Accelerators.

    Given CAMs at your access, one can cache the outer layers and short >>>>>>> circuit most of the MMU accesses--such that you don't siply read >>>>>>> the Accelerator RAM 25 times (two 5-level tables), you CAM down both >>>>>>> GuestOS and HV tables so only walk the parts not in your CAM. {And >>>>>>> them put them in your CAM.} A Density trick is for each CAM to have >>>>>>> access to a whole cache line of PTEs (8 in my case).

    An idea I had here was to allow the OS more explicit control
    for the invalidates of the interior nodes caches.

    The interior nodes, stored in the CAM, retain their physical
    address, and
    are snooped, so no invalidation is required. ANY write to them is
    seen and
    the entry invalidates itself.

    On My66000, but other cores don't have automatically coherent TLB's.
    This feature is intended for that general rabble.

    Just to play devil's advocate...

    To snoop page table updates My66000 TLB would need a large CAM with all >>>> the physical addresses of the PTE's source cache lines parallel to the >>>> virtual and ASID CAM's, and route the cache line invalidates through
    it.

    Yes, .....

    While the virtual index CAM's are separated in different banks,
    one for each page table level, the P.A. CAM is for all entries in
    all banks.
    This extra P.A. CAM will have a lot of entries and therefore be slow.

    That is the TWA.

    No, not the Table Walk Accelerator. I'm thinking the PA CAM would
    only need to be accessed for cache line invalidates. However it would be
    very inconvenient if the TLB CAMs had faster access time for virtual
    address lookups than for physical address lookups, so the access time
    would be the longer of the two, that being PA.

    VA PA
    | |
    V V
    +-----------+ +-----------+-+ +-----------+
    | VA CAM |->| PTEs |v|<-| PA CAM |
    +-----------+ +-----------+-+ +-----------+

    Of course, but for a 5 or 6 level page table you'd have a CAM bank
    for each level to search in parallel. The loading on the PA path
    would be the same as if you a CAM as large as the sum of all entries.

    But as you point out below, this shouldn't be an issue because
    it has little to do after the lookup.

    Basically I'm suggesting the big PA CAM slows down VA translates
    and therefore possibly all memory accesses.

    It is a completely independent and concurrent hunk of logic that only has access to the valid bit and can only clear the valid bit.

    Yes, ok. The lookup on the PA path may take longer but there is
    little to do on a hit so the total path length is shorter,
    so PA invalidate won't be on the critical path for the MMU.

    Also routing the Invalidate messages through the TLB could slow down
    all
    their ACK's messages even though there is very low probability of a hit >>>> because page tables update relatively infrequently.

    TLBs don't ACK they self-invalidate. And they can be performing a
    translation
    while self-invalidating.

    Hmmm... Danger Will Robinson. Most OS page table management depends on
    synchronizing after the shootdowns complete on all affected cores.

    The basic safe sequence is:
    - acquire page table mutex
    - modify PTE in memory for a VA

    Here you have obtained write permission on the line containing the PTE
    being modified. By the time you have obtained write permission, all
    other TLBs will have been invalidated.

    It means you can't use the outer cache levels to filter invalidates.
    You'd have to pass all invalidate messages from the coherence network
    directly to the TLB PA, bypassing the cache hierarchy, to ensure the
    TLB entry is removed before the cache ACK's the invalidate message.

    - issue IPI's with VA to all cores that might have a copy in TLB
    - invalidate local TLB for VA
    - wait for IPI ACK's from remote cores
    - release mutex

    If it doesn't wait for shootdown ACKs then it might be possible for a
    stale PTE copy to exist and be used after the mutex is released.

    Race condition does not exist. By the time the core modifying the PTE
    obtains write permission, all the TLBs have been cleared of that entry.

    Ok

    Also the L2-TLB's, called the STLB for Second-level TLB by Intel,
    are set assoc., and would have to be virtually indexed and virtually
    tagged with both VA and ASID plus table level to select address mask.
    On Skylake the STLB for 4k/2M pages is 128-rows*12-way, 1G is
    4-rows*4-way.

    All TLB walks are performed with RealPA.
    All Snoops are performed with RealPA

    How can My66000 look up STLB entries by invalidate physical line
    address?
    It would have to scan all 128 rows for each message.

    It is not structured like Intel L2 TLB.

    Are you saying the My66000 STLB is physically indexed, physically tagged?
    Hows this work for a bottom-up table walk (aka your TWA)?

    L2 TLB is a different structure (SRAM) than TWAs (CAM).
    I can't talk about it:: as Ivan used to say:: NYF.

    Rats.

    The only way I know to do a bottom-up walk is to use the portion of the
    VA for the higher index level to get the PA of the page table page.

    I <actually> did not say I did a bottom up walk. I said I short circuited
    the table walks for those layers I have recent translation PTPs. Its more like CAM the Root to the last PTP and every CAM that hits is one layer
    you don't have to access.

    What I call a bottom-up walk can be performed in parallel, serial,
    or a bit of both, across the banks for each page table level.

    I'd have a VA TLB lookup in parallel for page levels 1, 2 and 3 (4K, 2M, 1G), and if all three miss then do a sequential lookups for levels 4, 5, 6.

    That requires lookup by a masked portion of the VA with the ASID.
    The bottom-up walk is done by making the VA mask shorter for each level.
    This implies a Virtually Indexed Virtually Tagged PTE cache.

    The VIVT PTE cache implies that certain TLB VA or ASID invalidates
    require
    a full STLB table scan which could be up to 128 clocks for that
    instruction.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sun Jan 28 21:10:51 2024
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    One accomplishes the same effect by caching the interior PTE nodes >>>>>>>>> for each of the HV and GuestOS tables separately on the downward >>>>>>>>> walk,
    and hold the combined nested table mapping in the TLB.
    The bottom-up table walkers on each interior PTE cache should >>>>>>>>> eliminate 98% of the PTE reads with none of the headaches.

    I call these things:: TableWalk Accelerators.

    Given CAMs at your access, one can cache the outer layers and short >>>>>>>> circuit most of the MMU accesses--such that you don't siply read >>>>>>>> the Accelerator RAM 25 times (two 5-level tables), you CAM down both >>>>>>>> GuestOS and HV tables so only walk the parts not in your CAM. {And >>>>>>>> them put them in your CAM.} A Density trick is for each CAM to have >>>>>>>> access to a whole cache line of PTEs (8 in my case).

    An idea I had here was to allow the OS more explicit control
    for the invalidates of the interior nodes caches.

    The interior nodes, stored in the CAM, retain their physical
    address, and
    are snooped, so no invalidation is required. ANY write to them is
    seen and
    the entry invalidates itself.

    On My66000, but other cores don't have automatically coherent TLB's. >>>>> This feature is intended for that general rabble.

    Just to play devil's advocate...

    To snoop page table updates My66000 TLB would need a large CAM with all >>>>> the physical addresses of the PTE's source cache lines parallel to the >>>>> virtual and ASID CAM's, and route the cache line invalidates through >>>>> it.

    Yes, .....

    While the virtual index CAM's are separated in different banks,
    one for each page table level, the P.A. CAM is for all entries in
    all banks.
    This extra P.A. CAM will have a lot of entries and therefore be slow. >>>>
    That is the TWA.

    No, not the Table Walk Accelerator. I'm thinking the PA CAM would
    only need to be accessed for cache line invalidates. However it would be >>> very inconvenient if the TLB CAMs had faster access time for virtual
    address lookups than for physical address lookups, so the access time
    would be the longer of the two, that being PA.

    VA PA
    | |
    V V
    +-----------+ +-----------+-+ +-----------+
    | VA CAM |->| PTEs |v|<-| PA CAM |
    +-----------+ +-----------+-+ +-----------+

    Of course, but for a 5 or 6 level page table you'd have a CAM bank
    for each level to search in parallel. The loading on the PA path
    would be the same as if you a CAM as large as the sum of all entries.

    What you see above is the TLB
    What your above paragraph talks about is what I call the TWA.

    But as you point out below, this shouldn't be an issue because
    it has little to do after the lookup.

    Basically only wait for SNOOPs and for TLB misses.

    Basically I'm suggesting the big PA CAM slows down VA translates
    and therefore possibly all memory accesses.

    It is a completely independent and concurrent hunk of logic that only has
    access to the valid bit and can only clear the valid bit.

    Yes, ok. The lookup on the PA path may take longer but there is
    little to do on a hit so the total path length is shorter,
    so PA invalidate won't be on the critical path for the MMU.

    Also routing the Invalidate messages through the TLB could slow down >>>>> all
    their ACK's messages even though there is very low probability of a hit >>>>> because page tables update relatively infrequently.

    TLBs don't ACK they self-invalidate. And they can be performing a
    translation
    while self-invalidating.

    Hmmm... Danger Will Robinson. Most OS page table management depends on
    synchronizing after the shootdowns complete on all affected cores.

    The basic safe sequence is:
    - acquire page table mutex
    - modify PTE in memory for a VA

    Here you have obtained write permission on the line containing the PTE
    being modified. By the time you have obtained write permission, all
    other TLBs will have been invalidated.

    It means you can't use the outer cache levels to filter invalidates.
    You'd have to pass all invalidate messages from the coherence network directly to the TLB PA, bypassing the cache hierarchy, to ensure the
    TLB entry is removed before the cache ACK's the invalidate message.

    With my exclusive cache, I have to do that anyway. With wider than
    register accesses I am already in a position where I have BW for these
    SNOOPs without little overhead on either channel.

    - issue IPI's with VA to all cores that might have a copy in TLB
    - invalidate local TLB for VA
    - wait for IPI ACK's from remote cores
    - release mutex

    If it doesn't wait for shootdown ACKs then it might be possible for a
    stale PTE copy to exist and be used after the mutex is released.

    Race condition does not exist. By the time the core modifying the PTE
    obtains write permission, all the TLBs have been cleared of that entry.

    Ok

    Also the L2-TLB's, called the STLB for Second-level TLB by Intel,
    are set assoc., and would have to be virtually indexed and virtually >>>>> tagged with both VA and ASID plus table level to select address mask. >>>>> On Skylake the STLB for 4k/2M pages is 128-rows*12-way, 1G is
    4-rows*4-way.

    All TLB walks are performed with RealPA.
    All Snoops are performed with RealPA

    How can My66000 look up STLB entries by invalidate physical line
    address?
    It would have to scan all 128 rows for each message.

    It is not structured like Intel L2 TLB.

    Are you saying the My66000 STLB is physically indexed, physically tagged? >>> Hows this work for a bottom-up table walk (aka your TWA)?

    L2 TLB is a different structure (SRAM) than TWAs (CAM).
    I can't talk about it:: as Ivan used to say:: NYF.

    Rats.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)