EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
One accomplishes the same effect by caching the interior PTE nodes
for each of the HV and GuestOS tables separately on the downward walk, >>>> and hold the combined nested table mapping in the TLB.
The bottom-up table walkers on each interior PTE cache should
eliminate 98% of the PTE reads with none of the headaches.
I call these things:: TableWalk Accelerators.
Given CAMs at your access, one can cache the outer layers and short
circuit most of the MMU accesses--such that you don't siply read the
Accelerator RAM 25 times (two 5-level tables), you CAM down both
GuestOS and HV tables so only walk the parts not in your CAM. {And
them put them in your CAM.} A Density trick is for each CAM to have
access to a whole cache line of PTEs (8 in my case).
An idea I had here was to allow the OS more explicit control
for the invalidates of the interior nodes caches.
The interior nodes, stored in the CAM, retain their physical address, and
are snooped, so no invalidation is required. ANY write to them is seen and the entry invalidates itself.
On x86/x64 the interior cache invalidation had to be backwards
compatible,
so the INVLPG instruction has to guess what besides the main TLB needs
to be
invalidated, and it has to do so in a conservative (ie paranoid) manner.
So it tosses these interior PTE's just in case which means they
have to be reloaded on the next TLB miss.
The OS knows which paging levels it is recycling memory for and
can provide a finer grain control for these TLB invalidates.
The INVLPG and INVPCID instructions need a control bit mask allowing OS
to invalidate just the TLB levels it is changing for a virtual address.
OS or HV does not need to bother in My 66000.
And for OS debugging purposes, all these HW TLB tables need to be
readable
and writable by some means (as control registers or whatever).
Because when something craps out, what's in memory may not be the same
as what was loaded into HW some time ago. A debugger should be able to
look into and manipulate these HW structures.
All control registers, including the TLB CAMs are accessible via MMI/O accesses. So a remote core can decide what a crashed core was doing at
the instant of the crash.
MitchAlsup1 wrote:
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
One accomplishes the same effect by caching the interior PTE nodes
for each of the HV and GuestOS tables separately on the downward walk, >>>>> and hold the combined nested table mapping in the TLB.
The bottom-up table walkers on each interior PTE cache should
eliminate 98% of the PTE reads with none of the headaches.
I call these things:: TableWalk Accelerators.
Given CAMs at your access, one can cache the outer layers and short
circuit most of the MMU accesses--such that you don't siply read the
Accelerator RAM 25 times (two 5-level tables), you CAM down both
GuestOS and HV tables so only walk the parts not in your CAM. {And
them put them in your CAM.} A Density trick is for each CAM to have
access to a whole cache line of PTEs (8 in my case).
An idea I had here was to allow the OS more explicit control
for the invalidates of the interior nodes caches.
The interior nodes, stored in the CAM, retain their physical address, and
are snooped, so no invalidation is required. ANY write to them is seen and >> the entry invalidates itself.
On My66000, but other cores don't have automatically coherent TLB's.
This feature is intended for that general rabble.
Just to play devil's advocate...
To snoop page table updates My66000 TLB would need a large CAM with all
the physical addresses of the PTE's source cache lines parallel to the virtual and ASID CAM's, and route the cache line invalidates through it.
While the virtual index CAM's are separated in different banks,
one for each page table level, the P.A. CAM is for all entries in all banks. This extra P.A. CAM will have a lot of entries and therefore be slow.
Also routing the Invalidate messages through the TLB could slow down all their ACK's messages even though there is very low probability of a hit because page tables update relatively infrequently.
Also the L2-TLB's, called the STLB for Second-level TLB by Intel,
are set assoc., and would have to be virtually indexed and virtually
tagged with both VA and ASID plus table level to select address mask.
On Skylake the STLB for 4k/2M pages is 128-rows*12-way, 1G is 4-rows*4-way.
How can My66000 look up STLB entries by invalidate physical line address?
It would have to scan all 128 rows for each message.
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
One accomplishes the same effect by caching the interior PTE nodes >>>>>> for each of the HV and GuestOS tables separately on the downward
walk,
and hold the combined nested table mapping in the TLB.
The bottom-up table walkers on each interior PTE cache should
eliminate 98% of the PTE reads with none of the headaches.
I call these things:: TableWalk Accelerators.
Given CAMs at your access, one can cache the outer layers and short
circuit most of the MMU accesses--such that you don't siply read
the Accelerator RAM 25 times (two 5-level tables), you CAM down both >>>>> GuestOS and HV tables so only walk the parts not in your CAM. {And
them put them in your CAM.} A Density trick is for each CAM to have
access to a whole cache line of PTEs (8 in my case).
An idea I had here was to allow the OS more explicit control
for the invalidates of the interior nodes caches.
The interior nodes, stored in the CAM, retain their physical address,
and
are snooped, so no invalidation is required. ANY write to them is
seen and
the entry invalidates itself.
On My66000, but other cores don't have automatically coherent TLB's.
This feature is intended for that general rabble.
Just to play devil's advocate...
To snoop page table updates My66000 TLB would need a large CAM with all
the physical addresses of the PTE's source cache lines parallel to the
virtual and ASID CAM's, and route the cache line invalidates through it.
Yes, .....
While the virtual index CAM's are separated in different banks,
one for each page table level, the P.A. CAM is for all entries in all
banks.
This extra P.A. CAM will have a lot of entries and therefore be slow.
That is the TWA.
Also routing the Invalidate messages through the TLB could slow down all
their ACK's messages even though there is very low probability of a hit
because page tables update relatively infrequently.
TLBs don't ACK they self-invalidate. And they can be performing a
translation
while self-invalidating.
Also the L2-TLB's, called the STLB for Second-level TLB by Intel,
are set assoc., and would have to be virtually indexed and virtually
tagged with both VA and ASID plus table level to select address mask.
On Skylake the STLB for 4k/2M pages is 128-rows*12-way, 1G is
4-rows*4-way.
All TLB walks are performed with RealPA.
All Snoops are performed with RealPA
How can My66000 look up STLB entries by invalidate physical line address?
It would have to scan all 128 rows for each message.
It is not structured like Intel L2 TLB.
MitchAlsup1 wrote:
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
One accomplishes the same effect by caching the interior PTE nodes >>>>>>> for each of the HV and GuestOS tables separately on the downward >>>>>>> walk,
and hold the combined nested table mapping in the TLB.
The bottom-up table walkers on each interior PTE cache should
eliminate 98% of the PTE reads with none of the headaches.
I call these things:: TableWalk Accelerators.
Given CAMs at your access, one can cache the outer layers and short >>>>>> circuit most of the MMU accesses--such that you don't siply read
the Accelerator RAM 25 times (two 5-level tables), you CAM down both >>>>>> GuestOS and HV tables so only walk the parts not in your CAM. {And >>>>>> them put them in your CAM.} A Density trick is for each CAM to have >>>>>> access to a whole cache line of PTEs (8 in my case).
An idea I had here was to allow the OS more explicit control
for the invalidates of the interior nodes caches.
The interior nodes, stored in the CAM, retain their physical address,
and
are snooped, so no invalidation is required. ANY write to them is
seen and
the entry invalidates itself.
On My66000, but other cores don't have automatically coherent TLB's.
This feature is intended for that general rabble.
Just to play devil's advocate...
To snoop page table updates My66000 TLB would need a large CAM with all
the physical addresses of the PTE's source cache lines parallel to the
virtual and ASID CAM's, and route the cache line invalidates through it.
Yes, .....
While the virtual index CAM's are separated in different banks,
one for each page table level, the P.A. CAM is for all entries in all
banks.
This extra P.A. CAM will have a lot of entries and therefore be slow.
That is the TWA.
No, not the Table Walk Accelerator. I'm thinking the PA CAM would
only need to be accessed for cache line invalidates. However it would be
very inconvenient if the TLB CAMs had faster access time for virtual
address lookups than for physical address lookups, so the access time
would be the longer of the two, that being PA.
Basically I'm suggesting the big PA CAM slows down VA translates
and therefore possibly all memory accesses.
Also routing the Invalidate messages through the TLB could slow down all >>> their ACK's messages even though there is very low probability of a hit
because page tables update relatively infrequently.
TLBs don't ACK they self-invalidate. And they can be performing a
translation
while self-invalidating.
Hmmm... Danger Will Robinson. Most OS page table management depends on synchronizing after the shootdowns complete on all affected cores.
The basic safe sequence is:
- acquire page table mutex
- modify PTE in memory for a VA
- issue IPI's with VA to all cores that might have a copy in TLB
- invalidate local TLB for VA
- wait for IPI ACK's from remote cores
- release mutex
If it doesn't wait for shootdown ACKs then it might be possible for a
stale PTE copy to exist and be used after the mutex is released.
Also the L2-TLB's, called the STLB for Second-level TLB by Intel,
are set assoc., and would have to be virtually indexed and virtually
tagged with both VA and ASID plus table level to select address mask.
On Skylake the STLB for 4k/2M pages is 128-rows*12-way, 1G is
4-rows*4-way.
All TLB walks are performed with RealPA.
All Snoops are performed with RealPA
How can My66000 look up STLB entries by invalidate physical line address? >>> It would have to scan all 128 rows for each message.
It is not structured like Intel L2 TLB.
Are you saying the My66000 STLB is physically indexed, physically tagged? Hows this work for a bottom-up table walk (aka your TWA)?
The only way I know to do a bottom-up walk is to use the portion of the
VA for the higher index level to get the PA of the page table page.
That requires lookup by a masked portion of the VA with the ASID.
The bottom-up walk is done by making the VA mask shorter for each level.
This implies a Virtually Indexed Virtually Tagged PTE cache.
The VIVT PTE cache implies that certain TLB VA or ASID invalidates require
a full STLB table scan which could be up to 128 clocks for that instruction.
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
One accomplishes the same effect by caching the interior PTE nodes >>>>>>>> for each of the HV and GuestOS tables separately on the downward >>>>>>>> walk,
and hold the combined nested table mapping in the TLB.
The bottom-up table walkers on each interior PTE cache should
eliminate 98% of the PTE reads with none of the headaches.
I call these things:: TableWalk Accelerators.
Given CAMs at your access, one can cache the outer layers and short >>>>>>> circuit most of the MMU accesses--such that you don't siply read >>>>>>> the Accelerator RAM 25 times (two 5-level tables), you CAM down both >>>>>>> GuestOS and HV tables so only walk the parts not in your CAM. {And >>>>>>> them put them in your CAM.} A Density trick is for each CAM to have >>>>>>> access to a whole cache line of PTEs (8 in my case).
An idea I had here was to allow the OS more explicit control
for the invalidates of the interior nodes caches.
The interior nodes, stored in the CAM, retain their physical
address, and
are snooped, so no invalidation is required. ANY write to them is
seen and
the entry invalidates itself.
On My66000, but other cores don't have automatically coherent TLB's.
This feature is intended for that general rabble.
Just to play devil's advocate...
To snoop page table updates My66000 TLB would need a large CAM with all >>>> the physical addresses of the PTE's source cache lines parallel to the >>>> virtual and ASID CAM's, and route the cache line invalidates through
it.
Yes, .....
While the virtual index CAM's are separated in different banks,
one for each page table level, the P.A. CAM is for all entries in
all banks.
This extra P.A. CAM will have a lot of entries and therefore be slow.
That is the TWA.
No, not the Table Walk Accelerator. I'm thinking the PA CAM would
only need to be accessed for cache line invalidates. However it would be
very inconvenient if the TLB CAMs had faster access time for virtual
address lookups than for physical address lookups, so the access time
would be the longer of the two, that being PA.
VA PA
| |
V V
+-----------+ +-----------+-+ +-----------+
| VA CAM |->| PTEs |v|<-| PA CAM |
+-----------+ +-----------+-+ +-----------+
Basically I'm suggesting the big PA CAM slows down VA translates
and therefore possibly all memory accesses.
It is a completely independent and concurrent hunk of logic that only has access to the valid bit and can only clear the valid bit.
Also routing the Invalidate messages through the TLB could slow down
all
their ACK's messages even though there is very low probability of a hit >>>> because page tables update relatively infrequently.
TLBs don't ACK they self-invalidate. And they can be performing a
translation
while self-invalidating.
Hmmm... Danger Will Robinson. Most OS page table management depends on
synchronizing after the shootdowns complete on all affected cores.
The basic safe sequence is:
- acquire page table mutex
- modify PTE in memory for a VA
Here you have obtained write permission on the line containing the PTE
being modified. By the time you have obtained write permission, all
other TLBs will have been invalidated.
- issue IPI's with VA to all cores that might have a copy in TLB
- invalidate local TLB for VA
- wait for IPI ACK's from remote cores
- release mutex
If it doesn't wait for shootdown ACKs then it might be possible for a
stale PTE copy to exist and be used after the mutex is released.
Race condition does not exist. By the time the core modifying the PTE
obtains write permission, all the TLBs have been cleared of that entry.
Also the L2-TLB's, called the STLB for Second-level TLB by Intel,
are set assoc., and would have to be virtually indexed and virtually
tagged with both VA and ASID plus table level to select address mask.
On Skylake the STLB for 4k/2M pages is 128-rows*12-way, 1G is
4-rows*4-way.
All TLB walks are performed with RealPA.
All Snoops are performed with RealPA
How can My66000 look up STLB entries by invalidate physical line
address?
It would have to scan all 128 rows for each message.
It is not structured like Intel L2 TLB.
Are you saying the My66000 STLB is physically indexed, physically tagged?
Hows this work for a bottom-up table walk (aka your TWA)?
L2 TLB is a different structure (SRAM) than TWAs (CAM).
I can't talk about it:: as Ivan used to say:: NYF.
The only way I know to do a bottom-up walk is to use the portion of the
VA for the higher index level to get the PA of the page table page.
I <actually> did not say I did a bottom up walk. I said I short circuited
the table walks for those layers I have recent translation PTPs. Its more like CAM the Root to the last PTP and every CAM that hits is one layer
you don't have to access.
That requires lookup by a masked portion of the VA with the ASID.
The bottom-up walk is done by making the VA mask shorter for each level.
This implies a Virtually Indexed Virtually Tagged PTE cache.
The VIVT PTE cache implies that certain TLB VA or ASID invalidates
require
a full STLB table scan which could be up to 128 clocks for that
instruction.
MitchAlsup1 wrote:
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
One accomplishes the same effect by caching the interior PTE nodes >>>>>>>>> for each of the HV and GuestOS tables separately on the downward >>>>>>>>> walk,
and hold the combined nested table mapping in the TLB.
The bottom-up table walkers on each interior PTE cache should >>>>>>>>> eliminate 98% of the PTE reads with none of the headaches.
I call these things:: TableWalk Accelerators.
Given CAMs at your access, one can cache the outer layers and short >>>>>>>> circuit most of the MMU accesses--such that you don't siply read >>>>>>>> the Accelerator RAM 25 times (two 5-level tables), you CAM down both >>>>>>>> GuestOS and HV tables so only walk the parts not in your CAM. {And >>>>>>>> them put them in your CAM.} A Density trick is for each CAM to have >>>>>>>> access to a whole cache line of PTEs (8 in my case).
An idea I had here was to allow the OS more explicit control
for the invalidates of the interior nodes caches.
The interior nodes, stored in the CAM, retain their physical
address, and
are snooped, so no invalidation is required. ANY write to them is
seen and
the entry invalidates itself.
On My66000, but other cores don't have automatically coherent TLB's. >>>>> This feature is intended for that general rabble.
Just to play devil's advocate...
To snoop page table updates My66000 TLB would need a large CAM with all >>>>> the physical addresses of the PTE's source cache lines parallel to the >>>>> virtual and ASID CAM's, and route the cache line invalidates through >>>>> it.
Yes, .....
While the virtual index CAM's are separated in different banks,That is the TWA.
one for each page table level, the P.A. CAM is for all entries in
all banks.
This extra P.A. CAM will have a lot of entries and therefore be slow. >>>>
No, not the Table Walk Accelerator. I'm thinking the PA CAM would
only need to be accessed for cache line invalidates. However it would be >>> very inconvenient if the TLB CAMs had faster access time for virtual
address lookups than for physical address lookups, so the access time
would be the longer of the two, that being PA.
VA PA
| |
V V
+-----------+ +-----------+-+ +-----------+
| VA CAM |->| PTEs |v|<-| PA CAM |
+-----------+ +-----------+-+ +-----------+
Of course, but for a 5 or 6 level page table you'd have a CAM bank
for each level to search in parallel. The loading on the PA path
would be the same as if you a CAM as large as the sum of all entries.
But as you point out below, this shouldn't be an issue because
it has little to do after the lookup.
Basically I'm suggesting the big PA CAM slows down VA translates
and therefore possibly all memory accesses.
It is a completely independent and concurrent hunk of logic that only has
access to the valid bit and can only clear the valid bit.
Yes, ok. The lookup on the PA path may take longer but there is
little to do on a hit so the total path length is shorter,
so PA invalidate won't be on the critical path for the MMU.
Also routing the Invalidate messages through the TLB could slow down >>>>> all
their ACK's messages even though there is very low probability of a hit >>>>> because page tables update relatively infrequently.
TLBs don't ACK they self-invalidate. And they can be performing a
translation
while self-invalidating.
Hmmm... Danger Will Robinson. Most OS page table management depends on
synchronizing after the shootdowns complete on all affected cores.
The basic safe sequence is:
- acquire page table mutex
- modify PTE in memory for a VA
Here you have obtained write permission on the line containing the PTE
being modified. By the time you have obtained write permission, all
other TLBs will have been invalidated.
It means you can't use the outer cache levels to filter invalidates.
You'd have to pass all invalidate messages from the coherence network directly to the TLB PA, bypassing the cache hierarchy, to ensure the
TLB entry is removed before the cache ACK's the invalidate message.
- issue IPI's with VA to all cores that might have a copy in TLB
- invalidate local TLB for VA
- wait for IPI ACK's from remote cores
- release mutex
If it doesn't wait for shootdown ACKs then it might be possible for a
stale PTE copy to exist and be used after the mutex is released.
Race condition does not exist. By the time the core modifying the PTE
obtains write permission, all the TLBs have been cleared of that entry.
Ok
Also the L2-TLB's, called the STLB for Second-level TLB by Intel,
are set assoc., and would have to be virtually indexed and virtually >>>>> tagged with both VA and ASID plus table level to select address mask. >>>>> On Skylake the STLB for 4k/2M pages is 128-rows*12-way, 1G is
4-rows*4-way.
All TLB walks are performed with RealPA.
All Snoops are performed with RealPA
How can My66000 look up STLB entries by invalidate physical line
address?
It would have to scan all 128 rows for each message.
It is not structured like Intel L2 TLB.
Are you saying the My66000 STLB is physically indexed, physically tagged? >>> Hows this work for a bottom-up table walk (aka your TWA)?
L2 TLB is a different structure (SRAM) than TWAs (CAM).
I can't talk about it:: as Ivan used to say:: NYF.
Rats.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 10:37:13 |
Calls: | 10,387 |
Calls today: | 2 |
Files: | 14,060 |
Messages: | 6,416,691 |