On 2/27/2024 7:33 PM, MitchAlsup1 wrote:
A thought::
Construct the 8-way cache from a pair of 4-way cache instances
and connect both into one 8-way with a single layer of logic
{multiplexing.}
Possible, I have decided for now to stick with 4-way...
But, even then, efforts at trying to optimize this seem to be causing
the LUT cost to increase rather than decrease...
Seemingly, Vivado's response to all this being to turn it almost
entirely into LUT3 instances (with a small number of LUT6's here and there).
Looking at the LUT3's, there seem to be various truth-tables in use.
But, off-hand, the patterns aren't super obvious.
A few common ones seem to be:
( I0 & I1) | (!I1 & I2)
( I0 & I1) | I2
(!I0 & I1) | I2
...
The first one seems to be a strong majority though. I think it is a bit
MUX using I1 to select the other bit (I0 or I2).
....
On 2/28/2024 5:26 PM, MitchAlsup1 wrote:
BGB wrote:
On 2/27/2024 7:33 PM, MitchAlsup1 wrote:
A thought::
Construct the 8-way cache from a pair of 4-way cache instances
and connect both into one 8-way with a single layer of logic
{multiplexing.}
Possible, I have decided for now to stick with 4-way...
But, even then, efforts at trying to optimize this seem to be causing
the LUT cost to increase rather than decrease...
Then you have tickled one of Verilog's insidious deamons.
How many elements in a way ?? and how many bits in an element ??
If there a way to make a "way" into a single SRAM ?? (or part of a single
SRAM) ??
What I am getting at is that "conceptually" a n-way set associative
cache is unrecognizingly different than n-copies of a 1/n direct mapped
cache coupled to a set/way selection multiplexer based on
address bits compare. {{And of course write set selection.}}
I am not sure.
In this case, I interpreted it as, say, 4 or 8 parallel sets of arrays,
with the corresponding match and multiplex logic.
In the first instance, adding an item always shifted each item over one position and added a new item to the front.
The MTF logic tries to move an accessed item to the front, or shift each
item back as before it is a new address. If the address hits while
adding an items, it behaves as-if it were moving it to the front, but effectively replaces the item being moved to the front with the data
being written.
The MTF logic seems to increase hit-rate, but eats a lot of additional LUTs.
On 2/29/2024 1:39 PM, MitchAlsup1 wrote:
They should be 4 or 8 parrallel instances of a 1-way (DM) cache with a
comparator and an output multiplexer signal and an input write signal.
The Move to Front is easier done with Not-recently-Used bits as a guise
for least recently used.
Each way has a NRU bit--at reset and when all NRU bits are set, they are
cleared asynchronously. Then as each set is hit, the NRU bit is set. You
do not reallocate the sets with the NRU bit set. I see no reason to move
one set to the front if you can alter the reallocation selection to avoid
picking it. {{3 gates per line}}
I guess it could be possible say, by having 8 bits (4x2 bits), or 24
bits (8x3 bits) to encode the permutation. Then using the permutation to drive what index to update, rather than by having the items swap places.
This could possibly reduce LUT cost vs the existing MTF scheme...
I may need to explore this idea.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 04:23:50 |
Calls: | 10,387 |
Calls today: | 2 |
Files: | 14,061 |
Messages: | 6,416,782 |