Forum: >>> Magnum BBS <<<

Re: Misc: 8-way vs 4-way cache, a cost mystery...

From MitchAlsup1@21:1/5 to All on Wed Feb 28 01:33:14 2024

A thought::

Construct the 8-way cache from a pair of 4-way cache instances
and connect both into one 8-way with a single layer of logic
{multiplexing.}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed Feb 28 23:26:29 2024

BGB wrote:

On 2/27/2024 7:33 PM, MitchAlsup1 wrote:

A thought::

Construct the 8-way cache from a pair of 4-way cache instances
and connect both into one 8-way with a single layer of logic
{multiplexing.}

Possible, I have decided for now to stick with 4-way...

But, even then, efforts at trying to optimize this seem to be causing
the LUT cost to increase rather than decrease...

Then you have tickled one of Verilog's insidious deamons.

How many elements in a way ?? and how many bits in an element ??
If there a way to make a "way" into a single SRAM ?? (or part of a single
SRAM) ??

What I am getting at is that "conceptually" a n-way set associative
cache is unrecognizingly different than n-copies of a 1/n direct
mapped cache coupled to a set/way selection multiplexer based on
address bits compare. {{And of course write set selection.}}

My 1-wide My 66000 implementation carefully chose 3-way (or 6-way)
L1 caches because that exactly fit the number of bits in my SRAM
macro (-2 spare bits). So cramming 3 tags, 3 line-states, and 3-bit
LRU into one 128-bit SRAM word. The 24KB cache is 3-way while the
48KB cache is the 6-way. The read speed path is 1 gate longer in
6-way configuration.

Seemingly, Vivado's response to all this being to turn it almost
entirely into LUT3 instances (with a small number of LUT6's here and there).

It seems to me it is failing to see the SRAM and just made it out of flip-flops.

Looking at the LUT3's, there seem to be various truth-tables in use.

But, off-hand, the patterns aren't super obvious.

A few common ones seem to be:
( I0 & I1) | (!I1 & I2)
( I0 & I1) | I2
(!I0 & I1) | I2
...

These appear to be std decoding pattern recognizers, to me (although
the top one looks like a binary multiplexer}.

The first one seems to be a strong majority though. I think it is a bit
MUX using I1 to select the other bit (I0 or I2).

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Thu Feb 29 19:39:35 2024

BGB wrote:

On 2/28/2024 5:26 PM, MitchAlsup1 wrote:

BGB wrote:

On 2/27/2024 7:33 PM, MitchAlsup1 wrote:

A thought::

Construct the 8-way cache from a pair of 4-way cache instances
and connect both into one 8-way with a single layer of logic
{multiplexing.}

Possible, I have decided for now to stick with 4-way...

But, even then, efforts at trying to optimize this seem to be causing
the LUT cost to increase rather than decrease...

Then you have tickled one of Verilog's insidious deamons.

How many elements in a way ?? and how many bits in an element ??
If there a way to make a "way" into a single SRAM ?? (or part of a single
SRAM) ??

What I am getting at is that "conceptually" a n-way set associative
cache is unrecognizingly different than n-copies of a 1/n direct mapped
cache coupled to a set/way selection multiplexer based on
address bits compare. {{And of course write set selection.}}

I am not sure.

In this case, I interpreted it as, say, 4 or 8 parallel sets of arrays,
with the corresponding match and multiplex logic.

They should be 4 or 8 parrallel instances of a 1-way (DM) cache with a comparator and an output multiplexer signal and an input write signal.
The Move to Front is easier done with Not-recently-Used bits as a guise
for least recently used.

Each way has a NRU bit--at reset and when all NRU bits are set, they are cleared asynchronously. Then as each set is hit, the NRU bit is set. You
do not reallocate the sets with the NRU bit set. I see no reason to move
one set to the front if you can alter the reallocation selection to avoid picking it. {{3 gates per line}}

In the first instance, adding an item always shifted each item over one position and added a new item to the front.

The MTF logic tries to move an accessed item to the front, or shift each
item back as before it is a new address. If the address hits while
adding an items, it behaves as-if it were moving it to the front, but effectively replaces the item being moved to the front with the data
being written.

The MTF logic seems to increase hit-rate, but eats a lot of additional LUTs.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sat Mar 2 18:00:18 2024

BGB wrote:

On 2/29/2024 1:39 PM, MitchAlsup1 wrote:

They should be 4 or 8 parrallel instances of a 1-way (DM) cache with a
comparator and an output multiplexer signal and an input write signal.
The Move to Front is easier done with Not-recently-Used bits as a guise
for least recently used.

Each way has a NRU bit--at reset and when all NRU bits are set, they are
cleared asynchronously. Then as each set is hit, the NRU bit is set. You
do not reallocate the sets with the NRU bit set. I see no reason to move
one set to the front if you can alter the reallocation selection to avoid
picking it. {{3 gates per line}}

I guess it could be possible say, by having 8 bits (4x2 bits), or 24
bits (8x3 bits) to encode the permutation. Then using the permutation to drive what index to update, rather than by having the items swap places.

This could possibly reduce LUT cost vs the existing MTF scheme...

I may need to explore this idea.

Each line in a way has a Not-Recently-Used bit. 1 SR-flip-flop per line.
Every time the way is accessed the bit is set. the set input is asserted.
When all bits have been set they are all cleared (asynchronously).
the reset input is asserted.

The line to be replaced is determined by a find first zero. There is
always at least 1 zero in the list.

I have been using this since the days of the Mc68851.....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	04:23:50
Calls:	10,387
Calls today:	2
Files:	14,061
Messages:	6,416,782

Re: Misc: 8-way vs 4-way cache, a cost mystery...

Who's Online

Recent Visitors

System Info