• Re: Misc: 8-way vs 4-way cache, a cost mystery...

    From MitchAlsup1@21:1/5 to All on Wed Feb 28 01:33:14 2024
    A thought::

    Construct the 8-way cache from a pair of 4-way cache instances
    and connect both into one 8-way with a single layer of logic
    {multiplexing.}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Wed Feb 28 23:26:29 2024
    BGB wrote:

    On 2/27/2024 7:33 PM, MitchAlsup1 wrote:
    A thought::

    Construct the 8-way cache from a pair of 4-way cache instances
    and connect both into one 8-way with a single layer of logic
    {multiplexing.}

    Possible, I have decided for now to stick with 4-way...


    But, even then, efforts at trying to optimize this seem to be causing
    the LUT cost to increase rather than decrease...

    Then you have tickled one of Verilog's insidious deamons.

    How many elements in a way ?? and how many bits in an element ??
    If there a way to make a "way" into a single SRAM ?? (or part of a single
    SRAM) ??

    What I am getting at is that "conceptually" a n-way set associative
    cache is unrecognizingly different than n-copies of a 1/n direct
    mapped cache coupled to a set/way selection multiplexer based on
    address bits compare. {{And of course write set selection.}}

    My 1-wide My 66000 implementation carefully chose 3-way (or 6-way)
    L1 caches because that exactly fit the number of bits in my SRAM
    macro (-2 spare bits). So cramming 3 tags, 3 line-states, and 3-bit
    LRU into one 128-bit SRAM word. The 24KB cache is 3-way while the
    48KB cache is the 6-way. The read speed path is 1 gate longer in
    6-way configuration.

    Seemingly, Vivado's response to all this being to turn it almost
    entirely into LUT3 instances (with a small number of LUT6's here and there).

    It seems to me it is failing to see the SRAM and just made it out of flip-flops.

    Looking at the LUT3's, there seem to be various truth-tables in use.

    But, off-hand, the patterns aren't super obvious.

    A few common ones seem to be:
    ( I0 & I1) | (!I1 & I2)
    ( I0 & I1) | I2
    (!I0 & I1) | I2
    ...

    These appear to be std decoding pattern recognizers, to me (although
    the top one looks like a binary multiplexer}.

    The first one seems to be a strong majority though. I think it is a bit
    MUX using I1 to select the other bit (I0 or I2).

    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Thu Feb 29 19:39:35 2024
    BGB wrote:

    On 2/28/2024 5:26 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 2/27/2024 7:33 PM, MitchAlsup1 wrote:
    A thought::

    Construct the 8-way cache from a pair of 4-way cache instances
    and connect both into one 8-way with a single layer of logic
    {multiplexing.}

    Possible, I have decided for now to stick with 4-way...


    But, even then, efforts at trying to optimize this seem to be causing
    the LUT cost to increase rather than decrease...

    Then you have tickled one of Verilog's insidious deamons.

    How many elements in a way ?? and how many bits in an element ??
    If there a way to make a "way" into a single SRAM ?? (or part of a single
    SRAM) ??

    What I am getting at is that "conceptually" a n-way set associative
    cache is unrecognizingly different than n-copies of a 1/n direct mapped
    cache coupled to a set/way selection multiplexer based on
    address bits compare. {{And of course write set selection.}}


    I am not sure.

    In this case, I interpreted it as, say, 4 or 8 parallel sets of arrays,
    with the corresponding match and multiplex logic.

    They should be 4 or 8 parrallel instances of a 1-way (DM) cache with a comparator and an output multiplexer signal and an input write signal.
    The Move to Front is easier done with Not-recently-Used bits as a guise
    for least recently used.

    Each way has a NRU bit--at reset and when all NRU bits are set, they are cleared asynchronously. Then as each set is hit, the NRU bit is set. You
    do not reallocate the sets with the NRU bit set. I see no reason to move
    one set to the front if you can alter the reallocation selection to avoid picking it. {{3 gates per line}}

    In the first instance, adding an item always shifted each item over one position and added a new item to the front.

    The MTF logic tries to move an accessed item to the front, or shift each
    item back as before it is a new address. If the address hits while
    adding an items, it behaves as-if it were moving it to the front, but effectively replaces the item being moved to the front with the data
    being written.

    The MTF logic seems to increase hit-rate, but eats a lot of additional LUTs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sat Mar 2 18:00:18 2024
    BGB wrote:

    On 2/29/2024 1:39 PM, MitchAlsup1 wrote:

    They should be 4 or 8 parrallel instances of a 1-way (DM) cache with a
    comparator and an output multiplexer signal and an input write signal.
    The Move to Front is easier done with Not-recently-Used bits as a guise
    for least recently used.

    Each way has a NRU bit--at reset and when all NRU bits are set, they are
    cleared asynchronously. Then as each set is hit, the NRU bit is set. You
    do not reallocate the sets with the NRU bit set. I see no reason to move
    one set to the front if you can alter the reallocation selection to avoid
    picking it. {{3 gates per line}}


    I guess it could be possible say, by having 8 bits (4x2 bits), or 24
    bits (8x3 bits) to encode the permutation. Then using the permutation to drive what index to update, rather than by having the items swap places.

    This could possibly reduce LUT cost vs the existing MTF scheme...


    I may need to explore this idea.

    Each line in a way has a Not-Recently-Used bit. 1 SR-flip-flop per line.
    Every time the way is accessed the bit is set. the set input is asserted.
    When all bits have been set they are all cleared (asynchronously).
    the reset input is asserted.

    The line to be replaced is determined by a find first zero. There is
    always at least 1 zero in the list.

    I have been using this since the days of the Mc68851.....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)