Forum: >>> Magnum BBS <<<

Re: State of the art non-linear FETCH

From MitchAlsup@21:1/5 to EricP on Tue Dec 5 21:43:13 2023

EricP wrote:

MitchAlsup wrote:

That result feeds to the Log2 Parser which selects up to 6 instructions
from those source bytes.

Fetch Line Buf 0 (fully assoc index)
Fetch Line Buf 1
Fetch Line Buf 2
Fetch Line Buf 3
v v
32B Blk1 32B Blk0
v v
Alignment Shifter 8:1 muxes
v
Log2 Parser and Branch Detect
v v v v v v
I5 I4 I3 I2 I1 I0

Been thinking of this in the background the last month. It seems to me that
a small fetch-predictor is in order.

This fetch-predictor makes use of the natural organization of the ICache as
a matrix of SRAM macros (of some given size:: say 2KB) each SRAM macro having
a ¼ Line access width. Let us call this the horizontal direction. In the vertical direction we have sets (or ways if you prefer).

Each SRAM macro (2KB) has 128-bits and 128-words so we need a 7-bit index.
Each SRAM column has {2,3,4,...} SRAM macros {4=16KB ICache}; so we need {2,3,..}-bits of set-index.

Putting 4 of these index sets together gives us a (7+3)×4 = 40-bit fetch- predictor entry, add a few bits for state and control. {{We may need to
add a field used to access the fetch-predictor for the next cycle}}.

We are now in a position to access 4×¼ = 1 cache line (16 words) from the matrix of SRAM macros.

Sequential access:
It is easy to see that one can access 16 words (16 potential instructions)
in a linear sequence even when the access crosses a cache line boundary.

Non-sequential access:
Given a 6-wide machine (and known instruction statistics wrt VLE utilization) and the assumption of 1 taken branch per issue-width:: the fetch-predictor accesses 4 SRAM macros indexing the macro with the 7-bit index, and choosing the set from the 3-bit index. {We are accessing a set-associative cache as if it were directly mapped.}

Doubly non-sequential access:
There are many occurrences where there are a number of instructions on the sequential path, a conditional branch to a short number of instructions on
the alternate path ending with a direct branch to somewhere else. We use
the next fetch-predictor access field such that this direct branch does not incur an additional cycle of fetch (or execute) latency. This direct branch
can be a {branch, call, or return}

Ramifications:
When instructions are written into the ICache, they are positioned in a set which allows the fetch-predictor to access the sequential path of instructions and the alternate path of instructions.

All instructions are always fetched from the ICache, which has been organized for coherence by external SNOOP activities, so there is minimal excess state and no surgery at context switching or the like.

ICache placement ends up dependent on the instructions being written in accord with how control flow arrived at this point (satisfying the access method above).

This organization satisfies several "hard" cases::

a) 3 ST instructions each 5 words in size: the ICache access supplies 16 words all 4×¼ accesses are sequential but may span cache line boundaries and set placements. These sequences are found in subroutine prologue setting up local variables with static assignments on the stack. The proposed machine can only perform 3 memory references per cycle, so this seems to be a reasonable balance.

b) One can process sequential instruction up to a call and several instructions at the call-target in the same issue cycle. The same can transpire on return.

c) Should a return find a subsequent call (after a few instructions) the EXIT instruction can be cut short and the ENTER instruction cut short because all the preserved registers are already where they need to be on the call/return stack; taking fewer cycles wandering around the call/return tree.

So:: the fetch-predictor contains 5 accesses, 4 to ICache of instructions and
1 to itself for the next fetch-prediction.

{ set[0] column[0] set[1] column[1] set[2] column[2] set[3] column[3] next}
| +-------+ | +-------+ | +-------+ | +-------+ | +-------+
| | | +-->| | | | | | | | +->| |
| +-------+ +-------+ | +-------+ | +-------+ +-------+
+--> | | | | | | | | | |
+-------+ +-------+ | +-------+ | +-------+
| | | | +--> | | +--> | |
+-------+ +-------+ +-------+ +-------+
| | | | | | | |
+-------+ +-------+ +-------+ +-------+
| | | |
V V V V
inst[0] inst[1] inst[2] inst[3]

The instruction groups still have to be "routed" into some semblance of order but this can take place over the 2 or 3 decode cycles.

All of the ICache tag checking is performed "later" in the pipeline, taking tag-check and selection multiplexing out of the instruction delivery path.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Mon Sep 15 15:42:34 2025
  from Wales, Uk via Telnet
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	07:14:53
Calls:	10,388
Calls today:	3
Files:	14,061
Messages:	6,416,822
Posted today:	1

Re: State of the art non-linear FETCH

Who's Online

Recent Visitors

System Info