EricP wrote:
MitchAlsup wrote:
That result feeds to the Log2 Parser which selects up to 6 instructions
from those source bytes.
Fetch Line Buf 0 (fully assoc index)
Fetch Line Buf 1
Fetch Line Buf 2
Fetch Line Buf 3
v v
32B Blk1 32B Blk0
v v
Alignment Shifter 8:1 muxes
v
Log2 Parser and Branch Detect
v v v v v v
I5 I4 I3 I2 I1 I0
Been thinking of this in the background the last month. It seems to me that
a small fetch-predictor is in order.
This fetch-predictor makes use of the natural organization of the ICache as
a matrix of SRAM macros (of some given size:: say 2KB) each SRAM macro having
a ¼ Line access width. Let us call this the horizontal direction. In the vertical direction we have sets (or ways if you prefer).
Each SRAM macro (2KB) has 128-bits and 128-words so we need a 7-bit index.
Each SRAM column has {2,3,4,...} SRAM macros {4=16KB ICache}; so we need {2,3,..}-bits of set-index.
Putting 4 of these index sets together gives us a (7+3)×4 = 40-bit fetch- predictor entry, add a few bits for state and control. {{We may need to
add a field used to access the fetch-predictor for the next cycle}}.
We are now in a position to access 4×¼ = 1 cache line (16 words) from the matrix of SRAM macros.
Sequential access:
It is easy to see that one can access 16 words (16 potential instructions)
in a linear sequence even when the access crosses a cache line boundary.
Non-sequential access:
Given a 6-wide machine (and known instruction statistics wrt VLE utilization) and the assumption of 1 taken branch per issue-width:: the fetch-predictor accesses 4 SRAM macros indexing the macro with the 7-bit index, and choosing the set from the 3-bit index. {We are accessing a set-associative cache as if it were directly mapped.}
Doubly non-sequential access:
There are many occurrences where there are a number of instructions on the sequential path, a conditional branch to a short number of instructions on
the alternate path ending with a direct branch to somewhere else. We use
the next fetch-predictor access field such that this direct branch does not incur an additional cycle of fetch (or execute) latency. This direct branch
can be a {branch, call, or return}
Ramifications:
When instructions are written into the ICache, they are positioned in a set which allows the fetch-predictor to access the sequential path of instructions and the alternate path of instructions.
All instructions are always fetched from the ICache, which has been organized for coherence by external SNOOP activities, so there is minimal excess state and no surgery at context switching or the like.
ICache placement ends up dependent on the instructions being written in accord with how control flow arrived at this point (satisfying the access method above).
This organization satisfies several "hard" cases::
a) 3 ST instructions each 5 words in size: the ICache access supplies 16 words all 4×¼ accesses are sequential but may span cache line boundaries and set placements. These sequences are found in subroutine prologue setting up local variables with static assignments on the stack. The proposed machine can only perform 3 memory references per cycle, so this seems to be a reasonable balance.
b) One can process sequential instruction up to a call and several instructions at the call-target in the same issue cycle. The same can transpire on return.
c) Should a return find a subsequent call (after a few instructions) the EXIT instruction can be cut short and the ENTER instruction cut short because all the preserved registers are already where they need to be on the call/return stack; taking fewer cycles wandering around the call/return tree.
So:: the fetch-predictor contains 5 accesses, 4 to ICache of instructions and
1 to itself for the next fetch-prediction.
{ set[0] column[0] set[1] column[1] set[2] column[2] set[3] column[3] next}
| +-------+ | +-------+ | +-------+ | +-------+ | +-------+
| | | +-->| | | | | | | | +->| |
| +-------+ +-------+ | +-------+ | +-------+ +-------+
+--> | | | | | | | | | |
+-------+ +-------+ | +-------+ | +-------+
| | | | +--> | | +--> | |
+-------+ +-------+ +-------+ +-------+
| | | | | | | |
+-------+ +-------+ +-------+ +-------+
| | | |
V V V V
inst[0] inst[1] inst[2] inst[3]
The instruction groups still have to be "routed" into some semblance of order but this can take place over the 2 or 3 decode cycles.
All of the ICache tag checking is performed "later" in the pipeline, taking tag-check and selection multiplexing out of the instruction delivery path.
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)