• Decoding Instructions in Parallel

    From Quadibloc@21:1/5 to All on Sat Jan 6 16:36:31 2024
    Given that I do not know a whole lot about how cache
    coherency is done, and Mitch asked me what approach
    I was planning to take...

    I went on a web search to find more information on
    the subject.

    I learned that MSI went to MESI... and then there were
    a bunch of "ownership" schemes, such as Berkeley,
    Illinois, Firefly, and Dragon.

    By 1999, AMD seems to have done something in that area
    with MOESI, and later on Intel came up with MESIF instead,
    where "F", for Forwarding, is _like_ owned data, but it
    is also saved to RAM. Engineers at Intel recently also
    wrote papers on "MOESI Prime", which has primed versions
    of two of the MOESI states to avoid the cache coherency
    mechanism causing RowHammer-like behavior.

    Anyways... there was something else I found while looking
    this stuff up.

    I had noted that one of the reasons for offering the
    programmer a choice of writing programs with 32-bit
    long instructions and nothing but 32-bit long instructions,
    or using block headers for blocks of 256 bits in code,
    was to allow instructions to be decoded in parallel.

    Mitch pointed out that one could just start decoding
    in parallel at every possible instruction start location,
    while also, in parallel, quickly resolving instruction
    lengths so as to find which decodes result in executions.

    I acknowledged that one could certainly do that, but
    since it was somewhat wasteful of heat and electricity,
    I didn't think of this as describing a _typical_
    implementation of my ISA (and hence parallel decoding
    was still an excuse for having a block structure rather
    than conventional CISC-like variable-length instructions).

    Well, one of my search results showed that this was how
    they did it on the first 64-bit Opterons, from AMD, so
    that explains why this technique came so readily to
    Mitch's mind!

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Sat Jan 6 19:16:30 2024
    Quadibloc wrote:

    Given that I do not know a whole lot about how cache
    coherency is done, and Mitch asked me what approach
    I was planning to take...

    I went on a web search to find more information on
    the subject.

    I learned that MSI went to MESI... and then there were
    a bunch of "ownership" schemes, such as Berkeley,
    Illinois, Firefly, and Dragon.

    By 1999, AMD seems to have done something in that area
    with MOESI, and later on Intel came up with MESIF instead,
    where "F", for Forwarding, is _like_ owned data, but it
    is also saved to RAM. Engineers at Intel recently also
    wrote papers on "MOESI Prime", which has primed versions
    of two of the MOESI states to avoid the cache coherency
    mechanism causing RowHammer-like behavior.

    The OWNED state represents the concept that this copy is the
    only valid copy, so you better not lose it. A request can
    arrive back with OWNED data (in some protocols) and now the
    recipient is in charge of not losing it.

    Anyways... there was something else I found while looking
    this stuff up.

    I had noted that one of the reasons for offering the
    programmer a choice of writing programs with 32-bit
    long instructions and nothing but 32-bit long instructions,
    or using block headers for blocks of 256 bits in code,
    was to allow instructions to be decoded in parallel.

    Mitch pointed out that one could just start decoding
    in parallel at every possible instruction start location,

    Consider reading 4 words at a time out of ICache. Even
    before one compares the tag and selects the data to be
    decoded, one can apply a block of logic 40-gates in
    size and 4-gates of delay and have unary pointers to
    the {Next instruction, any displacement, any constant}
    by the time the tags have been compared and the 4-words
    are then gated out with these extra pointers (8-bits)
    on top of the 128-bits of instructions.

    Each Next instruction pointer selects its successor, and
    a tree of these resolves 2->4->8->16 at 1 more gate of
    delay each. {Higher exponents seem accessible if desired}

    while also, in parallel, quickly resolving instruction
    lengths so as to find which decodes result in executions.

    Generally one associated DECODE with when logical registers
    are applied to either the physical register rile or to the
    register renamer. These be ports one must use efficiently
    and if possible the stage before DECODE (I call PARSE)
    routes instructions to suitable DECODERs {Especially
    important in ISAs with multiple register files {GPR, FP,
    SIMD}.

    I acknowledged that one could certainly do that, but
    since it was somewhat wasteful of heat and electricity,

    Separating PARSE from DECODE minimizes the waste heat
    as all we are doing is looking at enough bits to route
    the instruction to somewhere it can be efficiently DECODEd.
    DECODE accesses the register ports and all sorts of big
    gate count decoding, PARSE uses tiny pattern decoders to
    only route instruction.

    I didn't think of this as describing a _typical_
    implementation of my ISA (and hence parallel decoding
    was still an excuse for having a block structure rather
    than conventional CISC-like variable-length instructions).

    Well, one of my search results showed that this was how
    they did it on the first 64-bit Opterons, from AMD, so
    that explains why this technique came so readily to
    Mitch's mind!

    Burned in solid. Opteron used a trailing marker bit so we
    know if we were looking at the last byte in an instruction
    (or not). My 66000 uses 4 Major OpCode patterns from 001xxx
    to then use a 4-bit positions {15,14,13,11} to decode all
    VLE size information.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Mon Jan 8 12:20:01 2024
    MitchAlsup wrote:
    Quadibloc wrote:

    Given that I do not know a whole lot about how cache
    coherency is done, and Mitch asked me what approach
    I was planning to take...

    I went on a web search to find more information on
    the subject.

    I learned that MSI went to MESI... and then there were
    a bunch of "ownership" schemes, such as Berkeley,
    Illinois, Firefly, and Dragon.

    By 1999, AMD seems to have done something in that area
    with MOESI, and later on Intel came up with MESIF instead,
    where "F", for Forwarding, is _like_ owned data, but it
    is also saved to RAM. Engineers at Intel recently also
    wrote papers on "MOESI Prime", which has primed versions
    of two of the MOESI states to avoid the cache coherency
    mechanism causing RowHammer-like behavior.

    The Forward state is to address the issue of who should respond to a
    request for a shared copy of a line when there are multiple sharers.
    If multiple sharers respond it could flood a requester with redundant
    messages.

    The Directory Controller (DC) records which lines are held in each core
    in what state. It remembers the most recent core to read-share a line
    as the Forward state, on the assumption that copy is most likely still resident, while the prior readers are tracked in a Shared state.

    The cache with the line in a Forward state is told send a shared copy to
    a read-shared requester, who becomes the line's new Forward state holder.
    If no Forward copy is available the DC reads from DRAM.

    The OWNED state represents the concept that this copy is the
    only valid copy, so you better not lose it. A request can
    arrive back with OWNED data (in some protocols) and now the recipient is
    in charge of not losing it.

    Also OWNED is the modified-shared state where the owner modifies a line
    then shared read-only copies of it. The ownership can be passed to a
    new cache without writing it back to DRAM or invalidating the shared copies.
    To modify the line again the owner has to invalidate all the shared copies first to return it to the Exclusive state.
    When the owner eventually evicts the line, it is responsible for writing
    it back to DRAM.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Mon Jan 8 22:13:37 2024
    EricP wrote:

    MitchAlsup wrote:
    Quadibloc wrote:

    Given that I do not know a whole lot about how cache
    coherency is done, and Mitch asked me what approach
    I was planning to take...

    I went on a web search to find more information on
    the subject.

    I learned that MSI went to MESI... and then there were
    a bunch of "ownership" schemes, such as Berkeley,
    Illinois, Firefly, and Dragon.

    By 1999, AMD seems to have done something in that area
    with MOESI, and later on Intel came up with MESIF instead,
    where "F", for Forwarding, is _like_ owned data, but it
    is also saved to RAM. Engineers at Intel recently also
    wrote papers on "MOESI Prime", which has primed versions
    of two of the MOESI states to avoid the cache coherency
    mechanism causing RowHammer-like behavior.

    The Forward state is to address the issue of who should respond to a
    request for a shared copy of a line when there are multiple sharers.
    If multiple sharers respond it could flood a requester with redundant messages.

    The Directory Controller (DC) records which lines are held in each core
    in what state. It remembers the most recent core to read-share a line
    as the Forward state, on the assumption that copy is most likely still resident, while the prior readers are tracked in a Shared state.

    The cache with the line in a Forward state is told send a shared copy to
    a read-shared requester, who becomes the line's new Forward state holder.
    If no Forward copy is available the DC reads from DRAM.

    The OWNED state represents the concept that this copy is the
    only valid copy, so you better not lose it. A request can
    arrive back with OWNED data (in some protocols) and now the recipient is
    in charge of not losing it.

    Also OWNED is the modified-shared state where the owner modifies a line
    then shared read-only copies of it. The ownership can be passed to a
    new cache without writing it back to DRAM or invalidating the shared copies. To modify the line again the owner has to invalidate all the shared copies first to return it to the Exclusive state.

    Granted

    When the owner eventually evicts the line, it is responsible for writing
    it back to DRAM.

    Or sending it to another cache that can take OWNERship over it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)