• Spectre vs SC (was: Concertina II Progress)

    From Stefan Monnier@21:1/5 to All on Mon Dec 4 14:17:43 2023
    MitchAlsup [2023-12-04 18:58:51] wrote:
    Quadibloc wrote:
    On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:
    You CAN build Spectré-free, Melltdown-free, ROP-free,... in GBOoO by
    following one simple rule:: No microarchitectural changes until the
    causing instruction retires. AND you can do this without loosing
    performance.
    I thought that the mitigations that _were_ costly in performance
    were mostly attempts to approach following just that rule.

    IIUC the mitigations are all done in software, and on the hardware side
    they seem to largely disregard the issue as if they had abandoned all
    hope to solve it at all.

    The mitigations were closer to:: cause the problem to vanish,
    but change as little of the µArchitecture as possible in doing
    it. But 6 years later, they apparently are still unwilling to
    alter the µArchitecture enough to completely eliminate them.

    While you write above that "you can do this without loosing
    performance", IIUC it does have a cost in that you have to keep more information as "speculative" and for longer. Presumably we can make
    this cost small.

    But it reminds me of a recent discussion with similar costs: sequential consistency, where there again the cost is to keep more instructions as "speculative" for longer.

    Is my association of those two just the result of my lack of knowledge
    of how these things are really implemented, or is there indeed
    some similarity?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stefan Monnier on Mon Dec 4 20:53:23 2023
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    MitchAlsup [2023-12-04 18:58:51] wrote:
    Quadibloc wrote:
    On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:
    You CAN build Spectré-free, Melltdown-free, ROP-free,... in GBOoO by
    following one simple rule:: No microarchitectural changes until the
    causing instruction retires. AND you can do this without loosing
    performance.
    I thought that the mitigations that _were_ costly in performance
    were mostly attempts to approach following just that rule.

    IIUC the mitigations are all done in software, and on the hardware side
    they seem to largely disregard the issue as if they had abandoned all
    hope to solve it at all.

    On the x86_64 side, perhaps.

    ARM added a number of hint instructions, including a speculation
    barrier instruction, that provide some hardware support for mitigations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stefan Monnier on Tue Dec 5 17:59:36 2023
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    MitchAlsup [2023-12-04 18:58:51] wrote:
    Quadibloc wrote:
    On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:
    You CAN build Spectré-free, Melltdown-free, ROP-free,... in GBOoO by
    following one simple rule:: No microarchitectural changes until the
    causing instruction retires. AND you can do this without loosing
    performance.
    I thought that the mitigations that _were_ costly in performance
    were mostly attempts to approach following just that rule.

    IIUC the mitigations are all done in software, and on the hardware side
    they seem to largely disregard the issue as if they had abandoned all
    hope to solve it at all.

    For a while I thought that the hardware people thought that this is
    Somebody Else's Problem (like the Rowhammer disaster, which IMO would
    have been fixed long ago if the same manufacturer made the memory and
    the memory controller; hmm, in case of Samsung that's actually the
    case, but I guss that Samsung is large enough that it's as if the two
    groups were different manufacturers).

    However, reading "Speculative interference attacks: breaking invisible speculation schemes", which has 6 authors with affiliation "Intel
    Corporation, USA" (out of 16), someone at Intel seems to be aware of
    the right way. Of course, the title of the paper and some of the
    contents read as if it is intended to discourage the idea that Spectre
    can be fixed, but if you ignore that, the paper points out a few more
    problems that need to be fixed; some of them are included in what
    Mitch Alsup wrote above; there is also a side channel through resource contention that must also be closed, but that problem has been dealt
    with, too.

    So why the pessimistic title and partial content? Maybe it's just
    that the authors are security researchers, and a successful attack is
    the matter of prestige in this community, and such a paper has to be
    published first before a defense paper can be seen as an achievement.
    More paranoid explanations: 1) Intel does not plan to do anything, so
    they don't want people to know that they could do something. 2) Intel
    is working on a microarchitecture that fixes Spectre, but wants to
    maximize this as a competetive advantage, so they want (superficial)
    readers from the competition to get the impression that there is no
    point in working on a fix.

    The mitigations were closer to:: cause the problem to vanish,
    but change as little of the µArchitecture as possible in doing
    it. But 6 years later, they apparently are still unwilling to
    alter the µArchitecture enough to completely eliminate them.

    While you write above that "you can do this without loosing
    performance", IIUC it does have a cost in that you have to keep more >information as "speculative" and for longer.

    You just have to keep speculative microarchitectural information
    speculative (which costs some area). Once it's no longer speculative,
    it can be promoted to the "permanent" microarchitectural state.

    But it reminds me of a recent discussion with similar costs: sequential >consistency, where there again the cost is to keep more instructions as >"speculative" for longer.

    That's different. Sequential consistency can be implemented
    efficiently through an additional speculation mechanism: you would
    speculate that the values you loaded are consistent with an ordering
    of the loads and stores in the whole system that satisfies sequential consistency. So there you would speculate until that order is known
    to your local core, which could be a relatively long time.

    Fixing Spectre requires no additional speculation.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Tue Dec 5 20:52:14 2023
    Anton Ertl wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    MitchAlsup [2023-12-04 18:58:51] wrote:
    Quadibloc wrote:
    On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:
    You CAN build Spectr�-free, Melltdown-free, ROP-free,... in GBOoO by >>>>> following one simple rule:: No microarchitectural changes until the
    causing instruction retires. AND you can do this without loosing
    performance.
    I thought that the mitigations that _were_ costly in performance
    were mostly attempts to approach following just that rule.

    IIUC the mitigations are all done in software, and on the hardware side >>they seem to largely disregard the issue as if they had abandoned all
    hope to solve it at all.

    For a while I thought that the hardware people thought that this is
    Somebody Else's Problem (like the Rowhammer disaster, which IMO would
    have been fixed long ago if the same manufacturer made the memory and
    the memory controller; hmm, in case of Samsung that's actually the
    case, but I guss that Samsung is large enough that it's as if the two
    groups were different manufacturers).

    However, reading "Speculative interference attacks: breaking invisible speculation schemes", which has 6 authors with affiliation "Intel Corporation, USA" (out of 16), someone at Intel seems to be aware of
    the right way. Of course, the title of the paper and some of the
    contents read as if it is intended to discourage the idea that Spectre
    can be fixed, but if you ignore that, the paper points out a few more problems that need to be fixed; some of them are included in what
    Mitch Alsup wrote above; there is also a side channel through resource contention that must also be closed, but that problem has been dealt
    with, too.

    So why the pessimistic title and partial content? Maybe it's just
    that the authors are security researchers, and a successful attack is
    the matter of prestige in this community, and such a paper has to be published first before a defense paper can be seen as an achievement.
    More paranoid explanations: 1) Intel does not plan to do anything, so
    they don't want people to know that they could do something. 2) Intel
    is working on a microarchitecture that fixes Spectre, but wants to
    maximize this as a competetive advantage, so they want (superficial)
    readers from the competition to get the impression that there is no
    point in working on a fix.

    The mitigations were closer to:: cause the problem to vanish,
    but change as little of the �Architecture as possible in doing
    it. But 6 years later, they apparently are still unwilling to
    alter the �Architecture enough to completely eliminate them.

    While you write above that "you can do this without loosing
    performance", IIUC it does have a cost in that you have to keep more >>information as "speculative" and for longer.

    You just have to keep speculative microarchitectural information
    speculative (which costs some area). Once it's no longer speculative,
    it can be promoted to the "permanent" microarchitectural state.

    Retired as it were.

    But it reminds me of a recent discussion with similar costs: sequential >>consistency, where there again the cost is to keep more instructions as >>"speculative" for longer.

    That's different. Sequential consistency can be implemented
    efficiently through an additional speculation mechanism: you would
    speculate that the values you loaded are consistent with an ordering
    of the loads and stores in the whole system that satisfies sequential consistency. So there you would speculate until that order is known
    to your local core, which could be a relatively long time.

    Sequential consistency can be implemented by minor addition to a
    memory dependence matrix, where dependencies are relaxed less
    aggressively.

    Fixing Spectre requires no additional speculation.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Tue Dec 5 21:27:35 2023
    scott@slp53.sl.home (Scott Lurndal) writes:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    IIUC the mitigations are all done in software, and on the hardware side >>they seem to largely disregard the issue as if they had abandoned all
    hope to solve it at all.

    On the x86_64 side, perhaps.

    ARM added a number of hint instructions, including a speculation
    barrier instruction, that provide some hardware support for mitigations.

    Such mitigation support instructuions (which exist for Intel and AMD,
    too) are not the fix, on the contrary: If Spectre is fixed in the
    hardware, no such instructions are necessary. E.g., on ARM these
    instructions should be unnecessary (and do nothing) on A53, A55, A510
    and A520, because they do not perform speculative execution.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Tue Dec 5 21:34:29 2023
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    IIUC the mitigations are all done in software, and on the hardware side >>>they seem to largely disregard the issue as if they had abandoned all >>>hope to solve it at all.

    On the x86_64 side, perhaps.

    ARM added a number of hint instructions, including a speculation
    barrier instruction, that provide some hardware support for mitigations.

    Such mitigation support instructuions (which exist for Intel and AMD,
    too) are not the fix, on the contrary: If Spectre is fixed in the
    hardware, no such instructions are necessary. E.g., on ARM these >instructions should be unnecessary (and do nothing) on A53, A55, A510
    and A520, because they do not perform speculative execution.

    That's why those new instructions were allocated out of the "hint"
    instruction encoding.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Dec 5 17:40:42 2023
    While you write above that "you can do this without loosing
    performance", IIUC it does have a cost in that you have to keep more >>information as "speculative" and for longer.

    You just have to keep speculative microarchitectural information
    speculative (which costs some area). Once it's no longer speculative,
    it can be promoted to the "permanent" microarchitectural state.

    Right, so it needs extra hardware to keep extra information until we
    know for sure that it's not speculative.

    But it reminds me of a recent discussion with similar costs: sequential
    consistency, where there again the cost is to keep more instructions as
    "speculative" for longer.
    That's different. Sequential consistency can be implemented
    efficiently through an additional speculation mechanism: you would
    speculate that the values you loaded are consistent with an ordering
    of the loads and stores in the whole system that satisfies sequential consistency. So there you would speculate until that order is known
    to your local core, which could be a relatively long time.

    IIUC this boils down to keeping some memory operations (reads) as
    speculative for a longer amount of time (until all previous memory
    operations are not speculative any more), which delays their retirement.
    So the cost is that it requires keeping more instructions "in flight",
    so it increases the "residency" of instructions in things like a ROB,
    meaning that for a fixed size ROB you'll get less performance, or that
    to keep the same performance you need a larger ROB.

    So in both cases, we need extra hardware to keep track of extra
    speculative info. In one case it's extra info about existing
    speculation, and in the other it's existing info but about "extra
    speculation".


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to MitchAlsup on Wed Dec 6 08:37:49 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Anton Ertl wrote:
    You just have to keep speculative microarchitectural information
    speculative (which costs some area). Once it's no longer speculative,
    it can be promoted to the "permanent" microarchitectural state.

    Retired as it were.

    Yes, although I think that "committed" is the better term in this
    context.

    That's different. Sequential consistency can be implemented
    efficiently through an additional speculation mechanism: you would
    speculate that the values you loaded are consistent with an ordering
    of the loads and stores in the whole system that satisfies sequential
    consistency. So there you would speculate until that order is known
    to your local core, which could be a relatively long time.

    Sequential consistency can be implemented by minor addition to a
    memory dependence matrix, where dependencies are relaxed less
    aggressively.

    Yes, you can implement sequential consistency by restricting the
    ordering, but according to the advocates of weak consistency models,
    this costs performance for single-threaded code. I guess that's also
    why your architecture has blocks of sequential consistency, but
    switches to weaker consistency in other parts of the program.

    The alternative I outlined above is to speculate that the loaded value
    is not written by some other core in a way that would violate
    sequential consistency, and abandon the speculative state if that
    turns out to be wrong. For single-threaded code this provides the
    same performance as weak consistency (as long as the CPU has enough
    speculative resources).

    For code that communicates with other cores, it experiences slowdowns
    only in the contended case, unlike typical implementations of cores
    where weaker consistency is faster, and the programmer has to throw in
    membars or somesuch that always costs performance, or has to switch to
    a stronger cosistency mode that always costs performance.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Stefan Monnier on Wed Dec 6 14:46:49 2023
    Stefan Monnier wrote:

    While you write above that "you can do this without loosing
    performance", IIUC it does have a cost in that you have to keep more >>>information as "speculative" and for longer.

    You just have to keep speculative microarchitectural information
    speculative (which costs some area). Once it's no longer speculative,
    it can be promoted to the "permanent" microarchitectural state.

    Right, so it needs extra hardware to keep extra information until we
    know for sure that it's not speculative.

    Why is the Miss Buffer considered "extra hardware" ??

    But it reminds me of a recent discussion with similar costs: sequential
    consistency, where there again the cost is to keep more instructions as
    "speculative" for longer.
    That's different. Sequential consistency can be implemented
    efficiently through an additional speculation mechanism: you would
    speculate that the values you loaded are consistent with an ordering
    of the loads and stores in the whole system that satisfies sequential
    consistency. So there you would speculate until that order is known
    to your local core, which could be a relatively long time.

    IIUC this boils down to keeping some memory operations (reads) as
    speculative for a longer amount of time (until all previous memory
    operations are not speculative any more), which delays their retirement.
    So the cost is that it requires keeping more instructions "in flight",
    so it increases the "residency" of instructions in things like a ROB,
    meaning that for a fixed size ROB you'll get less performance, or that
    to keep the same performance you need a larger ROB.

    So in both cases, we need extra hardware to keep track of extra
    speculative info. In one case it's extra info about existing
    speculation, and in the other it's existing info but about "extra speculation".


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Wed Dec 6 14:56:44 2023
    Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Anton Ertl wrote:
    You just have to keep speculative microarchitectural information
    speculative (which costs some area). Once it's no longer speculative,
    it can be promoted to the "permanent" microarchitectural state.

    Retired as it were.

    Yes, although I think that "committed" is the better term in this
    context.

    That's different. Sequential consistency can be implemented
    efficiently through an additional speculation mechanism: you would
    speculate that the values you loaded are consistent with an ordering
    of the loads and stores in the whole system that satisfies sequential
    consistency. So there you would speculate until that order is known
    to your local core, which could be a relatively long time.

    Sequential consistency can be implemented by minor addition to a
    memory dependence matrix, where dependencies are relaxed less
    aggressively.

    Yes, you can implement sequential consistency by restricting the
    ordering, but according to the advocates of weak consistency models,
    this costs performance for single-threaded code. I guess that's also
    why your architecture has blocks of sequential consistency, but
    switches to weaker consistency in other parts of the program.

    Yes, My architecture uses sequential consistency only when required, and
    the programmer does not have to deal with the boundaries, HW does.

    The alternative I outlined above is to speculate that the loaded value
    is not written by some other core in a way that would violate
    sequential consistency, and abandon the speculative state if that
    turns out to be wrong. For single-threaded code this provides the
    same performance as weak consistency (as long as the CPU has enough speculative resources).

    For code that communicates with other cores, it experiences slowdowns
    only in the contended case, unlike typical implementations of cores
    where weaker consistency is faster, and the programmer has to throw in membars or somesuch that always costs performance, or has to switch to
    a stronger cosistency mode that always costs performance.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Dec 6 12:18:45 2023
    Right, so it needs extra hardware to keep extra information until we
    know for sure that it's not speculative.
    Why is the Miss Buffer considered "extra hardware" ??

    IIUC to avoid Spectre you need (among other things) to refrain from
    updating the branch predication table until the prediction is verified,
    so you need to keep a buffer (presumably a queue) of prediction updates
    which is not otherwise needed.

    Similarly, when fetching something from memory you have to keep it in
    a miss buffer until the load instruction is committed. Maybe this miss
    buffer would exist in any case, but I suspect that the fact that you
    have to keep it there for (at least) a specific duration will increase
    the residency in this miss buffer and thus in turn would require
    a larger miss buffer if we don't want to slow down execution.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Stefan Monnier on Wed Dec 6 17:49:01 2023
    Stefan Monnier wrote:

    Right, so it needs extra hardware to keep extra information until we
    know for sure that it's not speculative.
    Why is the Miss Buffer considered "extra hardware" ??

    IIUC to avoid Spectre you need (among other things) to refrain from
    updating the branch predication table until the prediction is verified,
    so you need to keep a buffer (presumably a queue) of prediction updates
    which is not otherwise needed.

    A buffer of window-size/issue-width elements of 1-bit plus 1 index into
    BP table.

    Similarly, when fetching something from memory you have to keep it in
    a miss buffer until the load instruction is committed.

    can commit, not is comitted.

    Maybe this miss buffer would exist in any case, but I suspect that the fact that you
    have to keep it there for (at least) a specific duration will increase
    the residency in this miss buffer and thus in turn would require
    a larger miss buffer if we don't want to slow down execution.

    This is a far lower penalty than the workarounds to avoid.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stefan Monnier on Wed Dec 6 19:08:34 2023
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    Right, so it needs extra hardware to keep extra information until we
    know for sure that it's not speculative.
    Why is the Miss Buffer considered "extra hardware" ??

    IIUC to avoid Spectre you need (among other things) to refrain from
    updating the branch predication table until the prediction is verified,
    so you need to keep a buffer (presumably a queue) of prediction updates
    which is not otherwise needed.

    And you need to flush that queue on context switch, right?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Dec 6 17:22:36 2023
    Scott Lurndal [2023-12-06 19:08:34] wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    IIUC to avoid Spectre you need (among other things) to refrain from
    updating the branch predication table until the prediction is verified,
    so you need to keep a buffer (presumably a queue) of prediction updates
    which is not otherwise needed.
    And you need to flush that queue on context switch, right?

    You need to flush the part starting at the first
    misprediction (if any), just as you flush the rest of the
    mispredicted instructions.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Thu Dec 7 03:39:12 2023
    Scott Lurndal wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    Right, so it needs extra hardware to keep extra information until we
    know for sure that it's not speculative.
    Why is the Miss Buffer considered "extra hardware" ??

    IIUC to avoid Spectre you need (among other things) to refrain from >>updating the branch predication table until the prediction is verified,
    so you need to keep a buffer (presumably a queue) of prediction updates >>which is not otherwise needed.

    And you need to flush that queue on context switch, right?

    1) Every effect of every instruction executed to completion must be part
    of greater-machine-state.

    2) No state of any discarded instruction is allowed to change any single
    bit of greater-machine-state.

    Greater machine state is all of the bits required to restore the logic
    to this exact-state once there are no instructions execution in the
    machine (pipeline is drained). Exact-state includes all timing-visible prediction state (control visible timing) and buffering state (data
    visible timing).

    On an interrupt the pipeline has the freedom to choose which instruction represents the last instruction executed based on paragraph 1 and
    which is the first instruction in paragraph 2. Branch recovery does not
    get this choice;; the branch instruction itself is in paragraph 1,
    the architectural-target is in paragraph 2.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to MitchAlsup on Thu Dec 7 10:11:45 2023
    On 2023-12-07 5:39, MitchAlsup wrote:

    1) Every effect of every instruction executed to completion must be part
    of greater-machine-state.
    2) No state of any discarded instruction is allowed to change any single
    bit of greater-machine-state.

    Greater machine state is all of the bits required to restore the logic
    to this exact-state once there are no instructions execution in the
    machine (pipeline is drained). Exact-state includes all timing-visible prediction state (control visible timing) and buffering state (data
    visible timing).


    Neither paragraph 1 nor 2 mentions "exact-state". It seems paragraph 2
    should, IIUC.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to MitchAlsup on Thu Dec 7 13:18:56 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Yes, My architecture uses sequential consistency only when required, and
    the programmer does not have to deal with the boundaries, HW does.

    That sounds cool, how does the hardware detect that it has to switch
    to sequential consistency?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stefan Monnier on Thu Dec 7 13:20:07 2023
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    Right, so it needs extra hardware to keep extra information until we
    know for sure that it's not speculative.
    Why is the Miss Buffer considered "extra hardware" ??

    IIUC to avoid Spectre you need (among other things) to refrain from
    updating the branch predication table until the prediction is verified,

    Yes.

    so you need to keep a buffer (presumably a queue) of prediction updates
    which is not otherwise needed.

    That buffer already exists: It's the ROB. When a branch and its
    outcome becomes non-speculative (or upon retirement), you feed the
    outcome of the branch into the branch predictor(s).

    Whether you also need speculation-based speculative "history"
    predictors, is a good question, but in case of a misprediction it
    certainly looks like a good idea to me not to pollute the committed
    history predictors with these wrong predictions, so you also get a
    benefit from not putting speculative predictions into the history of
    predictors that are used for committed history.

    Concerning the return-address stack predictor, you certainly want a
    separate speculative version of that in addition to the committed
    version, otherwise a misprediction recovery can result in multiple
    subsequent return stack mispredictions. I expect that existing CPUs
    already have some mechanism for resynchronizing the return stack after
    a resolved misprediction, so I expect little additional cost for that.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stefan Monnier on Thu Dec 7 12:56:37 2023
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    While you write above that "you can do this without loosing
    performance", IIUC it does have a cost in that you have to keep more >>>information as "speculative" and for longer.

    You just have to keep speculative microarchitectural information
    speculative (which costs some area). Once it's no longer speculative,
    it can be promoted to the "permanent" microarchitectural state.

    Right, so it needs extra hardware to keep extra information until we
    know for sure that it's not speculative.

    If you use extra hardware for that, you also get extra performance
    (like you get extra performance from the extra architectural state).
    E.g., if you have n speculative load buffers, they serve as additional
    D-cache lines. OTOH, if you reduce the committed D-cache by these
    lines, you don't need much extra hardware, but you lose performance.

    That's different. Sequential consistency can be implemented
    efficiently through an additional speculation mechanism: you would
    speculate that the values you loaded are consistent with an ordering
    of the loads and stores in the whole system that satisfies sequential
    consistency. So there you would speculate until that order is known
    to your local core, which could be a relatively long time.

    IIUC this boils down to keeping some memory operations (reads) as
    speculative for a longer amount of time (until all previous memory
    operations are not speculative any more), which delays their retirement.
    So the cost is that it requires keeping more instructions "in flight",
    so it increases the "residency" of instructions in things like a ROB,
    meaning that for a fixed size ROB you'll get less performance, or that
    to keep the same performance you need a larger ROB.

    No, retirement is in-order, so all instructions after a cache miss
    have to wait in the ROB anyway until the miss is served. There is an implementation cost to sequential consistency, but this is not it. I
    think if the computer architects put their mind to it, they can find
    solutions that are good in performance and low in hardware cost.

    But that means that we first have to shatter the meme that weak
    consistency is necessary for performance. If the EPIC meme had been
    as successful at suppressing research and development of OoO as the
    weak consistency meme is at suppressing research into and development
    of efficient hardware implementations of sequential consistency,
    everybody would be using CPUs like Poulson and Efficeon these days,
    and think that EPIC is necessary for performance.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Thu Dec 7 13:35:28 2023
    scott@slp53.sl.home (Scott Lurndal) writes:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    Right, so it needs extra hardware to keep extra information until we
    know for sure that it's not speculative.
    Why is the Miss Buffer considered "extra hardware" ??

    IIUC to avoid Spectre you need (among other things) to refrain from >>updating the branch predication table until the prediction is verified,
    so you need to keep a buffer (presumably a queue) of prediction updates >>which is not otherwise needed.

    And you need to flush that queue on context switch, right?

    Not if the goal is to eliminate Spectre.

    The idea is that the hardware is as secure or insecure as hardware
    without speculation. The measure that achieve that make such flushes
    neither necessary nor sufficient.

    You may want to do such flushes for other security reasons, though
    (i.e., reasons beyond Spectre).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Thu Dec 7 19:00:23 2023
    Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Yes, My architecture uses sequential consistency only when required, and >>the programmer does not have to deal with the boundaries, HW does.

    That sounds cool, how does the hardware detect that it has to switch
    to sequential consistency?

    Translate an AGEN into MMI/O space
    or
    LD.lock

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Thu Dec 7 19:01:11 2023
    Anton Ertl wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    While you write above that "you can do this without loosing >>>>performance", IIUC it does have a cost in that you have to keep more >>>>information as "speculative" and for longer.

    You just have to keep speculative microarchitectural information
    speculative (which costs some area). Once it's no longer speculative,
    it can be promoted to the "permanent" microarchitectural state.

    Right, so it needs extra hardware to keep extra information until we
    know for sure that it's not speculative.

    If you use extra hardware for that, you also get extra performance
    (like you get extra performance from the extra architectural state).
    E.g., if you have n speculative load buffers, they serve as additional D-cache lines. OTOH, if you reduce the committed D-cache by these
    lines, you don't need much extra hardware, but you lose performance.

    That's different. Sequential consistency can be implemented
    efficiently through an additional speculation mechanism: you would
    speculate that the values you loaded are consistent with an ordering
    of the loads and stores in the whole system that satisfies sequential
    consistency. So there you would speculate until that order is known
    to your local core, which could be a relatively long time.

    IIUC this boils down to keeping some memory operations (reads) as >>speculative for a longer amount of time (until all previous memory >>operations are not speculative any more), which delays their retirement.
    So the cost is that it requires keeping more instructions "in flight",
    so it increases the "residency" of instructions in things like a ROB, >>meaning that for a fixed size ROB you'll get less performance, or that
    to keep the same performance you need a larger ROB.

    No, retirement is in-order, so all instructions after a cache miss
    have to wait in the ROB anyway until the miss is served. There is an implementation cost to sequential consistency, but this is not it. I
    think if the computer architects put their mind to it, they can find solutions that are good in performance and low in hardware cost.

    But that means that we first have to shatter the meme that weak
    consistency is necessary for performance.

    Weak is not necessary, causal is.

    If the EPIC meme had been
    as successful at suppressing research and development of OoO as the
    weak consistency meme is at suppressing research into and development
    of efficient hardware implementations of sequential consistency,
    everybody would be using CPUs like Poulson and Efficeon these days,
    and think that EPIC is necessary for performance.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Thu Dec 7 19:07:34 2023
    Anton Ertl wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    Right, so it needs extra hardware to keep extra information until we
    know for sure that it's not speculative.
    Why is the Miss Buffer considered "extra hardware" ??

    IIUC to avoid Spectre you need (among other things) to refrain from >>updating the branch predication table until the prediction is verified,

    Yes.

    so you need to keep a buffer (presumably a queue) of prediction updates >>which is not otherwise needed.

    That buffer already exists: It's the ROB. When a branch and its
    outcome becomes non-speculative (or upon retirement), you feed the
    outcome of the branch into the branch predictor(s).

    Whether you also need speculation-based speculative "history"
    predictors, is a good question, but in case of a misprediction it
    certainly looks like a good idea to me not to pollute the committed
    history predictors with these wrong predictions, so you also get a
    benefit from not putting speculative predictions into the history of predictors that are used for committed history.

    Concerning the return-address stack predictor, you certainly want a
    separate speculative version of that in addition to the committed
    version, otherwise a misprediction recovery can result in multiple
    subsequent return stack mispredictions. I expect that existing CPUs
    already have some mechanism for resynchronizing the return stack after
    a resolved misprediction, so I expect little additional cost for that.

    Yes, you definitely have to be able to RET and the unRET if the RET
    was under the shadow of a mispredicted branch. Athlon and Opteron
    used a doubly linked list with 16 entries (4-bit pointers).

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Thu Dec 7 14:49:51 2023
    MitchAlsup wrote:
    Anton Ertl wrote:

    Concerning the return-address stack predictor, you certainly want a
    separate speculative version of that in addition to the committed
    version, otherwise a misprediction recovery can result in multiple
    subsequent return stack mispredictions. I expect that existing CPUs
    already have some mechanism for resynchronizing the return stack after
    a resolved misprediction, so I expect little additional cost for that.

    Yes, you definitely have to be able to RET and the unRET if the RET
    was under the shadow of a mispredicted branch. Athlon and Opteron
    used a doubly linked list with 16 entries (4-bit pointers).

    The double linked list approach could work but you have to checkpoint
    the whole Return Address Stack Predictor (RASP) data structure:
    (2*4b links + 64b return address per entry) * 16 entries, + 2*4b list head
    = ~1160 bits per checkpoint, * 16 checkpoints = 18,560 bits checkpoint sram.

    To manipulate the double linked list, RASP sram also requires
    6 read and 6 write ports _for each concurrent RET decode lane_,
    plus a bulk copy read and a write port for RASP checkpoint/rollback.

    And I'm skipping over free entry list management.

    Its an expensive little gadget.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Thu Dec 7 22:09:04 2023
    EricP wrote:

    MitchAlsup wrote:
    Anton Ertl wrote:

    Concerning the return-address stack predictor, you certainly want a
    separate speculative version of that in addition to the committed
    version, otherwise a misprediction recovery can result in multiple
    subsequent return stack mispredictions. I expect that existing CPUs
    already have some mechanism for resynchronizing the return stack after
    a resolved misprediction, so I expect little additional cost for that.

    Yes, you definitely have to be able to RET and the unRET if the RET
    was under the shadow of a mispredicted branch. Athlon and Opteron
    used a doubly linked list with 16 entries (4-bit pointers).

    The double linked list approach could work but you have to checkpoint
    the whole Return Address Stack Predictor (RASP) data structure:
    (2*4b links + 64b return address per entry) * 16 entries, + 2*4b list head
    = ~1160 bits per checkpoint, * 16 checkpoints = 18,560 bits checkpoint sram.

    You just have to avoid reallocating the entry while the RET remains
    under a mispredictable branch shadow. This eliminates needing the
    64-bit address. 2×4×(16+1) = 136-bits--which is why the stack is 16
    instead of 8.

    To manipulate the double linked list, RASP sram also requires
    6 read and 6 write ports _for each concurrent RET decode lane_,
    plus a bulk copy read and a write port for RASP checkpoint/rollback.

    You are only modifying 8-bits per cycle (across 2 entries)

    And I'm skipping over free entry list management.

    Its an expensive little gadget.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Chris M. Thomasson on Fri Dec 8 02:23:14 2023
    Chris M. Thomasson wrote:

    On 12/7/2023 11:00 AM, MitchAlsup wrote:
    Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Yes, My architecture uses sequential consistency only when required,
    and the programmer does not have to deal with the boundaries, HW does.

    That sounds cool, how does the hardware detect that it has to switch
    to sequential consistency?

    Translate an AGEN into MMI/O space
    or
    LD.lock

    Akin to if a programmer knows it absolutely _needs_ to use a #StoreLoad,
    it will add them in the right places?

    Consider a virtualized device driver. He may think he is performing a ST to MMI/O
    space while the HyperVisor has remapped his virtual device control space back to
    DRAM.

    Now consider two copies of the same code (still a device driver) one virtualized
    the other native.

    Finally ask the question of whether a MemBar should be in one but not the other.

    Virtualization means that sometimes even the most well informed programmer
    has insufficient information .....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Fri Dec 8 15:21:12 2023
    mitchalsup@aol.com (MitchAlsup) writes:
    Chris M. Thomasson wrote:

    On 12/7/2023 11:00 AM, MitchAlsup wrote:
    Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup) writes:
    Yes, My architecture uses sequential consistency only when required, >>>>> and the programmer does not have to deal with the boundaries, HW does. >>>
    That sounds cool, how does the hardware detect that it has to switch
    to sequential consistency?

    Translate an AGEN into MMI/O space
    or
    LD.lock

    Akin to if a programmer knows it absolutely _needs_ to use a #StoreLoad,
    it will add them in the right places?

    Consider a virtualized device driver. He may think he is performing a ST to MMI/O
    space while the HyperVisor has remapped his virtual device control space back to
    DRAM.

    It's more likely to be using an SR-IOV type of interface to the device, allowing direct access to the hardware. HV intervention for MMIO is
    so 2000's. :-).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)