Forum: >>> Magnum BBS <<<

Spectre vs SC (was: Concertina II Progress)

From Stefan Monnier@21:1/5 to All on Mon Dec 4 14:17:43 2023

MitchAlsup [2023-12-04 18:58:51] wrote:

Quadibloc wrote:

On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:

You CAN build Spectr�-free, Melltdown-free, ROP-free,... in GBOoO by
following one simple rule:: No microarchitectural changes until the
causing instruction retires. AND you can do this without loosing
performance.

I thought that the mitigations that _were_ costly in performance
were mostly attempts to approach following just that rule.

IIUC the mitigations are all done in software, and on the hardware side
they seem to largely disregard the issue as if they had abandoned all
hope to solve it at all.

The mitigations were closer to:: cause the problem to vanish,
but change as little of the �Architecture as possible in doing
it. But 6 years later, they apparently are still unwilling to
alter the �Architecture enough to completely eliminate them.

While you write above that "you can do this without loosing
performance", IIUC it does have a cost in that you have to keep more information as "speculative" and for longer. Presumably we can make
this cost small.

But it reminds me of a recent discussion with similar costs: sequential consistency, where there again the cost is to keep more instructions as "speculative" for longer.

Is my association of those two just the result of my lack of knowledge
of how these things are really implemented, or is there indeed
some similarity?

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Stefan Monnier on Mon Dec 4 20:53:23 2023

Stefan Monnier <monnier@iro.umontreal.ca> writes:

MitchAlsup [2023-12-04 18:58:51] wrote:

Quadibloc wrote:

On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:

You CAN build Spectr�-free, Melltdown-free, ROP-free,... in GBOoO by
following one simple rule:: No microarchitectural changes until the
causing instruction retires. AND you can do this without loosing
performance.

I thought that the mitigations that _were_ costly in performance
were mostly attempts to approach following just that rule.

IIUC the mitigations are all done in software, and on the hardware side
they seem to largely disregard the issue as if they had abandoned all
hope to solve it at all.

On the x86_64 side, perhaps.

ARM added a number of hint instructions, including a speculation
barrier instruction, that provide some hardware support for mitigations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stefan Monnier on Tue Dec 5 17:59:36 2023

Stefan Monnier <monnier@iro.umontreal.ca> writes:

MitchAlsup [2023-12-04 18:58:51] wrote:

Quadibloc wrote:

On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:

You CAN build Spectr�-free, Melltdown-free, ROP-free,... in GBOoO by
following one simple rule:: No microarchitectural changes until the
causing instruction retires. AND you can do this without loosing
performance.

I thought that the mitigations that _were_ costly in performance
were mostly attempts to approach following just that rule.

IIUC the mitigations are all done in software, and on the hardware side
they seem to largely disregard the issue as if they had abandoned all
hope to solve it at all.

For a while I thought that the hardware people thought that this is
Somebody Else's Problem (like the Rowhammer disaster, which IMO would
have been fixed long ago if the same manufacturer made the memory and
the memory controller; hmm, in case of Samsung that's actually the
case, but I guss that Samsung is large enough that it's as if the two
groups were different manufacturers).

However, reading "Speculative interference attacks: breaking invisible speculation schemes", which has 6 authors with affiliation "Intel
Corporation, USA" (out of 16), someone at Intel seems to be aware of
the right way. Of course, the title of the paper and some of the
contents read as if it is intended to discourage the idea that Spectre
can be fixed, but if you ignore that, the paper points out a few more
problems that need to be fixed; some of them are included in what
Mitch Alsup wrote above; there is also a side channel through resource contention that must also be closed, but that problem has been dealt
with, too.

So why the pessimistic title and partial content? Maybe it's just
that the authors are security researchers, and a successful attack is
the matter of prestige in this community, and such a paper has to be
published first before a defense paper can be seen as an achievement.
More paranoid explanations: 1) Intel does not plan to do anything, so
they don't want people to know that they could do something. 2) Intel
is working on a microarchitecture that fixes Spectre, but wants to
maximize this as a competetive advantage, so they want (superficial)
readers from the competition to get the impression that there is no
point in working on a fix.

The mitigations were closer to:: cause the problem to vanish,
but change as little of the �Architecture as possible in doing
it. But 6 years later, they apparently are still unwilling to
alter the �Architecture enough to completely eliminate them.

While you write above that "you can do this without loosing
performance", IIUC it does have a cost in that you have to keep more >information as "speculative" and for longer.

You just have to keep speculative microarchitectural information
speculative (which costs some area). Once it's no longer speculative,
it can be promoted to the "permanent" microarchitectural state.

But it reminds me of a recent discussion with similar costs: sequential >consistency, where there again the cost is to keep more instructions as >"speculative" for longer.

That's different. Sequential consistency can be implemented
efficiently through an additional speculation mechanism: you would
speculate that the values you loaded are consistent with an ordering
of the loads and stores in the whole system that satisfies sequential consistency. So there you would speculate until that order is known
to your local core, which could be a relatively long time.

Fixing Spectre requires no additional speculation.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Anton Ertl on Tue Dec 5 20:52:14 2023

Anton Ertl wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

MitchAlsup [2023-12-04 18:58:51] wrote:

Quadibloc wrote:

On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:

You CAN build Spectr�-free, Melltdown-free, ROP-free,... in GBOoO by >>>>> following one simple rule:: No microarchitectural changes until the
causing instruction retires. AND you can do this without loosing
performance.

I thought that the mitigations that _were_ costly in performance
were mostly attempts to approach following just that rule.

IIUC the mitigations are all done in software, and on the hardware side >>they seem to largely disregard the issue as if they had abandoned all
hope to solve it at all.

For a while I thought that the hardware people thought that this is
Somebody Else's Problem (like the Rowhammer disaster, which IMO would
have been fixed long ago if the same manufacturer made the memory and
the memory controller; hmm, in case of Samsung that's actually the
case, but I guss that Samsung is large enough that it's as if the two
groups were different manufacturers).

However, reading "Speculative interference attacks: breaking invisible speculation schemes", which has 6 authors with affiliation "Intel Corporation, USA" (out of 16), someone at Intel seems to be aware of
the right way. Of course, the title of the paper and some of the
contents read as if it is intended to discourage the idea that Spectre
can be fixed, but if you ignore that, the paper points out a few more problems that need to be fixed; some of them are included in what
Mitch Alsup wrote above; there is also a side channel through resource contention that must also be closed, but that problem has been dealt
with, too.

So why the pessimistic title and partial content? Maybe it's just
that the authors are security researchers, and a successful attack is
the matter of prestige in this community, and such a paper has to be published first before a defense paper can be seen as an achievement.
More paranoid explanations: 1) Intel does not plan to do anything, so
they don't want people to know that they could do something. 2) Intel
is working on a microarchitecture that fixes Spectre, but wants to
maximize this as a competetive advantage, so they want (superficial)
readers from the competition to get the impression that there is no
point in working on a fix.

The mitigations were closer to:: cause the problem to vanish,
but change as little of the �Architecture as possible in doing
it. But 6 years later, they apparently are still unwilling to
alter the �Architecture enough to completely eliminate them.

While you write above that "you can do this without loosing
performance", IIUC it does have a cost in that you have to keep more >>information as "speculative" and for longer.

You just have to keep speculative microarchitectural information
speculative (which costs some area). Once it's no longer speculative,
it can be promoted to the "permanent" microarchitectural state.

Retired as it were.

But it reminds me of a recent discussion with similar costs: sequential >>consistency, where there again the cost is to keep more instructions as >>"speculative" for longer.

That's different. Sequential consistency can be implemented
efficiently through an additional speculation mechanism: you would
speculate that the values you loaded are consistent with an ordering
of the loads and stores in the whole system that satisfies sequential consistency. So there you would speculate until that order is known
to your local core, which could be a relatively long time.

Sequential consistency can be implemented by minor addition to a
memory dependence matrix, where dependencies are relaxed less
aggressively.

Fixing Spectre requires no additional speculation.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Tue Dec 5 21:27:35 2023

scott@slp53.sl.home (Scott Lurndal) writes:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

IIUC the mitigations are all done in software, and on the hardware side >>they seem to largely disregard the issue as if they had abandoned all
hope to solve it at all.

On the x86_64 side, perhaps.

ARM added a number of hint instructions, including a speculation
barrier instruction, that provide some hardware support for mitigations.

Such mitigation support instructuions (which exist for Intel and AMD,
too) are not the fix, on the contrary: If Spectre is fixed in the
hardware, no such instructions are necessary. E.g., on ARM these
instructions should be unnecessary (and do nothing) on A53, A55, A510
and A520, because they do not perform speculative execution.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Tue Dec 5 21:34:29 2023

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

IIUC the mitigations are all done in software, and on the hardware side >>>they seem to largely disregard the issue as if they had abandoned all >>>hope to solve it at all.

On the x86_64 side, perhaps.

ARM added a number of hint instructions, including a speculation
barrier instruction, that provide some hardware support for mitigations.

Such mitigation support instructuions (which exist for Intel and AMD,
too) are not the fix, on the contrary: If Spectre is fixed in the
hardware, no such instructions are necessary. E.g., on ARM these >instructions should be unnecessary (and do nothing) on A53, A55, A510
and A520, because they do not perform speculative execution.

That's why those new instructions were allocated out of the "hint"
instruction encoding.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue Dec 5 17:40:42 2023

While you write above that "you can do this without loosing
performance", IIUC it does have a cost in that you have to keep more >>information as "speculative" and for longer.

You just have to keep speculative microarchitectural information
speculative (which costs some area). Once it's no longer speculative,
it can be promoted to the "permanent" microarchitectural state.

Right, so it needs extra hardware to keep extra information until we
know for sure that it's not speculative.

But it reminds me of a recent discussion with similar costs: sequential
consistency, where there again the cost is to keep more instructions as
"speculative" for longer.

That's different. Sequential consistency can be implemented
efficiently through an additional speculation mechanism: you would
speculate that the values you loaded are consistent with an ordering
of the loads and stores in the whole system that satisfies sequential consistency. So there you would speculate until that order is known
to your local core, which could be a relatively long time.

IIUC this boils down to keeping some memory operations (reads) as
speculative for a longer amount of time (until all previous memory
operations are not speculative any more), which delays their retirement.
So the cost is that it requires keeping more instructions "in flight",
so it increases the "residency" of instructions in things like a ROB,
meaning that for a fixed size ROB you'll get less performance, or that
to keep the same performance you need a larger ROB.

So in both cases, we need extra hardware to keep track of extra
speculative info. In one case it's extra info about existing
speculation, and in the other it's existing info but about "extra
speculation".

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to MitchAlsup on Wed Dec 6 08:37:49 2023

mitchalsup@aol.com (MitchAlsup) writes:

Anton Ertl wrote:

You just have to keep speculative microarchitectural information
speculative (which costs some area). Once it's no longer speculative,
it can be promoted to the "permanent" microarchitectural state.

Retired as it were.

Yes, although I think that "committed" is the better term in this
context.

That's different. Sequential consistency can be implemented
efficiently through an additional speculation mechanism: you would
speculate that the values you loaded are consistent with an ordering
of the loads and stores in the whole system that satisfies sequential
consistency. So there you would speculate until that order is known
to your local core, which could be a relatively long time.

Sequential consistency can be implemented by minor addition to a
memory dependence matrix, where dependencies are relaxed less
aggressively.

Yes, you can implement sequential consistency by restricting the
ordering, but according to the advocates of weak consistency models,
this costs performance for single-threaded code. I guess that's also
why your architecture has blocks of sequential consistency, but
switches to weaker consistency in other parts of the program.

The alternative I outlined above is to speculate that the loaded value
is not written by some other core in a way that would violate
sequential consistency, and abandon the speculative state if that
turns out to be wrong. For single-threaded code this provides the
same performance as weak consistency (as long as the CPU has enough
speculative resources).

For code that communicates with other cores, it experiences slowdowns
only in the contended case, unlike typical implementations of cores
where weaker consistency is faster, and the programmer has to throw in
membars or somesuch that always costs performance, or has to switch to
a stronger cosistency mode that always costs performance.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Stefan Monnier on Wed Dec 6 14:46:49 2023

Stefan Monnier wrote:

While you write above that "you can do this without loosing
performance", IIUC it does have a cost in that you have to keep more >>>information as "speculative" and for longer.

You just have to keep speculative microarchitectural information
speculative (which costs some area). Once it's no longer speculative,
it can be promoted to the "permanent" microarchitectural state.

Right, so it needs extra hardware to keep extra information until we
know for sure that it's not speculative.

Why is the Miss Buffer considered "extra hardware" ??

But it reminds me of a recent discussion with similar costs: sequential
consistency, where there again the cost is to keep more instructions as
"speculative" for longer.

That's different. Sequential consistency can be implemented
efficiently through an additional speculation mechanism: you would
speculate that the values you loaded are consistent with an ordering
of the loads and stores in the whole system that satisfies sequential
consistency. So there you would speculate until that order is known
to your local core, which could be a relatively long time.

IIUC this boils down to keeping some memory operations (reads) as
speculative for a longer amount of time (until all previous memory
operations are not speculative any more), which delays their retirement.
So the cost is that it requires keeping more instructions "in flight",
so it increases the "residency" of instructions in things like a ROB,
meaning that for a fixed size ROB you'll get less performance, or that
to keep the same performance you need a larger ROB.

So in both cases, we need extra hardware to keep track of extra
speculative info. In one case it's extra info about existing
speculation, and in the other it's existing info but about "extra speculation".

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Anton Ertl on Wed Dec 6 14:56:44 2023

Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup) writes:

Anton Ertl wrote:

You just have to keep speculative microarchitectural information
speculative (which costs some area). Once it's no longer speculative,
it can be promoted to the "permanent" microarchitectural state.

Retired as it were.

Yes, although I think that "committed" is the better term in this
context.

That's different. Sequential consistency can be implemented
efficiently through an additional speculation mechanism: you would
speculate that the values you loaded are consistent with an ordering
of the loads and stores in the whole system that satisfies sequential
consistency. So there you would speculate until that order is known
to your local core, which could be a relatively long time.

Sequential consistency can be implemented by minor addition to a
memory dependence matrix, where dependencies are relaxed less
aggressively.

Yes, you can implement sequential consistency by restricting the
ordering, but according to the advocates of weak consistency models,
this costs performance for single-threaded code. I guess that's also
why your architecture has blocks of sequential consistency, but
switches to weaker consistency in other parts of the program.

Yes, My architecture uses sequential consistency only when required, and
the programmer does not have to deal with the boundaries, HW does.

The alternative I outlined above is to speculate that the loaded value
is not written by some other core in a way that would violate
sequential consistency, and abandon the speculative state if that
turns out to be wrong. For single-threaded code this provides the
same performance as weak consistency (as long as the CPU has enough speculative resources).

For code that communicates with other cores, it experiences slowdowns
only in the contended case, unlike typical implementations of cores
where weaker consistency is faster, and the programmer has to throw in membars or somesuch that always costs performance, or has to switch to
a stronger cosistency mode that always costs performance.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Dec 6 12:18:45 2023

Right, so it needs extra hardware to keep extra information until we
know for sure that it's not speculative.

Why is the Miss Buffer considered "extra hardware" ??

IIUC to avoid Spectre you need (among other things) to refrain from
updating the branch predication table until the prediction is verified,
so you need to keep a buffer (presumably a queue) of prediction updates
which is not otherwise needed.

Similarly, when fetching something from memory you have to keep it in
a miss buffer until the load instruction is committed. Maybe this miss
buffer would exist in any case, but I suspect that the fact that you
have to keep it there for (at least) a specific duration will increase
the residency in this miss buffer and thus in turn would require
a larger miss buffer if we don't want to slow down execution.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Stefan Monnier on Wed Dec 6 17:49:01 2023

Stefan Monnier wrote:

Right, so it needs extra hardware to keep extra information until we
know for sure that it's not speculative.

Why is the Miss Buffer considered "extra hardware" ??

IIUC to avoid Spectre you need (among other things) to refrain from
updating the branch predication table until the prediction is verified,
so you need to keep a buffer (presumably a queue) of prediction updates
which is not otherwise needed.

A buffer of window-size/issue-width elements of 1-bit plus 1 index into
BP table.

Similarly, when fetching something from memory you have to keep it in
a miss buffer until the load instruction is committed.

can commit, not is comitted.

Maybe this miss buffer would exist in any case, but I suspect that the fact that you
have to keep it there for (at least) a specific duration will increase
the residency in this miss buffer and thus in turn would require
a larger miss buffer if we don't want to slow down execution.

This is a far lower penalty than the workarounds to avoid.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Stefan Monnier on Wed Dec 6 19:08:34 2023

Stefan Monnier <monnier@iro.umontreal.ca> writes:

Right, so it needs extra hardware to keep extra information until we
know for sure that it's not speculative.

Why is the Miss Buffer considered "extra hardware" ??

IIUC to avoid Spectre you need (among other things) to refrain from
updating the branch predication table until the prediction is verified,
so you need to keep a buffer (presumably a queue) of prediction updates
which is not otherwise needed.

And you need to flush that queue on context switch, right?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Dec 6 17:22:36 2023

Scott Lurndal [2023-12-06 19:08:34] wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

IIUC to avoid Spectre you need (among other things) to refrain from
updating the branch predication table until the prediction is verified,
so you need to keep a buffer (presumably a queue) of prediction updates
which is not otherwise needed.

And you need to flush that queue on context switch, right?

You need to flush the part starting at the first
misprediction (if any), just as you flush the rest of the
mispredicted instructions.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Thu Dec 7 03:39:12 2023

Scott Lurndal wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

Right, so it needs extra hardware to keep extra information until we
know for sure that it's not speculative.

Why is the Miss Buffer considered "extra hardware" ??

IIUC to avoid Spectre you need (among other things) to refrain from >>updating the branch predication table until the prediction is verified,
so you need to keep a buffer (presumably a queue) of prediction updates >>which is not otherwise needed.

And you need to flush that queue on context switch, right?

1) Every effect of every instruction executed to completion must be part
of greater-machine-state.

2) No state of any discarded instruction is allowed to change any single
bit of greater-machine-state.

Greater machine state is all of the bits required to restore the logic
to this exact-state once there are no instructions execution in the
machine (pipeline is drained). Exact-state includes all timing-visible prediction state (control visible timing) and buffering state (data
visible timing).

On an interrupt the pipeline has the freedom to choose which instruction represents the last instruction executed based on paragraph 1 and
which is the first instruction in paragraph 2. Branch recovery does not
get this choice;; the branch instruction itself is in paragraph 1,
the architectural-target is in paragraph 2.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to MitchAlsup on Thu Dec 7 10:11:45 2023

On 2023-12-07 5:39, MitchAlsup wrote:

1) Every effect of every instruction executed to completion must be part
of greater-machine-state.
2) No state of any discarded instruction is allowed to change any single
bit of greater-machine-state.

Greater machine state is all of the bits required to restore the logic
to this exact-state once there are no instructions execution in the
machine (pipeline is drained). Exact-state includes all timing-visible prediction state (control visible timing) and buffering state (data
visible timing).

Neither paragraph 1 nor 2 mentions "exact-state". It seems paragraph 2
should, IIUC.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to MitchAlsup on Thu Dec 7 13:18:56 2023

mitchalsup@aol.com (MitchAlsup) writes:

Yes, My architecture uses sequential consistency only when required, and
the programmer does not have to deal with the boundaries, HW does.

That sounds cool, how does the hardware detect that it has to switch
to sequential consistency?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stefan Monnier on Thu Dec 7 13:20:07 2023

Stefan Monnier <monnier@iro.umontreal.ca> writes:

Right, so it needs extra hardware to keep extra information until we
know for sure that it's not speculative.

Why is the Miss Buffer considered "extra hardware" ??

IIUC to avoid Spectre you need (among other things) to refrain from
updating the branch predication table until the prediction is verified,

Yes.

so you need to keep a buffer (presumably a queue) of prediction updates
which is not otherwise needed.

That buffer already exists: It's the ROB. When a branch and its
outcome becomes non-speculative (or upon retirement), you feed the
outcome of the branch into the branch predictor(s).

Whether you also need speculation-based speculative "history"
predictors, is a good question, but in case of a misprediction it
certainly looks like a good idea to me not to pollute the committed
history predictors with these wrong predictions, so you also get a
benefit from not putting speculative predictions into the history of
predictors that are used for committed history.

Concerning the return-address stack predictor, you certainly want a
separate speculative version of that in addition to the committed
version, otherwise a misprediction recovery can result in multiple
subsequent return stack mispredictions. I expect that existing CPUs
already have some mechanism for resynchronizing the return stack after
a resolved misprediction, so I expect little additional cost for that.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stefan Monnier on Thu Dec 7 12:56:37 2023

Stefan Monnier <monnier@iro.umontreal.ca> writes:

While you write above that "you can do this without loosing
performance", IIUC it does have a cost in that you have to keep more >>>information as "speculative" and for longer.

You just have to keep speculative microarchitectural information
speculative (which costs some area). Once it's no longer speculative,
it can be promoted to the "permanent" microarchitectural state.

Right, so it needs extra hardware to keep extra information until we
know for sure that it's not speculative.

If you use extra hardware for that, you also get extra performance
(like you get extra performance from the extra architectural state).
E.g., if you have n speculative load buffers, they serve as additional
D-cache lines. OTOH, if you reduce the committed D-cache by these
lines, you don't need much extra hardware, but you lose performance.

That's different. Sequential consistency can be implemented
efficiently through an additional speculation mechanism: you would
speculate that the values you loaded are consistent with an ordering
of the loads and stores in the whole system that satisfies sequential
consistency. So there you would speculate until that order is known
to your local core, which could be a relatively long time.

IIUC this boils down to keeping some memory operations (reads) as
speculative for a longer amount of time (until all previous memory
operations are not speculative any more), which delays their retirement.
So the cost is that it requires keeping more instructions "in flight",
so it increases the "residency" of instructions in things like a ROB,
meaning that for a fixed size ROB you'll get less performance, or that
to keep the same performance you need a larger ROB.

No, retirement is in-order, so all instructions after a cache miss
have to wait in the ROB anyway until the miss is served. There is an implementation cost to sequential consistency, but this is not it. I
think if the computer architects put their mind to it, they can find
solutions that are good in performance and low in hardware cost.

But that means that we first have to shatter the meme that weak
consistency is necessary for performance. If the EPIC meme had been
as successful at suppressing research and development of OoO as the
weak consistency meme is at suppressing research into and development
of efficient hardware implementations of sequential consistency,
everybody would be using CPUs like Poulson and Efficeon these days,
and think that EPIC is necessary for performance.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Thu Dec 7 13:35:28 2023

scott@slp53.sl.home (Scott Lurndal) writes:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

Right, so it needs extra hardware to keep extra information until we
know for sure that it's not speculative.

Why is the Miss Buffer considered "extra hardware" ??

IIUC to avoid Spectre you need (among other things) to refrain from >>updating the branch predication table until the prediction is verified,
so you need to keep a buffer (presumably a queue) of prediction updates >>which is not otherwise needed.

And you need to flush that queue on context switch, right?

Not if the goal is to eliminate Spectre.

The idea is that the hardware is as secure or insecure as hardware
without speculation. The measure that achieve that make such flushes
neither necessary nor sufficient.

You may want to do such flushes for other security reasons, though
(i.e., reasons beyond Spectre).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Anton Ertl on Thu Dec 7 19:00:23 2023

Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup) writes:

Yes, My architecture uses sequential consistency only when required, and >>the programmer does not have to deal with the boundaries, HW does.

That sounds cool, how does the hardware detect that it has to switch
to sequential consistency?

Translate an AGEN into MMI/O space
or
LD.lock

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Anton Ertl on Thu Dec 7 19:01:11 2023

Anton Ertl wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

While you write above that "you can do this without loosing >>>>performance", IIUC it does have a cost in that you have to keep more >>>>information as "speculative" and for longer.

You just have to keep speculative microarchitectural information
speculative (which costs some area). Once it's no longer speculative,
it can be promoted to the "permanent" microarchitectural state.

Right, so it needs extra hardware to keep extra information until we
know for sure that it's not speculative.

If you use extra hardware for that, you also get extra performance
(like you get extra performance from the extra architectural state).
E.g., if you have n speculative load buffers, they serve as additional D-cache lines. OTOH, if you reduce the committed D-cache by these
lines, you don't need much extra hardware, but you lose performance.

That's different. Sequential consistency can be implemented
efficiently through an additional speculation mechanism: you would
speculate that the values you loaded are consistent with an ordering
of the loads and stores in the whole system that satisfies sequential
consistency. So there you would speculate until that order is known
to your local core, which could be a relatively long time.

IIUC this boils down to keeping some memory operations (reads) as >>speculative for a longer amount of time (until all previous memory >>operations are not speculative any more), which delays their retirement.
So the cost is that it requires keeping more instructions "in flight",
so it increases the "residency" of instructions in things like a ROB, >>meaning that for a fixed size ROB you'll get less performance, or that
to keep the same performance you need a larger ROB.

No, retirement is in-order, so all instructions after a cache miss
have to wait in the ROB anyway until the miss is served. There is an implementation cost to sequential consistency, but this is not it. I
think if the computer architects put their mind to it, they can find solutions that are good in performance and low in hardware cost.

But that means that we first have to shatter the meme that weak
consistency is necessary for performance.

Weak is not necessary, causal is.

If the EPIC meme had been
as successful at suppressing research and development of OoO as the
weak consistency meme is at suppressing research into and development
of efficient hardware implementations of sequential consistency,
everybody would be using CPUs like Poulson and Efficeon these days,
and think that EPIC is necessary for performance.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Anton Ertl on Thu Dec 7 19:07:34 2023

Anton Ertl wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

Right, so it needs extra hardware to keep extra information until we
know for sure that it's not speculative.

Why is the Miss Buffer considered "extra hardware" ??

IIUC to avoid Spectre you need (among other things) to refrain from >>updating the branch predication table until the prediction is verified,

Yes.

so you need to keep a buffer (presumably a queue) of prediction updates >>which is not otherwise needed.

That buffer already exists: It's the ROB. When a branch and its
outcome becomes non-speculative (or upon retirement), you feed the
outcome of the branch into the branch predictor(s).

Whether you also need speculation-based speculative "history"
predictors, is a good question, but in case of a misprediction it
certainly looks like a good idea to me not to pollute the committed
history predictors with these wrong predictions, so you also get a
benefit from not putting speculative predictions into the history of predictors that are used for committed history.

Concerning the return-address stack predictor, you certainly want a
separate speculative version of that in addition to the committed
version, otherwise a misprediction recovery can result in multiple
subsequent return stack mispredictions. I expect that existing CPUs
already have some mechanism for resynchronizing the return stack after
a resolved misprediction, so I expect little additional cost for that.

Yes, you definitely have to be able to RET and the unRET if the RET
was under the shadow of a mispredicted branch. Athlon and Opteron
used a doubly linked list with 16 entries (4-bit pointers).

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to MitchAlsup on Thu Dec 7 14:49:51 2023

MitchAlsup wrote:

Anton Ertl wrote:

Concerning the return-address stack predictor, you certainly want a
separate speculative version of that in addition to the committed
version, otherwise a misprediction recovery can result in multiple
subsequent return stack mispredictions. I expect that existing CPUs
already have some mechanism for resynchronizing the return stack after
a resolved misprediction, so I expect little additional cost for that.

Yes, you definitely have to be able to RET and the unRET if the RET
was under the shadow of a mispredicted branch. Athlon and Opteron
used a doubly linked list with 16 entries (4-bit pointers).

The double linked list approach could work but you have to checkpoint
the whole Return Address Stack Predictor (RASP) data structure:
(2*4b links + 64b return address per entry) * 16 entries, + 2*4b list head
= ~1160 bits per checkpoint, * 16 checkpoints = 18,560 bits checkpoint sram.

To manipulate the double linked list, RASP sram also requires
6 read and 6 write ports _for each concurrent RET decode lane_,
plus a bulk copy read and a write port for RASP checkpoint/rollback.

And I'm skipping over free entry list management.

Its an expensive little gadget.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Thu Dec 7 22:09:04 2023

EricP wrote:

MitchAlsup wrote:

Anton Ertl wrote:

Concerning the return-address stack predictor, you certainly want a
separate speculative version of that in addition to the committed
version, otherwise a misprediction recovery can result in multiple
subsequent return stack mispredictions. I expect that existing CPUs
already have some mechanism for resynchronizing the return stack after
a resolved misprediction, so I expect little additional cost for that.

Yes, you definitely have to be able to RET and the unRET if the RET
was under the shadow of a mispredicted branch. Athlon and Opteron
used a doubly linked list with 16 entries (4-bit pointers).

The double linked list approach could work but you have to checkpoint
the whole Return Address Stack Predictor (RASP) data structure:
(2*4b links + 64b return address per entry) * 16 entries, + 2*4b list head
= ~1160 bits per checkpoint, * 16 checkpoints = 18,560 bits checkpoint sram.

You just have to avoid reallocating the entry while the RET remains
under a mispredictable branch shadow. This eliminates needing the
64-bit address. 2×4×(16+1) = 136-bits--which is why the stack is 16
instead of 8.

To manipulate the double linked list, RASP sram also requires
6 read and 6 write ports _for each concurrent RET decode lane_,
plus a bulk copy read and a write port for RASP checkpoint/rollback.

You are only modifying 8-bits per cycle (across 2 entries)

And I'm skipping over free entry list management.

Its an expensive little gadget.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Chris M. Thomasson on Fri Dec 8 02:23:14 2023

Chris M. Thomasson wrote:

On 12/7/2023 11:00 AM, MitchAlsup wrote:

Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup) writes:

Yes, My architecture uses sequential consistency only when required,
and the programmer does not have to deal with the boundaries, HW does.

That sounds cool, how does the hardware detect that it has to switch
to sequential consistency?

Translate an AGEN into MMI/O space
or
LD.lock

Akin to if a programmer knows it absolutely _needs_ to use a #StoreLoad,
it will add them in the right places?

Consider a virtualized device driver. He may think he is performing a ST to MMI/O
space while the HyperVisor has remapped his virtual device control space back to
DRAM.

Now consider two copies of the same code (still a device driver) one virtualized
the other native.

Finally ask the question of whether a MemBar should be in one but not the other.

Virtualization means that sometimes even the most well informed programmer
has insufficient information .....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Fri Dec 8 15:21:12 2023

mitchalsup@aol.com (MitchAlsup) writes:

Chris M. Thomasson wrote:

On 12/7/2023 11:00 AM, MitchAlsup wrote:

Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup) writes:

Yes, My architecture uses sequential consistency only when required, >>>>> and the programmer does not have to deal with the boundaries, HW does. >>>

That sounds cool, how does the hardware detect that it has to switch
to sequential consistency?

Translate an AGEN into MMI/O space
or
LD.lock

Akin to if a programmer knows it absolutely _needs_ to use a #StoreLoad,
it will add them in the right places?

Consider a virtualized device driver. He may think he is performing a ST to MMI/O
space while the HyperVisor has remapped his virtual device control space back to
DRAM.

It's more likely to be using an SR-IOV type of interface to the device, allowing direct access to the hardware. HV intervention for MMIO is
so 2000's. :-).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	02:45:12
Calls:	10,387
Calls today:	2
Files:	14,061
Messages:	6,416,755

Spectre vs SC (was: Concertina II Progress)

Who's Online

Recent Visitors

System Info