Quadibloc wrote:
On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:
You CAN build Spectré-free, Melltdown-free, ROP-free,... in GBOoO byI thought that the mitigations that _were_ costly in performance
following one simple rule:: No microarchitectural changes until the
causing instruction retires. AND you can do this without loosing
performance.
were mostly attempts to approach following just that rule.
The mitigations were closer to:: cause the problem to vanish,
but change as little of the µArchitecture as possible in doing
it. But 6 years later, they apparently are still unwilling to
alter the µArchitecture enough to completely eliminate them.
MitchAlsup [2023-12-04 18:58:51] wrote:
Quadibloc wrote:
On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:
You CAN build Spectré-free, Melltdown-free, ROP-free,... in GBOoO byI thought that the mitigations that _were_ costly in performance
following one simple rule:: No microarchitectural changes until the
causing instruction retires. AND you can do this without loosing
performance.
were mostly attempts to approach following just that rule.
IIUC the mitigations are all done in software, and on the hardware side
they seem to largely disregard the issue as if they had abandoned all
hope to solve it at all.
MitchAlsup [2023-12-04 18:58:51] wrote:
Quadibloc wrote:
On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:
You CAN build Spectré-free, Melltdown-free, ROP-free,... in GBOoO byI thought that the mitigations that _were_ costly in performance
following one simple rule:: No microarchitectural changes until the
causing instruction retires. AND you can do this without loosing
performance.
were mostly attempts to approach following just that rule.
IIUC the mitigations are all done in software, and on the hardware side
they seem to largely disregard the issue as if they had abandoned all
hope to solve it at all.
The mitigations were closer to:: cause the problem to vanish,
but change as little of the µArchitecture as possible in doing
it. But 6 years later, they apparently are still unwilling to
alter the µArchitecture enough to completely eliminate them.
While you write above that "you can do this without loosing
performance", IIUC it does have a cost in that you have to keep more >information as "speculative" and for longer.
But it reminds me of a recent discussion with similar costs: sequential >consistency, where there again the cost is to keep more instructions as >"speculative" for longer.
Stefan Monnier <monnier@iro.umontreal.ca> writes:
MitchAlsup [2023-12-04 18:58:51] wrote:
Quadibloc wrote:
On Sun, 03 Dec 2023 20:18:30 +0000, MitchAlsup wrote:
You CAN build Spectr�-free, Melltdown-free, ROP-free,... in GBOoO by >>>>> following one simple rule:: No microarchitectural changes until theI thought that the mitigations that _were_ costly in performance
causing instruction retires. AND you can do this without loosing
performance.
were mostly attempts to approach following just that rule.
IIUC the mitigations are all done in software, and on the hardware side >>they seem to largely disregard the issue as if they had abandoned all
hope to solve it at all.
For a while I thought that the hardware people thought that this is
Somebody Else's Problem (like the Rowhammer disaster, which IMO would
have been fixed long ago if the same manufacturer made the memory and
the memory controller; hmm, in case of Samsung that's actually the
case, but I guss that Samsung is large enough that it's as if the two
groups were different manufacturers).
However, reading "Speculative interference attacks: breaking invisible speculation schemes", which has 6 authors with affiliation "Intel Corporation, USA" (out of 16), someone at Intel seems to be aware of
the right way. Of course, the title of the paper and some of the
contents read as if it is intended to discourage the idea that Spectre
can be fixed, but if you ignore that, the paper points out a few more problems that need to be fixed; some of them are included in what
Mitch Alsup wrote above; there is also a side channel through resource contention that must also be closed, but that problem has been dealt
with, too.
So why the pessimistic title and partial content? Maybe it's just
that the authors are security researchers, and a successful attack is
the matter of prestige in this community, and such a paper has to be published first before a defense paper can be seen as an achievement.
More paranoid explanations: 1) Intel does not plan to do anything, so
they don't want people to know that they could do something. 2) Intel
is working on a microarchitecture that fixes Spectre, but wants to
maximize this as a competetive advantage, so they want (superficial)
readers from the competition to get the impression that there is no
point in working on a fix.
The mitigations were closer to:: cause the problem to vanish,
but change as little of the �Architecture as possible in doing
it. But 6 years later, they apparently are still unwilling to
alter the �Architecture enough to completely eliminate them.
While you write above that "you can do this without loosing
performance", IIUC it does have a cost in that you have to keep more >>information as "speculative" and for longer.
You just have to keep speculative microarchitectural information
speculative (which costs some area). Once it's no longer speculative,
it can be promoted to the "permanent" microarchitectural state.
But it reminds me of a recent discussion with similar costs: sequential >>consistency, where there again the cost is to keep more instructions as >>"speculative" for longer.
That's different. Sequential consistency can be implemented
efficiently through an additional speculation mechanism: you would
speculate that the values you loaded are consistent with an ordering
of the loads and stores in the whole system that satisfies sequential consistency. So there you would speculate until that order is known
to your local core, which could be a relatively long time.
Fixing Spectre requires no additional speculation.
- anton
Stefan Monnier <monnier@iro.umontreal.ca> writes:
IIUC the mitigations are all done in software, and on the hardware side >>they seem to largely disregard the issue as if they had abandoned all
hope to solve it at all.
On the x86_64 side, perhaps.
ARM added a number of hint instructions, including a speculation
barrier instruction, that provide some hardware support for mitigations.
scott@slp53.sl.home (Scott Lurndal) writes:
Stefan Monnier <monnier@iro.umontreal.ca> writes:
IIUC the mitigations are all done in software, and on the hardware side >>>they seem to largely disregard the issue as if they had abandoned all >>>hope to solve it at all.
On the x86_64 side, perhaps.
ARM added a number of hint instructions, including a speculation
barrier instruction, that provide some hardware support for mitigations.
Such mitigation support instructuions (which exist for Intel and AMD,
too) are not the fix, on the contrary: If Spectre is fixed in the
hardware, no such instructions are necessary. E.g., on ARM these >instructions should be unnecessary (and do nothing) on A53, A55, A510
and A520, because they do not perform speculative execution.
While you write above that "you can do this without loosing
performance", IIUC it does have a cost in that you have to keep more >>information as "speculative" and for longer.
You just have to keep speculative microarchitectural information
speculative (which costs some area). Once it's no longer speculative,
it can be promoted to the "permanent" microarchitectural state.
But it reminds me of a recent discussion with similar costs: sequentialThat's different. Sequential consistency can be implemented
consistency, where there again the cost is to keep more instructions as
"speculative" for longer.
efficiently through an additional speculation mechanism: you would
speculate that the values you loaded are consistent with an ordering
of the loads and stores in the whole system that satisfies sequential consistency. So there you would speculate until that order is known
to your local core, which could be a relatively long time.
Anton Ertl wrote:
You just have to keep speculative microarchitectural information
speculative (which costs some area). Once it's no longer speculative,
it can be promoted to the "permanent" microarchitectural state.
Retired as it were.
That's different. Sequential consistency can be implemented
efficiently through an additional speculation mechanism: you would
speculate that the values you loaded are consistent with an ordering
of the loads and stores in the whole system that satisfies sequential
consistency. So there you would speculate until that order is known
to your local core, which could be a relatively long time.
Sequential consistency can be implemented by minor addition to a
memory dependence matrix, where dependencies are relaxed less
aggressively.
While you write above that "you can do this without loosing
performance", IIUC it does have a cost in that you have to keep more >>>information as "speculative" and for longer.
You just have to keep speculative microarchitectural information
speculative (which costs some area). Once it's no longer speculative,
it can be promoted to the "permanent" microarchitectural state.
Right, so it needs extra hardware to keep extra information until we
know for sure that it's not speculative.
But it reminds me of a recent discussion with similar costs: sequentialThat's different. Sequential consistency can be implemented
consistency, where there again the cost is to keep more instructions as
"speculative" for longer.
efficiently through an additional speculation mechanism: you would
speculate that the values you loaded are consistent with an ordering
of the loads and stores in the whole system that satisfies sequential
consistency. So there you would speculate until that order is known
to your local core, which could be a relatively long time.
IIUC this boils down to keeping some memory operations (reads) as
speculative for a longer amount of time (until all previous memory
operations are not speculative any more), which delays their retirement.
So the cost is that it requires keeping more instructions "in flight",
so it increases the "residency" of instructions in things like a ROB,
meaning that for a fixed size ROB you'll get less performance, or that
to keep the same performance you need a larger ROB.
So in both cases, we need extra hardware to keep track of extra
speculative info. In one case it's extra info about existing
speculation, and in the other it's existing info but about "extra speculation".
Stefan
mitchalsup@aol.com (MitchAlsup) writes:
Anton Ertl wrote:
You just have to keep speculative microarchitectural information
speculative (which costs some area). Once it's no longer speculative,
it can be promoted to the "permanent" microarchitectural state.
Retired as it were.
Yes, although I think that "committed" is the better term in this
context.
That's different. Sequential consistency can be implemented
efficiently through an additional speculation mechanism: you would
speculate that the values you loaded are consistent with an ordering
of the loads and stores in the whole system that satisfies sequential
consistency. So there you would speculate until that order is known
to your local core, which could be a relatively long time.
Sequential consistency can be implemented by minor addition to a
memory dependence matrix, where dependencies are relaxed less
aggressively.
Yes, you can implement sequential consistency by restricting the
ordering, but according to the advocates of weak consistency models,
this costs performance for single-threaded code. I guess that's also
why your architecture has blocks of sequential consistency, but
switches to weaker consistency in other parts of the program.
The alternative I outlined above is to speculate that the loaded value
is not written by some other core in a way that would violate
sequential consistency, and abandon the speculative state if that
turns out to be wrong. For single-threaded code this provides the
same performance as weak consistency (as long as the CPU has enough speculative resources).
For code that communicates with other cores, it experiences slowdowns
only in the contended case, unlike typical implementations of cores
where weaker consistency is faster, and the programmer has to throw in membars or somesuch that always costs performance, or has to switch to
a stronger cosistency mode that always costs performance.
- anton
Right, so it needs extra hardware to keep extra information until weWhy is the Miss Buffer considered "extra hardware" ??
know for sure that it's not speculative.
Right, so it needs extra hardware to keep extra information until weWhy is the Miss Buffer considered "extra hardware" ??
know for sure that it's not speculative.
IIUC to avoid Spectre you need (among other things) to refrain from
updating the branch predication table until the prediction is verified,
so you need to keep a buffer (presumably a queue) of prediction updates
which is not otherwise needed.
Similarly, when fetching something from memory you have to keep it in
a miss buffer until the load instruction is committed.
Maybe this miss buffer would exist in any case, but I suspect that the fact that you
have to keep it there for (at least) a specific duration will increase
the residency in this miss buffer and thus in turn would require
a larger miss buffer if we don't want to slow down execution.
Stefan
Right, so it needs extra hardware to keep extra information until weWhy is the Miss Buffer considered "extra hardware" ??
know for sure that it's not speculative.
IIUC to avoid Spectre you need (among other things) to refrain from
updating the branch predication table until the prediction is verified,
so you need to keep a buffer (presumably a queue) of prediction updates
which is not otherwise needed.
Stefan Monnier <monnier@iro.umontreal.ca> writes:
IIUC to avoid Spectre you need (among other things) to refrain fromAnd you need to flush that queue on context switch, right?
updating the branch predication table until the prediction is verified,
so you need to keep a buffer (presumably a queue) of prediction updates
which is not otherwise needed.
Stefan Monnier <monnier@iro.umontreal.ca> writes:
Right, so it needs extra hardware to keep extra information until weWhy is the Miss Buffer considered "extra hardware" ??
know for sure that it's not speculative.
IIUC to avoid Spectre you need (among other things) to refrain from >>updating the branch predication table until the prediction is verified,
so you need to keep a buffer (presumably a queue) of prediction updates >>which is not otherwise needed.
And you need to flush that queue on context switch, right?
1) Every effect of every instruction executed to completion must be part
of greater-machine-state.
2) No state of any discarded instruction is allowed to change any single
bit of greater-machine-state.
Greater machine state is all of the bits required to restore the logic
to this exact-state once there are no instructions execution in the
machine (pipeline is drained). Exact-state includes all timing-visible prediction state (control visible timing) and buffering state (data
visible timing).
Yes, My architecture uses sequential consistency only when required, and
the programmer does not have to deal with the boundaries, HW does.
Right, so it needs extra hardware to keep extra information until weWhy is the Miss Buffer considered "extra hardware" ??
know for sure that it's not speculative.
IIUC to avoid Spectre you need (among other things) to refrain from
updating the branch predication table until the prediction is verified,
so you need to keep a buffer (presumably a queue) of prediction updates
which is not otherwise needed.
While you write above that "you can do this without loosing
performance", IIUC it does have a cost in that you have to keep more >>>information as "speculative" and for longer.
You just have to keep speculative microarchitectural information
speculative (which costs some area). Once it's no longer speculative,
it can be promoted to the "permanent" microarchitectural state.
Right, so it needs extra hardware to keep extra information until we
know for sure that it's not speculative.
That's different. Sequential consistency can be implemented
efficiently through an additional speculation mechanism: you would
speculate that the values you loaded are consistent with an ordering
of the loads and stores in the whole system that satisfies sequential
consistency. So there you would speculate until that order is known
to your local core, which could be a relatively long time.
IIUC this boils down to keeping some memory operations (reads) as
speculative for a longer amount of time (until all previous memory
operations are not speculative any more), which delays their retirement.
So the cost is that it requires keeping more instructions "in flight",
so it increases the "residency" of instructions in things like a ROB,
meaning that for a fixed size ROB you'll get less performance, or that
to keep the same performance you need a larger ROB.
Stefan Monnier <monnier@iro.umontreal.ca> writes:
Right, so it needs extra hardware to keep extra information until weWhy is the Miss Buffer considered "extra hardware" ??
know for sure that it's not speculative.
IIUC to avoid Spectre you need (among other things) to refrain from >>updating the branch predication table until the prediction is verified,
so you need to keep a buffer (presumably a queue) of prediction updates >>which is not otherwise needed.
And you need to flush that queue on context switch, right?
mitchalsup@aol.com (MitchAlsup) writes:
Yes, My architecture uses sequential consistency only when required, and >>the programmer does not have to deal with the boundaries, HW does.
That sounds cool, how does the hardware detect that it has to switch
to sequential consistency?
- anton
Stefan Monnier <monnier@iro.umontreal.ca> writes:
While you write above that "you can do this without loosing >>>>performance", IIUC it does have a cost in that you have to keep more >>>>information as "speculative" and for longer.
You just have to keep speculative microarchitectural information
speculative (which costs some area). Once it's no longer speculative,
it can be promoted to the "permanent" microarchitectural state.
Right, so it needs extra hardware to keep extra information until we
know for sure that it's not speculative.
If you use extra hardware for that, you also get extra performance
(like you get extra performance from the extra architectural state).
E.g., if you have n speculative load buffers, they serve as additional D-cache lines. OTOH, if you reduce the committed D-cache by these
lines, you don't need much extra hardware, but you lose performance.
That's different. Sequential consistency can be implemented
efficiently through an additional speculation mechanism: you would
speculate that the values you loaded are consistent with an ordering
of the loads and stores in the whole system that satisfies sequential
consistency. So there you would speculate until that order is known
to your local core, which could be a relatively long time.
IIUC this boils down to keeping some memory operations (reads) as >>speculative for a longer amount of time (until all previous memory >>operations are not speculative any more), which delays their retirement.
So the cost is that it requires keeping more instructions "in flight",
so it increases the "residency" of instructions in things like a ROB, >>meaning that for a fixed size ROB you'll get less performance, or that
to keep the same performance you need a larger ROB.
No, retirement is in-order, so all instructions after a cache miss
have to wait in the ROB anyway until the miss is served. There is an implementation cost to sequential consistency, but this is not it. I
think if the computer architects put their mind to it, they can find solutions that are good in performance and low in hardware cost.
But that means that we first have to shatter the meme that weak
consistency is necessary for performance.
If the EPIC meme had been
as successful at suppressing research and development of OoO as the
weak consistency meme is at suppressing research into and development
of efficient hardware implementations of sequential consistency,
everybody would be using CPUs like Poulson and Efficeon these days,
and think that EPIC is necessary for performance.
- anton
Stefan Monnier <monnier@iro.umontreal.ca> writes:
Right, so it needs extra hardware to keep extra information until weWhy is the Miss Buffer considered "extra hardware" ??
know for sure that it's not speculative.
IIUC to avoid Spectre you need (among other things) to refrain from >>updating the branch predication table until the prediction is verified,
Yes.
so you need to keep a buffer (presumably a queue) of prediction updates >>which is not otherwise needed.
That buffer already exists: It's the ROB. When a branch and its
outcome becomes non-speculative (or upon retirement), you feed the
outcome of the branch into the branch predictor(s).
Whether you also need speculation-based speculative "history"
predictors, is a good question, but in case of a misprediction it
certainly looks like a good idea to me not to pollute the committed
history predictors with these wrong predictions, so you also get a
benefit from not putting speculative predictions into the history of predictors that are used for committed history.
Concerning the return-address stack predictor, you certainly want a
separate speculative version of that in addition to the committed
version, otherwise a misprediction recovery can result in multiple
subsequent return stack mispredictions. I expect that existing CPUs
already have some mechanism for resynchronizing the return stack after
a resolved misprediction, so I expect little additional cost for that.
- anton
Anton Ertl wrote:
Concerning the return-address stack predictor, you certainly want a
separate speculative version of that in addition to the committed
version, otherwise a misprediction recovery can result in multiple
subsequent return stack mispredictions. I expect that existing CPUs
already have some mechanism for resynchronizing the return stack after
a resolved misprediction, so I expect little additional cost for that.
Yes, you definitely have to be able to RET and the unRET if the RET
was under the shadow of a mispredicted branch. Athlon and Opteron
used a doubly linked list with 16 entries (4-bit pointers).
MitchAlsup wrote:
Anton Ertl wrote:
Concerning the return-address stack predictor, you certainly want a
separate speculative version of that in addition to the committed
version, otherwise a misprediction recovery can result in multiple
subsequent return stack mispredictions. I expect that existing CPUs
already have some mechanism for resynchronizing the return stack after
a resolved misprediction, so I expect little additional cost for that.
Yes, you definitely have to be able to RET and the unRET if the RET
was under the shadow of a mispredicted branch. Athlon and Opteron
used a doubly linked list with 16 entries (4-bit pointers).
The double linked list approach could work but you have to checkpoint
the whole Return Address Stack Predictor (RASP) data structure:
(2*4b links + 64b return address per entry) * 16 entries, + 2*4b list head
= ~1160 bits per checkpoint, * 16 checkpoints = 18,560 bits checkpoint sram.
To manipulate the double linked list, RASP sram also requires
6 read and 6 write ports _for each concurrent RET decode lane_,
plus a bulk copy read and a write port for RASP checkpoint/rollback.
And I'm skipping over free entry list management.
Its an expensive little gadget.
On 12/7/2023 11:00 AM, MitchAlsup wrote:
Anton Ertl wrote:
mitchalsup@aol.com (MitchAlsup) writes:
Yes, My architecture uses sequential consistency only when required,
and the programmer does not have to deal with the boundaries, HW does.
That sounds cool, how does the hardware detect that it has to switch
to sequential consistency?
Translate an AGEN into MMI/O space
or
LD.lock
Akin to if a programmer knows it absolutely _needs_ to use a #StoreLoad,
it will add them in the right places?
Chris M. Thomasson wrote:
On 12/7/2023 11:00 AM, MitchAlsup wrote:
Anton Ertl wrote:
mitchalsup@aol.com (MitchAlsup) writes:
Yes, My architecture uses sequential consistency only when required, >>>>> and the programmer does not have to deal with the boundaries, HW does. >>>That sounds cool, how does the hardware detect that it has to switch
to sequential consistency?
Translate an AGEN into MMI/O space
or
LD.lock
Akin to if a programmer knows it absolutely _needs_ to use a #StoreLoad,
it will add them in the right places?
Consider a virtualized device driver. He may think he is performing a ST to MMI/O
space while the HyperVisor has remapped his virtual device control space back to
DRAM.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 02:45:12 |
Calls: | 10,387 |
Calls today: | 2 |
Files: | 14,061 |
Messages: | 6,416,755 |