On Fri, 26 Jul 2024 17:00:07 +0000, Anton Ertl wrote:
Similarly, I expect that hardware that is designed for good TSO or
sequential consistency performance will run faster on code written for
this model than code written for weakly consistent hardware will run
on that hardware.
According to Lamport; only the ATOMIC stuff needs sequential
consistency.
So, it is completely possible to have a causally consistent processor
that switches to sequential consistency when doing ATOMIC stuff and gain >performance when not doing ATOMIC stuff, and gain programmability when
doing atomic stuff.
That's because software written for weakly
consistent hardware often has to insert barriers or atomic operations
just in case, and these operations are slow on hardware optimized for
weak consistency.
The operations themselves are not slow.
By contrast, one can design hardware for strong ordering such that the
slowness occurs only in those cases when actual (not potential)
communication between the cores happens, i.e., much less frequently.
How would you do this for a 256-way banked memory system of the
NEC SX ?? I.E., the processor is not in charge of memory order--
the memory system is.
On 7/26/2024 10:00 AM, Anton Ertl wrote:[...]
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
By contrast, one can design hardware for strong ordering such that the
slowness occurs only in those cases when actual (not potential)
communication between the cores happens, i.e., much less frequently.
and sometimes use cases do not care if they encounter "stale" data.
Great. Unless these "sometimes" cases are more often than the cases
where you perform some atomic operation or barrier because of
potential, but not actual communication between cores, the weak model
is still slower than a well-implemented strong model.
A strong model? You mean I don't have to use any memory barriers at all?
Tell that to SPARC in RMO mode...
Even the x86 requires a
membar when a store followed by a load to another location shall be
respected wrt order.
I
thought it was easier for a HW guy to implement weak consistency? At the
cost of the increased complexity wrt programming the sucker! ;^)
mitchalsup@aol.com (MitchAlsup1) writes:
On Fri, 26 Jul 2024 17:00:07 +0000, Anton Ertl wrote:
Similarly, I expect that hardware that is designed for good TSO or
sequential consistency performance will run faster on code written for
this model than code written for weakly consistent hardware will run
on that hardware.
According to Lamport; only the ATOMIC stuff needs sequential
consistency.
So, it is completely possible to have a causally consistent processor
that switches to sequential consistency when doing ATOMIC stuff and gain >>performance when not doing ATOMIC stuff, and gain programmability when >>doing atomic stuff.
That's not what I have in mind. What I have in mind is hardware that,
e.g., speculatively performs loads, predicting that no other core will
store there with an earlier time stamp. But if another core actually performs such a store, the usual misprediction handling happens and
the code starting from that mispredicted load is reexecuted. So as
long as two cores do not access the same memory, they can run at full
speed, and there is only slowdown if there is actual (not potential) communication between the cores.
A problem with that approach is that this requires enough reorder
buffering (or something equivalent, there may be something cheaper for
this particular problem) to cover at least the shared-cache latency
(usually L3, more with multiple sockets).
That's because software written for weakly
consistent hardware often has to insert barriers or atomic operations
just in case, and these operations are slow on hardware optimized for
weak consistency.
The operations themselves are not slow.
Citation needed.
By contrast, one can design hardware for strong ordering such that the
slowness occurs only in those cases when actual (not potential)
communication between the cores happens, i.e., much less frequently.
How would you do this for a 256-way banked memory system of the
NEC SX ?? I.E., the processor is not in charge of memory order--
the memory system is.
Memory consistency is defined wrt what several processors do. Some
processor performs some reads and writes and another performs some
read and writes, and memory consistency defines what a processor sees
about what the other does, and what ends up in main memory. But as
long as the processors, their caches, and their interconnect gets the
memory ordering right, the main memory is just the backing store that eventually gets a consistent result of what the other components did.
So it does not matter whether the main memory has one bank or 256.
One interesting aspect is that for supercomputers I generally think
that they have not yet been struck by the software crisis:
Supercomputer hardware is more expensive than supercomputer software.
So I expect that supercomputer hardware designers tend to throw
complexity over the wall to the software people, and in many cases
they do (the Cell broadband engine offers many examples of that).
However, "some ... Fujitsu [ARM] CPUs run with TSO at all times" <https://lwn.net/Articles/970907/>; that sounds like the A64FX, a
processor designed for supercomputing. So apparently in this case the hardware designers actually accepted the hardware and design
complexity cost of TSO and gave a better model to software, even in
hardware designed for a supercomputer.
- anton
On 7/29/2024 10:49 AM, MitchAlsup1 wrote:
[...]
A MEMBAR dropped into the pipeline, when nothing is speculative, takes
no more time than an integer ADD. Only when there is speculation does
it have to take time to relax the speculation.
I am wondering if you were ever aware of the "danger zone" wrt putting a MEMBAR instruction in a memory delay slot over on the SPARC? The docs explicitly said not to do it. I guess it can cause some interesting
memory order "issues" that might allow a running system to last for say,
five years before crashing at a random time... Yikes!
[...]
On Mon, 29 Jul 2024 20:08:10 +0000, Chris M. Thomasson wrote:
On 7/29/2024 10:49 AM, MitchAlsup1 wrote:
[...]
A MEMBAR dropped into the pipeline, when nothing is speculative,
takes no more time than an integer ADD. Only when there is
speculation does it have to take time to relax the speculation.
I am wondering if you were ever aware of the "danger zone" wrt
putting a MEMBAR instruction in a memory delay slot over on the
SPARC? The docs explicitly said not to do it. I guess it can cause
some interesting memory order "issues" that might allow a running
system to last for say, five years before crashing at a random
time... Yikes!
Which, btw, is why one should exhaustively test atomic codes before
putting it production. You do not want to chase down a memory ordering
issue that occurs, in production, less than once a month.
I did a lot of programming on SPARCs and never needed a MEMBAR....
but the OS guys used them like they were "free".
All sorts of things were dangerous in the delay slot of a branch,
not the least of which was another branch.
Prior to the introduction of multi-threaded cores, the rounding
modes of the FPU status register were hard wired to the FU.
Afterwards, they became "just another piece of state" that got
pipelined down instruction execution.
[...]
On Mon, 29 Jul 2024 13:21:10 +0000, Anton Ertl wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Fri, 26 Jul 2024 17:00:07 +0000, Anton Ertl wrote:
Similarly, I expect that hardware that is designed for good TSO or
sequential consistency performance will run faster on code written for >>>> this model than code written for weakly consistent hardware will run
on that hardware.
According to Lamport; only the ATOMIC stuff needs sequential
consistency.
So, it is completely possible to have a causally consistent processor >>>that switches to sequential consistency when doing ATOMIC stuff and gain >>>performance when not doing ATOMIC stuff, and gain programmability when >>>doing atomic stuff.
That's not what I have in mind. What I have in mind is hardware that,
e.g., speculatively performs loads, predicting that no other core will
store there with an earlier time stamp. But if another core actually
performs such a store, the usual misprediction handling happens and
the code starting from that mispredicted load is reexecuted. So as
long as two cores do not access the same memory, they can run at full
speed, and there is only slowdown if there is actual (not potential)
communication between the cores.
OK...
A problem with that approach is that this requires enough reorder
buffering (or something equivalent, there may be something cheaper for
this particular problem) to cover at least the shared-cache latency
(usually L3, more with multiple sockets).
The depth of the execution window may be smaller than the time it takes
to send the required information around and have this core recognize
that it is out-of-order wrt memory.
The operations themselves are not slow.
Citation needed.
A MEMBAR dropped into the pipeline, when nothing is speculative, takes
no more time than an integer ADD. Only when there is speculation does
it have to take time to relax the speculation.
Memory consistency is defined wrt what several processors do. Some
processor performs some reads and writes and another performs some
read and writes, and memory consistency defines what a processor sees
about what the other does, and what ends up in main memory. But as
long as the processors, their caches, and their interconnect gets the
memory ordering right, the main memory is just the backing store that
eventually gets a consistent result of what the other components did.
So it does not matter whether the main memory has one bank or 256.
NEC SX is a multi-processor vector machine with the property that
addresses are spewed out as fast as AGEN can perform. These addresses
are routed to banks based on bus-segment and can arrive OoO wrt
how they were spewed out.
So two processors accessing the same memory using vector LDs will
see a single vector having multiple memory orderings. P[0]V[0] ordered
before P[1]V[0] but P[1]V[1] ordered before P[0]V[1], ...
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 29 Jul 2024 13:21:10 +0000, Anton Ertl wrote:
A problem with that approach is that this requires enough reorder
buffering (or something equivalent, there may be something cheaper for
this particular problem) to cover at least the shared-cache latency
(usually L3, more with multiple sockets).
The depth of the execution window may be smaller than the time it takes
to send the required information around and have this core recognize
that it is out-of-order wrt memory.
So if we don't want to stall for memory accesses all the time, we need
a bigger execution window, either by making the reorder buffer larger,
or by using a different, cheaper mechanism.
Concerning the cheaper mechanism, what I am thinking of is hardware checkpointing every, say, 200 cycles or so (subject to fine-tuning).
The idea here is that communication between cores is very rare, so
rolling back more cycles than the minimal necessary amount costs
little on average (except that it looks bad on cache ping-pong microbenchmarks).
The operations themselves are not slow.
Citation needed.
A MEMBAR dropped into the pipeline, when nothing is speculative, takes
no more time than an integer ADD. Only when there is speculation does
it have to take time to relax the speculation.
Not sure what kind of speculation you mean here. On in-order cores
like the non-Fujitsu SPARCs from before about 2010 memory barriers are expensive AFAIK, even though there is essentially no branch
speculation on in-order cores.
Of course, if you mean speculation about the order of loads and
stores, yes, if you don't have such speculation, the memory barriers
are fast, but then loads are extremely slow.
Memory consistency is defined wrt what several processors do. Some
processor performs some reads and writes and another performs some
read and writes, and memory consistency defines what a processor sees
about what the other does, and what ends up in main memory. But as
long as the processors, their caches, and their interconnect gets the
memory ordering right, the main memory is just the backing store that
eventually gets a consistent result of what the other components did.
So it does not matter whether the main memory has one bank or 256.
NEC SX is a multi-processor vector machine with the property that
addresses are spewed out as fast as AGEN can perform. These addresses
are routed to banks based on bus-segment and can arrive OoO wrt
how they were spewed out.
So two processors accessing the same memory using vector LDs will
see a single vector having multiple memory orderings. P[0]V[0] ordered >>before P[1]V[0] but P[1]V[1] ordered before P[0]V[1], ...
As long as no stores happen, who cares about the order of the loads?
When stores happen, the loads are ordered wrt these stores (with
stronger memory orderings giving more guarantees). So the number of
memory banks does not matter for implementing a strong ordering
efficiently.
- anton
On Tue, 30 Jul 2024 9:51:46 +0000, Anton Ertl wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
The depth of the execution window may be smaller than the time it takes >>>to send the required information around and have this core recognize
that it is out-of-order wrt memory.
So if we don't want to stall for memory accesses all the time, we need
a bigger execution window, either by making the reorder buffer larger,
or by using a different, cheaper mechanism.
Mc 88120 had a 96-wide execution window, which could be filled up in
16 cycles (optimistically) and was filled up in 32 cycles (average).
Given that DRAM is not going to be less than 20 ns and a 5GHz core,
the execution window is 1/3rd that which would be required to absorb
an cache miss all the way to DRAM.
Concerning the cheaper mechanism, what I am thinking of is hardware
checkpointing every, say, 200 cycles or so (subject to fine-tuning).
The idea here is that communication between cores is very rare, so
rolling back more cycles than the minimal necessary amount costs
little on average (except that it looks bad on cache ping-pong
microbenchmarks).
You lost me::
Colloquially, there are 2 uses of the word checkpointing:: a) what
HW does each time it inserts a branch into the EW, b) what an OS or >application does to be able to recover from a crash (from any
mechanism).
An MEMBAR requires the memory order to catch up to the current point
before adding new AGENs to the problem space. If the memory order
is already SC then MEMBAR has nothing to do and is pushed through
the pipeline without delay.
Then consider 2 Vector processors performing 2 STs (1 each) to >non-overlapping addresses but with bank aliasing. Consider that
the STs are scatter based and the back conflicts random. There
is no way to determine which store happened first or which
element of each vector store happened first.
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 30 Jul 2024 9:51:46 +0000, Anton Ertl wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
The depth of the execution window may be smaller than the time it takes >>>>to send the required information around and have this core recognize >>>>that it is out-of-order wrt memory.
So if we don't want to stall for memory accesses all the time, we need
a bigger execution window, either by making the reorder buffer larger,
or by using a different, cheaper mechanism.
Colloquially, there are 2 uses of the word checkpointing:: a) what
HW does each time it inserts a branch into the EW, b) what an OS or >>application does to be able to recover from a crash (from any
mechanism).
What is "EW"?
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 30 Jul 2024 9:51:46 +0000, Anton Ertl wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
An MEMBAR requires the memory order to catch up to the current point
before adding new AGENs to the problem space. If the memory order
is already SC then MEMBAR has nothing to do and is pushed through
the pipeline without delay.
Yes, that's the slow implementation. The fast implementation is to
implement sequential consistency all the time (by predicting and
speculating that memory accesses do not interfer with those of other
cores, and recovering from that speculation when the speculation turns
out to be wrong). In such an implementation memory barriers are noops
(and thus fast), because the hardware already provides sequential consistency.
Then consider 2 Vector processors performing 2 STs (1 each) to >>non-overlapping addresses but with bank aliasing. Consider that
the STs are scatter based and the back conflicts random. There
is no way to determine which store happened first or which
element of each vector store happened first.
It's up to the architecture to define the order of stores and loads of
a given core. For sequential consistency you then interleave the
sequences coming from the cores in some convenient order.
It does not
matter what happens earlier in some inertial system. It only matters
what your hardware decides should be treated as being earlier. The
hardware has a lot of freedom here, but the end result as visible to
the cores must be sequentially consistent (or, with a weaker memory consistency model, consistent with that model).
- anton
On Thu, 1 Aug 2024 15:54:55 +0000, Anton Ertl wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 30 Jul 2024 9:51:46 +0000, Anton Ertl wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
An MEMBAR requires the memory order to catch up to the current point >>>before adding new AGENs to the problem space. If the memory order
is already SC then MEMBAR has nothing to do and is pushed through
the pipeline without delay.
Yes, that's the slow implementation. The fast implementation is to
implement sequential consistency all the time (by predicting and
speculating that memory accesses do not interfer with those of other
cores, and recovering from that speculation when the speculation turns
out to be wrong). In such an implementation memory barriers are noops
(and thus fast), because the hardware already provides sequential
consistency.
Why does SC need any MEMBARs ??
Then consider 2 Vector processors performing 2 STs (1 each) to >>>non-overlapping addresses but with bank aliasing. Consider that
the STs are scatter based and the back conflicts random. There
is no way to determine which store happened first or which
element of each vector store happened first.
It's up to the architecture to define the order of stores and loads of
a given core. For sequential consistency you then interleave the
sequences coming from the cores in some convenient order.
Insufficient:: If OoO processor orders LDs and STs as they leave AGEN
you cannot just interleave multiple core access streams and achieve >sequential consistency.
On 11/14/2024 11:25 PM, Anton Ertl wrote:
aph@littlepinkcloud.invalid writes:
Yes. That Alpha behaviour was a historic error. No one wants to do
that again.
Was it an actual behaviour of any Alpha for public sale, or was it[...]
just the Alpha specification? I certainly think that Alpha's lack
of guarantees in memory ordering is a bad idea, and so is ARM's:
"It's only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously? Sequential consistency can be specified in one sentence: "The result
of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the
operations of each individual processor appear in this sequence in
the order specified by its program."
Well, iirc, the Alpha is the only system that requires an explicit
membar for a RCU based algorithm. Even SPARC in RMO mode does not
need this. Iirc, akin to memory_order_consume in C++:
https://en.cppreference.com/w/cpp/atomic/memory_order
data dependent loads
aph@littlepinkcloud.invalid writes:
Yes. That Alpha behaviour was a historic error. No one wants to do
that again.
Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification? I certainly think that Alpha's lack of guarantees in memory ordering is a bad idea, and so is ARM's: "It's
only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
Sequential consistency can be specified in one sentence: "The result
of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order
specified by its program."
However, I don't think that the Alpha architects considered the Alpha
memory ordering to be an error, and probably still don't, just like
the ARM architects don't consider their memory model to be an error.
I am pretty sure that no Alpha implementation ever made use of the
lack of causality in the Alpha memory model, so they could have added causality without outlawing existing implementations. That they did
not indicates that they thought that their memory model was right. An advocacy paper for weak memory models [adve&gharachorloo95] came from
the same place as Alpha, so it's no surprise that Alpha specifies weak consistency.
@TechReport{adve&gharachorloo95,
author = {Sarita V. Adve and Kourosh Gharachorloo},
title = {Shared Memory Consistency Models: A Tutorial},
institution = {Digital Western Research Lab},
year = {1995},
type = {WRL Research Report},
number = {95/7},
annote = {Gives an overview of architectural features of
shared-memory computers such as independent memory
banks and per-CPU caches, and how they make the (for
programmers) most natural consistency model hard to
implement, giving examples of programs that can fail
with weaker consistency models. It then discusses
several categories of weaker consistency models and
actual consistency models in these categories, and
which ``safety net'' (e.g., memory barrier
instructions) programmers need to use to work around
the deficiencies of these models. While the authors
recognize that programmers find it difficult to use
these safety nets correctly and efficiently, it
still advocates weaker consistency models, claiming
that sequential consistency is too inefficient, by
outlining an inefficient implementation (which is of
course no proof that no efficient implementation
exists). Still the paper is a good introduction to
the issues involved.}
}
- anton
Anybody doing that sort of programming, i.e. lock-free or distributed >algorithms, who can't handle weakly consistent memory models, shouldn't
be doing that sort of programming in the first place.
Strongly consistent memory won't help incompetence.
On 11/15/2024 4:05 PM, Chris M. Thomasson wrote:
On 11/15/2024 12:53 PM, BGB wrote:
On 11/15/2024 11:27 AM, Anton Ertl wrote:[...]
jseigh <jseigh_es00@xemaps.com> writes:
Anybody doing that sort of programming, i.e. lock-free or distributed >>>>> algorithms, who can't handle weakly consistent memory models, shouldn't >>>>> be doing that sort of programming in the first place.
Do you have any argument that supports this claim.
Strongly consistent memory won't help incompetence.
Strong words to hide lack of arguments?
In my case, as I see it:
  The tradeoff is more about implementation cost, performance, etc.
Weak model:
  Cheaper (and simpler) to implement;
  Performs better when there is no need to synchronize memory;
  Performs worse when there is need to synchronize memory;
  ...
A TSO from a weak memory model is as it is. It should not necessarily
perform "worse" than other systems that have TSO as a default. The
weaker models give us flexibility. Any weak memory model should be able
to give sequential consistency via using the right membars in the right
places.
The speed difference is mostly that, in a weak model, the L1 cache
merely needs to fetch memory from the L2 or similar, may write to it whenever, and need not proactively store back results.
As I understand it, a typical TSO like model will require, say:
Any L1 cache that wants to write to a cache line, needs to explicitly
request write ownership over that cache line;
Any attempt by other cores to access this line,
may require the L2 cache
to send a message to the core currently holding the cache line for
writing to write back its contents, with the request unable to be
handled until after the second core has written back the dirty cache
line.
This would create potential for significantly more latency in cases
where multiple cores touch the same part of memory; albeit the cores
will see each others' memory stores.
So, initially, weak model can be faster due to not needing any
additional handling.
But... Any synchronization points, such as a barrier or locking or
releasing a mutex, will require manually flushing the cache with a weak model.
And, locking/releasing the mutex itself will require a mechanism
that is consistent between cores (such as volatile atomic swaps or
similar, which may still be weak as a volatile-atomic-swap would still
not be atomic from the POV of the L2 cache; and an MMIO interface could
be stronger here).
Seems like there could possibly be some way to skip some of the cache flushing if one could verify that a mutex is only being locked and
unlocked on a single core.
Issue then is how to deal with trying to lock a mutex which has thus far
been exclusive to a single core. One would need some way for the core
that last held the mutex to know that it needs to perform an L1 cache
flush.
Though, one possibility could be to leave this part to the OS scheduler/syscall/...
mechanism; so the core that wants to lock the
mutex signals its intention to do so via the OS, and the next time the
core that last held the mutex does a syscall (or tries to lock the mutex again), the handler sees this, then performs the L1 flush and flags the
mutex as multi-core safe (at which point, the parties will flush L1s at
each mutex lock, though possibly with a timeout count so that, if the
mutex has been single-core for N locks, it reverts to single-core
behavior).
This could reduce the overhead of "frivolous mutex locking" in programs
that are otherwise single-threaded or single processor (leaving the
cache flushes for the ones that are in-fact being used for
synchronization purposes).
....
On 11/15/2024 9:27 AM, Anton Ertl wrote:
jseigh <jseigh_es00@xemaps.com> writes:
Anybody doing that sort of programming, i.e. lock-free or distributed
algorithms, who can't handle weakly consistent memory models, shouldn't
be doing that sort of programming in the first place.
Strongly consistent memory won't help incompetence.
Strong words to hide lack of arguments?
For instance, a 100% sequential memory order won't help you with, say, >solving ABA.
The tradeoff is more about implementation cost, performance, etc.
Weak model:
Cheaper (and simpler) to implement;
Performs better when there is no need to synchronize memory;
Performs worse when there is need to synchronize memory;
jseigh <jseigh_es00@xemaps.com> writes:
Anybody doing that sort of programming, i.e. lock-free or distributed
algorithms, who can't handle weakly consistent memory models, shouldn't
be doing that sort of programming in the first place.
Do you have any argument that supports this claim.
Fwiw, in C++ std::memory_order_consume is useful for traversing a node
based stack of something in RCU. In most systems it only acts like a
compiler barrier. On the Alpha, it must emit a membar instruction. Iirc,
mb for alpha? Cannot remember that one right now.
Even if the hardware memory
memory model is strongly ordered, compilers can reorder stuff,
so you still have to program as if a weak memory model was in
effect.
Or maybe disable reordering or optimization altogether
for those target architectures.
On 11/15/2024 11:37 PM, Anton Ertl wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 11/15/2024 9:27 AM, Anton Ertl wrote:
jseigh <jseigh_es00@xemaps.com> writes:
Anybody doing that sort of programming, i.e. lock-free or distributed >>>>> algorithms, who can't handle weakly consistent memory models, shouldn't >>>>> be doing that sort of programming in the first place.
Strongly consistent memory won't help incompetence.
Strong words to hide lack of arguments?
For instance, a 100% sequential memory order won't help you with, say,
solving ABA.
Sure, not all problems are solved by sequential consistency, and yes,
it won't solve race conditions like the ABA problem. But jseigh
implied that finding it easier to write correct and efficient code for
sequential consistency than for a weakly-consistent memory model
(e.g., Alphas memory model) is incompetent.
What if you had to write code for a weakly ordered system, and the >performance guidelines said to only use a membar when you absolutely
have to. If you say something akin to "I do everything using >std::memory_order_seq_cst", well, that is a violation right off the bat.
Fair enough?
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:I am trying to say you might not be hired if you only knew how to handle >std::memory_order_seq_cst wrt C++... ?
What if you had to write code for a weakly ordered system, and the
performance guidelines said to only use a membar when you absolutely
have to. If you say something akin to "I do everything using
std::memory_order_seq_cst", well, that is a violation right off the bat. ...
aph@littlepinkcloud.invalid writes:
Yes. That Alpha behaviour was a historic error. No one wants to do
that again.
Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification?
On 11/16/24 16:21, Chris M. Thomasson wrote:
Fwiw, in C++ std::memory_order_consume is useful for traversing a node
based stack of something in RCU. In most systems it only acts like a
compiler barrier. On the Alpha, it must emit a membar instruction. Iirc,
mb for alpha? Cannot remember that one right now.
That got deprecated. Too hard for compilers to deal with. It's now
same as memory_order_acquire.
Which brings up an interesting point. Even if the hardware memory
memory model is strongly ordered, compilers can reorder stuff,
so you still have to program as if a weak memory model was in
effect.
On 11/18/2024 3:34 PM, Chris M. Thomasson wrote:
Don't tell me you want all of std::memory_order_* to default to
std::memory_order_seq_cst? If your on a system that only has seq_cst and
nothing else, okay, but not on other weaker (memory order) systems,
right?
defaulting a relaxed to a seq_cst is a bit much.... ;^o
On 11/18/2024 3:20 PM, Chris M. Thomasson wrote:...
On 11/17/2024 11:11 PM, Anton Ertl wrote:
The flaw in the reasoning of the paper was:
|To solve it more easily without floating–point von Neumann had
|transformed equation Bx = c to B^TBx = B^Tc , thus unnecessarily
|doubling the number of sig. bits lost to ill-condition
This is an example of how the supposed gains that the harder-to-use
interface provides (in this case the bits "wasted" on the exponent)
are overcompensated by then having to use a software workaround for
the harder-to-use interface.
Don't tell me you want all of std::memory_order_* to default to >std::memory_order_seq_cst? If your on a system that only has seq_cst and >nothing else, okay, but not on other weaker (memory order) systems, right?
On 11/17/2024 7:17 AM, Anton Ertl wrote:
jseigh <jseigh_es00@xemaps.com> writes:
Or maybe disable reordering or optimization altogether
for those target architectures.
So you want to throw out the baby with the bathwater.
No, keep the weak order systems and not throw them out wrt a system that
is 100% seq_cst? Perhaps? What am I missing here?
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 11/18/2024 3:20 PM, Chris M. Thomasson wrote:...
On 11/17/2024 11:11 PM, Anton Ertl wrote:
The flaw in the reasoning of the paper was:
|To solve it more easily without floating–point von Neumann had
|transformed equation Bx = c to B^TBx = B^Tc , thus unnecessarily
|doubling the number of sig. bits lost to ill-condition
This is an example of how the supposed gains that the
harder-to-use interface provides (in this case the bits "wasted"
on the exponent) are overcompensated by then having to use a
software workaround for the harder-to-use interface.
Don't tell me you want all of std::memory_order_* to default to >std::memory_order_seq_cst? If your on a system that only has seq_cst
and nothing else, okay, but not on other weaker (memory order)
systems, right?
I tell anyone who wants to read it to stop buying hardware without FP
for non-integer work, and with weak memory ordering for work that
needs concurrent programming. There are enough affordable offerings
with FP and TSO that we do not need to waste programming time and
increase the frequency of hard-to-find bugs by figuring out how to get
good performance out of hardware without FP hardware and with weak
memory ordering.
Those who enjoy the challenge of dealing with the unnecessary problems
of sub-par hardware can continue to enjoy that.
But when developing production software, as a manager don't let
programmers with this hobby horse influence your hardware and
development decisions. Give full support for FP and TSO hardware, and limited support to weakly-ordered hardware. That limited support may
consist of using software implementations of FP (instead of designing software for fixed point arithmetic). In case of hardware with weak
ordering the limited support could be to use memory barriers liberally (without trying to minimize them at all; every memory barrier
elimination costs development time and increases the potential for hard-to-find bugs), of using OS mechanisms for concurrency (rather
than, e.g., lock-free algorithms), or maybe even only supporting single-threaded operation.
Efficiently-implemented sequentially-consistent hardware would be even
more preferable, and if it was widely available, I would recommend
buying that over TSO hardware, but unfortunately we are not there yet.
- anton
BTW, does your stance means that your are strongly against A64FX ?
Lockless programming is horrendously complicated and error prone.
Sequential consistency removes only small part of potential
complications.
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 11/17/2024 7:17 AM, Anton Ertl wrote:
jseigh <jseigh_es00@xemaps.com> writes:
Or maybe disable reordering or optimization altogether
for those target architectures.
So you want to throw out the baby with the bathwater.
No, keep the weak order systems and not throw them out wrt a system that
is 100% seq_cst? Perhaps? What am I missing here?
Disabling optimization altogether costs a lot; e.g., look at <http://www.complang.tuwien.ac.at/anton/bentley.pdf>: if you compare
the lines for clang-3.5 -O0 with clang-3.5 -O3, you see a factor >2.5
for the tsp9 program. For gcc-5.2.0 the difference is even bigger.
That's why jseigh and people like him (I have read that suggestion
several times before) love to suggest disabling optimization
altogether. It's a straw man that does not even need beating up. Of
course they usually don't show results for the supposed benefits of
the particular "optimization" they advocate (or the drawbacks of
disabling it), and jseigh follows this pattern nicely.
The compiler is allow to reorder code as long as it knows the
reordering can't be observed or detected.
If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.
If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.
The problem is HOW to TELL the COMPILER that these memory references
are "more special" than normal--when languages give few mechanisms.
If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.
The problem is HOW to TELL the COMPILER that these memory references
are "more special" than normal--when languages give few mechanisms.
We could start with something like
critical_region {
...
}
such that the compiler must refrain from any code motion within
those sections but is free to move things outside of those sections as
if execution was singlethreaded.
Stefan
You identify a second problem. Is it that you don't want code motion
across the boundary or you do not want code motion within the boundary??
If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.
The problem is HOW to TELL the COMPILER that these memory references
are "more special" than normal--when languages give few mechanisms.
We could start with something like
critical_region {
...
}
such that the compiler must refrain from any code motion within
those sections but is free to move things outside of those sections as if execution was singlethreaded.
On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:
The compiler is allow to reorder code as long as it knows the
reordering can't be observed or detected.
With exceptions enabled, this would allow for almost no code
movement at all.
If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.
The problem is HOW to TELL the COMPILER that these memory references
are "more special" than normal--when languages give few mechanisms.
You identify a second problem. Is it that you don't want code motion
across the boundary or you do not want code motion within the boundary??
Concurrency is hard. 🙂
Stefan
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:
The compiler is allow to reorder code as long as it knows the
reordering can't be observed or detected.
With exceptions enabled, this would allow for almost no code
movement at all.
If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.
The problem is HOW to TELL the COMPILER that these memory references
are "more special" than normal--when languages give few mechanisms.
C and C++ have the 'volatile' keyword for this purpose.
On Wed, 4 Dec 2024 16:37:41 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:
The compiler is allow to reorder code as long as it knows the
reordering can't be observed or detected.
With exceptions enabled, this would allow for almost no code
movement at all.
If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.
The problem is HOW to TELL the COMPILER that these memory references
are "more special" than normal--when languages give few mechanisms.
C and C++ have the 'volatile' keyword for this purpose.
What if you want the volatile attribute only to hold
on an inner block::
{
int i = ...;
... // I is not volitile here
{
... // I is volitile in here
}
... // I is not volitile here
...
}
On 12/4/2024 8:13 AM, jseigh wrote:
On 12/3/24 18:37, Stefan Monnier wrote:
                                          If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.
The problem is HOW to TELL the COMPILER that these memory references
are "more special" than normal--when languages give few mechanisms.
We could start with something like
    critical_region {
      ...
    }
such that the compiler must refrain from any code motion within
those sections but is free to move things outside of those sections
as if
execution was singlethreaded.
C/C++11 already defines what lock acquire/release semantics are.
Roughly you can move stuff outside of a critical section into it
but not vice versa.
Java uses synchronized blocks to denote the critical section.
C++ (the society for using RAII for everything) has scoped_lock
if you want to use RAII for your critical section. It's not
always obvious what the actual critical section is. I usually
use it inside its own bracket section to make it more obvious.
  { std::scoped_lock m(mutex);
    // .. critical section
  }
I'm not a big fan of c/c++ using acquire and release memory order
directives on everything since apart from a few situations it's
not intuitively obvious what they do in all cases. You can
look a compiler assembler output but you have to be real careful
generalizing from what you see.
The release on the unlock can allow some following stores and things to
sort of "bubble up before it?
Acquire and release confines things to the "critical section", the
release can allow for some following things to go above it, so to speak.
This is making me think of Alex over on c.p.t. !
:^)
Did I miss anything? Sorry Joe.
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:
The compiler is allow to reorder code as long as it knows the
reordering can't be observed or detected.
With exceptions enabled, this would allow for almost no code
movement at all.
If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.
The problem is HOW to TELL the COMPILER that these memory references
are "more special" than normal--when languages give few mechanisms.
C and C++ have the 'volatile' keyword for this purpose.
scott@slp53.sl.home (Scott Lurndal) writes:
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:
The compiler is allow to reorder code as long as it knows the
reordering can't be observed or detected.
With exceptions enabled, this would allow for almost no code
movement at all.
If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.
The problem is HOW to TELL the COMPILER that these memory references
are "more special" than normal--when languages give few mechanisms.
C and C++ have the 'volatile' keyword for this purpose.
A problem with using volatile is that volatile doesn't do what
most people think it does, especially with respect to what
reordering is or is not allowed.
Tim, did you send me a PM to check my email? I responded but then
silence. Could someone be pretending to be you?
On 12/5/2024 5:00 AM, jseigh wrote:
Maybe. For thread local non-shared data if the compiler can make that
determination but I don't know if the actual specs say that.
It would be strange to me if the compiler executed a weaker barrier than
what I said needed to be there. If I say I need a #LoadStore |
#StoreStore here, then the compiler better put that barrier in there.
Humm...
C++ doesn't use #LoadStore, etc... memory ordering terminology. They
use acquire, release, cst, relaxed, ... While in some cases it's straightforward as to what that means, in others it's less obvious.
Non-obvious isn't exactly what you want when writing multi-threaded
code. There's enough subtlety as it is.
On 12/17/2024 4:33 AM, jseigh wrote:
On 12/16/24 16:48, Chris M. Thomasson wrote:
On 12/5/2024 5:00 AM, jseigh wrote:
Maybe. For thread local non-shared data if the compiler can make that >>>> determination but I don't know if the actual specs say that.
It would be strange to me if the compiler executed a weaker barrier
than what I said needed to be there. If I say I need a #LoadStore |
#StoreStore here, then the compiler better put that barrier in there.
Humm...
C++ concurrency was designed by a committee. They try to fit things
into their world view even if reality is a bit more nuanced or complex
than that world view.
Indeed.
C++ doesn't use #LoadStore, etc... memory ordering terminology. They
use acquire, release, cst, relaxed, ... While in some cases it's
straightforward as to what that means, in others it's less obvious.
Non-obvious isn't exactly what you want when writing multi-threaded
code. There's enough subtlety as it is.
Agreed. Humm... The CAS is interesting to me.
atomic_compare_exchange_weak
atomic_compare_exchange_strong
The weak one can fail spuriously... Akin to LL/SC in a sense?
atomic_compare_exchange_weak_explicit
atomic_compare_exchange_strong_explicit
A membar for the success path and one for the failure path. Oh that's
fun. Sometimes I think its better to use relaxed for all of the atomics
and use explicit barriers ala atomic_thread_fence for the order. Well,
that is more in line with the SPARC way of doing things... ;^)
Agreed. Humm... The CAS is interesting to me.
atomic_compare_exchange_weak
atomic_compare_exchange_strong
The weak one can fail spuriously... Akin to LL/SC in a sense?
atomic_compare_exchange_weak_explicit
atomic_compare_exchange_strong_explicit
A membar for the success path and one for the failure path. Oh that's
fun. Sometimes I think its better to use relaxed for all of the atomics
and use explicit barriers ala atomic_thread_fence for the order. Well,
that is more in line with the SPARC way of doing things... ;^)
On 12/4/2024 8:13 AM, jseigh wrote:
On 12/3/24 18:37, Stefan Monnier wrote:
                                          If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.
The problem is HOW to TELL the COMPILER that these memory references
are "more special" than normal--when languages give few mechanisms.
We could start with something like
    critical_region {
      ...
    }
such that the compiler must refrain from any code motion within
those sections but is free to move things outside of those sections as
if
execution was singlethreaded.
C/C++11 already defines what lock acquire/release semantics are.
Roughly you can move stuff outside of a critical section into it
but not vice versa.
Java uses synchronized blocks to denote the critical section.
C++ (the society for using RAII for everything) has scoped_lock
if you want to use RAII for your critical section. It's not
always obvious what the actual critical section is. I usually
use it inside its own bracket section to make it more obvious.
 { std::scoped_lock m(mutex);
   // .. critical section
 }
I'm not a big fan of c/c++ using acquire and release memory order
directives on everything since apart from a few situations it's
not intuitively obvious what they do in all cases. You can
look a compiler assembler output but you have to be real careful
generalizing from what you see.
The release on the unlock can allow some following stores and things to
sort of "bubble up before it?
Acquire and release confines things to the "critical section", the
release can allow for some following things to go above it, so to speak.
This is making me think of Alex over on c.p.t. !
:^)
Did I miss anything? Sorry Joe.
On 12/19/2024 10:33 AM, MitchAlsup1 wrote:
On Thu, 5 Dec 2024 7:44:19 +0000, Chris M. Thomasson wrote:
On 12/4/2024 8:13 AM, jseigh wrote:
On 12/3/24 18:37, Stefan Monnier wrote:
We could start with something like                                          If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.
The problem is HOW to TELL the COMPILER that these memory references >>>>>> are "more special" than normal--when languages give few mechanisms. >>>>>
    critical_region {
      ...
    }
such that the compiler must refrain from any code motion within
those sections but is free to move things outside of those sections as >>>>> if
execution was singlethreaded.
C/C++11 already defines what lock acquire/release semantics are.
Roughly you can move stuff outside of a critical section into it
but not vice versa.
Java uses synchronized blocks to denote the critical section.
C++ (the society for using RAII for everything) has scoped_lock
if you want to use RAII for your critical section. It's not
always obvious what the actual critical section is. I usually
use it inside its own bracket section to make it more obvious.
  { std::scoped_lock m(mutex);
    // .. critical section
  }
I'm not a big fan of c/c++ using acquire and release memory order
directives on everything since apart from a few situations it's
not intuitively obvious what they do in all cases. You can
look a compiler assembler output but you have to be real careful
generalizing from what you see.
The release on the unlock can allow some following stores and things to
sort of "bubble up before it?
Acquire and release confines things to the "critical section", the
release can allow for some following things to go above it, so to speak. >>> This is making me think of Alex over on c.p.t. !
This sounds dangerous if the thing allowed to go above it is unCacheable
while the lock:release is cacheable, the cacheable lock can arrive at
another core before the unCacheable store arrives at its destination.
Humm... Need to ponder on that. Wrt the sparc:
membar #LoadStore | #StoreStore
can allow following stores to bubble up before it. If we want to block
that then we would use a #StoreLoad. However, a #StoreLoad is not
required for unlocking a mutex.
On 12/19/2024 10:33 AM, MitchAlsup1 wrote:
On Thu, 5 Dec 2024 7:44:19 +0000, Chris M. Thomasson wrote:
On 12/4/2024 8:13 AM, jseigh wrote:
On 12/3/24 18:37, Stefan Monnier wrote:
We could start with something likeIf there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.
The problem is HOW to TELL the COMPILER that these memory references >>>>>> are "more special" than normal--when languages give few mechanisms. >>>>>
critical_region {
...
}
such that the compiler must refrain from any code motion within
those sections but is free to move things outside of those sections as >>>>> if
execution was singlethreaded.
C/C++11 already defines what lock acquire/release semantics are.
Roughly you can move stuff outside of a critical section into it
but not vice versa.
Java uses synchronized blocks to denote the critical section.
C++ (the society for using RAII for everything) has scoped_lock
if you want to use RAII for your critical section. It's not
always obvious what the actual critical section is. I usually
use it inside its own bracket section to make it more obvious.
{ std::scoped_lock m(mutex);
// .. critical section
}
I'm not a big fan of c/c++ using acquire and release memory order
directives on everything since apart from a few situations it's
not intuitively obvious what they do in all cases. You can
look a compiler assembler output but you have to be real careful
generalizing from what you see.
The release on the unlock can allow some following stores and things to
sort of "bubble up before it?
Acquire and release confines things to the "critical section", the
release can allow for some following things to go above it, so to speak. >>> This is making me think of Alex over on c.p.t. !
This sounds dangerous if the thing allowed to go above it is unCacheable
while the lock:release is cacheable, the cacheable lock can arrive at
another core before the unCacheable store arrives at its destination.
Humm... Need to ponder on that. Wrt the sparc:
membar #LoadStore | #StoreStore
can allow following stores to bubble up before it. If we want to block
that then we would use a #StoreLoad. However, a #StoreLoad is not
required for unlocking a mutex.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 546 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 43:53:56 |
| Calls: | 10,392 |
| Files: | 14,066 |
| Messages: | 6,417,241 |