Forum: >>> Magnum BBS <<<

Memory ordering (was: Arguments for a sane ISA 6-years later)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Mon Jul 29 13:21:10 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 26 Jul 2024 17:00:07 +0000, Anton Ertl wrote:

Similarly, I expect that hardware that is designed for good TSO or
sequential consistency performance will run faster on code written for
this model than code written for weakly consistent hardware will run
on that hardware.

According to Lamport; only the ATOMIC stuff needs sequential
consistency.
So, it is completely possible to have a causally consistent processor
that switches to sequential consistency when doing ATOMIC stuff and gain >performance when not doing ATOMIC stuff, and gain programmability when
doing atomic stuff.

That's not what I have in mind. What I have in mind is hardware that,
e.g., speculatively performs loads, predicting that no other core will
store there with an earlier time stamp. But if another core actually
performs such a store, the usual misprediction handling happens and
the code starting from that mispredicted load is reexecuted. So as
long as two cores do not access the same memory, they can run at full
speed, and there is only slowdown if there is actual (not potential) communication between the cores.

A problem with that approach is that this requires enough reorder
buffering (or something equivalent, there may be something cheaper for
this particular problem) to cover at least the shared-cache latency
(usually L3, more with multiple sockets).

That's because software written for weakly
consistent hardware often has to insert barriers or atomic operations
just in case, and these operations are slow on hardware optimized for
weak consistency.

The operations themselves are not slow.

Citation needed.

By contrast, one can design hardware for strong ordering such that the
slowness occurs only in those cases when actual (not potential)
communication between the cores happens, i.e., much less frequently.

How would you do this for a 256-way banked memory system of the
NEC SX ?? I.E., the processor is not in charge of memory order--
the memory system is.

Memory consistency is defined wrt what several processors do. Some
processor performs some reads and writes and another performs some
read and writes, and memory consistency defines what a processor sees
about what the other does, and what ends up in main memory. But as
long as the processors, their caches, and their interconnect gets the
memory ordering right, the main memory is just the backing store that eventually gets a consistent result of what the other components did.
So it does not matter whether the main memory has one bank or 256.

One interesting aspect is that for supercomputers I generally think
that they have not yet been struck by the software crisis:
Supercomputer hardware is more expensive than supercomputer software.
So I expect that supercomputer hardware designers tend to throw
complexity over the wall to the software people, and in many cases
they do (the Cell broadband engine offers many examples of that).
However, "some ... Fujitsu [ARM] CPUs run with TSO at all times" <https://lwn.net/Articles/970907/>; that sounds like the A64FX, a
processor designed for supercomputing. So apparently in this case the
hardware designers actually accepted the hardware and design
complexity cost of TSO and gave a better model to software, even in
hardware designed for a supercomputer.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Chris M. Thomasson on Mon Jul 29 14:10:26 2024

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 7/26/2024 10:00 AM, Anton Ertl wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

[...]

By contrast, one can design hardware for strong ordering such that the
slowness occurs only in those cases when actual (not potential)
communication between the cores happens, i.e., much less frequently.

and sometimes use cases do not care if they encounter "stale" data.

Great. Unless these "sometimes" cases are more often than the cases
where you perform some atomic operation or barrier because of
potential, but not actual communication between cores, the weak model
is still slower than a well-implemented strong model.

A strong model? You mean I don't have to use any memory barriers at all?

In the paragraph above I had a weaker model and a stronger model in
mind, where the stronger model needs fewer barriers and atomic
operations than the weak model.

But sure, sequential consistency does not require any memory barriers,
so that would be the ultimate strong model, and, if implemented
efficiently, would beat the weaker TSO, just as an efficiently
implemented TSO beats weaker models.

Tell that to SPARC in RMO mode...

What about it? If you mean that SPARC in TSO mode is slower, that may
be the case. That's why I specified "well-implemented" above.

Even the x86 requires a
membar when a store followed by a load to another location shall be
respected wrt order.

So "the x86" (whatever you mean by that) is not sequentially
consistent.

I
thought it was easier for a HW guy to implement weak consistency? At the
cost of the increased complexity wrt programming the sucker! ;^)

Yes, that's why the hardware people love to give us weak consistency.
It's our job to say no and tell them to finish the job.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Mon Jul 29 17:49:19 2024

On Mon, 29 Jul 2024 13:21:10 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 26 Jul 2024 17:00:07 +0000, Anton Ertl wrote:

Similarly, I expect that hardware that is designed for good TSO or
sequential consistency performance will run faster on code written for
this model than code written for weakly consistent hardware will run
on that hardware.

According to Lamport; only the ATOMIC stuff needs sequential
consistency.
So, it is completely possible to have a causally consistent processor
that switches to sequential consistency when doing ATOMIC stuff and gain >>performance when not doing ATOMIC stuff, and gain programmability when >>doing atomic stuff.

That's not what I have in mind. What I have in mind is hardware that,
e.g., speculatively performs loads, predicting that no other core will
store there with an earlier time stamp. But if another core actually performs such a store, the usual misprediction handling happens and
the code starting from that mispredicted load is reexecuted. So as
long as two cores do not access the same memory, they can run at full
speed, and there is only slowdown if there is actual (not potential) communication between the cores.

OK...

A problem with that approach is that this requires enough reorder
buffering (or something equivalent, there may be something cheaper for
this particular problem) to cover at least the shared-cache latency
(usually L3, more with multiple sockets).

The depth of the execution window may be smaller than the time it takes
to send the required information around and have this core recognize
that it is out-of-order wrt memory.

That's because software written for weakly
consistent hardware often has to insert barriers or atomic operations
just in case, and these operations are slow on hardware optimized for
weak consistency.

The operations themselves are not slow.

Citation needed.

A MEMBAR dropped into the pipeline, when nothing is speculative, takes
no more time than an integer ADD. Only when there is speculation does
it have to take time to relax the speculation.

By contrast, one can design hardware for strong ordering such that the
slowness occurs only in those cases when actual (not potential)
communication between the cores happens, i.e., much less frequently.

How would you do this for a 256-way banked memory system of the
NEC SX ?? I.E., the processor is not in charge of memory order--
the memory system is.

Memory consistency is defined wrt what several processors do. Some
processor performs some reads and writes and another performs some
read and writes, and memory consistency defines what a processor sees
about what the other does, and what ends up in main memory. But as
long as the processors, their caches, and their interconnect gets the
memory ordering right, the main memory is just the backing store that eventually gets a consistent result of what the other components did.
So it does not matter whether the main memory has one bank or 256.

NEC SX is a multi-processor vector machine with the property that
addresses are spewed out as fast as AGEN can perform. These addresses
are routed to banks based on bus-segment and can arrive OoO wrt
how they were spewed out.

So two processors accessing the same memory using vector LDs will
see a single vector having multiple memory orderings. P[0]V[0] ordered
before P[1]V[0] but P[1]V[1] ordered before P[0]V[1], ...

One interesting aspect is that for supercomputers I generally think
that they have not yet been struck by the software crisis:
Supercomputer hardware is more expensive than supercomputer software.
So I expect that supercomputer hardware designers tend to throw
complexity over the wall to the software people, and in many cases
they do (the Cell broadband engine offers many examples of that).
However, "some ... Fujitsu [ARM] CPUs run with TSO at all times" <https://lwn.net/Articles/970907/>; that sounds like the A64FX, a
processor designed for supercomputing. So apparently in this case the hardware designers actually accepted the hardware and design
complexity cost of TSO and gave a better model to software, even in
hardware designed for a supercomputer.

That may be true, but I always saw it as "Supercomputers run
applications for which they have source code"

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Chris M. Thomasson on Mon Jul 29 20:43:47 2024

On Mon, 29 Jul 2024 20:08:10 +0000, Chris M. Thomasson wrote:

On 7/29/2024 10:49 AM, MitchAlsup1 wrote:
[...]

A MEMBAR dropped into the pipeline, when nothing is speculative, takes
no more time than an integer ADD. Only when there is speculation does
it have to take time to relax the speculation.

I am wondering if you were ever aware of the "danger zone" wrt putting a MEMBAR instruction in a memory delay slot over on the SPARC? The docs explicitly said not to do it. I guess it can cause some interesting
memory order "issues" that might allow a running system to last for say,
five years before crashing at a random time... Yikes!

Which, btw, is why one should exhaustively test atomic codes before
putting it production. You do not want to chase down a memory ordering
issue that occurs, in production, less than once a month.

I did a lot of programming on SPARCs and never needed a MEMBAR....
but the OS guys used them like they were "free".

All sorts of things were dangerous in the delay slot of a branch,
not the least of which was another branch.

Prior to the introduction of multi-threaded cores, the rounding
modes of the FPU status register were hard wired to the FU.
Afterwards, they became "just another piece of state" that got
pipelined down instruction execution.

[...]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Tue Jul 30 12:47:42 2024

On Mon, 29 Jul 2024 20:43:47 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On Mon, 29 Jul 2024 20:08:10 +0000, Chris M. Thomasson wrote:

On 7/29/2024 10:49 AM, MitchAlsup1 wrote:
[...]

A MEMBAR dropped into the pipeline, when nothing is speculative,
takes no more time than an integer ADD. Only when there is
speculation does it have to take time to relax the speculation.

I am wondering if you were ever aware of the "danger zone" wrt
putting a MEMBAR instruction in a memory delay slot over on the
SPARC? The docs explicitly said not to do it. I guess it can cause
some interesting memory order "issues" that might allow a running
system to last for say, five years before crashing at a random
time... Yikes!

Which, btw, is why one should exhaustively test atomic codes before
putting it production. You do not want to chase down a memory ordering
issue that occurs, in production, less than once a month.

I did a lot of programming on SPARCs and never needed a MEMBAR....
but the OS guys used them like they were "free".

Did you do programming on SPARC in RMO mode?
According to my understanding, RMO mode existed only on a couple of
SPARC CPU models (UltraSparc and UltraSparc-II) and was activated only
by OSes others than Solaris. Possibly, only by Linux.

All sorts of things were dangerous in the delay slot of a branch,
not the least of which was another branch.

Chris M. Thomasson wrote "memory delay slot".
I wonder what's meant by that. According to my understanding, on SPARC,
unlike on early MIPS, load delay slot was always merely an
implementation detail rather than SW-visible architectural feature.

Prior to the introduction of multi-threaded cores, the rounding
modes of the FPU status register were hard wired to the FU.
Afterwards, they became "just another piece of state" that got
pipelined down instruction execution.

[...]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Tue Jul 30 09:51:46 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 29 Jul 2024 13:21:10 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 26 Jul 2024 17:00:07 +0000, Anton Ertl wrote:

Similarly, I expect that hardware that is designed for good TSO or
sequential consistency performance will run faster on code written for >>>> this model than code written for weakly consistent hardware will run
on that hardware.

According to Lamport; only the ATOMIC stuff needs sequential
consistency.
So, it is completely possible to have a causally consistent processor >>>that switches to sequential consistency when doing ATOMIC stuff and gain >>>performance when not doing ATOMIC stuff, and gain programmability when >>>doing atomic stuff.

That's not what I have in mind. What I have in mind is hardware that,
e.g., speculatively performs loads, predicting that no other core will
store there with an earlier time stamp. But if another core actually
performs such a store, the usual misprediction handling happens and
the code starting from that mispredicted load is reexecuted. So as
long as two cores do not access the same memory, they can run at full
speed, and there is only slowdown if there is actual (not potential)
communication between the cores.

OK...

A problem with that approach is that this requires enough reorder
buffering (or something equivalent, there may be something cheaper for
this particular problem) to cover at least the shared-cache latency
(usually L3, more with multiple sockets).

The depth of the execution window may be smaller than the time it takes
to send the required information around and have this core recognize
that it is out-of-order wrt memory.

So if we don't want to stall for memory accesses all the time, we need
a bigger execution window, either by making the reorder buffer larger,
or by using a different, cheaper mechanism.

Concerning the cheaper mechanism, what I am thinking of is hardware checkpointing every, say, 200 cycles or so (subject to fine-tuning).
The idea here is that communication between cores is very rare, so
rolling back more cycles than the minimal necessary amount costs
little on average (except that it looks bad on cache ping-pong microbenchmarks). The cost of such a checkpoint is (at most) the
number of architectural registers, plus the aggregated stores between
the checkpoint and the next one. Once the global time reaches the
timestamp of the checkpoint N+1 of the core, checkpoint N of the core
can be released (i.e. all its instructions committed) and all its
stores can be commited (and checked against speculative loads in other
cores). If it turns out that an uncommited load's result has been
changed by a store commited by another core, a rollback to the latest checkpoint before the load happens, and the program is re-executed
starting from that checkpoint.

Daya et al [daya+14] have already implemented sequential consistency
in their 36-core research chip, with similar ideas (that inspired my
statement above) and much more detail (that makes it hard to see the
grand scheme of things IIRC).

@InProceedings{daya+14,
author = {Bhavya K. Daya and Chia-Hsin Owen Chen and Suvinay
Subramanian and Woo-Cheol Kwon and Sunghyun Park and
Tushar Krishna and Jim Holt and Anantha
P. Chandrakasan and L-Shiuan Peh},
title = {{SCORPIO}: A 36-Core Research-Chip Demonstrating
Snoopy Coherence on a Scalable Mesh {NoC} with
In-Network Ordering},
crossref = {isca14},
OPTpages = {},
url = {http://projects.csail.mit.edu/wiki/pub/LSPgroup/PublicationList/scorpio_isca2014.pdf},
annote = {The cores on the chip described in this paper access
their shared memory in a sequentially consistent
manner; what's more, the chip provides a significant
speedup in comparison to the distributed directory
and HyperTransport coherence protocols. The main
idea is to deal with the ordering separately from
the data, in a distributed way. The ordering
messages are relatively small (one bit per core).
For details see the paper.}
}

@Proceedings{isca14,
title = "$41^\textit{st}$ Annual International Symposium on Computer Architecture",
booktitle = "$41^\textit{st}$ Annual International Symposium on Computer Architecture",
year = "2014",
key = "ISCA 2014",
}

The operations themselves are not slow.

Citation needed.

A MEMBAR dropped into the pipeline, when nothing is speculative, takes
no more time than an integer ADD. Only when there is speculation does
it have to take time to relax the speculation.

Not sure what kind of speculation you mean here. On in-order cores
like the non-Fujitsu SPARCs from before about 2010 memory barriers are expensive AFAIK, even though there is essentially no branch
speculation on in-order cores.

Of course, if you mean speculation about the order of loads and
stores, yes, if you don't have such speculation, the memory barriers
are fast, but then loads are extremely slow.

Memory consistency is defined wrt what several processors do. Some
processor performs some reads and writes and another performs some
read and writes, and memory consistency defines what a processor sees
about what the other does, and what ends up in main memory. But as
long as the processors, their caches, and their interconnect gets the
memory ordering right, the main memory is just the backing store that
eventually gets a consistent result of what the other components did.
So it does not matter whether the main memory has one bank or 256.

NEC SX is a multi-processor vector machine with the property that
addresses are spewed out as fast as AGEN can perform. These addresses
are routed to banks based on bus-segment and can arrive OoO wrt
how they were spewed out.

So two processors accessing the same memory using vector LDs will
see a single vector having multiple memory orderings. P[0]V[0] ordered
before P[1]V[0] but P[1]V[1] ordered before P[0]V[1], ...

As long as no stores happen, who cares about the order of the loads?
When stores happen, the loads are ordered wrt these stores (with
stronger memory orderings giving more guarantees). So the number of
memory banks does not matter for implementing a strong ordering
efficiently.

The thinking about memory banks etc. comes when you approach the
problem from the other direction: You have some memory subsystem that
by itself gives you no consistency guarantees whatsoever, and then you
think about what's the minimum you can do to make it actually useful
for inter-core communication. And then you write up a paper like

@TechReport{adve&gharachorloo95,
author = {Sarita V. Adve and Kourosh Gharachorloo},
title = {Shared Memory Consistency Models: A Tutorial},
institution = {Digital Western Research Lab},
year = {1995},
type = {WRL Research Report},
number = {95/7},
annote = {Gives an overview of architectural features of
shared-memory computers such as independent memory
banks and per-CPU caches, and how they make the (for
programmers) most natural consistency model hard to
implement, giving examples of programs that can fail
with weaker consistency models. It then discusses
several categories of weaker consistency models and
actual consistency models in these categories, and
which ``safety net'' (e.g., memory barrier
instructions) programmers need to use to work around
the deficiencies of these models. While the authors
recognize that programmers find it difficult to use
these safety nets correctly and efficiently, it
still advocates weaker consistency models, claiming
that sequential consistency is too inefficient, by
outlining an inefficient implementation (which is of
course no proof that no efficient implementation
exists). Still the paper is a good introduction to
the issues involved.}
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Tue Jul 30 23:27:11 2024

On Tue, 30 Jul 2024 9:51:46 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 29 Jul 2024 13:21:10 +0000, Anton Ertl wrote:

A problem with that approach is that this requires enough reorder
buffering (or something equivalent, there may be something cheaper for
this particular problem) to cover at least the shared-cache latency
(usually L3, more with multiple sockets).

The depth of the execution window may be smaller than the time it takes
to send the required information around and have this core recognize
that it is out-of-order wrt memory.

So if we don't want to stall for memory accesses all the time, we need
a bigger execution window, either by making the reorder buffer larger,
or by using a different, cheaper mechanism.

Mc 88120 had a 96-wide execution window, which could be filled up in
16 cycles (optimistically) and was filled up in 32 cycles (average).
Given that DRAM is not going to be less than 20 ns and a 5GHz core,
the execution window is 1/3rd that which would be required to absorb
an cache miss all the way to DRAM. Add in 12-ish cycles for L2, 10-
cycles for transporting L2 miss to Memory controller, 5-cycles between
memory controller and DRAM controller-----and sooner or later it gets
hard. So nobody tries to make the execution window big enough to do
that (actually Mc88120 was to ECL buss technology and was only 100 MHz
so its puny 16-cycle EW actually was big enough to absorb a Cache
miss all the way to DRAM.....but that is for another day.)

Concerning the cheaper mechanism, what I am thinking of is hardware checkpointing every, say, 200 cycles or so (subject to fine-tuning).
The idea here is that communication between cores is very rare, so
rolling back more cycles than the minimal necessary amount costs
little on average (except that it looks bad on cache ping-pong microbenchmarks).

You lost me::

Colloquially, there are 2 uses of the word checkpointing:: a) what
HW does each time it inserts a branch into the EW, b) what an OS or
application does to be able to recover from a crash (from any
mechanism).

Neither is used to describe interactions between cores.

<snip>

The operations themselves are not slow.

Citation needed.

A MEMBAR dropped into the pipeline, when nothing is speculative, takes
no more time than an integer ADD. Only when there is speculation does
it have to take time to relax the speculation.

Not sure what kind of speculation you mean here. On in-order cores
like the non-Fujitsu SPARCs from before about 2010 memory barriers are expensive AFAIK, even though there is essentially no branch
speculation on in-order cores.

You dropped 64 instructions into the EW, and AGEN performs 15 address generations in the order permitted by operand arrival. These addresses
are routed to the L1s to determine who hits and who misses--all OoO.

Thus, the Addresses are only in Operand order and they touched the
caches
in operand order.

There could be branch mispredictions in EW also causing many of the
AGENs to get thrown away after the branch is discovered to be poorly
predicted.

And ono top of all of this several FP instructions may have raised
exceptions.

EW pipelines are generally designed to "sort all this stuff out at
retirement"; occasionally, memory ordering issues are sorted out
prior to retirement by replaying OoO memory references.

Of course, if you mean speculation about the order of loads and
stores, yes, if you don't have such speculation, the memory barriers
are fast, but then loads are extremely slow.

An MEMBAR requires the memory order to catch up to the current point
before adding new AGENs to the problem space. If the memory order
is already SC then MEMBAR has nothing to do and is pushed through
the pipeline without delay.

So, the delay has to do with catching up with memory order not in
pushing the MEMBAR through the pipeline.

Memory consistency is defined wrt what several processors do. Some
processor performs some reads and writes and another performs some
read and writes, and memory consistency defines what a processor sees
about what the other does, and what ends up in main memory. But as
long as the processors, their caches, and their interconnect gets the
memory ordering right, the main memory is just the backing store that
eventually gets a consistent result of what the other components did.
So it does not matter whether the main memory has one bank or 256.

NEC SX is a multi-processor vector machine with the property that
addresses are spewed out as fast as AGEN can perform. These addresses
are routed to banks based on bus-segment and can arrive OoO wrt
how they were spewed out.

So two processors accessing the same memory using vector LDs will
see a single vector having multiple memory orderings. P[0]V[0] ordered >>before P[1]V[0] but P[1]V[1] ordered before P[0]V[1], ...

As long as no stores happen, who cares about the order of the loads?
When stores happen, the loads are ordered wrt these stores (with
stronger memory orderings giving more guarantees). So the number of
memory banks does not matter for implementing a strong ordering
efficiently.

Then consider 2 Vector processors performing 2 STs (1 each) to
non-overlapping addresses but with bank aliasing. Consider that
the STs are scatter based and the back conflicts random. There
is no way to determine which store happened first or which
element of each vector store happened first.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Thu Aug 1 15:54:55 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 30 Jul 2024 9:51:46 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

The depth of the execution window may be smaller than the time it takes >>>to send the required information around and have this core recognize
that it is out-of-order wrt memory.

So if we don't want to stall for memory accesses all the time, we need
a bigger execution window, either by making the reorder buffer larger,
or by using a different, cheaper mechanism.

Mc 88120 had a 96-wide execution window, which could be filled up in
16 cycles (optimistically) and was filled up in 32 cycles (average).
Given that DRAM is not going to be less than 20 ns and a 5GHz core,
the execution window is 1/3rd that which would be required to absorb
an cache miss all the way to DRAM.

Relevant numbers for current cores are 400-600 instructions in the
reorder buffer, 6-8 instructions per cycle, and the core-to-core
latency is (for Bergamo) 30-40ns (90-120 cycles) within a CCX,
100-120ns (300-360 cyles) within a socket, with a 212ns (636 cycles)
worst case across sockets (data from <https://chipsandcheese.com/2024/06/22/testing-amds-bergamo-zen-4c-spam/>);
I computed with 3GHz clock rate, which fits Bergamo. On the fast
desktop chips the clock rate is higher, but the latency is lower in
both ns and cycles (in particular, no dual-socket penalty); e.g., on
the Ryzen 7950X the core-to-core latency is <80ns (456 cycles) <https://images.anandtech.com/doci/17585/AMD%20Ryzen%209%207950X%20Core%20to%20Core%20Latency%20Final.jpg>.

The reorder buffers (and integer register files and store buffers)
would need to be even larger to cover the 456 or 636 cycles (and maybe
even more than that latency is needed before you are sure that a load
or store is sequentially consistent), or alternatively one would need
a cheaper mechanism.

Concerning the cheaper mechanism, what I am thinking of is hardware
checkpointing every, say, 200 cycles or so (subject to fine-tuning).
The idea here is that communication between cores is very rare, so
rolling back more cycles than the minimal necessary amount costs
little on average (except that it looks bad on cache ping-pong
microbenchmarks).

You lost me::

Colloquially, there are 2 uses of the word checkpointing:: a) what
HW does each time it inserts a branch into the EW, b) what an OS or >application does to be able to recover from a crash (from any
mechanism).

What is "EW"?

Anyway, here checkpointing would be a hardware mechanism that allows
rolling back to the state at the point when the checkpoint was made.

An MEMBAR requires the memory order to catch up to the current point
before adding new AGENs to the problem space. If the memory order
is already SC then MEMBAR has nothing to do and is pushed through
the pipeline without delay.

Yes, that's the slow implementation. The fast implementation is to
implement sequential consistency all the time (by predicting and
speculating that memory accesses do not interfer with those of other
cores, and recovering from that speculation when the speculation turns
out to be wrong). In such an implementation memory barriers are noops
(and thus fast), because the hardware already provides sequential
consistency.

Then consider 2 Vector processors performing 2 STs (1 each) to >non-overlapping addresses but with bank aliasing. Consider that
the STs are scatter based and the back conflicts random. There
is no way to determine which store happened first or which
element of each vector store happened first.

It's up to the architecture to define the order of stores and loads of
a given core. For sequential consistency you then interleave the
sequences coming from the cores in some convenient order. It does not
matter what happens earlier in some inertial system. It only matters
what your hardware decides should be treated as being earlier. The
hardware has a lot of freedom here, but the end result as visible to
the cores must be sequentially consistent (or, with a weaker memory
consistency model, consistent with that model).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Thu Aug 1 17:33:05 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 30 Jul 2024 9:51:46 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

The depth of the execution window may be smaller than the time it takes >>>>to send the required information around and have this core recognize >>>>that it is out-of-order wrt memory.

So if we don't want to stall for memory accesses all the time, we need
a bigger execution window, either by making the reorder buffer larger,
or by using a different, cheaper mechanism.

Colloquially, there are 2 uses of the word checkpointing:: a) what
HW does each time it inserts a branch into the EW, b) what an OS or >>application does to be able to recover from a crash (from any
mechanism).

What is "EW"?

Execution Window (referred to above)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Thu Aug 1 19:57:24 2024

On Thu, 1 Aug 2024 15:54:55 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 30 Jul 2024 9:51:46 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

An MEMBAR requires the memory order to catch up to the current point
before adding new AGENs to the problem space. If the memory order
is already SC then MEMBAR has nothing to do and is pushed through
the pipeline without delay.

Yes, that's the slow implementation. The fast implementation is to
implement sequential consistency all the time (by predicting and
speculating that memory accesses do not interfer with those of other
cores, and recovering from that speculation when the speculation turns
out to be wrong). In such an implementation memory barriers are noops
(and thus fast), because the hardware already provides sequential consistency.

Why does SC need any MEMBARs ??

Then consider 2 Vector processors performing 2 STs (1 each) to >>non-overlapping addresses but with bank aliasing. Consider that
the STs are scatter based and the back conflicts random. There
is no way to determine which store happened first or which
element of each vector store happened first.

It's up to the architecture to define the order of stores and loads of
a given core. For sequential consistency you then interleave the
sequences coming from the cores in some convenient order.

Insufficient:: If OoO processor orders LDs and STs as they leave AGEN
you cannot just interleave multiple core access streams and achieve
sequential consistency.

It does not
matter what happens earlier in some inertial system. It only matters
what your hardware decides should be treated as being earlier. The

Causal consistency is determined by arrival order at the memory
controller.
Cache consistency is enforced by allowing a single line to be "in
progress"
only once in the system--each cache line is serially reusable, while
other
cache liens remain unordered wrt that one.

hardware has a lot of freedom here, but the end result as visible to
the cores must be sequentially consistent (or, with a weaker memory consistency model, consistent with that model).

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Fri Aug 2 08:14:21 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Thu, 1 Aug 2024 15:54:55 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 30 Jul 2024 9:51:46 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

An MEMBAR requires the memory order to catch up to the current point >>>before adding new AGENs to the problem space. If the memory order
is already SC then MEMBAR has nothing to do and is pushed through
the pipeline without delay.

Yes, that's the slow implementation. The fast implementation is to
implement sequential consistency all the time (by predicting and
speculating that memory accesses do not interfer with those of other
cores, and recovering from that speculation when the speculation turns
out to be wrong). In such an implementation memory barriers are noops
(and thus fast), because the hardware already provides sequential
consistency.

Why does SC need any MEMBARs ??

A program written for sequential consistency does not need them. But
if you have a program written for a weaker memory model, the memory
barriers in that program will be noops and therefore really cheap.

Then consider 2 Vector processors performing 2 STs (1 each) to >>>non-overlapping addresses but with bank aliasing. Consider that
the STs are scatter based and the back conflicts random. There
is no way to determine which store happened first or which
element of each vector store happened first.

It's up to the architecture to define the order of stores and loads of
a given core. For sequential consistency you then interleave the
sequences coming from the cores in some convenient order.

Insufficient:: If OoO processor orders LDs and STs as they leave AGEN
you cannot just interleave multiple core access streams and achieve >sequential consistency.

Architecture is defined in the architecture manual. Implementation
concepts like OoO and AGEN don't (or shouldn't) play a role there.
WRT memory ordering most architectures define clearly what happens
(for single-threaded programs), i.e., loads and stores happen exactly
in the architectural execution order of the instructions, and they
actually implement that, for single threaded programs.

Then they take back some of these guarantees for multi-processing, and
add some instructions (memory barriers, lock prefixes, etc.) to
reestablish these guarantees when needed, in an expensive way.

Sequential consistency is what you get if you do not take back these guarantees.

Concerning vector instructions, what do architectures say about the
memory order here? An ideal would be if they were treated as atomic,
i.e., a read access is all performed after any earlier and before any
later memory access in the stream of executed instructions. But even
without multi-processing this tends to be inefficient, and has
problems with page faults and the number of necessary pages in memory
at the same time, especially with gather/scatter accesses and very
long vector memory-memory instructions as on the NEC SX (IIRC). But
of course, the NEC SX is a supercomputer architecture, a certain
amount of architectural nonsense is not unusual there.

Given such difficulties, vector instructions, at least with gather
loads and scatter stores (whether strided or indirect), are not a good
idea (and a recent Intel hardware vulnerability shows another reason
why gather is not a good idea). Your VVM OTOH allows a clean
architectural definition.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Chris M. Thomasson on Fri Nov 15 15:24:59 2024

On Fri, 15 Nov 2024 03:17:22 -0800
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> wrote:

On 11/14/2024 11:25 PM, Anton Ertl wrote:

aph@littlepinkcloud.invalid writes:

Yes. That Alpha behaviour was a historic error. No one wants to do
that again.

Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification? I certainly think that Alpha's lack
of guarantees in memory ordering is a bad idea, and so is ARM's:
"It's only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously? Sequential consistency can be specified in one sentence: "The result
of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the
operations of each individual processor appear in this sequence in
the order specified by its program."

[...]

Well, iirc, the Alpha is the only system that requires an explicit
membar for a RCU based algorithm. Even SPARC in RMO mode does not
need this. Iirc, akin to memory_order_consume in C++:

https://en.cppreference.com/w/cpp/atomic/memory_order

data dependent loads

You response does not answer Anton's question.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Anton Ertl on Fri Nov 15 11:08:29 2024

On 11/15/2024 2:25 AM, Anton Ertl wrote:

aph@littlepinkcloud.invalid writes:

Yes. That Alpha behaviour was a historic error. No one wants to do
that again.

Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification? I certainly think that Alpha's lack of guarantees in memory ordering is a bad idea, and so is ARM's: "It's
only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
Sequential consistency can be specified in one sentence: "The result
of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order
specified by its program."

However, I don't think that the Alpha architects considered the Alpha
memory ordering to be an error, and probably still don't, just like
the ARM architects don't consider their memory model to be an error.
I am pretty sure that no Alpha implementation ever made use of the
lack of causality in the Alpha memory model, so they could have added causality without outlawing existing implementations. That they did
not indicates that they thought that their memory model was right. An advocacy paper for weak memory models [adve&gharachorloo95] came from
the same place as Alpha, so it's no surprise that Alpha specifies weak consistency.

@TechReport{adve&gharachorloo95,
author = {Sarita V. Adve and Kourosh Gharachorloo},
title = {Shared Memory Consistency Models: A Tutorial},
institution = {Digital Western Research Lab},
year = {1995},
type = {WRL Research Report},
number = {95/7},
annote = {Gives an overview of architectural features of
shared-memory computers such as independent memory
banks and per-CPU caches, and how they make the (for
programmers) most natural consistency model hard to
implement, giving examples of programs that can fail
with weaker consistency models. It then discusses
several categories of weaker consistency models and
actual consistency models in these categories, and
which ``safety net'' (e.g., memory barrier
instructions) programmers need to use to work around
the deficiencies of these models. While the authors
recognize that programmers find it difficult to use
these safety nets correctly and efficiently, it
still advocates weaker consistency models, claiming
that sequential consistency is too inefficient, by
outlining an inefficient implementation (which is of
course no proof that no efficient implementation
exists). Still the paper is a good introduction to
the issues involved.}
}

- anton

Anybody doing that sort of programming, i.e. lock-free or distributed algorithms, who can't handle weakly consistent memory models, shouldn't
be doing that sort of programming in the first place. Strongly
consistent memory won't help incompetence.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to jseigh on Fri Nov 15 17:27:37 2024

jseigh <jseigh_es00@xemaps.com> writes:

Anybody doing that sort of programming, i.e. lock-free or distributed >algorithms, who can't handle weakly consistent memory models, shouldn't
be doing that sort of programming in the first place.

Do you have any argument that supports this claim.

Strongly consistent memory won't help incompetence.

Strong words to hide lack of arguments?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sat Nov 16 00:51:36 2024

On Fri, 15 Nov 2024 23:35:22 +0000, BGB wrote:

On 11/15/2024 4:05 PM, Chris M. Thomasson wrote:

On 11/15/2024 12:53 PM, BGB wrote:

On 11/15/2024 11:27 AM, Anton Ertl wrote:

jseigh <jseigh_es00@xemaps.com> writes:

Anybody doing that sort of programming, i.e. lock-free or distributed >>>>> algorithms, who can't handle weakly consistent memory models, shouldn't >>>>> be doing that sort of programming in the first place.

Do you have any argument that supports this claim.

Strongly consistent memory won't help incompetence.

Strong words to hide lack of arguments?

In my case, as I see it:
   The tradeoff is more about implementation cost, performance, etc.

Weak model:
   Cheaper (and simpler) to implement;
   Performs better when there is no need to synchronize memory;
   Performs worse when there is need to synchronize memory;
   ...

[...]

A TSO from a weak memory model is as it is. It should not necessarily
perform "worse" than other systems that have TSO as a default. The
weaker models give us flexibility. Any weak memory model should be able
to give sequential consistency via using the right membars in the right
places.

The speed difference is mostly that, in a weak model, the L1 cache
merely needs to fetch memory from the L2 or similar, may write to it whenever, and need not proactively store back results.

As I understand it, a typical TSO like model will require, say:
Any L1 cache that wants to write to a cache line, needs to explicitly
request write ownership over that cache line;

The cache line may have been fetched from a core which modified the
data, and handed this line directly to this requesting core on a
typical read. So, it is possible for the line to show up with
write permission even if the requesting core did not ask for write
permission. So, not all lines being written have to request owner-
ship.

Any attempt by other cores to access this line,

You are being rather loose with your time analysis in this question::

Access this line before write permission has been requested,
or
Access this line after write permission has been requested but
before it has arrived,
or
Access this line after write permission has arrived.

may require the L2 cache
to send a message to the core currently holding the cache line for
writing to write back its contents, with the request unable to be
handled until after the second core has written back the dirty cache
line.

L2 has to know something about how L1 has the line, and likely which
core cache the data is in.

This would create potential for significantly more latency in cases
where multiple cores touch the same part of memory; albeit the cores
will see each others' memory stores.

One can ARGUE that this is a good thing as it makes latency part
of the memory access model. More interfering accesses=higher
latency.

So, initially, weak model can be faster due to not needing any
additional handling.

But... Any synchronization points, such as a barrier or locking or
releasing a mutex, will require manually flushing the cache with a weak model.

Not necessarily:: My 66000 uses causal memory consistency, yet when
an ATOMIC event begins it reverts to sequential consistency until
the end of the event where it reverts back to causal. Use of MMI/O
space reverts to sequential consistency, while access to config
space reverts all the way back to strongly ordered.

And, locking/releasing the mutex itself will require a mechanism
that is consistent between cores (such as volatile atomic swaps or
similar, which may still be weak as a volatile-atomic-swap would still
not be atomic from the POV of the L2 cache; and an MMIO interface could
be stronger here).

Seems like there could possibly be some way to skip some of the cache flushing if one could verify that a mutex is only being locked and
unlocked on a single core.

Issue then is how to deal with trying to lock a mutex which has thus far
been exclusive to a single core. One would need some way for the core
that last held the mutex to know that it needs to perform an L1 cache
flush.

This seems to be a job for Cache Consistency.

Though, one possibility could be to leave this part to the OS scheduler/syscall/...

The OS wants nothing to do with this.

mechanism; so the core that wants to lock the
mutex signals its intention to do so via the OS, and the next time the
core that last held the mutex does a syscall (or tries to lock the mutex again), the handler sees this, then performs the L1 flush and flags the
mutex as multi-core safe (at which point, the parties will flush L1s at
each mutex lock, though possibly with a timeout count so that, if the
mutex has been single-core for N locks, it reverts to single-core
behavior).

This could reduce the overhead of "frivolous mutex locking" in programs
that are otherwise single-threaded or single processor (leaving the
cache flushes for the ones that are in-fact being used for
synchronization purposes).

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Chris M. Thomasson on Sat Nov 16 07:37:44 2024

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 11/15/2024 9:27 AM, Anton Ertl wrote:

jseigh <jseigh_es00@xemaps.com> writes:

Anybody doing that sort of programming, i.e. lock-free or distributed
algorithms, who can't handle weakly consistent memory models, shouldn't
be doing that sort of programming in the first place.

Strongly consistent memory won't help incompetence.

Strong words to hide lack of arguments?

For instance, a 100% sequential memory order won't help you with, say, >solving ABA.

Sure, not all problems are solved by sequential consistency, and yes,
it won't solve race conditions like the ABA problem. But jseigh
implied that finding it easier to write correct and efficient code for sequential consistency than for a weakly-consistent memory model
(e.g., Alphas memory model) is incompetent.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Sat Nov 16 07:46:17 2024

BGB <cr88192@gmail.com> writes:

The tradeoff is more about implementation cost, performance, etc.

Yes. And the "etc." includes "ease of programming".

Weak model:
Cheaper (and simpler) to implement;

Yes.

Performs better when there is no need to synchronize memory;

Not in general. For a cheap multiprocessor implementation, yes. A sophisticated implementation of sequential consistency can just storm
ahead in that case and achieve the same performance. It just has to
keep checkpoints around in case that there is a need to synchronize
memory.

Performs worse when there is need to synchronize memory;

With a cheap multiprocessor implementation, yes. In general, no: Any sequentially consistent implementation is also an implementation of
every weaker memory model, and the memory barriers become nops in that
kind of implementation. Ok, nops still have a cost, but it's very
close to 0 on a modern CPU.

Another potential performance disadvantage of sequential consistency
even with a sophisticated implementation:

If you have some algorithm that actually works correctly even when it
gets stale data from a load (with some limits on the staleness), the sophisticated SC implementation will incur the latency coming from
making the load non-stale while that latency will not occur or be less
in a similarly-sophisticated implementation of an appropriate weak
consistency model.

However, given that the access to actually-shared memory is slow even
on weakly-consistent hardware, software usually takes measures to
avoid having a lot of such accesses, so that cost will usually be
miniscule.

What you missed: the big cost of weak memory models and cheap hardware implementations of them is in the software:

* For correctness, the safe way is to insert a memory barrier between
any two memory operations.

* For performance (on cheap implementations of weak memory models) you
want to execute as few memory barriers as possible.

* You cannot use testing to find out whether you have enough (and the
right) memory barriers. That's not only because the involved
threads may not be in the right state during testing for uncovering
the incorrectness, but also because the hardware used for testing
may actually have stronger consistency than the memory model, and so
some kinds of bugs will never show up in testing on that hardware,
even when the threads reach the right state. And testing is still
the go-to solution for software people to find errors (nowadays even
glorified by continuous integration and modern fuzz testing
approaches).

The result is that a lot of software dealing with shared memory is
incorrect because it does not have a memory barrier that it should
have, or inefficient on cheap hardware with expensive memory barriers
because it uses more memory barriers than necessary for the memory
model. A program may even be incorrect in one place and have
superflouous memory barriers in another one.

Or programmers just don't do this stuff at all (as advocated by
jseigh), and instead just write sequential programs, or use bottled
solutions that often are a lot more expensive than superfluous memory
barriers. E.g., in Gforth the primary inter-thread communication
mechanism is currently implemented with pipes, involving the system
calls read() and write(). And Bernd Paysan who implemented that is a
really good programmer; I am sure he would be able to wrap his head
around the whole memory model stuff and implement something much more efficient, but that would take time that he obviously prefers to spend
on more productive things.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Anton Ertl on Sat Nov 16 17:28:21 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

jseigh <jseigh_es00@xemaps.com> writes:

Anybody doing that sort of programming, i.e. lock-free or distributed
algorithms, who can't handle weakly consistent memory models, shouldn't
be doing that sort of programming in the first place.

Do you have any argument that supports this claim.

It isn't a claim, just an opinion.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Sun Nov 17 09:03:06 2024

On 11/16/24 16:21, Chris M. Thomasson wrote:

Fwiw, in C++ std::memory_order_consume is useful for traversing a node
based stack of something in RCU. In most systems it only acts like a
compiler barrier. On the Alpha, it must emit a membar instruction. Iirc,
mb for alpha? Cannot remember that one right now.

That got deprecated. Too hard for compilers to deal with. It's now
same as memory_order_acquire.

Which brings up an interesting point. Even if the hardware memory
memory model is strongly ordered, compilers can reorder stuff,
so you still have to program as if a weak memory model was in
effect. Or maybe disable reordering or optimization altogether
for those target architectures.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to jseigh on Sun Nov 17 15:17:52 2024

jseigh <jseigh_es00@xemaps.com> writes:

Even if the hardware memory
memory model is strongly ordered, compilers can reorder stuff,
so you still have to program as if a weak memory model was in
effect.

That's something between the user of a programming language and the
compiler. If you use a programming language or compiler that gives
weaker memory ordering guarantees than the architecture it compiles
to, that's your choice. Nothing forces compilers to behave that way,
and it's actually easier to write compilers that do not do such
reordering.

Or maybe disable reordering or optimization altogether
for those target architectures.

So you want to throw out the baby with the bathwater.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Chris M. Thomasson on Sun Nov 17 15:15:08 2024

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 11/15/2024 11:37 PM, Anton Ertl wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 11/15/2024 9:27 AM, Anton Ertl wrote:

jseigh <jseigh_es00@xemaps.com> writes:

Anybody doing that sort of programming, i.e. lock-free or distributed >>>>> algorithms, who can't handle weakly consistent memory models, shouldn't >>>>> be doing that sort of programming in the first place.

Strongly consistent memory won't help incompetence.

Strong words to hide lack of arguments?

For instance, a 100% sequential memory order won't help you with, say,
solving ABA.

Sure, not all problems are solved by sequential consistency, and yes,
it won't solve race conditions like the ABA problem. But jseigh
implied that finding it easier to write correct and efficient code for
sequential consistency than for a weakly-consistent memory model
(e.g., Alphas memory model) is incompetent.

What if you had to write code for a weakly ordered system, and the >performance guidelines said to only use a membar when you absolutely
have to. If you say something akin to "I do everything using >std::memory_order_seq_cst", well, that is a violation right off the bat.

Fair enough?

Are you trying to support my point?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Chris M. Thomasson on Mon Nov 18 07:11:04 2024

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

What if you had to write code for a weakly ordered system, and the
performance guidelines said to only use a membar when you absolutely
have to. If you say something akin to "I do everything using
std::memory_order_seq_cst", well, that is a violation right off the bat. ...

I am trying to say you might not be hired if you only knew how to handle >std::memory_order_seq_cst wrt C++... ?

I am not looking to be hired.

In any case, this cuts both ways: If you are an employer working on multi-threaded software, say, for Windows or Linux, will you reduce
your pool of potential hires by including a requirement like the one
above? And then pay for longer development time and additional
hard-to-find bugs coming from overshooting the requirement you stated
above. Or do you limit your software support to TSO hardware (for
lack of widely available SC hardware), and gain all the benefits of
more potential hires, reduced development time, and fewer bugs?

I have compared arguments against strong memory ordering with those
against floating-point. Von Neumann argued for fixed point as follows <https://booksite.elsevier.com/9780124077263/downloads/historial%20perspectives/section_3.11.pdf>:

|[...] human time is consumed in arranging for the introduction of
|suitable scale factors. We only argue that the time consumed is a
|very small percentage of the total time we will spend in preparing an |interesting problem for our machine. The first advantage of the
|floating point is, we feel, somewhat illusory. In order to have such
|a floating point, one must waste memory capacity which could
|otherwise be used for carrying more digits per word.

Kahan writes <https://people.eecs.berkeley.edu/~wkahan/SIAMjvnl.pdf>:

|Papers in 1947/8 by Bargman, Goldstein, Montgomery and von Neumann
|seemed to imply that 40-bit arithmetic would hardly ever deliver
|usable accuracy for the solution of so few as 100 linear equations in
|100 unknowns; but by 1954 engineers were solving bigger systems
|routinely and getting satisfactory accuracy from arithmetics with no
|more than 40 bits.

The flaw in the reasoning of the paper was:

|To solve it more easily without floating–point von Neumann had
|transformed equation Bx = c to B^TBx = B^Tc , thus unnecessarily
|doubling the number of sig. bits lost to ill-condition

This is an example of how the supposed gains that the harder-to-use
interface provides (in this case the bits "wasted" on the exponent)
are overcompensated by then having to use a software workaround for
the harder-to-use interface.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From aph@littlepinkcloud.invalid@21:1/5 to Anton Ertl on Mon Nov 18 12:03:55 2024

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

aph@littlepinkcloud.invalid writes:

Yes. That Alpha behaviour was a historic error. No one wants to do
that again.

Was it an actual behaviour of any Alpha for public sale, or was it
just the Alpha specification?

I don't know. Given the contortions that the Linux kernel people had
to go through, maybe it really was present in hardware.

As a programming language implementer, I don't much think about "Will
the hardware really do this?" because new hardware arises all the
time, and I don't want users' programs to stop working.

Andrew.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From aph@littlepinkcloud.invalid@21:1/5 to jseigh on Mon Nov 18 11:56:48 2024

jseigh <jseigh_es00@xemaps.com> wrote:

On 11/16/24 16:21, Chris M. Thomasson wrote:

Fwiw, in C++ std::memory_order_consume is useful for traversing a node
based stack of something in RCU. In most systems it only acts like a
compiler barrier. On the Alpha, it must emit a membar instruction. Iirc,
mb for alpha? Cannot remember that one right now.

That got deprecated. Too hard for compilers to deal with. It's now
same as memory_order_acquire.

It's back in C++20. I think the problem wasn't so much implementing
it, which as you say can be trivially done by aliasing with acquire,
but specifying it. We use load dependency ordering in Java on AArch64
to satisfy some memory model requirements, so it's not as if it's
difficult to use.

Which brings up an interesting point. Even if the hardware memory
memory model is strongly ordered, compilers can reorder stuff,
so you still have to program as if a weak memory model was in
effect.

Yes, exactly. It's not as if this is an issue that affects people who
program in high level languages, it's about what language implementers
choose to do.

Andrew.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Chris M. Thomasson on Tue Nov 26 01:29:19 2024

On Mon, 25 Nov 2024 23:59:02 +0000, Chris M. Thomasson wrote:

On 11/18/2024 3:34 PM, Chris M. Thomasson wrote:

Don't tell me you want all of std::memory_order_* to default to
std::memory_order_seq_cst? If your on a system that only has seq_cst and
nothing else, okay, but not on other weaker (memory order) systems,
right?

defaulting a relaxed to a seq_cst is a bit much.... ;^o

Defaulting to Strongly_Ordered is even worse.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Chris M. Thomasson on Tue Dec 3 08:32:52 2024

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 11/18/2024 3:20 PM, Chris M. Thomasson wrote:

On 11/17/2024 11:11 PM, Anton Ertl wrote:

The flaw in the reasoning of the paper was:

|To solve it more easily without floating–point von Neumann had
|transformed equation Bx = c to B^TBx = B^Tc , thus unnecessarily
|doubling the number of sig. bits lost to ill-condition

This is an example of how the supposed gains that the harder-to-use
interface provides (in this case the bits "wasted" on the exponent)
are overcompensated by then having to use a software workaround for
the harder-to-use interface.

...

Don't tell me you want all of std::memory_order_* to default to >std::memory_order_seq_cst? If your on a system that only has seq_cst and >nothing else, okay, but not on other weaker (memory order) systems, right?

I tell anyone who wants to read it to stop buying hardware without FP
for non-integer work, and with weak memory ordering for work that
needs concurrent programming. There are enough affordable offerings
with FP and TSO that we do not need to waste programming time and
increase the frequency of hard-to-find bugs by figuring out how to get
good performance out of hardware without FP hardware and with weak
memory ordering.

Those who enjoy the challenge of dealing with the unnecessary problems
of sub-par hardware can continue to enjoy that.

But when developing production software, as a manager don't let
programmers with this hobby horse influence your hardware and
development decisions. Give full support for FP and TSO hardware, and
limited support to weakly-ordered hardware. That limited support may
consist of using software implementations of FP (instead of designing
software for fixed point arithmetic). In case of hardware with weak
ordering the limited support could be to use memory barriers liberally
(without trying to minimize them at all; every memory barrier
elimination costs development time and increases the potential for
hard-to-find bugs), of using OS mechanisms for concurrency (rather
than, e.g., lock-free algorithms), or maybe even only supporting single-threaded operation.

Efficiently-implemented sequentially-consistent hardware would be even
more preferable, and if it was widely available, I would recommend
buying that over TSO hardware, but unfortunately we are not there yet.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Chris M. Thomasson on Tue Dec 3 09:01:44 2024

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 11/17/2024 7:17 AM, Anton Ertl wrote:

jseigh <jseigh_es00@xemaps.com> writes:

Or maybe disable reordering or optimization altogether
for those target architectures.

So you want to throw out the baby with the bathwater.

No, keep the weak order systems and not throw them out wrt a system that
is 100% seq_cst? Perhaps? What am I missing here?

Disabling optimization altogether costs a lot; e.g., look at <http://www.complang.tuwien.ac.at/anton/bentley.pdf>: if you compare
the lines for clang-3.5 -O0 with clang-3.5 -O3, you see a factor >2.5
for the tsp9 program. For gcc-5.2.0 the difference is even bigger.

That's why jseigh and people like him (I have read that suggestion
several times before) love to suggest disabling optimization
altogether. It's a straw man that does not even need beating up. Of
course they usually don't show results for the supposed benefits of
the particular "optimization" they advocate (or the drawbacks of
disabling it), and jseigh follows this pattern nicely.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Tue Dec 3 11:36:37 2024

On Tue, 03 Dec 2024 08:32:52 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 11/18/2024 3:20 PM, Chris M. Thomasson wrote:

On 11/17/2024 11:11 PM, Anton Ertl wrote:

The flaw in the reasoning of the paper was:

|To solve it more easily without floating–point von Neumann had
|transformed equation Bx = c to B^TBx = B^Tc , thus unnecessarily
|doubling the number of sig. bits lost to ill-condition

This is an example of how the supposed gains that the
harder-to-use interface provides (in this case the bits "wasted"
on the exponent) are overcompensated by then having to use a
software workaround for the harder-to-use interface.

...

Don't tell me you want all of std::memory_order_* to default to >std::memory_order_seq_cst? If your on a system that only has seq_cst
and nothing else, okay, but not on other weaker (memory order)
systems, right?

I tell anyone who wants to read it to stop buying hardware without FP
for non-integer work, and with weak memory ordering for work that
needs concurrent programming. There are enough affordable offerings
with FP and TSO that we do not need to waste programming time and
increase the frequency of hard-to-find bugs by figuring out how to get
good performance out of hardware without FP hardware and with weak
memory ordering.

Those who enjoy the challenge of dealing with the unnecessary problems
of sub-par hardware can continue to enjoy that.

But when developing production software, as a manager don't let
programmers with this hobby horse influence your hardware and
development decisions. Give full support for FP and TSO hardware, and limited support to weakly-ordered hardware. That limited support may
consist of using software implementations of FP (instead of designing software for fixed point arithmetic). In case of hardware with weak
ordering the limited support could be to use memory barriers liberally (without trying to minimize them at all; every memory barrier
elimination costs development time and increases the potential for hard-to-find bugs), of using OS mechanisms for concurrency (rather
than, e.g., lock-free algorithms), or maybe even only supporting single-threaded operation.

Efficiently-implemented sequentially-consistent hardware would be even
more preferable, and if it was widely available, I would recommend
buying that over TSO hardware, but unfortunately we are not there yet.

- anton

If you want capable dual-core or quad-core processor integrated with
FPGA then Arm Cortex-A is the only game in town right now, and probably
for a few years going forward. Typically, old low end cores. FPU is
there, TSO is not.
Fortunately, in majority of applications of this chips there is no need
for concurrent programming, but one is rarely 100% sure that need
wouldn't emerge when he starts a project.

BTW, does your stance means that your are strongly against A64FX ?

My own stance is that people should not do lockless concurrent
programming. Period.
Well, almost period. Something like RCU in Linux kernel is an exception.
May be, atomic updates of statistical counters is another exception,
but only when one is sure that his application will never have to scale
above 2 dozens of cores.

Lockless programming is horrendously complicated and error prone.
Sequential consistency removes only small part of potential
complications.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Tue Dec 3 10:03:29 2024

Michael S <already5chosen@yahoo.com> writes:

BTW, does your stance means that your are strongly against A64FX ?

No. According to <https://lwn.net/Articles/970907/> A64FX is one of
the few ARM A64 implementations that provides TSO.

Lockless programming is horrendously complicated and error prone.
Sequential consistency removes only small part of potential
complications.

It's only a part, true, but I am not sure that the part is small.

Interestingly, the FP analogy persists here: having FP hardware does
not mean that numerical programming is easy, just that it is not as
hard as with fixed point.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Anton Ertl on Tue Dec 3 08:59:18 2024

On 12/3/24 04:01, Anton Ertl wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 11/17/2024 7:17 AM, Anton Ertl wrote:

jseigh <jseigh_es00@xemaps.com> writes:

Or maybe disable reordering or optimization altogether
for those target architectures.

So you want to throw out the baby with the bathwater.

No, keep the weak order systems and not throw them out wrt a system that
is 100% seq_cst? Perhaps? What am I missing here?

Disabling optimization altogether costs a lot; e.g., look at <http://www.complang.tuwien.ac.at/anton/bentley.pdf>: if you compare
the lines for clang-3.5 -O0 with clang-3.5 -O3, you see a factor >2.5
for the tsp9 program. For gcc-5.2.0 the difference is even bigger.

That's why jseigh and people like him (I have read that suggestion
several times before) love to suggest disabling optimization
altogether. It's a straw man that does not even need beating up. Of
course they usually don't show results for the supposed benefits of
the particular "optimization" they advocate (or the drawbacks of
disabling it), and jseigh follows this pattern nicely.

That wasn't a serious suggestion.

The compiler is allow to reorder code as long as it knows the
reordering can't be observed or detected. If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.

If you are writing code with concurrent shared data access then
you need let the compiler know. One way is with locks.
Another way for lock-free data structures with with
memory barriers. Even if you had cst hardware you
still need to tell the compiler so cst hardware doesn't
buy you any less effort from a programming point of view.

If you are arguing lock-free programming with memory barrriers
is hard, let's use locks for everything (disregarding that
locks have acquire/release semantics that the compiler has
to be aware of and programmers aren't always aware of), you
might want to consider the following performance timings
on some stuff I've been playing with.

unsafe 53.344 nsecs ( 0.000) 54.547 nsecs ( 0.000)*
smr 53.828 nsecs ( 0.484) 55.485 nsecs ( 0.939) smrlite 53.094 nsecs ( 0.000) 54.329 nsecs ( 0.000)
arc 306.674 nsecs ( 253.330) 313.931 nsecs ( 259.384)
rwlock 730.012 nsecs ( 676.668) 830.340 nsecs ( 775.793)
mutex 2,881.690 nsecs ( 2,828.346) 3,305.382 nsecs ( 3,250.835)

smr is smrproxy, something like user space rcu. smrlite is smr
is smr w/o thread_local access so I have an idea how much that
adds to overhead. arc is arcproxy, lock-free reference count
based deferred reclamation. rwlock and mutex are what their
names would suggest. unsafe is no synchronization to get a
base timing on the reader loop body.

2nd col is per loop read lock/unlock average cpu time
3rd col is with unsafe time subtracted out
4th col is average elapsed time
5th col is with unsafe time subtracted out.

cpu time doesn't measure lock wait time so elapsed time
gives some indication of that.

8 reader threads, 1 writer thread

smrproxy is the version that doesn't need the cst_seq
memory barrier so it is pretty fast (you are welcome).

arc, rwlock, and mutex use interlocked instructions which
cause cache thrashing. mutex will not scale well with
number of threads on top of that. rwlock depends on
how much write locking is going on. With few write
updates, it will look more like arc.

Timings are for 8 reader threads, 1 writer thread on
4 core/8 hw thread machine.

There's going to be applications where that 2 to 3+ order
difference of overhead is going to matter a lot.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to jseigh on Tue Dec 3 19:27:47 2024

On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:

The compiler is allow to reorder code as long as it knows the
reordering can't be observed or detected.

With exceptions enabled, this would allow for almost no code
movement at all.

If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.

The problem is HOW to TELL the COMPILER that these memory references
are "more special" than normal--when languages give few mechanisms.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue Dec 3 18:37:41 2024

If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.

The problem is HOW to TELL the COMPILER that these memory references
are "more special" than normal--when languages give few mechanisms.

We could start with something like

critical_region {
...
}

such that the compiler must refrain from any code motion within
those sections but is free to move things outside of those sections as if execution was singlethreaded.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Wed Dec 4 02:47:58 2024

On Tue, 3 Dec 2024 23:37:41 +0000, Stefan Monnier wrote:

If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.

The problem is HOW to TELL the COMPILER that these memory references
are "more special" than normal--when languages give few mechanisms.

We could start with something like

critical_region {
...
}

In the spirit of Fortran rule about obeying parenthesis; one could
write::

{
...
}

and the compiler would not allow code motion beyond the {s or }s.

{{You may want to disable code motion even when the region is not critical--just dangerous in some way.}}

such that the compiler must refrain from any code motion within
those sections but is free to move things outside of those sections as
if execution was singlethreaded.

You identify a second problem. Is it that you don't want code motion
across the boundary or you do not want code motion within the boundary??

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Dec 4 10:29:33 2024

You identify a second problem. Is it that you don't want code motion
across the boundary or you do not want code motion within the boundary??

Concurrency is hard. 🙂

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Stefan Monnier on Wed Dec 4 11:13:17 2024

On 12/3/24 18:37, Stefan Monnier wrote:

If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.

The problem is HOW to TELL the COMPILER that these memory references
are "more special" than normal--when languages give few mechanisms.

We could start with something like

critical_region {
...
}

such that the compiler must refrain from any code motion within
those sections but is free to move things outside of those sections as if execution was singlethreaded.

C/C++11 already defines what lock acquire/release semantics are.
Roughly you can move stuff outside of a critical section into it
but not vice versa.

Java uses synchronized blocks to denote the critical section.
C++ (the society for using RAII for everything) has scoped_lock
if you want to use RAII for your critical section. It's not
always obvious what the actual critical section is. I usually
use it inside its own bracket section to make it more obvious.
{ std::scoped_lock m(mutex);
// .. critical section
}

I'm not a big fan of c/c++ using acquire and release memory order
directives on everything since apart from a few situations it's
not intuitively obvious what they do in all cases. You can
look a compiler assembler output but you have to be real careful
generalizing from what you see.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Dec 4 16:37:41 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:

The compiler is allow to reorder code as long as it knows the
reordering can't be observed or detected.

With exceptions enabled, this would allow for almost no code
movement at all.

If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.

The problem is HOW to TELL the COMPILER that these memory references
are "more special" than normal--when languages give few mechanisms.

C and C++ have the 'volatile' keyword for this purpose.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Wed Dec 4 19:43:48 2024

On Wed, 4 Dec 2024 15:29:33 +0000, Stefan Monnier wrote:

You identify a second problem. Is it that you don't want code motion
across the boundary or you do not want code motion within the boundary??

Concurrency is hard. 🙂

Suggest a way to disallow code motion across the boundary
that still allows code motion within a block ??

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Dec 4 19:42:11 2024

On Wed, 4 Dec 2024 16:37:41 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:

The compiler is allow to reorder code as long as it knows the
reordering can't be observed or detected.

With exceptions enabled, this would allow for almost no code
movement at all.

If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.

The problem is HOW to TELL the COMPILER that these memory references
are "more special" than normal--when languages give few mechanisms.

C and C++ have the 'volatile' keyword for this purpose.

What if you want the volatile attribute only to hold
on an inner block::

{
int i = ...;
... // I is not volitile here
{
... // I is volitile in here
}
... // I is not volitile here
...
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Dec 4 19:48:21 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 4 Dec 2024 16:37:41 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:

The compiler is allow to reorder code as long as it knows the
reordering can't be observed or detected.

With exceptions enabled, this would allow for almost no code
movement at all.

If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.

The problem is HOW to TELL the COMPILER that these memory references
are "more special" than normal--when languages give few mechanisms.

C and C++ have the 'volatile' keyword for this purpose.

What if you want the volatile attribute only to hold
on an inner block::

{
int i = ...;
... // I is not volitile here
{
... // I is volitile in here
}
... // I is not volitile here
...
}

Cast it. Linux uses the macro ACCESS_ONCE to support this:

uint64_t value = ACCESS_ONCE(&memory_location);

#define ACCESS_ONCE(x) (*(volatile __typeof__(x) *)&(x))

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Thu Dec 5 08:00:40 2024

On 12/5/24 02:44, Chris M. Thomasson wrote:

On 12/4/2024 8:13 AM, jseigh wrote:

On 12/3/24 18:37, Stefan Monnier wrote:

                                           If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.

The problem is HOW to TELL the COMPILER that these memory references
are "more special" than normal--when languages give few mechanisms.

We could start with something like

     critical_region {
       ...
     }

such that the compiler must refrain from any code motion within
those sections but is free to move things outside of those sections
as if
execution was singlethreaded.

C/C++11 already defines what lock acquire/release semantics are.
Roughly you can move stuff outside of a critical section into it
but not vice versa.

Java uses synchronized blocks to denote the critical section.
C++ (the society for using RAII for everything) has scoped_lock
if you want to use RAII for your critical section. It's not
always obvious what the actual critical section is. I usually
use it inside its own bracket section to make it more obvious.
   { std::scoped_lock m(mutex);
     // .. critical section
   }

I'm not a big fan of c/c++ using acquire and release memory order
directives on everything since apart from a few situations it's
not intuitively obvious what they do in all cases. You can
look a compiler assembler output but you have to be real careful
generalizing from what you see.

The release on the unlock can allow some following stores and things to
sort of "bubble up before it?

Acquire and release confines things to the "critical section", the
release can allow for some following things to go above it, so to speak.
This is making me think of Alex over on c.p.t. !

:^)

Did I miss anything? Sorry Joe.

Maybe. For thread local non-shared data if the compiler can make that determination but I don't know if the actual specs say that.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Scott Lurndal on Thu Dec 5 07:19:24 2024

scott@slp53.sl.home (Scott Lurndal) writes:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:

The compiler is allow to reorder code as long as it knows the
reordering can't be observed or detected.

With exceptions enabled, this would allow for almost no code
movement at all.

If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.

The problem is HOW to TELL the COMPILER that these memory references
are "more special" than normal--when languages give few mechanisms.

C and C++ have the 'volatile' keyword for this purpose.

A problem with using volatile is that volatile doesn't do what
most people think it does, especially with respect to what
reordering is or is not allowed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Tim Rentsch on Fri Dec 6 10:04:29 2024

Tim, did you send me a PM to check my email? I responded but then
silence. Could someone be pretending to be you?

Terje

Tim Rentsch wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:

The compiler is allow to reorder code as long as it knows the
reordering can't be observed or detected.

With exceptions enabled, this would allow for almost no code
movement at all.

If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.

The problem is HOW to TELL the COMPILER that these memory references
are "more special" than normal--when languages give few mechanisms.

C and C++ have the 'volatile' keyword for this purpose.

A problem with using volatile is that volatile doesn't do what
most people think it does, especially with respect to what
reordering is or is not allowed.

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Terje Mathisen on Fri Dec 6 07:36:33 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Tim, did you send me a PM to check my email? I responded but then
silence. Could someone be pretending to be you?

Yes that was me. I'm just about to send you a followup.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Tue Dec 17 07:33:58 2024

On 12/16/24 16:48, Chris M. Thomasson wrote:

On 12/5/2024 5:00 AM, jseigh wrote:

Maybe. For thread local non-shared data if the compiler can make that
determination but I don't know if the actual specs say that.

It would be strange to me if the compiler executed a weaker barrier than
what I said needed to be there. If I say I need a #LoadStore |
#StoreStore here, then the compiler better put that barrier in there.
Humm...

C++ concurrency was designed by a committee. They try to fit things
into their world view even if reality is a bit more nuanced or complex
than that world view.

C++ doesn't use #LoadStore, etc... memory ordering terminology. They
use acquire, release, cst, relaxed, ... While in some cases it's straightforward as to what that means, in others it's less obvious.
Non-obvious isn't exactly what you want when writing multi-threaded
code. There's enough subtlety as it is.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From aph@littlepinkcloud.invalid@21:1/5 to jseigh on Tue Dec 17 20:38:18 2024

jseigh <jseigh_es00@xemaps.com> wrote:

C++ doesn't use #LoadStore, etc... memory ordering terminology. They
use acquire, release, cst, relaxed, ... While in some cases it's straightforward as to what that means, in others it's less obvious.

Indeed you don't know the exact mapping to instructions, but that's
the idea: you ask for the ordering model you want, and the compiler
chooses the instructions.

Non-obvious isn't exactly what you want when writing multi-threaded
code. There's enough subtlety as it is.

There are efficiency advantages to be had from getting away from
explicit barriers, though. AArch64 has seq-cst load and store
instructions which don't need the sledgehammer of of a full StoreLoad
between a store and a load.

Andrew.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Chris M. Thomasson on Tue Dec 17 21:45:55 2024

On Tue, 17 Dec 2024 20:41:23 +0000, Chris M. Thomasson wrote:

On 12/17/2024 4:33 AM, jseigh wrote:

On 12/16/24 16:48, Chris M. Thomasson wrote:

On 12/5/2024 5:00 AM, jseigh wrote:

Maybe. For thread local non-shared data if the compiler can make that >>>> determination but I don't know if the actual specs say that.

It would be strange to me if the compiler executed a weaker barrier
than what I said needed to be there. If I say I need a #LoadStore |
#StoreStore here, then the compiler better put that barrier in there.
Humm...

C++ concurrency was designed by a committee. They try to fit things
into their world view even if reality is a bit more nuanced or complex
than that world view.

Indeed.

A committee with no/little HW design experience ...

C++ doesn't use #LoadStore, etc... memory ordering terminology. They
use acquire, release, cst, relaxed, ... While in some cases it's
straightforward as to what that means, in others it's less obvious.
Non-obvious isn't exactly what you want when writing multi-threaded
code. There's enough subtlety as it is.

Agreed. Humm... The CAS is interesting to me.

atomic_compare_exchange_weak
atomic_compare_exchange_strong

The weak one can fail spuriously... Akin to LL/SC in a sense?

There are architectures in which CAS* is allowed to fail spuriously.
In one case, if the miss buffers overflow while holding a locked
variable, the CAS would fail (preventing an ABA problem possibility).

(*) or any multi-instruction ATOMIC sequence.

atomic_compare_exchange_weak_explicit
atomic_compare_exchange_strong_explicit

A membar for the success path and one for the failure path. Oh that's
fun. Sometimes I think its better to use relaxed for all of the atomics
and use explicit barriers ala atomic_thread_fence for the order. Well,
that is more in line with the SPARC way of doing things... ;^)

:-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Wed Dec 18 06:43:43 2024

On 12/17/24 15:41, Chris M. Thomasson wrote:

Agreed. Humm... The CAS is interesting to me.

atomic_compare_exchange_weak
atomic_compare_exchange_strong

The weak one can fail spuriously... Akin to LL/SC in a sense?

Most likely LL/SC in the implementation. If you are calling
cas in a loop, then the weak form will be safe, it will just
retry. The strong form is if you are not in a loop and the
cas emulation code has extra logic to filter out spurious
failures and retry internally.

atomic_compare_exchange_weak_explicit
atomic_compare_exchange_strong_explicit

A membar for the success path and one for the failure path. Oh that's
fun. Sometimes I think its better to use relaxed for all of the atomics
and use explicit barriers ala atomic_thread_fence for the order. Well,
that is more in line with the SPARC way of doing things... ;^)

No sure why that is there. Possibly for non loop usages.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Chris M. Thomasson on Thu Dec 19 18:33:36 2024

On Thu, 5 Dec 2024 7:44:19 +0000, Chris M. Thomasson wrote:

On 12/4/2024 8:13 AM, jseigh wrote:

On 12/3/24 18:37, Stefan Monnier wrote:

                                           If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.

The problem is HOW to TELL the COMPILER that these memory references
are "more special" than normal--when languages give few mechanisms.

We could start with something like

     critical_region {
       ...
     }

such that the compiler must refrain from any code motion within
those sections but is free to move things outside of those sections as
if
execution was singlethreaded.

C/C++11 already defines what lock acquire/release semantics are.
Roughly you can move stuff outside of a critical section into it
but not vice versa.

Java uses synchronized blocks to denote the critical section.
C++ (the society for using RAII for everything) has scoped_lock
if you want to use RAII for your critical section. It's not
always obvious what the actual critical section is. I usually
use it inside its own bracket section to make it more obvious.
{ std::scoped_lock m(mutex);
    // .. critical section
}

I'm not a big fan of c/c++ using acquire and release memory order
directives on everything since apart from a few situations it's
not intuitively obvious what they do in all cases. You can
look a compiler assembler output but you have to be real careful
generalizing from what you see.

The release on the unlock can allow some following stores and things to
sort of "bubble up before it?

Acquire and release confines things to the "critical section", the
release can allow for some following things to go above it, so to speak.
This is making me think of Alex over on c.p.t. !

This sounds dangerous if the thing allowed to go above it is unCacheable
while the lock:release is cacheable, the cacheable lock can arrive at
another core before the unCacheable store arrives at its destination.

:^)

Did I miss anything? Sorry Joe.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Chris M. Thomasson on Thu Dec 19 23:59:01 2024

On Thu, 19 Dec 2024 21:19:24 +0000, Chris M. Thomasson wrote:

On 12/19/2024 10:33 AM, MitchAlsup1 wrote:

On Thu, 5 Dec 2024 7:44:19 +0000, Chris M. Thomasson wrote:

On 12/4/2024 8:13 AM, jseigh wrote:

On 12/3/24 18:37, Stefan Monnier wrote:

                                           If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.

The problem is HOW to TELL the COMPILER that these memory references >>>>>> are "more special" than normal--when languages give few mechanisms. >>>>>

We could start with something like

     critical_region {
       ...
     }

such that the compiler must refrain from any code motion within
those sections but is free to move things outside of those sections as >>>>> if
execution was singlethreaded.

C/C++11 already defines what lock acquire/release semantics are.
Roughly you can move stuff outside of a critical section into it
but not vice versa.

Java uses synchronized blocks to denote the critical section.
C++ (the society for using RAII for everything) has scoped_lock
if you want to use RAII for your critical section. It's not
always obvious what the actual critical section is. I usually
use it inside its own bracket section to make it more obvious.
   { std::scoped_lock m(mutex);
     // .. critical section
   }

I'm not a big fan of c/c++ using acquire and release memory order
directives on everything since apart from a few situations it's
not intuitively obvious what they do in all cases. You can
look a compiler assembler output but you have to be real careful
generalizing from what you see.

The release on the unlock can allow some following stores and things to
sort of "bubble up before it?

Acquire and release confines things to the "critical section", the
release can allow for some following things to go above it, so to speak. >>> This is making me think of Alex over on c.p.t. !

This sounds dangerous if the thing allowed to go above it is unCacheable
while the lock:release is cacheable, the cacheable lock can arrive at
another core before the unCacheable store arrives at its destination.

Humm... Need to ponder on that. Wrt the sparc:

membar #LoadStore | #StoreStore

can allow following stores to bubble up before it. If we want to block
that then we would use a #StoreLoad. However, a #StoreLoad is not
required for unlocking a mutex.

It is the cacheable locks covering unCacheable data that got MOESI
protocol in trouble (SPARC V8 era). MESI does not have this kind
of problem. {{SuperSPARC MESI did not have this problem because
writes to memory (via SNOOP hits) were slow, Ross MOESI did have
this problem because cache-to-cache transfers (SNOOP hit) were as
few as 6 cycles.}}

S O , What kind of barriers a relaxed memory model needs becomes
dependent on the cache coherency model !?!?!?! How is software
going to deal with that ?!? It them becomes dependent on the
memory order model, as a cascade of Oh-crap-what-have-I-done-to-
myself ...

It is stuff like this that lead My 66000 to alter memory models
as it accesses memory and mandates that all critical sections
are denoted (.lock) at the beginning and end of the ATOMIC event.
Thus, the programmer gets the performance of the relaxed memory
with the sanity of sequential consistency without programmer
inivolvement.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Chris M. Thomasson on Fri Dec 20 14:39:03 2024

Chris M. Thomasson wrote:

On 12/19/2024 10:33 AM, MitchAlsup1 wrote:

On Thu, 5 Dec 2024 7:44:19 +0000, Chris M. Thomasson wrote:

On 12/4/2024 8:13 AM, jseigh wrote:

On 12/3/24 18:37, Stefan Monnier wrote:

If there are places
in the code it doesn't know this can't happen it won't optimize
across it, more or less.

The problem is HOW to TELL the COMPILER that these memory references >>>>>> are "more special" than normal--when languages give few mechanisms. >>>>>

We could start with something like

critical_region {
...
}

such that the compiler must refrain from any code motion within
those sections but is free to move things outside of those sections as >>>>> if
execution was singlethreaded.

C/C++11 already defines what lock acquire/release semantics are.
Roughly you can move stuff outside of a critical section into it
but not vice versa.

Java uses synchronized blocks to denote the critical section.
C++ (the society for using RAII for everything) has scoped_lock
if you want to use RAII for your critical section. It's not
always obvious what the actual critical section is. I usually
use it inside its own bracket section to make it more obvious.
{ std::scoped_lock m(mutex);
// .. critical section
}

I'm not a big fan of c/c++ using acquire and release memory order
directives on everything since apart from a few situations it's
not intuitively obvious what they do in all cases. You can
look a compiler assembler output but you have to be real careful
generalizing from what you see.

The release on the unlock can allow some following stores and things to
sort of "bubble up before it?

Acquire and release confines things to the "critical section", the
release can allow for some following things to go above it, so to speak. >>> This is making me think of Alex over on c.p.t. !

This sounds dangerous if the thing allowed to go above it is unCacheable
while the lock:release is cacheable, the cacheable lock can arrive at
another core before the unCacheable store arrives at its destination.

Humm... Need to ponder on that. Wrt the sparc:

membar #LoadStore | #StoreStore

can allow following stores to bubble up before it. If we want to block
that then we would use a #StoreLoad. However, a #StoreLoad is not
required for unlocking a mutex.

I had an idea a few weeks back of a different way to do membars
that should be more flexible and controllable (if that's a good thing)
so I thought I'd toss it out there for comments.

This hypothetical ISA has normal LD and ST instructions, to which I
would add a LW Load for Write instruction to optimize moving shared lines between caches. There are also the Atomic Fetch and OP instructions
AFADD, AFAND, AFOR, AFXOR, plus ASWAP and ACAS, LL Load Locked and
SC Store Conditional, for various size of naturally aligned data,
and with various address modes.

Here is the new part:

To the above instructions is added a 3-bit Coherence Group (CG) field.
This allows one to specify different groups that various above data
accesses belong to.

The ISA has a membar instruction: MBG Memory Barrier for Group

MBG has three fields:
- one 4-bit field where each bit enables which operations this barrier
applies to, in older-younger order: Load-Load, Load-Store, Store-Load,
and Store-Store.
- two 8-bit fields where each bit selects which sets of Coherence Group(s)
this barrier applies to, one field for the older (before the membar) sets,
one for the younger (after the membar) sets.

Also the Load Store Queue is assumed to be self coherent - that loads
and stores to the same address by a single core are performed in order,
and that nothing can bypass a load or store with an unresolved address.

The CG numbers are assigned by convention, probably by the OS designers
when they define the ABI for this ISA.
Here I assigned CG:0 to be thread normal access, CG:1 to be atomic items,
CG:2 to be shared memory sections. The remaining 5 CG's can be used to
indicate different shared memory sections if their locks can overlap.

Eg. An MBG with op bits for Load-Load and Load-Store, with a before CG of 1
and after CG's 3 and 4 would block all younger loads and stores in groups
3 and 4 from starting execution until all older loads in group 1 completed. Loads and stores in all other groups are free to reorder, within the
LSQ self coherence rules.
An MBG with all op bits and all CG bits set is a full membar.

Also if one is say juggling multiple shared sections with multiple
spinlocks or mutexes, then one can use multiple membars applied to
different groups to achieve specific bypassing blocking effects.

An MBG instruction completes and retires when no older groups of
selected loads or stores are incomplete.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	43:53:56
Calls:	10,392
Files:	14,066
Messages:	6,417,241

Memory ordering (was: Arguments for a sane ISA 6-years later)

Who's Online

System Info