• Memory ordering (was: Arguments for a sane ISA 6-years later)

    From Anton Ertl@21:1/5 to mitchalsup@aol.com on Mon Jul 29 13:21:10 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 26 Jul 2024 17:00:07 +0000, Anton Ertl wrote:
    Similarly, I expect that hardware that is designed for good TSO or
    sequential consistency performance will run faster on code written for
    this model than code written for weakly consistent hardware will run
    on that hardware.

    According to Lamport; only the ATOMIC stuff needs sequential
    consistency.
    So, it is completely possible to have a causally consistent processor
    that switches to sequential consistency when doing ATOMIC stuff and gain >performance when not doing ATOMIC stuff, and gain programmability when
    doing atomic stuff.

    That's not what I have in mind. What I have in mind is hardware that,
    e.g., speculatively performs loads, predicting that no other core will
    store there with an earlier time stamp. But if another core actually
    performs such a store, the usual misprediction handling happens and
    the code starting from that mispredicted load is reexecuted. So as
    long as two cores do not access the same memory, they can run at full
    speed, and there is only slowdown if there is actual (not potential) communication between the cores.

    A problem with that approach is that this requires enough reorder
    buffering (or something equivalent, there may be something cheaper for
    this particular problem) to cover at least the shared-cache latency
    (usually L3, more with multiple sockets).

    That's because software written for weakly
    consistent hardware often has to insert barriers or atomic operations
    just in case, and these operations are slow on hardware optimized for
    weak consistency.

    The operations themselves are not slow.

    Citation needed.

    By contrast, one can design hardware for strong ordering such that the
    slowness occurs only in those cases when actual (not potential)
    communication between the cores happens, i.e., much less frequently.

    How would you do this for a 256-way banked memory system of the
    NEC SX ?? I.E., the processor is not in charge of memory order--
    the memory system is.

    Memory consistency is defined wrt what several processors do. Some
    processor performs some reads and writes and another performs some
    read and writes, and memory consistency defines what a processor sees
    about what the other does, and what ends up in main memory. But as
    long as the processors, their caches, and their interconnect gets the
    memory ordering right, the main memory is just the backing store that eventually gets a consistent result of what the other components did.
    So it does not matter whether the main memory has one bank or 256.

    One interesting aspect is that for supercomputers I generally think
    that they have not yet been struck by the software crisis:
    Supercomputer hardware is more expensive than supercomputer software.
    So I expect that supercomputer hardware designers tend to throw
    complexity over the wall to the software people, and in many cases
    they do (the Cell broadband engine offers many examples of that).
    However, "some ... Fujitsu [ARM] CPUs run with TSO at all times" <https://lwn.net/Articles/970907/>; that sounds like the A64FX, a
    processor designed for supercomputing. So apparently in this case the
    hardware designers actually accepted the hardware and design
    complexity cost of TSO and gave a better model to software, even in
    hardware designed for a supercomputer.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Chris M. Thomasson on Mon Jul 29 14:10:26 2024
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 7/26/2024 10:00 AM, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    [...]
    By contrast, one can design hardware for strong ordering such that the
    slowness occurs only in those cases when actual (not potential)
    communication between the cores happens, i.e., much less frequently.

    and sometimes use cases do not care if they encounter "stale" data.

    Great. Unless these "sometimes" cases are more often than the cases
    where you perform some atomic operation or barrier because of
    potential, but not actual communication between cores, the weak model
    is still slower than a well-implemented strong model.

    A strong model? You mean I don't have to use any memory barriers at all?

    In the paragraph above I had a weaker model and a stronger model in
    mind, where the stronger model needs fewer barriers and atomic
    operations than the weak model.

    But sure, sequential consistency does not require any memory barriers,
    so that would be the ultimate strong model, and, if implemented
    efficiently, would beat the weaker TSO, just as an efficiently
    implemented TSO beats weaker models.

    Tell that to SPARC in RMO mode...

    What about it? If you mean that SPARC in TSO mode is slower, that may
    be the case. That's why I specified "well-implemented" above.

    Even the x86 requires a
    membar when a store followed by a load to another location shall be
    respected wrt order.

    So "the x86" (whatever you mean by that) is not sequentially
    consistent.

    I
    thought it was easier for a HW guy to implement weak consistency? At the
    cost of the increased complexity wrt programming the sucker! ;^)

    Yes, that's why the hardware people love to give us weak consistency.
    It's our job to say no and tell them to finish the job.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Mon Jul 29 17:49:19 2024
    On Mon, 29 Jul 2024 13:21:10 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 26 Jul 2024 17:00:07 +0000, Anton Ertl wrote:
    Similarly, I expect that hardware that is designed for good TSO or
    sequential consistency performance will run faster on code written for
    this model than code written for weakly consistent hardware will run
    on that hardware.

    According to Lamport; only the ATOMIC stuff needs sequential
    consistency.
    So, it is completely possible to have a causally consistent processor
    that switches to sequential consistency when doing ATOMIC stuff and gain >>performance when not doing ATOMIC stuff, and gain programmability when >>doing atomic stuff.

    That's not what I have in mind. What I have in mind is hardware that,
    e.g., speculatively performs loads, predicting that no other core will
    store there with an earlier time stamp. But if another core actually performs such a store, the usual misprediction handling happens and
    the code starting from that mispredicted load is reexecuted. So as
    long as two cores do not access the same memory, they can run at full
    speed, and there is only slowdown if there is actual (not potential) communication between the cores.

    OK...

    A problem with that approach is that this requires enough reorder
    buffering (or something equivalent, there may be something cheaper for
    this particular problem) to cover at least the shared-cache latency
    (usually L3, more with multiple sockets).

    The depth of the execution window may be smaller than the time it takes
    to send the required information around and have this core recognize
    that it is out-of-order wrt memory.

    That's because software written for weakly
    consistent hardware often has to insert barriers or atomic operations
    just in case, and these operations are slow on hardware optimized for
    weak consistency.

    The operations themselves are not slow.

    Citation needed.

    A MEMBAR dropped into the pipeline, when nothing is speculative, takes
    no more time than an integer ADD. Only when there is speculation does
    it have to take time to relax the speculation.

    By contrast, one can design hardware for strong ordering such that the
    slowness occurs only in those cases when actual (not potential)
    communication between the cores happens, i.e., much less frequently.

    How would you do this for a 256-way banked memory system of the
    NEC SX ?? I.E., the processor is not in charge of memory order--
    the memory system is.

    Memory consistency is defined wrt what several processors do. Some
    processor performs some reads and writes and another performs some
    read and writes, and memory consistency defines what a processor sees
    about what the other does, and what ends up in main memory. But as
    long as the processors, their caches, and their interconnect gets the
    memory ordering right, the main memory is just the backing store that eventually gets a consistent result of what the other components did.
    So it does not matter whether the main memory has one bank or 256.

    NEC SX is a multi-processor vector machine with the property that
    addresses are spewed out as fast as AGEN can perform. These addresses
    are routed to banks based on bus-segment and can arrive OoO wrt
    how they were spewed out.

    So two processors accessing the same memory using vector LDs will
    see a single vector having multiple memory orderings. P[0]V[0] ordered
    before P[1]V[0] but P[1]V[1] ordered before P[0]V[1], ...

    One interesting aspect is that for supercomputers I generally think
    that they have not yet been struck by the software crisis:
    Supercomputer hardware is more expensive than supercomputer software.
    So I expect that supercomputer hardware designers tend to throw
    complexity over the wall to the software people, and in many cases
    they do (the Cell broadband engine offers many examples of that).
    However, "some ... Fujitsu [ARM] CPUs run with TSO at all times" <https://lwn.net/Articles/970907/>; that sounds like the A64FX, a
    processor designed for supercomputing. So apparently in this case the hardware designers actually accepted the hardware and design
    complexity cost of TSO and gave a better model to software, even in
    hardware designed for a supercomputer.

    That may be true, but I always saw it as "Supercomputers run
    applications for which they have source code"

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Mon Jul 29 20:43:47 2024
    On Mon, 29 Jul 2024 20:08:10 +0000, Chris M. Thomasson wrote:

    On 7/29/2024 10:49 AM, MitchAlsup1 wrote:
    [...]

    A MEMBAR dropped into the pipeline, when nothing is speculative, takes
    no more time than an integer ADD. Only when there is speculation does
    it have to take time to relax the speculation.

    I am wondering if you were ever aware of the "danger zone" wrt putting a MEMBAR instruction in a memory delay slot over on the SPARC? The docs explicitly said not to do it. I guess it can cause some interesting
    memory order "issues" that might allow a running system to last for say,
    five years before crashing at a random time... Yikes!

    Which, btw, is why one should exhaustively test atomic codes before
    putting it production. You do not want to chase down a memory ordering
    issue that occurs, in production, less than once a month.

    I did a lot of programming on SPARCs and never needed a MEMBAR....
    but the OS guys used them like they were "free".

    All sorts of things were dangerous in the delay slot of a branch,
    not the least of which was another branch.

    Prior to the introduction of multi-threaded cores, the rounding
    modes of the FPU status register were hard wired to the FU.
    Afterwards, they became "just another piece of state" that got
    pipelined down instruction execution.

    [...]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Tue Jul 30 12:47:42 2024
    On Mon, 29 Jul 2024 20:43:47 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    On Mon, 29 Jul 2024 20:08:10 +0000, Chris M. Thomasson wrote:

    On 7/29/2024 10:49 AM, MitchAlsup1 wrote:
    [...]

    A MEMBAR dropped into the pipeline, when nothing is speculative,
    takes no more time than an integer ADD. Only when there is
    speculation does it have to take time to relax the speculation.

    I am wondering if you were ever aware of the "danger zone" wrt
    putting a MEMBAR instruction in a memory delay slot over on the
    SPARC? The docs explicitly said not to do it. I guess it can cause
    some interesting memory order "issues" that might allow a running
    system to last for say, five years before crashing at a random
    time... Yikes!

    Which, btw, is why one should exhaustively test atomic codes before
    putting it production. You do not want to chase down a memory ordering
    issue that occurs, in production, less than once a month.

    I did a lot of programming on SPARCs and never needed a MEMBAR....
    but the OS guys used them like they were "free".


    Did you do programming on SPARC in RMO mode?
    According to my understanding, RMO mode existed only on a couple of
    SPARC CPU models (UltraSparc and UltraSparc-II) and was activated only
    by OSes others than Solaris. Possibly, only by Linux.

    All sorts of things were dangerous in the delay slot of a branch,
    not the least of which was another branch.


    Chris M. Thomasson wrote "memory delay slot".
    I wonder what's meant by that. According to my understanding, on SPARC,
    unlike on early MIPS, load delay slot was always merely an
    implementation detail rather than SW-visible architectural feature.

    Prior to the introduction of multi-threaded cores, the rounding
    modes of the FPU status register were hard wired to the FU.
    Afterwards, they became "just another piece of state" that got
    pipelined down instruction execution.

    [...]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Tue Jul 30 09:51:46 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 29 Jul 2024 13:21:10 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 26 Jul 2024 17:00:07 +0000, Anton Ertl wrote:
    Similarly, I expect that hardware that is designed for good TSO or
    sequential consistency performance will run faster on code written for >>>> this model than code written for weakly consistent hardware will run
    on that hardware.

    According to Lamport; only the ATOMIC stuff needs sequential
    consistency.
    So, it is completely possible to have a causally consistent processor >>>that switches to sequential consistency when doing ATOMIC stuff and gain >>>performance when not doing ATOMIC stuff, and gain programmability when >>>doing atomic stuff.

    That's not what I have in mind. What I have in mind is hardware that,
    e.g., speculatively performs loads, predicting that no other core will
    store there with an earlier time stamp. But if another core actually
    performs such a store, the usual misprediction handling happens and
    the code starting from that mispredicted load is reexecuted. So as
    long as two cores do not access the same memory, they can run at full
    speed, and there is only slowdown if there is actual (not potential)
    communication between the cores.

    OK...

    A problem with that approach is that this requires enough reorder
    buffering (or something equivalent, there may be something cheaper for
    this particular problem) to cover at least the shared-cache latency
    (usually L3, more with multiple sockets).

    The depth of the execution window may be smaller than the time it takes
    to send the required information around and have this core recognize
    that it is out-of-order wrt memory.

    So if we don't want to stall for memory accesses all the time, we need
    a bigger execution window, either by making the reorder buffer larger,
    or by using a different, cheaper mechanism.

    Concerning the cheaper mechanism, what I am thinking of is hardware checkpointing every, say, 200 cycles or so (subject to fine-tuning).
    The idea here is that communication between cores is very rare, so
    rolling back more cycles than the minimal necessary amount costs
    little on average (except that it looks bad on cache ping-pong microbenchmarks). The cost of such a checkpoint is (at most) the
    number of architectural registers, plus the aggregated stores between
    the checkpoint and the next one. Once the global time reaches the
    timestamp of the checkpoint N+1 of the core, checkpoint N of the core
    can be released (i.e. all its instructions committed) and all its
    stores can be commited (and checked against speculative loads in other
    cores). If it turns out that an uncommited load's result has been
    changed by a store commited by another core, a rollback to the latest checkpoint before the load happens, and the program is re-executed
    starting from that checkpoint.

    Daya et al [daya+14] have already implemented sequential consistency
    in their 36-core research chip, with similar ideas (that inspired my
    statement above) and much more detail (that makes it hard to see the
    grand scheme of things IIRC).

    @InProceedings{daya+14,
    author = {Bhavya K. Daya and Chia-Hsin Owen Chen and Suvinay
    Subramanian and Woo-Cheol Kwon and Sunghyun Park and
    Tushar Krishna and Jim Holt and Anantha
    P. Chandrakasan and L-Shiuan Peh},
    title = {{SCORPIO}: A 36-Core Research-Chip Demonstrating
    Snoopy Coherence on a Scalable Mesh {NoC} with
    In-Network Ordering},
    crossref = {isca14},
    OPTpages = {},
    url = {http://projects.csail.mit.edu/wiki/pub/LSPgroup/PublicationList/scorpio_isca2014.pdf},
    annote = {The cores on the chip described in this paper access
    their shared memory in a sequentially consistent
    manner; what's more, the chip provides a significant
    speedup in comparison to the distributed directory
    and HyperTransport coherence protocols. The main
    idea is to deal with the ordering separately from
    the data, in a distributed way. The ordering
    messages are relatively small (one bit per core).
    For details see the paper.}
    }

    @Proceedings{isca14,
    title = "$41^\textit{st}$ Annual International Symposium on Computer Architecture",
    booktitle = "$41^\textit{st}$ Annual International Symposium on Computer Architecture",
    year = "2014",
    key = "ISCA 2014",
    }

    The operations themselves are not slow.

    Citation needed.

    A MEMBAR dropped into the pipeline, when nothing is speculative, takes
    no more time than an integer ADD. Only when there is speculation does
    it have to take time to relax the speculation.

    Not sure what kind of speculation you mean here. On in-order cores
    like the non-Fujitsu SPARCs from before about 2010 memory barriers are expensive AFAIK, even though there is essentially no branch
    speculation on in-order cores.

    Of course, if you mean speculation about the order of loads and
    stores, yes, if you don't have such speculation, the memory barriers
    are fast, but then loads are extremely slow.

    Memory consistency is defined wrt what several processors do. Some
    processor performs some reads and writes and another performs some
    read and writes, and memory consistency defines what a processor sees
    about what the other does, and what ends up in main memory. But as
    long as the processors, their caches, and their interconnect gets the
    memory ordering right, the main memory is just the backing store that
    eventually gets a consistent result of what the other components did.
    So it does not matter whether the main memory has one bank or 256.

    NEC SX is a multi-processor vector machine with the property that
    addresses are spewed out as fast as AGEN can perform. These addresses
    are routed to banks based on bus-segment and can arrive OoO wrt
    how they were spewed out.

    So two processors accessing the same memory using vector LDs will
    see a single vector having multiple memory orderings. P[0]V[0] ordered
    before P[1]V[0] but P[1]V[1] ordered before P[0]V[1], ...

    As long as no stores happen, who cares about the order of the loads?
    When stores happen, the loads are ordered wrt these stores (with
    stronger memory orderings giving more guarantees). So the number of
    memory banks does not matter for implementing a strong ordering
    efficiently.

    The thinking about memory banks etc. comes when you approach the
    problem from the other direction: You have some memory subsystem that
    by itself gives you no consistency guarantees whatsoever, and then you
    think about what's the minimum you can do to make it actually useful
    for inter-core communication. And then you write up a paper like

    @TechReport{adve&gharachorloo95,
    author = {Sarita V. Adve and Kourosh Gharachorloo},
    title = {Shared Memory Consistency Models: A Tutorial},
    institution = {Digital Western Research Lab},
    year = {1995},
    type = {WRL Research Report},
    number = {95/7},
    annote = {Gives an overview of architectural features of
    shared-memory computers such as independent memory
    banks and per-CPU caches, and how they make the (for
    programmers) most natural consistency model hard to
    implement, giving examples of programs that can fail
    with weaker consistency models. It then discusses
    several categories of weaker consistency models and
    actual consistency models in these categories, and
    which ``safety net'' (e.g., memory barrier
    instructions) programmers need to use to work around
    the deficiencies of these models. While the authors
    recognize that programmers find it difficult to use
    these safety nets correctly and efficiently, it
    still advocates weaker consistency models, claiming
    that sequential consistency is too inefficient, by
    outlining an inefficient implementation (which is of
    course no proof that no efficient implementation
    exists). Still the paper is a good introduction to
    the issues involved.}
    }

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Tue Jul 30 23:27:11 2024
    On Tue, 30 Jul 2024 9:51:46 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 29 Jul 2024 13:21:10 +0000, Anton Ertl wrote:
    A problem with that approach is that this requires enough reorder
    buffering (or something equivalent, there may be something cheaper for
    this particular problem) to cover at least the shared-cache latency
    (usually L3, more with multiple sockets).

    The depth of the execution window may be smaller than the time it takes
    to send the required information around and have this core recognize
    that it is out-of-order wrt memory.

    So if we don't want to stall for memory accesses all the time, we need
    a bigger execution window, either by making the reorder buffer larger,
    or by using a different, cheaper mechanism.

    Mc 88120 had a 96-wide execution window, which could be filled up in
    16 cycles (optimistically) and was filled up in 32 cycles (average).
    Given that DRAM is not going to be less than 20 ns and a 5GHz core,
    the execution window is 1/3rd that which would be required to absorb
    an cache miss all the way to DRAM. Add in 12-ish cycles for L2, 10-
    cycles for transporting L2 miss to Memory controller, 5-cycles between
    memory controller and DRAM controller-----and sooner or later it gets
    hard. So nobody tries to make the execution window big enough to do
    that (actually Mc88120 was to ECL buss technology and was only 100 MHz
    so its puny 16-cycle EW actually was big enough to absorb a Cache
    miss all the way to DRAM.....but that is for another day.)

    Concerning the cheaper mechanism, what I am thinking of is hardware checkpointing every, say, 200 cycles or so (subject to fine-tuning).
    The idea here is that communication between cores is very rare, so
    rolling back more cycles than the minimal necessary amount costs
    little on average (except that it looks bad on cache ping-pong microbenchmarks).

    You lost me::

    Colloquially, there are 2 uses of the word checkpointing:: a) what
    HW does each time it inserts a branch into the EW, b) what an OS or
    application does to be able to recover from a crash (from any
    mechanism).

    Neither is used to describe interactions between cores.

    <snip>

    The operations themselves are not slow.

    Citation needed.

    A MEMBAR dropped into the pipeline, when nothing is speculative, takes
    no more time than an integer ADD. Only when there is speculation does
    it have to take time to relax the speculation.

    Not sure what kind of speculation you mean here. On in-order cores
    like the non-Fujitsu SPARCs from before about 2010 memory barriers are expensive AFAIK, even though there is essentially no branch
    speculation on in-order cores.

    You dropped 64 instructions into the EW, and AGEN performs 15 address generations in the order permitted by operand arrival. These addresses
    are routed to the L1s to determine who hits and who misses--all OoO.

    Thus, the Addresses are only in Operand order and they touched the
    caches
    in operand order.

    There could be branch mispredictions in EW also causing many of the
    AGENs to get thrown away after the branch is discovered to be poorly
    predicted.

    And ono top of all of this several FP instructions may have raised
    exceptions.

    EW pipelines are generally designed to "sort all this stuff out at
    retirement"; occasionally, memory ordering issues are sorted out
    prior to retirement by replaying OoO memory references.

    Of course, if you mean speculation about the order of loads and
    stores, yes, if you don't have such speculation, the memory barriers
    are fast, but then loads are extremely slow.

    An MEMBAR requires the memory order to catch up to the current point
    before adding new AGENs to the problem space. If the memory order
    is already SC then MEMBAR has nothing to do and is pushed through
    the pipeline without delay.

    So, the delay has to do with catching up with memory order not in
    pushing the MEMBAR through the pipeline.


    Memory consistency is defined wrt what several processors do. Some
    processor performs some reads and writes and another performs some
    read and writes, and memory consistency defines what a processor sees
    about what the other does, and what ends up in main memory. But as
    long as the processors, their caches, and their interconnect gets the
    memory ordering right, the main memory is just the backing store that
    eventually gets a consistent result of what the other components did.
    So it does not matter whether the main memory has one bank or 256.

    NEC SX is a multi-processor vector machine with the property that
    addresses are spewed out as fast as AGEN can perform. These addresses
    are routed to banks based on bus-segment and can arrive OoO wrt
    how they were spewed out.

    So two processors accessing the same memory using vector LDs will
    see a single vector having multiple memory orderings. P[0]V[0] ordered >>before P[1]V[0] but P[1]V[1] ordered before P[0]V[1], ...

    As long as no stores happen, who cares about the order of the loads?
    When stores happen, the loads are ordered wrt these stores (with
    stronger memory orderings giving more guarantees). So the number of
    memory banks does not matter for implementing a strong ordering
    efficiently.

    Then consider 2 Vector processors performing 2 STs (1 each) to
    non-overlapping addresses but with bank aliasing. Consider that
    the STs are scatter based and the back conflicts random. There
    is no way to determine which store happened first or which
    element of each vector store happened first.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Thu Aug 1 15:54:55 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 30 Jul 2024 9:51:46 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    The depth of the execution window may be smaller than the time it takes >>>to send the required information around and have this core recognize
    that it is out-of-order wrt memory.

    So if we don't want to stall for memory accesses all the time, we need
    a bigger execution window, either by making the reorder buffer larger,
    or by using a different, cheaper mechanism.

    Mc 88120 had a 96-wide execution window, which could be filled up in
    16 cycles (optimistically) and was filled up in 32 cycles (average).
    Given that DRAM is not going to be less than 20 ns and a 5GHz core,
    the execution window is 1/3rd that which would be required to absorb
    an cache miss all the way to DRAM.

    Relevant numbers for current cores are 400-600 instructions in the
    reorder buffer, 6-8 instructions per cycle, and the core-to-core
    latency is (for Bergamo) 30-40ns (90-120 cycles) within a CCX,
    100-120ns (300-360 cyles) within a socket, with a 212ns (636 cycles)
    worst case across sockets (data from <https://chipsandcheese.com/2024/06/22/testing-amds-bergamo-zen-4c-spam/>);
    I computed with 3GHz clock rate, which fits Bergamo. On the fast
    desktop chips the clock rate is higher, but the latency is lower in
    both ns and cycles (in particular, no dual-socket penalty); e.g., on
    the Ryzen 7950X the core-to-core latency is <80ns (456 cycles) <https://images.anandtech.com/doci/17585/AMD%20Ryzen%209%207950X%20Core%20to%20Core%20Latency%20Final.jpg>.

    The reorder buffers (and integer register files and store buffers)
    would need to be even larger to cover the 456 or 636 cycles (and maybe
    even more than that latency is needed before you are sure that a load
    or store is sequentially consistent), or alternatively one would need
    a cheaper mechanism.

    Concerning the cheaper mechanism, what I am thinking of is hardware
    checkpointing every, say, 200 cycles or so (subject to fine-tuning).
    The idea here is that communication between cores is very rare, so
    rolling back more cycles than the minimal necessary amount costs
    little on average (except that it looks bad on cache ping-pong
    microbenchmarks).

    You lost me::

    Colloquially, there are 2 uses of the word checkpointing:: a) what
    HW does each time it inserts a branch into the EW, b) what an OS or >application does to be able to recover from a crash (from any
    mechanism).

    What is "EW"?

    Anyway, here checkpointing would be a hardware mechanism that allows
    rolling back to the state at the point when the checkpoint was made.

    An MEMBAR requires the memory order to catch up to the current point
    before adding new AGENs to the problem space. If the memory order
    is already SC then MEMBAR has nothing to do and is pushed through
    the pipeline without delay.

    Yes, that's the slow implementation. The fast implementation is to
    implement sequential consistency all the time (by predicting and
    speculating that memory accesses do not interfer with those of other
    cores, and recovering from that speculation when the speculation turns
    out to be wrong). In such an implementation memory barriers are noops
    (and thus fast), because the hardware already provides sequential
    consistency.

    Then consider 2 Vector processors performing 2 STs (1 each) to >non-overlapping addresses but with bank aliasing. Consider that
    the STs are scatter based and the back conflicts random. There
    is no way to determine which store happened first or which
    element of each vector store happened first.

    It's up to the architecture to define the order of stores and loads of
    a given core. For sequential consistency you then interleave the
    sequences coming from the cores in some convenient order. It does not
    matter what happens earlier in some inertial system. It only matters
    what your hardware decides should be treated as being earlier. The
    hardware has a lot of freedom here, but the end result as visible to
    the cores must be sequentially consistent (or, with a weaker memory
    consistency model, consistent with that model).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Thu Aug 1 17:33:05 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 30 Jul 2024 9:51:46 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    The depth of the execution window may be smaller than the time it takes >>>>to send the required information around and have this core recognize >>>>that it is out-of-order wrt memory.

    So if we don't want to stall for memory accesses all the time, we need
    a bigger execution window, either by making the reorder buffer larger,
    or by using a different, cheaper mechanism.


    Colloquially, there are 2 uses of the word checkpointing:: a) what
    HW does each time it inserts a branch into the EW, b) what an OS or >>application does to be able to recover from a crash (from any
    mechanism).

    What is "EW"?

    Execution Window (referred to above)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Thu Aug 1 19:57:24 2024
    On Thu, 1 Aug 2024 15:54:55 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 30 Jul 2024 9:51:46 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:

    An MEMBAR requires the memory order to catch up to the current point
    before adding new AGENs to the problem space. If the memory order
    is already SC then MEMBAR has nothing to do and is pushed through
    the pipeline without delay.

    Yes, that's the slow implementation. The fast implementation is to
    implement sequential consistency all the time (by predicting and
    speculating that memory accesses do not interfer with those of other
    cores, and recovering from that speculation when the speculation turns
    out to be wrong). In such an implementation memory barriers are noops
    (and thus fast), because the hardware already provides sequential consistency.

    Why does SC need any MEMBARs ??

    Then consider 2 Vector processors performing 2 STs (1 each) to >>non-overlapping addresses but with bank aliasing. Consider that
    the STs are scatter based and the back conflicts random. There
    is no way to determine which store happened first or which
    element of each vector store happened first.

    It's up to the architecture to define the order of stores and loads of
    a given core. For sequential consistency you then interleave the
    sequences coming from the cores in some convenient order.

    Insufficient:: If OoO processor orders LDs and STs as they leave AGEN
    you cannot just interleave multiple core access streams and achieve
    sequential consistency.

    It does not
    matter what happens earlier in some inertial system. It only matters
    what your hardware decides should be treated as being earlier. The

    Causal consistency is determined by arrival order at the memory
    controller.
    Cache consistency is enforced by allowing a single line to be "in
    progress"
    only once in the system--each cache line is serially reusable, while
    other
    cache liens remain unordered wrt that one.

    hardware has a lot of freedom here, but the end result as visible to
    the cores must be sequentially consistent (or, with a weaker memory consistency model, consistent with that model).

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Fri Aug 2 08:14:21 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 1 Aug 2024 15:54:55 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 30 Jul 2024 9:51:46 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:

    An MEMBAR requires the memory order to catch up to the current point >>>before adding new AGENs to the problem space. If the memory order
    is already SC then MEMBAR has nothing to do and is pushed through
    the pipeline without delay.

    Yes, that's the slow implementation. The fast implementation is to
    implement sequential consistency all the time (by predicting and
    speculating that memory accesses do not interfer with those of other
    cores, and recovering from that speculation when the speculation turns
    out to be wrong). In such an implementation memory barriers are noops
    (and thus fast), because the hardware already provides sequential
    consistency.

    Why does SC need any MEMBARs ??

    A program written for sequential consistency does not need them. But
    if you have a program written for a weaker memory model, the memory
    barriers in that program will be noops and therefore really cheap.

    Then consider 2 Vector processors performing 2 STs (1 each) to >>>non-overlapping addresses but with bank aliasing. Consider that
    the STs are scatter based and the back conflicts random. There
    is no way to determine which store happened first or which
    element of each vector store happened first.

    It's up to the architecture to define the order of stores and loads of
    a given core. For sequential consistency you then interleave the
    sequences coming from the cores in some convenient order.

    Insufficient:: If OoO processor orders LDs and STs as they leave AGEN
    you cannot just interleave multiple core access streams and achieve >sequential consistency.

    Architecture is defined in the architecture manual. Implementation
    concepts like OoO and AGEN don't (or shouldn't) play a role there.
    WRT memory ordering most architectures define clearly what happens
    (for single-threaded programs), i.e., loads and stores happen exactly
    in the architectural execution order of the instructions, and they
    actually implement that, for single threaded programs.

    Then they take back some of these guarantees for multi-processing, and
    add some instructions (memory barriers, lock prefixes, etc.) to
    reestablish these guarantees when needed, in an expensive way.

    Sequential consistency is what you get if you do not take back these guarantees.

    Concerning vector instructions, what do architectures say about the
    memory order here? An ideal would be if they were treated as atomic,
    i.e., a read access is all performed after any earlier and before any
    later memory access in the stream of executed instructions. But even
    without multi-processing this tends to be inefficient, and has
    problems with page faults and the number of necessary pages in memory
    at the same time, especially with gather/scatter accesses and very
    long vector memory-memory instructions as on the NEC SX (IIRC). But
    of course, the NEC SX is a supercomputer architecture, a certain
    amount of architectural nonsense is not unusual there.

    Given such difficulties, vector instructions, at least with gather
    loads and scatter stores (whether strided or indirect), are not a good
    idea (and a recent Intel hardware vulnerability shows another reason
    why gather is not a good idea). Your VVM OTOH allows a clean
    architectural definition.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Chris M. Thomasson on Fri Nov 15 15:24:59 2024
    On Fri, 15 Nov 2024 03:17:22 -0800
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> wrote:

    On 11/14/2024 11:25 PM, Anton Ertl wrote:
    aph@littlepinkcloud.invalid writes:
    Yes. That Alpha behaviour was a historic error. No one wants to do
    that again.

    Was it an actual behaviour of any Alpha for public sale, or was it
    just the Alpha specification? I certainly think that Alpha's lack
    of guarantees in memory ordering is a bad idea, and so is ARM's:
    "It's only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously? Sequential consistency can be specified in one sentence: "The result
    of any execution is the same as if the operations of all the
    processors were executed in some sequential order, and the
    operations of each individual processor appear in this sequence in
    the order specified by its program."
    [...]


    Well, iirc, the Alpha is the only system that requires an explicit
    membar for a RCU based algorithm. Even SPARC in RMO mode does not
    need this. Iirc, akin to memory_order_consume in C++:

    https://en.cppreference.com/w/cpp/atomic/memory_order

    data dependent loads


    You response does not answer Anton's question.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Anton Ertl on Fri Nov 15 11:08:29 2024
    On 11/15/2024 2:25 AM, Anton Ertl wrote:
    aph@littlepinkcloud.invalid writes:
    Yes. That Alpha behaviour was a historic error. No one wants to do
    that again.

    Was it an actual behaviour of any Alpha for public sale, or was it
    just the Alpha specification? I certainly think that Alpha's lack of guarantees in memory ordering is a bad idea, and so is ARM's: "It's
    only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. Seriously?
    Sequential consistency can be specified in one sentence: "The result
    of any execution is the same as if the operations of all the
    processors were executed in some sequential order, and the operations
    of each individual processor appear in this sequence in the order
    specified by its program."

    However, I don't think that the Alpha architects considered the Alpha
    memory ordering to be an error, and probably still don't, just like
    the ARM architects don't consider their memory model to be an error.
    I am pretty sure that no Alpha implementation ever made use of the
    lack of causality in the Alpha memory model, so they could have added causality without outlawing existing implementations. That they did
    not indicates that they thought that their memory model was right. An advocacy paper for weak memory models [adve&gharachorloo95] came from
    the same place as Alpha, so it's no surprise that Alpha specifies weak consistency.

    @TechReport{adve&gharachorloo95,
    author = {Sarita V. Adve and Kourosh Gharachorloo},
    title = {Shared Memory Consistency Models: A Tutorial},
    institution = {Digital Western Research Lab},
    year = {1995},
    type = {WRL Research Report},
    number = {95/7},
    annote = {Gives an overview of architectural features of
    shared-memory computers such as independent memory
    banks and per-CPU caches, and how they make the (for
    programmers) most natural consistency model hard to
    implement, giving examples of programs that can fail
    with weaker consistency models. It then discusses
    several categories of weaker consistency models and
    actual consistency models in these categories, and
    which ``safety net'' (e.g., memory barrier
    instructions) programmers need to use to work around
    the deficiencies of these models. While the authors
    recognize that programmers find it difficult to use
    these safety nets correctly and efficiently, it
    still advocates weaker consistency models, claiming
    that sequential consistency is too inefficient, by
    outlining an inefficient implementation (which is of
    course no proof that no efficient implementation
    exists). Still the paper is a good introduction to
    the issues involved.}
    }

    - anton

    Anybody doing that sort of programming, i.e. lock-free or distributed algorithms, who can't handle weakly consistent memory models, shouldn't
    be doing that sort of programming in the first place. Strongly
    consistent memory won't help incompetence.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to jseigh on Fri Nov 15 17:27:37 2024
    jseigh <jseigh_es00@xemaps.com> writes:
    Anybody doing that sort of programming, i.e. lock-free or distributed >algorithms, who can't handle weakly consistent memory models, shouldn't
    be doing that sort of programming in the first place.

    Do you have any argument that supports this claim.

    Strongly consistent memory won't help incompetence.

    Strong words to hide lack of arguments?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sat Nov 16 00:51:36 2024
    On Fri, 15 Nov 2024 23:35:22 +0000, BGB wrote:

    On 11/15/2024 4:05 PM, Chris M. Thomasson wrote:
    On 11/15/2024 12:53 PM, BGB wrote:
    On 11/15/2024 11:27 AM, Anton Ertl wrote:
    jseigh <jseigh_es00@xemaps.com> writes:
    Anybody doing that sort of programming, i.e. lock-free or distributed >>>>> algorithms, who can't handle weakly consistent memory models, shouldn't >>>>> be doing that sort of programming in the first place.

    Do you have any argument that supports this claim.

    Strongly consistent memory won't help incompetence.

    Strong words to hide lack of arguments?


    In my case, as I see it:
       The tradeoff is more about implementation cost, performance, etc.

    Weak model:
       Cheaper (and simpler) to implement;
       Performs better when there is no need to synchronize memory;
       Performs worse when there is need to synchronize memory;
       ...
    [...]

    A TSO from a weak memory model is as it is. It should not necessarily
    perform "worse" than other systems that have TSO as a default. The
    weaker models give us flexibility. Any weak memory model should be able
    to give sequential consistency via using the right membars in the right
    places.


    The speed difference is mostly that, in a weak model, the L1 cache
    merely needs to fetch memory from the L2 or similar, may write to it whenever, and need not proactively store back results.

    As I understand it, a typical TSO like model will require, say:
    Any L1 cache that wants to write to a cache line, needs to explicitly
    request write ownership over that cache line;

    The cache line may have been fetched from a core which modified the
    data, and handed this line directly to this requesting core on a
    typical read. So, it is possible for the line to show up with
    write permission even if the requesting core did not ask for write
    permission. So, not all lines being written have to request owner-
    ship.

    Any attempt by other cores to access this line,

    You are being rather loose with your time analysis in this question::

    Access this line before write permission has been requested,
    or
    Access this line after write permission has been requested but
    before it has arrived,
    or
    Access this line after write permission has arrived.

    may require the L2 cache
    to send a message to the core currently holding the cache line for
    writing to write back its contents, with the request unable to be
    handled until after the second core has written back the dirty cache
    line.

    L2 has to know something about how L1 has the line, and likely which
    core cache the data is in.

    This would create potential for significantly more latency in cases
    where multiple cores touch the same part of memory; albeit the cores
    will see each others' memory stores.

    One can ARGUE that this is a good thing as it makes latency part
    of the memory access model. More interfering accesses=higher
    latency.


    So, initially, weak model can be faster due to not needing any
    additional handling.


    But... Any synchronization points, such as a barrier or locking or
    releasing a mutex, will require manually flushing the cache with a weak model.

    Not necessarily:: My 66000 uses causal memory consistency, yet when
    an ATOMIC event begins it reverts to sequential consistency until
    the end of the event where it reverts back to causal. Use of MMI/O
    space reverts to sequential consistency, while access to config
    space reverts all the way back to strongly ordered.

    And, locking/releasing the mutex itself will require a mechanism
    that is consistent between cores (such as volatile atomic swaps or
    similar, which may still be weak as a volatile-atomic-swap would still
    not be atomic from the POV of the L2 cache; and an MMIO interface could
    be stronger here).


    Seems like there could possibly be some way to skip some of the cache flushing if one could verify that a mutex is only being locked and
    unlocked on a single core.

    Issue then is how to deal with trying to lock a mutex which has thus far
    been exclusive to a single core. One would need some way for the core
    that last held the mutex to know that it needs to perform an L1 cache
    flush.

    This seems to be a job for Cache Consistency.

    Though, one possibility could be to leave this part to the OS scheduler/syscall/...

    The OS wants nothing to do with this.

    mechanism; so the core that wants to lock the
    mutex signals its intention to do so via the OS, and the next time the
    core that last held the mutex does a syscall (or tries to lock the mutex again), the handler sees this, then performs the L1 flush and flags the
    mutex as multi-core safe (at which point, the parties will flush L1s at
    each mutex lock, though possibly with a timeout count so that, if the
    mutex has been single-core for N locks, it reverts to single-core
    behavior).

    This could reduce the overhead of "frivolous mutex locking" in programs
    that are otherwise single-threaded or single processor (leaving the
    cache flushes for the ones that are in-fact being used for
    synchronization purposes).

    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Chris M. Thomasson on Sat Nov 16 07:37:44 2024
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/15/2024 9:27 AM, Anton Ertl wrote:
    jseigh <jseigh_es00@xemaps.com> writes:
    Anybody doing that sort of programming, i.e. lock-free or distributed
    algorithms, who can't handle weakly consistent memory models, shouldn't
    be doing that sort of programming in the first place.

    Strongly consistent memory won't help incompetence.

    Strong words to hide lack of arguments?

    For instance, a 100% sequential memory order won't help you with, say, >solving ABA.

    Sure, not all problems are solved by sequential consistency, and yes,
    it won't solve race conditions like the ABA problem. But jseigh
    implied that finding it easier to write correct and efficient code for sequential consistency than for a weakly-consistent memory model
    (e.g., Alphas memory model) is incompetent.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Sat Nov 16 07:46:17 2024
    BGB <cr88192@gmail.com> writes:
    The tradeoff is more about implementation cost, performance, etc.

    Yes. And the "etc." includes "ease of programming".

    Weak model:
    Cheaper (and simpler) to implement;

    Yes.

    Performs better when there is no need to synchronize memory;

    Not in general. For a cheap multiprocessor implementation, yes. A sophisticated implementation of sequential consistency can just storm
    ahead in that case and achieve the same performance. It just has to
    keep checkpoints around in case that there is a need to synchronize
    memory.

    Performs worse when there is need to synchronize memory;

    With a cheap multiprocessor implementation, yes. In general, no: Any sequentially consistent implementation is also an implementation of
    every weaker memory model, and the memory barriers become nops in that
    kind of implementation. Ok, nops still have a cost, but it's very
    close to 0 on a modern CPU.

    Another potential performance disadvantage of sequential consistency
    even with a sophisticated implementation:

    If you have some algorithm that actually works correctly even when it
    gets stale data from a load (with some limits on the staleness), the sophisticated SC implementation will incur the latency coming from
    making the load non-stale while that latency will not occur or be less
    in a similarly-sophisticated implementation of an appropriate weak
    consistency model.

    However, given that the access to actually-shared memory is slow even
    on weakly-consistent hardware, software usually takes measures to
    avoid having a lot of such accesses, so that cost will usually be
    miniscule.


    What you missed: the big cost of weak memory models and cheap hardware implementations of them is in the software:

    * For correctness, the safe way is to insert a memory barrier between
    any two memory operations.

    * For performance (on cheap implementations of weak memory models) you
    want to execute as few memory barriers as possible.

    * You cannot use testing to find out whether you have enough (and the
    right) memory barriers. That's not only because the involved
    threads may not be in the right state during testing for uncovering
    the incorrectness, but also because the hardware used for testing
    may actually have stronger consistency than the memory model, and so
    some kinds of bugs will never show up in testing on that hardware,
    even when the threads reach the right state. And testing is still
    the go-to solution for software people to find errors (nowadays even
    glorified by continuous integration and modern fuzz testing
    approaches).

    The result is that a lot of software dealing with shared memory is
    incorrect because it does not have a memory barrier that it should
    have, or inefficient on cheap hardware with expensive memory barriers
    because it uses more memory barriers than necessary for the memory
    model. A program may even be incorrect in one place and have
    superflouous memory barriers in another one.

    Or programmers just don't do this stuff at all (as advocated by
    jseigh), and instead just write sequential programs, or use bottled
    solutions that often are a lot more expensive than superfluous memory
    barriers. E.g., in Gforth the primary inter-thread communication
    mechanism is currently implemented with pipes, involving the system
    calls read() and write(). And Bernd Paysan who implemented that is a
    really good programmer; I am sure he would be able to wrap his head
    around the whole memory model stuff and implement something much more efficient, but that would take time that he obviously prefers to spend
    on more productive things.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Anton Ertl on Sat Nov 16 17:28:21 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    jseigh <jseigh_es00@xemaps.com> writes:

    Anybody doing that sort of programming, i.e. lock-free or distributed
    algorithms, who can't handle weakly consistent memory models, shouldn't
    be doing that sort of programming in the first place.

    Do you have any argument that supports this claim.

    It isn't a claim, just an opinion.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Sun Nov 17 09:03:06 2024
    On 11/16/24 16:21, Chris M. Thomasson wrote:

    Fwiw, in C++ std::memory_order_consume is useful for traversing a node
    based stack of something in RCU. In most systems it only acts like a
    compiler barrier. On the Alpha, it must emit a membar instruction. Iirc,
    mb for alpha? Cannot remember that one right now.

    That got deprecated. Too hard for compilers to deal with. It's now
    same as memory_order_acquire.

    Which brings up an interesting point. Even if the hardware memory
    memory model is strongly ordered, compilers can reorder stuff,
    so you still have to program as if a weak memory model was in
    effect. Or maybe disable reordering or optimization altogether
    for those target architectures.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to jseigh on Sun Nov 17 15:17:52 2024
    jseigh <jseigh_es00@xemaps.com> writes:
    Even if the hardware memory
    memory model is strongly ordered, compilers can reorder stuff,
    so you still have to program as if a weak memory model was in
    effect.

    That's something between the user of a programming language and the
    compiler. If you use a programming language or compiler that gives
    weaker memory ordering guarantees than the architecture it compiles
    to, that's your choice. Nothing forces compilers to behave that way,
    and it's actually easier to write compilers that do not do such
    reordering.

    Or maybe disable reordering or optimization altogether
    for those target architectures.

    So you want to throw out the baby with the bathwater.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Chris M. Thomasson on Sun Nov 17 15:15:08 2024
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/15/2024 11:37 PM, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/15/2024 9:27 AM, Anton Ertl wrote:
    jseigh <jseigh_es00@xemaps.com> writes:
    Anybody doing that sort of programming, i.e. lock-free or distributed >>>>> algorithms, who can't handle weakly consistent memory models, shouldn't >>>>> be doing that sort of programming in the first place.

    Strongly consistent memory won't help incompetence.

    Strong words to hide lack of arguments?

    For instance, a 100% sequential memory order won't help you with, say,
    solving ABA.

    Sure, not all problems are solved by sequential consistency, and yes,
    it won't solve race conditions like the ABA problem. But jseigh
    implied that finding it easier to write correct and efficient code for
    sequential consistency than for a weakly-consistent memory model
    (e.g., Alphas memory model) is incompetent.

    What if you had to write code for a weakly ordered system, and the >performance guidelines said to only use a membar when you absolutely
    have to. If you say something akin to "I do everything using >std::memory_order_seq_cst", well, that is a violation right off the bat.

    Fair enough?

    Are you trying to support my point?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Chris M. Thomasson on Mon Nov 18 07:11:04 2024
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    What if you had to write code for a weakly ordered system, and the
    performance guidelines said to only use a membar when you absolutely
    have to. If you say something akin to "I do everything using
    std::memory_order_seq_cst", well, that is a violation right off the bat. ...
    I am trying to say you might not be hired if you only knew how to handle >std::memory_order_seq_cst wrt C++... ?

    I am not looking to be hired.

    In any case, this cuts both ways: If you are an employer working on multi-threaded software, say, for Windows or Linux, will you reduce
    your pool of potential hires by including a requirement like the one
    above? And then pay for longer development time and additional
    hard-to-find bugs coming from overshooting the requirement you stated
    above. Or do you limit your software support to TSO hardware (for
    lack of widely available SC hardware), and gain all the benefits of
    more potential hires, reduced development time, and fewer bugs?

    I have compared arguments against strong memory ordering with those
    against floating-point. Von Neumann argued for fixed point as follows <https://booksite.elsevier.com/9780124077263/downloads/historial%20perspectives/section_3.11.pdf>:

    |[...] human time is consumed in arranging for the introduction of
    |suitable scale factors. We only argue that the time consumed is a
    |very small percentage of the total time we will spend in preparing an |interesting problem for our machine. The first advantage of the
    |floating point is, we feel, somewhat illusory. In order to have such
    |a floating point, one must waste memory capacity which could
    |otherwise be used for carrying more digits per word.

    Kahan writes <https://people.eecs.berkeley.edu/~wkahan/SIAMjvnl.pdf>:

    |Papers in 1947/8 by Bargman, Goldstein, Montgomery and von Neumann
    |seemed to imply that 40-bit arithmetic would hardly ever deliver
    |usable accuracy for the solution of so few as 100 linear equations in
    |100 unknowns; but by 1954 engineers were solving bigger systems
    |routinely and getting satisfactory accuracy from arithmetics with no
    |more than 40 bits.

    The flaw in the reasoning of the paper was:

    |To solve it more easily without floating–point von Neumann had
    |transformed equation Bx = c to B^TBx = B^Tc , thus unnecessarily
    |doubling the number of sig. bits lost to ill-condition

    This is an example of how the supposed gains that the harder-to-use
    interface provides (in this case the bits "wasted" on the exponent)
    are overcompensated by then having to use a software workaround for
    the harder-to-use interface.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to Anton Ertl on Mon Nov 18 12:03:55 2024
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    aph@littlepinkcloud.invalid writes:
    Yes. That Alpha behaviour was a historic error. No one wants to do
    that again.

    Was it an actual behaviour of any Alpha for public sale, or was it
    just the Alpha specification?

    I don't know. Given the contortions that the Linux kernel people had
    to go through, maybe it really was present in hardware.

    As a programming language implementer, I don't much think about "Will
    the hardware really do this?" because new hardware arises all the
    time, and I don't want users' programs to stop working.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to jseigh on Mon Nov 18 11:56:48 2024
    jseigh <jseigh_es00@xemaps.com> wrote:
    On 11/16/24 16:21, Chris M. Thomasson wrote:

    Fwiw, in C++ std::memory_order_consume is useful for traversing a node
    based stack of something in RCU. In most systems it only acts like a
    compiler barrier. On the Alpha, it must emit a membar instruction. Iirc,
    mb for alpha? Cannot remember that one right now.

    That got deprecated. Too hard for compilers to deal with. It's now
    same as memory_order_acquire.

    It's back in C++20. I think the problem wasn't so much implementing
    it, which as you say can be trivially done by aliasing with acquire,
    but specifying it. We use load dependency ordering in Java on AArch64
    to satisfy some memory model requirements, so it's not as if it's
    difficult to use.

    Which brings up an interesting point. Even if the hardware memory
    memory model is strongly ordered, compilers can reorder stuff,
    so you still have to program as if a weak memory model was in
    effect.

    Yes, exactly. It's not as if this is an issue that affects people who
    program in high level languages, it's about what language implementers
    choose to do.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Tue Nov 26 01:29:19 2024
    On Mon, 25 Nov 2024 23:59:02 +0000, Chris M. Thomasson wrote:

    On 11/18/2024 3:34 PM, Chris M. Thomasson wrote:


    Don't tell me you want all of std::memory_order_* to default to
    std::memory_order_seq_cst? If your on a system that only has seq_cst and
    nothing else, okay, but not on other weaker (memory order) systems,
    right?

    defaulting a relaxed to a seq_cst is a bit much.... ;^o

    Defaulting to Strongly_Ordered is even worse.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Chris M. Thomasson on Tue Dec 3 08:32:52 2024
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/18/2024 3:20 PM, Chris M. Thomasson wrote:
    On 11/17/2024 11:11 PM, Anton Ertl wrote:
    The flaw in the reasoning of the paper was:

    |To solve it more easily without floating–point von Neumann had
    |transformed equation Bx = c to B^TBx = B^Tc , thus unnecessarily
    |doubling the number of sig. bits lost to ill-condition

    This is an example of how the supposed gains that the harder-to-use
    interface provides (in this case the bits "wasted" on the exponent)
    are overcompensated by then having to use a software workaround for
    the harder-to-use interface.
    ...
    Don't tell me you want all of std::memory_order_* to default to >std::memory_order_seq_cst? If your on a system that only has seq_cst and >nothing else, okay, but not on other weaker (memory order) systems, right?

    I tell anyone who wants to read it to stop buying hardware without FP
    for non-integer work, and with weak memory ordering for work that
    needs concurrent programming. There are enough affordable offerings
    with FP and TSO that we do not need to waste programming time and
    increase the frequency of hard-to-find bugs by figuring out how to get
    good performance out of hardware without FP hardware and with weak
    memory ordering.

    Those who enjoy the challenge of dealing with the unnecessary problems
    of sub-par hardware can continue to enjoy that.

    But when developing production software, as a manager don't let
    programmers with this hobby horse influence your hardware and
    development decisions. Give full support for FP and TSO hardware, and
    limited support to weakly-ordered hardware. That limited support may
    consist of using software implementations of FP (instead of designing
    software for fixed point arithmetic). In case of hardware with weak
    ordering the limited support could be to use memory barriers liberally
    (without trying to minimize them at all; every memory barrier
    elimination costs development time and increases the potential for
    hard-to-find bugs), of using OS mechanisms for concurrency (rather
    than, e.g., lock-free algorithms), or maybe even only supporting single-threaded operation.

    Efficiently-implemented sequentially-consistent hardware would be even
    more preferable, and if it was widely available, I would recommend
    buying that over TSO hardware, but unfortunately we are not there yet.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Chris M. Thomasson on Tue Dec 3 09:01:44 2024
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/17/2024 7:17 AM, Anton Ertl wrote:
    jseigh <jseigh_es00@xemaps.com> writes:
    Or maybe disable reordering or optimization altogether
    for those target architectures.

    So you want to throw out the baby with the bathwater.

    No, keep the weak order systems and not throw them out wrt a system that
    is 100% seq_cst? Perhaps? What am I missing here?

    Disabling optimization altogether costs a lot; e.g., look at <http://www.complang.tuwien.ac.at/anton/bentley.pdf>: if you compare
    the lines for clang-3.5 -O0 with clang-3.5 -O3, you see a factor >2.5
    for the tsp9 program. For gcc-5.2.0 the difference is even bigger.

    That's why jseigh and people like him (I have read that suggestion
    several times before) love to suggest disabling optimization
    altogether. It's a straw man that does not even need beating up. Of
    course they usually don't show results for the supposed benefits of
    the particular "optimization" they advocate (or the drawbacks of
    disabling it), and jseigh follows this pattern nicely.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Tue Dec 3 11:36:37 2024
    On Tue, 03 Dec 2024 08:32:52 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/18/2024 3:20 PM, Chris M. Thomasson wrote:
    On 11/17/2024 11:11 PM, Anton Ertl wrote:
    The flaw in the reasoning of the paper was:

    |To solve it more easily without floating–point von Neumann had
    |transformed equation Bx = c to B^TBx = B^Tc , thus unnecessarily
    |doubling the number of sig. bits lost to ill-condition

    This is an example of how the supposed gains that the
    harder-to-use interface provides (in this case the bits "wasted"
    on the exponent) are overcompensated by then having to use a
    software workaround for the harder-to-use interface.
    ...
    Don't tell me you want all of std::memory_order_* to default to >std::memory_order_seq_cst? If your on a system that only has seq_cst
    and nothing else, okay, but not on other weaker (memory order)
    systems, right?

    I tell anyone who wants to read it to stop buying hardware without FP
    for non-integer work, and with weak memory ordering for work that
    needs concurrent programming. There are enough affordable offerings
    with FP and TSO that we do not need to waste programming time and
    increase the frequency of hard-to-find bugs by figuring out how to get
    good performance out of hardware without FP hardware and with weak
    memory ordering.

    Those who enjoy the challenge of dealing with the unnecessary problems
    of sub-par hardware can continue to enjoy that.

    But when developing production software, as a manager don't let
    programmers with this hobby horse influence your hardware and
    development decisions. Give full support for FP and TSO hardware, and limited support to weakly-ordered hardware. That limited support may
    consist of using software implementations of FP (instead of designing software for fixed point arithmetic). In case of hardware with weak
    ordering the limited support could be to use memory barriers liberally (without trying to minimize them at all; every memory barrier
    elimination costs development time and increases the potential for hard-to-find bugs), of using OS mechanisms for concurrency (rather
    than, e.g., lock-free algorithms), or maybe even only supporting single-threaded operation.

    Efficiently-implemented sequentially-consistent hardware would be even
    more preferable, and if it was widely available, I would recommend
    buying that over TSO hardware, but unfortunately we are not there yet.

    - anton

    If you want capable dual-core or quad-core processor integrated with
    FPGA then Arm Cortex-A is the only game in town right now, and probably
    for a few years going forward. Typically, old low end cores. FPU is
    there, TSO is not.
    Fortunately, in majority of applications of this chips there is no need
    for concurrent programming, but one is rarely 100% sure that need
    wouldn't emerge when he starts a project.

    BTW, does your stance means that your are strongly against A64FX ?

    My own stance is that people should not do lockless concurrent
    programming. Period.
    Well, almost period. Something like RCU in Linux kernel is an exception.
    May be, atomic updates of statistical counters is another exception,
    but only when one is sure that his application will never have to scale
    above 2 dozens of cores.

    Lockless programming is horrendously complicated and error prone.
    Sequential consistency removes only small part of potential
    complications.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Tue Dec 3 10:03:29 2024
    Michael S <already5chosen@yahoo.com> writes:
    BTW, does your stance means that your are strongly against A64FX ?

    No. According to <https://lwn.net/Articles/970907/> A64FX is one of
    the few ARM A64 implementations that provides TSO.

    Lockless programming is horrendously complicated and error prone.
    Sequential consistency removes only small part of potential
    complications.

    It's only a part, true, but I am not sure that the part is small.

    Interestingly, the FP analogy persists here: having FP hardware does
    not mean that numerical programming is easy, just that it is not as
    hard as with fixed point.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Anton Ertl on Tue Dec 3 08:59:18 2024
    On 12/3/24 04:01, Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 11/17/2024 7:17 AM, Anton Ertl wrote:
    jseigh <jseigh_es00@xemaps.com> writes:
    Or maybe disable reordering or optimization altogether
    for those target architectures.

    So you want to throw out the baby with the bathwater.

    No, keep the weak order systems and not throw them out wrt a system that
    is 100% seq_cst? Perhaps? What am I missing here?

    Disabling optimization altogether costs a lot; e.g., look at <http://www.complang.tuwien.ac.at/anton/bentley.pdf>: if you compare
    the lines for clang-3.5 -O0 with clang-3.5 -O3, you see a factor >2.5
    for the tsp9 program. For gcc-5.2.0 the difference is even bigger.

    That's why jseigh and people like him (I have read that suggestion
    several times before) love to suggest disabling optimization
    altogether. It's a straw man that does not even need beating up. Of
    course they usually don't show results for the supposed benefits of
    the particular "optimization" they advocate (or the drawbacks of
    disabling it), and jseigh follows this pattern nicely.

    That wasn't a serious suggestion.

    The compiler is allow to reorder code as long as it knows the
    reordering can't be observed or detected. If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    If you are writing code with concurrent shared data access then
    you need let the compiler know. One way is with locks.
    Another way for lock-free data structures with with
    memory barriers. Even if you had cst hardware you
    still need to tell the compiler so cst hardware doesn't
    buy you any less effort from a programming point of view.

    If you are arguing lock-free programming with memory barrriers
    is hard, let's use locks for everything (disregarding that
    locks have acquire/release semantics that the compiler has
    to be aware of and programmers aren't always aware of), you
    might want to consider the following performance timings
    on some stuff I've been playing with.

    unsafe 53.344 nsecs ( 0.000) 54.547 nsecs ( 0.000)*
    smr 53.828 nsecs ( 0.484) 55.485 nsecs ( 0.939) smrlite 53.094 nsecs ( 0.000) 54.329 nsecs ( 0.000)
    arc 306.674 nsecs ( 253.330) 313.931 nsecs ( 259.384)
    rwlock 730.012 nsecs ( 676.668) 830.340 nsecs ( 775.793)
    mutex 2,881.690 nsecs ( 2,828.346) 3,305.382 nsecs ( 3,250.835)

    smr is smrproxy, something like user space rcu. smrlite is smr
    is smr w/o thread_local access so I have an idea how much that
    adds to overhead. arc is arcproxy, lock-free reference count
    based deferred reclamation. rwlock and mutex are what their
    names would suggest. unsafe is no synchronization to get a
    base timing on the reader loop body.

    2nd col is per loop read lock/unlock average cpu time
    3rd col is with unsafe time subtracted out
    4th col is average elapsed time
    5th col is with unsafe time subtracted out.

    cpu time doesn't measure lock wait time so elapsed time
    gives some indication of that.

    8 reader threads, 1 writer thread

    smrproxy is the version that doesn't need the cst_seq
    memory barrier so it is pretty fast (you are welcome).

    arc, rwlock, and mutex use interlocked instructions which
    cause cache thrashing. mutex will not scale well with
    number of threads on top of that. rwlock depends on
    how much write locking is going on. With few write
    updates, it will look more like arc.

    Timings are for 8 reader threads, 1 writer thread on
    4 core/8 hw thread machine.

    There's going to be applications where that 2 to 3+ order
    difference of overhead is going to matter a lot.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to jseigh on Tue Dec 3 19:27:47 2024
    On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:

    The compiler is allow to reorder code as long as it knows the
    reordering can't be observed or detected.

    With exceptions enabled, this would allow for almost no code
    movement at all.

    If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references
    are "more special" than normal--when languages give few mechanisms.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Dec 3 18:37:41 2024
    If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references
    are "more special" than normal--when languages give few mechanisms.

    We could start with something like

    critical_region {
    ...
    }

    such that the compiler must refrain from any code motion within
    those sections but is free to move things outside of those sections as if execution was singlethreaded.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Wed Dec 4 02:47:58 2024
    On Tue, 3 Dec 2024 23:37:41 +0000, Stefan Monnier wrote:

    If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references
    are "more special" than normal--when languages give few mechanisms.

    We could start with something like

    critical_region {
    ...
    }

    In the spirit of Fortran rule about obeying parenthesis; one could
    write::

    {
    ...
    }

    and the compiler would not allow code motion beyond the {s or }s.

    {{You may want to disable code motion even when the region is not critical--just dangerous in some way.}}

    such that the compiler must refrain from any code motion within
    those sections but is free to move things outside of those sections as
    if execution was singlethreaded.

    You identify a second problem. Is it that you don't want code motion
    across the boundary or you do not want code motion within the boundary??


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Dec 4 10:29:33 2024
    You identify a second problem. Is it that you don't want code motion
    across the boundary or you do not want code motion within the boundary??

    Concurrency is hard. 🙂


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Stefan Monnier on Wed Dec 4 11:13:17 2024
    On 12/3/24 18:37, Stefan Monnier wrote:
    If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references
    are "more special" than normal--when languages give few mechanisms.

    We could start with something like

    critical_region {
    ...
    }

    such that the compiler must refrain from any code motion within
    those sections but is free to move things outside of those sections as if execution was singlethreaded.


    C/C++11 already defines what lock acquire/release semantics are.
    Roughly you can move stuff outside of a critical section into it
    but not vice versa.

    Java uses synchronized blocks to denote the critical section.
    C++ (the society for using RAII for everything) has scoped_lock
    if you want to use RAII for your critical section. It's not
    always obvious what the actual critical section is. I usually
    use it inside its own bracket section to make it more obvious.
    { std::scoped_lock m(mutex);
    // .. critical section
    }

    I'm not a big fan of c/c++ using acquire and release memory order
    directives on everything since apart from a few situations it's
    not intuitively obvious what they do in all cases. You can
    look a compiler assembler output but you have to be real careful
    generalizing from what you see.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Dec 4 16:37:41 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:

    The compiler is allow to reorder code as long as it knows the
    reordering can't be observed or detected.

    With exceptions enabled, this would allow for almost no code
    movement at all.

    If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references
    are "more special" than normal--when languages give few mechanisms.

    C and C++ have the 'volatile' keyword for this purpose.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Wed Dec 4 19:43:48 2024
    On Wed, 4 Dec 2024 15:29:33 +0000, Stefan Monnier wrote:

    You identify a second problem. Is it that you don't want code motion
    across the boundary or you do not want code motion within the boundary??

    Concurrency is hard. 🙂

    Suggest a way to disallow code motion across the boundary
    that still allows code motion within a block ??



    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Dec 4 19:42:11 2024
    On Wed, 4 Dec 2024 16:37:41 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:

    The compiler is allow to reorder code as long as it knows the
    reordering can't be observed or detected.

    With exceptions enabled, this would allow for almost no code
    movement at all.

    If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references
    are "more special" than normal--when languages give few mechanisms.

    C and C++ have the 'volatile' keyword for this purpose.

    What if you want the volatile attribute only to hold
    on an inner block::

    {
    int i = ...;
    ... // I is not volitile here
    {
    ... // I is volitile in here
    }
    ... // I is not volitile here
    ...
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Dec 4 19:48:21 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 4 Dec 2024 16:37:41 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:

    The compiler is allow to reorder code as long as it knows the
    reordering can't be observed or detected.

    With exceptions enabled, this would allow for almost no code
    movement at all.

    If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references
    are "more special" than normal--when languages give few mechanisms.

    C and C++ have the 'volatile' keyword for this purpose.

    What if you want the volatile attribute only to hold
    on an inner block::

    {
    int i = ...;
    ... // I is not volitile here
    {
    ... // I is volitile in here
    }
    ... // I is not volitile here
    ...
    }

    Cast it. Linux uses the macro ACCESS_ONCE to support this:

    uint64_t value = ACCESS_ONCE(&memory_location);

    #define ACCESS_ONCE(x) (*(volatile __typeof__(x) *)&(x))

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Thu Dec 5 08:00:40 2024
    On 12/5/24 02:44, Chris M. Thomasson wrote:
    On 12/4/2024 8:13 AM, jseigh wrote:
    On 12/3/24 18:37, Stefan Monnier wrote:
                                               If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references
    are "more special" than normal--when languages give few mechanisms.

    We could start with something like

         critical_region {
           ...
         }

    such that the compiler must refrain from any code motion within
    those sections but is free to move things outside of those sections
    as if
    execution was singlethreaded.


    C/C++11 already defines what lock acquire/release semantics are.
    Roughly you can move stuff outside of a critical section into it
    but not vice versa.

    Java uses synchronized blocks to denote the critical section.
    C++ (the society for using RAII for everything) has scoped_lock
    if you want to use RAII for your critical section.  It's not
    always obvious what the actual critical section is.  I usually
    use it inside its own bracket section to make it more obvious.
       { std::scoped_lock m(mutex);
         // .. critical section
       }

    I'm not a big fan of c/c++ using acquire and release memory order
    directives on everything since apart from a few situations it's
    not intuitively obvious what they do in all cases.  You can
    look a compiler assembler output but you have to be real careful
    generalizing from what you see.

    The release on the unlock can allow some following stores and things to
    sort of "bubble up before it?

    Acquire and release confines things to the "critical section", the
    release can allow for some following things to go above it, so to speak.
    This is making me think of Alex over on c.p.t. !

    :^)

    Did I miss anything? Sorry Joe.


    Maybe. For thread local non-shared data if the compiler can make that determination but I don't know if the actual specs say that.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Scott Lurndal on Thu Dec 5 07:19:24 2024
    scott@slp53.sl.home (Scott Lurndal) writes:

    mitchalsup@aol.com (MitchAlsup1) writes:

    On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:

    The compiler is allow to reorder code as long as it knows the
    reordering can't be observed or detected.

    With exceptions enabled, this would allow for almost no code
    movement at all.

    If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references
    are "more special" than normal--when languages give few mechanisms.

    C and C++ have the 'volatile' keyword for this purpose.

    A problem with using volatile is that volatile doesn't do what
    most people think it does, especially with respect to what
    reordering is or is not allowed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Tim Rentsch on Fri Dec 6 10:04:29 2024
    Tim, did you send me a PM to check my email? I responded but then
    silence. Could someone be pretending to be you?

    Terje

    Tim Rentsch wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:

    mitchalsup@aol.com (MitchAlsup1) writes:

    On Tue, 3 Dec 2024 13:59:18 +0000, jseigh wrote:

    The compiler is allow to reorder code as long as it knows the
    reordering can't be observed or detected.

    With exceptions enabled, this would allow for almost no code
    movement at all.

    If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references
    are "more special" than normal--when languages give few mechanisms.

    C and C++ have the 'volatile' keyword for this purpose.

    A problem with using volatile is that volatile doesn't do what
    most people think it does, especially with respect to what
    reordering is or is not allowed.



    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Fri Dec 6 07:36:33 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim, did you send me a PM to check my email? I responded but then
    silence. Could someone be pretending to be you?

    Yes that was me. I'm just about to send you a followup.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Tue Dec 17 07:33:58 2024
    On 12/16/24 16:48, Chris M. Thomasson wrote:
    On 12/5/2024 5:00 AM, jseigh wrote:


    Maybe.  For thread local non-shared data if the compiler can make that
    determination but I don't know if the actual specs say that.

    It would be strange to me if the compiler executed a weaker barrier than
    what I said needed to be there. If I say I need a #LoadStore |
    #StoreStore here, then the compiler better put that barrier in there.
    Humm...

    C++ concurrency was designed by a committee. They try to fit things
    into their world view even if reality is a bit more nuanced or complex
    than that world view.

    C++ doesn't use #LoadStore, etc... memory ordering terminology. They
    use acquire, release, cst, relaxed, ... While in some cases it's straightforward as to what that means, in others it's less obvious.
    Non-obvious isn't exactly what you want when writing multi-threaded
    code. There's enough subtlety as it is.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to jseigh on Tue Dec 17 20:38:18 2024
    jseigh <jseigh_es00@xemaps.com> wrote:

    C++ doesn't use #LoadStore, etc... memory ordering terminology. They
    use acquire, release, cst, relaxed, ... While in some cases it's straightforward as to what that means, in others it's less obvious.

    Indeed you don't know the exact mapping to instructions, but that's
    the idea: you ask for the ordering model you want, and the compiler
    chooses the instructions.

    Non-obvious isn't exactly what you want when writing multi-threaded
    code. There's enough subtlety as it is.

    There are efficiency advantages to be had from getting away from
    explicit barriers, though. AArch64 has seq-cst load and store
    instructions which don't need the sledgehammer of of a full StoreLoad
    between a store and a load.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Tue Dec 17 21:45:55 2024
    On Tue, 17 Dec 2024 20:41:23 +0000, Chris M. Thomasson wrote:

    On 12/17/2024 4:33 AM, jseigh wrote:
    On 12/16/24 16:48, Chris M. Thomasson wrote:
    On 12/5/2024 5:00 AM, jseigh wrote:


    Maybe.  For thread local non-shared data if the compiler can make that >>>> determination but I don't know if the actual specs say that.

    It would be strange to me if the compiler executed a weaker barrier
    than what I said needed to be there. If I say I need a #LoadStore |
    #StoreStore here, then the compiler better put that barrier in there.
    Humm...

    C++ concurrency was designed by a committee.  They try to fit things
    into their world view even if reality is a bit more nuanced or complex
    than that world view.

    Indeed.

    A committee with no/little HW design experience ...


    C++ doesn't use #LoadStore, etc... memory ordering terminology.  They
    use acquire, release, cst, relaxed, ...  While in some cases it's
    straightforward as to what that means, in others it's less obvious.
    Non-obvious isn't exactly what you want when writing multi-threaded
    code.  There's enough subtlety as it is.

    Agreed. Humm... The CAS is interesting to me.

    atomic_compare_exchange_weak
    atomic_compare_exchange_strong

    The weak one can fail spuriously... Akin to LL/SC in a sense?

    There are architectures in which CAS* is allowed to fail spuriously.
    In one case, if the miss buffers overflow while holding a locked
    variable, the CAS would fail (preventing an ABA problem possibility).

    (*) or any multi-instruction ATOMIC sequence.

    atomic_compare_exchange_weak_explicit
    atomic_compare_exchange_strong_explicit

    A membar for the success path and one for the failure path. Oh that's
    fun. Sometimes I think its better to use relaxed for all of the atomics
    and use explicit barriers ala atomic_thread_fence for the order. Well,
    that is more in line with the SPARC way of doing things... ;^)

    :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Wed Dec 18 06:43:43 2024
    On 12/17/24 15:41, Chris M. Thomasson wrote:

    Agreed. Humm... The CAS is interesting to me.

    atomic_compare_exchange_weak
    atomic_compare_exchange_strong

    The weak one can fail spuriously... Akin to LL/SC in a sense?

    Most likely LL/SC in the implementation. If you are calling
    cas in a loop, then the weak form will be safe, it will just
    retry. The strong form is if you are not in a loop and the
    cas emulation code has extra logic to filter out spurious
    failures and retry internally.

    atomic_compare_exchange_weak_explicit
    atomic_compare_exchange_strong_explicit

    A membar for the success path and one for the failure path. Oh that's
    fun. Sometimes I think its better to use relaxed for all of the atomics
    and use explicit barriers ala atomic_thread_fence for the order. Well,
    that is more in line with the SPARC way of doing things... ;^)

    No sure why that is there. Possibly for non loop usages.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Thu Dec 19 18:33:36 2024
    On Thu, 5 Dec 2024 7:44:19 +0000, Chris M. Thomasson wrote:

    On 12/4/2024 8:13 AM, jseigh wrote:
    On 12/3/24 18:37, Stefan Monnier wrote:
                                               If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references
    are "more special" than normal--when languages give few mechanisms.

    We could start with something like

         critical_region {
           ...
         }

    such that the compiler must refrain from any code motion within
    those sections but is free to move things outside of those sections as
    if
    execution was singlethreaded.


    C/C++11 already defines what lock acquire/release semantics are.
    Roughly you can move stuff outside of a critical section into it
    but not vice versa.

    Java uses synchronized blocks to denote the critical section.
    C++ (the society for using RAII for everything) has scoped_lock
    if you want to use RAII for your critical section.  It's not
    always obvious what the actual critical section is.  I usually
    use it inside its own bracket section to make it more obvious.
      { std::scoped_lock m(mutex);
        // .. critical section
      }

    I'm not a big fan of c/c++ using acquire and release memory order
    directives on everything since apart from a few situations it's
    not intuitively obvious what they do in all cases.  You can
    look a compiler assembler output but you have to be real careful
    generalizing from what you see.

    The release on the unlock can allow some following stores and things to
    sort of "bubble up before it?

    Acquire and release confines things to the "critical section", the
    release can allow for some following things to go above it, so to speak.
    This is making me think of Alex over on c.p.t. !

    This sounds dangerous if the thing allowed to go above it is unCacheable
    while the lock:release is cacheable, the cacheable lock can arrive at
    another core before the unCacheable store arrives at its destination.

    :^)

    Did I miss anything? Sorry Joe.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Thu Dec 19 23:59:01 2024
    On Thu, 19 Dec 2024 21:19:24 +0000, Chris M. Thomasson wrote:

    On 12/19/2024 10:33 AM, MitchAlsup1 wrote:
    On Thu, 5 Dec 2024 7:44:19 +0000, Chris M. Thomasson wrote:

    On 12/4/2024 8:13 AM, jseigh wrote:
    On 12/3/24 18:37, Stefan Monnier wrote:
                                               If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references >>>>>> are "more special" than normal--when languages give few mechanisms. >>>>>
    We could start with something like

         critical_region {
           ...
         }

    such that the compiler must refrain from any code motion within
    those sections but is free to move things outside of those sections as >>>>> if
    execution was singlethreaded.


    C/C++11 already defines what lock acquire/release semantics are.
    Roughly you can move stuff outside of a critical section into it
    but not vice versa.

    Java uses synchronized blocks to denote the critical section.
    C++ (the society for using RAII for everything) has scoped_lock
    if you want to use RAII for your critical section.  It's not
    always obvious what the actual critical section is.  I usually
    use it inside its own bracket section to make it more obvious.
       { std::scoped_lock m(mutex);
         // .. critical section
       }

    I'm not a big fan of c/c++ using acquire and release memory order
    directives on everything since apart from a few situations it's
    not intuitively obvious what they do in all cases.  You can
    look a compiler assembler output but you have to be real careful
    generalizing from what you see.

    The release on the unlock can allow some following stores and things to
    sort of "bubble up before it?

    Acquire and release confines things to the "critical section", the
    release can allow for some following things to go above it, so to speak. >>> This is making me think of Alex over on c.p.t. !

    This sounds dangerous if the thing allowed to go above it is unCacheable
    while the lock:release is cacheable, the cacheable lock can arrive at
    another core before the unCacheable store arrives at its destination.

    Humm... Need to ponder on that. Wrt the sparc:

    membar #LoadStore | #StoreStore

    can allow following stores to bubble up before it. If we want to block
    that then we would use a #StoreLoad. However, a #StoreLoad is not
    required for unlocking a mutex.

    It is the cacheable locks covering unCacheable data that got MOESI
    protocol in trouble (SPARC V8 era). MESI does not have this kind
    of problem. {{SuperSPARC MESI did not have this problem because
    writes to memory (via SNOOP hits) were slow, Ross MOESI did have
    this problem because cache-to-cache transfers (SNOOP hit) were as
    few as 6 cycles.}}

    S O , What kind of barriers a relaxed memory model needs becomes
    dependent on the cache coherency model !?!?!?! How is software
    going to deal with that ?!? It them becomes dependent on the
    memory order model, as a cascade of Oh-crap-what-have-I-done-to-
    myself ...

    It is stuff like this that lead My 66000 to alter memory models
    as it accesses memory and mandates that all critical sections
    are denoted (.lock) at the beginning and end of the ATOMIC event.
    Thus, the programmer gets the performance of the relaxed memory
    with the sanity of sequential consistency without programmer
    inivolvement.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Chris M. Thomasson on Fri Dec 20 14:39:03 2024
    Chris M. Thomasson wrote:
    On 12/19/2024 10:33 AM, MitchAlsup1 wrote:
    On Thu, 5 Dec 2024 7:44:19 +0000, Chris M. Thomasson wrote:

    On 12/4/2024 8:13 AM, jseigh wrote:
    On 12/3/24 18:37, Stefan Monnier wrote:
    If there are places
    in the code it doesn't know this can't happen it won't optimize
    across it, more or less.

    The problem is HOW to TELL the COMPILER that these memory references >>>>>> are "more special" than normal--when languages give few mechanisms. >>>>>
    We could start with something like

    critical_region {
    ...
    }

    such that the compiler must refrain from any code motion within
    those sections but is free to move things outside of those sections as >>>>> if
    execution was singlethreaded.


    C/C++11 already defines what lock acquire/release semantics are.
    Roughly you can move stuff outside of a critical section into it
    but not vice versa.

    Java uses synchronized blocks to denote the critical section.
    C++ (the society for using RAII for everything) has scoped_lock
    if you want to use RAII for your critical section. It's not
    always obvious what the actual critical section is. I usually
    use it inside its own bracket section to make it more obvious.
    { std::scoped_lock m(mutex);
    // .. critical section
    }

    I'm not a big fan of c/c++ using acquire and release memory order
    directives on everything since apart from a few situations it's
    not intuitively obvious what they do in all cases. You can
    look a compiler assembler output but you have to be real careful
    generalizing from what you see.

    The release on the unlock can allow some following stores and things to
    sort of "bubble up before it?

    Acquire and release confines things to the "critical section", the
    release can allow for some following things to go above it, so to speak. >>> This is making me think of Alex over on c.p.t. !

    This sounds dangerous if the thing allowed to go above it is unCacheable
    while the lock:release is cacheable, the cacheable lock can arrive at
    another core before the unCacheable store arrives at its destination.

    Humm... Need to ponder on that. Wrt the sparc:

    membar #LoadStore | #StoreStore

    can allow following stores to bubble up before it. If we want to block
    that then we would use a #StoreLoad. However, a #StoreLoad is not
    required for unlocking a mutex.

    I had an idea a few weeks back of a different way to do membars
    that should be more flexible and controllable (if that's a good thing)
    so I thought I'd toss it out there for comments.

    This hypothetical ISA has normal LD and ST instructions, to which I
    would add a LW Load for Write instruction to optimize moving shared lines between caches. There are also the Atomic Fetch and OP instructions
    AFADD, AFAND, AFOR, AFXOR, plus ASWAP and ACAS, LL Load Locked and
    SC Store Conditional, for various size of naturally aligned data,
    and with various address modes.

    Here is the new part:

    To the above instructions is added a 3-bit Coherence Group (CG) field.
    This allows one to specify different groups that various above data
    accesses belong to.

    The ISA has a membar instruction: MBG Memory Barrier for Group

    MBG has three fields:
    - one 4-bit field where each bit enables which operations this barrier
    applies to, in older-younger order: Load-Load, Load-Store, Store-Load,
    and Store-Store.
    - two 8-bit fields where each bit selects which sets of Coherence Group(s)
    this barrier applies to, one field for the older (before the membar) sets,
    one for the younger (after the membar) sets.

    Also the Load Store Queue is assumed to be self coherent - that loads
    and stores to the same address by a single core are performed in order,
    and that nothing can bypass a load or store with an unresolved address.

    The CG numbers are assigned by convention, probably by the OS designers
    when they define the ABI for this ISA.
    Here I assigned CG:0 to be thread normal access, CG:1 to be atomic items,
    CG:2 to be shared memory sections. The remaining 5 CG's can be used to
    indicate different shared memory sections if their locks can overlap.

    Eg. An MBG with op bits for Load-Load and Load-Store, with a before CG of 1
    and after CG's 3 and 4 would block all younger loads and stores in groups
    3 and 4 from starting execution until all older loads in group 1 completed. Loads and stores in all other groups are free to reorder, within the
    LSQ self coherence rules.
    An MBG with all op bits and all CG bits set is a full membar.

    Also if one is say juggling multiple shared sections with multiple
    spinlocks or mutexes, then one can use multiple membars applied to
    different groups to achieve specific bypassing blocking effects.

    An MBG instruction completes and retires when no older groups of
    selected loads or stores are incomplete.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)