• arm ldxr/stxr vs cas

    From jseigh@21:1/5 to All on Mon Sep 2 13:27:57 2024
    I read that arm added the cas instruction because they didn't think
    ldxr/stxr would scale well. It wasn't clear to me as to why that
    would be the case. I would think the memory lock mechanism would
    have really low overhead vs cas having to do an interlocked load
    and store. Unless maybe the memory lock size might be large
    enough to cause false sharing issues. Any ideas?

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Mon Sep 2 15:59:36 2024
    On 9/2/24 14:58, Chris M. Thomasson wrote:
    On 9/2/2024 11:55 AM, Chris M. Thomasson wrote:
    On 9/2/2024 10:27 AM, jseigh wrote:
    I read that arm added the cas instruction because they didn't think
    ldxr/stxr would scale well.  It wasn't clear to me as to why that
    would be the case.  I would think the memory lock mechanism would
    have really low overhead vs cas having to do an interlocked load
    and store.  Unless maybe the memory lock size might be large
    enough to cause false sharing issues.  Any ideas?

    Reservation granularity? Wrt the PPC, wow this an older thread:

    https://groups.google.com/g/comp.arch/c/yREvvvKvr6k/m/nRZ5tpLwDNQJ

    LL/SC can spuriously fail... It's more obstruction free than
    lock-free? I think this is why a strong and weak CAS are in the C/C++
    std's?


    Btw, have you talked with Alexander Terekhov lately? He is a smart guy. :^)

    What, where? c.p.t. is gone. It's crickets everywhere else.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to jseigh on Wed Sep 4 21:27:32 2024
    On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:

    I read that arm added the cas instruction because they didn't think
    ldxr/stxr would scale well. It wasn't clear to me as to why that
    would be the case. I would think the memory lock mechanism would
    have really low overhead vs cas having to do an interlocked load
    and store. Unless maybe the memory lock size might be large
    enough to cause false sharing issues. Any ideas?

    A pipeline lock between the LD part of a CAS and the ST part of a
    CAS is essentially FREE. But the same is true for LL followed by
    a later SC.

    Older machines with looser than sequential consistency memory models
    and running OoO have a myriad of problems with LL - SC. This is
    why My 66000 architecture switches from causal consistency to
    sequential consistency when it encounters <effectively> LL and
    switches bac after seeing SC.

    No Fences necessary with causal consistency.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Thu Sep 5 00:59:36 2024
    On Wed, 4 Sep 2024 23:48:48 +0000, Chris M. Thomasson wrote:

    On 9/4/2024 2:27 PM, MitchAlsup1 wrote:
    On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:

    I read that arm added the cas instruction because they didn't think
    ldxr/stxr would scale well.  It wasn't clear to me as to why that
    would be the case.  I would think the memory lock mechanism would
    have really low overhead vs cas having to do an interlocked load
    and store.  Unless maybe the memory lock size might be large
    enough to cause false sharing issues.  Any ideas?

    A pipeline lock between the LD part of a CAS and the ST part of a
    CAS is essentially FREE. But the same is true for LL followed by
    a later SC.

    100% sure on that? No way to break the reservation from an unrelated
    aspect wrt LL/SC?

    I have been building pipelines like that since 1983.

    Now, it is completely possible to build a pipeline that gives
    one grief (lesser or greater) in doing these things--but it is
    definitely possible to build a grief free pipeline for LL- SC,
    and by extension CAS.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to All on Thu Sep 5 07:33:23 2024
    On 9/4/2024 5:27 PM, MitchAlsup1 wrote:
    On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:

    I read that arm added the cas instruction because they didn't think
    ldxr/stxr would scale well.  It wasn't clear to me as to why that
    would be the case.  I would think the memory lock mechanism would
    have really low overhead vs cas having to do an interlocked load
    and store.  Unless maybe the memory lock size might be large
    enough to cause false sharing issues.  Any ideas?

    A pipeline lock between the LD part of a CAS and the ST part of a
    CAS is essentially FREE. But the same is true for LL followed by
    a later SC.

    Older machines with looser than sequential consistency memory models
    and running OoO have a myriad of problems with LL - SC. This is
    why My 66000 architecture switches from causal consistency to
    sequential consistency when it encounters <effectively> LL and
    switches bac after seeing SC.

    No Fences necessary with causal consistency.


    I'm not sure I entirely follow. I was thinking of the effects on
    cache. In theory the SC could fail without having get the current
    cache line exclusive or at all. CAS has to get it exclusive before
    it can definitively fail.

    Whenever they get around to making arm desktops without the OS tax
    so I can install linux I can compare the 2.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to jseigh on Thu Sep 5 14:17:19 2024
    jseigh <jseigh_es00@xemaps.com> writes:
    On 9/4/2024 5:27 PM, MitchAlsup1 wrote:
    On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:

    I read that arm added the cas instruction because they didn't think
    ldxr/stxr would scale well.  It wasn't clear to me as to why that
    would be the case.  I would think the memory lock mechanism would
    have really low overhead vs cas having to do an interlocked load
    and store.  Unless maybe the memory lock size might be large
    enough to cause false sharing issues.  Any ideas?

    A pipeline lock between the LD part of a CAS and the ST part of a
    CAS is essentially FREE. But the same is true for LL followed by
    a later SC.

    Older machines with looser than sequential consistency memory models
    and running OoO have a myriad of problems with LL - SC. This is
    why My 66000 architecture switches from causal consistency to
    sequential consistency when it encounters <effectively> LL and
    switches bac after seeing SC.

    No Fences necessary with causal consistency.


    I'm not sure I entirely follow. I was thinking of the effects on
    cache. In theory the SC could fail without having get the current
    cache line exclusive or at all. CAS has to get it exclusive before
    it can definitively fail.

    The nice property of atomic and CAS instructions is that the processor
    can delegate them to an agent closer to memory (e.g. LLC). Since the
    processor must acquire the line anyway, once it has the line the
    atomicity is satisfied.

    For uncached accesses (e.g. PCIexpress), they also can be delegated to
    the PCI Express endpoint directly.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to jseigh on Thu Sep 5 15:04:18 2024
    jseigh <jseigh_es00@xemaps.com> writes:
    Whenever they get around to making arm desktops without the OS tax
    so I can install linux

    What's wrong with the Raspi 5, or the Rock 5B? Or, if you want to
    spend a lot more to get a lot more cores, you can buy workstations
    with an Ampere Altra starting at EUR 3199 for 64 cores at <https://www.mifcom.de/workstations-ampere-altra-cid774>.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to jseigh on Thu Sep 5 19:46:25 2024
    On Thu, 5 Sep 2024 11:33:23 +0000, jseigh wrote:

    On 9/4/2024 5:27 PM, MitchAlsup1 wrote:
    On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:

    I read that arm added the cas instruction because they didn't think
    ldxr/stxr would scale well.  It wasn't clear to me as to why that
    would be the case.  I would think the memory lock mechanism would
    have really low overhead vs cas having to do an interlocked load
    and store.  Unless maybe the memory lock size might be large
    enough to cause false sharing issues.  Any ideas?

    A pipeline lock between the LD part of a CAS and the ST part of a
    CAS is essentially FREE. But the same is true for LL followed by
    a later SC.

    Older machines with looser than sequential consistency memory models
    and running OoO have a myriad of problems with LL - SC. This is
    why My 66000 architecture switches from causal consistency to
    sequential consistency when it encounters <effectively> LL and
    switches bac after seeing SC.

    No Fences necessary with causal consistency.


    I'm not sure I entirely follow. I was thinking of the effects on
    cache. In theory the SC could fail without having get the current
    cache line exclusive or at all. CAS has to get it exclusive before
    it can definitively fail.

    A LL that takes a miss in L1 will perform a fetch with intent to modify,
    so will a CAS. However, LL is allowed to silently fail if exclusive is
    not returned from its fetch, deferring atomic failure to SC, while CAS
    will fail when exclusive fails to return. LL-SC is designed so that
    when a failure happens, failure is visible at SC not necessarily at LL.

    There are coherence protocols that allows the 2nd party to determine
    if it returns exclusive or not. The example I know is when the 2nd
    party is already performing an atomic event and it is better to fail
    the starting atomic event than to fail an ongoing atomic event.
    In My 66000 the determination is made under the notion of priority::
    the higher priority thread is allows to continue while the lower
    priority thread takes the failure. The higher priority thread can
    be the requestor (1st party) or the holder of data (2nd party)
    while all interested observers (3rd parties) are in a position
    to see what transpired and act accordingly (causal).

    Whenever they get around to making arm desktops without the OS tax
    so I can install linux I can compare the 2.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Thu Sep 5 17:49:32 2024
    On 9/5/24 16:34, Chris M. Thomasson wrote:
    On 9/5/2024 12:46 PM, MitchAlsup1 wrote:
    On Thu, 5 Sep 2024 11:33:23 +0000, jseigh wrote:

    On 9/4/2024 5:27 PM, MitchAlsup1 wrote:
    On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:

    I read that arm added the cas instruction because they didn't think
    ldxr/stxr would scale well.  It wasn't clear to me as to why that
    would be the case.  I would think the memory lock mechanism would
    have really low overhead vs cas having to do an interlocked load
    and store.  Unless maybe the memory lock size might be large
    enough to cause false sharing issues.  Any ideas?

    A pipeline lock between the LD part of a CAS and the ST part of a
    CAS is essentially FREE. But the same is true for LL followed by
    a later SC.

    Older machines with looser than sequential consistency memory models
    and running OoO have a myriad of problems with LL - SC. This is
    why My 66000 architecture switches from causal consistency to
    sequential consistency when it encounters <effectively> LL and
    switches bac after seeing SC.

    No Fences necessary with causal consistency.


    I'm not sure I entirely follow.  I was thinking of the effects on
    cache.  In theory the SC could fail without having get the current
    cache line exclusive or at all.  CAS has to get it exclusive before
    it can definitively fail.

    A LL that takes a miss in L1 will perform a fetch with intent to modify,
    so will a CAS. However, LL is allowed to silently fail if exclusive is
    not returned from its fetch, deferring atomic failure to SC, while CAS
    will fail when exclusive fails to return.

    CAS should only fail when the comparands are not equal to each other.
    Well, then there is the damn weak and strong CAS in C++11... ;^o


    LL-SC is designed so that
    when a failure happens, failure is visible at SC not necessarily at LL.

    There are coherence protocols that allows the 2nd party to determine
    if it returns exclusive or not. The example I know is when the 2nd
    party is already performing an atomic event and it is better to fail
    the starting atomic event than to fail an ongoing atomic event.
    In My 66000 the determination is made under the notion of priority::
    the higher priority thread is allows to continue while the lower
    priority thread takes the failure. The higher priority thread can
    be the requestor (1st party) or the holder of data (2nd party)
    while all interested observers (3rd parties) are in a position
    to see what transpired and act accordingly (causal).


    I'm not so sure about making the memory lock granularity same as
    cache line size but that's an implementation decision I guess.

    I do like the idea of detecting potential contention at the
    start of LL/SC so you can do back off. Right now the only way I
    can detect contention is after the fact when the CAS fails and
    I probably have the cache line exclusive at that point. It's
    pretty problematic.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to jseigh on Thu Sep 5 23:45:36 2024
    On Thu, 5 Sep 2024 21:49:32 +0000, jseigh wrote:

    On 9/5/24 16:34, Chris M. Thomasson wrote:
    On 9/5/2024 12:46 PM, MitchAlsup1 wrote:
    On Thu, 5 Sep 2024 11:33:23 +0000, jseigh wrote:

    On 9/4/2024 5:27 PM, MitchAlsup1 wrote:
    On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:

    I read that arm added the cas instruction because they didn't think >>>>>> ldxr/stxr would scale well.  It wasn't clear to me as to why that >>>>>> would be the case.  I would think the memory lock mechanism would >>>>>> have really low overhead vs cas having to do an interlocked load
    and store.  Unless maybe the memory lock size might be large
    enough to cause false sharing issues.  Any ideas?

    A pipeline lock between the LD part of a CAS and the ST part of a
    CAS is essentially FREE. But the same is true for LL followed by
    a later SC.

    Older machines with looser than sequential consistency memory models >>>>> and running OoO have a myriad of problems with LL - SC. This is
    why My 66000 architecture switches from causal consistency to
    sequential consistency when it encounters <effectively> LL and
    switches bac after seeing SC.

    No Fences necessary with causal consistency.


    I'm not sure I entirely follow.  I was thinking of the effects on
    cache.  In theory the SC could fail without having get the current
    cache line exclusive or at all.  CAS has to get it exclusive before
    it can definitively fail.

    A LL that takes a miss in L1 will perform a fetch with intent to modify, >>> so will a CAS. However, LL is allowed to silently fail if exclusive is
    not returned from its fetch, deferring atomic failure to SC, while CAS
    will fail when exclusive fails to return.

    CAS should only fail when the comparands are not equal to each other.
    Well, then there is the damn weak and strong CAS in C++11... ;^o


    LL-SC is designed so that
    when a failure happens, failure is visible at SC not necessarily at LL.

    There are coherence protocols that allows the 2nd party to determine
    if it returns exclusive or not. The example I know is when the 2nd
    party is already performing an atomic event and it is better to fail
    the starting atomic event than to fail an ongoing atomic event.
    In My 66000 the determination is made under the notion of priority::
    the higher priority thread is allows to continue while the lower
    priority thread takes the failure. The higher priority thread can
    be the requestor (1st party) or the holder of data (2nd party)
    while all interested observers (3rd parties) are in a position
    to see what transpired and act accordingly (causal).


    I'm not so sure about making the memory lock granularity same as
    cache line size but that's an implementation decision I guess.

    I do like the idea of detecting potential contention at the
    start of LL/SC so you can do back off. Right now the only way I
    can detect contention is after the fact when the CAS fails and
    I probably have the cache line exclusive at that point. It's
    pretty problematic.

    My 66000 ISA has a pipelined version of LL-SC where there can be
    as many as 8 LLs (to different cache liens) with as many LDs and
    STs to those lines as one desires. There are 2 cases::
    a) failure detected before checking for interference
    b) failure detected after checking for interference

    Failure before returns control to the first instruction.
    Failure after transfers control to a known location.

    case a is used to emulate Test&Set, Test&test&set, CAS, DCAS
    case b is used when doing more exotic atomic stuff.

    Note: interference is based on physical address not the
    data contained at that address.


    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Fri Sep 6 19:57:57 2024
    On Fri, 6 Sep 2024 19:36:36 +0000, Chris M. Thomasson wrote:

    On 9/5/2024 2:49 PM, jseigh wrote:
    On 9/5/24 16:34, Chris M. Thomasson wrote:
    On 9/5/2024 12:46 PM, MitchAlsup1 wrote:
    On Thu, 5 Sep 2024 11:33:23 +0000, jseigh wrote:

    On 9/4/2024 5:27 PM, MitchAlsup1 wrote:
    On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:

    I read that arm added the cas instruction because they didn't think >>>>>>> ldxr/stxr would scale well.  It wasn't clear to me as to why that >>>>>>> would be the case.  I would think the memory lock mechanism would >>>>>>> have really low overhead vs cas having to do an interlocked load >>>>>>> and store.  Unless maybe the memory lock size might be large
    enough to cause false sharing issues.  Any ideas?

    A pipeline lock between the LD part of a CAS and the ST part of a
    CAS is essentially FREE. But the same is true for LL followed by
    a later SC.

    Older machines with looser than sequential consistency memory models >>>>>> and running OoO have a myriad of problems with LL - SC. This is
    why My 66000 architecture switches from causal consistency to
    sequential consistency when it encounters <effectively> LL and
    switches bac after seeing SC.

    No Fences necessary with causal consistency.


    I'm not sure I entirely follow.  I was thinking of the effects on
    cache.  In theory the SC could fail without having get the current
    cache line exclusive or at all.  CAS has to get it exclusive before >>>>> it can definitively fail.

    A LL that takes a miss in L1 will perform a fetch with intent to modify, >>>> so will a CAS. However, LL is allowed to silently fail if exclusive is >>>> not returned from its fetch, deferring atomic failure to SC, while CAS >>>> will fail when exclusive fails to return.

    CAS should only fail when the comparands are not equal to each other.
    Well, then there is the damn weak and strong CAS in C++11... ;^o


    LL-SC is designed so that
    when a failure happens, failure is visible at SC not necessarily at LL. >>>>
    There are coherence protocols that allows the 2nd party to determine
    if it returns exclusive or not. The example I know is when the 2nd
    party is already performing an atomic event and it is better to fail
    the starting atomic event than to fail an ongoing atomic event.
    In My 66000 the determination is made under the notion of priority::
    the higher priority thread is allows to continue while the lower
    priority thread takes the failure. The higher priority thread can
    be the requestor (1st party) or the holder of data (2nd party)
    while all interested observers (3rd parties) are in a position
    to see what transpired and act accordingly (causal).


    I'm not so sure about making the memory lock granularity same as
    cache line size but that's an implementation decision I guess.

    I do like the idea of detecting potential contention at the
    start of LL/SC so you can do back off.  Right now the only way I
    can detect contention is after the fact when the CAS fails and
    I probably have the cache line exclusive at that point.  It's
    pretty problematic.

    I wonder if the ability to determine why a "weak" CAS failed might help.
    They (weak) can fail for other reasons besides comparing comparands...
    Well, would be a little too low level for a general atomic op in
    C/C++11?

    One can detect that the CAS-line is no longer exclusive as a form
    of weak failure, rather than waiting for the data to show up and
    fail strongly on the compare.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to All on Sat Sep 7 11:02:56 2024
    On 9/6/24 15:57, MitchAlsup1 wrote:
    On Fri, 6 Sep 2024 19:36:36 +0000, Chris M. Thomasson wrote:

    On 9/5/2024 2:49 PM, jseigh wrote:
    On 9/5/24 16:34, Chris M. Thomasson wrote:
    On 9/5/2024 12:46 PM, MitchAlsup1 wrote:
    On Thu, 5 Sep 2024 11:33:23 +0000, jseigh wrote:

    On 9/4/2024 5:27 PM, MitchAlsup1 wrote:
    On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:

    I read that arm added the cas instruction because they didn't think >>>>>>>> ldxr/stxr would scale well.  It wasn't clear to me as to why that >>>>>>>> would be the case.  I would think the memory lock mechanism would >>>>>>>> have really low overhead vs cas having to do an interlocked load >>>>>>>> and store.  Unless maybe the memory lock size might be large
    enough to cause false sharing issues.  Any ideas?

    A pipeline lock between the LD part of a CAS and the ST part of a >>>>>>> CAS is essentially FREE. But the same is true for LL followed by >>>>>>> a later SC.

    Older machines with looser than sequential consistency memory models >>>>>>> and running OoO have a myriad of problems with LL - SC. This is
    why My 66000 architecture switches from causal consistency to
    sequential consistency when it encounters <effectively> LL and
    switches bac after seeing SC.

    No Fences necessary with causal consistency.


    I'm not sure I entirely follow.  I was thinking of the effects on >>>>>> cache.  In theory the SC could fail without having get the current >>>>>> cache line exclusive or at all.  CAS has to get it exclusive before >>>>>> it can definitively fail.

    A LL that takes a miss in L1 will perform a fetch with intent to
    modify,
    so will a CAS. However, LL is allowed to silently fail if exclusive is >>>>> not returned from its fetch, deferring atomic failure to SC, while CAS >>>>> will fail when exclusive fails to return.

    CAS should only fail when the comparands are not equal to each other.
    Well, then there is the damn weak and strong CAS in C++11... ;^o


    LL-SC is designed so that
    when a failure happens, failure is visible at SC not necessarily at
    LL.

    There are coherence protocols that allows the 2nd party to determine >>>>> if it returns exclusive or not. The example I know is when the 2nd
    party is already performing an atomic event and it is better to fail >>>>> the starting atomic event than to fail an ongoing atomic event.
    In My 66000 the determination is made under the notion of priority:: >>>>> the higher priority thread is allows to continue while the lower
    priority thread takes the failure. The higher priority thread can
    be the requestor (1st party) or the holder of data (2nd party)
    while all interested observers (3rd parties) are in a position
    to see what transpired and act accordingly (causal).


    I'm not so sure about making the memory lock granularity same as
    cache line size but that's an implementation decision I guess.

    I do like the idea of detecting potential contention at the
    start of LL/SC so you can do back off.  Right now the only way I
    can detect contention is after the fact when the CAS fails and
    I probably have the cache line exclusive at that point.  It's
    pretty problematic.

    I wonder if the ability to determine why a "weak" CAS failed might help.
    They (weak) can fail for other reasons besides comparing comparands...
    Well, would be a little too low level for a general atomic op in
    C/C++11?

    One can detect that the CAS-line is no longer exclusive as a form
    of weak failure, rather than waiting for the data to show up and
    fail strongly on the compare.

    There is no requirement for CAS to calculate the expected value in
    any way, though typically the expected value is loaded from the CAS
    target. In fact you can use random values and it will still work,
    just take a lot longer. A typical optimization for pushing onto
    a stack that you expect to be empty more often than not is to
    initially load NULL as expected value instead of loading from the
    stack anchor, a load immediate vs load from storage.

    x64 doesn't have an atomic 128 bit load but cmpxchg16b works
    ok nonetheless. The 2 64 bit loads just have to be effectively
    atomic most of the time or you can use the updated result from
    cmpxchg16b.

    aarch64 didn't have atomic 128 bit load, LDP, early on. You
    have to do a LDXP/STXP to determine if load was atomic. In
    practice if you're doing a LDXP/STXP loop anyway it doesn't
    matter too much as long as you can handle the occasional
    random 128 bit value.

    I have some success with after the fact contention back off.
    I get 30% to 50% improvement in most cases. The main challenge
    is getting a 100+ nanosecond pause. nanosleep() doesn't hack it.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Sun Sep 8 00:59:51 2024
    On Sat, 7 Sep 2024 23:16:35 +0000, Chris M. Thomasson wrote:

    On 9/7/2024 4:14 PM, Chris M. Thomasson wrote:
    [...]
    When I am using CAS I don't really expect it to fail willy nilly even if
    the comparands are still the same. Weak vs Strong. Still irks me a bit.
    ;^)

    There are algorithms out there, usually state machines that depend on
    strong cas. When a CAS fails, it depends on it failing because the
    comparands were actually different...

    Leading to ABA failures::

    Do you really want the following CAS to succeed ??

    LD R19,[someMemoryValue]
    ..
    interrupt delays program execution for 1 week
    ..
    CAS R17,R19,[someMemoryLocation]

    Given that the someMemoryLocation is accessible to other programs
    while tis one is sleeping ??

    Thus, it seems reasonable to fail a CAS when one cannot determine
    if the memory location has been changed and changed back in the
    mean time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Sun Sep 8 07:53:59 2024
    On 9/8/24 02:35, Chris M. Thomasson wrote:
    On 9/7/2024 5:59 PM, MitchAlsup1 wrote:
    On Sat, 7 Sep 2024 23:16:35 +0000, Chris M. Thomasson wrote:

    On 9/7/2024 4:14 PM, Chris M. Thomasson wrote:
    [...]
    When I am using CAS I don't really expect it to fail willy nilly
    even if
    the comparands are still the same. Weak vs Strong. Still irks me a bit. >>>> ;^)

    There are algorithms out there, usually state machines that depend on
    strong cas. When a CAS fails, it depends on it failing because the
    comparands were actually different...

    Leading to ABA failures::

    Do you really want the following CAS to succeed ??

         LD    R19,[someMemoryValue]
    ..
    interrupt delays program execution for 1 week
    ..
         CAS   R17,R19,[someMemoryLocation]

    Given that the someMemoryLocation is accessible to other programs
    while tis one is sleeping ??

    Thus, it seems reasonable to fail a CAS when one cannot determine
    if the memory location has been changed and changed back in the
    mean time.

    ABA, well that can happen with CAS and certain algorithms that use them.
    The good ol' version counter is pretty nice, but it still can fail. 64
    bit words, 128 bit CAS. Loads to boot... ;^) Actually, I think Joe
    mentioned something interesting about CAS a long time ago wrt IBM...
    Candy Cane Striped books (Joe do you remember?) about a way to avoid
    live lock and failing in the os. I think Windows has some like it with
    their SList and SEH?

    You can have an ABA problem with single word CAS if your values (usually pointers) are one word in size. The solution to that is to use double
    word CAS, DWCAS, with the other word being a monotonic sequence number.
    The logic being that it would take longer to wrap and reuse the sequence
    number than any reasonable delay in execution. In the 70's s370
    CDS was 64 bits (2 32 bit words) and it was estimated that it would
    take a year to wrap a 32 bit counter.

    The risc-v arch manual mentions that you need DWCAS or LL/SC to avoid
    the ABA problem and that they chose LL/SC to avoid dealing with 128
    bit atomic data sizes (their loss but I'm not going into that here).

    Anyway if you want to avoid the ABA problem, you can't do it with
    C/C++ atomics since they don't support DWCAS.

    The candy striped Principle of Operation was an internal restricted
    document with architecture and engineering notes added. The note
    was that for short CAS loops fairness should be guaranteed so all
    processors could make equal forward progress.

    Speaking of live lock, in the 80's, VM/XA spin locks broke on a 4300
    box. VMXA used a test and set loop without any back off. The 4300 TS microcode timed out getting a lock and failed silently. So all the
    cpu's spun forever trying to get a lock that was free. We sort
    of figured out when I provided a patch that put a load and test
    of the lock so we could put in a hardware break point to see if the
    lock was actually not free. It suddenly started working because
    the patch provide enough backoff for the microcode locking
    to start working again.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Chris M. Thomasson on Sun Sep 8 16:10:41 2024
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 9/7/2024 5:59 PM, MitchAlsup1 wrote:
    On Sat, 7 Sep 2024 23:16:35 +0000, Chris M. Thomasson wrote:

    Thus, it seems reasonable to fail a CAS when one cannot determine
    if the memory location has been changed and changed back in the
    mean time.

    I think Scott Lurndal mentioned something about CAS or something on
    windows that will assert the bus lock after a lot of failures...

    On AMD processors (and likely intel), if a core cannot acquire
    a cache line in a a finite time, the core will assert the bus lock
    to ensure forward progress.

    Nothing to do with the operating software; purely a hardware thing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Sun Sep 8 20:46:21 2024
    On Sun, 08 Sep 2024 16:10:41 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 9/7/2024 5:59 PM, MitchAlsup1 wrote:
    On Sat, 7 Sep 2024 23:16:35 +0000, Chris M. Thomasson wrote:

    Thus, it seems reasonable to fail a CAS when one cannot determine
    if the memory location has been changed and changed back in the
    mean time.

    I think Scott Lurndal mentioned something about CAS or something on
    windows that will assert the bus lock after a lot of failures...

    On AMD processors (and likely intel), if a core cannot acquire
    a cache line in a a finite time, the core will assert the bus lock
    to ensure forward progress.

    Nothing to do with the operating software; purely a hardware thing.


    I think, on AMD processors made in this century the only cases that
    resort to physical bus lock are
    A) atomic accesses that cross cache boundary
    B) atomic accesses that address non-cached memory regions

    Both are in realm of "Dear software, please, never do it. If you think
    that you need it, you are wrong."
    And in both cases it's not related to any sort timeout.

    In case of Intel, I think that since Nehalem they don't have physical
    Bus Lock signal in their processors.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Sun Sep 8 17:52:08 2024
    On Sun, 8 Sep 2024 16:10:41 +0000, Scott Lurndal wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 9/7/2024 5:59 PM, MitchAlsup1 wrote:
    On Sat, 7 Sep 2024 23:16:35 +0000, Chris M. Thomasson wrote:

    Thus, it seems reasonable to fail a CAS when one cannot determine
    if the memory location has been changed and changed back in the
    mean time.

    I think Scott Lurndal mentioned something about CAS or something on
    windows that will assert the bus lock after a lot of failures...

    On AMD processors (and likely intel), if a core cannot acquire
    a cache line in a a finite time, the core will assert the bus lock
    to ensure forward progress.

    There are busses without the ability to LOCK (outside of x86).

    Nothing to do with the operating software; purely a hardware thing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Sun Sep 8 18:32:42 2024
    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 08 Sep 2024 16:10:41 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 9/7/2024 5:59 PM, MitchAlsup1 wrote:
    On Sat, 7 Sep 2024 23:16:35 +0000, Chris M. Thomasson wrote:

    Thus, it seems reasonable to fail a CAS when one cannot determine
    if the memory location has been changed and changed back in the
    mean time.

    I think Scott Lurndal mentioned something about CAS or something on
    windows that will assert the bus lock after a lot of failures...

    On AMD processors (and likely intel), if a core cannot acquire
    a cache line in a a finite time, the core will assert the bus lock
    to ensure forward progress.

    Nothing to do with the operating software; purely a hardware thing.


    I think, on AMD processors made in this century the only cases that
    resort to physical bus lock are
    A) atomic accesses that cross cache boundary
    B) atomic accesses that address non-cached memory regions

    C) A core cannot acquire a cache line in a finite time. We
    encountered this in 2010 on AMD Opteron processors with
    our HyperTransport connected CXL-like chip (designed in 2005);
    r/t latency could be as high as 800ns to remote memory.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to jseigh on Mon Sep 9 09:14:40 2024
    jseigh wrote:

    I'm not so sure about making the memory lock granularity same as
    cache line size but that's an implementation decision I guess.

    Just make sure you never have multiple locks residing inside the same
    cache line!


    I do like the idea of detecting potential contention at the
    start of LL/SC so you can do back off.  Right now the only way I
    can detect contention is after the fact when the CAS fails and
    I probably have the cache line exclusive at that point.  It's
    pretty problematic.

    I do prefer LOCK XADD instead of CAS (CmpXchg*), because the return
    value will also tell you which queue entry to pick/work on.

    It will not be optimal when really contended, but at least one
    participant will make forward progress, and typically several of them.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Mon Sep 9 12:00:05 2024
    On Sun, 08 Sep 2024 18:32:42 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 08 Sep 2024 16:10:41 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:


    On AMD processors (and likely intel), if a core cannot acquire
    a cache line in a a finite time, the core will assert the bus lock
    to ensure forward progress.

    Nothing to do with the operating software; purely a hardware
    thing.


    I think, on AMD processors made in this century the only cases that
    resort to physical bus lock are
    A) atomic accesses that cross cache boundary
    B) atomic accesses that address non-cached memory regions

    C) A core cannot acquire a cache line in a finite time. We
    encountered this in 2010 on AMD Opteron processors with
    our HyperTransport connected CXL-like chip (designed in 2005);
    r/t latency could be as high as 800ns to remote memory.


    PathScale Infinipath, I suppose.

    Bought by QLogic then sold to Sicortex then picked by Cray when
    Sicortex went belly-up then resold to Intel where it became the base
    for Omni-Path that was discontinued in 2019, but according to Wikipedia
    still living under Cornelis Networks.
    https://www.cornelisnetworks.com/

    I never liked attempts to use it as cache-coherent interconnect.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Terje Mathisen on Mon Sep 9 10:29:25 2024
    On 9/9/24 03:14, Terje Mathisen wrote:
    jseigh wrote:

    I'm not so sure about making the memory lock granularity same as
    cache line size but that's an implementation decision I guess.

    Just make sure you never have multiple locks residing inside the same
    cache line!

    This the the terminology ARM uses when describing their LL/SC
    implementation. It is not the best choice in terminology.


    I do like the idea of detecting potential contention at the
    start of LL/SC so you can do back off.  Right now the only way I
    can detect contention is after the fact when the CAS fails and
    I probably have the cache line exclusive at that point.  It's
    pretty problematic.

    I do prefer LOCK XADD instead of CAS (CmpXchg*), because the return
    value will also tell you which queue entry to pick/work on.

    It will not be optimal when really contended, but at least one
    participant will make forward progress, and typically several of them.


    I'm not aware of any lock-free queue algorithms that use
    atomic_fetch_add that are actually lock-free, error free,
    and/or don't have an ABA problem. I'm not saying there
    aren't, just that I'm not aware of them.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Mon Sep 9 15:42:45 2024
    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 08 Sep 2024 18:32:42 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 08 Sep 2024 16:10:41 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:


    On AMD processors (and likely intel), if a core cannot acquire
    a cache line in a a finite time, the core will assert the bus lock
    to ensure forward progress.

    Nothing to do with the operating software; purely a hardware
    thing.


    I think, on AMD processors made in this century the only cases that
    resort to physical bus lock are
    A) atomic accesses that cross cache boundary
    B) atomic accesses that address non-cached memory regions

    C) A core cannot acquire a cache line in a finite time. We
    encountered this in 2010 on AMD Opteron processors with
    our HyperTransport connected CXL-like chip (designed in 2005);
    r/t latency could be as high as 800ns to remote memory.


    PathScale Infinipath, I suppose.

    Actully, no.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Mon Sep 9 20:49:46 2024
    On Mon, 09 Sep 2024 15:42:45 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 08 Sep 2024 18:32:42 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 08 Sep 2024 16:10:41 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:


    On AMD processors (and likely intel), if a core cannot acquire
    a cache line in a a finite time, the core will assert the bus
    lock to ensure forward progress.

    Nothing to do with the operating software; purely a hardware
    thing.


    I think, on AMD processors made in this century the only cases
    that resort to physical bus lock are
    A) atomic accesses that cross cache boundary
    B) atomic accesses that address non-cached memory regions

    C) A core cannot acquire a cache line in a finite time. We
    encountered this in 2010 on AMD Opteron processors with
    our HyperTransport connected CXL-like chip (designed in 2005);
    r/t latency could be as high as 800ns to remote memory.


    PathScale Infinipath, I suppose.

    Actully, no.


    Horus?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Mon Sep 9 18:49:00 2024
    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 09 Sep 2024 15:42:45 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 08 Sep 2024 18:32:42 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 08 Sep 2024 16:10:41 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:


    On AMD processors (and likely intel), if a core cannot acquire
    a cache line in a a finite time, the core will assert the bus
    lock to ensure forward progress.

    Nothing to do with the operating software; purely a hardware
    thing.


    I think, on AMD processors made in this century the only cases
    that resort to physical bus lock are
    A) atomic accesses that cross cache boundary
    B) atomic accesses that address non-cached memory regions

    C) A core cannot acquire a cache line in a finite time. We
    encountered this in 2010 on AMD Opteron processors with
    our HyperTransport connected CXL-like chip (designed in 2005);
    r/t latency could be as high as 800ns to remote memory.


    PathScale Infinipath, I suppose.

    Actully, no.


    Horus?


    https://www.infoworld.com/article/2201330/3leaf-systems-scale-up-by-scaling-out.html

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to All on Mon Sep 9 14:46:24 2024
    Some testing of the effects of backoff on contention.

    $ lscpu -e
    CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ
    0 0 0 0 0:0:0:0 yes 3400.0000 400.0000 799.9840
    1 0 0 1 1:1:1:0 yes 3400.0000 400.0000 2296.7949
    2 0 0 2 2:2:2:0 yes 3400.0000 400.0000 712.2160
    3 0 0 3 3:3:3:0 yes 3400.0000 400.0000 1574.7550
    4 0 0 0 0:0:0:0 yes 3400.0000 400.0000 799.9760
    5 0 0 1 1:1:1:0 yes 3400.0000 400.0000 2600.7561
    6 0 0 2 2:2:2:0 yes 3400.0000 400.0000 2599.2959
    7 0 0 3 3:3:3:0 yes 3400.0000 400.0000 400.0000

    so 4 cores, 2 threads / core

    The options
    -n #enqueues per producer
    -p #producers
    -c #consumers
    -b backoff if > 0
    -f 0 linear backoff 1 exponential backoff
    -t queue type mpmc,spsc, spmc, mpsc default mpmc

    taskset used to specify which cpus to run on.
    There is a lot more stats being outputed but I'm
    just showing the overall rate.

    $ taskset -c 0-7 test/lfrbtest -n 4000000 -p6 -c2 -f 0 -b 0
    overall rate = 5,988,679.5093 /sec

    $ taskset -c 0-3 test/lfrbtest -n 4000000 -p6 -c2 -f 0 -b 0
    overall rate = 7,206,319.9145 /sec

    $ taskset -c 0-7 test/lfrbtest -n 4000000 -p6 -c2 -f 1 -b 12
    overall rate = 7,243,427.7786 /sec

    $ taskset -c 0-3 test/lfrbtest -n 4000000 -p6 -c2 -f 1 -b 12
    overall rate = 8,722,397.3449 /sec

    $ taskset -c 3 test/lfrbtest -n 4000000 -p6 -c2 -f 1 -b 12
    overall rate = 19,950,919.8062 /sec

    Some more threads than cpus timings ...

    $ taskset -c 0-7 test/lfrbtest -n 1000000 -p24 -c8 -f 1 -b 0
    overall rate = 6,454,866.9162 /sec

    $ taskset -c 0-7 test/lfrbtest -n 1000000 -p24 -c8 -f 1 -b 12
    overall rate = 7,350,906.2839 /sec

    Some 1 producer 1 consumer timings ... (no contention so backoff setting
    has no effect)

    $ taskset -c 3,4 test/lfrbtest -n 10000000 -p1 -c1 -f 1 -b 0
    overall rate = 15,329,272.1370 /sec

    $ taskset -c 3,4 test/lfrbtest -n 10000000 -p1 -c1 -f 1 -b 12
    overall rate = 15,063,074.6140 /sec

    $ taskset -c 3 test/lfrbtest -n 10000000 -p1 -c1 -f 1 -b 0
    overall rate = 19,983,735.6372 /sec

    $ taskset -c 3 test/lfrbtest -n 10000000 -p1 -c1 -f 1 -b 12
    overall rate = 19,935,213.3078 /sec

    $ taskset -c 3 test/lfrbtest -n 10000000 -p1 -c1 -f 1 -b 12 -x n
    overall rate = 32,921,077.1692 /sec

    $ taskset -c 2,3 test/lfrbtest -n 10000000 -p1 -c1 -f 1 -b 12 -t spsc
    overall rate = 38,620,404.6845 /sec

    $ taskset -c 3 test/lfrbtest -n 10000000 -p1 -c1 -f 1 -b 12 -t spsc
    overall rate = 109,891,021.6233 /sec

    This is producer and consumer running on same thread so no thread
    switching overhead

    $ taskset -c 3 test/lfrbtest -n 10000000 -p1 -c1 -f 1 -b 12 -x n -t spsc
    overall rate = 146,941,972.2478 /sec

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Mon Sep 9 23:39:30 2024
    On Mon, 09 Sep 2024 18:49:00 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 09 Sep 2024 15:42:45 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 08 Sep 2024 18:32:42 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 08 Sep 2024 16:10:41 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:


    On AMD processors (and likely intel), if a core cannot
    acquire a cache line in a a finite time, the core will
    assert the bus lock to ensure forward progress.

    Nothing to do with the operating software; purely a hardware
    thing.


    I think, on AMD processors made in this century the only cases
    that resort to physical bus lock are
    A) atomic accesses that cross cache boundary
    B) atomic accesses that address non-cached memory regions

    C) A core cannot acquire a cache line in a finite time. We
    encountered this in 2010 on AMD Opteron processors with
    our HyperTransport connected CXL-like chip (designed in
    2005); r/t latency could be as high as 800ns to remote memory.


    PathScale Infinipath, I suppose.

    Actully, no.


    Horus?


    https://www.infoworld.com/article/2201330/3leaf-systems-scale-up-by-scaling-out.html

    O.k.
    This aticle is a little more technical https://www.eetimes.com/3leaf-links-32-amd-chips-in-one-server/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to jseigh on Tue Sep 10 11:22:05 2024
    jseigh wrote:
    On 9/9/24 03:14, Terje Mathisen wrote:
    jseigh wrote:

    I'm not so sure about making the memory lock granularity same as
    cache line size but that's an implementation decision I guess.

    Just make sure you never have multiple locks residing inside the same
    cache line!

    This the the terminology ARM uses when describing their LL/SC implementation.  It is not the best choice in terminology.


    I do like the idea of detecting potential contention at the
    start of LL/SC so you can do back off.  Right now the only way I
    can detect contention is after the fact when the CAS fails and
    I probably have the cache line exclusive at that point.  It's
    pretty problematic.

    I do prefer LOCK XADD instead of CAS (CmpXchg*), because the return
    value will also tell you which queue entry to pick/work on.

    It will not be optimal when really contended, but at least one
    participant will make forward progress, and typically several of them.


    I'm not aware of any lock-free queue algorithms that use
    atomic_fetch_add that are actually lock-free, error free,
    and/or don't have an ABA problem.  I'm not saying there
    aren't, just that I'm not aware of them.

    I'm not sure either: I have written test code that should be general
    purpose but never actually used that in any production systems.

    The one time I used this for critical work (a maximum throughput ntpd
    server) I had a single writer and N-1 readers, so I actually decided to
    create one queue per reader (cpu core), making them much easier to get
    right. :-)

    Each queue was a power of two in length, the write and read indices were
    full 32-bit unsigned variables that I masked before accessing the work
    list, so I never needed to worry about queue end wraparound.

    The writer simply allocated work to each queue in round robin fashion.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Tue Sep 10 10:40:05 2024
    On 9/9/24 17:09, Chris M. Thomasson wrote:
    On 9/9/2024 7:29 AM, jseigh wrote:
    On 9/9/24 03:14, Terje Mathisen wrote:
    jseigh wrote:

    I'm not so sure about making the memory lock granularity same as
    cache line size but that's an implementation decision I guess.

    Just make sure you never have multiple locks residing inside the same
    cache line!

    This the the terminology ARM uses when describing their LL/SC
    implementation.  It is not the best choice in terminology.


    I do like the idea of detecting potential contention at the
    start of LL/SC so you can do back off.  Right now the only way I
    can detect contention is after the fact when the CAS fails and
    I probably have the cache line exclusive at that point.  It's
    pretty problematic.

    I do prefer LOCK XADD instead of CAS (CmpXchg*), because the return
    value will also tell you which queue entry to pick/work on.

    It will not be optimal when really contended, but at least one
    participant will make forward progress, and typically several of them.


    I'm not aware of any lock-free queue algorithms that use
    atomic_fetch_add that are actually lock-free, error free,
    and/or don't have an ABA problem.  I'm not saying there
    aren't, just that I'm not aware of them.

    Here is an interesting one I did. A tweak from another algorithm.
    Basically a bakery algorithm:

    <pseudo code, membars aside for a moment> ______________________________________________
    struct cell { uint32_t ver; double state; };

    uint32_t head = 0;
    uint32_t tail = 0;
    cell cells[N]; // N must be a power of 2

    void init() {
        for (uint32_t i = 0; i < N; ++i) cells[i].ver = i;
    }

    void producer(double state) {
        uint32_t ver = XADD(&head, 1);
        cell& c = cells[ver & (N - 1)];
        while (LOAD(&c.ver) != ver) backoff();
        c.state = state;
        STORE(&c.ver, ver + 1);
    }

    double consumer() {
        uint32_t ver = XADD(&tail, 1);
        cell& c = cells[ver & (N - 1)];
        while (LOAD(&c.ver) != ver + 1) backoff();
        double state = c.state;
        STORE(&c.ver, ver + N);
        return state;
    }
    ______________________________________________


    One of the things I have in my implementation is that an enqueue
    operation can detect if wrap occurred if it got a interrupted by
    a time slice end and log how far behind it got. I'm seeing about
    200,000 enqueue operations in those cases. That would have been
    a huge performance hit if my queue wasn't lock-free.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Terje Mathisen on Tue Sep 10 10:51:45 2024
    On 9/10/24 05:22, Terje Mathisen wrote:
    jseigh wrote:
    On 9/9/24 03:14, Terje Mathisen wrote:
    jseigh wrote:

    I'm not so sure about making the memory lock granularity same as
    cache line size but that's an implementation decision I guess.

    Just make sure you never have multiple locks residing inside the same
    cache line!

    This the the terminology ARM uses when describing their LL/SC
    implementation.  It is not the best choice in terminology.


    I do like the idea of detecting potential contention at the
    start of LL/SC so you can do back off.  Right now the only way I
    can detect contention is after the fact when the CAS fails and
    I probably have the cache line exclusive at that point.  It's
    pretty problematic.

    I do prefer LOCK XADD instead of CAS (CmpXchg*), because the return
    value will also tell you which queue entry to pick/work on.

    It will not be optimal when really contended, but at least one
    participant will make forward progress, and typically several of them.


    I'm not aware of any lock-free queue algorithms that use
    atomic_fetch_add that are actually lock-free, error free,
    and/or don't have an ABA problem.  I'm not saying there
    aren't, just that I'm not aware of them.

    I'm not sure either: I have written test code that should be general
    purpose but never actually used that in any production systems.

    The one time I used this for critical work (a maximum throughput ntpd
    server) I had a single writer and N-1 readers, so I actually decided to create one queue per reader (cpu core), making them much easier to get
    right. :-)

    Each queue was a power of two in length, the write and read indices were
    full 32-bit unsigned variables that I masked before accessing the work
    list, so I never needed to worry about queue end wraparound.

    The writer simply allocated work to each queue in round robin fashion.


    You should look at the RSEQ (restartable sequences) code for SPMC
    queue. It's in assembler (basically because RSEQ needs to know
    the exact address of the commit instruction), but essentially an
    enqueue operation is

    tail->next = &added_node;
    tail = &added_node; // the commit instruction

    If you get interrupted before the tail gets updated, no matter
    because the next successful enqueue will work fine. You
    don't even have to null the next pointer of the node you
    are enqueuing.

    The only downside is that if you are on a thousand processor
    box, you will have 1000 SPMC queues.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Tue Sep 10 17:34:28 2024
    On 9/10/24 14:46, Chris M. Thomasson wrote:
    On 9/10/2024 7:40 AM, jseigh wrote:


    One of the things I have in my implementation is that an enqueue
    operation can detect if wrap occurred if it got a interrupted by
    a time slice end and log how far behind it got.  I'm seeing about
    200,000 enqueue operations in those cases.  That would have been
    a huge performance hit if my queue wasn't lock-free.

    Is your queue similar to the M&S Queue?

    https://people.csail.mit.edu/xchen/parallel-computing/queue.pdf


    Yes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Paul A. Clayton on Wed Sep 11 09:22:11 2024
    On 9/11/24 00:15, Paul A. Clayton wrote:
    On 9/9/24 3:14 AM, Terje Mathisen wrote:
    jseigh wrote:

    I'm not so sure about making the memory lock granularity same as
    cache line size but that's an implementation decision I guess.

    Just make sure you never have multiple locks residing inside the same
    cache line!

    Never?

    I suspect at least theoretically conditions could exist where
    having more than one lock within a cache line would be beneficial.

    If lock B is always acquired after lock A, then sharing a cache
    line might (I think) improve performance. One would lose
    prefetched capacity for the data protected by lock A and lock B.
    This assumes simple locks (e.g., not readers-writer locks).

    It seems to me that the pingpong problem may be less important
    than spatial locality depending on the contention for the cache
    line and the cache hierarchy locality of the contention
    (pingponging from a shared level of cache would be less
    expensive).

    I've been performance testing on a 4 core 8 hw thread cpu and
    generally it's faster if I run on 4 hw threads, 1 per cpu,
    than on all 8 hw threads. But if I run on 2 hw threads, both
    on same core, it's 2x faster than 2 hw threads on 2 cores,
    1 per core. Probably because they're running out of L1/L2
    core only cache rather than L3 global cache.

    If work behind highly active locks is preferentially or forcefully
    localized, pingponging would be less of a problem, it seems.
    Instead of an arbitrary core acquiring a lock's cache line and
    doing some work, the core could send a message to the natural owner of
    the cache line to do the work.

    If communication between cores was low latency and simple messages
    used little bandwidth, one might also conceive of having a lock
    manager that tracks the lock state and sends a granted or not-
    granted message back. This assumes that the memory location of the
    lock itself is separate from the data guarded by the lock.

    Being able to grab a snapshot of some data briefly without
    requiring (longer-term) ownership change might be useful even
    beyond lock probing (where a conventional MESI would change the
    M-state cache to S forcing a request for ownership when the lock
    is released). I recall some paper proposed expiring cache line
    ownership to reduce coherence overhead.

    Within a multiple-cache-line atomic operation/memory transaction,
    I _think_ if the write set is owned, the read set could be grabbed
    as such snapshots. I.e., I think any remote write to the read set
    could be "after" the atomic/transaction commits. (Such might be
    too difficult to get right while still providing any benefit.)

    Well there's HTM (hardware transactional memory) but apparently
    that's hard to get right and it's limited in snap shot size.

    In software there's a large body of lock-free data structures
    using various deferred reclamation schemes, RCU, hazard pointers,
    etc... RCU has zero read access cost overhead, hazard pointers
    have almost zero cost, about 3x the cost of a pipelined load.
    Hazard pointers got a lot faster when need for a store/load memory
    barrier was gotten rid of.

    These lock-free data structures can be quite huge, e.g.
    a lock-free map with millions of entries use to cache a redis
    database with billions of entries. Certainly bigger than
    you can fit into hw cache.

    (Weird side-thought: I wonder if a conservative filter might be
    useful for locking, particularly for writer locks. On the one
    hand, such would increase the pingpong in the filter when writer
    locks are set/cleared; on the other hand, reader locks could use
    a remote increment within the filter check atomic to avoid slight
    cache pollution.)

    There are reader/writer bakery style spin locks. Bakery style
    spin locks are more cache friendly than regular spin locks.
    An interlock operation, fetch_and_add, is only done once
    to get the wait ticket and then the code just spins testing
    the next value which should be pretty efficient on strongly
    coherent cache.

    An early description of rw spin locks https://groups.google.com/g/comp.programming/c/tHkE6R4joe0/m/1NyR2OkkHJ4J

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Sep 11 10:33:16 2024
    I suspect at least theoretically conditions could exist where
    having more than one lock within a cache line would be beneficial.
    If lock B is always acquired after lock A, then sharing a cache
    line might (I think) improve performance. One would lose

    I suspect in practice this almost never happens because if it did, it
    would mean that the program would benefit from merging those two locks
    into a single one.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to jseigh on Wed Sep 11 14:04:17 2024
    On 9/9/24 14:46, jseigh wrote:
    Some testing of the effects of backoff on contention.


    I don't think we are getting anywhere in this line of inquiry. I
    think I will break off for a while and work on a lemonade
    recipe (obscure reference to an old sig).

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)