Forum: >>> Magnum BBS <<<

arm ldxr/stxr vs cas

From jseigh@21:1/5 to All on Mon Sep 2 13:27:57 2024

I read that arm added the cas instruction because they didn't think
ldxr/stxr would scale well. It wasn't clear to me as to why that
would be the case. I would think the memory lock mechanism would
have really low overhead vs cas having to do an interlocked load
and store. Unless maybe the memory lock size might be large
enough to cause false sharing issues. Any ideas?

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Mon Sep 2 15:59:36 2024

On 9/2/24 14:58, Chris M. Thomasson wrote:

On 9/2/2024 11:55 AM, Chris M. Thomasson wrote:

On 9/2/2024 10:27 AM, jseigh wrote:

I read that arm added the cas instruction because they didn't think
ldxr/stxr would scale well. It wasn't clear to me as to why that
would be the case. I would think the memory lock mechanism would
have really low overhead vs cas having to do an interlocked load
and store. Unless maybe the memory lock size might be large
enough to cause false sharing issues. Any ideas?

Reservation granularity? Wrt the PPC, wow this an older thread:

https://groups.google.com/g/comp.arch/c/yREvvvKvr6k/m/nRZ5tpLwDNQJ

LL/SC can spuriously fail... It's more obstruction free than
lock-free? I think this is why a strong and weak CAS are in the C/C++
std's?

Btw, have you talked with Alexander Terekhov lately? He is a smart guy. :^)

What, where? c.p.t. is gone. It's crickets everywhere else.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to jseigh on Wed Sep 4 21:27:32 2024

On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:

I read that arm added the cas instruction because they didn't think
ldxr/stxr would scale well. It wasn't clear to me as to why that
would be the case. I would think the memory lock mechanism would
have really low overhead vs cas having to do an interlocked load
and store. Unless maybe the memory lock size might be large
enough to cause false sharing issues. Any ideas?

A pipeline lock between the LD part of a CAS and the ST part of a
CAS is essentially FREE. But the same is true for LL followed by
a later SC.

Older machines with looser than sequential consistency memory models
and running OoO have a myriad of problems with LL - SC. This is
why My 66000 architecture switches from causal consistency to
sequential consistency when it encounters <effectively> LL and
switches bac after seeing SC.

No Fences necessary with causal consistency.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Chris M. Thomasson on Thu Sep 5 00:59:36 2024

On Wed, 4 Sep 2024 23:48:48 +0000, Chris M. Thomasson wrote:

On 9/4/2024 2:27 PM, MitchAlsup1 wrote:

On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:

I read that arm added the cas instruction because they didn't think
ldxr/stxr would scale well. It wasn't clear to me as to why that
would be the case. I would think the memory lock mechanism would
have really low overhead vs cas having to do an interlocked load
and store. Unless maybe the memory lock size might be large
enough to cause false sharing issues. Any ideas?

A pipeline lock between the LD part of a CAS and the ST part of a
CAS is essentially FREE. But the same is true for LL followed by
a later SC.

100% sure on that? No way to break the reservation from an unrelated
aspect wrt LL/SC?

I have been building pipelines like that since 1983.

Now, it is completely possible to build a pipeline that gives
one grief (lesser or greater) in doing these things--but it is
definitely possible to build a grief free pipeline for LL- SC,
and by extension CAS.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to All on Thu Sep 5 07:33:23 2024

On 9/4/2024 5:27 PM, MitchAlsup1 wrote:

On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:

I read that arm added the cas instruction because they didn't think
ldxr/stxr would scale well. It wasn't clear to me as to why that
would be the case. I would think the memory lock mechanism would
have really low overhead vs cas having to do an interlocked load
and store. Unless maybe the memory lock size might be large
enough to cause false sharing issues. Any ideas?

A pipeline lock between the LD part of a CAS and the ST part of a
CAS is essentially FREE. But the same is true for LL followed by
a later SC.

Older machines with looser than sequential consistency memory models
and running OoO have a myriad of problems with LL - SC. This is
why My 66000 architecture switches from causal consistency to
sequential consistency when it encounters <effectively> LL and
switches bac after seeing SC.

No Fences necessary with causal consistency.

I'm not sure I entirely follow. I was thinking of the effects on
cache. In theory the SC could fail without having get the current
cache line exclusive or at all. CAS has to get it exclusive before
it can definitively fail.

Whenever they get around to making arm desktops without the OS tax
so I can install linux I can compare the 2.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to jseigh on Thu Sep 5 14:17:19 2024

jseigh <jseigh_es00@xemaps.com> writes:

On 9/4/2024 5:27 PM, MitchAlsup1 wrote:

On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:

I read that arm added the cas instruction because they didn't think
ldxr/stxr would scale well. It wasn't clear to me as to why that
would be the case. I would think the memory lock mechanism would
have really low overhead vs cas having to do an interlocked load
and store. Unless maybe the memory lock size might be large
enough to cause false sharing issues. Any ideas?

A pipeline lock between the LD part of a CAS and the ST part of a
CAS is essentially FREE. But the same is true for LL followed by
a later SC.

Older machines with looser than sequential consistency memory models
and running OoO have a myriad of problems with LL - SC. This is
why My 66000 architecture switches from causal consistency to
sequential consistency when it encounters <effectively> LL and
switches bac after seeing SC.

No Fences necessary with causal consistency.

I'm not sure I entirely follow. I was thinking of the effects on
cache. In theory the SC could fail without having get the current
cache line exclusive or at all. CAS has to get it exclusive before
it can definitively fail.

The nice property of atomic and CAS instructions is that the processor
can delegate them to an agent closer to memory (e.g. LLC). Since the
processor must acquire the line anyway, once it has the line the
atomicity is satisfied.

For uncached accesses (e.g. PCIexpress), they also can be delegated to
the PCI Express endpoint directly.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to jseigh on Thu Sep 5 15:04:18 2024

jseigh <jseigh_es00@xemaps.com> writes:

Whenever they get around to making arm desktops without the OS tax
so I can install linux

What's wrong with the Raspi 5, or the Rock 5B? Or, if you want to
spend a lot more to get a lot more cores, you can buy workstations
with an Ampere Altra starting at EUR 3199 for 64 cores at <https://www.mifcom.de/workstations-ampere-altra-cid774>.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to jseigh on Thu Sep 5 19:46:25 2024

On Thu, 5 Sep 2024 11:33:23 +0000, jseigh wrote:

On 9/4/2024 5:27 PM, MitchAlsup1 wrote:

On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:

I read that arm added the cas instruction because they didn't think
ldxr/stxr would scale well. It wasn't clear to me as to why that
would be the case. I would think the memory lock mechanism would
have really low overhead vs cas having to do an interlocked load
and store. Unless maybe the memory lock size might be large
enough to cause false sharing issues. Any ideas?

A pipeline lock between the LD part of a CAS and the ST part of a
CAS is essentially FREE. But the same is true for LL followed by
a later SC.

Older machines with looser than sequential consistency memory models
and running OoO have a myriad of problems with LL - SC. This is
why My 66000 architecture switches from causal consistency to
sequential consistency when it encounters <effectively> LL and
switches bac after seeing SC.

No Fences necessary with causal consistency.

I'm not sure I entirely follow. I was thinking of the effects on
cache. In theory the SC could fail without having get the current
cache line exclusive or at all. CAS has to get it exclusive before
it can definitively fail.

A LL that takes a miss in L1 will perform a fetch with intent to modify,
so will a CAS. However, LL is allowed to silently fail if exclusive is
not returned from its fetch, deferring atomic failure to SC, while CAS
will fail when exclusive fails to return. LL-SC is designed so that
when a failure happens, failure is visible at SC not necessarily at LL.

There are coherence protocols that allows the 2nd party to determine
if it returns exclusive or not. The example I know is when the 2nd
party is already performing an atomic event and it is better to fail
the starting atomic event than to fail an ongoing atomic event.
In My 66000 the determination is made under the notion of priority::
the higher priority thread is allows to continue while the lower
priority thread takes the failure. The higher priority thread can
be the requestor (1st party) or the holder of data (2nd party)
while all interested observers (3rd parties) are in a position
to see what transpired and act accordingly (causal).

Whenever they get around to making arm desktops without the OS tax
so I can install linux I can compare the 2.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Thu Sep 5 17:49:32 2024

On 9/5/24 16:34, Chris M. Thomasson wrote:

On 9/5/2024 12:46 PM, MitchAlsup1 wrote:

On Thu, 5 Sep 2024 11:33:23 +0000, jseigh wrote:

On 9/4/2024 5:27 PM, MitchAlsup1 wrote:

On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:

I read that arm added the cas instruction because they didn't think
ldxr/stxr would scale well. It wasn't clear to me as to why that
would be the case. I would think the memory lock mechanism would
have really low overhead vs cas having to do an interlocked load
and store. Unless maybe the memory lock size might be large
enough to cause false sharing issues. Any ideas?

A pipeline lock between the LD part of a CAS and the ST part of a
CAS is essentially FREE. But the same is true for LL followed by
a later SC.

Older machines with looser than sequential consistency memory models
and running OoO have a myriad of problems with LL - SC. This is
why My 66000 architecture switches from causal consistency to
sequential consistency when it encounters <effectively> LL and
switches bac after seeing SC.

No Fences necessary with causal consistency.

I'm not sure I entirely follow. I was thinking of the effects on
cache. In theory the SC could fail without having get the current
cache line exclusive or at all. CAS has to get it exclusive before
it can definitively fail.

A LL that takes a miss in L1 will perform a fetch with intent to modify,
so will a CAS. However, LL is allowed to silently fail if exclusive is
not returned from its fetch, deferring atomic failure to SC, while CAS
will fail when exclusive fails to return.

CAS should only fail when the comparands are not equal to each other.
Well, then there is the damn weak and strong CAS in C++11... ;^o

LL-SC is designed so that
when a failure happens, failure is visible at SC not necessarily at LL.

There are coherence protocols that allows the 2nd party to determine
if it returns exclusive or not. The example I know is when the 2nd
party is already performing an atomic event and it is better to fail
the starting atomic event than to fail an ongoing atomic event.
In My 66000 the determination is made under the notion of priority::
the higher priority thread is allows to continue while the lower
priority thread takes the failure. The higher priority thread can
be the requestor (1st party) or the holder of data (2nd party)
while all interested observers (3rd parties) are in a position
to see what transpired and act accordingly (causal).

I'm not so sure about making the memory lock granularity same as
cache line size but that's an implementation decision I guess.

I do like the idea of detecting potential contention at the
start of LL/SC so you can do back off. Right now the only way I
can detect contention is after the fact when the CAS fails and
I probably have the cache line exclusive at that point. It's
pretty problematic.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to jseigh on Thu Sep 5 23:45:36 2024

On Thu, 5 Sep 2024 21:49:32 +0000, jseigh wrote:

On 9/5/24 16:34, Chris M. Thomasson wrote:

On 9/5/2024 12:46 PM, MitchAlsup1 wrote:

On Thu, 5 Sep 2024 11:33:23 +0000, jseigh wrote:

On 9/4/2024 5:27 PM, MitchAlsup1 wrote:

On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:

I read that arm added the cas instruction because they didn't think >>>>>> ldxr/stxr would scale well. It wasn't clear to me as to why that >>>>>> would be the case. I would think the memory lock mechanism would >>>>>> have really low overhead vs cas having to do an interlocked load
and store. Unless maybe the memory lock size might be large
enough to cause false sharing issues. Any ideas?

A pipeline lock between the LD part of a CAS and the ST part of a
CAS is essentially FREE. But the same is true for LL followed by
a later SC.

Older machines with looser than sequential consistency memory models >>>>> and running OoO have a myriad of problems with LL - SC. This is
why My 66000 architecture switches from causal consistency to
sequential consistency when it encounters <effectively> LL and
switches bac after seeing SC.

No Fences necessary with causal consistency.

I'm not sure I entirely follow. I was thinking of the effects on
cache. In theory the SC could fail without having get the current
cache line exclusive or at all. CAS has to get it exclusive before
it can definitively fail.

A LL that takes a miss in L1 will perform a fetch with intent to modify, >>> so will a CAS. However, LL is allowed to silently fail if exclusive is
not returned from its fetch, deferring atomic failure to SC, while CAS
will fail when exclusive fails to return.

CAS should only fail when the comparands are not equal to each other.
Well, then there is the damn weak and strong CAS in C++11... ;^o

LL-SC is designed so that
when a failure happens, failure is visible at SC not necessarily at LL.

There are coherence protocols that allows the 2nd party to determine
if it returns exclusive or not. The example I know is when the 2nd
party is already performing an atomic event and it is better to fail
the starting atomic event than to fail an ongoing atomic event.
In My 66000 the determination is made under the notion of priority::
the higher priority thread is allows to continue while the lower
priority thread takes the failure. The higher priority thread can
be the requestor (1st party) or the holder of data (2nd party)
while all interested observers (3rd parties) are in a position
to see what transpired and act accordingly (causal).

I'm not so sure about making the memory lock granularity same as
cache line size but that's an implementation decision I guess.

I do like the idea of detecting potential contention at the
start of LL/SC so you can do back off. Right now the only way I
can detect contention is after the fact when the CAS fails and
I probably have the cache line exclusive at that point. It's
pretty problematic.

My 66000 ISA has a pipelined version of LL-SC where there can be
as many as 8 LLs (to different cache liens) with as many LDs and
STs to those lines as one desires. There are 2 cases::
a) failure detected before checking for interference
b) failure detected after checking for interference

Failure before returns control to the first instruction.
Failure after transfers control to a known location.

case a is used to emulate Test&Set, Test&test&set, CAS, DCAS
case b is used when doing more exotic atomic stuff.

Note: interference is based on physical address not the
data contained at that address.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Chris M. Thomasson on Fri Sep 6 19:57:57 2024

On Fri, 6 Sep 2024 19:36:36 +0000, Chris M. Thomasson wrote:

On 9/5/2024 2:49 PM, jseigh wrote:

On 9/5/24 16:34, Chris M. Thomasson wrote:

On 9/5/2024 12:46 PM, MitchAlsup1 wrote:

On Thu, 5 Sep 2024 11:33:23 +0000, jseigh wrote:

On 9/4/2024 5:27 PM, MitchAlsup1 wrote:

On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:

I read that arm added the cas instruction because they didn't think >>>>>>> ldxr/stxr would scale well. It wasn't clear to me as to why that >>>>>>> would be the case. I would think the memory lock mechanism would >>>>>>> have really low overhead vs cas having to do an interlocked load >>>>>>> and store. Unless maybe the memory lock size might be large
enough to cause false sharing issues. Any ideas?

A pipeline lock between the LD part of a CAS and the ST part of a
CAS is essentially FREE. But the same is true for LL followed by
a later SC.

Older machines with looser than sequential consistency memory models >>>>>> and running OoO have a myriad of problems with LL - SC. This is
why My 66000 architecture switches from causal consistency to
sequential consistency when it encounters <effectively> LL and
switches bac after seeing SC.

No Fences necessary with causal consistency.

I'm not sure I entirely follow. I was thinking of the effects on
cache. In theory the SC could fail without having get the current
cache line exclusive or at all. CAS has to get it exclusive before >>>>> it can definitively fail.

A LL that takes a miss in L1 will perform a fetch with intent to modify, >>>> so will a CAS. However, LL is allowed to silently fail if exclusive is >>>> not returned from its fetch, deferring atomic failure to SC, while CAS >>>> will fail when exclusive fails to return.

CAS should only fail when the comparands are not equal to each other.
Well, then there is the damn weak and strong CAS in C++11... ;^o

LL-SC is designed so that
when a failure happens, failure is visible at SC not necessarily at LL. >>>>
There are coherence protocols that allows the 2nd party to determine
if it returns exclusive or not. The example I know is when the 2nd
party is already performing an atomic event and it is better to fail
the starting atomic event than to fail an ongoing atomic event.
In My 66000 the determination is made under the notion of priority::
the higher priority thread is allows to continue while the lower
priority thread takes the failure. The higher priority thread can
be the requestor (1st party) or the holder of data (2nd party)
while all interested observers (3rd parties) are in a position
to see what transpired and act accordingly (causal).

I'm not so sure about making the memory lock granularity same as
cache line size but that's an implementation decision I guess.

I do like the idea of detecting potential contention at the
start of LL/SC so you can do back off. Right now the only way I
can detect contention is after the fact when the CAS fails and
I probably have the cache line exclusive at that point. It's
pretty problematic.

I wonder if the ability to determine why a "weak" CAS failed might help.
They (weak) can fail for other reasons besides comparing comparands...
Well, would be a little too low level for a general atomic op in
C/C++11?

One can detect that the CAS-line is no longer exclusive as a form
of weak failure, rather than waiting for the data to show up and
fail strongly on the compare.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to All on Sat Sep 7 11:02:56 2024

On 9/6/24 15:57, MitchAlsup1 wrote:

On Fri, 6 Sep 2024 19:36:36 +0000, Chris M. Thomasson wrote:

On 9/5/2024 2:49 PM, jseigh wrote:

On 9/5/24 16:34, Chris M. Thomasson wrote:

On 9/5/2024 12:46 PM, MitchAlsup1 wrote:

On Thu, 5 Sep 2024 11:33:23 +0000, jseigh wrote:

On 9/4/2024 5:27 PM, MitchAlsup1 wrote:

On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:

I read that arm added the cas instruction because they didn't think >>>>>>>> ldxr/stxr would scale well. It wasn't clear to me as to why that >>>>>>>> would be the case. I would think the memory lock mechanism would >>>>>>>> have really low overhead vs cas having to do an interlocked load >>>>>>>> and store. Unless maybe the memory lock size might be large
enough to cause false sharing issues. Any ideas?

A pipeline lock between the LD part of a CAS and the ST part of a >>>>>>> CAS is essentially FREE. But the same is true for LL followed by >>>>>>> a later SC.

Older machines with looser than sequential consistency memory models >>>>>>> and running OoO have a myriad of problems with LL - SC. This is
why My 66000 architecture switches from causal consistency to
sequential consistency when it encounters <effectively> LL and
switches bac after seeing SC.

No Fences necessary with causal consistency.

I'm not sure I entirely follow. I was thinking of the effects on >>>>>> cache. In theory the SC could fail without having get the current >>>>>> cache line exclusive or at all. CAS has to get it exclusive before >>>>>> it can definitively fail.

A LL that takes a miss in L1 will perform a fetch with intent to
modify,
so will a CAS. However, LL is allowed to silently fail if exclusive is >>>>> not returned from its fetch, deferring atomic failure to SC, while CAS >>>>> will fail when exclusive fails to return.

CAS should only fail when the comparands are not equal to each other.
Well, then there is the damn weak and strong CAS in C++11... ;^o

LL-SC is designed so that
when a failure happens, failure is visible at SC not necessarily at
LL.

There are coherence protocols that allows the 2nd party to determine >>>>> if it returns exclusive or not. The example I know is when the 2nd
party is already performing an atomic event and it is better to fail >>>>> the starting atomic event than to fail an ongoing atomic event.
In My 66000 the determination is made under the notion of priority:: >>>>> the higher priority thread is allows to continue while the lower
priority thread takes the failure. The higher priority thread can
be the requestor (1st party) or the holder of data (2nd party)
while all interested observers (3rd parties) are in a position
to see what transpired and act accordingly (causal).

I'm not so sure about making the memory lock granularity same as
cache line size but that's an implementation decision I guess.

I do like the idea of detecting potential contention at the
start of LL/SC so you can do back off. Right now the only way I
can detect contention is after the fact when the CAS fails and
I probably have the cache line exclusive at that point. It's
pretty problematic.

I wonder if the ability to determine why a "weak" CAS failed might help.
They (weak) can fail for other reasons besides comparing comparands...
Well, would be a little too low level for a general atomic op in
C/C++11?

One can detect that the CAS-line is no longer exclusive as a form
of weak failure, rather than waiting for the data to show up and
fail strongly on the compare.

There is no requirement for CAS to calculate the expected value in
any way, though typically the expected value is loaded from the CAS
target. In fact you can use random values and it will still work,
just take a lot longer. A typical optimization for pushing onto
a stack that you expect to be empty more often than not is to
initially load NULL as expected value instead of loading from the
stack anchor, a load immediate vs load from storage.

x64 doesn't have an atomic 128 bit load but cmpxchg16b works
ok nonetheless. The 2 64 bit loads just have to be effectively
atomic most of the time or you can use the updated result from
cmpxchg16b.

aarch64 didn't have atomic 128 bit load, LDP, early on. You
have to do a LDXP/STXP to determine if load was atomic. In
practice if you're doing a LDXP/STXP loop anyway it doesn't
matter too much as long as you can handle the occasional
random 128 bit value.

I have some success with after the fact contention back off.
I get 30% to 50% improvement in most cases. The main challenge
is getting a 100+ nanosecond pause. nanosleep() doesn't hack it.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Chris M. Thomasson on Sun Sep 8 00:59:51 2024

On Sat, 7 Sep 2024 23:16:35 +0000, Chris M. Thomasson wrote:

On 9/7/2024 4:14 PM, Chris M. Thomasson wrote:
[...]

When I am using CAS I don't really expect it to fail willy nilly even if
the comparands are still the same. Weak vs Strong. Still irks me a bit.
;^)

There are algorithms out there, usually state machines that depend on
strong cas. When a CAS fails, it depends on it failing because the
comparands were actually different...

Leading to ABA failures::

Do you really want the following CAS to succeed ??

LD R19,[someMemoryValue]
..
interrupt delays program execution for 1 week
..
CAS R17,R19,[someMemoryLocation]

Given that the someMemoryLocation is accessible to other programs
while tis one is sleeping ??

Thus, it seems reasonable to fail a CAS when one cannot determine
if the memory location has been changed and changed back in the
mean time.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Sun Sep 8 07:53:59 2024

On 9/8/24 02:35, Chris M. Thomasson wrote:

On 9/7/2024 5:59 PM, MitchAlsup1 wrote:

On Sat, 7 Sep 2024 23:16:35 +0000, Chris M. Thomasson wrote:

On 9/7/2024 4:14 PM, Chris M. Thomasson wrote:
[...]

When I am using CAS I don't really expect it to fail willy nilly
even if
the comparands are still the same. Weak vs Strong. Still irks me a bit. >>>> ;^)

There are algorithms out there, usually state machines that depend on
strong cas. When a CAS fails, it depends on it failing because the
comparands were actually different...

Leading to ABA failures::

Do you really want the following CAS to succeed ??

LD R19,[someMemoryValue]
..
interrupt delays program execution for 1 week
..
CAS R17,R19,[someMemoryLocation]

Given that the someMemoryLocation is accessible to other programs
while tis one is sleeping ??

Thus, it seems reasonable to fail a CAS when one cannot determine
if the memory location has been changed and changed back in the
mean time.

ABA, well that can happen with CAS and certain algorithms that use them.
The good ol' version counter is pretty nice, but it still can fail. 64
bit words, 128 bit CAS. Loads to boot... ;^) Actually, I think Joe
mentioned something interesting about CAS a long time ago wrt IBM...
Candy Cane Striped books (Joe do you remember?) about a way to avoid
live lock and failing in the os. I think Windows has some like it with
their SList and SEH?

You can have an ABA problem with single word CAS if your values (usually pointers) are one word in size. The solution to that is to use double
word CAS, DWCAS, with the other word being a monotonic sequence number.
The logic being that it would take longer to wrap and reuse the sequence
number than any reasonable delay in execution. In the 70's s370
CDS was 64 bits (2 32 bit words) and it was estimated that it would
take a year to wrap a 32 bit counter.

The risc-v arch manual mentions that you need DWCAS or LL/SC to avoid
the ABA problem and that they chose LL/SC to avoid dealing with 128
bit atomic data sizes (their loss but I'm not going into that here).

Anyway if you want to avoid the ABA problem, you can't do it with
C/C++ atomics since they don't support DWCAS.

The candy striped Principle of Operation was an internal restricted
document with architecture and engineering notes added. The note
was that for short CAS loops fairness should be guaranteed so all
processors could make equal forward progress.

Speaking of live lock, in the 80's, VM/XA spin locks broke on a 4300
box. VMXA used a test and set loop without any back off. The 4300 TS microcode timed out getting a lock and failed silently. So all the
cpu's spun forever trying to get a lock that was free. We sort
of figured out when I provided a patch that put a load and test
of the lock so we could put in a hardware break point to see if the
lock was actually not free. It suddenly started working because
the patch provide enough backoff for the microcode locking
to start working again.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Chris M. Thomasson on Sun Sep 8 16:10:41 2024

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 9/7/2024 5:59 PM, MitchAlsup1 wrote:

On Sat, 7 Sep 2024 23:16:35 +0000, Chris M. Thomasson wrote:

Thus, it seems reasonable to fail a CAS when one cannot determine
if the memory location has been changed and changed back in the
mean time.

I think Scott Lurndal mentioned something about CAS or something on
windows that will assert the bus lock after a lot of failures...

On AMD processors (and likely intel), if a core cannot acquire
a cache line in a a finite time, the core will assert the bus lock
to ensure forward progress.

Nothing to do with the operating software; purely a hardware thing.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Sun Sep 8 20:46:21 2024

On Sun, 08 Sep 2024 16:10:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 9/7/2024 5:59 PM, MitchAlsup1 wrote:

On Sat, 7 Sep 2024 23:16:35 +0000, Chris M. Thomasson wrote:

Thus, it seems reasonable to fail a CAS when one cannot determine
if the memory location has been changed and changed back in the
mean time.

I think Scott Lurndal mentioned something about CAS or something on
windows that will assert the bus lock after a lot of failures...

On AMD processors (and likely intel), if a core cannot acquire
a cache line in a a finite time, the core will assert the bus lock
to ensure forward progress.

Nothing to do with the operating software; purely a hardware thing.

I think, on AMD processors made in this century the only cases that
resort to physical bus lock are
A) atomic accesses that cross cache boundary
B) atomic accesses that address non-cached memory regions

Both are in realm of "Dear software, please, never do it. If you think
that you need it, you are wrong."
And in both cases it's not related to any sort timeout.

In case of Intel, I think that since Nehalem they don't have physical
Bus Lock signal in their processors.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Sun Sep 8 17:52:08 2024

On Sun, 8 Sep 2024 16:10:41 +0000, Scott Lurndal wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 9/7/2024 5:59 PM, MitchAlsup1 wrote:

On Sat, 7 Sep 2024 23:16:35 +0000, Chris M. Thomasson wrote:

Thus, it seems reasonable to fail a CAS when one cannot determine
if the memory location has been changed and changed back in the
mean time.

I think Scott Lurndal mentioned something about CAS or something on
windows that will assert the bus lock after a lot of failures...

On AMD processors (and likely intel), if a core cannot acquire
a cache line in a a finite time, the core will assert the bus lock
to ensure forward progress.

There are busses without the ability to LOCK (outside of x86).

Nothing to do with the operating software; purely a hardware thing.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Sun Sep 8 18:32:42 2024

Michael S <already5chosen@yahoo.com> writes:

On Sun, 08 Sep 2024 16:10:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 9/7/2024 5:59 PM, MitchAlsup1 wrote:

On Sat, 7 Sep 2024 23:16:35 +0000, Chris M. Thomasson wrote:

Thus, it seems reasonable to fail a CAS when one cannot determine
if the memory location has been changed and changed back in the
mean time.

I think Scott Lurndal mentioned something about CAS or something on
windows that will assert the bus lock after a lot of failures...

On AMD processors (and likely intel), if a core cannot acquire
a cache line in a a finite time, the core will assert the bus lock
to ensure forward progress.

Nothing to do with the operating software; purely a hardware thing.

I think, on AMD processors made in this century the only cases that
resort to physical bus lock are
A) atomic accesses that cross cache boundary
B) atomic accesses that address non-cached memory regions

C) A core cannot acquire a cache line in a finite time. We
encountered this in 2010 on AMD Opteron processors with
our HyperTransport connected CXL-like chip (designed in 2005);
r/t latency could be as high as 800ns to remote memory.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to jseigh on Mon Sep 9 09:14:40 2024

jseigh wrote:

I'm not so sure about making the memory lock granularity same as
cache line size but that's an implementation decision I guess.

Just make sure you never have multiple locks residing inside the same
cache line!

I do like the idea of detecting potential contention at the
start of LL/SC so you can do back off. Right now the only way I
can detect contention is after the fact when the CAS fails and
I probably have the cache line exclusive at that point. It's
pretty problematic.

I do prefer LOCK XADD instead of CAS (CmpXchg*), because the return
value will also tell you which queue entry to pick/work on.

It will not be optimal when really contended, but at least one
participant will make forward progress, and typically several of them.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Mon Sep 9 12:00:05 2024

On Sun, 08 Sep 2024 18:32:42 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Sun, 08 Sep 2024 16:10:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

On AMD processors (and likely intel), if a core cannot acquire
a cache line in a a finite time, the core will assert the bus lock
to ensure forward progress.

Nothing to do with the operating software; purely a hardware
thing.

I think, on AMD processors made in this century the only cases that
resort to physical bus lock are
A) atomic accesses that cross cache boundary
B) atomic accesses that address non-cached memory regions

C) A core cannot acquire a cache line in a finite time. We
encountered this in 2010 on AMD Opteron processors with
our HyperTransport connected CXL-like chip (designed in 2005);
r/t latency could be as high as 800ns to remote memory.

PathScale Infinipath, I suppose.

Bought by QLogic then sold to Sicortex then picked by Cray when
Sicortex went belly-up then resold to Intel where it became the base
for Omni-Path that was discontinued in 2019, but according to Wikipedia
still living under Cornelis Networks.
https://www.cornelisnetworks.com/

I never liked attempts to use it as cache-coherent interconnect.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Terje Mathisen on Mon Sep 9 10:29:25 2024

On 9/9/24 03:14, Terje Mathisen wrote:

jseigh wrote:

I'm not so sure about making the memory lock granularity same as
cache line size but that's an implementation decision I guess.

Just make sure you never have multiple locks residing inside the same
cache line!

This the the terminology ARM uses when describing their LL/SC
implementation. It is not the best choice in terminology.

I do like the idea of detecting potential contention at the
start of LL/SC so you can do back off. Right now the only way I
can detect contention is after the fact when the CAS fails and
I probably have the cache line exclusive at that point. It's
pretty problematic.

I do prefer LOCK XADD instead of CAS (CmpXchg*), because the return
value will also tell you which queue entry to pick/work on.

It will not be optimal when really contended, but at least one
participant will make forward progress, and typically several of them.

I'm not aware of any lock-free queue algorithms that use
atomic_fetch_add that are actually lock-free, error free,
and/or don't have an ABA problem. I'm not saying there
aren't, just that I'm not aware of them.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Mon Sep 9 15:42:45 2024

Michael S <already5chosen@yahoo.com> writes:

On Sun, 08 Sep 2024 18:32:42 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Sun, 08 Sep 2024 16:10:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

On AMD processors (and likely intel), if a core cannot acquire
a cache line in a a finite time, the core will assert the bus lock
to ensure forward progress.

Nothing to do with the operating software; purely a hardware
thing.

I think, on AMD processors made in this century the only cases that
resort to physical bus lock are
A) atomic accesses that cross cache boundary
B) atomic accesses that address non-cached memory regions

C) A core cannot acquire a cache line in a finite time. We
encountered this in 2010 on AMD Opteron processors with
our HyperTransport connected CXL-like chip (designed in 2005);
r/t latency could be as high as 800ns to remote memory.

PathScale Infinipath, I suppose.

Actully, no.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Mon Sep 9 20:49:46 2024

On Mon, 09 Sep 2024 15:42:45 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Sun, 08 Sep 2024 18:32:42 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Sun, 08 Sep 2024 16:10:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

On AMD processors (and likely intel), if a core cannot acquire
a cache line in a a finite time, the core will assert the bus
lock to ensure forward progress.

Nothing to do with the operating software; purely a hardware
thing.

I think, on AMD processors made in this century the only cases
that resort to physical bus lock are
A) atomic accesses that cross cache boundary
B) atomic accesses that address non-cached memory regions

C) A core cannot acquire a cache line in a finite time. We
encountered this in 2010 on AMD Opteron processors with
our HyperTransport connected CXL-like chip (designed in 2005);
r/t latency could be as high as 800ns to remote memory.

PathScale Infinipath, I suppose.

Actully, no.

Horus?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Mon Sep 9 18:49:00 2024

Michael S <already5chosen@yahoo.com> writes:

On Mon, 09 Sep 2024 15:42:45 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Sun, 08 Sep 2024 18:32:42 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Sun, 08 Sep 2024 16:10:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

On AMD processors (and likely intel), if a core cannot acquire
a cache line in a a finite time, the core will assert the bus
lock to ensure forward progress.

Nothing to do with the operating software; purely a hardware
thing.

I think, on AMD processors made in this century the only cases
that resort to physical bus lock are
A) atomic accesses that cross cache boundary
B) atomic accesses that address non-cached memory regions

C) A core cannot acquire a cache line in a finite time. We
encountered this in 2010 on AMD Opteron processors with
our HyperTransport connected CXL-like chip (designed in 2005);
r/t latency could be as high as 800ns to remote memory.

PathScale Infinipath, I suppose.

Actully, no.

Horus?

https://www.infoworld.com/article/2201330/3leaf-systems-scale-up-by-scaling-out.html

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to All on Mon Sep 9 14:46:24 2024

Some testing of the effects of backoff on contention.

$ lscpu -e
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ
0 0 0 0 0:0:0:0 yes 3400.0000 400.0000 799.9840
1 0 0 1 1:1:1:0 yes 3400.0000 400.0000 2296.7949
2 0 0 2 2:2:2:0 yes 3400.0000 400.0000 712.2160
3 0 0 3 3:3:3:0 yes 3400.0000 400.0000 1574.7550
4 0 0 0 0:0:0:0 yes 3400.0000 400.0000 799.9760
5 0 0 1 1:1:1:0 yes 3400.0000 400.0000 2600.7561
6 0 0 2 2:2:2:0 yes 3400.0000 400.0000 2599.2959
7 0 0 3 3:3:3:0 yes 3400.0000 400.0000 400.0000

so 4 cores, 2 threads / core

The options
-n #enqueues per producer
-p #producers
-c #consumers
-b backoff if > 0
-f 0 linear backoff 1 exponential backoff
-t queue type mpmc,spsc, spmc, mpsc default mpmc

taskset used to specify which cpus to run on.
There is a lot more stats being outputed but I'm
just showing the overall rate.

$ taskset -c 0-7 test/lfrbtest -n 4000000 -p6 -c2 -f 0 -b 0
overall rate = 5,988,679.5093 /sec

$ taskset -c 0-3 test/lfrbtest -n 4000000 -p6 -c2 -f 0 -b 0
overall rate = 7,206,319.9145 /sec

$ taskset -c 0-7 test/lfrbtest -n 4000000 -p6 -c2 -f 1 -b 12
overall rate = 7,243,427.7786 /sec

$ taskset -c 0-3 test/lfrbtest -n 4000000 -p6 -c2 -f 1 -b 12
overall rate = 8,722,397.3449 /sec

$ taskset -c 3 test/lfrbtest -n 4000000 -p6 -c2 -f 1 -b 12
overall rate = 19,950,919.8062 /sec

Some more threads than cpus timings ...

$ taskset -c 0-7 test/lfrbtest -n 1000000 -p24 -c8 -f 1 -b 0
overall rate = 6,454,866.9162 /sec

$ taskset -c 0-7 test/lfrbtest -n 1000000 -p24 -c8 -f 1 -b 12
overall rate = 7,350,906.2839 /sec

Some 1 producer 1 consumer timings ... (no contention so backoff setting
has no effect)

$ taskset -c 3,4 test/lfrbtest -n 10000000 -p1 -c1 -f 1 -b 0
overall rate = 15,329,272.1370 /sec

$ taskset -c 3,4 test/lfrbtest -n 10000000 -p1 -c1 -f 1 -b 12
overall rate = 15,063,074.6140 /sec

$ taskset -c 3 test/lfrbtest -n 10000000 -p1 -c1 -f 1 -b 0
overall rate = 19,983,735.6372 /sec

$ taskset -c 3 test/lfrbtest -n 10000000 -p1 -c1 -f 1 -b 12
overall rate = 19,935,213.3078 /sec

$ taskset -c 3 test/lfrbtest -n 10000000 -p1 -c1 -f 1 -b 12 -x n
overall rate = 32,921,077.1692 /sec

$ taskset -c 2,3 test/lfrbtest -n 10000000 -p1 -c1 -f 1 -b 12 -t spsc
overall rate = 38,620,404.6845 /sec

$ taskset -c 3 test/lfrbtest -n 10000000 -p1 -c1 -f 1 -b 12 -t spsc
overall rate = 109,891,021.6233 /sec

This is producer and consumer running on same thread so no thread
switching overhead

$ taskset -c 3 test/lfrbtest -n 10000000 -p1 -c1 -f 1 -b 12 -x n -t spsc
overall rate = 146,941,972.2478 /sec

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Mon Sep 9 23:39:30 2024

On Mon, 09 Sep 2024 18:49:00 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 09 Sep 2024 15:42:45 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Sun, 08 Sep 2024 18:32:42 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Sun, 08 Sep 2024 16:10:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

On AMD processors (and likely intel), if a core cannot
acquire a cache line in a a finite time, the core will
assert the bus lock to ensure forward progress.

Nothing to do with the operating software; purely a hardware
thing.

I think, on AMD processors made in this century the only cases
that resort to physical bus lock are
A) atomic accesses that cross cache boundary
B) atomic accesses that address non-cached memory regions

C) A core cannot acquire a cache line in a finite time. We
encountered this in 2010 on AMD Opteron processors with
our HyperTransport connected CXL-like chip (designed in
2005); r/t latency could be as high as 800ns to remote memory.

PathScale Infinipath, I suppose.

Actully, no.

Horus?

https://www.infoworld.com/article/2201330/3leaf-systems-scale-up-by-scaling-out.html

O.k.
This aticle is a little more technical https://www.eetimes.com/3leaf-links-32-amd-chips-in-one-server/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to jseigh on Tue Sep 10 11:22:05 2024

jseigh wrote:

On 9/9/24 03:14, Terje Mathisen wrote:

jseigh wrote:

I'm not so sure about making the memory lock granularity same as
cache line size but that's an implementation decision I guess.

Just make sure you never have multiple locks residing inside the same
cache line!

This the the terminology ARM uses when describing their LL/SC implementation. It is not the best choice in terminology.

I do like the idea of detecting potential contention at the
start of LL/SC so you can do back off.Â Right now the only way I
can detect contention is after the fact when the CAS fails and
I probably have the cache line exclusive at that point.Â It's
pretty problematic.

I do prefer LOCK XADD instead of CAS (CmpXchg*), because the return
value will also tell you which queue entry to pick/work on.

It will not be optimal when really contended, but at least one
participant will make forward progress, and typically several of them.

I'm not aware of any lock-free queue algorithms that use
atomic_fetch_add that are actually lock-free, error free,
and/or don't have an ABA problem. I'm not saying there
aren't, just that I'm not aware of them.

I'm not sure either: I have written test code that should be general
purpose but never actually used that in any production systems.

The one time I used this for critical work (a maximum throughput ntpd
server) I had a single writer and N-1 readers, so I actually decided to
create one queue per reader (cpu core), making them much easier to get
right. :-)

Each queue was a power of two in length, the write and read indices were
full 32-bit unsigned variables that I masked before accessing the work
list, so I never needed to worry about queue end wraparound.

The writer simply allocated work to each queue in round robin fashion.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Tue Sep 10 10:40:05 2024

On 9/9/24 17:09, Chris M. Thomasson wrote:

On 9/9/2024 7:29 AM, jseigh wrote:

On 9/9/24 03:14, Terje Mathisen wrote:

jseigh wrote:

I'm not so sure about making the memory lock granularity same as
cache line size but that's an implementation decision I guess.

Just make sure you never have multiple locks residing inside the same
cache line!

This the the terminology ARM uses when describing their LL/SC
implementation. It is not the best choice in terminology.

I do like the idea of detecting potential contention at the
start of LL/SC so you can do back off. Right now the only way I
can detect contention is after the fact when the CAS fails and
I probably have the cache line exclusive at that point. It's
pretty problematic.

I do prefer LOCK XADD instead of CAS (CmpXchg*), because the return
value will also tell you which queue entry to pick/work on.

It will not be optimal when really contended, but at least one
participant will make forward progress, and typically several of them.

I'm not aware of any lock-free queue algorithms that use
atomic_fetch_add that are actually lock-free, error free,
and/or don't have an ABA problem. I'm not saying there
aren't, just that I'm not aware of them.

Here is an interesting one I did. A tweak from another algorithm.
Basically a bakery algorithm:

<pseudo code, membars aside for a moment> ______________________________________________
struct cell { uint32_t ver; double state; };

uint32_t head = 0;
uint32_t tail = 0;
cell cells[N]; // N must be a power of 2

void init() {
    for (uint32_t i = 0; i < N; ++i) cells[i].ver = i;
}

void producer(double state) {
    uint32_t ver = XADD(&head, 1);
    cell& c = cells[ver & (N - 1)];
    while (LOAD(&c.ver) != ver) backoff();
    c.state = state;
    STORE(&c.ver, ver + 1);
}

double consumer() {
    uint32_t ver = XADD(&tail, 1);
    cell& c = cells[ver & (N - 1)];
    while (LOAD(&c.ver) != ver + 1) backoff();
    double state = c.state;
    STORE(&c.ver, ver + N);
    return state;
}
______________________________________________

One of the things I have in my implementation is that an enqueue
operation can detect if wrap occurred if it got a interrupted by
a time slice end and log how far behind it got. I'm seeing about
200,000 enqueue operations in those cases. That would have been
a huge performance hit if my queue wasn't lock-free.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Terje Mathisen on Tue Sep 10 10:51:45 2024

On 9/10/24 05:22, Terje Mathisen wrote:

jseigh wrote:

On 9/9/24 03:14, Terje Mathisen wrote:

jseigh wrote:

I'm not so sure about making the memory lock granularity same as
cache line size but that's an implementation decision I guess.

Just make sure you never have multiple locks residing inside the same
cache line!

This the the terminology ARM uses when describing their LL/SC
implementation. It is not the best choice in terminology.

I do like the idea of detecting potential contention at the
start of LL/SC so you can do back off.Â Right now the only way I
can detect contention is after the fact when the CAS fails and
I probably have the cache line exclusive at that point.Â It's
pretty problematic.

I do prefer LOCK XADD instead of CAS (CmpXchg*), because the return
value will also tell you which queue entry to pick/work on.

It will not be optimal when really contended, but at least one
participant will make forward progress, and typically several of them.

I'm not aware of any lock-free queue algorithms that use
atomic_fetch_add that are actually lock-free, error free,
and/or don't have an ABA problem. I'm not saying there
aren't, just that I'm not aware of them.

I'm not sure either: I have written test code that should be general
purpose but never actually used that in any production systems.

The one time I used this for critical work (a maximum throughput ntpd
server) I had a single writer and N-1 readers, so I actually decided to create one queue per reader (cpu core), making them much easier to get
right. :-)

Each queue was a power of two in length, the write and read indices were
full 32-bit unsigned variables that I masked before accessing the work
list, so I never needed to worry about queue end wraparound.

The writer simply allocated work to each queue in round robin fashion.

You should look at the RSEQ (restartable sequences) code for SPMC
queue. It's in assembler (basically because RSEQ needs to know
the exact address of the commit instruction), but essentially an
enqueue operation is

tail->next = &added_node;
tail = &added_node; // the commit instruction

If you get interrupted before the tail gets updated, no matter
because the next successful enqueue will work fine. You
don't even have to null the next pointer of the node you
are enqueuing.

The only downside is that if you are on a thousand processor
box, you will have 1000 SPMC queues.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Tue Sep 10 17:34:28 2024

On 9/10/24 14:46, Chris M. Thomasson wrote:

On 9/10/2024 7:40 AM, jseigh wrote:

One of the things I have in my implementation is that an enqueue
operation can detect if wrap occurred if it got a interrupted by
a time slice end and log how far behind it got. I'm seeing about
200,000 enqueue operations in those cases. That would have been
a huge performance hit if my queue wasn't lock-free.

Is your queue similar to the M&S Queue?

https://people.csail.mit.edu/xchen/parallel-computing/queue.pdf

Yes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Paul A. Clayton on Wed Sep 11 09:22:11 2024

On 9/11/24 00:15, Paul A. Clayton wrote:

On 9/9/24 3:14 AM, Terje Mathisen wrote:

jseigh wrote:

I'm not so sure about making the memory lock granularity same as
cache line size but that's an implementation decision I guess.

Just make sure you never have multiple locks residing inside the same
cache line!

Never?

I suspect at least theoretically conditions could exist where
having more than one lock within a cache line would be beneficial.

If lock B is always acquired after lock A, then sharing a cache
line might (I think) improve performance. One would lose
prefetched capacity for the data protected by lock A and lock B.
This assumes simple locks (e.g., not readers-writer locks).

It seems to me that the pingpong problem may be less important
than spatial locality depending on the contention for the cache
line and the cache hierarchy locality of the contention
(pingponging from a shared level of cache would be less
expensive).

I've been performance testing on a 4 core 8 hw thread cpu and
generally it's faster if I run on 4 hw threads, 1 per cpu,
than on all 8 hw threads. But if I run on 2 hw threads, both
on same core, it's 2x faster than 2 hw threads on 2 cores,
1 per core. Probably because they're running out of L1/L2
core only cache rather than L3 global cache.

If work behind highly active locks is preferentially or forcefully
localized, pingponging would be less of a problem, it seems.
Instead of an arbitrary core acquiring a lock's cache line and
doing some work, the core could send a message to the natural owner of
the cache line to do the work.

If communication between cores was low latency and simple messages
used little bandwidth, one might also conceive of having a lock
manager that tracks the lock state and sends a granted or not-
granted message back. This assumes that the memory location of the
lock itself is separate from the data guarded by the lock.

Being able to grab a snapshot of some data briefly without
requiring (longer-term) ownership change might be useful even
beyond lock probing (where a conventional MESI would change the
M-state cache to S forcing a request for ownership when the lock
is released). I recall some paper proposed expiring cache line
ownership to reduce coherence overhead.

Within a multiple-cache-line atomic operation/memory transaction,
I _think_ if the write set is owned, the read set could be grabbed
as such snapshots. I.e., I think any remote write to the read set
could be "after" the atomic/transaction commits. (Such might be
too difficult to get right while still providing any benefit.)

Well there's HTM (hardware transactional memory) but apparently
that's hard to get right and it's limited in snap shot size.

In software there's a large body of lock-free data structures
using various deferred reclamation schemes, RCU, hazard pointers,
etc... RCU has zero read access cost overhead, hazard pointers
have almost zero cost, about 3x the cost of a pipelined load.
Hazard pointers got a lot faster when need for a store/load memory
barrier was gotten rid of.

These lock-free data structures can be quite huge, e.g.
a lock-free map with millions of entries use to cache a redis
database with billions of entries. Certainly bigger than
you can fit into hw cache.

(Weird side-thought: I wonder if a conservative filter might be
useful for locking, particularly for writer locks. On the one
hand, such would increase the pingpong in the filter when writer
locks are set/cleared; on the other hand, reader locks could use
a remote increment within the filter check atomic to avoid slight
cache pollution.)

There are reader/writer bakery style spin locks. Bakery style
spin locks are more cache friendly than regular spin locks.
An interlock operation, fetch_and_add, is only done once
to get the wait ticket and then the code just spins testing
the next value which should be pretty efficient on strongly
coherent cache.

An early description of rw spin locks https://groups.google.com/g/comp.programming/c/tHkE6R4joe0/m/1NyR2OkkHJ4J

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Sep 11 10:33:16 2024

I suspect at least theoretically conditions could exist where
having more than one lock within a cache line would be beneficial.
If lock B is always acquired after lock A, then sharing a cache
line might (I think) improve performance. One would lose

I suspect in practice this almost never happens because if it did, it
would mean that the program would benefit from merging those two locks
into a single one.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to jseigh on Wed Sep 11 14:04:17 2024

On 9/9/24 14:46, jseigh wrote:

Some testing of the effects of backoff on contention.

I don't think we are getting anywhere in this line of inquiry. I
think I will break off for a while and work on a lemonade
recipe (obscure reference to an old sig).

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	44:36:03
Calls:	10,392
Files:	14,066
Messages:	6,417,253

arm ldxr/stxr vs cas

Who's Online

System Info