On 9/2/2024 11:55 AM, Chris M. Thomasson wrote:
On 9/2/2024 10:27 AM, jseigh wrote:
I read that arm added the cas instruction because they didn't think
ldxr/stxr would scale well. It wasn't clear to me as to why that
would be the case. I would think the memory lock mechanism would
have really low overhead vs cas having to do an interlocked load
and store. Unless maybe the memory lock size might be large
enough to cause false sharing issues. Any ideas?
Reservation granularity? Wrt the PPC, wow this an older thread:
https://groups.google.com/g/comp.arch/c/yREvvvKvr6k/m/nRZ5tpLwDNQJ
LL/SC can spuriously fail... It's more obstruction free than
lock-free? I think this is why a strong and weak CAS are in the C/C++
std's?
Btw, have you talked with Alexander Terekhov lately? He is a smart guy. :^)
I read that arm added the cas instruction because they didn't think
ldxr/stxr would scale well. It wasn't clear to me as to why that
would be the case. I would think the memory lock mechanism would
have really low overhead vs cas having to do an interlocked load
and store. Unless maybe the memory lock size might be large
enough to cause false sharing issues. Any ideas?
Joe Seigh
On 9/4/2024 2:27 PM, MitchAlsup1 wrote:
On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:
I read that arm added the cas instruction because they didn't think
ldxr/stxr would scale well. It wasn't clear to me as to why that
would be the case. I would think the memory lock mechanism would
have really low overhead vs cas having to do an interlocked load
and store. Unless maybe the memory lock size might be large
enough to cause false sharing issues. Any ideas?
A pipeline lock between the LD part of a CAS and the ST part of a
CAS is essentially FREE. But the same is true for LL followed by
a later SC.
100% sure on that? No way to break the reservation from an unrelated
aspect wrt LL/SC?
On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:
I read that arm added the cas instruction because they didn't think
ldxr/stxr would scale well. It wasn't clear to me as to why that
would be the case. I would think the memory lock mechanism would
have really low overhead vs cas having to do an interlocked load
and store. Unless maybe the memory lock size might be large
enough to cause false sharing issues. Any ideas?
A pipeline lock between the LD part of a CAS and the ST part of a
CAS is essentially FREE. But the same is true for LL followed by
a later SC.
Older machines with looser than sequential consistency memory models
and running OoO have a myriad of problems with LL - SC. This is
why My 66000 architecture switches from causal consistency to
sequential consistency when it encounters <effectively> LL and
switches bac after seeing SC.
No Fences necessary with causal consistency.
On 9/4/2024 5:27 PM, MitchAlsup1 wrote:
On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:
I read that arm added the cas instruction because they didn't think
ldxr/stxr would scale well. It wasn't clear to me as to why that
would be the case. I would think the memory lock mechanism would
have really low overhead vs cas having to do an interlocked load
and store. Unless maybe the memory lock size might be large
enough to cause false sharing issues. Any ideas?
A pipeline lock between the LD part of a CAS and the ST part of a
CAS is essentially FREE. But the same is true for LL followed by
a later SC.
Older machines with looser than sequential consistency memory models
and running OoO have a myriad of problems with LL - SC. This is
why My 66000 architecture switches from causal consistency to
sequential consistency when it encounters <effectively> LL and
switches bac after seeing SC.
No Fences necessary with causal consistency.
I'm not sure I entirely follow. I was thinking of the effects on
cache. In theory the SC could fail without having get the current
cache line exclusive or at all. CAS has to get it exclusive before
it can definitively fail.
Whenever they get around to making arm desktops without the OS tax
so I can install linux
On 9/4/2024 5:27 PM, MitchAlsup1 wrote:
On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:
I read that arm added the cas instruction because they didn't think
ldxr/stxr would scale well. It wasn't clear to me as to why that
would be the case. I would think the memory lock mechanism would
have really low overhead vs cas having to do an interlocked load
and store. Unless maybe the memory lock size might be large
enough to cause false sharing issues. Any ideas?
A pipeline lock between the LD part of a CAS and the ST part of a
CAS is essentially FREE. But the same is true for LL followed by
a later SC.
Older machines with looser than sequential consistency memory models
and running OoO have a myriad of problems with LL - SC. This is
why My 66000 architecture switches from causal consistency to
sequential consistency when it encounters <effectively> LL and
switches bac after seeing SC.
No Fences necessary with causal consistency.
I'm not sure I entirely follow. I was thinking of the effects on
cache. In theory the SC could fail without having get the current
cache line exclusive or at all. CAS has to get it exclusive before
it can definitively fail.
Whenever they get around to making arm desktops without the OS tax
so I can install linux I can compare the 2.
Joe Seigh
On 9/5/2024 12:46 PM, MitchAlsup1 wrote:
On Thu, 5 Sep 2024 11:33:23 +0000, jseigh wrote:
On 9/4/2024 5:27 PM, MitchAlsup1 wrote:
On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:
I read that arm added the cas instruction because they didn't think
ldxr/stxr would scale well. It wasn't clear to me as to why that
would be the case. I would think the memory lock mechanism would
have really low overhead vs cas having to do an interlocked load
and store. Unless maybe the memory lock size might be large
enough to cause false sharing issues. Any ideas?
A pipeline lock between the LD part of a CAS and the ST part of a
CAS is essentially FREE. But the same is true for LL followed by
a later SC.
Older machines with looser than sequential consistency memory models
and running OoO have a myriad of problems with LL - SC. This is
why My 66000 architecture switches from causal consistency to
sequential consistency when it encounters <effectively> LL and
switches bac after seeing SC.
No Fences necessary with causal consistency.
I'm not sure I entirely follow. I was thinking of the effects on
cache. In theory the SC could fail without having get the current
cache line exclusive or at all. CAS has to get it exclusive before
it can definitively fail.
A LL that takes a miss in L1 will perform a fetch with intent to modify,
so will a CAS. However, LL is allowed to silently fail if exclusive is
not returned from its fetch, deferring atomic failure to SC, while CAS
will fail when exclusive fails to return.
CAS should only fail when the comparands are not equal to each other.
Well, then there is the damn weak and strong CAS in C++11... ;^o
LL-SC is designed so that
when a failure happens, failure is visible at SC not necessarily at LL.
There are coherence protocols that allows the 2nd party to determine
if it returns exclusive or not. The example I know is when the 2nd
party is already performing an atomic event and it is better to fail
the starting atomic event than to fail an ongoing atomic event.
In My 66000 the determination is made under the notion of priority::
the higher priority thread is allows to continue while the lower
priority thread takes the failure. The higher priority thread can
be the requestor (1st party) or the holder of data (2nd party)
while all interested observers (3rd parties) are in a position
to see what transpired and act accordingly (causal).
On 9/5/24 16:34, Chris M. Thomasson wrote:
On 9/5/2024 12:46 PM, MitchAlsup1 wrote:
On Thu, 5 Sep 2024 11:33:23 +0000, jseigh wrote:
On 9/4/2024 5:27 PM, MitchAlsup1 wrote:
On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:
I read that arm added the cas instruction because they didn't think >>>>>> ldxr/stxr would scale well. It wasn't clear to me as to why that >>>>>> would be the case. I would think the memory lock mechanism would >>>>>> have really low overhead vs cas having to do an interlocked load
and store. Unless maybe the memory lock size might be large
enough to cause false sharing issues. Any ideas?
A pipeline lock between the LD part of a CAS and the ST part of a
CAS is essentially FREE. But the same is true for LL followed by
a later SC.
Older machines with looser than sequential consistency memory models >>>>> and running OoO have a myriad of problems with LL - SC. This is
why My 66000 architecture switches from causal consistency to
sequential consistency when it encounters <effectively> LL and
switches bac after seeing SC.
No Fences necessary with causal consistency.
I'm not sure I entirely follow. I was thinking of the effects on
cache. In theory the SC could fail without having get the current
cache line exclusive or at all. CAS has to get it exclusive before
it can definitively fail.
A LL that takes a miss in L1 will perform a fetch with intent to modify, >>> so will a CAS. However, LL is allowed to silently fail if exclusive is
not returned from its fetch, deferring atomic failure to SC, while CAS
will fail when exclusive fails to return.
CAS should only fail when the comparands are not equal to each other.
Well, then there is the damn weak and strong CAS in C++11... ;^o
LL-SC is designed so that
when a failure happens, failure is visible at SC not necessarily at LL.
There are coherence protocols that allows the 2nd party to determine
if it returns exclusive or not. The example I know is when the 2nd
party is already performing an atomic event and it is better to fail
the starting atomic event than to fail an ongoing atomic event.
In My 66000 the determination is made under the notion of priority::
the higher priority thread is allows to continue while the lower
priority thread takes the failure. The higher priority thread can
be the requestor (1st party) or the holder of data (2nd party)
while all interested observers (3rd parties) are in a position
to see what transpired and act accordingly (causal).
I'm not so sure about making the memory lock granularity same as
cache line size but that's an implementation decision I guess.
I do like the idea of detecting potential contention at the
start of LL/SC so you can do back off. Right now the only way I
can detect contention is after the fact when the CAS fails and
I probably have the cache line exclusive at that point. It's
pretty problematic.
Joe Seigh
On 9/5/2024 2:49 PM, jseigh wrote:
On 9/5/24 16:34, Chris M. Thomasson wrote:
On 9/5/2024 12:46 PM, MitchAlsup1 wrote:
On Thu, 5 Sep 2024 11:33:23 +0000, jseigh wrote:
On 9/4/2024 5:27 PM, MitchAlsup1 wrote:
On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:
I read that arm added the cas instruction because they didn't think >>>>>>> ldxr/stxr would scale well. It wasn't clear to me as to why that >>>>>>> would be the case. I would think the memory lock mechanism would >>>>>>> have really low overhead vs cas having to do an interlocked load >>>>>>> and store. Unless maybe the memory lock size might be large
enough to cause false sharing issues. Any ideas?
A pipeline lock between the LD part of a CAS and the ST part of a
CAS is essentially FREE. But the same is true for LL followed by
a later SC.
Older machines with looser than sequential consistency memory models >>>>>> and running OoO have a myriad of problems with LL - SC. This is
why My 66000 architecture switches from causal consistency to
sequential consistency when it encounters <effectively> LL and
switches bac after seeing SC.
No Fences necessary with causal consistency.
I'm not sure I entirely follow. I was thinking of the effects on
cache. In theory the SC could fail without having get the current
cache line exclusive or at all. CAS has to get it exclusive before >>>>> it can definitively fail.
A LL that takes a miss in L1 will perform a fetch with intent to modify, >>>> so will a CAS. However, LL is allowed to silently fail if exclusive is >>>> not returned from its fetch, deferring atomic failure to SC, while CAS >>>> will fail when exclusive fails to return.
CAS should only fail when the comparands are not equal to each other.
Well, then there is the damn weak and strong CAS in C++11... ;^o
LL-SC is designed so that
when a failure happens, failure is visible at SC not necessarily at LL. >>>>
There are coherence protocols that allows the 2nd party to determine
if it returns exclusive or not. The example I know is when the 2nd
party is already performing an atomic event and it is better to fail
the starting atomic event than to fail an ongoing atomic event.
In My 66000 the determination is made under the notion of priority::
the higher priority thread is allows to continue while the lower
priority thread takes the failure. The higher priority thread can
be the requestor (1st party) or the holder of data (2nd party)
while all interested observers (3rd parties) are in a position
to see what transpired and act accordingly (causal).
I'm not so sure about making the memory lock granularity same as
cache line size but that's an implementation decision I guess.
I do like the idea of detecting potential contention at the
start of LL/SC so you can do back off. Right now the only way I
can detect contention is after the fact when the CAS fails and
I probably have the cache line exclusive at that point. It's
pretty problematic.
I wonder if the ability to determine why a "weak" CAS failed might help.
They (weak) can fail for other reasons besides comparing comparands...
Well, would be a little too low level for a general atomic op in
C/C++11?
On Fri, 6 Sep 2024 19:36:36 +0000, Chris M. Thomasson wrote:
On 9/5/2024 2:49 PM, jseigh wrote:
On 9/5/24 16:34, Chris M. Thomasson wrote:
On 9/5/2024 12:46 PM, MitchAlsup1 wrote:
On Thu, 5 Sep 2024 11:33:23 +0000, jseigh wrote:
On 9/4/2024 5:27 PM, MitchAlsup1 wrote:
On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:
I read that arm added the cas instruction because they didn't think >>>>>>>> ldxr/stxr would scale well. It wasn't clear to me as to why that >>>>>>>> would be the case. I would think the memory lock mechanism would >>>>>>>> have really low overhead vs cas having to do an interlocked load >>>>>>>> and store. Unless maybe the memory lock size might be large
enough to cause false sharing issues. Any ideas?
A pipeline lock between the LD part of a CAS and the ST part of a >>>>>>> CAS is essentially FREE. But the same is true for LL followed by >>>>>>> a later SC.
Older machines with looser than sequential consistency memory models >>>>>>> and running OoO have a myriad of problems with LL - SC. This is
why My 66000 architecture switches from causal consistency to
sequential consistency when it encounters <effectively> LL and
switches bac after seeing SC.
No Fences necessary with causal consistency.
I'm not sure I entirely follow. I was thinking of the effects on >>>>>> cache. In theory the SC could fail without having get the current >>>>>> cache line exclusive or at all. CAS has to get it exclusive before >>>>>> it can definitively fail.
A LL that takes a miss in L1 will perform a fetch with intent to
modify,
so will a CAS. However, LL is allowed to silently fail if exclusive is >>>>> not returned from its fetch, deferring atomic failure to SC, while CAS >>>>> will fail when exclusive fails to return.
CAS should only fail when the comparands are not equal to each other.
Well, then there is the damn weak and strong CAS in C++11... ;^o
LL-SC is designed so that
when a failure happens, failure is visible at SC not necessarily at
LL.
There are coherence protocols that allows the 2nd party to determine >>>>> if it returns exclusive or not. The example I know is when the 2nd
party is already performing an atomic event and it is better to fail >>>>> the starting atomic event than to fail an ongoing atomic event.
In My 66000 the determination is made under the notion of priority:: >>>>> the higher priority thread is allows to continue while the lower
priority thread takes the failure. The higher priority thread can
be the requestor (1st party) or the holder of data (2nd party)
while all interested observers (3rd parties) are in a position
to see what transpired and act accordingly (causal).
I'm not so sure about making the memory lock granularity same as
cache line size but that's an implementation decision I guess.
I do like the idea of detecting potential contention at the
start of LL/SC so you can do back off. Right now the only way I
can detect contention is after the fact when the CAS fails and
I probably have the cache line exclusive at that point. It's
pretty problematic.
I wonder if the ability to determine why a "weak" CAS failed might help.
They (weak) can fail for other reasons besides comparing comparands...
Well, would be a little too low level for a general atomic op in
C/C++11?
One can detect that the CAS-line is no longer exclusive as a form
of weak failure, rather than waiting for the data to show up and
fail strongly on the compare.
On 9/7/2024 4:14 PM, Chris M. Thomasson wrote:
[...]
When I am using CAS I don't really expect it to fail willy nilly even if
the comparands are still the same. Weak vs Strong. Still irks me a bit.
;^)
There are algorithms out there, usually state machines that depend on
strong cas. When a CAS fails, it depends on it failing because the
comparands were actually different...
On 9/7/2024 5:59 PM, MitchAlsup1 wrote:
On Sat, 7 Sep 2024 23:16:35 +0000, Chris M. Thomasson wrote:
On 9/7/2024 4:14 PM, Chris M. Thomasson wrote:
[...]
When I am using CAS I don't really expect it to fail willy nilly
even if
the comparands are still the same. Weak vs Strong. Still irks me a bit. >>>> ;^)
There are algorithms out there, usually state machines that depend on
strong cas. When a CAS fails, it depends on it failing because the
comparands were actually different...
Leading to ABA failures::
Do you really want the following CAS to succeed ??
LD R19,[someMemoryValue]
..
interrupt delays program execution for 1 week
..
CAS R17,R19,[someMemoryLocation]
Given that the someMemoryLocation is accessible to other programs
while tis one is sleeping ??
Thus, it seems reasonable to fail a CAS when one cannot determine
if the memory location has been changed and changed back in the
mean time.
ABA, well that can happen with CAS and certain algorithms that use them.
The good ol' version counter is pretty nice, but it still can fail. 64
bit words, 128 bit CAS. Loads to boot... ;^) Actually, I think Joe
mentioned something interesting about CAS a long time ago wrt IBM...
Candy Cane Striped books (Joe do you remember?) about a way to avoid
live lock and failing in the os. I think Windows has some like it with
their SList and SEH?
On 9/7/2024 5:59 PM, MitchAlsup1 wrote:
On Sat, 7 Sep 2024 23:16:35 +0000, Chris M. Thomasson wrote:
Thus, it seems reasonable to fail a CAS when one cannot determine
if the memory location has been changed and changed back in the
mean time.
I think Scott Lurndal mentioned something about CAS or something on
windows that will assert the bus lock after a lot of failures...
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 9/7/2024 5:59 PM, MitchAlsup1 wrote:
On Sat, 7 Sep 2024 23:16:35 +0000, Chris M. Thomasson wrote:
Thus, it seems reasonable to fail a CAS when one cannot determine
if the memory location has been changed and changed back in the
mean time.
I think Scott Lurndal mentioned something about CAS or something on
windows that will assert the bus lock after a lot of failures...
On AMD processors (and likely intel), if a core cannot acquire
a cache line in a a finite time, the core will assert the bus lock
to ensure forward progress.
Nothing to do with the operating software; purely a hardware thing.
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 9/7/2024 5:59 PM, MitchAlsup1 wrote:
On Sat, 7 Sep 2024 23:16:35 +0000, Chris M. Thomasson wrote:
Thus, it seems reasonable to fail a CAS when one cannot determine
if the memory location has been changed and changed back in the
mean time.
I think Scott Lurndal mentioned something about CAS or something on
windows that will assert the bus lock after a lot of failures...
On AMD processors (and likely intel), if a core cannot acquire
a cache line in a a finite time, the core will assert the bus lock
to ensure forward progress.
Nothing to do with the operating software; purely a hardware thing.
On Sun, 08 Sep 2024 16:10:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 9/7/2024 5:59 PM, MitchAlsup1 wrote:
On Sat, 7 Sep 2024 23:16:35 +0000, Chris M. Thomasson wrote:
Thus, it seems reasonable to fail a CAS when one cannot determine
if the memory location has been changed and changed back in the
mean time.
I think Scott Lurndal mentioned something about CAS or something on
windows that will assert the bus lock after a lot of failures...
On AMD processors (and likely intel), if a core cannot acquire
a cache line in a a finite time, the core will assert the bus lock
to ensure forward progress.
Nothing to do with the operating software; purely a hardware thing.
I think, on AMD processors made in this century the only cases that
resort to physical bus lock are
A) atomic accesses that cross cache boundary
B) atomic accesses that address non-cached memory regions
I'm not so sure about making the memory lock granularity same as
cache line size but that's an implementation decision I guess.
I do like the idea of detecting potential contention at the
start of LL/SC so you can do back off. Right now the only way I
can detect contention is after the fact when the CAS fails and
I probably have the cache line exclusive at that point. It's
pretty problematic.
Michael S <already5chosen@yahoo.com> writes:
On Sun, 08 Sep 2024 16:10:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
On AMD processors (and likely intel), if a core cannot acquire
a cache line in a a finite time, the core will assert the bus lock
to ensure forward progress.
Nothing to do with the operating software; purely a hardware
thing.
I think, on AMD processors made in this century the only cases that
resort to physical bus lock are
A) atomic accesses that cross cache boundary
B) atomic accesses that address non-cached memory regions
C) A core cannot acquire a cache line in a finite time. We
encountered this in 2010 on AMD Opteron processors with
our HyperTransport connected CXL-like chip (designed in 2005);
r/t latency could be as high as 800ns to remote memory.
jseigh wrote:
I'm not so sure about making the memory lock granularity same as
cache line size but that's an implementation decision I guess.
Just make sure you never have multiple locks residing inside the same
cache line!
I do like the idea of detecting potential contention at the
start of LL/SC so you can do back off. Right now the only way I
can detect contention is after the fact when the CAS fails and
I probably have the cache line exclusive at that point. It's
pretty problematic.
I do prefer LOCK XADD instead of CAS (CmpXchg*), because the return
value will also tell you which queue entry to pick/work on.
It will not be optimal when really contended, but at least one
participant will make forward progress, and typically several of them.
On Sun, 08 Sep 2024 18:32:42 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Sun, 08 Sep 2024 16:10:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
On AMD processors (and likely intel), if a core cannot acquire
a cache line in a a finite time, the core will assert the bus lock
to ensure forward progress.
Nothing to do with the operating software; purely a hardware
thing.
I think, on AMD processors made in this century the only cases that
resort to physical bus lock are
A) atomic accesses that cross cache boundary
B) atomic accesses that address non-cached memory regions
C) A core cannot acquire a cache line in a finite time. We
encountered this in 2010 on AMD Opteron processors with
our HyperTransport connected CXL-like chip (designed in 2005);
r/t latency could be as high as 800ns to remote memory.
PathScale Infinipath, I suppose.
Michael S <already5chosen@yahoo.com> writes:
On Sun, 08 Sep 2024 18:32:42 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Sun, 08 Sep 2024 16:10:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
On AMD processors (and likely intel), if a core cannot acquire
a cache line in a a finite time, the core will assert the bus
lock to ensure forward progress.
Nothing to do with the operating software; purely a hardware
thing.
I think, on AMD processors made in this century the only cases
that resort to physical bus lock are
A) atomic accesses that cross cache boundary
B) atomic accesses that address non-cached memory regions
C) A core cannot acquire a cache line in a finite time. We
encountered this in 2010 on AMD Opteron processors with
our HyperTransport connected CXL-like chip (designed in 2005);
r/t latency could be as high as 800ns to remote memory.
PathScale Infinipath, I suppose.
Actully, no.
On Mon, 09 Sep 2024 15:42:45 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Sun, 08 Sep 2024 18:32:42 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Sun, 08 Sep 2024 16:10:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
On AMD processors (and likely intel), if a core cannot acquire
a cache line in a a finite time, the core will assert the bus
lock to ensure forward progress.
Nothing to do with the operating software; purely a hardware
thing.
I think, on AMD processors made in this century the only cases
that resort to physical bus lock are
A) atomic accesses that cross cache boundary
B) atomic accesses that address non-cached memory regions
C) A core cannot acquire a cache line in a finite time. We
encountered this in 2010 on AMD Opteron processors with
our HyperTransport connected CXL-like chip (designed in 2005);
r/t latency could be as high as 800ns to remote memory.
PathScale Infinipath, I suppose.
Actully, no.
Horus?
Michael S <already5chosen@yahoo.com> writes:
On Mon, 09 Sep 2024 15:42:45 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Sun, 08 Sep 2024 18:32:42 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Sun, 08 Sep 2024 16:10:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
On AMD processors (and likely intel), if a core cannot
acquire a cache line in a a finite time, the core will
assert the bus lock to ensure forward progress.
Nothing to do with the operating software; purely a hardware
thing.
I think, on AMD processors made in this century the only cases
that resort to physical bus lock are
A) atomic accesses that cross cache boundary
B) atomic accesses that address non-cached memory regions
C) A core cannot acquire a cache line in a finite time. We
encountered this in 2010 on AMD Opteron processors with
our HyperTransport connected CXL-like chip (designed in
2005); r/t latency could be as high as 800ns to remote memory.
PathScale Infinipath, I suppose.
Actully, no.
Horus?
https://www.infoworld.com/article/2201330/3leaf-systems-scale-up-by-scaling-out.html
On 9/9/24 03:14, Terje Mathisen wrote:
jseigh wrote:
I'm not so sure about making the memory lock granularity same as
cache line size but that's an implementation decision I guess.
Just make sure you never have multiple locks residing inside the same
cache line!
This the the terminology ARM uses when describing their LL/SC implementation. It is not the best choice in terminology.
I do like the idea of detecting potential contention at the
start of LL/SC so you can do back off. Right now the only way I
can detect contention is after the fact when the CAS fails and
I probably have the cache line exclusive at that point. It's
pretty problematic.
I do prefer LOCK XADD instead of CAS (CmpXchg*), because the return
value will also tell you which queue entry to pick/work on.
It will not be optimal when really contended, but at least one
participant will make forward progress, and typically several of them.
I'm not aware of any lock-free queue algorithms that use
atomic_fetch_add that are actually lock-free, error free,
and/or don't have an ABA problem. I'm not saying there
aren't, just that I'm not aware of them.
On 9/9/2024 7:29 AM, jseigh wrote:
On 9/9/24 03:14, Terje Mathisen wrote:
jseigh wrote:
I'm not so sure about making the memory lock granularity same as
cache line size but that's an implementation decision I guess.
Just make sure you never have multiple locks residing inside the same
cache line!
This the the terminology ARM uses when describing their LL/SC
implementation. It is not the best choice in terminology.
I do like the idea of detecting potential contention at the
start of LL/SC so you can do back off. Right now the only way I
can detect contention is after the fact when the CAS fails and
I probably have the cache line exclusive at that point. It's
pretty problematic.
I do prefer LOCK XADD instead of CAS (CmpXchg*), because the return
value will also tell you which queue entry to pick/work on.
It will not be optimal when really contended, but at least one
participant will make forward progress, and typically several of them.
I'm not aware of any lock-free queue algorithms that use
atomic_fetch_add that are actually lock-free, error free,
and/or don't have an ABA problem. I'm not saying there
aren't, just that I'm not aware of them.
Here is an interesting one I did. A tweak from another algorithm.
Basically a bakery algorithm:
<pseudo code, membars aside for a moment> ______________________________________________
struct cell { uint32_t ver; double state; };
uint32_t head = 0;
uint32_t tail = 0;
cell cells[N]; // N must be a power of 2
void init() {
for (uint32_t i = 0; i < N; ++i) cells[i].ver = i;
}
void producer(double state) {
uint32_t ver = XADD(&head, 1);
cell& c = cells[ver & (N - 1)];
while (LOAD(&c.ver) != ver) backoff();
c.state = state;
STORE(&c.ver, ver + 1);
}
double consumer() {
uint32_t ver = XADD(&tail, 1);
cell& c = cells[ver & (N - 1)];
while (LOAD(&c.ver) != ver + 1) backoff();
double state = c.state;
STORE(&c.ver, ver + N);
return state;
}
______________________________________________
jseigh wrote:
On 9/9/24 03:14, Terje Mathisen wrote:
jseigh wrote:
I'm not so sure about making the memory lock granularity same as
cache line size but that's an implementation decision I guess.
Just make sure you never have multiple locks residing inside the same
cache line!
This the the terminology ARM uses when describing their LL/SC
implementation. It is not the best choice in terminology.
I do like the idea of detecting potential contention at the
start of LL/SC so you can do back off. Right now the only way I
can detect contention is after the fact when the CAS fails and
I probably have the cache line exclusive at that point. It's
pretty problematic.
I do prefer LOCK XADD instead of CAS (CmpXchg*), because the return
value will also tell you which queue entry to pick/work on.
It will not be optimal when really contended, but at least one
participant will make forward progress, and typically several of them.
I'm not aware of any lock-free queue algorithms that use
atomic_fetch_add that are actually lock-free, error free,
and/or don't have an ABA problem. I'm not saying there
aren't, just that I'm not aware of them.
I'm not sure either: I have written test code that should be general
purpose but never actually used that in any production systems.
The one time I used this for critical work (a maximum throughput ntpd
server) I had a single writer and N-1 readers, so I actually decided to create one queue per reader (cpu core), making them much easier to get
right. :-)
Each queue was a power of two in length, the write and read indices were
full 32-bit unsigned variables that I masked before accessing the work
list, so I never needed to worry about queue end wraparound.
The writer simply allocated work to each queue in round robin fashion.
On 9/10/2024 7:40 AM, jseigh wrote:
One of the things I have in my implementation is that an enqueue
operation can detect if wrap occurred if it got a interrupted by
a time slice end and log how far behind it got. I'm seeing about
200,000 enqueue operations in those cases. That would have been
a huge performance hit if my queue wasn't lock-free.
Is your queue similar to the M&S Queue?
https://people.csail.mit.edu/xchen/parallel-computing/queue.pdf
On 9/9/24 3:14 AM, Terje Mathisen wrote:
jseigh wrote:
I'm not so sure about making the memory lock granularity same as
cache line size but that's an implementation decision I guess.
Just make sure you never have multiple locks residing inside the same
cache line!
Never?
I suspect at least theoretically conditions could exist where
having more than one lock within a cache line would be beneficial.
If lock B is always acquired after lock A, then sharing a cache
line might (I think) improve performance. One would lose
prefetched capacity for the data protected by lock A and lock B.
This assumes simple locks (e.g., not readers-writer locks).
It seems to me that the pingpong problem may be less important
than spatial locality depending on the contention for the cache
line and the cache hierarchy locality of the contention
(pingponging from a shared level of cache would be less
expensive).
If work behind highly active locks is preferentially or forcefully
localized, pingponging would be less of a problem, it seems.
Instead of an arbitrary core acquiring a lock's cache line and
doing some work, the core could send a message to the natural owner of
the cache line to do the work.
If communication between cores was low latency and simple messages
used little bandwidth, one might also conceive of having a lock
manager that tracks the lock state and sends a granted or not-
granted message back. This assumes that the memory location of the
lock itself is separate from the data guarded by the lock.
Being able to grab a snapshot of some data briefly without
requiring (longer-term) ownership change might be useful even
beyond lock probing (where a conventional MESI would change the
M-state cache to S forcing a request for ownership when the lock
is released). I recall some paper proposed expiring cache line
ownership to reduce coherence overhead.
Within a multiple-cache-line atomic operation/memory transaction,
I _think_ if the write set is owned, the read set could be grabbed
as such snapshots. I.e., I think any remote write to the read set
could be "after" the atomic/transaction commits. (Such might be
too difficult to get right while still providing any benefit.)
(Weird side-thought: I wonder if a conservative filter might be
useful for locking, particularly for writer locks. On the one
hand, such would increase the pingpong in the filter when writer
locks are set/cleared; on the other hand, reader locks could use
a remote increment within the filter check atomic to avoid slight
cache pollution.)
I suspect at least theoretically conditions could exist where
having more than one lock within a cache line would be beneficial.
If lock B is always acquired after lock A, then sharing a cache
line might (I think) improve performance. One would lose
Some testing of the effects of backoff on contention.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 44:36:03 |
Calls: | 10,392 |
Files: | 14,066 |
Messages: | 6,417,253 |