Forum: >>> Magnum BBS <<<

MM instruction and the pipeline

From Stephen Fuld@21:1/5 to All on Tue Oct 15 22:56:34 2024

Even though this is about the MM instruction, and the MM instruction is mentioned in other threads, they have lots of other stuff (thread
drift), and this isn't related to C, standard or otherwise, so I thought
it best to start a new thread,

My questions are about what happens to subsequent instructions that
immediately follow the MM in the stream when an MM instruction is
executing. Since an MM instruction may take quite a long time (in
computer time) to complete I think it is useful to know what else can
happen while the MM is executing.

I will phrase this as a series of questions.

1. I assume that subsequent non-memory reference instructions can
proceed simultaneously with the MM. Is that correct?

2. Can a load or store where the memory address is in neither the source nor the destination of the MM proceed simultaneously with the MM

3. Can a load where the memory address is within the source of the MM proceed?

For the next questions, assume for exposition that the MM has proceeded
to complete 1/3 of the move when the following instructions come up.

4. Can a load in the first third of the destination range proceed?

5. Can a store in the first third of the source range proceed?

6. Can a store in the first third of the destination range proceed?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Wed Oct 16 19:26:46 2024

On Wed, 16 Oct 2024 5:56:34 +0000, Stephen Fuld wrote:

Even though this is about the MM instruction, and the MM instruction is mentioned in other threads, they have lots of other stuff (thread
drift), and this isn't related to C, standard or otherwise, so I thought
it best to start a new thread,

My questions are about what happens to subsequent instructions that immediately follow the MM in the stream when an MM instruction is
executing. Since an MM instruction may take quite a long time (in
computer time) to complete I think it is useful to know what else can
happen while the MM is executing.

I will phrase this as a series of questions.

1. I assume that subsequent non-memory reference instructions can
proceed simultaneously with the MM. Is that correct?

Yes, they may begin but they cannot retire.

2. Can a load or store where the memory address is in neither the source nor the destination of the MM proceed simultaneously with the MM

Yes, in higher end implementations--after checking for no-conflict
{and this is dependent on accessing DRAM not MMI/O or config spaces}

3. Can a load where the memory address is within the source of the MM proceed?

It is just read data, so, yes--at least theoretically.

For the next questions, assume for exposition that the MM has proceeded
to complete 1/3 of the move when the following instructions come up.

4. Can a load in the first third of the destination range proceed?

5. Can a store in the first third of the source range proceed?

6. Can a store in the first third of the destination range proceed?

In all 3 of these cases; one much have a good way to determine what has
already been MMed and what is waiting to be MMed. A low end
implementation
is unlikely to have such, a high end will have such.

On the other hand, MM is basically going to saturate the cache ports
(if for no other reason than being as fast as it can be) so, there
may not be a lot of AGEN capability or cache access port availability.
So, the faster one makes MM (and by extension MS) the less one needs
of overlap and pipelining.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Paul A. Clayton on Wed Oct 16 21:14:37 2024

On Wed, 16 Oct 2024 20:48:39 +0000, Paul A. Clayton wrote:

Here is a question that I will leave to Mitch:

Can a MM that has confirmed permissions commit before it has been
performed such that uncorrectable errors would be recognized not
on read of the source but on later read of the destination?

A Memory Move does not <necessarily> read the destination. In
order to make the data transfers occur in cache line sizes,
The first and the last line may be read, but the intermediate
ones are not read (from DRAM) only to be re-written. An
implementation with byte write enables might not read any
of the destination lines.

Then there is the issue with uncorrectable errors at the
receiving cache. The current protocol has the sender (core)
not release his write buffer until LLC has replied that
the data arrived without ECC trouble. Thus, the instruction
causing the latent uncorrectable error is not retired until
the data has arrived successfully at LLC.

I could see some wanting to depend on the copy checking data
validity synchronously, but some might be okay with a quasi-
synchronous copy that allows the processor to continue doing work
outside of the MM.

As I mentioned before, Yes I intend to allow other instructions
to operate concurrently with MM, but I also expect MM to consume
all of L1 cache bandwidth. Just like LD L1-L2-miss operates
concurrently with FDIV.

If a translation map is provided for coherence, any MM could
commit once it is not speculative but before the actual copy has
been performed. Tracking what parts have been completed in the
presence of other stores would have significant overhead.

In practice, one is not going to allow MM to get farther than
the miss buffer ahead of a mispredict shadow.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Thu Oct 17 08:49:08 2024

On 10/16/2024 12:26 PM, MitchAlsup1 wrote:

On Wed, 16 Oct 2024 5:56:34 +0000, Stephen Fuld wrote:

Even though this is about the MM instruction, and the MM instruction is
mentioned in other threads, they have lots of other stuff (thread
drift), and this isn't related to C, standard or otherwise, so I thought
it best to start a new thread,

My questions are about what happens to subsequent instructions that
immediately follow the MM in the stream when an MM instruction is
executing. Since an MM instruction may take quite a long time (in
computer time) to complete I think it is useful to know what else can
happen while the MM is executing.

I will phrase this as a series of questions.

1.    I assume that subsequent non-memory reference instructions can
proceed simultaneously with the MM. Is that correct?

Yes, they may begin but they cannot retire.

2.    Can a load or store where the memory address is in neither the
source
nor the destination of the MM proceed simultaneously with the MM

Yes, in higher end implementations--after checking for no-conflict
{and this is dependent on accessing DRAM not MMI/O or config spaces}

3.    Can a load where the memory address is within the source of the MM >> proceed?

It is just read data, so, yes--at least theoretically.

For the next questions, assume for exposition that the MM has proceeded
to complete 1/3 of the move when the following instructions come up.

4.    Can a load in the first third of the destination range proceed?

5.    Can a store in the first third of the source range proceed?

6.    Can a store in the first third of the destination range proceed?

In all 3 of these cases; one much have a good way to determine what has already been MMed and what is waiting to be MMed. A low end
implementation
is unlikely to have such, a high end will have such.

On the other hand, MM is basically going to saturate the cache ports
(if for no other reason than being as fast as it can be) so, there
may not be a lot of AGEN capability or cache access port availability.

Yes, but. For a large transfer, say many hundreds to thousands of
bytes, why run the "middle" bytes through the cache, especially the L1
(as you indicated in reply to Paul)? It would take some analysis of
traces to know for sure, but I would expect the probability of reuse of
such bytes to be low. If that is true, it would take far less resources
(and avoid "sweeping" the cache) to do at least the intermediate reads
and writes into just L3, or even a dedicated very small buffer or two. Furthermore, for the transfers after the first, unless there is a page crossing, why go through a full AGEN, when a simple add to the previous
address is all that is required, thus freeing AGEN resources.

So, the faster one makes MM (and by extension MS) the less one needs
of overlap and pipelining.

Certainly true for small transfers, but for larger ones, I am not so
sure. It may make more sense to delay the MM completion slightly for
the time it takes for a single load to take place in order to allow the non-memory reference instructions following that load to execute
overlapped with the completion of the MM. Needs trace analysis.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Stephen Fuld on Thu Oct 17 13:16:24 2024

Stephen Fuld wrote:

On 10/16/2024 12:26 PM, MitchAlsup1 wrote:

On Wed, 16 Oct 2024 5:56:34 +0000, Stephen Fuld wrote:

Even though this is about the MM instruction, and the MM instruction is
mentioned in other threads, they have lots of other stuff (thread
drift), and this isn't related to C, standard or otherwise, so I thought >>> it best to start a new thread,

My questions are about what happens to subsequent instructions that
immediately follow the MM in the stream when an MM instruction is
executing. Since an MM instruction may take quite a long time (in
computer time) to complete I think it is useful to know what else can
happen while the MM is executing.

I will phrase this as a series of questions.

1. I assume that subsequent non-memory reference instructions can
proceed simultaneously with the MM. Is that correct?

Yes, they may begin but they cannot retire.

2. Can a load or store where the memory address is in neither the
source
nor the destination of the MM proceed simultaneously with the MM

Yes, in higher end implementations--after checking for no-conflict
{and this is dependent on accessing DRAM not MMI/O or config spaces}

3. Can a load where the memory address is within the source of the MM >>> proceed?

It is just read data, so, yes--at least theoretically.

For the next questions, assume for exposition that the MM has proceeded
to complete 1/3 of the move when the following instructions come up.

4. Can a load in the first third of the destination range proceed?

5. Can a store in the first third of the source range proceed?

6. Can a store in the first third of the destination range proceed?

In all 3 of these cases; one much have a good way to determine what has
already been MMed and what is waiting to be MMed. A low end
implementation
is unlikely to have such, a high end will have such.

On the other hand, MM is basically going to saturate the cache ports
(if for no other reason than being as fast as it can be) so, there
may not be a lot of AGEN capability or cache access port availability.

Yes, but. For a large transfer, say many hundreds to thousands of
bytes, why run the "middle" bytes through the cache, especially the L1
(as you indicated in reply to Paul)? It would take some analysis of
traces to know for sure, but I would expect the probability of reuse of
such bytes to be low. If that is true, it would take far less resources
(and avoid "sweeping" the cache) to do at least the intermediate reads
and writes into just L3, or even a dedicated very small buffer or two. Furthermore, for the transfers after the first, unless there is a page crossing, why go through a full AGEN, when a simple add to the previous address is all that is required, thus freeing AGEN resources.

So, the faster one makes MM (and by extension MS) the less one needs
of overlap and pipelining.

Certainly true for small transfers, but for larger ones, I am not so
sure. It may make more sense to delay the MM completion slightly for
the time it takes for a single load to take place in order to allow the non-memory reference instructions following that load to execute
overlapped with the completion of the MM. Needs trace analysis.

MM is a (long) series of byte LD and ST to virtual addresses.
The ordering rules for MM relative to scalar LD and ST before and after it should be no different, other than the exact order of individual MM bytes
moved is not defined other than it is overlap safe.

But the same bypassing and forwarding rules apply.
E.g. Under TSO, MM being a sequence of stores cannot start until it is at
the end of the Load Store Queue and ready to retire. So all older LD and ST must have retired and we only need consider interactions of MM with younger
LD and ST.

If an implementation allows a younger LD [x] to bypass an older
MM &dst, &src, then LSQ must retain the LD [x] someplace so it can
check if at some later time, perhaps far later, that a MM store to the
physical address of &dst overlaps the physical address of [x].
If it does overlap then LSQ must trigger a replay of the younger LD [x]
so it picks up the new value from dst buffer.

A younger store ST to any address cannot be seen to bypass an older MM
store to any address, though it could prefetch.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Thu Oct 17 21:49:19 2024

On Thu, 17 Oct 2024 17:16:24 +0000, EricP wrote:

Stephen Fuld wrote:

On 10/16/2024 12:26 PM, MitchAlsup1 wrote:

MM is a (long) series of byte LD and ST to virtual addresses.
The ordering rules for MM relative to scalar LD and ST before and after
it should be no different, other than the exact order of individual MM
bytes moved is not defined other than it is overlap safe.

But the same bypassing and forwarding rules apply.
E.g. Under TSO, MM being a sequence of stores cannot start until it is
at the end of the Load Store Queue and ready to retire. So all older LD
and
ST must have retired and we only need consider interactions of MM with younger LD and ST.

My 66000 is NOT TSO it is causal unless special areas are being
accessed.

But, yes, MM must operate as if it is ordered with other memory
reference
instructions.

If an implementation allows a younger LD [x] to bypass an older
MM &dst, &src, then LSQ must retain the LD [x] someplace so it can
check if at some later time, perhaps far later, that a MM store to the physical address of &dst overlaps the physical address of [x].
If it does overlap then LSQ must trigger a replay of the younger LD [x]
so it picks up the new value from dst buffer.

No essential disagreement.

A younger store ST to any address cannot be seen to bypass an older MM
store to any address, though it could prefetch.

Once again it is not TSO.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Paul A. Clayton on Mon Oct 21 00:25:57 2024

On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

On 10/16/24 5:14 PM, MitchAlsup1 wrote:

On Wed, 16 Oct 2024 20:48:39 +0000, Paul A. Clayton wrote:

Here is a question that I will leave to Mitch:

Can a MM that has confirmed permissions commit before it has been
performed such that uncorrectable errors would be recognized not
on read of the source but on later read of the destination?

A Memory Move does not <necessarily> read the destination. In
order to make the data transfers occur in cache line sizes,
The first and the last line may be read, but the intermediate
ones are not read (from DRAM) only to be re-written. An
implementation with byte write enables might not read any
of the destination lines.

I was referring to a following instruction reading the
destination.

Then there is the issue with uncorrectable errors at the
receiving cache. The current protocol has the sender (core)
not release his write buffer until LLC has replied that
the data arrived without ECC trouble. Thus, the instruction
causing the latent uncorrectable error is not retired until
the data has arrived successfully at LLC.

I was thinking primarily about uncorrectable errors in the
source. It would be convenient to software for MM to fail early on
an uncorrectable source error. It would be a little less
convenient (and possibly more complex in hardware) to generate an
ECC exception at the end of the MM instruction (or when it
pauses from a context switch).

As I describe above, all errors are detected at source read,
precisely so that they can be detected or recovered.

With source-signaled errors, MM might be used to scrub memory
(assuming the microarchitecture did not optimize out copies
onto self as nops☺).

Not signaling the error until the destination is read some time
later prevents software from assuming the copy was correct at the
time of copying, but allows a copy to commit once all permissions
have been verified.

The real question, here, is: if you have corrupted data in memory
and overwrite it, in its entirety, do you have to detect the error ??
should you detect the error, or should overwriting the error make
it "go away" ???

I could see some wanting to depend on the copy checking data
validity synchronously, but some might be okay with a quasi-
synchronous copy that allows the processor to continue doing work
outside of the MM.

As I mentioned before, Yes I intend to allow other instructions
to operate concurrently with MM, but I also expect MM to consume
all of L1 cache bandwidth. Just like LD L1-L2-miss operates
concurrently with FDIV.

For large copies, I could see having the copying done at L2 or
even L3 with distinct address generation (at least within a page,

As you get farther out than L1; you end up not having a TLB to
translate page crossing addresses.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Sun Oct 20 20:30:45 2024

On 10/20/2024 5:25 PM, MitchAlsup1 wrote:

On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

On 10/16/24 5:14 PM, MitchAlsup1 wrote:

On Wed, 16 Oct 2024 20:48:39 +0000, Paul A. Clayton wrote:

Here is a question that I will leave to Mitch:

Can a MM that has confirmed permissions commit before it has been
performed such that uncorrectable errors would be recognized not
on read of the source but on later read of the destination?

A Memory Move does not <necessarily> read the destination. In
order to make the data transfers occur in cache line sizes,
The first and the last line may be read, but the intermediate
ones are not read (from DRAM) only to be re-written. An
implementation with byte write enables might not read any
of the destination lines.

I was referring to a following instruction reading the
destination.

Then there is the issue with uncorrectable errors at the
receiving cache. The current protocol has the sender (core)
not release his write buffer until LLC has replied that
the data arrived without ECC trouble. Thus, the instruction
causing the latent uncorrectable error is not retired until
the data has arrived successfully at LLC.

I was thinking primarily about uncorrectable errors in the
source. It would be convenient to software for MM to fail early on
an uncorrectable source error. It would be a little less
convenient (and possibly more complex in hardware) to generate an
ECC exception at the end of the MM instruction (or when it
pauses from a context switch).

As I describe above, all errors are detected at source read,
precisely so that they can be detected or recovered.

With source-signaled errors, MM might be used to scrub memory
(assuming the microarchitecture did not optimize out copies
onto self as nops☺).

Not signaling the error until the destination is read some time
later prevents software from assuming the copy was correct at the
time of copying, but allows a copy to commit once all permissions
have been verified.

The real question, here, is: if you have corrupted data in memory
and overwrite it, in its entirety, do you have to detect the error ??
should you detect the error, or should overwriting the error make
it "go away" ???

I could see some wanting to depend on the copy checking data
validity synchronously, but some might be okay with a quasi-
synchronous copy that allows the processor to continue doing work
outside of the MM.

As I mentioned before, Yes I intend to allow other instructions
to operate concurrently with MM, but I also expect MM to consume
all of L1 cache bandwidth. Just like LD L1-L2-miss operates
concurrently with FDIV.

For large copies, I could see having the copying done at L2 or
even L3 with distinct address generation (at least within a page,

As you get farther out than L1; you end up not having a TLB to
translate page crossing addresses.

Yes, but once you translate the starting address of a page, which you
can tell from the low order bits X bits (X depends on page size) being
zero, you don't need to translate again until the next page crossing.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Paul A. Clayton on Mon Oct 21 06:32:52 2024

"Paul A. Clayton" <paaronclayton@gmail.com> writes:

I was thinking primarily about uncorrectable errors in the
source. It would be convenient to software for MM to fail early on
an uncorrectable source error. It would be a little less
convenient (and possibly more complex in hardware) to generate an
ECC exception at the end of the MM instruction (or when it
pauses from a context switch).

Welcome to the DDR5 age (which started in 2021). DDR5 not just has
on-die ECC, it also has ECS (error check and scrub). From <https://in.micron.com/content/dam/micron/global/public/products/white-paper/ddr5-new-features-white-paper.pdf>:

|An additional feature of the DDR5 SDRAM ECC is the error check and
|scrub (ECS) function. The ECS function is a read of internal data and
|the writing back of corrected data if an error occurred. ECS can be
|used as a manual function initiated by a Multi-Purpose Command (MPC),
|or the DDR5 SDRAM can run the ECS in automatic mode, where the DRAM
|schedules and performs the ECS commands as needed to complete a full
|scrub of the data bits in the array within the recommended 24-hour
|period. At the completion of a full-array scrub, the DDR5 reports the
|number of errors that were corrected during the scrub (once the error
|count exceeds a minimum fail threshold) and reports the row with the
|highest number of errors, which is also subject to a minimum
|threshold.

I am wondering why scrubbing is not performed automatically on
refresh.

Even before DDR5, scrubbing is a feature of the memory controller that
it performs in hardware (i.e., without software having to do something
in an interrupt handler or some such). Of course the My66000 memory
controller may be less capable, and leave scrubbing to software; maybe
also refresh?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Mon Oct 21 12:56:59 2024

On Mon, 21 Oct 2024 06:32:52 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

"Paul A. Clayton" <paaronclayton@gmail.com> writes:

I was thinking primarily about uncorrectable errors in the
source. It would be convenient to software for MM to fail early on
an uncorrectable source error. It would be a little less
convenient (and possibly more complex in hardware) to generate an
ECC exception at the end of the MM instruction (or when it
pauses from a context switch).

Welcome to the DDR5 age (which started in 2021). DDR5 not just has
on-die ECC, it also has ECS (error check and scrub). From <https://in.micron.com/content/dam/micron/global/public/products/white-paper/ddr5-new-features-white-paper.pdf>:

|An additional feature of the DDR5 SDRAM ECC is the error check and
|scrub (ECS) function. The ECS function is a read of internal data and
|the writing back of corrected data if an error occurred. ECS can be
|used as a manual function initiated by a Multi-Purpose Command (MPC),
|or the DDR5 SDRAM can run the ECS in automatic mode, where the DRAM |schedules and performs the ECS commands as needed to complete a full
|scrub of the data bits in the array within the recommended 24-hour
|period. At the completion of a full-array scrub, the DDR5 reports the |number of errors that were corrected during the scrub (once the error
|count exceeds a minimum fail threshold) and reports the row with the |highest number of errors, which is also subject to a minimum
|threshold.

I am wondering why scrubbing is not performed automatically on
refresh.

Typical DDR5 row contains 4096 data bits. That's 32x bigger than
internal ECC block. In order to do scrub at refresh without major
increase in refresh timing one would need a lot more ECC correction
logic than currently present.
Even with a lot of extra logic there will be some slowdown, likely one
clock (== 16T == 2.5 to 3.3 ns).

Even before DDR5, scrubbing is a feature of the memory controller that
it performs in hardware (i.e., without software having to do something
in an interrupt handler or some such). Of course the My66000 memory controller may be less capable, and leave scrubbing to software; maybe
also refresh?

Hopefully the last part is tongue in cheek.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Tue Oct 22 13:08:38 2024

MitchAlsup1 wrote:

On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

With source-signaled errors, MM might be used to scrub memory
(assuming the microarchitecture did not optimize out copies
onto self as nops☺).

Not signaling the error until the destination is read some time
later prevents software from assuming the copy was correct at the
time of copying, but allows a copy to commit once all permissions
have been verified.

The real question, here, is: if you have corrupted data in memory
and overwrite it, in its entirety, do you have to detect the error ??
should you detect the error, or should overwriting the error make
it "go away" ???

IF an error is detected then it should be required to at least be logged.
But that particular error, no it should not be required to be detected as
that would force an extra read cycle onto every quadword store.

By logging I mean a fifo buffer that error reports can be dumped into.
This ensures that the fact an error was detected is not lost.
This can be set to trigger a high priority interrupt on a number of
conditions such as fifo buffer 3/4 full, or specific kinds of errors,
or when a log has been in the buffer a while, say 1 sec.

Also that error should not raise an error exception in an application
for any number of reasons, such as most writes are lazy and could be
long after the app that did the store has been switched out.

But each error must be assessed on an individual basis.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to EricP on Tue Oct 22 18:13:25 2024

EricP <ThatWouldBeTelling@thevillage.com> writes:

MitchAlsup1 wrote:

On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

With source-signaled errors, MM might be used to scrub memory
(assuming the microarchitecture did not optimize out copies
onto self as nops☺).

Not signaling the error until the destination is read some time
later prevents software from assuming the copy was correct at the
time of copying, but allows a copy to commit once all permissions
have been verified.

The real question, here, is: if you have corrupted data in memory
and overwrite it, in its entirety, do you have to detect the error ??
should you detect the error, or should overwriting the error make
it "go away" ???

IF an error is detected then it should be required to at least be logged.

Even corrected errors must be logged for proper RAS. Undetected errors,
such as Mitch has described, by definition can't be logged.

If that corrupted data was read, the read transaction should be tagged
as poisoned, and _when consumed_[*], an error should be raised
and/or logged. If it is never consumed, it's an open question whether
logging is useful or required.

[*] If it were a speculative read, for example, that was never
consumed, should it also raise/log and error?

<snip>

By logging I mean a fifo buffer that error reports can be dumped into.

Typical reports should include the bank/ram location information
to at least narrow down to an FRU.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Tue Oct 22 18:21:04 2024

On Mon, 21 Oct 2024 9:56:59 +0000, Michael S wrote:

On Mon, 21 Oct 2024 06:32:52 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

"Paul A. Clayton" <paaronclayton@gmail.com> writes:

I was thinking primarily about uncorrectable errors in the
source. It would be convenient to software for MM to fail early on
an uncorrectable source error. It would be a little less
convenient (and possibly more complex in hardware) to generate an
ECC exception at the end of the MM instruction (or when it
pauses from a context switch).

Welcome to the DDR5 age (which started in 2021). DDR5 not just has
on-die ECC, it also has ECS (error check and scrub). From
<https://in.micron.com/content/dam/micron/global/public/products/white-paper/ddr5-new-features-white-paper.pdf>:

|An additional feature of the DDR5 SDRAM ECC is the error check and
|scrub (ECS) function. The ECS function is a read of internal data and
|the writing back of corrected data if an error occurred. ECS can be
|used as a manual function initiated by a Multi-Purpose Command (MPC),
|or the DDR5 SDRAM can run the ECS in automatic mode, where the DRAM
|schedules and performs the ECS commands as needed to complete a full
|scrub of the data bits in the array within the recommended 24-hour
|period. At the completion of a full-array scrub, the DDR5 reports the
|number of errors that were corrected during the scrub (once the error
|count exceeds a minimum fail threshold) and reports the row with the
|highest number of errors, which is also subject to a minimum
|threshold.

I am wondering why scrubbing is not performed automatically on
refresh.

Typical DDR5 row contains 4096 data bits. That's 32x bigger than
internal ECC block. In order to do scrub at refresh without major
increase in refresh timing one would need a lot more ECC correction
logic than currently present.

The standard 64+8 code is SECDED and takes 8 gate delays (5-XOR gates,
2 NAND gates, 1 XOR gate).
One can do as many 64+8 error checks as one wants in those gates of
delay.

It is only when one wants better than SEC/DED that things get
interesting.

Even with a lot of extra logic there will be some slowdown, likely one
clock (== 16T == 2.5 to 3.3 ns).

DRAM clock should not be that much slower than CPU clock -- a factor
of 2-3 is expected (logic delay), putting DRAM clock in the 0.75ns
range.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Tue Oct 22 18:16:25 2024

On Tue, 22 Oct 2024 17:08:38 +0000, EricP wrote:

MitchAlsup1 wrote:

On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

With source-signaled errors, MM might be used to scrub memory
(assuming the microarchitecture did not optimize out copies
onto self as nops☺).

Not signaling the error until the destination is read some time
later prevents software from assuming the copy was correct at the
time of copying, but allows a copy to commit once all permissions
have been verified.

The real question, here, is: if you have corrupted data in memory
and overwrite it, in its entirety, do you have to detect the error ??
should you detect the error, or should overwriting the error make
it "go away" ???

IF an error is detected then it should be required to at least be
logged. But that particular error, no it should not be required to
be detected as that would force an extra read cycle onto every
quadword store.

This, too, is my interpretation--in a similar way that one can use
bad-ECC to denote uninitialized data which "goes away" when the
data gets written, Memory Moving good-data over bad-data makes
the error "go away".

By logging I mean a fifo buffer that error reports can be dumped into.

As I specified, all data errors are available at execute time
and can be properly trapped with precision.

This ensures that the fact an error was detected is not lost.
This can be set to trigger a high priority interrupt on a number of conditions such as fifo buffer 3/4 full, or specific kinds of errors,
or when a log has been in the buffer a while, say 1 sec.

Also that error should not raise an error exception in an application
for any number of reasons, such as most writes are lazy and could be
long after the app that did the store has been switched out.

In My 66000 all written data is available prior to retirement, and
can be detected/corrected before the store is retired.

But each error must be assessed on an individual basis.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Scott Lurndal on Tue Oct 22 14:45:09 2024

Scott Lurndal wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

MitchAlsup1 wrote:

On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

With source-signaled errors, MM might be used to scrub memory
(assuming the microarchitecture did not optimize out copies
onto self as nops☺).

Not signaling the error until the destination is read some time
later prevents software from assuming the copy was correct at the
time of copying, but allows a copy to commit once all permissions
have been verified.

The real question, here, is: if you have corrupted data in memory
and overwrite it, in its entirety, do you have to detect the error ??
should you detect the error, or should overwriting the error make
it "go away" ???

IF an error is detected then it should be required to at least be logged.

Even corrected errors must be logged for proper RAS. Undetected errors,
such as Mitch has described, by definition can't be logged.

If that corrupted data was read, the read transaction should be tagged
as poisoned, and _when consumed_[*], an error should be raised
and/or logged. If it is never consumed, it's an open question whether logging is useful or required.

The scenario given was a store that overwrites the error data completely. Assuming a 8+64 SECDED ECC, no it should not be required to be detected,
and therefore not logged or raised, as that would force an unnecessary
read cycle on every quadword memory store just to check for something
that very rarely occurs.

Compared to say a byte memory store which is a RMW cycle.
If the read portion detects an error then it should be logged.
But again not raised as an error because store are mostly asynchronous
and could be long after the store instruction.

[*] If it were a speculative read, for example, that was never
consumed, should it also raise/log and error?

I'm drawing a distinction between logging, which is asynchronous,
and raising some kind of exception, which requires the detection
be synchronous with the code.

But even if an error is synchronous, if it is transient *and corrected*, whether memory or a bus error, or even an internal register parity error
(if it had such checks), I want logged but not elevated to an exception.

Speculative read is the same as a scrubber read error detect -
log but no exception because it is asynchronous to code.

<snip>

By logging I mean a fifo buffer that error reports can be dumped into.

Typical reports should include the bank/ram location information
to at least narrow down to an FRU.

Yes, then software might take the physical addresses from the disk log
and backtrack to the physical page frames affected, and if not transient
then mark those pages to be retired the next time they get recycled.

Oh, and a pop-up "You have memory errors"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Ginger1
  Mon Sep 15 19:33:54 2025
  from London via SSH
- Bob Worm
  Mon Sep 15 15:42:34 2025
  from Wales, Uk via Telnet
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	12:41:58
Calls:	10,389
Calls today:	4
Files:	14,061
Messages:	6,416,878
Posted today:	1

MM instruction and the pipeline

Who's Online

Recent Visitors

System Info