• MM instruction and the pipeline

    From Stephen Fuld@21:1/5 to All on Tue Oct 15 22:56:34 2024
    Even though this is about the MM instruction, and the MM instruction is mentioned in other threads, they have lots of other stuff (thread
    drift), and this isn't related to C, standard or otherwise, so I thought
    it best to start a new thread,

    My questions are about what happens to subsequent instructions that
    immediately follow the MM in the stream when an MM instruction is
    executing. Since an MM instruction may take quite a long time (in
    computer time) to complete I think it is useful to know what else can
    happen while the MM is executing.

    I will phrase this as a series of questions.

    1. I assume that subsequent non-memory reference instructions can
    proceed simultaneously with the MM. Is that correct?

    2. Can a load or store where the memory address is in neither the source nor the destination of the MM proceed simultaneously with the MM

    3. Can a load where the memory address is within the source of the MM proceed?

    For the next questions, assume for exposition that the MM has proceeded
    to complete 1/3 of the move when the following instructions come up.

    4. Can a load in the first third of the destination range proceed?

    5. Can a store in the first third of the source range proceed?

    6. Can a store in the first third of the destination range proceed?



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Wed Oct 16 19:26:46 2024
    On Wed, 16 Oct 2024 5:56:34 +0000, Stephen Fuld wrote:

    Even though this is about the MM instruction, and the MM instruction is mentioned in other threads, they have lots of other stuff (thread
    drift), and this isn't related to C, standard or otherwise, so I thought
    it best to start a new thread,

    My questions are about what happens to subsequent instructions that immediately follow the MM in the stream when an MM instruction is
    executing. Since an MM instruction may take quite a long time (in
    computer time) to complete I think it is useful to know what else can
    happen while the MM is executing.

    I will phrase this as a series of questions.

    1. I assume that subsequent non-memory reference instructions can
    proceed simultaneously with the MM. Is that correct?

    Yes, they may begin but they cannot retire.

    2. Can a load or store where the memory address is in neither the source nor the destination of the MM proceed simultaneously with the MM

    Yes, in higher end implementations--after checking for no-conflict
    {and this is dependent on accessing DRAM not MMI/O or config spaces}

    3. Can a load where the memory address is within the source of the MM proceed?

    It is just read data, so, yes--at least theoretically.

    For the next questions, assume for exposition that the MM has proceeded
    to complete 1/3 of the move when the following instructions come up.

    4. Can a load in the first third of the destination range proceed?

    5. Can a store in the first third of the source range proceed?

    6. Can a store in the first third of the destination range proceed?

    In all 3 of these cases; one much have a good way to determine what has
    already been MMed and what is waiting to be MMed. A low end
    implementation
    is unlikely to have such, a high end will have such.

    On the other hand, MM is basically going to saturate the cache ports
    (if for no other reason than being as fast as it can be) so, there
    may not be a lot of AGEN capability or cache access port availability.
    So, the faster one makes MM (and by extension MS) the less one needs
    of overlap and pipelining.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Wed Oct 16 21:14:37 2024
    On Wed, 16 Oct 2024 20:48:39 +0000, Paul A. Clayton wrote:


    Here is a question that I will leave to Mitch:

    Can a MM that has confirmed permissions commit before it has been
    performed such that uncorrectable errors would be recognized not
    on read of the source but on later read of the destination?

    A Memory Move does not <necessarily> read the destination. In
    order to make the data transfers occur in cache line sizes,
    The first and the last line may be read, but the intermediate
    ones are not read (from DRAM) only to be re-written. An
    implementation with byte write enables might not read any
    of the destination lines.

    Then there is the issue with uncorrectable errors at the
    receiving cache. The current protocol has the sender (core)
    not release his write buffer until LLC has replied that
    the data arrived without ECC trouble. Thus, the instruction
    causing the latent uncorrectable error is not retired until
    the data has arrived successfully at LLC.

    I could see some wanting to depend on the copy checking data
    validity synchronously, but some might be okay with a quasi-
    synchronous copy that allows the processor to continue doing work
    outside of the MM.

    As I mentioned before, Yes I intend to allow other instructions
    to operate concurrently with MM, but I also expect MM to consume
    all of L1 cache bandwidth. Just like LD L1-L2-miss operates
    concurrently with FDIV.

    If a translation map is provided for coherence, any MM could
    commit once it is not speculative but before the actual copy has
    been performed. Tracking what parts have been completed in the
    presence of other stores would have significant overhead.

    In practice, one is not going to allow MM to get farther than
    the miss buffer ahead of a mispredict shadow.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Thu Oct 17 08:49:08 2024
    On 10/16/2024 12:26 PM, MitchAlsup1 wrote:
    On Wed, 16 Oct 2024 5:56:34 +0000, Stephen Fuld wrote:

    Even though this is about the MM instruction, and the MM instruction is
    mentioned in other threads, they have lots of other stuff (thread
    drift), and this isn't related to C, standard or otherwise, so I thought
    it best to start a new thread,

    My questions are about what happens to subsequent instructions that
    immediately follow the MM in the stream when an MM instruction is
    executing.  Since an MM instruction may take quite a long time (in
    computer time) to complete I think it is useful to know what else can
    happen while the MM is executing.

    I will phrase this as a series of questions.

    1.    I assume that subsequent non-memory reference instructions can
    proceed simultaneously with the MM.  Is that correct?

    Yes, they may begin but they cannot retire.

    2.    Can a load or store where the memory address is in neither the
    source
    nor the destination of the MM proceed simultaneously with the MM

    Yes, in higher end implementations--after checking for no-conflict
    {and this is dependent on accessing DRAM not MMI/O or config spaces}

    3.    Can a load where the memory address is within the source of the MM >> proceed?

    It is just read data, so, yes--at least theoretically.

    For the next questions, assume for exposition that the MM has proceeded
    to complete 1/3 of the move when the following instructions come up.

    4.    Can a load in the first third of the destination range proceed?

    5.    Can a store in the first third of the source range proceed?

    6.    Can a store in the first third of the destination range proceed?

    In all 3 of these cases; one much have a good way to determine what has already been MMed and what is waiting to be MMed. A low end
    implementation
    is unlikely to have such, a high end will have such.

    On the other hand, MM is basically going to saturate the cache ports
    (if for no other reason than being as fast as it can be) so, there
    may not be a lot of AGEN capability or cache access port availability.


    Yes, but. For a large transfer, say many hundreds to thousands of
    bytes, why run the "middle" bytes through the cache, especially the L1
    (as you indicated in reply to Paul)? It would take some analysis of
    traces to know for sure, but I would expect the probability of reuse of
    such bytes to be low. If that is true, it would take far less resources
    (and avoid "sweeping" the cache) to do at least the intermediate reads
    and writes into just L3, or even a dedicated very small buffer or two. Furthermore, for the transfers after the first, unless there is a page crossing, why go through a full AGEN, when a simple add to the previous
    address is all that is required, thus freeing AGEN resources.


    So, the faster one makes MM (and by extension MS) the less one needs
    of overlap and pipelining.

    Certainly true for small transfers, but for larger ones, I am not so
    sure. It may make more sense to delay the MM completion slightly for
    the time it takes for a single load to take place in order to allow the non-memory reference instructions following that load to execute
    overlapped with the completion of the MM. Needs trace analysis.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Stephen Fuld on Thu Oct 17 13:16:24 2024
    Stephen Fuld wrote:
    On 10/16/2024 12:26 PM, MitchAlsup1 wrote:
    On Wed, 16 Oct 2024 5:56:34 +0000, Stephen Fuld wrote:

    Even though this is about the MM instruction, and the MM instruction is
    mentioned in other threads, they have lots of other stuff (thread
    drift), and this isn't related to C, standard or otherwise, so I thought >>> it best to start a new thread,

    My questions are about what happens to subsequent instructions that
    immediately follow the MM in the stream when an MM instruction is
    executing. Since an MM instruction may take quite a long time (in
    computer time) to complete I think it is useful to know what else can
    happen while the MM is executing.

    I will phrase this as a series of questions.

    1. I assume that subsequent non-memory reference instructions can
    proceed simultaneously with the MM. Is that correct?

    Yes, they may begin but they cannot retire.

    2. Can a load or store where the memory address is in neither the
    source
    nor the destination of the MM proceed simultaneously with the MM

    Yes, in higher end implementations--after checking for no-conflict
    {and this is dependent on accessing DRAM not MMI/O or config spaces}

    3. Can a load where the memory address is within the source of the MM >>> proceed?

    It is just read data, so, yes--at least theoretically.

    For the next questions, assume for exposition that the MM has proceeded
    to complete 1/3 of the move when the following instructions come up.

    4. Can a load in the first third of the destination range proceed?

    5. Can a store in the first third of the source range proceed?

    6. Can a store in the first third of the destination range proceed?

    In all 3 of these cases; one much have a good way to determine what has
    already been MMed and what is waiting to be MMed. A low end
    implementation
    is unlikely to have such, a high end will have such.

    On the other hand, MM is basically going to saturate the cache ports
    (if for no other reason than being as fast as it can be) so, there
    may not be a lot of AGEN capability or cache access port availability.


    Yes, but. For a large transfer, say many hundreds to thousands of
    bytes, why run the "middle" bytes through the cache, especially the L1
    (as you indicated in reply to Paul)? It would take some analysis of
    traces to know for sure, but I would expect the probability of reuse of
    such bytes to be low. If that is true, it would take far less resources
    (and avoid "sweeping" the cache) to do at least the intermediate reads
    and writes into just L3, or even a dedicated very small buffer or two. Furthermore, for the transfers after the first, unless there is a page crossing, why go through a full AGEN, when a simple add to the previous address is all that is required, thus freeing AGEN resources.


    So, the faster one makes MM (and by extension MS) the less one needs
    of overlap and pipelining.

    Certainly true for small transfers, but for larger ones, I am not so
    sure. It may make more sense to delay the MM completion slightly for
    the time it takes for a single load to take place in order to allow the non-memory reference instructions following that load to execute
    overlapped with the completion of the MM. Needs trace analysis.

    MM is a (long) series of byte LD and ST to virtual addresses.
    The ordering rules for MM relative to scalar LD and ST before and after it should be no different, other than the exact order of individual MM bytes
    moved is not defined other than it is overlap safe.

    But the same bypassing and forwarding rules apply.
    E.g. Under TSO, MM being a sequence of stores cannot start until it is at
    the end of the Load Store Queue and ready to retire. So all older LD and ST must have retired and we only need consider interactions of MM with younger
    LD and ST.

    If an implementation allows a younger LD [x] to bypass an older
    MM &dst, &src, then LSQ must retain the LD [x] someplace so it can
    check if at some later time, perhaps far later, that a MM store to the
    physical address of &dst overlaps the physical address of [x].
    If it does overlap then LSQ must trigger a replay of the younger LD [x]
    so it picks up the new value from dst buffer.

    A younger store ST to any address cannot be seen to bypass an older MM
    store to any address, though it could prefetch.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Thu Oct 17 21:49:19 2024
    On Thu, 17 Oct 2024 17:16:24 +0000, EricP wrote:

    Stephen Fuld wrote:
    On 10/16/2024 12:26 PM, MitchAlsup1 wrote:

    MM is a (long) series of byte LD and ST to virtual addresses.
    The ordering rules for MM relative to scalar LD and ST before and after
    it should be no different, other than the exact order of individual MM
    bytes moved is not defined other than it is overlap safe.

    But the same bypassing and forwarding rules apply.
    E.g. Under TSO, MM being a sequence of stores cannot start until it is
    at the end of the Load Store Queue and ready to retire. So all older LD
    and
    ST must have retired and we only need consider interactions of MM with younger LD and ST.

    My 66000 is NOT TSO it is causal unless special areas are being
    accessed.

    But, yes, MM must operate as if it is ordered with other memory
    reference
    instructions.

    If an implementation allows a younger LD [x] to bypass an older
    MM &dst, &src, then LSQ must retain the LD [x] someplace so it can
    check if at some later time, perhaps far later, that a MM store to the physical address of &dst overlaps the physical address of [x].
    If it does overlap then LSQ must trigger a replay of the younger LD [x]
    so it picks up the new value from dst buffer.

    No essential disagreement.

    A younger store ST to any address cannot be seen to bypass an older MM
    store to any address, though it could prefetch.

    Once again it is not TSO.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Mon Oct 21 00:25:57 2024
    On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

    On 10/16/24 5:14 PM, MitchAlsup1 wrote:
    On Wed, 16 Oct 2024 20:48:39 +0000, Paul A. Clayton wrote:


    Here is a question that I will leave to Mitch:

    Can a MM that has confirmed permissions commit before it has been
    performed such that uncorrectable errors would be recognized not
    on read of the source but on later read of the destination?

    A Memory Move does not <necessarily> read the destination. In
    order to make the data transfers occur in cache line sizes,
    The first and the last line may be read, but the intermediate
    ones are not read (from DRAM) only to be re-written. An
    implementation with byte write enables might not read any
    of the destination lines.

    I was referring to a following instruction reading the
    destination.

    Then there is the issue with uncorrectable errors at the
    receiving cache. The current protocol has the sender (core)
    not release his write buffer until LLC has replied that
    the data arrived without ECC trouble. Thus, the instruction
    causing the latent uncorrectable error is not retired until
    the data has arrived successfully at LLC.

    I was thinking primarily about uncorrectable errors in the
    source. It would be convenient to software for MM to fail early on
    an uncorrectable source error. It would be a little less
    convenient (and possibly more complex in hardware) to generate an
    ECC exception at the end of the MM instruction (or when it
    pauses from a context switch).

    As I describe above, all errors are detected at source read,
    precisely so that they can be detected or recovered.

    With source-signaled errors, MM might be used to scrub memory
    (assuming the microarchitecture did not optimize out copies
    onto self as nops☺).

    Not signaling the error until the destination is read some time
    later prevents software from assuming the copy was correct at the
    time of copying, but allows a copy to commit once all permissions
    have been verified.

    The real question, here, is: if you have corrupted data in memory
    and overwrite it, in its entirety, do you have to detect the error ??
    should you detect the error, or should overwriting the error make
    it "go away" ???

    I could see some wanting to depend on the copy checking data
    validity synchronously, but some might be okay with a quasi-
    synchronous copy that allows the processor to continue doing work
    outside of the MM.

    As I mentioned before, Yes I intend to allow other instructions
    to operate concurrently with MM, but I also expect MM to consume
    all of L1 cache bandwidth. Just like LD L1-L2-miss operates
    concurrently with FDIV.

    For large copies, I could see having the copying done at L2 or
    even L3 with distinct address generation (at least within a page,

    As you get farther out than L1; you end up not having a TLB to
    translate page crossing addresses.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Sun Oct 20 20:30:45 2024
    On 10/20/2024 5:25 PM, MitchAlsup1 wrote:
    On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

    On 10/16/24 5:14 PM, MitchAlsup1 wrote:
    On Wed, 16 Oct 2024 20:48:39 +0000, Paul A. Clayton wrote:


    Here is a question that I will leave to Mitch:

    Can a MM that has confirmed permissions commit before it has been
    performed such that uncorrectable errors would be recognized not
    on read of the source but on later read of the destination?

    A Memory Move does not <necessarily> read the destination. In
    order to make the data transfers occur in cache line sizes,
    The first and the last line may be read, but the intermediate
    ones are not read (from DRAM) only to be re-written. An
    implementation with byte write enables might not read any
    of the destination lines.

    I was referring to a following instruction reading the
    destination.

    Then there is the issue with uncorrectable errors at the
    receiving cache. The current protocol has the sender (core)
    not release his write buffer until LLC has replied that
    the data arrived without ECC trouble. Thus, the instruction
    causing the latent uncorrectable error is not retired until
    the data has arrived successfully at LLC.

    I was thinking primarily about uncorrectable errors in the
    source. It would be convenient to software for MM to fail early on
    an uncorrectable source error. It would be a little less
    convenient (and possibly more complex in hardware) to generate an
    ECC exception at the end of the MM instruction (or when it
    pauses from a context switch).

    As I describe above, all errors are detected at source read,
    precisely so that they can be detected or recovered.

    With source-signaled errors, MM might be used to scrub memory
    (assuming the microarchitecture did not optimize out copies
    onto self as nops☺).

    Not signaling the error until the destination is read some time
    later prevents software from assuming the copy was correct at the
    time of copying, but allows a copy to commit once all permissions
    have been verified.

    The real question, here, is: if you have corrupted data in memory
    and overwrite it, in its entirety, do you have to detect the error ??
    should you detect the error, or should overwriting the error make
    it "go away" ???

    I could see some wanting to depend on the copy checking data
    validity synchronously, but some might be okay with a quasi-
    synchronous copy that allows the processor to continue doing work
    outside of the MM.

    As I mentioned before, Yes I intend to allow other instructions
    to operate concurrently with MM, but I also expect MM to consume
    all of L1 cache bandwidth. Just like LD L1-L2-miss operates
    concurrently with FDIV.

    For large copies, I could see having the copying done at L2 or
    even L3 with distinct address generation (at least within a page,

    As you get farther out than L1; you end up not having a TLB to
    translate page crossing addresses.

    Yes, but once you translate the starting address of a page, which you
    can tell from the low order bits X bits (X depends on page size) being
    zero, you don't need to translate again until the next page crossing.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Paul A. Clayton on Mon Oct 21 06:32:52 2024
    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    I was thinking primarily about uncorrectable errors in the
    source. It would be convenient to software for MM to fail early on
    an uncorrectable source error. It would be a little less
    convenient (and possibly more complex in hardware) to generate an
    ECC exception at the end of the MM instruction (or when it
    pauses from a context switch).

    Welcome to the DDR5 age (which started in 2021). DDR5 not just has
    on-die ECC, it also has ECS (error check and scrub). From <https://in.micron.com/content/dam/micron/global/public/products/white-paper/ddr5-new-features-white-paper.pdf>:

    |An additional feature of the DDR5 SDRAM ECC is the error check and
    |scrub (ECS) function. The ECS function is a read of internal data and
    |the writing back of corrected data if an error occurred. ECS can be
    |used as a manual function initiated by a Multi-Purpose Command (MPC),
    |or the DDR5 SDRAM can run the ECS in automatic mode, where the DRAM
    |schedules and performs the ECS commands as needed to complete a full
    |scrub of the data bits in the array within the recommended 24-hour
    |period. At the completion of a full-array scrub, the DDR5 reports the
    |number of errors that were corrected during the scrub (once the error
    |count exceeds a minimum fail threshold) and reports the row with the
    |highest number of errors, which is also subject to a minimum
    |threshold.

    I am wondering why scrubbing is not performed automatically on
    refresh.

    Even before DDR5, scrubbing is a feature of the memory controller that
    it performs in hardware (i.e., without software having to do something
    in an interrupt handler or some such). Of course the My66000 memory
    controller may be less capable, and leave scrubbing to software; maybe
    also refresh?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Mon Oct 21 12:56:59 2024
    On Mon, 21 Oct 2024 06:32:52 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    I was thinking primarily about uncorrectable errors in the
    source. It would be convenient to software for MM to fail early on
    an uncorrectable source error. It would be a little less
    convenient (and possibly more complex in hardware) to generate an
    ECC exception at the end of the MM instruction (or when it
    pauses from a context switch).

    Welcome to the DDR5 age (which started in 2021). DDR5 not just has
    on-die ECC, it also has ECS (error check and scrub). From <https://in.micron.com/content/dam/micron/global/public/products/white-paper/ddr5-new-features-white-paper.pdf>:

    |An additional feature of the DDR5 SDRAM ECC is the error check and
    |scrub (ECS) function. The ECS function is a read of internal data and
    |the writing back of corrected data if an error occurred. ECS can be
    |used as a manual function initiated by a Multi-Purpose Command (MPC),
    |or the DDR5 SDRAM can run the ECS in automatic mode, where the DRAM |schedules and performs the ECS commands as needed to complete a full
    |scrub of the data bits in the array within the recommended 24-hour
    |period. At the completion of a full-array scrub, the DDR5 reports the |number of errors that were corrected during the scrub (once the error
    |count exceeds a minimum fail threshold) and reports the row with the |highest number of errors, which is also subject to a minimum
    |threshold.

    I am wondering why scrubbing is not performed automatically on
    refresh.


    Typical DDR5 row contains 4096 data bits. That's 32x bigger than
    internal ECC block. In order to do scrub at refresh without major
    increase in refresh timing one would need a lot more ECC correction
    logic than currently present.
    Even with a lot of extra logic there will be some slowdown, likely one
    clock (== 16T == 2.5 to 3.3 ns).

    Even before DDR5, scrubbing is a feature of the memory controller that
    it performs in hardware (i.e., without software having to do something
    in an interrupt handler or some such). Of course the My66000 memory controller may be less capable, and leave scrubbing to software; maybe
    also refresh?


    Hopefully the last part is tongue in cheek.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Tue Oct 22 13:08:38 2024
    MitchAlsup1 wrote:
    On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

    With source-signaled errors, MM might be used to scrub memory
    (assuming the microarchitecture did not optimize out copies
    onto self as nops☺).

    Not signaling the error until the destination is read some time
    later prevents software from assuming the copy was correct at the
    time of copying, but allows a copy to commit once all permissions
    have been verified.

    The real question, here, is: if you have corrupted data in memory
    and overwrite it, in its entirety, do you have to detect the error ??
    should you detect the error, or should overwriting the error make
    it "go away" ???

    IF an error is detected then it should be required to at least be logged.
    But that particular error, no it should not be required to be detected as
    that would force an extra read cycle onto every quadword store.

    By logging I mean a fifo buffer that error reports can be dumped into.
    This ensures that the fact an error was detected is not lost.
    This can be set to trigger a high priority interrupt on a number of
    conditions such as fifo buffer 3/4 full, or specific kinds of errors,
    or when a log has been in the buffer a while, say 1 sec.

    Also that error should not raise an error exception in an application
    for any number of reasons, such as most writes are lazy and could be
    long after the app that did the store has been switched out.

    But each error must be assessed on an individual basis.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Tue Oct 22 18:13:25 2024
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    MitchAlsup1 wrote:
    On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

    With source-signaled errors, MM might be used to scrub memory
    (assuming the microarchitecture did not optimize out copies
    onto self as nops☺).

    Not signaling the error until the destination is read some time
    later prevents software from assuming the copy was correct at the
    time of copying, but allows a copy to commit once all permissions
    have been verified.

    The real question, here, is: if you have corrupted data in memory
    and overwrite it, in its entirety, do you have to detect the error ??
    should you detect the error, or should overwriting the error make
    it "go away" ???

    IF an error is detected then it should be required to at least be logged.

    Even corrected errors must be logged for proper RAS. Undetected errors,
    such as Mitch has described, by definition can't be logged.

    If that corrupted data was read, the read transaction should be tagged
    as poisoned, and _when consumed_[*], an error should be raised
    and/or logged. If it is never consumed, it's an open question whether
    logging is useful or required.

    [*] If it were a speculative read, for example, that was never
    consumed, should it also raise/log and error?

    <snip>

    By logging I mean a fifo buffer that error reports can be dumped into.

    Typical reports should include the bank/ram location information
    to at least narrow down to an FRU.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Tue Oct 22 18:21:04 2024
    On Mon, 21 Oct 2024 9:56:59 +0000, Michael S wrote:

    On Mon, 21 Oct 2024 06:32:52 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    I was thinking primarily about uncorrectable errors in the
    source. It would be convenient to software for MM to fail early on
    an uncorrectable source error. It would be a little less
    convenient (and possibly more complex in hardware) to generate an
    ECC exception at the end of the MM instruction (or when it
    pauses from a context switch).

    Welcome to the DDR5 age (which started in 2021). DDR5 not just has
    on-die ECC, it also has ECS (error check and scrub). From
    <https://in.micron.com/content/dam/micron/global/public/products/white-paper/ddr5-new-features-white-paper.pdf>:

    |An additional feature of the DDR5 SDRAM ECC is the error check and
    |scrub (ECS) function. The ECS function is a read of internal data and
    |the writing back of corrected data if an error occurred. ECS can be
    |used as a manual function initiated by a Multi-Purpose Command (MPC),
    |or the DDR5 SDRAM can run the ECS in automatic mode, where the DRAM
    |schedules and performs the ECS commands as needed to complete a full
    |scrub of the data bits in the array within the recommended 24-hour
    |period. At the completion of a full-array scrub, the DDR5 reports the
    |number of errors that were corrected during the scrub (once the error
    |count exceeds a minimum fail threshold) and reports the row with the
    |highest number of errors, which is also subject to a minimum
    |threshold.

    I am wondering why scrubbing is not performed automatically on
    refresh.


    Typical DDR5 row contains 4096 data bits. That's 32x bigger than
    internal ECC block. In order to do scrub at refresh without major
    increase in refresh timing one would need a lot more ECC correction
    logic than currently present.

    The standard 64+8 code is SECDED and takes 8 gate delays (5-XOR gates,
    2 NAND gates, 1 XOR gate).
    One can do as many 64+8 error checks as one wants in those gates of
    delay.

    It is only when one wants better than SEC/DED that things get
    interesting.

    Even with a lot of extra logic there will be some slowdown, likely one
    clock (== 16T == 2.5 to 3.3 ns).

    DRAM clock should not be that much slower than CPU clock -- a factor
    of 2-3 is expected (logic delay), putting DRAM clock in the 0.75ns
    range.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Tue Oct 22 18:16:25 2024
    On Tue, 22 Oct 2024 17:08:38 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

    With source-signaled errors, MM might be used to scrub memory
    (assuming the microarchitecture did not optimize out copies
    onto self as nops☺).

    Not signaling the error until the destination is read some time
    later prevents software from assuming the copy was correct at the
    time of copying, but allows a copy to commit once all permissions
    have been verified.

    The real question, here, is: if you have corrupted data in memory
    and overwrite it, in its entirety, do you have to detect the error ??
    should you detect the error, or should overwriting the error make
    it "go away" ???

    IF an error is detected then it should be required to at least be
    logged. But that particular error, no it should not be required to
    be detected as that would force an extra read cycle onto every
    quadword store.

    This, too, is my interpretation--in a similar way that one can use
    bad-ECC to denote uninitialized data which "goes away" when the
    data gets written, Memory Moving good-data over bad-data makes
    the error "go away".

    By logging I mean a fifo buffer that error reports can be dumped into.

    As I specified, all data errors are available at execute time
    and can be properly trapped with precision.

    This ensures that the fact an error was detected is not lost.
    This can be set to trigger a high priority interrupt on a number of conditions such as fifo buffer 3/4 full, or specific kinds of errors,
    or when a log has been in the buffer a while, say 1 sec.

    Also that error should not raise an error exception in an application
    for any number of reasons, such as most writes are lazy and could be
    long after the app that did the store has been switched out.

    In My 66000 all written data is available prior to retirement, and
    can be detected/corrected before the store is retired.

    But each error must be assessed on an individual basis.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Scott Lurndal on Tue Oct 22 14:45:09 2024
    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    MitchAlsup1 wrote:
    On Sun, 20 Oct 2024 15:57:59 +0000, Paul A. Clayton wrote:

    With source-signaled errors, MM might be used to scrub memory
    (assuming the microarchitecture did not optimize out copies
    onto self as nops☺).

    Not signaling the error until the destination is read some time
    later prevents software from assuming the copy was correct at the
    time of copying, but allows a copy to commit once all permissions
    have been verified.
    The real question, here, is: if you have corrupted data in memory
    and overwrite it, in its entirety, do you have to detect the error ??
    should you detect the error, or should overwriting the error make
    it "go away" ???
    IF an error is detected then it should be required to at least be logged.

    Even corrected errors must be logged for proper RAS. Undetected errors,
    such as Mitch has described, by definition can't be logged.

    If that corrupted data was read, the read transaction should be tagged
    as poisoned, and _when consumed_[*], an error should be raised
    and/or logged. If it is never consumed, it's an open question whether logging is useful or required.

    The scenario given was a store that overwrites the error data completely. Assuming a 8+64 SECDED ECC, no it should not be required to be detected,
    and therefore not logged or raised, as that would force an unnecessary
    read cycle on every quadword memory store just to check for something
    that very rarely occurs.

    Compared to say a byte memory store which is a RMW cycle.
    If the read portion detects an error then it should be logged.
    But again not raised as an error because store are mostly asynchronous
    and could be long after the store instruction.


    [*] If it were a speculative read, for example, that was never
    consumed, should it also raise/log and error?

    I'm drawing a distinction between logging, which is asynchronous,
    and raising some kind of exception, which requires the detection
    be synchronous with the code.

    But even if an error is synchronous, if it is transient *and corrected*, whether memory or a bus error, or even an internal register parity error
    (if it had such checks), I want logged but not elevated to an exception.

    Speculative read is the same as a scrubber read error detect -
    log but no exception because it is asynchronous to code.

    <snip>

    By logging I mean a fifo buffer that error reports can be dumped into.

    Typical reports should include the bank/ram location information
    to at least narrow down to an FRU.

    Yes, then software might take the physical addresses from the disk log
    and backtrack to the physical page frames affected, and if not transient
    then mark those pages to be retired the next time they get recycled.

    Oh, and a pop-up "You have memory errors"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)