• Strange (cosmic ray?) corruption.

    From Tim Woodall@21:1/5 to All on Sun May 25 10:00:01 2025
    Hi All,

    This is the third time I've had the same problem. The first time it
    happened I put it down to a freak cosmic ray event. The second time it
    happened gave me doubts but I cannot come up with a good alternative explanation. Now it's happened for a third time I'm hoping that someone
    else can suggest additional logging that might help debug next time or
    even a possible (non-cosmic ray) explanation.

    At the end of this email is the complete log of the failed backup that I
    get emailed.

    The setup: proxy17 is a vm guest running on a xen host. backup17 is
    another vm guest running on the same host. There are many other guests
    running on the same host and all write backups to backup17 but only
    proxy17 has a problem.

    The process:

    dump does a low level backup of an ext4 partition on proxy17. The backup
    is written to stdout and then piped through ssh to backup17 where it's
    then piped through tee which writes it to disk and pipes it to sha256sum
    which writes the shasum to a separate file.

    The sha256sum matches the data on the disk so I know there's no disk
    corruption happening.

    sha256sum _.dmp f86f80d8b7370fe7e91570f379f4f53212a534f413878d1a4367a04ac031bfcc _.dmp
    cat _.dmp.sha256sum f86f80d8b7370fe7e91570f379f4f53212a534f413878d1a4367a04ac031bfcc -

    The data is then read back (this may or may not actually involve reading
    the data back from disk) sent over ssh to proxy17 where restore then
    verifies that the data in the backup matches the disk image.

    This process has worked for 20+ years.

    On 2025/04/29 (Tuesday) the restore verify failed with a decompression
    error. I checked the sha256sum and I restored the data successfully (no decompression error). Because it failed on a log file that had changed
    since the backup had run I couldn't check that the file it actually
    failed on was identical to what was dumped but the log looked normal
    with no visible corruption. I put this down to a cosmic ray event
    flipping a bit in memory while the restore was running.

    Then exactly the same thing happened again on 2025/05/04 (Sunday). This
    time the file it aborted at due to the decompression error was a file in
    sbin and I could confirm that the restore (that was successful with no decompression errors) successfully restored the file to an identical
    state to the one that was backed up.

    It happened for a third time this morning, again Sunday. Sundays tend to
    be bigger backups and I updated the kernel yesterday so the compressed
    backup is around 180MB today.

    Both backup17 and proxy17 are xen guest running on the same xen host. I
    backup about 15 guests in the same way on the same xen host no others of
    which have ever had this failure. I also backup another 25 or so
    machines, 15 or so on the same site but are otherwise completely
    independent (they share network switches etc but this backup isn't going
    over the network) and another 10 or so that are in a different site
    where all the backups go over the network to the backup host.

    Nothing else has failed, no unexplained crashes, no kernel faults, no
    disk corruption (that I've noticed). Have I freakishly managed to be hit
    by a similar cosmic ray event three times! or is there something else
    that could have given the same symptoms?

    Below I include the complete logs of todays failure - there is no
    more data and I cannot reproduce this at will to test things - so I'm
    looking for ideas on where I can add logging to try and track this down
    should it happen again. Is there anything obvious that could give these symptoms? (They were all essentially identical logs). The problem must
    surely be on the "return" path as if the data on disk was corrupted I'd
    be able to reproduce the decompression error so I don't see any good
    reason to do a sha256sum on the data before I send it over the ssh link
    but can anyone else think of a reason?

    The dump is OK and can be restored:
    # restore -t -f _.dmp >/dev/null && echo $?
    Dump tape is compressed.

    0

    The file is OK, can be restored, and matches the original:
    # restore -i -f _.dmp
    Dump tape is compressed.
    restore > add usr/lib/modules/6.1.0-37-amd64/kernel/drivers/hid/hid-sony.ko restore > extract
    You have not read any volumes yet.
    Unless you know which volume your file(s) are on you should start
    with the last volume and work towards the first.
    Specify next volume # (none if no more volumes): 1
    set owner/mode/attributes for '.'? [yn] n
    restore > quit

    root@backup17:# sha256sum ./usr/lib/modules/6.1.0-37-amd64/kernel/drivers/hid/hid-sony.ko
    fad808d9b24d457c2235b4e96e7b50872c0ff2b45bb3c2458034b074f9f99997 ./usr/lib/modules/6.1.0-37-amd64/kernel/drivers/hid/hid-sony.ko

    root@proxy17:# sha256sum /usr/lib/modules/6.1.0-37-amd64/kernel/drivers/hid/hid-sony.ko
    fad808d9b24d457c2235b4e96e7b50872c0ff2b45bb3c2458034b074f9f99997 /usr/lib/modules/6.1.0-37-amd64/kernel/drivers/hid/hid-sony.ko

    And, FWIW, also matches a complete different machine:
    tim@dirac:~ (none)$ sha256sum /usr/lib/modules/6.1.0-37-amd64/kernel/drivers/hid/hid-sony.ko
    fad808d9b24d457c2235b4e96e7b50872c0ff2b45bb3c2458034b074f9f99997 /usr/lib/modules/6.1.0-37-amd64/kernel/drivers/hid/hid-sony.ko

    Is the only possible explanation that there's a bug in restore and
    something with this particular backup is tickling it once in a blue
    moon? There isn't an obvious link between an update and this starting happening.

    Start-Date: 2024-12-21 06:25:54
    Commandline: apt-get -y upgrade
    Upgrade: dump:amd64 (0.4b49-1+tjw12r4, 0.4b49-1+tjw12r5)
    End-Date: 2024-12-21 06:25:59

    Start-Date: 2024-12-30 06:26:01
    Commandline: apt-get -y upgrade
    Upgrade: dump:amd64 (0.4b49-1+tjw12r5, 0.4b49-1+tjw12r6)
    End-Date: 2024-12-30 06:26:07

    Start-Date: 2025-03-24 06:25:51
    Commandline: apt-get -y upgrade
    Upgrade: dump:amd64 (0.4b49-1+tjw12r6, 0.4b50-1+tjw12r1), local-sendmail-config:amd64 (12.1+tjw12r1, 12.2+tjw12r1)
    End-Date: 2025-03-24 06:26:12

    Start-Date: 2025-03-26 06:25:52
    Commandline: apt-get -y upgrade
    Upgrade: dump:amd64 (0.4b50-1+tjw12r1, 0.4b51-1+tjw12r1)
    End-Date: 2025-03-26 06:26:02

    Start-Date: 2025-05-02 06:26:06
    Commandline: apt-get -y upgrade
    Upgrade: dump:amd64 (0.4b51-1+tjw12r1, 0.4b52-1+tjw12r2)
    End-Date: 2025-05-02 06:26:16

    Start-Date: 2025-05-04 04:51:56
    Commandline: apt-get -y upgrade
    Upgrade: dump:amd64 (0.4b52-1+tjw12r2, 0.4b52-1+tjw12r3)
    End-Date: 2025-05-04 04:52:03

    Tim.

    p.s. the script renamed the dump to _.dmp.bad I've renamed it back as
    there's nothing wrong with it which is why the tests above don't look
    like they're using the same file.

    Date: Sun, 25 May 2025 02:44:15 GMT
    From: Cron Daemon <root@proxy17.home.woodall.me.uk>
    To: root@proxy17.home.woodall.me.uk
    Subject: Cron <root@proxy17> /usr/sbin/backup


    /sbin/lvcreate -A n -l52 -s -nroot-backup /dev/vg-proxy17/root
    Logical volume "root-backup" created.
    WARNING: This metadata update is NOT backed up.

    /sbin/e2fsck -p /dev/vg-proxy17/root-backup
    /dev/vg-proxy17/root-backup: Clearing orphaned inode 39474 (uid=0, gid=0, mode=0100600, size=0)
    /dev/vg-proxy17/root-backup: clean, 27942/126208 files, 294517/524288 blocks

    ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 'mkdir -p /dumps/proxy17.20250525.1'

    /sbin/dump -z9 -1u -f - /dev/vg-proxy17/root-backup | ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 '/usr/bin/tee /dumps/proxy17.20250525.1/_.dmp | /usr/bin/sha256sum >/dumps/proxy17.20250525.1/_.dmp.sha256sum'
    DUMP: Date of this level 1 dump: Sun May 25 02:32:07 2025
    DUMP: Date of last level 0 dump: Sun May 4 02:32:12 2025
    DUMP: Dumping /dev/vg-proxy17/root-backup (an unlisted file system) to standard output
    DUMP: Label: none
    DUMP: Writing 10 Kilobyte records
    DUMP: Compressing output at transformation level 9 (zlib)
    DUMP: mapping (Pass I) [regular files]
    DUMP: mapping (Pass II) [directories]
    DUMP: estimated 586849 blocks.
    DUMP: Volume 1 started with block 1 at: Sun May 25 02:32:12 2025
    DUMP: dumping (Pass III) [directories]
    DUMP: dumping (Pass IV) [regular files]
    DUMP: 64.64% done at 1264 kB/s, finished in 0:02
    DUMP: Volume 1 completed at: Sun May 25 02:39:02 2025
    DUMP: Volume 1 took 0:06:50
    DUMP: Volume 1 transfer rate: 422 kB/s
    DUMP: Volume 1 583700kB uncompressed, 173243kB compressed, 3.370:1
    DUMP: 583700 blocks (570.02MB)
    DUMP: finished in 410 seconds, throughput 1423 kBytes/sec
    DUMP: Date of this level 1 dump: Sun May 25 02:32:07 2025
    DUMP: Date this dump completed: Sun May 25 02:39:02 2025
    DUMP: Average transfer rate: 422 kB/s
    DUMP: Wrote 583700kB uncompressed, 173243kB compressed, 3.370:1
    DUMP: DUMP IS DONE

    mount -o ro /dev/vg-proxy17/root-backup /tmp/backup.TLzpiF2wx5

    ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 'cat /dumps/proxy17.20250525.1/_.dmp' | /sbin/restore -C -D /tmp/backup.TLzpiF2wx5 -f -
    /sbin/restore: cannot open /dev/tty: No such device or address
    Dump tape is compressed.
    decompression error, block 207131: data error
    Tape decompression error while restoring ./usr/lib/modules/6.1.0-37-amd64/kernel/drivers/hid/hid-sony.ko
    continue? [yn]
    Dump date: Sun May 25 02:32:07 2025
    Dumped from: Sun May 4 02:32:12 2025
    Level 1 dump of an unlisted file system on proxy17.home.woodall.me.uk:/dev/vg-proxy17/root-backup
    Label: none
    filesys = /tmp/backup.TLzpiF2wx5

    umount /tmp/backup.TLzpiF2wx5

    ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 'mv /dumps/proxy17.20250525.1/_.dmp /dumps/proxy17.20250525.1/_.dmp.bad'

    echo 'Verify of /dev/vg-proxy17/root-backup failed'
    Verify of /dev/vg-proxy17/root-backup failed

    /sbin/lvremove -A n -f /dev/vg-proxy17/root-backup
    Logical volume "root-backup" successfully removed.
    WARNING: This metadata update is NOT backed up.

    mount | ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 'cat >/dumps/proxy17.20250525.1/mount.log'

    df | ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 'cat >/dumps/proxy17.20250525.1/df'

    fdisk -l | ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 'cat >/dumps/proxy17.20250525.1/fdisk'

    cat /var/lib/dumpdates | ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 'cat >/dumps/proxy17.20250525.1/dumpdates'

    cat /tmp/backup.log.7uH9zDqOIo | ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 'cat >/dumps/proxy17.20250525.1/_.log'

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Christensen@21:1/5 to Tim Woodall on Sun May 25 21:50:01 2025
    On 5/25/25 00:58, Tim Woodall wrote:
    Hi All,

    This is the third time I've had the same problem. The first time it
    happened I put it down to a freak cosmic ray event. The second time it happened gave me doubts but I cannot come up with a good alternative explanation. Now it's happened for a third time I'm hoping that someone
    else can suggest additional logging that might help debug next time or
    even a possible (non-cosmic ray) explanation.

    At the end of this email is the complete log of the failed backup that I
    get emailed.

    The setup: proxy17 is a vm guest running on a xen host. backup17 is
    another vm guest running on the same host. There are many other guests running on the same host and all write backups to backup17 but only
    proxy17 has a problem.

    The process:

    dump does a low level backup of an ext4 partition on proxy17. The backup
    is written to stdout and then piped through ssh to backup17 where it's
    then piped through tee which writes it to disk and pipes it to sha256sum which writes the shasum to a separate file.

    The sha256sum matches the data on the disk so I know there's no disk corruption happening.

    sha256sum _.dmp f86f80d8b7370fe7e91570f379f4f53212a534f413878d1a4367a04ac031bfcc  _.dmp
    cat _.dmp.sha256sum f86f80d8b7370fe7e91570f379f4f53212a534f413878d1a4367a04ac031bfcc  -

    The data is then read back (this may or may not actually involve reading
    the data back from disk) sent over ssh to proxy17 where restore then
    verifies that the data in the backup matches the disk image.

    This process has worked for 20+ years.

    On 2025/04/29 (Tuesday) the restore verify failed with a decompression
    error. I checked the sha256sum and I restored the data successfully (no decompression error). Because it failed on a log file that had changed
    since the backup had run I couldn't check that the file it actually
    failed on was identical to what was dumped but the log looked normal
    with no visible corruption. I put this down to a cosmic ray event
    flipping a bit in memory while the restore was running.

    Then exactly the same thing happened again on 2025/05/04 (Sunday). This
    time the file it aborted at due to the decompression error was a file in
    sbin and I could confirm that the restore (that was successful with no decompression errors) successfully restored the file to an identical
    state to the one that was backed up.

    It happened for a third time this morning, again Sunday. Sundays tend to
    be bigger backups and I updated the kernel yesterday so the compressed
    backup is around 180MB today.

    Both backup17 and proxy17 are xen guest running on the same xen host. I backup about 15 guests in the same way on the same xen host no others of which have ever had this failure. I also backup another 25 or so
    machines, 15 or so on the same site but are otherwise completely
    independent (they share network switches etc but this backup isn't going
    over the network) and another 10 or so that are in a different site
    where all the backups go over the network to the backup host.

    Nothing else has failed, no unexplained crashes, no kernel faults, no
    disk corruption (that I've noticed). Have I freakishly managed to be hit
    by a similar cosmic ray event three times! or is there something else
    that could have given the same symptoms?

    Below I include the complete logs of todays failure - there is no more
    data and I cannot reproduce this at will to test things - so I'm looking
    for ideas on where I can add logging to try and track this down should
    it happen again.  Is there anything obvious that could give these
    symptoms? (They were all essentially identical logs). The problem must
    surely be on the "return" path as if the data on disk was corrupted I'd
    be able to reproduce the decompression error so I don't see any good
    reason to do a sha256sum on the data before I send it over the ssh link
    but can anyone else think of a reason?

    The dump is OK and can be restored:
    # restore -t -f _.dmp >/dev/null && echo $?
    Dump tape is compressed.
    0

    The file is OK, can be restored, and matches the original:
    # restore -i -f _.dmp
    Dump tape is compressed.
    restore > add usr/lib/modules/6.1.0-37-amd64/kernel/drivers/hid/hid-sony.ko restore > extract
    You have not read any volumes yet.
    Unless you know which volume your file(s) are on you should start
    with the last volume and work towards the first.
    Specify next volume # (none if no more volumes): 1
    set owner/mode/attributes for '.'? [yn] n
    restore > quit

    root@backup17:# sha256sum ./usr/lib/modules/6.1.0-37-amd64/kernel/ drivers/hid/hid-sony.ko fad808d9b24d457c2235b4e96e7b50872c0ff2b45bb3c2458034b074f9f99997  ./usr/ lib/modules/6.1.0-37-amd64/kernel/drivers/hid/hid-sony.ko

    root@proxy17:# sha256sum /usr/lib/modules/6.1.0-37-amd64/kernel/drivers/ hid/hid-sony.ko fad808d9b24d457c2235b4e96e7b50872c0ff2b45bb3c2458034b074f9f99997  /usr/ lib/modules/6.1.0-37-amd64/kernel/drivers/hid/hid-sony.ko

    And, FWIW, also matches a complete different machine:
    tim@dirac:~ (none)$ sha256sum /usr/lib/modules/6.1.0-37-amd64/kernel/ drivers/hid/hid-sony.ko fad808d9b24d457c2235b4e96e7b50872c0ff2b45bb3c2458034b074f9f99997  /usr/ lib/modules/6.1.0-37-amd64/kernel/drivers/hid/hid-sony.ko

    Is the only possible explanation that there's a bug in restore and
    something with this particular backup is tickling it once in a blue
    moon? There isn't an obvious link between an update and this starting happening.

    Start-Date: 2024-12-21  06:25:54
    Commandline: apt-get -y upgrade
    Upgrade: dump:amd64 (0.4b49-1+tjw12r4, 0.4b49-1+tjw12r5)
    End-Date: 2024-12-21  06:25:59

    Start-Date: 2024-12-30  06:26:01
    Commandline: apt-get -y upgrade
    Upgrade: dump:amd64 (0.4b49-1+tjw12r5, 0.4b49-1+tjw12r6)
    End-Date: 2024-12-30  06:26:07

    Start-Date: 2025-03-24  06:25:51
    Commandline: apt-get -y upgrade
    Upgrade: dump:amd64 (0.4b49-1+tjw12r6, 0.4b50-1+tjw12r1), local- sendmail-config:amd64 (12.1+tjw12r1, 12.2+tjw12r1)
    End-Date: 2025-03-24  06:26:12

    Start-Date: 2025-03-26  06:25:52
    Commandline: apt-get -y upgrade
    Upgrade: dump:amd64 (0.4b50-1+tjw12r1, 0.4b51-1+tjw12r1)
    End-Date: 2025-03-26  06:26:02

    Start-Date: 2025-05-02  06:26:06
    Commandline: apt-get -y upgrade
    Upgrade: dump:amd64 (0.4b51-1+tjw12r1, 0.4b52-1+tjw12r2)
    End-Date: 2025-05-02  06:26:16

    Start-Date: 2025-05-04  04:51:56
    Commandline: apt-get -y upgrade
    Upgrade: dump:amd64 (0.4b52-1+tjw12r2, 0.4b52-1+tjw12r3)
    End-Date: 2025-05-04  04:52:03

    Tim.

    p.s. the script renamed the dump to _.dmp.bad I've renamed it back as
    there's nothing wrong with it which is why the tests above don't look
    like they're using the same file.

    Date: Sun, 25 May 2025 02:44:15 GMT
    From: Cron Daemon <root@proxy17.home.woodall.me.uk>
    To: root@proxy17.home.woodall.me.uk
    Subject: Cron <root@proxy17> /usr/sbin/backup


    /sbin/lvcreate -A n -l52 -s -nroot-backup /dev/vg-proxy17/root
      Logical volume "root-backup" created.
      WARNING: This metadata update is NOT backed up.

    /sbin/e2fsck -p /dev/vg-proxy17/root-backup
    /dev/vg-proxy17/root-backup: Clearing orphaned inode 39474 (uid=0,
    gid=0, mode=0100600, size=0)
    /dev/vg-proxy17/root-backup: clean, 27942/126208 files, 294517/524288
    blocks

    ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 'mkdir -p / dumps/proxy17.20250525.1'

    /sbin/dump -z9 -1u -f - /dev/vg-proxy17/root-backup | ssh -e none -i / root/.ssh/id_rsa_backup backup@backup17 '/usr/bin/tee /dumps/ proxy17.20250525.1/_.dmp | /usr/bin/sha256sum >/dumps/ proxy17.20250525.1/_.dmp.sha256sum'
      DUMP: Date of this level 1 dump: Sun May 25 02:32:07 2025
      DUMP: Date of last level 0 dump: Sun May  4 02:32:12 2025
      DUMP: Dumping /dev/vg-proxy17/root-backup (an unlisted file system)
    to standard output
      DUMP: Label: none
      DUMP: Writing 10 Kilobyte records
      DUMP: Compressing output at transformation level 9 (zlib)
      DUMP: mapping (Pass I) [regular files]
      DUMP: mapping (Pass II) [directories]
      DUMP: estimated 586849 blocks.
      DUMP: Volume 1 started with block 1 at: Sun May 25 02:32:12 2025
      DUMP: dumping (Pass III) [directories]
      DUMP: dumping (Pass IV) [regular files]
      DUMP: 64.64% done at 1264 kB/s, finished in 0:02
      DUMP: Volume 1 completed at: Sun May 25 02:39:02 2025
      DUMP: Volume 1 took 0:06:50
      DUMP: Volume 1 transfer rate: 422 kB/s
      DUMP: Volume 1 583700kB uncompressed, 173243kB compressed, 3.370:1
      DUMP: 583700 blocks (570.02MB)
      DUMP: finished in 410 seconds, throughput 1423 kBytes/sec
      DUMP: Date of this level 1 dump: Sun May 25 02:32:07 2025
      DUMP: Date this dump completed:  Sun May 25 02:39:02 2025
      DUMP: Average transfer rate: 422 kB/s
      DUMP: Wrote 583700kB uncompressed, 173243kB compressed, 3.370:1
      DUMP: DUMP IS DONE

    mount -o ro /dev/vg-proxy17/root-backup /tmp/backup.TLzpiF2wx5

    ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 'cat /dumps/ proxy17.20250525.1/_.dmp' | /sbin/restore -C -D /tmp/backup.TLzpiF2wx5 -f - /sbin/restore: cannot open /dev/tty: No such device or address
    Dump tape is compressed.
    decompression error, block 207131: data error
    Tape decompression error while restoring ./usr/lib/modules/6.1.0-37- amd64/kernel/drivers/hid/hid-sony.ko
    continue? [yn]
    Dump   date: Sun May 25 02:32:07 2025
    Dumped from: Sun May  4 02:32:12 2025
    Level 1 dump of an unlisted file system on proxy17.home.woodall.me.uk:/dev/vg-proxy17/root-backup
    Label: none
    filesys = /tmp/backup.TLzpiF2wx5

    umount /tmp/backup.TLzpiF2wx5

    ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 'mv /dumps/ proxy17.20250525.1/_.dmp /dumps/proxy17.20250525.1/_.dmp.bad'

    echo 'Verify of /dev/vg-proxy17/root-backup failed'
    Verify of /dev/vg-proxy17/root-backup failed

    /sbin/lvremove -A n -f /dev/vg-proxy17/root-backup
      Logical volume "root-backup" successfully removed.
      WARNING: This metadata update is NOT backed up.

    mount | ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 'cat >/ dumps/proxy17.20250525.1/mount.log'

    df | ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 'cat >/ dumps/proxy17.20250525.1/df'

    fdisk -l | ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 'cat
    /dumps/proxy17.20250525.1/fdisk'

    cat /var/lib/dumpdates | ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 'cat >/dumps/proxy17.20250525.1/dumpdates'

    cat /tmp/backup.log.7uH9zDqOIo | ssh -e none -i /root/.ssh/id_rsa_backup backup@backup17 'cat >/dumps/proxy17.20250525.1/_.log'



    Please run and post:

    # cat /etc/debian_version ; uname -a


    Does your computer have ECC memory?


    Are you using integrity assured filesystems (e.g. btrfs or ZFS)?


    Can you reduce your process to a minimal Bourne shell script that exits
    with 0 on success and exits with 1 on error that readers can run on
    their computers to debug the issue?


    David

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Gareth Evans@21:1/5 to All on Mon May 26 01:30:02 2025
    Have you over/under-clocked or otherwise adjusted your CPU settings?

    This "solved" issue which seems to be similar is put down to processor core instability under certain conditions:

    https://bbs.archlinux.org/viewtopic.php?id=290093

    As the last comment mentions, I was also wondering about the possibility of thermal issues.

    Might a smartmon or memtest test be worthwhile?

    You don't seem to be dumping from read-only snapshots afaics, but that these particular files might be changing between checksum and compression (or whichever comes first) seems an unlikely spanner in the works.

    HTH
    Gareth

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Gareth Evans@21:1/5 to Gareth Evans on Mon May 26 07:30:01 2025
    On Mon 26/05/2025 at 00:09, Gareth Evans <donotspam@fastmail.fm> wrote:
    You don't seem to be dumping from read-only snapshots afaics, but that
    these particular files might be changing between checksum and [de-/]compression (or whichever comes first) seems an unlikely spanner in the works.

    FTFM
    G

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Woodall@21:1/5 to Gareth Evans on Mon May 26 10:50:01 2025
    On Mon, 26 May 2025, Gareth Evans wrote:

    Have you over/under-clocked or otherwise adjusted your CPU settings?

    No, nothing changed
    This "solved" issue which seems to be similar is put down to processor core instability under certain conditions:

    https://bbs.archlinux.org/viewtopic.php?id=290093

    It's certainly a possibility, but it seems bizarre that it's only
    affecting one VM out of 13.

    It did start when the weather got warmer.

    As the last comment mentions, I was also wondering about the
    possibility of thermal issues.

    Might a smartmon or memtest test be worthwhile?

    You don't seem to be dumping from read-only snapshots afaics, but that
    these particular files might be changing between checksum and
    compression (or whichever comes first) seems an unlikely spanner in
    the works.

    The snapshot is not readonly (deliberately so I can run fsck before
    dumping) but it's not mounted while dumping, only while verifying.
    Changing files would cause a verification error, not a decompression
    error though.

    Back in the olden days before rw snapshots were possible (or properly
    reliable, IIRC until about 2010), I did occasionally get verification
    errors that I put down to the snapshot being inconsistent in some way
    despite a sync. I'm not exactly sure what journal replay on a ro
    snapshot implies. Since I started using rw snapshots and fsck before
    dumping, verification errors are so rare I don't recall the last time I
    saw one other than this issue. (It does sometimes happen but mostly it's
    a bug in dump or an ext4 feature I've not tested properly)


    HTH
    Gareth



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Gareth Evans@21:1/5 to All on Tue May 27 21:00:01 2025
    On 26 May 2025, at 09:47, Tim Woodall <debianuser@woodall.me.uk> wrote:

    On Mon, 26 May 2025, Gareth Evans wrote:

    Have you over/under-clocked or otherwise adjusted your CPU settings?

    No, nothing changed
    This "solved" issue which seems to be similar is put down to processor core instability under certain conditions:

    https://bbs.archlinux.org/viewtopic.php?id=290093

    It's certainly a possibility, but it seems bizarre that it's only affecting one VM out of 13.

    It did start when the weather got warmer.

    As the last comment mentions, I was also wondering about the possibility of thermal issues.

    Might a smartmon or memtest test be worthwhile?

    You don't seem to be dumping from read-only snapshots afaics, but that these particular files might be changing between checksum and compression (or whichever comes first) seems an unlikely spanner in the works.

    The snapshot is not readonly (deliberately so I can run fsck before dumping) but it's not mounted while dumping, only while verifying. Changing files would cause a verification error, not a decompression error though.

    Back in the olden days before rw snapshots were possible (or properly reliable, IIRC until about 2010), I did occasionally get verification errors that I put down to the snapshot being inconsistent in some way despite a sync. I'm not exactly sure what
    journal replay on a ro snapshot implies. Since I started using rw snapshots and fsck before dumping, verification errors are so rare I don't recall the last time I saw one other than this issue. (It does sometimes happen but mostly it's a bug in dump or
    an ext4 feature I've not tested properly)


    HTH
    Gareth

    Yes the same VM and no other issues does seem odd. I would probably try to rule out all other causes I could - smart read errors/remappings(?); though memory testing is a pain…

    Curious.
    G

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)