• Re: [gentoo-user] Another hard drive failure

    From Michael@21:1/5 to All on Sat Jun 7 23:42:11 2025
    On Saturday, 7 June 2025 19:12:14 British Summer Time Dale wrote:
    Michael wrote:
    Second drive gone bad within a few weeks, but it is a 2.5" HDD this time.
    I run a couple of smartctl tests and sector 33428384 was reported to have
    a read failure; e.g.:
    ...
    # 5 Extended offline Completed: read failure 90% 1351 33428384
    # 6 Extended offline Completed: read failure 90% 1350 33428384
    # 7 Extended offline Completed: read failure 90% 1350 33428384
    # 8 Short offline Completed without error 00% 1350
    - # 9 Conveyance offline Completed without error 00% 1350
    -

    Then I took the drive out of the PC and run some tests again in a USB docking station. First, 'hdparm --read-sector 33428384' returned
    success. Then a short offline test returned "Completed without error",
    to be followed by a long test with the same result. Interestingly, smartctl now shows:

    "3 of 3 failed self-tests are outdated by newer successful extended
    offline
    self-test # 1"

    However, the smarctl Thresholds table is warning "FAILING_NOW":

    184 End-to-End_Error 0x0032 099 099 099 Old_age Always FAILING_NOW 1

    I'm running a reading test on it now to see if it reports any errors.
    Given the results so far, is it worth keeping it around? Perhaps for duplicate non- mission critical data?

    I found this command long ago and from what I read, if this reports
    zeros, it is considered OK. I'm not familiar with the 184 end-to-end
    you show tho. May have to look into that.

    The 184 End-to-End-Error SMART attribute was developed by Hewlett Packard to check if corruption took place as data was transferred from/to the buffer of the drive. Some drives report it and this one does. I suppose if the RAM
    data buffer is a bit unreliable, this kind of error will come and go. Bearing in mind this particular drive was in a laptop, overheating may have also contributed.


    smartctl -a /dev/sdX | egrep '(^ID|Reallocated_Sector_Ct|Reported_Uncorrectable_Er|Command_Timeout|Curren t_Pending_Sector|Offline_Uncorrectable)'

    This is what I got, which shows 8 re-allocated sectors:

    ~ # smartctl -a /dev/sdb | egrep '(^ID|Reallocated_Sector_Ct| Reported_Uncorrectable_Er|Command_Timeout|Cu rrent_Pending_Sector|Offline_Uncorrectable)'
    ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always
    - 8
    188 Command_Timeout 0x0032 100 099 000 Old_age Always
    - 7
    197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
    - 0
    198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline
    - 0


    When I have a drive that has some sort of errors, I try not to use it
    for anything important. I did have one drive that reported some
    corrected errors, number 5 in the list, that I used but only after I ran shred on it and the error count didn't change. As we know, bad spots
    can be marked bad and the drive knows not to use those. If you do the
    same and the error count goes up, I'd ditch the drive. If it is stable,
    then maybe use it to play with or something.

    I just finished writing to it and no more errors showed up. I'll format it, with a slow read-write test for bad-blocks and recheck the smartclt attributes to see if more errors show up. Then I could play with it with some temporary media files which I don't mind if any are lost and see how it behaves.

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCAAdFiEEXqhvaVh2ERicA8Ceseqq9sKVZxkFAmhEwEMACgkQseqq9sKV Zxkk2g/9HlyfLj0X+fX06c0h0l+mLgb377IgKijBGHco96cLEC1XtDlCmbM+heeQ 4LL0ysE1IA+bud/ihED6IGU9uGZIxonAMwVEYcqdM2UqwjxWmbQ+ctZ9JnNPHQKN Agxgb9j8UpsGzurd4SdfUGe8VvBtDKYFUJ3g//Z7LemHoj5J7xabyEEpObbkCRYD eYeGZ3hk3x0fGQMm+9HfscetCe/Ho6CKjWLKpoqpY762ElnN9GCqkKpkpqvidQ1Y YPq91CRf3gq2El4hGcffFFkZtmt/LMZoMqJemgwIgNbnUNGkvONrzVFgezDhneyR I1tIXk5Oy1G2ZUzFqqTgPMDneRBmLEF3cyaTcgQikqsM1I/plZ2fFvT5q7eGiHRF fT4k+zteRHfxBagOpqqaTHmweYE++GTtTm1iYqnYb0DnlHVDXMWRAPqUPC62kZTZ ONkf7LybhPRdpwn8fcCj6I57V8njFIGzJ2y5BeR+m5rLZAeRkDbhAO7a9QZu5VWu mqojM+c6Vofdk5ME72Qe/tW0sHsc1EjZwZpQA3/CYgWsT4exz8KuSAvB6n4GMej7 LXwSP15U6oWIJhBXj+6vuw4a8NDSKy4XiK8VOGE0wmwoEu0NhqxXyFY+i88raOZ2 6lBSyga3qzubYkmbOCBMPHVuCgNS8Y5LYhyNhntrHde5buC5SDk=
    =o2ZP
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael@21:1/5 to All on Mon Jun 9 13:19:26 2025
    On Sunday, 8 June 2025 16:09:19 British Summer Time Dale wrote:
    Michael wrote:

    The 184 End-to-End-Error SMART attribute was developed by Hewlett Packard to check if corruption took place as data was transferred from/to the buffer of the drive. Some drives report it and this one does. I suppose if the RAM data buffer is a bit unreliable, this kind of error will come and go. Bearing in mind this particular drive was in a laptop,
    overheating may have also contributed.

    I've never seen that one before. It doesn't show up on the drives I
    have but another manufacturer may use that. May even be helpful.

    I can see it here on two different Seagate drives and the one I mentioned in this thread reports it is failing according to the End-to-End-Error SMART attribute. Various Western Digital drives and a 1.8" Toshiba drive don't have it.


    I just finished writing to it and no more errors showed up. I'll format it, with a slow read-write test for bad-blocks and recheck the smartclt attributes to see if more errors show up. Then I could play with it with some temporary media files which I don't mind if any are lost and see how it behaves.
    The drive I mentioned had some small number, single digit number, of
    those too. It wasn't much but it did have to correct a error. I'd do
    some serious testing and see if it is stable for sure. I guess
    badblocks is a good way to go but I've been known to use shred.
    Basically, anything that writes to the whole drive should catch errors.
    If you can write to it twice and it stays the same, it might be OK.

    SMART may report read errors and/or write errors. I have found the read
    errors can be somewhat spurious. Smart says can't read some sector, but
    either dd or hdparm can read it and the SMART read error count does not increase when you try it.

    Sometimes it is a hard error and fsck will fail to reallocate it. In this
    case I try dd or 'hdparm --write-sector' to force a reallocation and/or I try formatting with -cc option. If this fails too, then the drive is retired from active duty on ill health grounds. :p


    Basically, test it well until you are comfy that it is stable and can be
    at least fairly trusted. I might also add, I stuck a post-it note on it
    that it has bad sectors/blocks. So I don't forget. :/

    Post back with what you get. Curious to see if it is stable or not. If
    so, maybe a drive with a small number of these errors and is stable is
    safe to use. Maybe.

    Dale

    :-) :-)

    As mentioned above I finished writing to it with dd - no errors.

    I also completed another extended smartctl test on it, still no errors. The End-to-End_Error attribute continues to show "FAILING_NOW". I think this is reasonable. If the drive buffer failed a CRC test once, for whatever unknown reason, it may do so again in the future. I noticed the temperature of this drive while spinning on the USB docking station is on the high side, compared with other drives. It may have overheated in a laptop and this is why the error occurred - who knows? Anyway, it seems stable enough for now to try
    with real data.

    Interestingly, I've a WD 2.5" 1TB drive which a couple of years ago had reported errors on a partition I was using for /var. An emerge had failed and smartctl confirmed at the time this drive was unable to reallocate the faulty sectors in the partition. I thought I should do something about it, while I was busying myself on all these drive rescue activities recently. The problem with this drive was widespread across many sectors. Overwriting a sector here and another there did not allow me to complete an extended smartctl test without more reallocation errors. Thankfully, all the errors were taking
    place on the same partition. So I overwrote the partition with dd, then deleted it along with two more partitions, repartitioned and eventually reformatted it with -cc. Following this operation, smartctl reported no more errors. I've reinstalled Gentoo and to my surprise has been working with no problems shown so far. Since this WD drive lives in a laptop, I also took the opportunity to configure fscrypt for /home, plus cryptsetup for swap while at it. Time will tell how long it may survive.
    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCAAdFiEEXqhvaVh2ERicA8Ceseqq9sKVZxkFAmhG0U4ACgkQseqq9sKV ZxlpGhAA49B+rbunEhqmJ80A6IvaBjMDJbEKWzZFX6Xnyy8GmHocy0jGQ8HKRGZg AweyVkW/GPdRRMZRpj8YBDAjbv/bAjwuIm0Xe8UN21y4MJIMr24WeemBAJH+NkBb qQlAAkUkbukFptVnIqTe787tfovHyoQoEE5YD2x7VeQeyCjgTlMbE/RJqql0k30k IMzpbw9v/+CMhiVA4juVjjn9cEQYpU8NafE3JPWjv5jfb5qghffrIZA8zszo9X8i gg+Ds+piyX81uRF3VLeU7H7T6R3wD/DWIiAW0Y1IboZ0KFXn40MfLjcUWEr6fwEr 4KCRP0w81TPXqtvPWa+PLIAeG1dWj62P52W5BUwwY6eb5CPRiSrb+I+twLcUSeiQ hpiD2CQF81evUPffnY4eDz4EWyprjFWPE+2gDY3FiwD79yMW8i7eF+NKzhYXS73O wo5XfAxzaHqsi9gMbPNxqrFh298MqrG8Y3ExYcFmN+hWhC6RQc7Tkxubdy8Y/Ba3 odZPXepf+1WZsOpJ0MqXNqaKn7pIGC8caYVXyGNEHMAOjzEGB9yG2KMJQfU5beTX LRwZCIyumzuzXGnyOGrmXBpiZ6vZZb1ehv4qNuMSGXEbVMcVuBYeO0q9cz6WOH8c rvWzhl8zlK8mZxJQKzjIHufn4nFntS0zSGp8wZxlCe2348D1IQo=
    =/ZUY
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)