Malfunctioning hard drive. What can be interpreted from the “smartctl -a” output?

Question

I was asked to replace a malfunctioning hard drive, which was used as recording device in a TV setup.

(2.5" HDD, simple USB interface, in a two part plastic casing)

As it has a USB type A cable out, it was simple to just plug it in on the Ubuntu laptop.

Added Note: The USB connection seemingly worked perfectly, the device appears to have problems with the storage media only. Therefore it was possible to see...

The largest files present appears to be 200MB chunks of encrypted stream data. The remainder of the files are most likely metadata of various kind; I won't even try to decipher any of it, the recordings are a random set of TV-programs and covers 7.5% of the space.

"Disks" says:

Model: TOSHIBA MQ01ABD050V -63 (AX0N1Q)
Partitioning: 500 GB, Master Boot Record, 17MB Free, then a 500GB ext4 v1.0 partition.
Assessment: Disk is OK, 16376 bad sectors (29° C / 84° F)

Is there more to interpret out from this, than just "multiple escalating read error"?

I suspect the "driving factor" for the breakage is the small (even TINY) fully enclosed casing with no venting; causing heat problems.

It might also have been exposed to shock, as the device was standing beside a TV for two(?) years. When cleaning dust, ooops! Dropped it.

$ sudo smartctl -a /dev/sdb
[sudo] password for hannu: 
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.13.0-37-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     TOSHIBA MQ01ABD050V -63
Serial Number:    885YC2J1TF6G
LU WWN Device Id: 5 000039 8b43822ba
Firmware Version: AX0N1Q
User Capacity:    500 107 862 016 bytes [500 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s (current: 1.5 Gb/s)
Local Time is:    Wed Mar 30 19:53:04 2022 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (  120) seconds.
Offline data collection
capabilities:            (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    ( 115) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   084   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       1125
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       200
  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       10288
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   033   033   000    Old_age   Always       -       26898
 10 Spin_Retry_Count        0x0033   103   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       200
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       3
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       185
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       200
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       27 (Min/Max 22/58)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       854
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       6088
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       0
222 Loaded_Hours            0x0032   033   033   000    Old_age   Always       -       26898
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       178
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 467 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 467 occurred at disk power-on lifetime: 26805 hours (1116 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 b8 f0 73 13 4d  Error: UNC 184 sectors at LBA = 0x0d1373f0 = 219378672

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 d5 08 a0 73 13 40 00      06:40:51.442  READ DMA EXT
  25 d5 c0 e8 72 13 40 00      06:40:51.333  READ DMA EXT
  25 d5 98 58 71 13 40 00      06:40:51.137  READ DMA EXT
  25 d5 88 d8 6f 13 40 00      06:40:50.928  READ DMA EXT
  25 d5 d0 10 6e 13 40 00      06:40:50.728  READ DMA EXT

Error 466 occurred at disk power-on lifetime: 26805 hours (1116 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 18 e0 74 13 4d  Error: UNC 24 sectors at LBA = 0x0d1374e0 = 219378912

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 d5 18 e0 74 13 40 00      06:38:34.673  READ DMA EXT
  25 d5 48 a0 73 13 40 00      06:38:31.303  READ DMA EXT
  25 d5 c0 e8 72 13 40 00      06:38:31.292  READ DMA EXT
  25 d5 40 b0 71 13 40 00      06:38:31.083  READ DMA EXT
  25 d5 30 88 6f 13 40 00      06:38:30.890  READ DMA EXT

Error 465 occurred at disk power-on lifetime: 26805 hours (1116 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 f8 f0 73 13 4d  Error: UNC 248 sectors at LBA = 0x0d1373f0 = 219378672

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 d5 48 a0 73 13 40 00      06:38:31.303  READ DMA EXT
  25 d5 c0 e8 72 13 40 00      06:38:31.292  READ DMA EXT
  25 d5 40 b0 71 13 40 00      06:38:31.083  READ DMA EXT
  25 d5 30 88 6f 13 40 00      06:38:30.890  READ DMA EXT
  25 d5 b8 d8 6d 13 40 00      06:38:30.688  READ DMA EXT

Error 464 occurred at disk power-on lifetime: 26798 hours (1116 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 06 c2 76 06 40  Error: UNC 6 sectors at LBA = 0x000676c2 = 423618

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 06 c2 76 06 40 00      00:00:20.982  READ DMA EXT
  25 00 01 c1 76 06 40 00      00:00:17.605  READ DMA EXT
  25 00 01 c0 76 06 40 00      00:00:14.221  READ DMA EXT
  25 00 20 c0 76 06 40 00      00:00:10.840  READ DMA EXT
  25 00 08 b8 76 06 40 00      00:00:10.839  READ DMA EXT

Error 463 occurred at disk power-on lifetime: 26798 hours (1116 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 01 c1 76 06 40  Error: UNC 1 sectors at LBA = 0x000676c1 = 423617

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 01 c1 76 06 40 00      00:00:17.605  READ DMA EXT
  25 00 01 c0 76 06 40 00      00:00:14.221  READ DMA EXT
  25 00 20 c0 76 06 40 00      00:00:10.840  READ DMA EXT
  25 00 08 b8 76 06 40 00      00:00:10.839  READ DMA EXT
  25 00 20 90 76 06 40 00      00:00:10.838  READ DMA EXT

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


$ smartctl -P showall /dev/sdb1
No presets are defined for this drive.  Its identity strings:
MODEL:    /dev/sdb1
FIRMWARE: (any)
do not match any of the known regular expressions.

Its not at all clear what you are asking hete. Its possible the drive is faulty, but it could also be the enclosure or connection to the pc/pvr/tv. If you dont care if data gets lost, run a SMART long test and see if it passes - that cluld help point towards or away from a disk failure - where disk is the stuff in the actual sata disk in the enclosure - excluding the usb to sata interface. — davidgo, Mar 30 '22 at 18:30
“It might also have been exposed to shock, as the device was standing beside a TV for two(?) years. When cleaning dust, ooops! Dropped it.” It could either be the case and the USB to SATA connection is broken or the drive itself was damaged in the fall. The way to check the USB to SATA connection is to use another enclosure or another USB to SATA connection elsewhere. If it mounts there? All good. But, sorry to say, based on what you posted my guess is the drive itself is damaged. — Giacomo1968, Mar 30 '22 at 18:58
Temperature actually doesn’t look to bad. The drive is done for however. — Daniel B, Mar 31 '22 at 10:46
@Hannu three days prior to the S.MA.R.T. pull, the drive experienced hardware damage from an impact, which created at least five hardware errors [463 - 467] due to platter damage. This in and of itself doesn't mean the drive should be tossed since it can still operate fine provided self-tests result with no errors _(the blocks damaged will be marked as bad and not used again - you can force this with a bad block scan)_: `smartctl -t short /dev/sdb; sleep 121; smartctl -a /dev/sdb; smartctl -t long /dev/sdb`. Long test will take hours - to see progress: `smartctl -a /dev/sdb | grep progress` — JW0914, Mar 31 '22 at 12:35
@Hannu _(cont'd...)_ If the short and long self-tests result with no errors, it would be recommended to self-test at a minimum of once weekly [long: Sun], preferably twice a week spaced three days apart [short: Wed; long: Sun], to ensure there isn't additional hardware damage. To prevent further impact damage, Velcro the drive to a stationary surface that's not the bottom of a shelf _(e.g. side of the shelf, top of shelf mounting it upside down, the backside of the entertainment center, etc. - if Velcro adhesive has an issue sticking, use VHB tape between it and the surface)_ — JW0914, Mar 31 '22 at 12:44
For in depth understanding; do you mind telling how you can tell about this? :-) It isn't clear to me from reading the provided data. — Hannu, Mar 31 '22 at 13:10
@Hannu In between the numerical values section and the self-test log section is the hardware error section, which lists the error number _(begins at 1 when a hardware error occurs)_ when platter damage occurs _(drive has experienced 467 hardware errors)_, along with the powered on hours time at which the error occurred _(listed as days + hours - take the powered on hours [`26898`] and divide by 24, then subtract the error time from that value to determine how many days ago the error occurred)_ and a bunch of other information. — JW0914, Mar 31 '22 at 19:56
@Hannu _(Cont'd...)_ Platter damage in and of itself doesn't mean the drive should be replaced since it can still operate fine provided self-tests result with no errors _(any blocks damaged will be marked as bad and not used again by the drive's firmware)_; however, without a long test, there's no way to determine if the drive should be replaced, as the only values that show conclusive hardware damage affecting drive health are the `Offline_Uncorrectable` sectors and self-test log results from a long test _(reallocated sectors, seek errors, etc. can all be caused by non-HDD hardware)_. — JW0914, Mar 31 '22 at 20:02
@Hannu _(Cont'd..._ If the long test completes with errors, or doesn't complete due to errors, the drive should immediately be backed up and replaced; however, if it completes without errors, it should be monitored weekly for a month by running a long test once a week, and if it continues to finish without errors and the `Offline_Uncorrectable` sectors don't increase, the drive should be fine. I've had multiple drives in one of my TrueNAS zpools for 7yrs that had hardware errors occur due to an impact a little over a year into their powered-on hours and have had no issues with the drives since — JW0914, Mar 31 '22 at 20:10

r2d3 · Answer 1 · 2022-03-31T09:20:46.643

0

Hannu, do not believe those stupid one-dimensional assessments (red, yellow, green) or one phrase conclusions like

Assessment: Disk is OK, 16376 bad sectors (29° C / 84° F)

A disk with 16376 bad sectors is NOT OK! as it indicates a sharply decreased life expectance.

Furthermore 6088 unreadable pending sectors are NOT OK! as well.

Your temperature may well be 29°C now but it has been 58°C already and we don't know for how long. You have 6088 unreadable sectors, 10288 sectors have already been replaced. I replace a drive once unreadable sectors appear.

The G-Shock parameters probably suggests that you dropped the drive 3 times. Unfortunately I don't experience with this particular parameter.

Here are the relevant report lines that document dammage:

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

5 Reallocated_Sector_Ct 0x0033 100 100 050 Pre-fail Always - 10288

191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 3

194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 27 (Min/Max 22/58)

196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 854

197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 6088

Conclusion:

Duplicate your drive using ddrescue or send it to a professional recovery lab!

P.S.: If you are to duplicate the drive using ddrescue, would you be so kind as to link the logfile (mapfile)? That would harrymc give a chance to rethink about his claim. Thank you.

edited Mar 31 '22 at 09:20

answered Mar 31 '22 at 08:44

r2d3

3,298
1
8
24

1

Your knowledge of SMART is lacking. You made the same errors in [this post](https://superuser.com/questions/1709167/is-my-hard-disk-about-to-fail-tests-seem-inconsistent) and you're refusing to study SMART indicators. You would rather tell posters to junk totally heathy disks. Read more about the subject before hopping in with totally wrong answers. – harrymc Mar 31 '22 at 08:57
I back the statement in my other post. You should rather worry why your normalized value is still at 100 when you already have 6000 pending sectors. – r2d3 Mar 31 '22 at 09:08
Geberally I get the overall impression that "SMART" isn't that well integrated in this device. – Hannu Mar 31 '22 at 10:26
1

I do not look at normalized values anymore. The raw figures provide very good information unless they are coded but I have never seen coded figures for temperature, reallocated sectors or pending sectors. What is interesting with your case is that you frankly admitted that your drive fell and it looks like the G shock counter is reflecting it. Please use ddrescue to duplicate your drive and provide the mapfile (logfile) to this community. – r2d3 Mar 31 '22 at 10:41
Also `Raw_Read_Error_Rate 0x000b 100 084 ` indicates that this drive had lots of problems reading at some stage in the past, probably before it re-allocated the sectors. – dirkt Mar 31 '22 at 11:16
@r2d3 Three days prior to the S.MA.R.T. pull, the drive experienced hardware damage from an impact, which created at least five hardware errors [463 - 467] due to platter damage; this in and of itself doesn't mean the drive should be replaced since it can still operate fine provided self-tests result with no errors _(blocks damaged will be marked as bad and not used again)_, however, without a long test, there's no way to determine if the drive should be replaced, as the only values that show conclusive hardware damage affecting usage being `Offline_Uncorrectable` [`0`] and self-test values – JW0914 Mar 31 '22 at 12:56

harrymc · Answer 2 · 2022-03-31T12:01:09.570

-1

The SMART indicators of the disk show no errors at all, no bad sectors, nothing. As far as they are concerned, the disk is in perfect shape.

For the downvoters who do not understand SMART, here is a quote From NTFS.com S.M.A.R.T. Attributes:

Attribute values can range from 1 to 253 (1 representing the worst case and 253 representing the best). Depending on the manufacturer, a value of 100 or 200 will often be chosen as the "normal" value.

For most attributes, a value above this threshold is good and means no errors.

What merits attention is that you do have 467 ATA Errors of type READ DMA EXT.

According to the article ATA errors increasing on disk(s) in ReadyNAS:

ATA errors occur when the SATA controller of the ReadyNAS is unable to communicate with a hard disk.

The SATA controller of the ReadyNAS sends commands to the hard disk. When the controller cannot communicate with the disk, this may be as a result of an internal hardware error in the disk itself and may require replacement.

This basically means problems of the motherboard to connect with the disk.

This kind of errors accumulate during the life-time of the disk and the time-stamps do not contain the date, so do not allow to determine when did the errors occur.

This might be caused by a bad SATA cable, or be a problem with the disk. Try a new cable and run SMART tests with smartctl. This may allow to determine if the disk is actually failing or not.

Keeps an eye on ATA error count, to see if it is still increasing.

edited Mar 31 '22 at 12:01

answered Mar 30 '22 at 20:46

harrymc

455,459
31
526
924

"The SMART indicators of the disk show no errors at all, no bad sectors, nothing. As far as they are concerned, the disk is in perfect shape." 10000 Reallocated sectors, 6000 pending sectors, signs of overheating (max 58°C) and possible G-Shock ( 3 events?) suggest the opposite. – r2d3 Mar 31 '22 at 08:33
@r2d3: Did you downvote? And why? – harrymc Mar 31 '22 at 08:34
1

Yes, of course. I added the reasons in a comment. This disk needs immediate recovery. I have rarely seen so many reallocated sectors. I had 12000 reallocated sectors on a notebook from a collegue that was most probably overheating as well. – r2d3 Mar 31 '22 at 08:35
1

@r2d3: Sorry, you don't know SMART. The values 100, 200 and sometimes 253 mean "no errors". These are the initial values of most SMART indicators. – harrymc Mar 31 '22 at 08:48
Please consult with others - there is nothing more left to explain on my side. I back my statement. You should have wondered why normalized figures look so positively despite all dammage indications. – r2d3 Mar 31 '22 at 08:56
@harrymc I initially made the same mistake - you need to scroll over the right to see the raw values instead of just the thresholds. The drive has platter damage with an impact three days prior to the S.M.A.R.T pull [463 - 467 occurring at 1116d +14h], however without a long self-test, there's no way to determine if the drive should be replaced – JW0914 Mar 31 '22 at 12:46
1

@JW0914: That's a common mistake - the raw values are usually divided into bit-fields, so treating them as integers is meaningless. One should look at the normalized values, which are more homogenous. Otherwise, one finds thousands or errors of all kinds, like the other answer does, – harrymc Mar 31 '22 at 12:51

Malfunctioning hard drive. What can be interpreted from the “smartctl -a” output?

2 Answers2