How to interpret failure data provided by SMART and zfs

Question

In a small server system, I have a zfs file system with a mirrored pair of consumer grade drives (Seagate Barracudas). Recently, during a periodic scrub operation the following result was given:

 pool: storage state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: scrub repaired 10.9M in 44h14m with 0 errors on Tue Jun 6 00:11:23 2017 config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 map2_sda ONLINE 0 0 0 map2_sdb ONLINE 0 0 55 errors: No known data errors

There have been a few power failures and similar events between this scrub operation and the previous one, which I think may be a plausible cause of the failure, but I worry about the possibility that it is an impending hardware fault, particularly given that one disk was entirely clean and the other had multiple errors.

smartctl tells me that the suspect drive has had a total of 117 errors during its lifetime (of 935 days), but the most obvious error indicators are all well clear of their threshold values:

SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 109 081 006 Pre-fail Always - 22737688 5 Reallocated_Sector_Ct 0x0033 092 092 010 Pre-fail Always - 9784 7 Seek_Error_Rate 0x000f 083 060 030 Pre-fail Always - 213798923 9 Power_On_Hours 0x0032 075 075 000 Old_age Always - 22599 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0

Does anything here indicate that I need to be preemptively replacing this disk? I don't need 100% uptime on this machine, but would rather not have to worry about the multiple days of resilvering that would be required if I did have to replace the disk in an emergency situation.

Can you post the complete smartctl output of both disks? — shodanshok
– shodanshok, Commented Jun 12, 2017 at 5:49
Related topic regarding the cause of Checksum errors: serverfault.com/questions/789194/… — user121391
– user121391, Commented Jun 12, 2017 at 9:46

quadruplebucky · Accepted Answer · 2017-06-12 03:58:30Z

I wouldn't really panic if I were you, certainly not to replace it (which actually puts you in a dicier situation, with only one drive, nearly three years running, for a 44+ hour resilver...) I'd put the biggest drive I could reasonably afford into a free slot and add that to the pool (not as a spare, as a 3 way mirror) and when (if) one of the other two failed first I'd replace it with another big one and grow the pool...one the nicer features of zfs...but that's just me.

Old, but see google's experience with SMART, drive failure rates, heat, age...

This is a mirror vdev. You can easily add a new drive while both old are still online and providing data, then remove one of the old ones to get back to a two-way mirror. In fact, I'm pretty sure that's exactly what zpool replace will do for you. — user
– user, Commented Jun 12, 2017 at 14:30

user121391 · Accepted Answer · 2017-06-12 09:58:11Z

Checksum errors are far less critical than read or write errors. While read/write errors indicate that a block could not be read or written at all (which is most likely because it is permanently damaged), checksum errors just mean that what was received is not what should have been received (according to ZFS' own checksums).

You may want to investigate the cause of the errors:

Did they happen sometime already or was it the first time?
Has anything happened to the machine (somebody moved it, touched it, replaced other hardware)?
Were there unexpected reboots and/or power losses or other power supply events (if your devices allow to monitor that)?
How is the situation of heat and shock in the case for both disks?
Are both disks in any way different (different cables, different positions in case regarding cables, on different controllers, etc.)?
Has anything odd happened in any available logs?

If you cannot find anything AND get additional (possibly increasing or high) amounts of checksum errors, you may want to replace the disk. You can do it by adding a third mirror first, like quadruplebucky suggested and resilver it in the off-hours. Any additional load on the machine will slow down the resilvering. Depending on the disk, it could also be possible that the "good" disk alone resilvers faster than both, but only if the "bad" one is really bad (what I don't assume).

The second sentence could also be rephrased to state that "while read/write errors indicate that the disk knows it's having difficulties, checksum errors means that something was read back that was different from what was originally written or intended to be written, but it was read back cleanly by the disk". I'm not sure that's generally better than actual errors. — user
– user, Commented Jun 12, 2017 at 14:31
@MichaelKjörling Yes, from the perspective of the applications/system silent corruption would be much worse than not delivering anything. My answer was more in regard to the question "what is more likely to indicate an imminent hardware failure of the disk", where read/write failures mean that either the disk refuses to work on this sector or the whole communication link is broken - maybe similar in idea to analog shortwave radio vs. digital - you get some static and broken/garbled sound, but at least you get something, so the the sender must not be completely dead. — user121391
– user121391, Commented Jun 13, 2017 at 7:17

Stack Exchange Network

How to interpret failure data provided by SMART and zfs

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

How to interpret failure data provided by SMART and zfs

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions