2

TLDR; My ZFS mirror pool got some checksum errors. I replaced the controller, thinking that was the most likely cause, but the errors won't clear. pool clear temporarily resets them, but they come back the next time I run a scrub. How can I clear them for good?

Full story: I have had a ZFS mirror-0 set up and running on ubuntu 20.04.2 LTS for some time. When one of the drives died, I took advantage of the failure to replace both drives with larger ones, as well as adding a SATA-III PCI card for the new drives (the old ones had been connected to the on-board SATA II controller, as I had no more SATA III ports available). After running on the new drives and controller for a few weeks, ZFS complained about checksum errors on both new drives, and put the array into a "degraded" state as a result.

Some research led me to the conclusion that since both drives were showing the exact same number of checksum errors, it was much more likely to be an issue with the controller than with the drives themselves. So I pulled the new controller and put the drives back on the onboard SATA II controller for now, intending to replace the controller card once I verify that is the issue. I then deleted the two files that zpool status -v showed as having permanent errors, issued a zpool clear data to reset the errors, and ran a scrub.

Unfortunately, after the scrub the errors re-appeared, only now a -v no longer showed a file, but just the address (inode, I believe), presumably for one of the files I had deleted earlier. I tried again, with the same result. Every time I run a scrub, it comes back with the following result:

root@watchman:~# zpool status -v pool: data state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub repaired 16K in 0 days 09:10:20 with 1 errors on Sat Jul 24 15:48:21 2021 config: NAME STATE READ WRITE CKSUM data DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 ata-ST8000VE000-2P6101_WSD1M5NW DEGRADED 0 0 15 too many errors ata-ST8000VE000-2P6101_WSD1HEJX DEGRADED 0 0 15 too many errors errors: Permanent errors have been detected in the following files: data:<0x380508> 

From what I can tell, this is just the same issue that already existed due, presumably, to the bad controller, but I can't seem to clear it out. How can I restore my mirror to a fully-functioning state?

UPDATE: I finally gave up on the idea of clearing the errors, and instead started over. I created a new pool, stealing one of the drives from the existing mirror. I then ran a rsync to copy all the data over from the old pool to the new. This did run into a few errors (zfs wasn't lying about data errors), but nothing significant or troubling, and excluding the errored files allowed rsync to complete successfully. I then added the second drive to the new pool, and after a resilver everything now looks good, and a scrub on the new pool completed without error.

So assuming everything continues to look good for the next week or so, I think it's safe to assume the SATA III card was the cause of the issue, and replace it with a better brand/option :)

3
  • I belive its time for a backup and check for faulty hardware Commented Jul 25, 2021 at 12:56
  • @djdomi Yes, I believe the controller was faulty. I have pulled it, but without being able to clear the current errors, it is somewhat difficult to confirm if that was indeed the case. Commented Jul 25, 2021 at 14:13
  • im going through a similar error in raidz-3. im upgrading to larger drives. both new larger drives now showing same checksum amount(26) after resilvering in 2nd new drive successfully, after a reboot, and scrub wont clear it. (no fault, only degraded because i have pulled another drive for replace, no errors with -v) (havent tried to clear it yet, trying to make sure of best course of action) Commented Aug 7, 2023 at 14:56

1 Answer 1

0

From time to time I've also some checksum error on a 0-mirror, mostly occurring after a reboot, and the status of the zfs pool is degraded.

zpool status <poolname> 

enter image description here

To fix this and clean the errors I run:

zpool clear <poolname> 

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.