I recently installed three new disks in my QNAP TS-412 NAS.
These three new disks should be combined with the already present disk into a 4 disk RAID5 array, so I started the migration process.
After multiple tries (each taking about 24 hours) the migration seemed to work but resulted in a non-responsive NAS.
At that point I reset the NAS. Everything went downhill from there:
- The NAS boots but marks the first disk as failed and removes it from all arrays, leaving them limp.
- I ran checks on the disk and can't find any issues with it (which would be weird anyway, as it's almost new).
- The admin interface didn't offer any recovery options, so I figured I'd just do it manually.
I've successfully rebuilt all QNAP internal RAID1 arrays using mdadm (being /dev/md4, /dev/md13 and /dev/md9), leaving only the RAID5 array; /dev/md0:
I've tried this multiple times now, using these commands:
mdadm -w /dev/md0 (Required as the array was mounted read-only by the NAS after removing /dev/sda3 from it. Can't modify the array in RO mode).
mdadm /dev/md0 --re-add /dev/sda3 After which the array starts rebuilding. It stalls at 99.9% though, while the system is extremely slow and/or unresponsive. (Login in using SSH fails most of the time).
Current state of things:
[admin@nas01 ~]# cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md4 : active raid1 sdd2[2](S) sdc2[1] sdb2[0] 530048 blocks [2/2] [UU] md0 : active raid5 sda3[4] sdd3[3] sdc3[2] sdb3[1] 8786092608 blocks super 1.0 level 5, 64k chunk, algorithm 2 [4/3] [_UUU] [===================>.] recovery = 99.9% (2928697160/2928697536) finish=0.0min speed=110K/sec md13 : active raid1 sda4[0] sdb4[1] sdd4[3] sdc4[2] 458880 blocks [4/4] [UUUU] bitmap: 0/57 pages [0KB], 4KB chunk md9 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1] 530048 blocks [4/4] [UUUU] bitmap: 2/65 pages [8KB], 4KB chunk unused devices: <none> (It's stalled at 2928697160/2928697536 for hours now)
[admin@nas01 ~]# mdadm -D /dev/md0 /dev/md0: Version : 01.00.03 Creation Time : Thu Jan 10 23:35:00 2013 Raid Level : raid5 Array Size : 8786092608 (8379.07 GiB 8996.96 GB) Used Dev Size : 2928697536 (2793.02 GiB 2998.99 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Mon Jan 14 09:54:51 2013 State : clean, degraded, recovering Active Devices : 3 Working Devices : 4 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 64K Rebuild Status : 99% complete Name : 3 UUID : 0c43bf7b:282339e8:6c730d6b:98bc3b95 Events : 34111 Number Major Minor RaidDevice State 4 8 3 0 spare rebuilding /dev/sda3 1 8 19 1 active sync /dev/sdb3 2 8 35 2 active sync /dev/sdc3 3 8 51 3 active sync /dev/sdd3 After inspecting /mnt/HDA_ROOT/.logs/kmsg it turns out that the actual issue appears to be with /dev/sdb3 instead:
<6>[71052.730000] sd 3:0:0:0: [sdb] Unhandled sense code <6>[71052.730000] sd 3:0:0:0: [sdb] Result: hostbyte=0x00 driverbyte=0x08 <6>[71052.730000] sd 3:0:0:0: [sdb] Sense Key : 0x3 [current] [descriptor] <4>[71052.730000] Descriptor sense data with sense descriptors (in hex): <6>[71052.730000] 72 03 00 00 00 00 00 0c 00 0a 80 00 00 00 00 01 <6>[71052.730000] 5d 3e d9 c8 <6>[71052.730000] sd 3:0:0:0: [sdb] ASC=0x0 ASCQ=0x0 <6>[71052.730000] sd 3:0:0:0: [sdb] CDB: cdb[0]=0x88: 88 00 00 00 00 01 5d 3e d9 c8 00 00 00 c0 00 00 <3>[71052.730000] end_request: I/O error, dev sdb, sector 5859367368 <4>[71052.730000] raid5_end_read_request: 27 callbacks suppressed <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246784 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246792 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246800 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246808 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246816 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246824 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246832 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246840 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246848 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246856 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. The above sequence is repeated at a steady rate for various (random?) sectors in the 585724XXXX range.
My questions are:
- Why is it stalled so close to the end, while still using so many resources that the system stalls (the
md0_raid5andmd0_resyncprocesses are still running). - Is there any way to see what is causing it to fail/stall? <-- Likely due to the
sdb3errors. - How can I get the operation to complete without losing my 3TB of data? (Like skipping the troublesome sectors on
sdb3, but keeping the intact data?)
/var/log/syslogand/var/log/messages?/mnt/HDA_ROOT/.logs/kmsghas some troubling messages. I'll update question with them.