0

I inherited a NAS server that is setup with 5 RAID1 soft arrays grouped together into a XFS volume group.

Whoever built the 5th md device created it without a Linux Auto RAID partition (Looks like they did a mdadm --create using the raw disk (i.e. /dev/sdj /dev/sdk). It had been working fine until now, but today the entire /dev/md5 array disappeared.

The /dev/sdj drive appears to be in the process of failing. Buffer I/O error on /dev/sdj logical block 0 Buffer I/O error on /dev/sdj logical block 1 Buffer I/O error on /dev/sdj logical block 2 Buffer I/O error on /dev/sdj logical block 3 

Normally I would expect the RAID to fail the device, but keep the array up with the 2nd drive. When I cat /proc/mdstat however, my md5 device is gone. I suspect this is because those two drives did not have a Auto RAID partition, but I'm not sure.

I've tried to re-create the md5 array using

mdadm --create /dev/md5 --level=1 --raid-devices=2 /dev/sdj /dev/sdk 

but it says sdj is already part of a RAID device

What is odd is the XFS volume group appears to still be working fine - no data is lost as far as I can tell, and df still shows all the space being available. Could it be XFS still sees the /dev/sdk drive and can successfully write to it? Both sdj and sdk show up with a fdisk -l.

My questions are:

  1. Can I safely replace the /dev/sdj drive without messing up the (working but fragile) XFS volume?
  2. How can I recover/rebuild the md5 Array if mdstat says it does not exist, but mdadm says it does?
  3. If I go and add a Linux Auto RAID partition to the remaining good drive in this array, will that corrupt the data already on it?
  4. How do you verify data integrity with XFS with ? (to ensure there really is no data loss)

Output of pvscan:

pvscan /dev/sdj: read failed after 0 of 4096 at 0: Input/output error /dev/sdj: read failed after 0 of 4096 at 2000398843904: Input/output error PV /dev/sdd2 VG VolGroup00 lvm2 [74.41 GB / 0 free] PV /dev/md2 VG dedvol lvm2 [931.51 GB / 0 free] PV /dev/md3 VG dedvol lvm2 [931.51 GB / 0 free] PV /dev/md0 VG dedvol lvm2 [931.51 GB / 0 free] PV /dev/md4 VG dedvol lvm2 [931.51 GB / 0 free] PV /dev/sdj VG dedvol lvm2 [1.82 TB / 63.05 GB free] Total: 6 [5.53 TB] / in use: 6 [5.53 TB] / in no VG: 0 [0 ] 
 Disk /dev/sdj: 2000.3 GB, 2000398934016 bytes 255 heads, 63 sectors/track, 243201 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk /dev/sdj doesn't contain a valid partition table Disk /dev/sdk: 2000.3 GB, 2000398934016 bytes 255 heads, 63 sectors/track, 243201 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk /dev/sdk doesn't contain a valid partition table 
 mdadm --misc -Q /dev/sdj /dev/sdj: is not an md array /dev/sdj: No md super block found, not an md component. mdadm --misc -Q /dev/sdk /dev/sdk: is not an md array /dev/sdk: device 0 in 2 device undetected raid1 /dev/md5. Use mdadm --examine for more detail. 
 mdadm --examine /dev/sdk /dev/sdk: Magic : a92b4efc Version : 0.90.00 UUID : 25ead1e4:9ab7f998:73875d59:48b17be5 Creation Time : Fri Nov 26 21:10:49 2010 Raid Level : raid1 Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB) Array Size : 1953514496 (1863.02 GiB 2000.40 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 5 Update Time : Sat Mar 26 07:43:52 2011 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Checksum : 35a405cb - correct Events : 5720270 Number Major Minor RaidDevice State this 0 8 144 0 active sync /dev/sdj 0 0 8 144 0 active sync /dev/sdj 1 1 8 160 1 active sync /dev/sdk 
7
  • I think the answer to #3 is "Yes" since if the raw drive was used before, the blocks for the partition table would have data on it now, and adding a partition table will overwrite that. As for the rest, edit your question with the output of pvscan, please? Commented Apr 13, 2011 at 17:52
  • Output of pvscan added Commented Apr 13, 2011 at 18:41
  • Ouch. It looks like sdj was never part of a raid volume and was added directly into the LVM. Same with /dev/sdd2 (though that is a member of a different volume group than the rest. Commented Apr 13, 2011 at 19:39
  • The odd thing it was part of the RAID array - prior to today, md5 showed up with sdj and sdk drives. When sdj started failing, it dropped the RAID Commented Apr 13, 2011 at 19:55
  • ... New question then. If /dev/sdk is already in an array, where the heck is it? What does mdadm --misc -Q /dev/sdk and mdadm --misc -Q /dev/sdj say? Commented Apr 13, 2011 at 20:00

1 Answer 1

3

So, according to the superblock on /dev/sdk, there was a /dev/md5 and sdj was in there with it, but according to /dev/sdj, there is no raid superblock. What I fear is that /dev/sdj was added to the md5 array then /dev/sdj was added to the volume group (instead of /dev/md5) and at some point lvm got around to overwriting the blocks that identified it as a member of the RAID device. I fear this because I honestly can't think of any other way /dev/sdj would end up being named specifically in the LVM group and not have a raid superblock anymore.

Worst case nightmare scenario: both /dev/sdj and /dev/md5 were added to the LVM. Is your XFS partition bigger than the 5.5 TB in the LVM now? If this is the case, you should be able to get md5 back using mdadm --assemble but you need to be sure it's started in degraded mode without sdj, so it won't overwrite the data there.

Assuming that your /dev/md5 was never used in the LVM:

(...had you ever looked at pvscan before today?)

If you don't have backups, now is the time to start. If you do, now is the time to test them (and if they don't work, you don't have backups, see step 1).

There isn't an easy way out of this mess, and I haven't got a clue what might happen if you reboot at this point (can you unmount the filesystem?). If I was certain that what really happened was that sdj had been added as both a raid drive and as an lvm physical volume (since the lvm wasn't using the raid driver to write to sdj, none of the data written to sdj would be on sdk... perhaps this can be verified by comparing hex dumps of various chunks of /dev/sdj and /dev/sdk and someone smarter than me who knows good places to look for things that would say "this is XFS" versus "this is random gibberish or a blank drive"?), then what I'd do is this:

Start by trying to get SMART data on sdk to see if it is trustworthy or on the way out.

If sdk is good, then I would thank my lucky stars for the former admin having wasted 63GB of /dev/sdj.

fdisk /dev/sdk 

(doublecheck EVERYTHING before hitting return). Have fdisk create a partition table and an md partition (mdadm manpage says use 0xDA, but every walkthrough and my own experience says 0xFD for raid autodetect), then

mdadm --create /dev/md6 --level=1 --raid-devices=2 missing /dev/sdk1 

(doublecheck EVERYTHING before hitting return). This will create a degraded raid1 array named md6 using the partition we made on sdk. These next steps are why that wasted space is important: we've lost some space due to the md superblock and due to the partition table, so our /dev/md6 is slightly smaller than /dev/sdj was. We're going to add /dev/md6 to the dedvol volume group and instruct LVM to move the 1.82TB of logical volume from /dev/sdj to /dev/md6. LVM can handle the filesystem being active while it does this.

pvcreate /dev/md6 vgextend dedvol /dev/md6 pvmove -v /dev/sdj 

(doublecheck... you get the picture. I'd also run pvscan after pvcreate and again after vgextend to make sure things look right). This will begin the process of moving all the data allocated to /dev/sdj to /dev/md6 (specifically, the command moves everything off sdj, and md6 is the only place for it to go). Several hours later either this will complete or the system will lock up trying to read from sdj. If the system crashes, you can reboot and try pvmove without a device name to restart at the last checkpoint or just give up and reinstall from backups.

If we succeed, we remove /dev/sdj from the volume group, then remove it as a physical volume:

vgreduce dedvol /dev/sdj pvremove /dev/sdj 

Now, for the corruption-checking part. The tool for checking and fixing xfs is xfs_repair (fsck will run on an xfs filesystem but it does nothing at all). The bad news? It uses gigs of RAM per terabyte of filesystem, so hopefully you have a 64 bit server with a 64 bit kernel and the 64 bit xfs_repair binary (which might be named xfs_repair64) and at least 10GB of RAM+Swap (you should be able to use some of that leftover empty space in dedvol to create a swap volume, then mkswap that volume, then swapon that volume). The filesystem must be unmounted before running xfs_repair on it. Also, xfs_repair can detect and (attempt to) fix damage to the filesystem itself, but it may not detect damage to the data (for instance, something overwriting part of a directory inode versus something overwritten in the middle of a text file).

Finally, we need to buy a new /dev/sdj, install it, and add it to that degraded /dev/md6, keeping in mind that if we reboot the computer without sdj in it, it is possible sdk will move down to sdj and the new drive will be sdk instead (probably not, but best to be sure):

fdisk /dev/sdj 

check to make sure that it isn't the drive we partitioned and set up already, then create a partition for md on it

mdadm /dev/md6 -a /dev/sdj1 

(It is entirely possible that the errors could be due to raid and lvm duking it out over the content of sdj, rather than the drive actually failing (usually failing drives generate a lot of gibberish from the driver in dmesg rather than just Input/Output errors) but I'm not sure I'd risk it.)

3
  • I'd like to add that I've never done any of the above in an emergency situation (I do use md and lvm), the sequence of commands feels right, but there's no guarantee that this will work, even without a failing drive. Commented Apr 14, 2011 at 1:05
  • Thanks for the very thorough answer. At this point, I think my best bet is to patch this up so it is stable, build a new server and get off this one. I'm fairly certain that previous pvscan showed md5 in the LVM, but I'm not 100%. There is a chance sdj and sdk might have switches places. (Assuming my old notes about serial #s are correct). If the drive letters did change, could that explain some of the weirdness? Commented Apr 14, 2011 at 2:50
  • @John The raid superblock is supposed to keep track of all of that, but that superblock is missing from sdj. Is it gone because the drive died or is it gone because the LVM overwrote it? Since sdk has a superblock, it's highly unlikely that the array was built without them. If md5 was in the LVM before, then I would guess that when the kernel goes through the partitions to figure out what they do, if a partition fails to register as a RAID member, it tries to see if it's an LVM member, and whatever superblock LVM uses was still readable, so sdj became part of the LVM. Commented Apr 14, 2011 at 3:26

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.