I have an Intel Xeon-based system running Red Hat Enterprise Linux 9 with an Intel Software RAID configured. I'm using the mdadm utility for managing and monitoring software RAID devices, and I need to understand the possible RAID failure statuses. My goal is to monitor the status of the RAID arrays and be able to identify when they are in a degraded state, missing drives, or have failed devices.
Useful Commands:
cat /proc/mdstat
This command helps identify the active RAID devices. For example, I can see different RAID arrays with various states like "active" and "inactive."Example output:
# cat /proc/mdstat Personalities : [raid1] md125 : inactive nvme2n1[0](S) 1105 blocks super external:imsm md126 : active raid1 nvme0n1[1] nvme1n1[0] 890806272 blocks super external:/md127/0 [2/2] [UU] md127 : inactive nvme1n1[1](S) nvme0n1[0](S) 10402 blocks super external:imsm unused devices: <none>mdadm --detail /dev/md126
This command provides detailed RAID volume information. TheStatefield in the output indicates the health of the RAID volume.Example output:
# mdadm --detail /dev/md126 /dev/md126: Raid Level : raid1 Array Size : 890806272 (849.54 GiB 912.19 GB) State : active Active Devices : 2 Failed Devices : 0 Consistency Policy : resync
RAID Failure and Degradation Scenarios:
If a Hard Disk is Missing:
What will be the output and RAID status in themdadm --detailandcat /proc/mdstatcommands? Specifically, how will the RAID array reflect the missing disk's status?If a Hard Disk is Offline:
How will this affect the RAID status shown by these commands? Will the status change to something likeofflinesyncingordegraded?If a RAID Array is Degraded:
What status will be reflected in the output? Specifically, how does the term "degraded" appear in the RAID status in bothmdadmoutput and/proc/mdstat?If a RAID is in a
FailSpareState:
What is the output inmdadm --detailfor a RAID array in this state? How is this state reflected in the status of the RAID array and individual drives?
I have found the following statuses in the mdadm manual:
Critical Severity:
Fail,FailSpare,DeviceDisappeared,DegradedArray
Warning Severity:
RebuildStarted,RebuildNN,RebuildFinished,SparesMissing
Could you explain what these status values mean, especially when monitoring RAID arrays for failure or degradation?
Additional Queries:
Is it better to use the
mdadm --detail /dev/md<number>command to check the complete RAID status, or should I usemdadm --examine /dev/<disk>on any RAID member disk? For example:# mdadm --examine /dev/nvme1n1The output from this command provides information about the disk and its current state, but I am unsure about the relevance of this command compared to
mdadm --detail.
Thank you for your assistance.