Skip to main content
Commonmark migration
Source Link
  1. MTBF is remarkably broken when you have a lot of drives with similar origins. In statistics we'd call it a sampling bias, because of the similarity in your samples the averaging effects will tend to be less useful. If there's a fault with the batch or even with the design itself, and it happens more often than you'd think, then drives from that batch will fail sooner than MTBF would suggest.

    MTBF is remarkably broken when you have a lot of drives with similar origins. In statistics we'd call it a sampling bias, because of the similarity in your samples the averaging effects will tend to be less useful. If there's a fault with the batch or even with the design itself, and it happens more often than you'd think, then drives from that batch will fail sooner than MTBF would suggest.

    If the drives are spread out, you might get [50%, 90%, 120%, 200%] of MTBF, but if all the drives come from that 50% batch you've got a mess on your hands.

  2. Raid array reassembly kills disks. No, really. If you get a drive failure and the array rebuilds, it's going to put extra load on the other drives while it scans the data off them. If you have a drive close to failure the rebuild may well take it out, or it may already have a failure location that you just weren't aware of because that section hadn't been read recently.

    If you've got a lot of drives from the same batch, the chances of this kind of cascade failure occurring are much higher than the chances if they're different. You can mitigate this by having regular patrol scans, scrubs, resilvering, whatever the recommended practice is for the type of array you're using, but the downside to that is that it will impact performance and can takes hours to complete.

If the drives are spread out, you might get [50%, 90%, 120%, 200%] of MTBF, but if all the drives come from that 50% batch you've got a mess on your hands.

  1. Raid array reassembly kills disks. No, really. If you get a drive failure and the array rebuilds, it's going to put extra load on the other drives while it scans the data off them. If you have a drive close to failure the rebuild may well take it out, or it may already have a failure location that you just weren't aware of because that section hadn't been read recently.

If you've got a lot of drives from the same batch, the chances of this kind of cascade failure occurring are much higher than the chances if they're different. You can mitigate this by having regular patrol scans, scrubs, resilvering, whatever the recommended practice is for the type of array you're using, but the downside to that is that it will impact performance and can takes hours to complete.

  1. MTBF is remarkably broken when you have a lot of drives with similar origins. In statistics we'd call it a sampling bias, because of the similarity in your samples the averaging effects will tend to be less useful. If there's a fault with the batch or even with the design itself, and it happens more often than you'd think, then drives from that batch will fail sooner than MTBF would suggest.

If the drives are spread out, you might get [50%, 90%, 120%, 200%] of MTBF, but if all the drives come from that 50% batch you've got a mess on your hands.

  1. Raid array reassembly kills disks. No, really. If you get a drive failure and the array rebuilds, it's going to put extra load on the other drives while it scans the data off them. If you have a drive close to failure the rebuild may well take it out, or it may already have a failure location that you just weren't aware of because that section hadn't been read recently.

If you've got a lot of drives from the same batch, the chances of this kind of cascade failure occurring are much higher than the chances if they're different. You can mitigate this by having regular patrol scans, scrubs, resilvering, whatever the recommended practice is for the type of array you're using, but the downside to that is that it will impact performance and can takes hours to complete.

  1. MTBF is remarkably broken when you have a lot of drives with similar origins. In statistics we'd call it a sampling bias, because of the similarity in your samples the averaging effects will tend to be less useful. If there's a fault with the batch or even with the design itself, and it happens more often than you'd think, then drives from that batch will fail sooner than MTBF would suggest.

    If the drives are spread out, you might get [50%, 90%, 120%, 200%] of MTBF, but if all the drives come from that 50% batch you've got a mess on your hands.

  2. Raid array reassembly kills disks. No, really. If you get a drive failure and the array rebuilds, it's going to put extra load on the other drives while it scans the data off them. If you have a drive close to failure the rebuild may well take it out, or it may already have a failure location that you just weren't aware of because that section hadn't been read recently.

    If you've got a lot of drives from the same batch, the chances of this kind of cascade failure occurring are much higher than the chances if they're different. You can mitigate this by having regular patrol scans, scrubs, resilvering, whatever the recommended practice is for the type of array you're using, but the downside to that is that it will impact performance and can takes hours to complete.

Source Link
Kaithar
  • 1.1k
  • 6
  • 10

Honest answer from someone that's spent a lot of time dealing with dying raid arrays and difficult drives: Don't have all your drives from the same batch if you can avoid it.

My experience only applies to spinning disks, SSDs have their own issues and benefits to consider when bulk ordering.

Exactly the best way to handle things depends mostly on how big the array you're working with is, if you're working with something like 6 drive arrays with 2 drive redundancy you can probably safely buy similar drives from 3 manufacturers and split the array like that.

If you're using an odd drive or you're working with arrays that can't be easily partitioned like that you can try other approaches like buying the same drive from different vendors, or if you're buying in bulk you can look through and try to separate the drives based on likelihood of being manufactured together.

If you're running a small enough array with the right underlying tech it might even be worth your time to build it incrementally from heterogeneous disk supplies. Start with the minimum number of drives you can get away with and buy the next supply a month or two later, or when you fill the system. That also let's you get a feel for any issues that there might be with the particular models you picked.

The reason behind this advice is a combination of two quirks of drives.

  1. MTBF is remarkably broken when you have a lot of drives with similar origins. In statistics we'd call it a sampling bias, because of the similarity in your samples the averaging effects will tend to be less useful. If there's a fault with the batch or even with the design itself, and it happens more often than you'd think, then drives from that batch will fail sooner than MTBF would suggest.

If the drives are spread out, you might get [50%, 90%, 120%, 200%] of MTBF, but if all the drives come from that 50% batch you've got a mess on your hands.

  1. Raid array reassembly kills disks. No, really. If you get a drive failure and the array rebuilds, it's going to put extra load on the other drives while it scans the data off them. If you have a drive close to failure the rebuild may well take it out, or it may already have a failure location that you just weren't aware of because that section hadn't been read recently.

If you've got a lot of drives from the same batch, the chances of this kind of cascade failure occurring are much higher than the chances if they're different. You can mitigate this by having regular patrol scans, scrubs, resilvering, whatever the recommended practice is for the type of array you're using, but the downside to that is that it will impact performance and can takes hours to complete.

For some context on how wildly the longevity of drives varies, Backblaze do a regular drive failure stat report... I'm not affiliated with the company in any way but they should know what they're talking about on the subject of drive reliability. An example is https://www.backblaze.com/blog/hard-drive-failure-rates-q1-2017/ ... your sample set will likely be smaller, so outlying data can mess up your own experience, it's still a good reference.