Xsan

Dealing with Drive Failures with Xsan

I originally posted this at http://www.318.com/TechJournal

Sometimes a drive fails, or a RAID controller goes down on an array with a redundant drive and the parity on a RAID must be rebuilt. In other words, if you loose a drive in a RAID 5, RAID 1, RAID 0+1 or RAID 3 array you will be left with a degraded RAID (also referred to as a critical RAID) unless you have configured your Xserve RAID to use a hot spare. If you are using a hot spare on the channel of the failed drive the RAID will begin to rebuild itself automatically. If you are not using a hot spare, upgrading your degraded RAID back to a healthy state should happen as quickly as possible to avoid data loss. In the event of a second drive failure on the array most of the data could be lost – and Murphy’s Law is evil when it comes to RAIDs. The data should be backed up as quickly as possible if it has not already been backed up.

Once the data is backed up, you should perform a rebuild of the parity for the array. The partiy is rebuilt based on the data that is on the array. This does not fix any issues that may be present with actual data. In other words, if you were using the Xserve RAID as a local volume it would only repair issues with the array and not also perform a repair disk on the drives. In an Xsan any data corruption could force you to rebuild you volume from the LUNs. You would not need to relabel the LUNs, but you may have to rebuild your volume

In many situations you will be able to simply swap the bad drive out with an identical good drive and configure it as a hot spare. Then the Xserve RAID will automatically begin rebuilding the array, moving it from a degraded state into a healthy state.

However, there are often logical issues with drives and arrays. Also, hot spares do not always join the degraded array. In these situations you may need to manually rebuild an array. To do this:
Silence the alarm on the Xserve RAID.
Verify that you have a clean backup of your data.
Verify that you have a clean backup of your data again or better, have someone else check as well.
Open up your trusty Xserve RAID Spare Parts Kit and grab the spare drive module.
Remove the drive module that has gone down (typically the one with the amber light).
Install the new drive in your now empty slot.
Open RAID Admin from the /Applications/Server directory.
Click on the RAID containing the damagemed array.
Click on the Advanced button in the toolbar.
Enter the management password for the Xserve RAID you are rebuilding the parity for.
Click on the button for Verify or Rebuild Parity and click on Continue.
Select the array needing to be rebuilt.
Click Rebuild Array and be prepared to wait for hours during the rebuild process. It is possible to use the array during the rebuild process – although if you don’t have to use the array it is probably best not to as you will see a performance loss. During the rebuild the lights on the drive will flash between an amber and a green state.
Once the rebuild is complete, perform a Verify Array on the RAID.
Verify the data on the volumes using the array.
Order a new drive to replace the broken drive in your Xserve RAID Spare Parts Kit.

If the rebuild of the data does not go well and the array is lost then you will likely need to delete the array and readd it. This will cause you to loose the data that was stored on that array and possibly on the volume, so it can never hurt to call Apple first and see if they have any more steps you can attempt. This is one of the many good reasons for backing data up. Just because you are using a RAID does not mean you should not back your data up.

The Verify Array can also be used to help troubleshoot issues with corrupted arrays.

This process has been tested using firmware 1.5 and below for Xserve RAIDs.