In one of my articles I tried to define the mathematics of a RAID 5 stripe and how it relates to data recovery. Using the eXclusive ORing truth table we can continue to run the array even when one drive has dropped out of the array. This RAID state is known as degraded and must considered by the IT professional as a temporary state. Once in a degraded state the prudent technican should try to do the following:
1. Take every user off of the server. Although the RAID is designed to run in a degraded state, it is not a run time solution. Ignore management, ignore the user, and log everyone off.
2. Make a complete and full backup.
3. Check your complete and full backup. Many a time I have heard a tech tell me that he did a full and complete backup only to find out the some obscure accounting piece of software had some hidden flat file buried 27 folders deep that had the entire companies payroll for the last 36 years and was not in his “complete backup”.
4. Pull every drive from the array and make a complete sector by sector image of each drive. Take those images and guard them with your life. If when you are trying to bring the array back online, and something goes amiss, you will have a clean starting point. This method is called the ‘hindsight is definitely 20 20’ school of thought and has saved my derriere on many occasion.
5. Check every cable, every slot, every dust laden chip to make sure that something hasn’t ‘broken’ loose.
6. Put the working drives back in the enclosure and replace the bad drive. Bring the array back online. Go into the RAID BIOS and make sure that any rebuild is pointing to the right drive. Although there may be meta data that tells the RAID card who is what, where, and how. Double check anyway.
7. Rebuild the array. If you get a stall, a hang, or a reboot then stop everything. Execute step 5 again, and try the rebuild just one more time. If it fails again, then do a surface check of all the drives in the array, including the new drive. The fact that a drive is new does not necessarily mean that it will work out of the box. Many a time I have pulled a new drive out only to have it fail the ‘smoke test’. A surface check will hopefully expose any flaws on the media during the read tests.
If you have reached this point and still do not have a defined solution then you must weigh time constraints, user complaints, and management breathing down your neck as to whether to spring for a new server and reload, or to continue beating your head against the wall of an older server, using older software, running on an older operating system. Data is almost always exportable in a simple comma delimited format and can then be imported into almost any application. Maybe now is the time to upgrade and you can use this incident as leverage to pry money from management for a new server.
No matter what you decide, if you have followed the above steps, your data will be relatively safe. It is the seasoned IT professional that can think out of the box and bring his company back online with a minimum of aggravation.
Visit RAID Data Recovery page