Last week, we got a call from a Dublin physiotherapist`s practice. Their Dell Poweredge server, configured in RAID 5, had failed.
Their I.T support technician identified the problem immediately. However, for him data recovery from a RAID 5 server was unknown territory. For this blog post, here is an abridged version of the RAID recovery process which we used.
For recovery we decided to use Mdadm. It is a powerful linux-based RAID recovery utility. A good knowledge of Linux command line and in-depth experience of this tool are essential prerequisites for it’s operation.
The first step in the recovery process was to deterimine the status of the server’s drives in-situ.
We used the following command on every disk in the array:
mdadm – -examine
We were able to determine that drives /dev/sdc1 and /dev/sdd1 drives had failed (sdc1 being in worse condtion). Mdadm revealed that this RAID 5 had experienced double-disk failure.We then carefully labelled each drive and removed them from the server. Then, using a specialised hardware disk imager – we imaged the disks. This means that we would be working on copies of the disks rather than the original ones. In the unlikely event of the data recovery process being unsuccessful – the original configuration and data, as we received it, would still be intact.
The imaging process completed successfully. We now put the imaged drives into the server. With all the prep work completed. It was now time to take the RAID array “offline”. This can be achieved by using the “mdadm –stop” command. The last thing we wanted was for the RAID rebuilding process to start using a failed disk in bad condition (e.g. /dev/sdc1) To prevent this from happening, we cleared the superblock of this drive using the command:
mdadm –zero-superblock /dev/sdc1
Now using the output we got from mdadm –examine, we used the following command to rebuild the array:
mdadm –verbose –create –metadata=0.90 /dev/md0 –chunk=128 –level=5 – -raid-devices=5 /dev/sdd1 /dev/sde1 missing /dev/sda1 /dev/sdb1
We now had to check whether the array was aligned correctly using the command:
e2fsck -B 4096 -n /dev/md0
Using e2fsck it is always helpful to specify the block size before a scan to get a more accurate status of the array. We also used the -n prefix in case the array was mis-aligned and e2fsck attempted to fix it. ( e2fsck should never be executed on an array that is potentially mis-aligned)
E2fsck completed successfully and correctly identified the status and alignment of the array.
It was now safe to proceed with the repair and fix command
e2fsck -B 4096 /dev/md0
Notice that no “-n” was used this time. The scan took around 5.5 hours to complete. It found over 26 inode errors, hundreds of group errors and some bitmap errors.
Now, it was time to add the first failed drive back into the array. We used the command:
e2fsck -a /dev/sdc1
The RAID array now began to rebuild. After a couple of hours, the RAID 5 was totally re-created, albeit in degraded mode. But the the volume was mountable again and all data was now accessible.
The client had over 4 years of Sage Micropay and Sage 50 accounts files on the server. In addition, they had over 6 years worth of PhysioTools data files. This is a software package which they used to create customised exercise regimes for their patients. Reconstructing accounts and staff payslips would have been very time consuming and costly. For their staff to re-create patient exercise regimes, it would have incurred a huge time burden on them. Moreover, it would probably have been professionally damaging for thier reputation if they had to inform their patients that their customised exercise regimes had been “lost”.
We advised the client on some best-practice back-up strategies so they could prevent data loss in the future. It is deeply satisfying to help a customer like this when the “plan B” option would have been so disruptive for them. They could now get back to helping their patients with minimum downtime to their business.