A couple of weeks ago, an I.T. support administrator for a Dublin finance company called us. He was in a spot of bother. Last week, their RAID array on their HP Proliant server failed. Luckily, they had a complete back-up and no data had been lost. The I.T. admin decided to replace the four Western Digital (WD2500YS) Enterprise S-ATA setup in a RAID 5 configuration. He replaced them instead with four Western Digital Caviar Green disks (WD10EZRX). Using S-ATA to SAS adaptors he connected them to the HP Smart Array controller card.
He re-built the server, re-installed Windows Server 2008. But three days later, the server was down…again. In this short space of time, the RAID array had changed from “normal” to “degraded” status. He ran diagnostics on the disks. All of them passed. He suspected that the Smart controller card was at fault. He had a redudant server in his office using the same model of RAID controller card. He removed it and installed it in the problematic server and, for a second time, did a rebuild of the array. He tested it for a couple of hours. It worked fine. Then, just as he was about to start the data transfer process from the local backup, he rebooted the server . The dreaded “degraded” message appeared on the screen – again.
Being a past customer of Drive Rescue data recovery, the I.T. admin telephoned us for advice about the mysterious problem. After his description of his problem, we had a fairly good inkling as to what the cause might be. But, inklings or assumptions are dangerous. Most the great technological failures of mankind (nuclear power plant explosions, aircraft disasters,etc.) can be traced to someone, somewhere making a wrong assumption. The same applies to the data recovery process. Good data recovery methodology is not based – or never has been – on assumptions. We asked him to email us his server event logs, hard drive model numbers and exact model of RAID controller. After looking at his server logs, the specs of his controller card and model of hard disks used; it was now becoming clearer as to what the root-cause of the problem might be.
He got the disks delivered to us and we tested them using our own equipment. His Western Digital Green disks were indeed perfectly healthy. The problem with a lot of WD Green disks (and other non-entreprise-class disks) is that when they are used in a server or NAS – the RAID controller can erroneously detect them as faulty. The reason for this is quite simple. In some RAID setups, if the controller card detects that the disk is taking too long to respond, it simply drops it out of the array. But, it is normal for error recovery in some non-entreprise / non-NAS classified disks to take approximately 8 seconds to complete. With error recovery control (ERC) enabled on a disk, error recovery is usually limted to 8 seconds or under. (For example, with Hitachi branded disks ERC is limited to 7 seconds) This means the RAID controller will be less likely to report ERC-related false positives.
In this case, the Smart RAID controller, used commonly by HP Proliant servers, was detecting some of these disks as faulty when they were not. The most common type of error recovery control used by Western Digital is TLER (Time Limited Error Recovery). Most WD Caviar Green drives do not have this function. WD Red (NAS disks) and WD entreprise-class disks do have error recovery control.
RAID controllers (especially dedicated hardware controllers like those from LSI, Perc etc.) are very sensitive to read / write time delays. When a hard disk does not use error recovery control , a RAID controller will often report false positives as to the status of the array or, like what happpened in this case, the controller will simply drop the “defective” disk out of the array.
Entreprise-class disks (such as the WD Caviar RE2, RE2-GP) or disks made specifically for NAS devices (WD Red) have error recovery control enabled by default.
In this case, the I.T. admin replaced the WD Caviar Green disks with four 1TB WD RE SAS drives. He then rebuilt the RAID 5 array.
Yesterday, we got a nice email from him. The server has been running smoothly ever since. He has rebooted it a couple of times. The event logs are free from disk errors. He no longer has to worry about the uncertainty of the company’s server continually degrading. He can even sleep more soundly at night.