SMART disk monitoring: no longer fit for purpose in the SSD-era?

SMART disk monitoring: no longer fit for purpose in the SSD-era? Data Recovery Ireland

Predicting or detecting SSD failure is much harder than predicting HDD failure. If an HDD is failing, it can become slow, it can cause a computer to freeze or go slow. Or, it can trigger a kernel panic or blue screen of death to appear on the host system. And in some cases, the user will hear a clicking, grinding, beeping or chirping noise. A failing SSD however, does few of these things. In fact, failing flash-based storage be quieter than the proverbial church mouse.

That is worrying because a lot of users are not prepared for sudden-death failure of their disk. At least with a HDD, the user sometimes gets a bit leeway to perform an emergency backup. Your SSD could fail in the morning without even giving a peep of warning. SSD manufacturers have brought over a legacy technology called SMART (Self-Monitoring, Analysis and Reporting Technology) to monitor and help predict failure. Designed by IBM primarily for ATA and SCSI disks, it monitors disk parameters such as the Read Error Rate, Reallocated Sectors Count, Power-On Hours, Temperate and Uncorrectable Error Count.  And for the SSD-era, parameters such as flash program fail, wear level count and wear-out indicator have been added to the SMART attribute set. But even taking this newly bolted-on features into account, SMART is still an old technology designed for electro-mechanical disks.

How Accurate is SMART?

SSDs are first and foremost electronic devices. And SMART does not take into account failure or impending failure of electronic components. Failing DRAM chip?, problem with write amplification? problem with LBA mapping tables? –SMART, alas, does not have you covered. SMART will continue to merrily push out disk attributes sometimes with little salience to the operation of a modern SSD.

While power-up and power-down events are recorded. SMART gives us now information as to whether these power events were clean or dirty. An SSD could fail with its DRAM cache full to the brim just before a data corrupting power-event, but SMART will be blissfully unaware of it.  

SMART is a very siloed tool. It takes into account individual disk performance parameters but does not view them holistically.

SMART is not standardised. While the NVM Express working group is endeavouring to change this, SMART has also been implemented by SSD manufacturers on a non-standardised basis. This means that a sector reallocation event for a Samsung Evo SSD might be defined totally differently by Sandisk Plus SSD.

And because SMART has been implemented by manufacturers on their terms, it has invariably been driven by a commercial imperative. Let’s face it, manufacturers do not want a deluge of RMA’ed SSDs being sent back to them based at the slightest hint of malfunction. Therefore, most manufacturers have set their SMART failure thresholds high.

Why SMART is a problem for the end-user, computer technicians or system administrators

SMART provides a false sense of security to users. They might have a SSD which is on its last legs, but it will pass a SMART test. Here at Drive Rescue, we’ve seen this sort of scenario play out a countless number of times.

The problem of SMART and third-party SSD Diagnostic Tools

Most SSD diagnostic tools such CrystalDiskInfo and SNMP monitoring tools like PTRG rely on SMART information to perform their tests. While these tools can be extremely useful, they can also provide inaccurate information. This is because many SSD disk manufacturers have designed their disks’ firmware so that its telemetry cannot be fully interrogated by third-party tools. These tools sometimes only scratch the surface of what is really going on inside your SSD.

The Solution

Perform regular backups of your important data. Throw away any notions that SSDs don’t fail or that you’re going to get some warning. Sometimes SSDs fail out of the blue. Backup strategies such as performing 3-2-1 backups are as relevant with SSDs as they were even with the creakiest spinning disks.

Try to use manufacturer-based tools for diagnosing SSD problems. For example, Samsung Magician for Samsung SSDs or Crucial Storage Executive for Crucial SSDs. These tools tend to be slightly more accurate because they are typically allowed more privileged access to your disk’s telemetry data.  

Unbelievably, some SSD manufacturers still don’t provide diagnostic tools for their disks. If this is the case, you can use an SSD diagnostic tool like Smart Disk Checker. This will not only read the SMART logs of your disk but will also perform a time-sensitive sector analysis of your disk. This can give you a much better picture of your SSD’s health. This tool is also bootable from USB meaning you don’t have to remove the HDD or SSD from the system.    

Drive Rescue, Dublin, Ireland offer a complete data recovery service from inaccessible S-ATA and M.2 NVMe SSDs. Common SSDs we recover from include models such as Lenovo MZ-VKV5120, Toshiba THNSFJ256GDNU, THNSN5512GPUK, Samsung MZ-NLN5120, MZ-VLB5120 and MZ-VLB2560, WD SN520, WD SN550 and SanDisk X400.