What the mistakes of Tesla can teach us about data storage

<strong>What the mistakes of Tesla can teach us about data storage</strong> Data Recovery Ireland

At the recent Embedded World 2023 conference and exhibition, the “Great Tesla Recall” of 2021 was still being talked about by speakers, exhibitors, and attendees. Given that Tesla is such a high-profile company, it’s probably no surprise. Now you might be wondering, “What was this recall all about?” Well, most electric vehicles now have their own digital storage device installed. This is essentially the car’s hard drive. Tesla installed an 8GB eMMC module in their “S” and “X” models. It failed prematurely, prompting the manufacturer to instigate a massive recall.

What went wrong with the hard drive installed by Tesla?

Not surprisingly, manufacturers don’t use electro-mechanical (HDDs) but instead use NAND-based storage. This is the same type of storage you would find in an SSD.  Tesla used an 8GB eMMC NAND module in the MCU (Media Control Unit) of these models. However, this NAND module wore out a lot sooner than anticipated. This resulted in thousands of owners reporting problems with their in-car systems such as the touchscreen display, the autopilot, the window demisting system, and even the turning indicator lights. Just like a computer with a failing storage device, hardware devices and applications start doing some pretty weird things.

Why didn’t Tesla give dealers or owners a new disk that they could slot in to replace the failing one?

Now this is where it gets complicated. Tesla designed the Media Control Unit with an embedded storage device known as an eMMC. Unlike an M.2 SSD, SD Card, or CFExpress card – it can’t just be slotted out. Embedded NAND usually has to be micro-desoldered off.

What can IT technicians learn from Tesla’s hard disk disaster?

Let’s start off with Tesla’s calculation of disk wear-out. Tesla used a TLC NAND module from the Korean manufacturer SK Hynix. (For the record, SK Hynix are a well-respected player in the NAND and DRAM flash market.) This type of NAND has approximately 3000 Program / Erase (P/E) cycles. Tesla did the maths. They calculated that this NAND module would provide a useful life of 11-12 years before it wore out. But that did not happen. For some Tesla users, this problem started to manifest itself after just two years. The miscalculation occurred because some onboard devices needed to access the firmware modules way more than expected. Some onboard devices executed data logging more than expected. (Data logging can add a huge sequential read overhead to NAND-based storage due to write-amplification effects.) And to compound this issue, these same firmware modules had to be updated more than expected. So, it’s small wonder that this little but constantly accessed 8GB NAND module got exhausted and died.

Compared to MLC and TLC NAND, QLC NAND does not have great endurance…

Matching the right storage choice with the use-case.

Anybody who performs advanced data recovery on NAND-based devices (such as S-ATA, M.2 NVMe disks) could have told Tesla this. On the NAND plane, the area of the disk that stores the disk’s own firmware modules is typically subjected to more repeated sequential reads than any other area. And because this area of the disk is not usually covered by wear-levelling algorithms or ECC, it tends to wear out even quicker. Maybe Tesla should really have used a pSLC partition specifically for their device firmware modules. Or, maybe Tesla should have used a separate NAND IC specifically to store firmware of their cars’ vital functions. Choosing the right type of storage device to fit the use-case is important. Let’s say you support a video editing business with a moderate to heavy workflow. Installing QLC NAND-based SSDs inside their systems would probably not be a very wise decision. Likewise, fitting some QLC or TLC based storage in a CCTV’s NVR box which uses a ring buffer could be asking for disaster.

Don’t skimp on storage

Let’s be frank here: Tesla used a paltry 8GB NAND storage chip to serve their cars’ control and data systems, when in fact this storage disk should have been of a much larger capacity. A Tomy toy car would probably use a bigger chip… What a lot of people forget about NAND-based storage such as SSDs and SD cards is that, as a rule of thumb, you should always keep at least 20%-30% of their space free. This is because all those disk housekeeping functions such as TRIM, ECC, wear-levelling, and garbage collection need a bit of free disk space to perform optimally. This is especially true if your storage device does not natively use over-provisioning. To use a real-life example: Let’s say you’re assisting a user tomorrow who needs a new SSD in their laptop. On their existing disk they’ve already used 400GB. Migrating them to a new 512GB SSD is probably not going to be suitable, because in a few months’ time that disk usage could be well up to 450GB and then all those essential SSD house-keeping functions are going to struggle. This increases the probability of events like uncorrectable bit-errors, partition damage, or, like with Tesla, exhausted NAND. It is interesting to note that when Tesla revised their chip choice they choose a 64GB NAND module.  

The benefits of removable storage

There are other things we can learn from this Tesla saga as well. Because Tesla used embedded eMMC NAND, it made the recall and repair process very difficult. In a lot of cases Tesla had to replace the whole daughterboard housing the MCU. In the context of data storage devices, removability allows for serviceability. Devices that have removable storage can, in general, be more quickly serviced than those using embedded storage. Let’s say you’re the IT manager who works in an engineering works and you have a problem with a CNC machine. The electronic control unit of the machine might use embedded NAND to store its PLC data. However, let’s say your machine now starts to develop programming problems. In the absence of any serial port access, wireless, or remote access, embedded NAND might prove problematic – especially if the machine’s nearest service centre is in Sweden or Switzerland. On the other hand, a machine that stores all its PLC information on a removable SD or CFExpress card allows for a much easier troubleshooting option. It’s easier to send an SD card to Malmo than a 100KG machine.

A problem that many technicians working with edge or IoT computing devices are noticing is that 4G or 5G coverage, or Bluetooth connectivity, is not always a given. This makes over-the-air (OTA) device updates impossible. However, using a removable storage device like a microSD card means updated firmware can be applied in a much more flexible way.

So, even though the world of digital storage might be changing exponentially quickly, it all goes back to brass tacks of matching the right storage device to the right use case. Oh, and factoring in failure as a given…

In operation since 2007, Drive Rescue (Dublin, Ireland) offer a complete data recovery service for SD and CFExpress cards ( SanDisk, Adata, Samsung Evo, Nextbase, Integral and Kingston) along with our SSD data recovery service. (Samsung Evo, Samsung NVMe, WD M.2 NVMe, SK Hynix and Crucial). Outstanding success rates and competitive pricing.