May_EDFA_Digital

edfas.org ELECTRONIC DEVICE FAILURE ANALYSIS | VOLUME 21 NO. 2 16 DATA-DRIVEN RELIABILITY FOR DATACENTER HARD DISK DRIVES Alan Yang 1 , AmirEmad Ghassami, 1 Elyse Rosenbaum, 1 and Negar Kiyavash 2 1 Dept. Electrical and Computer Engineering, University of Illinois at Urbana-Champaign 2 Dept. Electrical and Computer Engineering, Georgia Institute of Technology asyang2@illinois.edu EDFAAO (2019) 2:16-20 1537-0755/$19.00 ©ASM International ® INTRODUCTION Reliability is of critical concern in hardware design. In general, system designers attempt to mitigate the nega- tive effects of component-level failures by introducing redundancy and conservatively limiting system lifetime specifications. There is a clear trade-offbetween reliability and system cost. At the same time, telemetry and data storage technologies have enabled diagnostic sensor arrays to be embeddedwithin hardware systems. System designers collect large volumes of sensor data streamed from operational devices in the field; those data are typi- cally used todrivedesign choices andassist in field repairs. An intelligent failure prediction system would allow for a new reliability paradigm that depends more heavily on data analytics rather than expensive redundancy or early retirement of hardware. This article explores the possibility of newdata-driven approaches to hardware reliability through a case study of the workhorse of datacenter storage, the hard disk drive (HDD). For datacenters, data integrity is of utmost importance. It would be unacceptable for information stored in the cloud—the reader’s email, for example—tobe lost due to a failure of the HDD storing that data. Indeed, HDDs are prone to failure because they rely onmechanical moving parts. In the field, annual disk replacement rates in datacenters typically range from 2-4% but have been found to be as high as 13%. [1] Solid-state drives (SSD) tend to be more reliable because they do not have moving parts, but they can be four to 40 times costlier (per giga- byte of storage) than HDDs. Further, SSDs are not as well established as HDDs as a staple technology for large-scale storage applications. [1] In general, the problem of constructing reliable sys- tems out of unreliable components is solved by using redundancy schemes. Datacenters utilizeRAID (redundant array of independent disks) configurations to ensure that no single individual drive failure leads to loss of informa- tion. The data storedon any drive can always be recovered by either directly copying froma backup or reconstructed from parity information. In fact, similar redundancy is built into individual storage drives. Reading and writing from storage media can be inherently noisy operations, and individual memory cells are prone to failure. Like the datacenters they are housed in, HDDs use redundancy to ensure reliability in the form of error correcting codes (ECC) and spare memory cells. [2] How much redundancy is needed, and how robust do systems and components need to be? Manufacturers usually aim to minimize costs while meeting a required specification on lifetime, subject to standard or worst- case field usage conditions. Lifetime projections are typi- cally established for HDDs using accelerated life tests during the design phase. Sample drives may be stress- tested with high read/write workloads, high operating temperatures, and other challenging worst-case condi- tions until failure. These samples are used to fit lifetime distributions (e.g., Weibull), which are extrapolated to model lifetime under typical use cases using physics- based models. [2] Unfortunately, designing for worst-case conditions requires sacrificing performance and increasing cost, options that are becoming increasingly untenable. Intuitively, systems should be designed tomeet reliability specifications under normal use conditions. However, real-world usage varies between applications and is dif- ficult to profile; this affects systems and components on all levels. Real-time monitoring and modeling of system health is necessary to account for usage variability and enable the preemptive replacement of soon-to-fail