May_EDFA_Digital

edfas.org 1 7 ELECTRONIC DEVICE FAILURE ANALYSIS | VOLUME 21 NO. 2 components, thereby meeting reliability specifications with minimal cost. To this end, howcan diagnostic sensor data be used to predict component failures? Modern HDDs are equipped with SMART (self-monitoring, analysis, and reporting technology) sensors that record internal temperature, read error rates, counts of error correction code (ECC) invocations, and other health-related information. Some datacenters have made their drives’ SMART records pub- licly available, and as a result, there is growing interest in applying techniques inmachine learning and data analyt- ics to predict HDD failures. [3,4] Causal inference and proba- bilistic modeling provide a framework for learning the underlying structure present inmultichannel sensor data such as SMART. Such models informmore principled and interpretable failure predictors and provide feedback to component designers, potentially inspiring future design practices. Promising results are presented in the follow- ing sections and new directions of inquiry are proposed. HDD DIAGNOSTICS USING SMART DATA SMART is a diagnostic monitoring system used by HDDs. At any time, a drive can be polled for its SMART attributes, which can be used for diagnosis. A few exam- ples of SMART attributes for HDDs include temperature sensors, power-on hour counts, and various operational error rates as shown in Table 1. SMART data are reported asmodel-specific rawvalues and unitless normalized values, which are effectively the raw values quantized to integers between 1 and 253. In most cases, smaller values areworse, with 1 representing the worst-case value. One notable exception to this rule is temperature, whose normalized value is reported in degrees Celsius. Normalized values are reported to make SMART readings interpretable between models because many manufacturers do not format their raw values as numerical values. Two examples of normalized SMART value trajectories are shown in Fig. 1. As noted earlier, datacenter HDDs make for a good first case study because some SMART data aremade pub- licly available. For example, the cloud storage company Backblaze collects and stores SMART attributes of the drives in its datacenter once a day. Up to 40 SMART attri- butes have been reported for tens of thousands of drives since 2013 with the data published on their website. [5] PREDICTING HDD FIELD FAILURES Here, a machine learning framework for predicting field failures from SMART data is established. Consider a dataset D of M SMART attributes collected from N drives. Each drive, indexed by i, is sampled regularly (e.g., once a day), for T i time slots. It is assumed that data collection is not synchronous; that is, the calendar date and time when two different drives started collecting datamay not be the same. This can be formally written as: Fig. 1 Example time series for two normalized SMART attributes from a single HDD in the Backblaze datacenter. [5] The drive failed at the end, after almost two years in the field. Lower normalized values for reported uncorrectable errors indicate more errors; this drive saw a rise in these errors in the last hundred days before failing. Table 1 Examples of diagnostic SMART attributes for HDD Examples of SMART Attributes for HDD Read Error Rate Seek Error Rate Reallocated Sector Count Reported Uncorrectable Errors Power-on Hours Temperature