May_EDFA_Digital

edfas.org 19 ELECTRONIC DEVICE FAILURE ANALYSIS | VOLUME 21 NO. 2 Reducing the number of attributes needed reduces the amount of data that must be stored and processed. While the number of features is typically not very large (<100) for SMART data, this is often not the case for similar applications, e.g., industrial big data processing and financial modeling. In these cases, feature selection may determine the success or failure of large-scale machine learning models. Let us nowexplore the issue of finding subsets of useful sensor channels. Traditional feature selection methods for this problem fall under filter methods and wrapper methods. [6] Filter methods score features independently based on measured dependency, which is typically the correlation with the prediction target. However, those methods do not consider the structure of the data, and as such, may not be able to eliminate all redundant features. Wrapper methods try various subsets of features, training a predictor for each. The smallest feature subset that pro- vides acceptable performance is used. However, wrapper methods depend on the choice of predictor model and can be computationally expensive. Instead, is it possible to develop a computationally efficient feature selection method that accounts for the structure within the data? Conceptually, each SMART attribute, as well as health status, represents a node within a directed acyclic graph (DAG), which in turn represents a hierarchy of causal rela- tionships between variables. While the structure learning problem involves inferring which vertices are present in the entire graph, feature selection is much simpler; it is only necessary to find the immediate neighbors of the target variable. Notice that for the failure prediction problem, the binary variable Y representing the failed/ not failed status of an HDD has no “children” in the DAG over the variables. In other words, it can be assumed that SMART attributes can only reflect causes of failure and not the otherway around. In this framework, the feature selec- tion problembecomes a search for the direct parents of Y . Suppose that a dataset contains M SMART attributes, represented as X 1 , X 2 ,..., X M . For each X i , how can it be deter- mined whether the link X 1 → Y is present? One common method attempts to iteratively rule out possible connec- tions by performing conditional independence tests. If Y and X i are independent, that is Y ⊥ X i , then the causal link X i → Y may be ruled out. Similarly, the link may be ruled out if Y and X i are independently conditioned on the other variables in the graph (Y ⊥ X i | { X j : i ≠ j }). As an example, the DAG X i → X j → Y implies the conditional independence rela- tion Y ⊥ X i | X j . Here, X i is a cause of Y through a solemediator X j , so if X j is known, X i gives no new information about Y , and as a result may be discarded. The feature selection problem now boils down to identifying a minimal subset of SMART attributes S , such that Y ⊥ X k |S for each X k not in S . Figure 2 illustrates these concepts. This probabilistic framework is useful because it makes no assumptions about the structure of any causal relationship X i → Y ; itmerely attempts to see if the relation- ship exists. Machine learning algorithms such as neural networks or SVMs may then be trained to learn the (pos- sibly complicated) mapping p ( Y | X i ) needed to predict Y using X i . Note that this framework, which relies on general conditional independence testing, is more powerful than onlymeasuring pairwise correlations, ameasure capable of only identifying linear relationships. Further, in the case of X i → X j → Y above, X i and X j may both be correlated with Y , yet X i may be eliminated as a predictive feature given that X j is a selected feature. CONCLUSION This article explores the possibility of evaluating reli- ability that depends more heavily on diagnostic data and analytics and less so on built-in redundancy, with data- center HDDs as an example application. The data-driven approach shows promise, based on successful recent studies of failure prediction from SMART sensor data for HDDs. Feature selection for SMART and other multichan- nel time-series sensor data is proposed as a promising future direction of inquiry. Learning the structure(s) pres- ent in data potentially allows for more generalizable and robust machine models. More importantly, it gives sys- tem and component designers a better understanding of usage and failure conditions in the field, which aids the design process. Fig. 2 An example of a DAG for the failure prediction prob- lem. Suppose that X 1 , ..., X 5 represent five different SMART attributes, and Y represents drive health (failed/not failed). The graph implies that Y is directly causedby X 1 , X 3 , and X 5 . Since Y ⊥ X 2 | X 3 and Y ⊥ X 4 | X 5 , no predictor using these threevariables canbe improved by adding X 2 or X 4 as inputs.