November_EDFA

edfas.org ELECTRONIC DEV ICE FA I LURE ANALYSIS | VOLUME 23 NO . 4 58 with DNN algorithms, which are designed to generalize beyond what they are trained on, and therefore must be able to copewith some error. The key is that this errormust bewithin the acceptable limits of operation. The reliability of an analog chip clearly takes on a new meaning, more closely intertwinedwith the accuracy of the computations during normal operation of the chip. The key metric of a neural processor operating in inference mode is its accuracy on a task, such as image classification. For a benchmark deep neural network task, a well-defined baseline accuracy will exist for a digital systemwhere all logical values are represented by standard floating-point precision (32 bit) variables. For example, a standard GPU running the ResNet50 neural network should correctly classify 75% of the images in the ImageNet test set, and this result should be exactly reproducible. Now, considering the exact same task on an analog processor, the final accuracy might be 74% under initial, ideal operating conditions, due to the fact that the weights, inputs, and outputs are represented by analog attributes. The loss is caused by several factors that reduce the precision of these attributes, including the device programming accuracy limits, physical varia- tions, analog to digital conversion, and array parasitics. A small accuracy loss accompanied by orders of magnitude speedup and energy efficiency improvement can be an acceptable tradeoff for many applications, and methods are being researched to close this accuracy gap. There isoften theperception that analogneural proces- sors are inherently immune or tolerant to device failures and reliability issues. While this is true in some instances, the reality is that reliability issues will degrade the accu- racy of analog systems before they affect digital systems. For example, in standard digital flash memory, the drift in the threshold voltage of the cell does not change the value read by the memory controller until it crosses over a preset read voltage, at which point a bit error occurs. A large threshold voltage shift is usually required before the bit will cross this point, allowing the bit to slowly shift due to charge loss, possibly over many years, before any error occurs. Charge loss is hastened under environments like high temperature or ionizing radiation. In the analog cell, as soonas the thresholdvoltage starts todrift, theaccuracy of the neural operations can be affected. For an accelera- tor built on flash memory, this shift occurs at the same time across an ensemble of cells. The collective shifts will immediately start to perturb the analog data values inside the network andmay soondegrade the accuracy. The case of erroneous bits is very clear for digital memory and can be remedied by error correction and detection codes. In analog, the very definition of an erroneous bit is ambigu- ous. In the case of a single cell that catastrophically fails, such that the difference between the actual and expected value ismaximized, the effect is still not clear at the output of the system; it in fact depends on the importance of that particular weight in the neural network. The connection between device-level analog errors and reliability at the system level requires a detailed model of the device up through the algorithm, which is a topic of current research. Analog computing is an important step which will take deep neural network processing to the next level of efficiency. However, reliability is intimately tied to perfor- mance and efficiency in these systems, more so than any modern digital system. Understanding the reliability in these systems presents a challenge and opportunity for the electronics community. ABOUT THE AUTHOR Matthew J. Marinella is a dis- tinguished member of the technical staff in Sandia’s Microsystems S&T Center, where he leads research in emerging technologies for low power, high performance, and radi- ation-hardened computing. He has served in technical advising and leadership roles in various Lab- and DOE-level initiatives on next-generation computing for government applications. Marinella has authored or co-authored over 100 peer reviewed publications, given numerous invited and contributed talks, and presented several short courses on these topics. He is a member of the SRC Decadal Plan Executive Committee, chairs the Emerging Memory Devices Section for the IRDS Roadmap Beyond CMOS Chapter, and serves on various technical program committees. He received a Ph.D. in electrical engineering from Arizona State University under Dieter Schroder in 2008. Sandia National Laboratories is a multimission labo- ratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. This article describes objective technical results and analysis. Any subjective views or opinions that might be expressed in thepaper donot necessarily represent the views of theU.S. Department of Energy or the United States Government.

November_EDFA_Digital