Approximate Computing on Unreliable Silicon Georgios Karakonstantis 2 Jeremy Constantin, Andreas Burg 1 Adam Teman 1 1 Swiss Federal Institute of Technology Lausanne (EPFL), Switzerland 2 Queen’s University Belfast, U.K. Dagstuhl 30-11/15
Objective: Improve Energy Efficiency New Classical Main Idea: Utilize application’s Main Idea: Reduce the complexity error resiliency to address of an algorithm. hardware induced errors Techniques Techniques • Allow and tolerate errors • Scale down bit-precision • Limit errors to less significant • Prune computations computations and variables • Simplify algorithms • Ensure graceful performance Metrics degradation • Quality (SNR, PSNR,…) Metrics • Energy • Quality (SNR, PSNR,…) • Energy • Yield • Reliability (e.g. MTTF) 2
Variability summarizes three different problems True randomness Need for overdesign to account for worst case assumptions Lack of knowledge Variability Inability to model: (chaotic behavior) Failure to design under all worst-case assumptions can and will lead to hardware misbehavior Two main types of failures • Logic level: violation of timing constraints causes erroneous computations and control plane failure • Memory: data is lost or not properly stored 3
Static components … Random dopant Line-edge roughness Process variations fluctuation Dynamic/runtime factors … 1010010 Vdd 0100100 0100100 Voltage variation Data dependencies Thermal Wearout /aging … Single event upsets Only errors that are truely random (intentionally not covered in this talk) NBTI 4
Manufacturing Runtime/Dynamic Wearout failure failure failure Time [s] Time [y] Die to die and within die variations Behavior of each circuit mostly Aging is a slow process • • Each die is an individual realization deterministic and on short time scale Parameters change • “Randomness” due to random of a random process on a long time scale • • Parameters are fixed after data and model uncertainty Long-term average • manufacturing Averaging only meaningful is meaningless with true random input Non-ergodic behavior renders analysis of circuits under variations difficult: averaging requires great care 5
• Predicting the exact timing of a circuit is almost impossible even if all factors are precisely known • Predicting the consequences of a timing failure in any or multiple points is even more impossible today • Different instances of the same circuit behave very differently • Despite the high sensitivity, unfortunately, the behavior of each instance of the circuit is also deterministic 6
Quality (SNR) degradation of different adders under frequency-over-scaling Some key observations: • Transition region of graceful quality degradation is small • Better architectures are also more sensitive to errors (smaller transition region) 7
Objective: exploit timing margin in low-power processors Error-detection sequentials measure timing margins in all pipeline stages Cycle-by-cycle adjustable clock generator Processor state determines instantaneous clock period Critical Range Optimization in OpenRISC +38% speedup -24% power consumption Opportunity J. Constantin, et al. “ Exploiting dynamic timing margins in microprocessors for frequency-over-scaling with instruction-based clock adjustment”, DATE 2015 8
Graphs removed since unpublished Summary of Main points: • Under timing violations without additional sources of randomness, there is a sudden transition between fault free operation and 100% failure beyond static timing limit • When adding uncertainty by means of supply voltage noise we get a transition region between functional and full failure • Unfortunately, the transition region is rather small (e.g., 50MHz at a clock of ~700MHz) 9
Consideration of the application level provides additional scalability: graceful performance degradation Quality Execution time Approximate computing Allow for graceful New paradigm: performance Scalable algorithms degradation Stochastic computing Application/algorithm-level fault tolerance X Application to Communications Task deadline Iterative algorithms adjust to process variations V DD =nominal target delay # of occurances target delay V DD =low path delay 10
• Memories account for the bulk of leakage and active power consumption • There is a clear relationship between savings (in area and power) and the amount of errors we expect • Errors can easily be located and be associated with individual variables or quantities in higher abstractions • Important variables can be protected against errors • The impact of errors is easy to model accurately and can be propagated well through the stack and other abstraction levels 11
Compact “better -than-worst- case” study of inherent fault-tolerance of 1 2 memory design for FT applications wireless systems • Memories with graceful HSPA+ System performance degradation Transmitter t ret System tolerates surprisingly high • Average-case refresh number of defects in costly memories 50x 3 Application of unreliable memories to forward error correction decoders 1 A. Teman , et al, “Energy versus data integrity trade-offs in embedded high-density logic compatible dynamic memories”, DATE 2015 12 2 G. Karakonstantis, et al. “On the exploitation of the inherent error resilience of wireless systems under unreliable silicon”, DAC 2012 3 P. Meinerzhagen , et al. “Refresh -Free Dynamic Standard-Cell Based Memories: Application to QC- LDPC Decoder” , ISCAS 2015
Controlled errors with a modified test criterion Conventional yield criterion: Modified yield criterion: accept only dies with no errors accept dies with less than N errors % of dies % of dies 80% 80% Nominal VDD toward low power operation 80% yield 90% yield (OK) (high) 40% 40% • Improves yield for a given power/quality metric 20% 20% Bit errors Bit errors • Keeps the yield under more stringent power constraints per die per die 0 <5 <100 >100 0 <5 <100 >100 Yield Reduced VDD loss 80% 80% 80% yield 60% yield (OK) (too low) 40% 40% 20% 20% Bit errors Bit errors per die per die 0 <5 <100 >100 0 <5 <100 >100 13
Problem: Different instances of same memory Each manufactured die is subject to a specific error pattern (number of errors and error locations) Impact on quality depends strongly on the number MSB MSB MSB MSB LSB LSB LSB LSB of errors and the error location (word and bit location) Very different Non-ergodicity invalidates quality assessment across dies performance impact Impact on quality distribution: Some chips with less than N errors work perfectly, others fail miserably Few errors in MSBs Many errors in LSBs 14
Binning based on specific error pattern: not feasible due to too many different patterns (predicting impact of each pattern on quality during test is impossible) Proper test criteria are hard to define and ensuring consistent quality is difficult Solution: ensure that all chips (with given number of errors have the same average quality) Average behavior over time must be independent of the physical error location Add logic to memories to change mapping between logical and physical locations Physical failures remain Logical bit-failures wonder Quality changes with each application in same location around in the memory of the algorithm (averaging) Physical to logical bit/address mapping MSB MSB MSB MSB LSB LSB LSB LSB Time/algorithm iteration 15
Best-effort statistical data Data representations for correction unreliable memories C. Roth, et. al. “ Statistical data correction for unreliable memories ”, Asilomar 2014 16 Roth, Christian, et al. "Data mapping for unreliable memories." Communication, Control, and Computing, Annual Allerton Conference on . IEEE, 2012.
Idea: Identify failing bit locations during runtime and store bits of lower significance (LSB) in those locations S. Ganapathy , et. al. “Mitigating the Impact of Faults in Unreliable Memories for Error Resilient Applications ”, DAC 2015 17
Bit Shuffling Mechanism Identify failing bits in a memory word at run-time Use a shifter to store bit of lower significance (LSB) in those locations Shuffling can be performed at varying levels of granularity • On a per-bit basis, where failing bit always stores the LSB • On a segment basis, where group of bits are shifted • Helps trade-off area and power for output quality Magnitude in error computed for 32- bit integer in 2’s complement mode 2 n fm segments/word 18
Overhead compared to (39,32) SECDED ECC. We show results for (22,16) SECDED ECC Priority ECC design Most significant 16-Bits are protected in a 32-Bit word Reduces power, area and latency overhead by as much as 83%, 89% and 77% respectively For 3 evaluated applications (Elasticnet, PCA and KNN), output quality within 10%, 0.2% and 7% of fault-free memory with SECDED ECC. 19
Recommend
More recommend