See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/311113761 AST2016 Dang slides Data · November 2016 CITATIONS READS 0 10 4 authors , including: Khanh N. Dang Michael Meyer Vietnam National University, Hanoi Waseda University 35 PUBLICATIONS 114 CITATIONS 36 PUBLICATIONS 125 CITATIONS SEE PROFILE SEE PROFILE Some of the authors of this publication are also working on these related projects: HotCluster View project VENGME View project All content following this page was uploaded by Abderazek Ben Abdallah on 30 November 2016. The user has requested enhancement of the downloaded file.
25th IEEE Asian Test Symposium (ATS'16), Nov. 21-24, 2016, Hiroshima, Japan Reliability Assessment and Quantitative Evaluation of Soft-Error Resilient 3D Network-on-Chip Systems Khanh N. Dang, Michael Meyer, Yuichi Okuyama, and Abderazek Ben Abdallah {d8162103, d8161104,okuyama,benab}@u-aizu.ac.jp Adaptive Systems Laboratory Graduate School of Computer Science and Engineering The University of Aizu Aizu-Wakamatsu, Fukushima, Japan
Content • Background • Soft Error Resilient 3D NoC System • Reliability Assessment Methodology • Evaluation Result • Conclusion & future work 2 25th IEEE Asian Test Symposium (ATS'16)
Content • Background • Soft Error Resilient 3D NoC System • Reliability Assessment Methodology • Evaluation Result • Conclusion & future work 3 25th IEEE Asian Test Symposium (ATS'16)
VLSI Design Challenges For decades, the CMOS technology has been progressed to provide efficient solutions; however, VLSI design nowadays has several challenges: • Power Wall : Energy consumption is increased by ~60% (high computing area) and ~40% (middle computing area) per year [Chang 2016]. • Yield Wall : With the similar process control steps (~420), yield of 5nm is predicted to be under 55% in compare to 28 nm (~78%) [Yield]. • Packaging : Intel Chip’s pin number is expected to increase by 25% every 2 years (tick-tock period) [Intel Proc] . 4 25th IEEE Asian Test Symposium (ATS'16)
VLSI Design Challenges (cnt.) • Time-to-Market: • One quarter or one year late to market (2 year product life) leads to over 33% or 90% of the revenue loss, respectively[TTM]. • Reliability : Exposing to a variety of manufacturing, design, and operation factors makes the future architectures more vulnerable to different types of faults. [Henkel 2013]. • 10-15°C difference in operation temperature can lead to 2x times difference of MTTF [Shafique 2014]. • Soft error rate at 0.45 V is 30x times of 0.7 V [Shafique 2014]. • ⇒ Reliability assessment has been becoming an import part in the design process. 5 25th IEEE Asian Test Symposium (ATS'16)
Network-on-Chip • Network-on-Chip (NoC) is the new paradigm to replace the traditional Bus with benefits: • Low power • Reusability • Scalability • Parallelism R R R PE PE PE Router R R R PE PE PE Network R R R PE R PE R PE R PE PE PE Interface R R R PE PE PE R R R PE PE PE Processing R R R Element R R R PE R PE R PE R PE PE PE PE PE PE R R R R R R PE PE PE Wires PE PE PE R R R PE PE PE 2D Mesh Network-on-Chip 3D Mesh Network-on-Chip 6 25th IEEE Asian Test Symposium (ATS'16)
Reliability Challenges Open wire defect Single Event Transient by radiation particle Fault Type Source Cross-talk Radiation particles Soft Errors Cosmic rays Thermal neutrons Manufacture defects Time dependent dielectric breakdown Hard Faults Thermal Stress Electro-migration Negative-Bias Temperature Instability 7 25th IEEE Asian Test Symposium (ATS'16)
Reliability Challenges (cnt.) Fault Type Potential Effects Possible Solution Flip-bit (gate/wire) • Data Corruption • Misrouting Error Correction Code • • Soft Errors Loss/duplicated packet Temporal Redundancy • • • Packet latency • Self-verification & roll-back With the increasing of system vulnerability to faults and the Locking state • critical effects on NoC systems, addressing NoC system reliability is needed. • Open (gate/wire) Bridge (gate/wire) • Stuck at 0/1 Spare module/gate for • • Delay replacements. • Hard Faults Data corruption Faulty part isolation. • • • Packet • Fault-tolerant routing. loss/duplicate/misroute Locking state • 8 25th IEEE Asian Test Symposium (ATS'16)
Reliability Assessment • Reliability Assessment involves five phases: • System Definition Reliability assessment is important for early design stages in order to prevent • Preliminary Design costly redesigns of the system. • Detailed Design • Fabrication, Assembly, Integration and Test (FAIT) • Production/Support Analytical Model System-Level Simulation Physical Analysis • Analyze in terms of • The design is analyzed • Faults are injected into physical failures. under analytical model the system under specific • Analytical model is efficient for the three early stages. distributions and rates. • A full-chip assessment • The design reliability is • By analyzing analytically, the critical part can be detected can be obtained by estimated from the sub- • Give an accurate combining separated modules or events. behavior under faults. and improved. parts. • Low complexity and • Result is trustable under • Highest accuracy quick. fair fault distributions and high amount of • Requires massive time statistic values. and computation resource 9 25th IEEE Asian Test Symposium (ATS'16)
Paper Contributions 1. An efficient soft error resilient mechanism and architecture (SER-3DR-NoC) for reliable 3D-NoC systems. • Use redundancy of pipeline stage execution to detect. • Use three execution results and majority voting to recover the soft error. 2. An formulation of reliability assessment for fault- tolerant system. • Base on Mean-Time-Between-Failure. • Modeling by Markov-state model. 10 25th IEEE Asian Test Symposium (ATS'16)
Content • Background • Soft Error Resilient 3D NoC System • Reliability Assessment Methodology • Evaluation Result • Conclusion & future work 11 25th IEEE Asian Test Symposium (ATS'16)
Proposed System Architecture The proposed system (SER-3DR-NoC) is a 3D-Mesh The proposed system consists of SER-3DR router SER-3DR operates under 3 pipeline stages: BW: Buffer Incoming flit is stored in the input buffer. Later, the routing based Network-on-Chip. with 7 ports (6 directions and 1 local). Writing, NPC/SA: Next Port Computing/Switch Allocation information is used to computing routing path and intra- and CT: Crossbar Traversal. router arbitration. Flits will be forwarded through the Proposed System: SER-3DR-NoC. crossbar. 12 25th IEEE Asian Test Symposium (ATS'16)
Soft Error Resilience Method • Approach: • Replicate the execution of the pipeline state. • Compare two consecutive results: different fault occurred. • Correct by executing the third time and using a majority voting. • Target: • The routing (NPC) and arbitrating (SA) units role an import part in side the network. • A soft error in NPC or SA can lead to misrouting, loss/duplicated packet or even locking states. • NPC and SA are selected to be protected. 13 25th IEEE Asian Test Symposium (ATS'16)
Soft Error Resilience Algorithm Redundant pipeline stage Original pipeline stage stage stage If one of them (or both) is different, Routing information is used for the first Incoming flit is stored in the buffer with Buffer Finally, corrected routing flit is If they are similar, which means no soft Cycle 1 Compare two consecutive results of NPC Execute a redundancy for each NPC/SA correct the error by third execution and BW time of NPC/SA execution Writing stage. forwarded to crossbar error, flit is forwarded to crossbar and SA majority voting. Cycle 2 Compute NPC Compute SA yes RNPC SA yes Compute RNPC Cycle 3 = = Compute CT Compute RSA NPC? RSA? no no NPC Roll-back and SA Cycle 4 Roll-back and Majority Re-compute Majority Compute CT Re-compute NPC Voting SA Voting 14 25th IEEE Asian Test Symposium (ATS'16)
Content • Background • Soft Error Resilient 3D NoC System • Reliability Assessment Methodology • Evaluation Result • Conclusion 15 25th IEEE Asian Test Symposium (ATS'16)
Reliability Assessment Methodology • We proposed a reliability assessment method by using Markov-state model. • The fault rate distribution is also proposed. • To evaluate the efficiency of a fault-tolerance, we present a new parameter: Reliability Acceleration Factor. • To assess the soft error resilient mechanism, we apply the method for it. 16 25th IEEE Asian Test Symposium (ATS'16)
Mean Time Between Failure Mean Time Between Failure is the average value of time between two consequent failures. t Time Between Failures working working working fail fail repaired soft error soft error occurs soft error occurs soft error is repaired soft error is repaired Given a reliability function R, MTBF is as follows: * in Laplace domain 𝑁𝑈𝐶𝐺 = lim 𝑡→0 𝑆(𝑡) 17 25th IEEE Asian Test Symposium (ATS'16)
Recommend
More recommend