measurement of timing error detection performance of
play

Measurement of Timing Error Detection Performance of Software-based - PowerPoint PPT Presentation

Measurement of Timing Error Detection Performance of Software-based Error Detection Mechanisms and Its Correlation with Simulation Yutaka Masuda, Masanori Hashimoto, Takao Onoye Dept. of Information Systems Eng., Osaka University


  1. Measurement of Timing Error Detection Performance of Software-based Error Detection Mechanisms and Its Correlation with Simulation Yutaka Masuda, Masanori Hashimoto, Takao Onoye Dept. of Information Systems Eng., Osaka University {masuda.yutaka, hasimoto}@ist.osaka-u.ac.jp 1

  2. 2 Agenda � Background and objective � Silicon measurement � Correlation between silicon measurement and simulation � Conclusion

  3. 3 Challenges in Post-Silicon Validation A number of tests Unexpected behavior happens due to � logic bug . � Electrical timing error (This work) To localize errors w/ trace buffer, we need to quickly detect errors !! System crash, Error Blue screen etc. Cannot record in Test trace buffer! discarded Trace buffer depth Long detection latency (e.g. billions cycles)

  4. 4 EDM* Trans. for quick error detection ( *) Error Detection Mechanisms, one of SW-based error detec. tech. Original EDM program No HW modification Processor a=b; 1001010 a0=b0; a1=b1; ・・・ EDM c 10100 Check; RAM c C/C++ ・・・ ・・・ trans. compile Input & Run Eg. EDM-L (EDM for short Latency) [1] Duplicate all instructions a0 = b0; a1 = b1; a = b; EDM-L if (a0 != a1) error(); Check : When variable written EDM-L quickly detects 86 % of elect. timing errors that vary exec. results [1]. (only evaluated in simulation. ) [1] Y. Masuda, M. Hashimoto, and T. Onoye, “Performance Evaluation of Software-based Error Detection Mechanisms for Localizing Electrical Timing Failures under Dynamic Supply Noise,” Proc. ICCAD , 2015.

  5. 5 Objective 1. To answer “How much electrical errors can EDM * localize? ” based on silicon measurement ! 2. To evaluate correlation between sim. and silicon results. Scenario 1 : Localize electrical errors in original program. Original reproduce EDM Short latency Scenario 2 : Localize electrical errors that vary exec. results.

  6. 6 Reproducibility and Detectability For making EDM work well, 2 conditions should be satisfied. Original EDM COND1 : Reproducibility Original Original (necessary for Scenario1) Duplicated reproduce Check EDM COND2 : Detectability Original (necessary for Scenario1 and Scenario2) Duplicated error latency ≤ 1000 cycles → satisfied Check Detect quickly

  7. 7 Agenda � Background and objective � Silicon measurement � Correlation between silicon measurement and simulation � Conclusion

  8. 8 Preparation "false_result.txt" "true_result.txt" 1.4 Evaluate error occurrence border Border 1.3 frequency freq. for each workload and Vdd Vdd voltage [v] 1.2 1.1 1 200 220 240 260 280 300 320 340 360 380 400 frequency [MHz] Frequency Test chip DC voltage source Supply Vdd (MeP processor fabricated in 65nm) USB PC Border freq.

  9. 9 Measurement Evaluate error occurrence time for Initiali- @10 MHz computing error detection latency. zation – repeat program execution User @ border N fast by changing N fast Program freq. cycle in binary search manner @10 MHz exec. results err ? Time Total : 75 measurements � User program : dijkstra, sha, crc (MiBench) � Supply voltage : 1.0 - 1.4 V with 0.1V interval � Test chip : 5 chips

  10. 10 Evaluation Result Scenario1 Scenario2 33% 25% 40% COND1 : Reproducibility 56% COND2 : Detectability 4% 0% 31% 11% Detected & Latency < 1000 cycles Both COND1 and COND2 satisfied Detected & latency > 1000 cycles Only COND1 satisfied Not detected & correct results Only COND2 satisfied Not detected & incorrect results Neither COND1 nor COND2 satisfied Detect 25 % of original errors. Detect 56 % of errors varying results.

  11. 11 Agenda � Background and objective � Silicon measurement � Correlation between silicon measurement and simulation � Conclusion

  12. 12 Simulation setup Consider 2 simulation setup 1. Previous Sim.[1] 2. Sim. which updates PDN and definition of border freq. Evaluation setup # of errors PDN design Border freq. Results Silicon Low noise Exec. results vary vary Previous Sim.[1] 3% - 7% Vdd drop Timing error occurs Updated Sim. Zero noise Exec. results vary Error Freq. occurs

  13. 13 Correlation between silicon and sim. (Scenario1) (Localize electrical timing errors in original program) Silicon Updated Sim. Previous Sim.[1] Both COND1 and 0% COND2 satisfied 25% 23% 4% 20% Only COND1 40% satisfied 50% 7% 4% Only COND2 76% satisfied 20% 31% Neither COND1 nor COND2 satisfied COND1 : Reproducibility, COND2: Detectability � Consistent between updated sim. and silicon –Detectability for original errors : 25%(Silicon) 23%(updated Sim.)

  14. 14 Correlation between silicon and sim. (Scenario2) (Localize potential errors that vary results) Updated Sim. Previous Sim[1]. Silicon Detected & Latency 13% < 1000 cycles 2% 20% Detected & latency 0% 33% 44% 1% > 1000 cycles 56% Not detected & 0% 43% 77% correct results 11% Not detected & incorrect results � Consistency improvement by simulation update For errors varying results, EDM detects 56 % (Silicon) 44 % (Updated Sim.) �.� 87 % = � � �.�� (Previous Sim.)

  15. 15 Agenda � Background and objective � Silicon measurement � Correlation between silicon measurement and simulation � Conclusion

  16. 16 Conclusion � Evaluated error detection performance of EDM transformation for supply noise induced timing errors based on silicon measurement. – Considered two EDM usage scenarios – In scenario1, EDM detected 25% of original errors. – In scenario2, EDM detected 56% of errors varying results. � Evaluate correlation of EDM performance between sim. and silicon. –Update PDN design and definition of border frequency. –Consistent between updated sim. and silicon.

  17. 17 Backup Slide Difficulty of Electrical Error Localization Program transformation change inst. sequence. Supply voltage varies. Original Inst. seq. #2 ・・・ Inst. seq. #1 program Voltage Error Time Transformed ・・・ Inst. seq. #1 + #1’ Check program Voltage Time The same error appear? Can SW-based trans. debug the original error ?

  18. 18 Backup Slide Why low reproduction ratio? 40 "dijkstra_full-EDM" "dijkstra_original" 30 Ratio[%] 1040 Voltage [mV] 1020 20 voltage [mv] 1000 10 980 960 0 940 0 1 2 3 4 5 6 941 942 943 944 945 946 947 Time [ns] time [ns] Minimum voltage in the MOV instruction [mV] Even when the same instructions are executed, memory and registers usage changes. ⇒ EDM changes inductive noise and this prevents the error reproduction.

Recommend


More recommend