Post-Silicon Bug Diagnosis with Inconsistent Executions Andrew DeOrio Daya Shanker Khudia Valeria Bertacco University of Michigan ICCAD’11 9 November 2011
Impact of errors $475 M • Functional bugs 17 Jan FDIV bug : Intel announces a pre-tax charge of $475M dollars against earnings 1995 for replacement of flawed processors Kris Kaspersky: Remote Code Execution Through Intel CPU Bugs • Electrical failures 1024-bit RSA secret key extracted in 100 hours • Transistor faults $1 B Sandy Bridge Bug 2X Costly as Pentium FDIV Bug 2
Post-silicon validation Pre-Silicon Post-Silicon Product Debug prototypes before shipment + Fast prototypes - Poor observability + High coverage - Slow off-chip transfer + Test full system - Noisy + Find deep bugs - Intermittent bugs 3
Post-silicon bugs • Intermittent post-silicon bugs are challenging – A same test does not expose the bug in every run – Each run exhibits different behaviors • Our goal: locate intermittent bugs difficult post-si platform pushl %epb to movl %epb debug! same post- many different results silicon test 4
Post-silicon debugging • Scan chains, logic analyzers [Whetsel 1991, Abramovici 2006, Dahlgren 2003] – Limited observability – Large manual effort • Processor-core specific debugging [Park 2009] – Limited areas of chip – Limited time to catch bug • Deterministic replay [Gao 2009, Li 2010, Yang 2008] – HW/performance overhead – Perturbation may prevent bug manifestation 5
BPS: “Bug Positioning System” • Localize failures – Time (cycle) and space (signals) • Tolerate non-repeatable executions – Statistical approach • Scalable, adaptable to many HW subsystems HW logging SW post-analysis bug location pass post-si platform band model post-si bug test occurrence fail hw sensors signatures time 6
HW logging SW post-analysis Signatures • Goal: summarize signal value • Encodings (hamming, CRC, etc.) – Large hardware – Small change in input -> large change in output • Counting schemes (time@1, toggles) time@1=1 time@1=2 signal A window window 7
HW logging SW post-analysis Statistical approach statistical debugging traditional debugging passing testcase 1 passing testcases failing testcases 0.8 Distribution 0.6 0.4 match? 0.2 0 0 0.2 0.4 0.6 0.8 1 Signature value failing testcase distribution of signature values: time@1 same test can yield ---------------- different results window size 8
HW logging SW post-analysis Signatures for statistical approach • Characterize populations of signatures • Statistical separation between noise and bug passing testcases failing testcases Distribution Distribution Example: CRC Example: time@1 passing testcases failing testcases 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Signature value Signature value 9
HW logging SW post-analysis Signature hardware • Measure time@1 • Use custom hardware or reuse existing debug infrastructure 11KB for 100 Memory signals x 100 Buffer 1 register windows EN Off-chip through 1 debug port register EN chip under test 10
BPS: “Bug Positioning System” 1. Hardware logging 2. Software post-analysis HW logging SW post-analysis bug location pass post-si platform band model post-si bug test occurrence fail hw sensors signatures time 11
HW logging SW post-analysis Bug band model 1 Failing band 0.8 Signature value Passing band 0 0.5 1 0.6 bug band 0.4 0 0.5 1 0.2 µ ± 2 σ bug occurrence bug detected 0 0 4 8 12 16 20 24 Window behavior of 1 signal from the MEM stage of a 5-stage pipeline processor 12
HW logging SW post-analysis SW post-analysis ... ... ... ... Passing group signatures signalC ... signals signalB ... signalA ... 3 2 4 1 windows windows bug band ... windows 3 1 2 4 ... signalA ... signals signalB signatures signalC Failing group ... ... ... ... 13
Experimental setup 10 random seeds: variable memory delay, crossbar random traffic monitored 41,744 top 100 level control signals passing detected runs signals BPS BPS HW SW detection 1000 10 testcases latency buggy runs 10 bugs: e.g. , functional bug in PCX, electrical error in Xbar 14
Signal Localization Bugs MMU combo MCU combo Xbar combo PCX atm SA PCX gnt SA bug signal not Xbar elect MMU fxn EXU elect PCX fxn observable BR fxn √+ √ √ blimp_rand √ √+ √+ √+ f.n. √+ f.n. fp_addsub n.b. f.p. √ √ √ √+ f.p. n.b. √+ f.p. fp_muldiv n.b. f.p. √ √ √+ f.p. f.p. √+ f.p. √ Testcases isa2_basic n.b. f.n. √ n.b. √+ √+ √+ √+ n.b. f.n. isa3_asr_pr n.b. √ √ f.n. √+ √ √+ √+ √ √ isa3_window n.b. √ √ n.b. √+ √ f.n. f.n. n.b. √ ldst_sync n.b. √+ √ √ √+ √+ √+ √+ √+ n.b. mpgen_smc n.b. √+ √ √ √+ √+ √+ √+ √+ √+ n2_lsu_asi n.b. f.n. √ f.n. √+ √+ √+ √+ √+ n.b. tlu_rand n.b. √+ √ √ √+ √+ √+ √+ √+ √+ 15 √ found n.b. no bug √+ exact signal f.p. false pos. f.n. false neg.
Signal Localization Bugs MMU combo MCU combo Xbar combo PCX atm SA PCX gnt SA 3 noisy signals Xbar elect MMU fxn EXU elect PCX fxn excited by floating BR fxn point benchmarks √+ √ √ blimp_rand √ √+ √+ √+ f.n. √+ f.n. fp_addsub n.b. f.p. √ √ √ √+ f.p. n.b. √+ f.p. fp_muldiv n.b. f.p. √ √ √+ f.p. f.p. √+ f.p. √ Testcases isa2_basic n.b. f.n. √ n.b. √+ √+ √+ √+ n.b. f.n. isa3_asr_pr n.b. √ √ f.n. √+ √ √+ √+ √ √ isa3_window n.b. √ √ n.b. √+ √ f.n. f.n. n.b. √ ldst_sync n.b. √+ √ √ √+ √+ √+ √+ √+ n.b. mpgen_smc n.b. √+ √ √ √+ √+ √+ √+ √+ √+ n2_lsu_asi n.b. f.n. √ f.n. √+ √+ √+ √+ √+ n.b. tlu_rand n.b. √+ √ √ √+ √+ √+ √+ √+ √+ 16 √ found n.b. no bug √+ exact signal f.p. false pos. f.n. false neg.
Signal Localization Bugs MMU combo MCU combo Xbar combo PCX atm SA PCX gnt SA wider effects, Xbar elect MMU fxn EXU elect PCX fxn easier to catch BR fxn √+ √ √ blimp_rand √ √+ √+ √+ f.n. √+ f.n. fp_addsub n.b. f.p. √ √ √ √+ f.p. n.b. √+ f.p. fp_muldiv n.b. f.p. √ √ √+ f.p. f.p. √+ f.p. √ Testcases isa2_basic n.b. f.n. √ n.b. √+ √+ √+ √+ n.b. f.n. isa3_asr_pr n.b. √ √ f.n. √+ √ √+ √+ √ √ isa3_window n.b. √ √ n.b. √+ √ f.n. f.n. n.b. √ ldst_sync n.b. √+ √ √ √+ √+ √+ √+ √+ n.b. mpgen_smc n.b. √+ √ √ √+ √+ √+ √+ √+ √+ n2_lsu_asi n.b. f.n. √ f.n. √+ √+ √+ √+ √+ n.b. tlu_rand n.b. √+ √ √ √+ √+ √+ √+ √+ √+ 17 √ found n.b. no bug √+ exact signal f.p. false pos. f.n. false neg.
Δ time bug injection to detection (cycles) 1,000 2,000 3,000 4,000 5,000 6,000 0 PCX gnt SA Time to detect bug XBar elect BR fxn MMU fxn PCX atm SA PCX fxn XBar combo MCU combo MMU combo EXU elect AVERAGE cycles 1,273 18
Number of signals detected 120 160 200 Number of signals detected 40 80 0 PCX gnt SA XBar elect BR fxn MMU fxn PCX atm SA PCX fxn XBar combo MCU combo MMU combo EXU elect AVERAGE (0.2%) 75 signals 19
Threshold selection bug 60 false negatives band false positives sum Sum total 40 threshold trade-off 20 0 0 0.2 0.4 0.6 0.8 1 Bug detection threshold (bug band) 20
Area overhead • Option 1: reuse existing debug structures • Option 2: add counters and memory buffer – Record a few signals at a time – 11KB for 100 signals x 100 windows @9bit precision – 1.35mm 2 with 65nm library – 0.4% of OpenSPARC Memory Buffer 1 register EN Off-chip through 1 debug port register EN chip under test 21
Conclusions • BPS automatically localizes bug time and location • Leverages a statistical approach to tolerate noise • Effective for a variety of bugs : functional, electrical and manufacturing – 1,273 cycles, 75 signals on average 22
Recommend
More recommend