machine learning based anomaly detection for post silicon
play

Machine Learning-based Anomaly Detection for Post-silicon Bug - PowerPoint PPT Presentation

Machine Learning-based Anomaly Detection for Post-silicon Bug Diagnosis Andrew DeOrio , Qingkun Li, Matthew Burgess and Valeria Bertacco University of Michigan University of Illinois Verification trends Wilson Research Group and Mentor


  1. Machine Learning-based Anomaly Detection for Post-silicon Bug Diagnosis Andrew DeOrio , Qingkun Li, Matthew Burgess and Valeria Bertacco University of Michigan University of Illinois

  2. Verification trends Wilson Research Group and Mentor Graphics 2010 Functional Verification Study 20-Mar-2013 Andrew DeOrio / University of Michigan 2

  3. Increasing post-silicon validation Design and pre-silicon Post-silicon validation effort verification effort Bob Barton, Intel. Invited talk at GSRC. 20-Mar-2013 Andrew DeOrio / University of Michigan 3

  4. Post-silicon validation Pre-silicon Post-silicon Product Goal: locate bug + Fast prototypes - Poor observability + High coverage - Slow off-chip transfer + Test full system - Noisy + Find deep bugs - Intermittent bugs 20-Mar-2013 Andrew DeOrio / University of Michigan 4

  5. Post-silicon and credit cards pushl %epb movl difficult to %epb locate bug! same test many different results difficult to locate fraud! same card many different transactions 20-Mar-2013 Andrew DeOrio / University of Michigan 5

  6. Post-silicon and credit cards pushl %epb anomalous movl compare %epb time and failing same location test test anomaly? compare same new card transaction 20-Mar-2013 Andrew DeOrio / University of Michigan 6

  7. Post-silicon and credit cards pushl clustering %epb anomalous movl %epb algorithm time and same location test training data: unknown positive example examples feature feature time@1=1 time@1=2 … signal A signal B … time@1=2 time@1=1 feature feature 20-Mar-2013 Andrew DeOrio / University of Michigan 7

  8. Learning clusters clustering algorithm feature values of One test, 1 st time window passing examples signal B feature value time@1 clusters signal A feature value signal A signal B 1 st time window 20-Mar-2013 Andrew DeOrio / University of Michigan 8

  9. Searching for anomalies clustering algorithm One test, 1 st time window feature values of signal B feature value unknown examples Added after inside clusters: no bug clustering signal A feature value signal A signal B 1 st time window 20-Mar-2013 Andrew DeOrio / University of Michigan 9

  10. Searching for anomalies clustering algorithm One test, 2 nd time window Outside clusters: signal B feature value bug found # anomalies > threshold signal A feature value signal A signal B 2 nd time window 20-Mar-2013 Andrew DeOrio / University of Michigan 10

  11. Clustering in X,000 dimensions clustering algorithm • Each signal is a dimension – Circular clusters become signal B feature value hyper-spheres – High dimensionality is a challenge • In practice: – Cap #signals in one clustering set (500) signal A feature value – Group signals by module(s) (100-500 signals) – Apply clustering to each group 20-Mar-2013 Andrew DeOrio / University of Michigan 11

  12. Experimental Setup 100 random seeds: variable memory delay, 10 seeds crossbar random traffic monitored 41,743 top level control signals 1000 passing runs training data HW unknown data 1000 10 testcases buggy runs 10 bugs: e.g. , functional bug in PCX, electrical error in Xbar 20-Mar-2013 Andrew DeOrio / University of Michigan 12

  13. Bug injection Bug Description PCX_gnt SA Stuck-at in PCX grant Xbar elect Electrical error in crossbar BR fxn Functional bug in branch logic MMU fxn Functional bug in memory controller PCX_atm SA Stuck-at in PCX atomic grant PCX fxn Functional bug in PCX XBar combo Combined electrical errors in Xbar/PCX MCU combo Combined electrical errors in mem/PCX MMU combo Combined functional bugs in MMU/PCX EXU elect Electrical error in execute unit 20-Mar-2013 Andrew DeOrio / University of Michigan 13

  14. Bug detection on OpenSPARC T2 100% exact 90% signal Percentage of Testcases detected Bug not detected 80% Bug signal not Bug detected other 70% observable signals 60% detected no bug 50% effect 40% 30% false negative 20% 10% false 0% positive 9/10 bugs caught 20-Mar-2013 Andrew DeOrio / University of Michigan 14

  15. Bug signal vs. noise More training data -> more accuracy 20-Mar-2013 Andrew DeOrio / University of Michigan 15

  16. Conclusions • Machine learning automatically localizes bug time and location • Leverages a statistical approach to tolerate noise • Effective for a variety of bugs: functional, electrical and manufacturing – 336 cycles, 347 signals on average 20-Mar-2013 Andrew DeOrio / University of Michigan 16

Recommend


More recommend