valeria bertacco
play

Valeria Bertacco Advanced Computer Architecture Laboratory - PowerPoint PPT Presentation

DACOTA: Post-silicon validation of the memory subsystem in multi-core designs Andrew DeOrio Ilya Wagner Valeria Bertacco Advanced Computer Architecture Laboratory University of Michigan, Ann Arbor HPCA 2009 Multi-core Designs Many simple


  1. DACOTA: Post-silicon validation of the memory subsystem in multi-core designs Andrew DeOrio Ilya Wagner Valeria Bertacco Advanced Computer Architecture Laboratory University of Michigan, Ann Arbor HPCA 2009

  2. Multi-core Designs • Many simple processors • Communicate through interconnect network Intel Polaris Tilera TILE64 2/22

  3. Complex Multi-core: Memory Subsystem • Cache coherence: the ordering of operations to a single cache line • Memory consistency: controls the ordering of operations among different memory addresses Core N-1 Core 0 ... L1 Cache L1 Cache The memory subsystem is hard to verify Interconnect L2 Cache 3/22

  4. The Verification Landscape Pre-Silicon Post-Silicon Runtime bugs exposed: 98% bugs exposed: 2% bugs exposed: <1% effort: 70% effort: 30% effort: 0% module dut; assign x = ~y | z; always @ * begin … RTL end Logic Sim. Stimuli • Fast: at-speed • Slow: ~Hz • Fast: at-speed • Early HW prototypes • Stimuli generators • Research ideas • Hard-to-find bugs • Random testers – Austin, Malik, Sorin • Microcode patching • Relatively new technology • Formal verification – Intel, AMD – Ad-hoc 4/22

  5. Post-Silicon Validation Today Critical path Simulation servers Simulation Simulation is the bottleneck of the validation process final state generator Test = Silicon prototype Prototype’s final state 5/22

  6. Escaped Bugs in the Memory Subsystem • 10% of the bugs that made it to product are related to the memory subsystem Excerpt from Specification Update bug AW38 No Fix Instruction fetch may cause a livelock during snoops of the L1 data cache [Nov. 2007] Memory related bugs are hard to find 6/22

  7. Post-Silicon Design Goals • High coverage – Enable self-detection of memory ordering errors – Coherence and consistency errors • Low area impact • No performance impact after shipment Critical path Simulation final state generator Test = Prototype’s final state self check 7/22

  8. DACOTA: Data Coloring for Consistency Testing and Analysis Post-silicon validation for the memory subsystem Core N-1 Core 0 • Logging ... L1 Cache L1 Cache – stores ordering info – uses cache storage temporarily Interconnect • Checking – starts when storage fills L2 Cache – distributed algorithm on individual cores benchmark execution check time 8/22

  9. Low Overhead Logging Architecture • DACOTA controller augments cache controller logic • Reconfigures a portion of cache for activity log Core 0 Core N - 1 dacota c c components o o n n t t r r o o l l D D a a c c o o t t a a access data vector Interconnect L 2 Cache 9/22

  10. Low Overhead Logging Architecture • Attach access vector to each cache line – Tracks the order of memory accesses to one line – Entry for each core stores a sequence ID data access vector … 1 3 0 2 0 0 3 2 0 1 1. core 0 store one entry for 2. core 1 store counter each core 3. core 0 store Core 0 Core N-1 control control Dacota Dacota access data vector • Allocate space for activity log Interconnect – Stores a sequence of access vectors L2 Cache in program order 10/22

  11. Checking Algorithm – On Site • Compares activity logs from L1 caches • Distributed algorithm runs on cores 1. Aggregate logs 2. Construct graph (protocol specific) • many protocol supported: SC, TSO, processor C., weak C. 3. Search graph for cycles, indicating ordering violation ST A 1 ST A 2 Core N-1 ST A 1 Core 0 ... ST B 1 LD B 1 L1 Cache L1 Cache ST B 1 ST A 2 ST E 1 … ST C 2 ST C 1 ST C 2 Interconnect ST E 1 ST D 1 L2 Cache ST B 2 11/22

  12. Example – Sequential Consistency Issue Order Actual Order [C 1 ] store to address 0xC [C 1 ] store to address 0xC [C 0 ] load from address 0xC [C 0 ] load from address 0xC [C 1 ] load from address 0xB [C 1 ] load from address 0xA [C 0 ] store to address 0xA [C 0 ] store to address 0xA [C 0 ] store to address 0xB [C 0 ] store to address 0xB [C 1 ] load from address 0xB [C 1 ] load from address 0xA Core 0 Core 1 c c 0xC 1 0 1 0xC 1 0 1 o o n n 0xA 1 0 1 0xB 1 0 1 t t r r o o 0xB 1 0 1 0xA 0 0 0 0xA <data> 0xA <data> 0xA <data> l l 0xB <data> 0xB <data> D D a a c c o o 0xC <data> 0xC <data> t t a a tag data log 12/22

  13. Example – Sequential Consistency Activity Logs 0xC 1 0 1 0xC 1 0 1 0xA 1 0 1 0xB 1 0 1 0xB 1 0 1 0xA 0 0 0 LD 0xC ST 0xC program order edges ST 0xA LD 0xB cycle indicates violation ST 0xB LD 0xA address reference edges 13/22

  14. Experimental Setup • Implemented checkers in GEMs simulator • Created buggy versions of cache controllers • TSO consistency model directory-based Core 0 Core N-1 MOESI cache control control coherence Dacota Dacota … 16 cores access data vector Mesh network L2 Cache 4MB 14/22

  15. Experimental Setup • Testbenches – Directed random stimulus: memory intensive Cycles to – SPLASH2 Benchmarks Expose Bug shared-store store to a shared line may not invalidate other caches 0.3M invisible-store store message may not reach all cores 1.3M store-alloc1 store allocation in any core may not occur properly 1.9M store-alloc2 store allocation in one core may not occur properly 2.3M reorder1 invalid store reordering (all cores) 1.4M reorder2 invalid store reordering (one core) 2.8M reorder3 invalid store reordering (single address, all cores) 2.9M reorder4 invalid store reordering (single address, one core) 5.6M • Bugs inspired by bugs found in processor errata • Injected one at a time 15/22

  16. Performance Impact - Random Computation overhead 299 Performance overhead (%) 120 Communication overhead 100 average 80 60 40 20 0 16/22

  17. Performance Impact – SPLASH2 Pre-Silicon 100,000,000 % Traditional Post-Silicon 10,000 % 50 DACOTA Post-Silicon 60 % Performance overhead (%) 40 Computation overhead Communication overhead 30 20 average 10 0 100x more tests! 17/22

  18. Area Impact Pre-Silicon 100,000,000 % Traditional Post-Silicon 10,000 % DACOTA Post-Silicon 60 % Runtime 0 % Area Overhead - Storage DACOTA 544 B Chen, et al., 2008 617,472 B Meixner, et al., 2006 940,032 B • Implemented DACOTA in Verilog • 0.01% overhead in OpenSPARC T1 18/22

  19. Communication Overhead Overhead due to communication (%) SPLASH2 benchmarks Random benchmarks 40 350 radix lu 35 300 cholesky fft large_1000_shared 30 average 250 barrier 25 locks 200 20 small_0_shared 150 average 15 100 10 50 5 0 0 64 128 256 512 1024 2048 64 128 256 512 1024 2048 Core activity log entries Core activity log entries 19/22

  20. Checking Algorithm Overhead Overhead due to Checking Alg. (%) SPLASH2 benchmarks Random benchmarks 120 700 radix large_1000_shared lu 600 barrier 100 cholesky locks 500 fft 80 small_0_shared average 400 average 60 300 40 200 20 100 0 0 64 128 256 512 1024 2048 64 128 256 512 1024 2048 Core activity log entries Core activity log entries ideal trade-off 20/22

  21. Related Work Pre-Silicon Post-Silicon Runtime Dill, et al ., 1992; Josephson, et al., 2006 Meixner, et al., 2006; Abts, et al ., 1993; Paniccia, et al., 1998 Chen, et al., 2008 • Pong, et al ., 1997; Whetsel, et al., 1991 Effective for protection against German, et al ., 2003 Tsang, et al., 2000 transient faults • • Formal verification Post-Si testing • possible for Problematic for abstract protocol functional errors • • Insufficient for High area overhead DeOrio, et al., 2008 implementation • Post-Si verification • Verifies coherence, but not consistency 21/22

  22. Conclusions • DACOTA is an on-chip post-silicon debugging solution for detecting errors in memory ordering – Enables self-detection of memory ordering errors • Effective at catching bugs – 100x more coverage than traditional post-silicon • Very low area overhead – 0.01% area overhead on OpenSPARC T1 • No performance impact to end user – Disable on shipment 22/22

Recommend


More recommend