vicis a reliable network for unreliable
play

Vicis: A Reliable Network for Unreliable Silicon Andrew DeOrio, - PowerPoint PPT Presentation

1 Vicis: A Reliable Network for Unreliable Silicon Andrew DeOrio, David Fick, Jin Hu, Valeria Bertacco, David Blaauw, Dennis Sylvester Electrical Engineering & Computer Science Department The University of Michigan, Ann Arbor 1/30 1 1


  1. 1 Vicis: A Reliable Network for Unreliable Silicon Andrew DeOrio, David Fick, Jin Hu, Valeria Bertacco, David Blaauw, Dennis Sylvester Electrical Engineering & Computer Science Department The University of Michigan, Ann Arbor 1/30 1 1

  2. Network-on-Chip 2  Growing design complexity  modular architectures  Network-on-Chip (NoC)  Components (processors, memory, etc.) communicate via routers  Scalable bandwidth, inherent redundancy router processor 2/30 2 2

  3. Transistor Wear-Out 3  Technology-scaling increases likelihood of wear-out  Reasons include: oxide breakdown, electromigration, etc. The Bathtub Curve eliminated eliminated Wearout Fault Rate by burn-in by margining Useful Chip Lifetime Chip Time 3/30 3 3

  4. Transistor Wear-Out 4  Technology-scaling increases likelihood of wear-out  Reasons include: oxide breakdown, electromigration, etc. The Bathtub Curve eliminated eliminated Wearout Fault Rate by burn-in by margining Useful Chip Lifetime Chip Time Technology-scaling  shorter useful life 4/30 4 4

  5. Transistor Wear-Out 5  Technology-scaling increases likelihood of wear-out  Reasons include: oxide breakdown, electromigration, etc. The Bathtub Curve Reclaimed eliminated eliminated Wearout Fault Rate by Vicis by burn-in by margining Useful Chip Lifetime Chip Time 5/30 5 5

  6. System Response to Wear-Out 6 No Protection Triple Modular Redundancy Vicis Graceful Failure at first fault Failure after Performance Performance Performance performance several faults degradation Errors Errors Errors Invariants Vicis Checkpointing Detection Reconfiguration Recovery Diagnosis Detect if fault Diagnose what Reconfigure network Recover and resume has occurred fault has occurred to account for fault normal operation 6/30 6 6

  7. Fault Tolerance Strategy 7 If faults are isolated, fault-free lanes may routing continue to operate table FIFO crossbar decoder output input 7/30 7 7

  8. Outline 8  Architecture Overview  Diagnostic Approach  Experimental Results  Conclusion 8/30 8 8

  9. Network Assumptions 9  Wormhole routing N  2D mesh or torus flits dest dir ... 0 N head  Static routing data tail packet 1 S W . . . E ...  No virtual channels i local routing table ...  Hard fault injection local router i n-1 W S dest i 9/30 9 9

  10. Vicis Reliability Features 10 routing table FIFO crossbar decoder output input 10/30 10 10

  11. Vicis Reliability Features 11 routing table FIFO crossbar decoder output input ECC 11/30 11 11

  12. Vicis Reliability Features 12 routing table FIFO crossbar decoder bypass bus output input ECC controller 12/30 12 12

  13. Vicis Reliability Features 13 routing table FIFO port swapper crossbar decoder bypass bus output input ECC controller 13/30 13 13

  14. Vicis Reliability Features 14 BIST Controller routing table FIFO port swapper crossbar decoder bypass bus output input ECC controller 14/30 14 14

  15. Vicis Reliability Features 15 BIST Controller routing config. table table FIFO port swapper crossbar decoder bypass bus output input ECC controller 15/30 15 15

  16. Vicis Reliability Features 16 BIST Controller routing Distributed config. table Algorithm table Engine FIFO port swapper crossbar decoder bypass bus output input ECC controller 16/30 16 16

  17. Input Port Swapping 17  Partial crossbar gives North multiple connection FIFO options West Local  Priority given to local South port in order to increase East # of available IPs Input ports decoder port swapper 17/30 17 17

  18. Input Port Swapping 18 R R Input Port Swap X X X R R R R R R X Two Three Input Port Functional Functional Output Port Links R R Links 18/30 18 18

  19. Bypass Bus 19  Provides alternative path around crossbar  Round-robin arbiter inside controller crossbar  No penalty for single user, additional users bypass bus controller must stall until free 19/30 19 19

  20. Error Correction Codes 20 BIST BIST ECC/DEC swap FIFO XB XB ECC/DEC Link bbus bbus ECC-to-ECC Path  Six available paths between two routers  Only one fault may be corrected by ECC  Fault information for five unit types must be considered:  Crossbar, bypass bus, network link, input port swapper, FIFOs  All configurations must be performed simultaneously  Configurations affect each another, network wide 20/30 20 20

  21. Error Correction Codes 21 BIST BIST ECC/DEC swap FIFO XB XB ECC/DEC Link bbus bbus ECC-to-ECC Path Single Bit Fault 21/30 21 21

  22. Network Re-routing [DATE 2009] 22  Distributed routing algorithm ! ! ! !  Can route around an arbitrary ! ! ! ! number of faulty links ! ! !  Requires no virtual channels ! ! ! !  Implemented in fewer than 300 gates ! ! ! ! ! ! ! ! Destination Routed 22/30 22 22

  23. Hard Fault Diagnosis 23 missed faults! Built-in-Self-Test Stuck-0 Unit Unit 0 1 2 1 Built-in-Self-Test Unit Unit 1 2 23/30 23 23

  24. Hard Fault Diagnosis 24 Pattern Based Testing  Error unit, crossbar controller, routing table, decode/ECC, output ports, FIFO control unit = LFSR under to configuration table test expected signature Datapath Testing  FIFO datapath, input port swapper, links, crossbar, bypass bus, configuration table unit save for pre-recorded bit flip under reconfiguration test counter routine test 24/30 24 24

  25. Experimental Setup 25  3x3 Torus  32-bit data flits, 32 flit buffers  Implemented in Verilog  Synthesized, automatic place and route in 45nm  Reliability results  Implemented in C++  Performance results  Injected stuck-at faults on gate outputs  Weighting based on gate area  10,000 packets per test, random uniform traffic  Parallel packet injection 25/30 25 25

  26. Results – Router Reliability 26 Key advantage: keep processors connected to network 26/30 26 26

  27. Results – Network Performance 27 9 8 7 Available Routers 6 5 ~50% of routers work despite 100 faults 4 3 ~1/2000 gates broken 2 Available Routers 1 0 0 20 40 60 80 100 Network Faults 27/30 27 27

  28. Results – Network Performance 28 9 1.0 Normalized Network Throughput 0.9 8 5 th -95 th percentile 0.8 7 0.7 Available Routers 6 0.6 5 0.5 4 0.4 throughput 3 throughput 0.3 increases as decreases network shrinks 2 as routers 0.2 reconfigure Available Routers 1 0.1 Network Performance 0 0.0 0 20 40 60 80 100 Network Faults 28/30 28 28

  29. Results – Network Reliability 29 100 42% area overhead Likelihood of Correct Operation (%) 90 *BulletProof: A Defect-tolerant CMP Switch Key Architecture, Constantinides, et al. 2006 80 Advantage: 70 Low area TMR overhead Vicis 60 Key Advantage: 50 Extended 200+% area overhead MTTF 40 30 different TMR granularities * 20 10 0 0 10 20 30 40 50 60 70 80 90 100 Network Faults 29/30 29 29

  30. Conclusion 30  Vicis can tolerate a large number of faults  Vicis provides much greater reliability than NMR based solutions  Vicis: constant-reliability, probabilistic performance  NMR: constant-performance, probabilistic reliability  Vicis has an overhead of 42% versus 100+% for NMR 30/30 30 30

Recommend


More recommend