mechanism for network on chip architectures
play

Mechanism for Network-on-Chip Architectures University of Cyprus - PowerPoint PPT Presentation

NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures University of Cyprus The Multicore Computer Architecture Laboratory (multiCAL) - Computer Architecture Research Group ( - CARCH) EuroCloud FP7


  1. NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures University of Cyprus The Multicore Computer Architecture Laboratory (multiCAL) Ξ - Computer Architecture Research Group (Ξ - CARCH) EuroCloud FP7 Project University of Cyprus International Symposium on Microarchitecture, December 3 2012, Vancouver, Canada NoCAlert (MICRO-2012) 1

  2. Outline  Necessity of Networks-on-Chip (NoCs)  Reliability and NoCs  The NoCAlert Approach: Invariance Checking  Identifying invariances and examples  Evaluation  Results  Conclusion – Future Work University of Cyprus NoCAlert (MICRO-2012) 2

  3. Outline  Necessity of Networks-on-Chip (NoCs)  Reliability and NoCs  The NoCAlert Approach: Invariance Checking  Identifying invariances and examples  Evaluation  Results  Conclusion – Future Work University of Cyprus NoCAlert (MICRO-2012) 3

  4. The Network-on-Chip (NoC) paradigm Image courtesy of C. Daniloff • On-chip interconnection fabric (backbone) to connect all nodes • Modular design • Structured Interconnect Layout • Scalable and efficient • Packet-based communication University of Cyprus NoCAlert (MICRO-2012) 4

  5. Core Number Increases Following Moore’s law the number of transistors/chip double approx. every 18-24 months  Designers turn into integrating more cores to take advantage of parallelism Intel 4004 4-bit 1971 Graph courtesy of www.crn.com University of Cyprus NoCAlert (MICRO-2012) 5

  6. Core Number Increases … 1971 2000 University of Cyprus NoCAlert (MICRO-2012) 6

  7. Core Number Increases Intel Core 2 Duo 2 Cores … 1971 2000 2007 University of Cyprus NoCAlert (MICRO-2012) 7

  8. Core Number Increases Intel Core i7 (Nehalem) 4 Cores … 1971 2000 2007 2008 University of Cyprus NoCAlert (MICRO-2012) 8

  9. Core Number Increases AMD Opteron 2400 6 Cores … 1971 2000 2007 2008 2009 University of Cyprus NoCAlert (MICRO-2012) 9

  10. Core Number Increases IBM POWER7 8 Cores … 1971 2000 2007 2008 2009 2010 University of Cyprus NoCAlert (MICRO-2012) 10

  11. Core Number Increases Intel Xeon Westmere-EX 10 Cores … 1971 2000 2007 2008 2009 2010 2011 University of Cyprus NoCAlert (MICRO-2012) 11

  12. Core Number Increases … 1971 2000 2007 2008 2009 2010 2011 2012 University of Cyprus NoCAlert (MICRO-2012) 12

  13. Core Number Increases Intel Single-Chip Cloud Computer Intel Polaris Chip 48 Cores 80 Cores … … 1971 2000 2007 2008 2009 2010 2011 2012 Near Future University of Cyprus NoCAlert (MICRO-2012) 13

  14. It’s already happening! Intel Polaris Chip • Router is becoming part of the core design • NoCs are becoming necessary Tilera TILE64 – 64 Cores • 2D mesh NoC comprising • 5 independent networks • one for each of 5 message classes … … 1971 2000 2007 2008 2009 2010 2011 2012 Near Future University of Cyprus NoCAlert (MICRO-2012) 14

  15. Outline  Necessity of Networks-on-Chip (NoCs)  Reliability and NoCs  The NoCAlert Approach: Invariance Checking  Identifying invariances and examples  Evaluation  Results  Conclusion – Future Work University of Cyprus NoCAlert (MICRO-2012) 15

  16. Reliability in the nano era • Aggressive transistor downsizing – Increasing hardware variability – Susceptibility to wear-out (accelerated aging effects) • Permanent faults – Static (occurring at manufacture-time) • Process Variability (PV), Manufacturing imperfections – Dynamic (occurring at run-time, prolonged stressing  component wear-out) • Electro-Migration (EM), Negative Bias Temperature Instability (NBTI), Oxide breakdown, Stress-Induced Voiding (SIV), Hot Carrier Injection (HCI), etc. • Transient faults (or Soft Errors – Single-Event Upsets, SEU) – Alpha particles (impurities in packaging/interconnect), Cosmic-ray-induced neutrons, Neutron-induced 10 B fission (interconnect layer insulator) – Traditionally associated with memories • Error Correcting Codes (ECC) widely used in DRAM modules University of Cyprus NoCAlert (MICRO-2012) 16

  17. Ominous predictions regarding reliability Probability of failure Impact of NBTI on failure probability trends * S.R. Nassif, N. Mehta, and Yu Cao. A resilience roadmap. In Proc. of the Design, Automation and Test in Europe Conference (DATE), 2010. • This recent study from DATE-2010 signifies increases in failure probabilities by tens of orders of magnitude at 12 nm, as opposed to 45 nm. • Each new technology generation decreases IC lifetime by half [ITRS 2011] Challenge: “Designing reliable systems from unreliable components” * * S. Borkar, "Designing reliable systems from unreliable components: the challenges of transistor variability and degradation," in IEEE Micro, Nov/Dec 2005. University of Cyprus NoCAlert (MICRO-2012) 17

  18. NoC (un)reliability implications • A single fault in the NoC can cause: – Network disconnections – Deadlocks (Network and Protocol-level) – Lost packets – Degraded performance  A single fault can paralyze the entire system (CMP) • Protecting the NoC is of paramount importance University of Cyprus NoCAlert (MICRO-2012) 18

  19. Outline  Necessity of Networks-on-Chip (NoCs)  Reliability and NoCs  The NoCAlert Approach: Invariance Checking  Identifying invariances and examples  Evaluation  Results  Conclusion – Future Work University of Cyprus NoCAlert (MICRO-2012) 19

  20. NoCAlert: The Big Picture • NoCAlert: – Lightweight distributed invariance checkers – Checkers behave like hardware assertions – Checks for legality , not correctness – Network’s operation is never interrupted – Provides almost instantaneous fault detection • Assumption: – Packet/flit contents are protected with ECC • NoCAlert protects against faults in the control logic • Interesting observation – Erroneous but legal module outputs are always benign University of Cyprus NoCAlert (MICRO-2012) 20

  21. NoCAlert’s Terminology • Invariance violation: – The breaking of a fundamental functional rule within the context of a component’s operation – e.g., the routing computation unit outputs an illegal direction • Legality: – Illegal is an output that is impossible to occur, based on the set of functional correctness rules of a given component • Instantaneous fault detection: – Detect a fault as soon as it manifests (same clock cycle) – Easier to recover – Localized information could identify faulty location University of Cyprus NoCAlert (MICRO-2012) 21

  22. Invariance Checking • System is continuously (on-line) examined for illegal outputs – An illegal output can be the result of some kind of fault • Emulates assertions used in software • Example: Assume a variable X cannot get the value 5 – assert(X!=5) – In hardware this would be achieved with a comparison unit that raises a flag University of Cyprus NoCAlert (MICRO-2012) 22

  23. Outline  Necessity of Networks-on-Chip (NoCs)  Reliability and NoCs  The NoCAlert Approach: Invariance Checking  Identifying invariances and examples  Evaluation  Results  Conclusion – Future Work University of Cyprus NoCAlert (MICRO-2012) 23

  24. A generic (typical) NoC router micro-architecture * Network packets are broken into multiple Processing Element In flits . A flit is a flow control unit and it is the South In RC VC0 North In smallest unit of flow control in the NoC. RC VC0 East In RC VC0 RC RC VC1 West In VC0 RC VC1 RC VC0 RC VC1 RC West Out RC VC2 VC VC1 RC East Out VC2 ID RC VC1 RC North Out VC2 RC RC South Out VC3 VC2 RC VC3 RC Processing VC2 RC 5x5 Crossbar VC3 Element Out One flit RC VC3 One flit capacity RC VC3 One flit capacity One flit capacity SA2 arbiters control the capacity Input One flit slot XBAR connections Port VA1 VA2 SA2 Routing SA1 Arbitration Computation Arbitration Arbitration Arbitration Local Arbitration: Global Local Arbitration: Global Arbitration: Routing Choose one specific Arbitration: One winning VC in Resolve global Computation: output VC in R esolve global each port Next-hop conflicts adjacent router conflicts direction Router Pipeline University of Cyprus NoCAlert (MICRO-2012) 24

  25. Identifying invariances within the NoC Router • Identification of invariances relies on the modularity and hierarchy of the NoC Router • The functional algorithm of each module is exhaustively inspected using a bottom-up approach – Identification of all the functional rules – Identification of all the functionally illegal outputs Network Level Router Level Crossbar Input Port VA and SA Switch FIFO Arbiters RC Unit Buffers • End-to-end invariances at the network-level are identified University of Cyprus NoCAlert (MICRO-2012) 25

  26. Invariance categorization • 32 invariances have been identified through detailed exploration of the router’s microarchitecture • Identified invariances are categorized based on the router module they are associated with – Routing Computation unit (3) – Arbiters (10) – Crossbar (3) – Buffer State (12) – Port-Level (3) – End-to-End (network-level) (1) University of Cyprus NoCAlert (MICRO-2012) 26

Recommend


More recommend