ARIADNE A gnostic R econfiguration I n A D isconnected N etwork E nvironment Konstantinos Aisopos (Princeton, MIT), Andrew DeOrio (Michigan), Li-Shiuan Peh (MIT), Valeria Bertacco (Michigan)
What is “reconfiguration”? Silicon technologies move into the nanometer regime …transistors become unreliable In future chips of 100 billion transistors, 10% for architects: permanent faults of transistors will eventually fail over Shekhar Borkar the lifetime of the chip Our focus in this talk: (Intel Fellow) Network-on-Chip cannot resend P need to re-route S D P$ S$ around the fault NIC reconfiguration: R “the process of replacing the routing algorithm”
Why is reconfiguration challenging? • XY routing X S Y D
Why is reconfiguration challenging? • XY routing Agnostic Reconfiguration S algorithm In A D Disconnected Network Environment
Why is reconfiguration challenging? • XY routing A gnostic R econfiguration S algorithm I n A D D isconnected N etwork E nvironment
Outline • Motivation • Ariadne – Baseline – Deadlocks – Synchronization • Evaluation – Overhead – Performance – Reliability • Conclusions
How will S find a path to D? ? S D
How will S find a path to D? RT S … D: S … RT RT … D … D: D: E W … … RT … D: N …
How will S find a path to D? RT … D: S RT RT … S … … D: D: E,S W,S … … RT RT … … D D: W,N D: E,N … … RT … D: N …
How will S find a path to D? S D
How will S find a path to D? RT RT … … D: D: W W RT … … … D: S … S D
ARIADNE: baseline • Upon a fault that changes the topology… – a node can let everyone know how it can be reached with a single broadcast – N nodes can let everyone know how they can be reached with N broadcasts
ARIADNE: baseline • Upon a fault that changes the topology… – Every node broadcasts “in - turn” to let others know how it can be reached … 8 9 10 11 1st 2nd 3rd last fault detector Every node has a statically assigned node ID
ARIADNE: baseline • Upon a fault that changes the topology… – Every node broadcasts “in - turn” to let others know how it can be reached … 8 9 10 11 1st 2nd 3rd last fault detector • Issues: – deadlock avoidance – synchronization (when to broadcast, multiple detectors)
ARIADNE: deadlocks S D
ARIADNE: deadlocks S S D D S D
ARIADNE: deadlocks up*/down* disable routes where rank goes first bcast ONLY: nodes are higher assigned ranks bcaster 0 “root” lower immediate 1 neighbors in every circle: 2-hop 2 1 node will have neighbors r higher rank than its 3-hop 3 neighbors, breaking neighbors the circular route unique ordering: among nodes with same rank, arbitrarily select a higher one
ARIADNE: deadlocks up*/down* disable routes where rank goes first bcast ONLY: nodes are higher assigned ranks bcaster 0 “root” lower immediate 1 neighbors in every circle: 2-hop 2 1 node will have neighbors r higher rank than its 3-hop 3 neighbors, breaking neighbors the circular route unique ordering: among nodes with same rank, arbitrarily select a higher one
ARIADNE: deadlocks up*/down* disable routes where rank goes first bcast ONLY: S nodes are higher assigned ranks bcaster 0 “root” lower immediate 1 neighbors D in every circle: 2-hop 2 1 node will have neighbors r higher rank than its 3-hop 3 neighbors, breaking neighbors the circular route unique ordering: connectivity: among nodes with D S can reach any same rank, arbitrarily node via the root select a higher one
ARIADNE: deadlocks • Upon a fault that changes the topology… – Every node broadcasts “in - turn” to let others know how it can be reached • Issues: – deadlock avoidance – synchronization
ARIADNE: deadlocks • Upon a fault that changes the topology… – Every node broadcasts “in - turn” to let others know how it can be reached RULE: (i) first broadcast ranks nodes (ii) remaining broadcasts spread only via enabled turns • Issues: – synchronization
ARIADNE: synchronization • How do I know completion of previous broadcast? can broadcasts overlap? • How does the recipient of a flag know the broadcasting node?
ARIADNE: synchronization Solution : Atomic Broadcasts • Nodes utilize the cycle count as a global reference point • Each node is assigned a unique broadcast slot from the “global” cycle counter
ARIADNE: synchronization cycle count (same for all nodes) 0 1 2 3 bcast bcast node cycle 0 X 1 X X 1 X 1 X 0 X 0 1 1 0 0 0 0 1 1 0 0 1 1 1 1 1 1 9 7 4 5 5 6 7 log(16) log(16) … bits bits waits for 5 0 0 1 0 0 1 1 1 1 4 15 8 9 10 11 5 initiates 0 1 0 1 0 0 0 0 5 0 … bcast 5’s bcast 0 1 0 1 1 1 1 1 5 15 completes 12 13 14 15 6 initiates 0 1 1 0 0 0 0 0 6 0 … bcast longest (in hops) broadcast 6’s bcast 0 1 1 0 1 1 1 1 6 15 completes … … reconfiguration completes in 4’s bcast 0 1 0 0 1 1 1 1 4 15 (16) 2 =(number of nodes) 2 cycles completes
ARIADNE: synchronization cycle count (same for all nodes) 0 1 2 3 bcast bcast node cycle 1 st hop X 0 1 X 1 X X 1 0 X 0 X 0 1 0 1 0 0 1 1 0 0 1 1 1 1 1 1 9 7 4 5 5 6 7 2 nd hop … waits for 5 0 0 1 0 0 1 1 1 1 4 15 8 8 9 10 11 5 initiates 0 1 0 1 0 0 0 0 5 0 bcast waits for 8 0 0 0 0 1 5 1 8 resigns from 12 13 14 15 0 0 1 0 5 2 becoming the … root node (!) we need to reconfigure once 5’s bcast even for multiple faults 0 1 0 1 1 1 1 1 5 15 completes
Outline • Motivation • Ariadne – Baseline – Deadlocks – Synchronization • Evaluation – Overhead – Performance – Reliability • Conclusions
Evaluation: Overhead Evaluation • On-chip routing algorithms for irregular topologies Vicis routing algo Immunet (D. Fick , DATE’09) (V. Puente, ISCA’04) reserves an escape VC for exceptions to turn deadlock freedom (routes model to apply it to ARIADNE deterministically in a ring) an arbitrary topology 6.0% overhead 2.0% 1.5% performance reliability synthesized a baseline 5-stage pipelined router (5 ports, 2 VCs, 5-flit buffer/VC) with Synopsys Design Compiler (IBM 130nm target library): router area (mm 2 ): baseline=2.708, Ariadne=2.761, Vicis=2.748, Immunet=2.870
Evaluation: Performance Average over 100 topologies • Experimental Setup: 10 PARSEC benchmarks Garnet + GEMS 100 lower is better System Configuration (GEMS) Average Latency (cycles) processors In-order SPARC cores 80 traffic coherence MOESI protocol deadlocks routing in a ring L1 caching private unified 32KB/node 60 ways: 2 latency: 3 cycles L2 caching shared distributed 1MB/node ways: 16 latency: 15 cycles 40 Network Architecture (GARNET) Ariadne (average) network topology 8x8 2D mesh 20 Vicis (average) memory controllers 4 at chip corners Immunet (average) channel width 64 bits 0 router architecture 5-stage pipeline 0 20 40 60 80 100 router ports, VCs 5, 2 (private) Injected Faults router buffers/port 5-flit for each VC
Evaluation: Performance + Reliability • On-chip routing algorithms for irregular topologies Vicis routing algo Immunet (D. Fick , DATE’09) (V. Puente, ISCA’04) reserves an escape VC for exceptions to turn deadlock freedom (routes model to apply it to ARIADNE deterministically in a ring) an arbitrary topology 6.0% overhead 2.0% 1.5% performance reliability
Outline • Motivation • Ariadne – Baseline – Deadlocks – Synchronization • Evaluation – Overhead – Performance – Reliability • Conclusions
Conclusions We have presented Ariadne. • a reconfiguration algorithm that provides deadlock-free routing paths in irregular network topologies that result from faulty links • is implemented in a fully distributed mode, resulting in simple hardware and low complexity • enables a trade-off between performance and reliable functionality on unreliable silicon
Thank You! Questions? [source: wikipedia] The Greek legend of Princess Ariadne “Ariadne (Αριάδνη ), was the daughter of King Minos of Crete. Minos attacked Athens after his son was killed there. The Athenians asked for terms, and were required to sacrifice seven young men and seven maidens every nine years to the Minotaur, a monster with the head of a bull on the body of a man. One year, the sacrificial party included Theseus, a young man who volunteered to come and kill the Minotaur. Ariadne fell in love at first sight, and helped him by giving him a ball of red fleece thread that she was spinning, to find his way out of the Minotaur's labyrinth. ” …similarly to Princess Ariadne, our Ariadne algorithm helps packets find their way in the labyrinth-like topology of a faulty network.
Recommend
More recommend