failure detection and propagation in hpc systems
play

Failure Detection and Propagation in HPC systems George Bosilca 1 , - PowerPoint PPT Presentation

Failure Detection and Propagation in HPC systems George Bosilca 1 , Aurlien Bouteiller 1 , Amina Guermouche 1 , Thomas Hrault 1 , Yves Robert 1 , 2 , Pierre Sens 3 and Jack Dongarra 1 , 4 1 . University Tennessee Knoxville 2 . ENS Lyon, France 3


  1. Failure Detection and Propagation in HPC systems George Bosilca 1 , Aurélien Bouteiller 1 , Amina Guermouche 1 , Thomas Hérault 1 , Yves Robert 1 , 2 , Pierre Sens 3 and Jack Dongarra 1 , 4 1 . University Tennessee Knoxville 2 . ENS Lyon, France 3 . LIP6 Paris, France 4 . Manchester University, UK CCDSC – October 4, 2016

  2. Model Failure detector Worst-case analysis Implementation and experiments Failure detection: why? • Nodes do crash at scale (you’ve heard the story before) • Current solution: 1 Detection: TCP time-out ( ≈ 20 mn ) 2 Knowledge propagation: Admin network • Work on fail-stop errors assumes instantaneous failure detection • Seems we put the cart before the horse � 2 / 33

  3. Model Failure detector Worst-case analysis Implementation and experiments Resilient applications • Continue execution after crash of one node 3 / 33

  4. Model Failure detector Worst-case analysis Implementation and experiments Resilient applications • Continue execution after crash of several nodes 3 / 33

  5. Model Failure detector Worst-case analysis Implementation and experiments Resilient applications • Continue execution after crash of several nodes • Need rapid and global knowledge of group members 1 Rapid: failure detection 2 Global: failure knowledge propagation 3 / 33

  6. Model Failure detector Worst-case analysis Implementation and experiments Resilient applications • Continue execution after crash of several nodes • Need rapid and global knowledge of group members 1 Rapid: failure detection 2 Global: failure knowledge propagation • Resilience mechanism should come for free 3 / 33

  7. Model Failure detector Worst-case analysis Implementation and experiments Resilient applications • Continue execution after crash of several nodes • Need rapid and global knowledge of group members 1 Rapid: failure detection 2 Global: failure knowledge propagation • Resilience mechanism should have minimal impact 3 / 33

  8. Model Failure detector Worst-case analysis Implementation and experiments Contribution • Failure-free overhead constant per node (memory, communications) • Failure detection with minimal overhead • Knowledge propagation based on fault-tolerant broadcast overlay • Tolerate an arbitrary number of failures (but bounded number within threshold interval) 4 / 33

  9. Model Failure detector Worst-case analysis Implementation and experiments Outline 1 Model 2 Failure detector 3 Worst-case analysis 4 Implementation and experiments 5 / 33

  10. Model Failure detector Worst-case analysis Implementation and experiments Outline 1 Model 2 Failure detector 3 Worst-case analysis 4 Implementation and experiments 6 / 33

  11. Model Failure detector Worst-case analysis Implementation and experiments Framework • Large-scale platform with (dense) interconnection graph (physical links) • One-port message passing model • Reliable links (messages not lost/duplicated/modified) • Communication time on each link: randomly distributed but bounded by τ • Permanent node crashes 7 / 33

  12. Model Failure detector Worst-case analysis Implementation and experiments Failure detector Definition Failure detector: distributed service able to return the state of any node, alive or dead. Perfect if: 1 any failure is eventually detected by all living nodes and 2 no living node suspects another living node Definition Stable configuration: all failed nodes are known to all processes (nodes may not be aware that they are in a stable configuration). 8 / 33

  13. Model Failure detector Worst-case analysis Implementation and experiments Vocabulary • Node = physical resource • Process = program running on node • Thread = part of a process that can run on a single core • Failure detector will detect both process and node failures • Failure detector mandatory to detect some node failures 9 / 33

  14. Model Failure detector Worst-case analysis Implementation and experiments Outline 1 Model 2 Failure detector 3 Worst-case analysis 4 Implementation and experiments 10 / 33

  15. Model Failure detector Worst-case analysis Implementation and experiments Timeout techniques: p observes q p q Are you alive? • Pull technique I am alive • Observer p requests a live message from q � More messages � Long timeout p q I am alive • Push technique [1] I am alive • Observed q periodically sends heartbeats to p � Less messages � Faster detection (shorter timeout) [1]: W. Chen, S. Toueg, and M. K. Aguilera. On the quality of service of failure detectors. IEEE Transactions on Computers, 2002 11 / 33

  16. Model Failure detector Worst-case analysis Implementation and experiments Timeout techniques: platform-wide • All-to-all: � Immediate knowledge propagation � Dramatic overhead • Random nodes and gossip: � Quick knowledge propagation � Redundant/partial failure information (observation round with n nodes selecting random target ⇒ expect n e nodes ignored) � Difficult to define timeout � Difficult to bound detection latency 12 / 33

  17. Model Failure detector Worst-case analysis Implementation and experiments Algorithm for failure detection 6 5 7 • Processes arranged as a ring • Periodic heartbeats from a 4 node to its successor 8 • Maintain ring of live nodes 3 → Reconnect ring after a failure 0 → Inform all processes 2 1 13 / 33

  18. Model Failure detector Worst-case analysis Implementation and experiments Reconnecting the ring 0 1 2 3 4 η η : Heartbeat interval Heartbeat 14 / 33

  19. Model Failure detector Worst-case analysis Implementation and experiments Reconnecting the ring 0 1 2 3 4 η η : Heartbeat interval Heartbeat 14 / 33

  20. Model Failure detector Worst-case analysis Implementation and experiments Reconnecting the ring 0 1 2 3 4 η η : Heartbeat interval δ : Timeout, δ >> τ Heartbeat δ 14 / 33

  21. Model Failure detector Worst-case analysis Implementation and experiments Reconnecting the ring 0 1 2 3 4 η η : Heartbeat interval δ : Timeout, δ >> τ Heartbeat Reconnection message δ Broadcast message 14 / 33

  22. Model Failure detector Worst-case analysis Implementation and experiments Reconnecting the ring 0 1 2 3 4 η η : Heartbeat interval δ : Timeout, δ >> τ Heartbeat Reconnection message δ Broadcast message 2 δ 14 / 33

  23. Model Failure detector Worst-case analysis Implementation and experiments Reconnecting the ring 0 1 2 3 4 η η : Heartbeat interval δ : Timeout, δ >> τ Heartbeat Reconnection message δ Broadcast message 2 δ 2 δ Ring reconnected 14 / 33

  24. Model Failure detector Worst-case analysis Implementation and experiments Algorithm task Initialization emitter i ← ( i − 1) mod N observer i ← ( i + 1) mod N HB-Timeout ← η Susp-Timeout ← δ task T4: upon reception of NewObserver ( j ) observer i ← j D i ← ∅ HB-Timeout ← 0 end task end task task T1: When HB-Timeout expires task T5: upon reception of HB-Timeout ← η BcastMsg ( dead , s, D ) Send heartbeat ( i ) to observer i D i ← D i ∪ { dead } end task Send BcastMsg ( dead , s, D ) to Neighbors ( s, D ) task T2: upon reception of heartbeat ( emitter i ) end task Susp-Timeout ← δ end task function FindEmitter ( D i ) k ← emitter i task T3: When Susp-Timeout expires while k ∈ D i do Susp-Timeout ← 2 δ k ← ( k − 1) mod N D i ← D i ∪ emitter i dead ← emitter i return k emitter i ← FindEmitter ( D i ) end function Send NewObserver ( i ) to emitter i Send BcastMsg ( dead , i, D i ) to Neighbors ( i, D i ) end task 15 / 33

  25. Model Failure detector Worst-case analysis Implementation and experiments Broadcast algorithm 6 7 4 5 • Hypercube Broadcast Algorithm [1] • Disjoint paths to deliver multiple broadcast message copies 2 3 • Recursive doubling broadcast algorithm by each node 0 1 • Completes if f ≤ ⌊ log ( n ) ⌋ − 1 Node Node1 Node2 Node4 ( f : number of failures, 1 0 0-2-3 0-4-5 n : number of live processes) 2 0-1-3 0 0-4-6 3 0-1 0-2 0-4-5-7 4 0-1-5 0-2-6 0 5 0-1 0-2-6-7 0-4 6 0-1-3-7 0-2 0-4 7 0-1-3 0-2-6 0-4-5 [1] P. Ramanathan and Kang G. Shin, ’Reliable Broadcast Algorithm’, IEEE transaction on computers, 1998 16 / 33

  26. Model Failure detector Worst-case analysis Implementation and experiments Failure propagation • Hypercube Broadcast Algorithm • Completes if f ≤ ⌊ log ( n ) ⌋ − 1 ( f : number of failures, n : number of living processes) • Completes after 2 τlog ( n ) • Application to failure detector • If n � = 2 l • k = ⌊ log ( n ) ⌋ • 2 k ≤ n ≤ 2 k +1 • Initiate two successive broadcast operations • Source s of broadcast sends its current list D of dead processes • No update of D during broadcast initiated by s (do NOT change broadcast topology on the fly) 17 / 33

  27. Model Failure detector Worst-case analysis Implementation and experiments Quick digression • Need a fault-tolerant overlay with small fault-tolerant diameter and easy routing • Known only for specific values of n : • Hypercubes: n = 2 k • Binomial graphs: n = 2 k • Circulant networks: n = cd k • . . . 18 / 33

  28. Model Failure detector Worst-case analysis Implementation and experiments Outline 1 Model 2 Failure detector 3 Worst-case analysis 4 Implementation and experiments 19 / 33

Recommend


More recommend