Failure Detection and Propagation in HPC systems George Bosilca 1 , - PowerPoint PPT Presentation

Failure Detection and Propagation in HPC systems George Bosilca 1 , Aurélien Bouteiller 1 , Amina Guermouche 1 , Thomas Hérault 1 , Yves Robert 1 , 2 , Pierre Sens 3 and Jack Dongarra 1 , 4 1 . University Tennessee Knoxville 2 . ENS Lyon, France 3 . LIP6 Paris, France 4 . Manchester University, UK CCDSC – October 4, 2016

Model Failure detector Worst-case analysis Implementation and experiments Failure detection: why? • Nodes do crash at scale (you’ve heard the story before) • Current solution: 1 Detection: TCP time-out ( ≈ 20 mn ) 2 Knowledge propagation: Admin network • Work on fail-stop errors assumes instantaneous failure detection • Seems we put the cart before the horse � 2 / 33

Model Failure detector Worst-case analysis Implementation and experiments Resilient applications • Continue execution after crash of one node 3 / 33

Model Failure detector Worst-case analysis Implementation and experiments Resilient applications • Continue execution after crash of several nodes 3 / 33

Model Failure detector Worst-case analysis Implementation and experiments Resilient applications • Continue execution after crash of several nodes • Need rapid and global knowledge of group members 1 Rapid: failure detection 2 Global: failure knowledge propagation 3 / 33

Model Failure detector Worst-case analysis Implementation and experiments Resilient applications • Continue execution after crash of several nodes • Need rapid and global knowledge of group members 1 Rapid: failure detection 2 Global: failure knowledge propagation • Resilience mechanism should come for free 3 / 33

Model Failure detector Worst-case analysis Implementation and experiments Resilient applications • Continue execution after crash of several nodes • Need rapid and global knowledge of group members 1 Rapid: failure detection 2 Global: failure knowledge propagation • Resilience mechanism should have minimal impact 3 / 33

Model Failure detector Worst-case analysis Implementation and experiments Contribution • Failure-free overhead constant per node (memory, communications) • Failure detection with minimal overhead • Knowledge propagation based on fault-tolerant broadcast overlay • Tolerate an arbitrary number of failures (but bounded number within threshold interval) 4 / 33

Model Failure detector Worst-case analysis Implementation and experiments Outline 1 Model 2 Failure detector 3 Worst-case analysis 4 Implementation and experiments 5 / 33

Model Failure detector Worst-case analysis Implementation and experiments Framework • Large-scale platform with (dense) interconnection graph (physical links) • One-port message passing model • Reliable links (messages not lost/duplicated/modified) • Communication time on each link: randomly distributed but bounded by τ • Permanent node crashes 7 / 33

Model Failure detector Worst-case analysis Implementation and experiments Failure detector Definition Failure detector: distributed service able to return the state of any node, alive or dead. Perfect if: 1 any failure is eventually detected by all living nodes and 2 no living node suspects another living node Definition Stable configuration: all failed nodes are known to all processes (nodes may not be aware that they are in a stable configuration). 8 / 33

Model Failure detector Worst-case analysis Implementation and experiments Vocabulary • Node = physical resource • Process = program running on node • Thread = part of a process that can run on a single core • Failure detector will detect both process and node failures • Failure detector mandatory to detect some node failures 9 / 33

Model Failure detector Worst-case analysis Implementation and experiments Timeout techniques: p observes q p q Are you alive? • Pull technique I am alive • Observer p requests a live message from q � More messages � Long timeout p q I am alive • Push technique [1] I am alive • Observed q periodically sends heartbeats to p � Less messages � Faster detection (shorter timeout) [1]: W. Chen, S. Toueg, and M. K. Aguilera. On the quality of service of failure detectors. IEEE Transactions on Computers, 2002 11 / 33

Model Failure detector Worst-case analysis Implementation and experiments Timeout techniques: platform-wide • All-to-all: � Immediate knowledge propagation � Dramatic overhead • Random nodes and gossip: � Quick knowledge propagation � Redundant/partial failure information (observation round with n nodes selecting random target ⇒ expect n e nodes ignored) � Difficult to define timeout � Difficult to bound detection latency 12 / 33

Model Failure detector Worst-case analysis Implementation and experiments Algorithm for failure detection 6 5 7 • Processes arranged as a ring • Periodic heartbeats from a 4 node to its successor 8 • Maintain ring of live nodes 3 → Reconnect ring after a failure 0 → Inform all processes 2 1 13 / 33

Model Failure detector Worst-case analysis Implementation and experiments Reconnecting the ring 0 1 2 3 4 η η : Heartbeat interval Heartbeat 14 / 33

Model Failure detector Worst-case analysis Implementation and experiments Reconnecting the ring 0 1 2 3 4 η η : Heartbeat interval δ : Timeout, δ >> τ Heartbeat δ 14 / 33

Model Failure detector Worst-case analysis Implementation and experiments Reconnecting the ring 0 1 2 3 4 η η : Heartbeat interval δ : Timeout, δ >> τ Heartbeat Reconnection message δ Broadcast message 14 / 33

Model Failure detector Worst-case analysis Implementation and experiments Reconnecting the ring 0 1 2 3 4 η η : Heartbeat interval δ : Timeout, δ >> τ Heartbeat Reconnection message δ Broadcast message 2 δ 14 / 33

Model Failure detector Worst-case analysis Implementation and experiments Reconnecting the ring 0 1 2 3 4 η η : Heartbeat interval δ : Timeout, δ >> τ Heartbeat Reconnection message δ Broadcast message 2 δ 2 δ Ring reconnected 14 / 33

Model Failure detector Worst-case analysis Implementation and experiments Algorithm task Initialization emitter i ← ( i − 1) mod N observer i ← ( i + 1) mod N HB-Timeout ← η Susp-Timeout ← δ task T4: upon reception of NewObserver ( j ) observer i ← j D i ← ∅ HB-Timeout ← 0 end task end task task T1: When HB-Timeout expires task T5: upon reception of HB-Timeout ← η BcastMsg ( dead , s, D ) Send heartbeat ( i ) to observer i D i ← D i ∪ { dead } end task Send BcastMsg ( dead , s, D ) to Neighbors ( s, D ) task T2: upon reception of heartbeat ( emitter i ) end task Susp-Timeout ← δ end task function FindEmitter ( D i ) k ← emitter i task T3: When Susp-Timeout expires while k ∈ D i do Susp-Timeout ← 2 δ k ← ( k − 1) mod N D i ← D i ∪ emitter i dead ← emitter i return k emitter i ← FindEmitter ( D i ) end function Send NewObserver ( i ) to emitter i Send BcastMsg ( dead , i, D i ) to Neighbors ( i, D i ) end task 15 / 33

Model Failure detector Worst-case analysis Implementation and experiments Broadcast algorithm 6 7 4 5 • Hypercube Broadcast Algorithm [1] • Disjoint paths to deliver multiple broadcast message copies 2 3 • Recursive doubling broadcast algorithm by each node 0 1 • Completes if f ≤ ⌊ log ( n ) ⌋ − 1 Node Node1 Node2 Node4 ( f : number of failures, 1 0 0-2-3 0-4-5 n : number of live processes) 2 0-1-3 0 0-4-6 3 0-1 0-2 0-4-5-7 4 0-1-5 0-2-6 0 5 0-1 0-2-6-7 0-4 6 0-1-3-7 0-2 0-4 7 0-1-3 0-2-6 0-4-5 [1] P. Ramanathan and Kang G. Shin, ’Reliable Broadcast Algorithm’, IEEE transaction on computers, 1998 16 / 33

Model Failure detector Worst-case analysis Implementation and experiments Failure propagation • Hypercube Broadcast Algorithm • Completes if f ≤ ⌊ log ( n ) ⌋ − 1 ( f : number of failures, n : number of living processes) • Completes after 2 τlog ( n ) • Application to failure detector • If n � = 2 l • k = ⌊ log ( n ) ⌋ • 2 k ≤ n ≤ 2 k +1 • Initiate two successive broadcast operations • Source s of broadcast sends its current list D of dead processes • No update of D during broadcast initiated by s (do NOT change broadcast topology on the fly) 17 / 33

Model Failure detector Worst-case analysis Implementation and experiments Quick digression • Need a fault-tolerant overlay with small fault-tolerant diameter and easy routing • Known only for specific values of n : • Hypercubes: n = 2 k • Binomial graphs: n = 2 k • Circulant networks: n = cd k • . . . 18 / 33

Failure Detection and Propagation in HPC systems George Bosilca 1 , - PowerPoint PPT Presentation

Failure Detection and Propagation in HPC systems George Bosilca 1 , Aurlien Bouteiller 1 , Amina Guermouche 1 , Thomas Hrault 1 , Yves Robert 1 , 2 , Pierre Sens 3 and Jack Dongarra 1 , 4 1 . University Tennessee Knoxville 2 . ENS Lyon, France 3

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

PLANT PROPAGATION An Overview of Plant Propagation Methods Two Techniques of Stem Cutting

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Failure Detection and Propagation in HPC systems George Bosilca 1 , Aurlien Bouteiller 1 , Amina

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Health Failure Telehealth Final Report Sarah Briggs Heart Failure Specialist Nurse Heart Failure

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

THE AMATEURS FRIEND OR Enemy A short course on Propagation Propagation What is it? What

1 How to deal with Radio Propagation How to deal with Radio Propagation Where are you from?

Physical of radio propagation Two types of propagation models

Failure is a four-letter word Andreas Zeller Thomas Zimmermann Christian Bird PROMISE

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, H. Cartiaux

Structured Output Learning for Automatic Geophysical Feature Detection Chiyuan Zhang, Charlie

Machine Learning for monitoring the condition of critical systems Ross W Dickie MacTaggart Scott

Nonlinear Autoregressive with Exogenous Wathiq R Abed #UDT2019 Outline Aim of Fault

VLSI Testing Sequential ATPG Virendra Singh Associate Professor C omputer A rchitecture and D

LQS01a Test Results LARP Collaboration Meeting 14 Fermilab - April 26-28, 2010 Guram Chlachidze

CS137: Electronic Design Automation Day 8: February 4, 2004 Fault Detection CALTECH CS137

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Artjom Grudnitsky, Hongyan Zhang, Jrg

A Note on Fault Diagnosis Algorithms Franck Cassez National ICT Australia & CNRS Sydney,

Failure Detection and Propagation in HPC systems George Bosilca 1 , - PowerPoint PPT Presentation

Failure Detection and Propagation in HPC systems George Bosilca 1 , Aurlien Bouteiller 1 , Amina Guermouche 1 , Thomas Hrault 1 , Yves Robert 1 , 2 , Pierre Sens 3 and Jack Dongarra 1 , 4 1 . University Tennessee Knoxville 2 . ENS Lyon, France 3

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

PLANT PROPAGATION An Overview of Plant Propagation Methods Two Techniques of Stem Cutting

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Failure Detection and Propagation in HPC systems George Bosilca 1 , Aurlien Bouteiller 1 , Amina

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Health Failure Telehealth Final Report Sarah Briggs Heart Failure Specialist Nurse Heart Failure

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

THE AMATEURS FRIEND OR Enemy A short course on Propagation Propagation What is it? What

1 How to deal with Radio Propagation How to deal with Radio Propagation Where are you from?

Physical of radio propagation Two types of propagation models

Failure is a four-letter word Andreas Zeller Thomas Zimmermann Christian Bird PROMISE

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, H. Cartiaux

Structured Output Learning for Automatic Geophysical Feature Detection Chiyuan Zhang, Charlie

Machine Learning for monitoring the condition of critical systems Ross W Dickie MacTaggart Scott

Nonlinear Autoregressive with Exogenous Wathiq R Abed #UDT2019 Outline Aim of Fault

VLSI Testing Sequential ATPG Virendra Singh Associate Professor C omputer A rchitecture and D

LQS01a Test Results LARP Collaboration Meeting 14 Fermilab - April 26-28, 2010 Guram Chlachidze

CS137: Electronic Design Automation Day 8: February 4, 2004 Fault Detection CALTECH CS137

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Artjom Grudnitsky, Hongyan Zhang, Jrg

A Note on Fault Diagnosis Algorithms Franck Cassez National ICT Australia &amp; CNRS Sydney,

A Note on Fault Diagnosis Algorithms Franck Cassez National ICT Australia & CNRS Sydney,