ParaStack : Efficient Hang Detection for MPI Programs at Large Scale Hongbo Li Zizhong Chen & Rajiv Gupta
Question Solution Evaluation 2
Question Solution Evaluation Program Hang Resource Wastage Current Solution 3
Execution in Batch Mode Process ID … … ! 0 1 2 i Time " " : occupied supercomputer time. Processes communicate via message passing (MPI). 4
Program Hang Occurs Program hang --- a type of bug whose occurrence stalls the program’s execution. Root cause can be in one single process, e.g. process 0 --- Incorrect thread-level synchronization and infinite loop, or all processes --- communication deadlock across all processes et.al. Process ID … … ! 0 1 2 i Time 5
Hang Causes Resource Wastage Process ID … … ! 0 1 2 i Large scale Time Resource waste Negative --- significant resource wastage at large scale. 6
Solution: Hang Detection Process ID … … $ 0 1 2 i Time ! " ! # Release resources when detecting a hang Shorter detection delay ( ! " ) à Bigger saving ( ! # ) 7
Traditional Detection Method Timeout is a commonly used method based on various metrics, e.g., IO-watchdog monitors how often a program writes . Setting a good timeout is hard due to following two dilemmas: Small timeout à Large Savings Too Small timeout à False Alarms Large timeout à Avoid False Positives Too Large timeout à Large Wastage 8
Question Solution Evaluation Statistical Model Two Problems 9
ParaStack Does not guess based on null unlike timeout methods. Detects hangs based on runtime history. 10
Basic Concept while (…) { user code MPI_Function () } ! "#$ Definition: ! "#$ = . "#$ . $"$/0 1 234 where denotes the number of processes executing inside user code and 1 42456 denotes the total number of processes employed in the run. 11
Dynamic Variation of Sout 0.6 S out 0.3 LU 0 1 51 101 Running timeline 1 0.5 S out FT 0 Running Timeline 1 101 201 0.6 S out SP 0.3 0 1 51 101 Running timeline A snippet of ! "#$ variation obtained via sampling every 1 millisecond interval. 12
When a Hang Occurs 0.8 0.4 S out 0 1 51 101 Running Timeline ! "#$ variation of a faulty LU run, where a fault is simulated by a very long sleep and injected on the left border of the red region. Program hang is characterized by two features : (1) very small ! %&' and (2) consecutive observations of (1) . 13
̂ ̂ Suspicion !(# $%& ) is the empirical cumulative distribution function obtained from randomly sampling ( )*+ . - , we obtain . = 0 12 - Given probability and classify the observed value of ( )*+ into a pair of opposite random events : Feature 1: Small 14
Significance Test of Hang Geometric distribution . The probability distribution of ! = # times of suspicions before the first occurrence of non- suspicion is $ ! = # = % & ∗ (1 − %) where % estimates the true suspicion probability , . Given the confidence level 1 − - , we claim a hang is detected if $ . / ! ≥ 1 = 2 3 ≤ 5 . Make it simple : something is very likely wrong when a very rare event occurs. Feature 1+2: Consecutively small 15
e Whole Picture v i t u c e s d n e o v c r e ! # s s a b o s p e o r r a d s y n ! " t o i l i i c b i a p b s o u r s P ! 16
Two Problems with the Model (1) How to achieve random sampling? (2) The observed suspicion probability ( ̂ " ) doesn’t reflect the truth ( " ), i.e., # ≠ % # . 17
Random Sampling Insert between two consecutive samplings with a random time step: !"#$ % + %/( . Too small % à lack of randomness; Bigger % à better randomness. 1 ü ü ü ü 0.5 S out 0 ûû û ûû û û ûû û û û û û û û û û û û ûû û ûû û ûû û û 1 101 201 Running Timeline û Lack of randomness ü better randomness Solution : use runs test to check randomness of the sample sequence, and double ) if it is found to be lack of randomness until randomness is assured. 18
Random Sampling (Cont.) Runs test --- a standard test that checks the randomness of a two-valued data sequence. Runs test’s procedure : calculate the average of the sample sequence; 1) denote values bigger than the average as (+) and those smaller than 2) that as (-); check the number of runs ( ! ) --- a run is defined as a series of 3) consecutive (+) or (-); Too small or too large " à the sequence is lack of randomness 4) (significance test) 19
Random Sampling (Cont.) Example . We have a sample sequence as 0.2 0.1 0.1 0.2 0.1 0.1 0.0 0.0 0.8 0.9 1.0 0.8 0.9 0.1 0.9 0.9, which can be transformed as below + − + + . − − − − − − − − + + + + Its average is 0.44375, the non-rejection region at 95% confidence is (4, 14), and # = 4 . As & is outside the non- rejection region , we claim the sampling is not random and thus double ' . 20
̂ ̂ ̂ ̂ ̂ ! " ≠ " The difference ( $ ) between the observed probability ( ! " ) and the true probability ( " ) is closely related to the sample size % . Solution : Hence, we estimate |" − ! "| ≤ $ at different sample size levels with high confidence (95%) : * = 0.47 3 = 0.3 when 11 ≤ : < 19, * = 0.27 when 19 ≤ : < 42, 3 = 0.2 * = 0.12 when 42 ≤ : < 86, 3 = 0.1 when 86 ≤ :. * = 0.06 3 = 0.05 At each level, we use a different credible ! " to define what is a suspicion ( ? @AB ≤ C DE * ) . Make it simple: the difference gets smaller as sample size increases. 21
̂ ̂ ! " ≠ " (Cont.) |% − ̂ %| ≤ ) is not enough as underestimating " , i.e., ! " < " , lead to false positives. % + --- the probability that a program is still healthy --- % < % , Given converges faster than % + to the significance level , as k increases à more false positives. We use - = ! " + 0 as an estimate of " in the calculation of hangs’ probability ( - 1 ), which guarantees that - ≥ " with 97.5% confidence. 22
Question Solution Evaluation 23
Goal Trivial overhead High accuracy & Low false positive ParaStack > Timeout Short detection delay Enable resource saving when a hang occurs 24
Evaluation Setting Fault injection A hang is simulated by injecting a long enough sleep () in either source code or binary. Target Programs HPL, HPCG, NPB benchmark set ParaStack’s default setting 10 randomly selected processes are monitored. Significance level ! = 0.1% . The initial maximal sampling interval is set as ' = 400 ms. 25
Evaluation Setting (Cont.) Number of hang-injected runs using default ParaStack Scale Tardis Tianhe-2 Stampede 256 800+ 20+ 1024 300+ 100+ 4096 50 8192 5 16384 3 Used notations AC Accuracy FP False positive rate D Average delay S Standard deviation of delays 26
Overhead, Accuracy & False Alarms Overhead @ scale 1024 with 5 runs on each program. We disable the automatic adaptation of ! . Average accuracy à over 99% for 100 runs of each program No false alarm reported in: - 39.7 hours of hang-free runs at scale of 1024 - 66 hours of hang-free runs at scale of 256 - all hang-injected runs 27
ParaStack v.s. Timeout 10 runs per setting & 256 processes Timeout baseline Hang is claimed to be found upon K consecutive observations of !"#$ ≤ 0 sampled at a fixed interval I . Like ParaStack, it only samples 10 processes to maintain the trivial overhead. 28
ParaStack v.s. Timeout (Cont.) 10 runs per setting & 256 processes Setting of ParaStack: P: ParaStack initializing ! as 400ms. ParaStack initializing ! as 10ms which doesn’t deliver random P*: sampling. P* compares well with P as ParaStack is able to automatically adjust ! to ensure a good model. 29
Detection Delay The median of detection delays based on 100 runs per setting at scale 256. BT CG LU SP FT MG HPL HPCG 4 6 3 3 13 3 4 5 (Unit: seconds ) 30
Detection Delay (Cont.) Delay on Tianhe-2 with 50 runs per setting Delay on Stampede with 20 runs per setting @ scale 1024 and 10 runs per setting at scale 4096 ParaStack detects hangs in a few seconds , which is far less than the commonly used 1-minute timeout . 31
Timesaving 100.0% 88.7% Saved time (%) 59.2% 55.5% 44.8% 50.0% 33.5% 27.5% 24.0% 10.0%11.3% 0.0% 0.0% 1 2 3 4 5 6 7 8 9 10 Hangs 10 faulty HPL runs with program hang’s occurrence uniformly distributed over the program execution On average 35.5% time saving 32
Thank you! Any Question? 33
Recommend
More recommend