Scheduling computational workflows on failure-prone platforms Guillaume Aupy, Anne Benoit, Henri Casanova & Yves Robert Joint Lab, Nov. 2014
Workflow Motivation scheduling with failures G. Aupy Motivation Many HPC applications can be represented as a computational Models workflow: Platform Fault-tolerance Application Results Exp’d makespan Other Represented by a DAG: Heuristic evaluation ◮ Vertices are tightly Heuristics Evaluation coupled parallel tasks Conclusion ◮ Edges represent data dependency Eg. CyberShake workflow (used to characterize earthquake hazards) as presented by Pegasus. 1
Workflow scheduling with failures 1 Motivation G. Aupy 2 Models Motivation Platform Models Platform Fault-tolerance Fault-tolerance Application Application Results Exp’d makespan Other 3 Results Heuristic evaluation Computation of the expected makespan Heuristics NP-hardness, polynomial algorithms for special graphs Evaluation Conclusion 4 Efficient heuristic evaluation Heuristics Evaluation 5 Conclusion 2
Workflow Platform and processor assignments scheduling with failures G. Aupy Failure-prone platform: Motivation ◮ p processors Models Platform ◮ Exponential failure distribution, MTBF: µ = 1 Fault-tolerance λ Application Results Exp’d makespan Other Heuristic evaluation Heuristics Evaluation Conclusion 3
Workflow Platform and processor assignments scheduling with failures G. Aupy Failure-prone platform: Motivation ◮ p processors Models Platform ◮ Exponential failure distribution, MTBF: µ = 1 Fault-tolerance λ Application Results Exp’d makespan Mixed parallelism is hard. Even without failures. Other Heuristic ◮ Assignment of processors to tasks? (throughput) evaluation Heuristics ◮ Traversal of the graph? (scheduling) Evaluation Conclusion ◮ Data redistribution? (model redistribution cost) 3
Workflow Platform and processor assignments scheduling with failures G. Aupy Failure-prone platform: Motivation ◮ p processors Models Platform ◮ Exponential failure distribution, MTBF: µ = 1 Fault-tolerance λ Application Results Exp’d makespan Mixed parallelism is hard. Even without failures. Other Heuristic ◮ Assignment of processors to tasks? (throughput) evaluation Heuristics ◮ Traversal of the graph? (scheduling) Evaluation Conclusion ◮ Data redistribution? (model redistribution cost) Simplified scenario Each task uses all available processors; workflow is linearized. 3
Workflow Fault tolerance scheduling with failures G. Aupy Motivation We use the checkpoint technique for fault-tolerance. Models Platform Fault-tolerance Application Checkpointing within tasks is expensive or hard: Results ◮ Expensive: for application-agnostic checkpoint, need to Exp’d makespan Other checkpoint the full image Heuristic evaluation ◮ Hard: modifying the implementation of the tasks to checkpoint Heuristics Evaluation only what is necessary Conclusion Checkpoint model We only checkpoint the output data of tasks. 4
Workflow Application model scheduling with failures G. Aupy Motivation Given a DAG: G = ( V , E ). For all tasks T i , we know: Models Platform w i : their execution time Fault-tolerance Application c i : the time to checkpoint their output Results r i : the time to recover their output Exp’d makespan Other Heuristic evaluation Heuristics Evaluation DAG-CkptSched Conclusion ◮ In which order should the tasks be executed? ◮ Which tasks should be checkpointed? We want to minimize the expected execution time. 5
Workflow Motivational example scheduling with failures G. Aupy Motivation Models Platform T 5 Fault-tolerance A solution (schedule): Application T 0 T 1 T 6 Results Exp’d makespan Order: T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 4 Other Ckpted: T 1 , T 4 Heuristic evaluation Heuristics T 2 T 3 T 7 Evaluation Conclusion 6
Workflow Motivational example scheduling with failures G. Aupy Motivation Models Platform T 5 Fault-tolerance A solution (schedule): Application T 0 T 0 T 1 T 1 T 6 Results Exp’d makespan Order: T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 4 T 4 Other Ckpted: T 1 , T 4 Heuristic evaluation Heuristics T 2 T 2 T 3 T 3 T 7 Evaluation Conclusion w 0 w 1 c 1 w 2 w 3 w 4 c 4 Time 6
Workflow Motivational example scheduling with failures G. Aupy Motivation Models Platform T 5 Fault-tolerance A solution (schedule): Application T 0 T 1 T 6 Results Exp’d makespan Order: T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 4 Other Ckpted: T 1 , T 4 Heuristic evaluation Heuristics T 2 T 3 T 7 Evaluation Conclusion fault w 0 w 1 c 1 w 2 w 3 w 4 c 4 Time 6
Workflow Motivational example scheduling with failures G. Aupy Motivation Models Platform T 5 Fault-tolerance A solution (schedule): Application T 0 T 1 T 6 Results Exp’d makespan Order: T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 4 Other Ckpted: T 1 , T 4 Heuristic evaluation Heuristics T 2 T 3 T 7 Evaluation Conclusion w 0 w 1 c 1 w 2 w 3 w 4 c 4 Time 6
Workflow Motivational example scheduling with failures G. Aupy Motivation Models Platform T 5 T 5 Fault-tolerance A solution (schedule): Application T 0 T 1 T 1 T 6 Results Exp’d makespan Order: T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 4 Other Ckpted: T 1 , T 4 Heuristic evaluation Heuristics T 2 T 3 T 7 Evaluation Conclusion w 0 w 1 c 1 w 2 w 3 w 4 c 4 r 1 w 5 Time 6
Workflow Motivational example scheduling with failures G. Aupy Motivation Models Platform T 5 T 5 Fault-tolerance A solution (schedule): Application T 0 T 1 T 1 T 6 T 6 Results Exp’d makespan Order: T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 4 T 4 Other Ckpted: T 1 , T 4 Heuristic evaluation Heuristics T 2 T 3 T 7 Evaluation Conclusion w 0 w 1 c 1 w 2 w 3 w 4 c 4 r 1 w 5 r 4 w 6 Time 6
Workflow Motivational example scheduling with failures G. Aupy Motivation Models Platform T 5 T 5 Fault-tolerance A solution (schedule): Application T 0 T 1 T 1 T 6 T 6 Results Exp’d makespan Order: T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 4 T 4 Other Ckpted: T 1 , T 4 Heuristic evaluation Heuristics T 2 T 2 T 3 T 3 T 7 T 7 Evaluation Conclusion w 0 w 1 c 1 w 2 w 3 w 4 c 4 r 1 w 5 r 4 w 6 w 2 w 3 w 7 Time 6
Workflow Previous results (Bougeret et al. 2011) scheduling with failures G. Aupy Motivation Models Let E [ t ( w ; c ; r )] the expected time to execute a single Platform Fault-tolerance application: Application Results w sec. of computation in a fault-free execution Exp’d makespan Other c sec. to checkpoint the output Heuristic evaluation r sec. to recover (if a failure occurs) Heuristics Evaluation Conclusion � 1 � � e λ ( w + c ) − 1 � E [ t ( w ; c ; r )] = e λ r λ + D . 7
Workflow scheduling with failures Theorem G. Aupy Given a DAG, and a schedule for this DAG, it is possible to Motivation compute the expected execution time in polynomial time. Models Platform Fault-tolerance Application Results Exp’d makespan Other Heuristic evaluation Heuristics Evaluation Conclusion w 0 w 1 c 1 w 2 w 3 w 4 c 4 r 1 w 5 r 4 w 6 w 2 w 3 w 7 Time 8
Workflow scheduling with failures Theorem G. Aupy Given a DAG, and a schedule for this DAG, it is possible to Motivation compute the expected execution time in polynomial time. Models Platform Fault-tolerance Application Results Exp’d makespan Other X i : execution time between the end of the first successful Heuristic execution of T i − 1 and the end of the first successful evaluation Heuristics execution of T i (RV). Evaluation Conclusion w 0 w 1 c 1 w 2 w 3 w 4 c 4 r 1 w 5 r 4 w 6 w 2 w 3 w 7 Time X 0 X 1 X 5 X 7 8
Workflow scheduling with failures Theorem G. Aupy Given a DAG, and a schedule for this DAG, it is possible to Motivation compute the expected execution time in polynomial time. Models Platform Fault-tolerance Application Results Exp’d makespan Other X i : execution time between the end of the first successful Heuristic execution of T i − 1 and the end of the first successful evaluation Heuristics execution of T i (RV). Evaluation Conclusion w 0 w 1 c 1 w 2 w 3 w 4 c 4 r 1 w 5 r 4 w 6 w 2 w 3 w 7 Time X 0 X 1 X 5 X 7 We want to compute E [ � i X i ] = � i E [ X i ]. 8
Workflow Sketch of Proof (1/2) scheduling with failures G. Aupy Z i k : “There was a fault during X k and no fault during X k +1 to X i − 1 ” Motivation (= when starting X i , the last fault was during X k ) . Models i − 1 Platform � P ( Z i k ) E [ X i | Z i Fault-tolerance → E [ X i ] = k ] Application k =0 Results Exp’d makespan Other Heuristic evaluation Heuristics Evaluation Conclusion 9
Workflow Sketch of Proof (1/2) scheduling with failures G. Aupy Z i k : “There was a fault during X k and no fault during X k +1 to X i − 1 ” Motivation (= when starting X i , the last fault was during X k ) . Models i − 1 Platform � P ( Z i k ) E [ X i | Z i Fault-tolerance → E [ X i ] = k ] Application k =0 Results Exp’d makespan Other T ↓ k : all T j ’s whose output should be computed during X i if Z i k . Heuristic i evaluation We separate their impact on the execution time into W i k and R i k Heuristics (depending if T j was checkpointed). Evaluation Conclusion T 5 T 0 T 1 T 6 T 4 ∈ T ↓ 5 R 6 5 = r 4 6 T 4 ∈ T ↓ 5 T 1 , T 5 , T 2 , T 3 / 6 T 2 T 3 T 7 w 0 w 1 c 1 w 2 w 3 w 4 c 4 r 1 w 5 Time 9
Recommend
More recommend