automatic and coordinated job recovery for high
play

Automatic and Coordinated Job Recovery for High Performance - PowerPoint PPT Presentation

Automatic and Coordinated Job Recovery for High Performance Computing Wei Tang 1 , Zhiling Lan 1 , Narayan Desai 2 , and Daniel Buettner 2 1 Illinois Insistute of Technology and 2 Argonne National Laboratory Nov 15, 2010 Wei Tang, Zhiling Lan,


  1. Automatic and Coordinated Job Recovery for High Performance Computing Wei Tang 1 , Zhiling Lan 1 , Narayan Desai 2 , and Daniel Buettner 2 1 Illinois Insistute of Technology and 2 Argonne National Laboratory Nov 15, 2010 Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 1 / 24

  2. Outline Motivation System Design Implementation Evaluations Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 2 / 24

  3. System Failure and Fault Tolerance System failures are increasingly common as the scale of supercomputers grows Fault tolerance schemes have been proposed continuously Redundancy and Replication Checkpoint/Restart Failure prediction + process migration Failure prediction + fault-aware job scheduling Most of existing fault tolerance schemes are pre-failure avoidance though post-failure handling is equally important. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 3 / 24

  4. Resource management system Functionality: Manages the processing load Prevents jobs from competing with each other for limited compute resources Two parts: Resource manager: maintains resources, e.g., job queues, computing nodes, etc. Job scheduler: makes scheduling decisions, i.e., when and where to run a job. Examples: PBS (Altair), Moab (Adaptive Computing), LSF (Platform), LoadLeveler (IBM), Cobalt (ANL) Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 4 / 24

  5. Motivation Fault-tolerance aspect: Precautionary fault avoidance dont suffice because of inevitability of failures. Post-failure recovery is import, but existing work is few. Resource management aspect: Resource manager assumes jobs will run to completion, it hardly support post failure handling. Due to resource limitation, failed jobs should be treated differently according to their diverse importance or priority. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 5 / 24

  6. Our approach AuCoRe: Automatic and Coordinated job Recovery Extend resource management system to support post-failure handling AuCoRe automatically resubmit failed job in a systematical manner treating failed jobs with different recovery priority coordinating the failed job recovery with the queuing of regular jobs. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 6 / 24

  7. Design diagram Figure: Diagram of AuCoRe. Users are allowed to specify their job recovery options in job submission scripts or commands. Jobs are maintained in three groups, namely the waiting job queue, the running job list, and the failed job queue. A recovery manager enables automatic and coordinated job recovery and supports an incentive management mechanism. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 7 / 24

  8. Recovery Options Specify recovery option by user in the submission script Suggested options: Option A: notify only Option B: resubmit to rear of the queue Option C: restart the job on original nodes when they are repaired Option D: insert the job in the middle of the queue Option E: resubmit to head of the queue Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 8 / 24

  9. Coordinated recovery Figure: Treatments for failed job with different recovery options. Option-A jobs are stepped out waiting for manually resubmit; option-C jobs are suspended until computing nodes are recovered; Jobs with option B, D, and E are resubmitted to different part of waiting job queues. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 9 / 24

  10. Incentive management Users behavior is hard to manage: Ignoring the recovery option Gaming the system by always specifying high options Intentive mechanism Users pay for each recovery option with some (virtual) credits at job submission Higher recovery priority costs more credits Credits are prepaid and not returned even no failure occurs. (like insurance) Default to lowest option if not specified Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 10 / 24

  11. Incentive mechansism Pricing: C = α i × T × N C – the cost for a job with recovery option i α i – the cost for a job with recovery option i : T – the job’s running time (in hour) N – the number of the job’s computing nodes. User Recovery Account: S = β × T × N S – Each time a user submits a job, he is assigned a certain amount of credits S β – a parameter set by system owner, ususally, median unit price ( P m ) Charging: B = ( α i − β ) × T × N B – actual charge for a job Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 11 / 24

  12. Implementation Figure: AuCoRe Implementation with Cobalt, a production resource management system developed by Argonne National Laboratory. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 12 / 24

  13. Evaluation Event-driven simulation using Qsim, a job scheduling simulator along with Cobalt resource mananger Uses real job trace from Blue Gene/P system at Argonne National Laboratory Uses synthetic failure events that follow Weibull distribution Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 13 / 24

  14. Simulation cases Cases Denote Description FF failure-free W/O AuCoRe MR failure-present, manual resubmit Even Option proportion is 1:1:1:1:1 W/ AuCoRe (multi-opt) Normal Option proportion is 1:2:4:2:1 All-B all with option B All-C all with option C W/ AuCoRe (single-opt) All-D all with option D All-E all with option E Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 14 / 24

  15. Evaluation metrics Response time (RESP) a jobs response time is the time from jobs submission to its completion. average among all jobs. Failure slowdown (FSD) the ratio of time delay caused by failure to failure-free job execution time. average among failed jobs Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 15 / 24

  16. Baseline simulations Figure: Baseline Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 16 / 24

  17. Comparison Figure: Comparing multi-option cases with single-option ones. The X-axis represents the job groups categorized by their recovery options. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 17 / 24

  18. Multi-option vs single-option Figure: Comparing multi-option cases with single-option ones. The X-axis represents the job groups categorized by their recovery options. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 18 / 24

  19. Performance under different MTTR Figure: Performance under different MTTR. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 19 / 24

  20. Performance under different system MTBF Figure: Performance under different system MTBF. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 20 / 24

  21. Performance under different job arrival rates Figure: Performance under different job arrival rates. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 21 / 24

  22. Results Summary AucoRe can significantly improve performance of failed jobs and the overall system performance. In the multi-option cases, higher-priority recovery options result in more performance gains than lower-priority options, especially on FSD. That is, having recovery option diversity can benefit part of jobs that are really thought important. The recovery performance is sensitive to MTTR. Therefore, when setting the relative unit price of option C, MTTR should be considered. AuCoRe is effective under different system failure rates and job arrival rates. Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 22 / 24

Recommend


More recommend