Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning Ton onmoy oy Dey ey†, Ken ento Sato†2, Sa Ji Jian an Gu Guo†2, Bog ogdan Nicol olae†3, Jen ens Dom omke†2, Wei eikuan Yu Yu†, Franc Fr nck Cappel ello†3, Ka Kathryn Moh ohror or†4 † † Flor orida State e Univer ersity, USA †2 †2 RIKEN Cen enter er for or Com omputation onal Scien ence( e(R-CCS CCS), Japan †3 †3 Argon onne e Nation onal Labor orator ory, USA †4 †4 Lawren ence e Liver ermor ore e Nation onal Labor orator ory, USA
Introduction Checkpoint/Restart (C/R) ■ Checkpoint-and-Restart is a commonly used technique for large-scale applications running for long time that: Writes a snapshot of an application at fixed intervals and § On a failure, the application can restart from the last checkpoint § ■ With emergence of fast local storage, Multi-Level Checkpointing ( MLC ) has become a common approach with hierarchically written checkpoints Multi-Level Checkpointing time Storage hierarchy Level-1 Node-local storage LOCAL L1 INTERVAL XOR Encoded Groups XOR Level-2 L2 INTERVAL Parallel File System PFS
Background and Motivation Optimal Checkpoint Configuration ■ Determining the optimal checkpoint configuration is very crucial for efficient checkpointing. However, finding this optimal configuration for efficient checkpointing is complicated ■ There exists a tradeoff for finding the optimal configuration: Frequent checkpoint : Spends more I/O time for checkpointing § Infrequent checkpoint: Lose more useful computation on a failure § Low overhead but … Infrequent more resilient but … Frequent less resilient huge overhead checkpoint checkpoint
Background and Motivation Approaches to Determine Optimal Configuration ■ There are existing two approaches to determine checkpoint configuration Approach 1: Modeling checkpointing behaviors § Execution states are categorized into compute, checkpoint and § recovery state. This approach works well for simpler checkpoint models, but is significantly difficult to implement for complex systems Approach 2: Simulation for optimal checkpointing § Simulation approach is much more accurate than Modeling approach, § however, it takes very long time to find optimal checkpoint configurations • In this paper, we try to obtain the optimal checkpoint configuration for a given HPC system using the effectiveness and accuracy of the simulation approach and combine it with machine learning models to avoid the the time taken by simulation to obtain the optimal result.
Design and Implementation Combine simulation with Machine Learning ■ Apply various AI techniques to learn checkpoint schemes given different C/R scenarios. There are two distinct ways to achieve it: Machine Learning (ML) Model : Use existing machine learning models on § the simulated dataset to see how well it learns. Neural Network (NN) Model : Build our own neural network to see how § well it can learn and predict the optimal configuration. Optimal Checkpoint Different C/R scenarios count Decision De De Decision . . . De Decision Approximate Tr Tree T 1 Tr Tree T 2 Tree T n Tr optimum ML Model Efficiency L1 Checkpoint Optimal Count Different C/R scenarios Checkpoint L1 Checkpoint Interval interval ML model
Design and Implementation Simulation ■ The simulator has been developed to replicate the behavior of real-world scenarios when using three-level checkpoint for large scale systems. ■ The simulator is provided with three critical parameters for each level, checkpoint overhead, checkpoint restart time, and failure rates. ■ The parameters are used by the simulator to provide the user with elapsed time and the efficiency (% of time utilized by useful computations) of the system.
Design and Implementation Model Optimization Daisy Chaining : Feed the output from Checkpoint Count prediction as an § input to the Neural Network for Checkpoint Interval prediction Parameter Optimization/Reduction : Remove interdependent, redundant § parameters Daisy Chaining Different C/R scenarios Optimal Checkpoint interval Daisy Chain C/R NN model Optimal De Decision Decision De . . . Decision De Checkpoint Tr Tree T 1 Tr Tree T 2 Tr Tree T n Different C/R scenarios count Random Forest
Evaluation Neural Network vs Machine Learning For a three-level checkpoint model, the neural network showed better § performance with an improved accuracy between 19 to 51% in comparison to the machine learning models. Neural Network Performance Improvement vs Machine Learning Models 60.0000% 51.0520% 50.0000% Perofrmance Improvement 40.0000% 33.1202% 30.3900% 29.3348% Count_L1 30.0000% Count_L2 24.1105% 19.1765% 20.0000% 10.0000% 0.0000% Random Forest Gaussian Naïve Bayes Support Vector Clustering
Conclusion We present an idea to combine the simulation approach with machine § learning models to determine the optimized parameter values for different configurations of C/R. We show that our models can predict the optimized parameter values when § trained with the simulation approach We have also demonstrated that using techniques such as neural networks § can improve the performance over the machine learning models with neural network sometime exceeding the performance of a machine learning model by 50%.
Contact Information Name Contact Tonmoy Dey td18d@my.fsu.edu Kento Sato kento.sato@riken.jp Bogdan Nicolae bnicolae@anl.gov Jian Guo jian.guo@riken.jp Jens Domke jens.domke@riken.jp Weikuan Yu yuw@cs.fsu.edu Franck Cappello cappello@mcs.anl.gov Kathryn Mohror mohror1@llnl.gov
Recommend
More recommend