im improving p performance o of it iterative me methods
play

Im Improving P Performance o of It Iterative Me Methods by y - PowerPoint PPT Presentation

Im Improving P Performance o of It Iterative Me Methods by y Lo Lossy y Checkp kpoin intin ing Dingwen Tao (University of California, Riverside) Sheng Di (Argonne National Laboratory) Xin Liang (University of California, Riverside)


  1. Im Improving P Performance o of It Iterative Me Methods by y Lo Lossy y Checkp kpoin intin ing Dingwen Tao (University of California, Riverside) Sheng Di (Argonne National Laboratory) Xin Liang (University of California, Riverside) Zizhong Chen (University of California, Riverside) Franck Cappello (Argonne National Laboratory) June 2018

  2. Ou Outline Ø Introduction • Why we need to checkpoint iterative methods? 2

  3. Ou Outline Ø Introduction • Why we need to checkpoint iterative methods? Ø Background • Traditional checkpointing for iterative methods • Performance model of traditional checkpointing 3

  4. Ou Outline Ø Introduction • Why we need to checkpoint iterative methods? Ø Background • Traditional checkpointing for iterative methods • Performance model of traditional checkpointing Ø Our Designs • Lossy checkpointing for iterative methods • Performance model of our new checkpointing 4

  5. Ou Outline Ø Introduction • Why we need to checkpoint iterative methods? Ø Background • Traditional checkpointing for iterative methods • Performance model of traditional checkpointing Ø Our Designs • Lossy checkpointing for iterative methods • Performance model of our new checkpointing Ø Theoretical Analysis • Impact of lossy checkpointing for different methods • Expected fault tolerance overhead 5

  6. Ou Outline Ø Introduction • Why we need to checkpoint iterative methods? Ø Background • Traditional checkpointing for iterative methods • Performance model of traditional checkpointing Ø Our Designs • Lossy checkpointing for iterative methods • Performance model of our new checkpointing Ø Theoretical Analysis • Impact of lossy checkpointing for different methods • Expected fault tolerance overhead Ø Experimental Evaluation 6

  7. Ou Outline Ø Introduction • Why we need to checkpoint iterative methods? Ø Background • Traditional checkpointing for iterative methods • Performance model of traditional checkpointing Ø Our Designs • Lossy checkpointing for iterative methods • Performance model of our new checkpointing Ø Theoretical Analysis • Impact of lossy checkpointing for different methods • Expected fault tolerance overhead Ø Experimental Evaluation 7

  8. Wh Why Ne Need to o Checkpoi oint Iterative Method ods? Ø Iterative methods used for solving large, sparse linear system • ”Gaia” mission by European Space Agency (ESA) • Producing 5-parameter astrometric catalogue at the microarcsecond for 1 billion stars in Galaxy • Resulting a very large, sparse linear system of 72 billion equations • Scientists use LSQR iterative algorithm • Takes more than 54 hours on 2,048 BlueGene/Q nodes Execution Time Number of Iterations • Largest symmetric indefinite sparse matrix from UFL sparse 20000 7E+05 matrix collection (KKT240 with 28 million linear equations) 15000 6E+05 • 2,048 cores / 64 nodes on Bebop cluster at Argonne Seconds 10000 • GMRES solver implemented in PETSc 5E+05 5000 • Relative convergence tolerance of 10 -6 , execution time > 1 hour 0 4E+05 256 512 1024 2048 • MTBF of Sunway TaihuLight supercomputer can be hourly or less Number of Processes than 1 hour 8

  9. Im Importan ance of f Im Improvi ving Checkpointing Pe Performance of Iterative Methods Ø Scientific simulations involving PDEs • Solve linear systems within each timestep • Sparse linear systems include most of the variables • E.g., 3D CFD problems from Navier-Stokes equations • Semi-Implicit Method for Pressure-Linked Equations (SIMPLE) algorithm • 5 out of 9 fluid-flow variables need to be checkpointed in iterative method Ø Significantly Improve Checkpointing Performance of Iterative methods Significantly Improve Application Performance 9

  10. State-of St of-the the-Ar Art: F Failure-St Stop Failure Comp Proc 1 Comp Proc k Process State Process State P 1 P k Stable Storage Checkpoint/Restart Model • Periodical checkpoint to file system is expensive • Difficult to scale up due to bottleneck of I/O bandwidth 10

  11. St State-of of-the the-Ar Art: F Failure-St Stop Failure Comp Proc 1 Comp Proc k Ckpt Proc Process State Process State P 1 P k Local Local Checkpoint Stable Checkpoint Checkpoint Encoding Storage C 1 C k C XOR Diskless checkpoint (J. Plank) 2 steps: C 1 + . . . + C n = C 1. Checkpointing state of each • More scalable (pros) application processor in memory • 2X or more memory overhead (cons) à Reduce usable memory and problem size 2. Encoding these in-memory checkpoints and storing the encodings • Only able to tolerate with partial failures , not for a whole system failure (cons) in checkpointing processors • Requires spare nodes and dedicates processors (cons) 11

  12. Fa Failures and Checkpointing Optimized Techniques to Improve Scalability of Checkpoint • Diskless checkpoint • Multi-level checkpoint • Asynchronized checkpoint • Lossless-compressed checkpoint • …… Question: Can we use lossy compression to (1) reduce checkpointing size and overhead and (2) improve the performance and scalability? 12

  13. Fa Failures and Checkpointing Question: Can we use lossy compression to (1) reduce checkpointing size and overhead and (2) improve the performance and scalability? Lossy checkpointing Two important questions: (1) What is the impact of the lossy checkpointing data on the execution performance? (2) Can lossy checkpointing actually improve the overall performance (including C/R and lossy compression) in the context of restarting with alternated data? 13

  14. Ou Outline Ø Introduction • Why we need to checkpoint iterative methods? Ø Background • Traditional checkpointing for iterative methods • Performance model of traditional checkpointing Ø Our Designs • Lossy checkpointing for iterative methods • Performance model of our new checkpointing Ø Theoretical Analysis • Impact of lossy checkpointing for different methods • Expected fault tolerance overhead Ø Experimental Evaluation 14

  15. Tr Traditional Checkpointing for Iterative Methods Ø Checkpoint Checkpoint static variables (e.g., A , M ) at the beginning 1. Checkpoint dynamic variables (e.g., i, ⍴ , p , x ) every several iterations 2. 15

  16. Traditional Checkpointing for Iterative Methods Tr Ø Checkpoint Checkpoint static variables (e.g., A , M ) at the beginning 1. Checkpoint dynamic variables (e.g., i, ⍴ , p , x ) every several iterations 2. Ø Recovery Recover a correct computational environment 1. Recover static variables 2. Recover dynamic variables 3. Recover recomputed variables (e.g., r ) 4. 16

  17. Traditional Checkpointing for Iterative Methods Tr Ø Checkpoint Checkpoint static variables (e.g., A , M ) at the beginning 1. Checkpoint dynamic variables (e.g., i, ⍴ , p , x ) every several iterations 2. Ø Recovery Recover a correct computational environment 1. Recover static variables 2. Recover dynamic variables 3. Recover recomputed variables (e.g., r ) 4. Ø C/R cost dominated by dynamic variables • Static variables not checkpointed along iterations (at most once) • Static variables: linear system matrix A and preconditioner M • A usually has 1x ~ 10x nnz than dynamic variables’ size (i.e., vector size) • M is much sparse than A, e.g., block Jacobi, ILU • Checkpoint frequency is usually much higher than failure rate • MTTI = 4 hrs., Time ckpt = 18 s è Checkpoint interval ( Young’ formula ) = 12 mins • Checkpoint frequency is 30x higher than recovery frequency 17

  18. Traditional Checkpointing for Iterative Methods Tr Ø Checkpoint Checkpoint static variables (e.g., A , M ) at the beginning 1. Checkpoint dynamic variables (e.g., i, ⍴ , p , x ) every several iterations 2. Ø Recovery Recover a correct computational environment 1. Recover static variables 2. Recover dynamic variables 3. Recover recomputed variables (e.g., r ) 4. Ø C/R cost dominated by dynamic variables • Static variables not checkpointed along iterations (at most once) • Static variables: linear system matrix A and preconditioner M • A usually has 1x ~ 10x nnz than dynamic variables’ size (i.e., vector size) Focus on reducing C/R overhead • M is much sparse than A, e.g., block Jacobi, ILU of dynamic variables in iterative methods by lossy compressors. • Checkpoint frequency is usually much higher than failure rate • MTTI = 4 hrs., Time ckpt = 18 s è Checkpoint interval ( Young’ formula ) = 12 mins • Checkpoint frequency is 30x higher than recovery frequency 18

  19. Ou Outline Ø Introduction • Why we need to checkpoint iterative methods? Ø Background • Traditional checkpointing for iterative methods • Performance model of traditional checkpointing Ø Our Designs • Lossy checkpointing for iterative methods • Performance model of our new checkpointing Ø Theoretical Analysis • Impact of lossy checkpointing for different methods • Expected fault tolerance overhead Ø Experimental Evaluation 19

  20. Th Theoretic ical al Analy alysis is of f Checkpoin intin ing Ov Overhead for or Iterative Method ods Overall execution time • Iteration time Checkpoint time Recover/rollback time 20

  21. Th Theoretic ical al Analy alysis is of f Checkpoin intin ing Ov Overhead for or Iterative Method ods Overall execution time • Iteration time Checkpoint time Recover/rollback time Based on Young’s formula Expected mean time of a rollback • and ! "# = %! &' /2 Overall time can be simplified to • 21

Recommend


More recommend