resilience for multigrid software at the extreme scale
play

Resilience for Multigrid Software at the Extreme Scale Markus Huber - PowerPoint PPT Presentation

Resilience for Multigrid Software at the Extreme Scale Markus Huber joint work with: Bj orn Gmeiner, Lorenz John, Ulrich R ude, Barbara Wohlmuth huber@ma.tum.de Technische Universit at M unchen, Germany Januar 25-27, 2016 SPPEXA


  1. Resilience for Multigrid Software at the Extreme Scale Markus Huber joint work with: Bj¨ orn Gmeiner, Lorenz John, Ulrich R¨ ude, Barbara Wohlmuth huber@ma.tum.de Technische Universit¨ at M¨ unchen, Germany Januar 25-27, 2016 SPPEXA Symposium 2016

  2. 0 Outline Overview • Terraneo: An Exa-scale Mantle Convection Framework • Model problem • Ultra scalability • Building a Fault Tolerant Multigrid Solver • Challenges in exa-scale systems • Problem setting • Recovery strategies • Single fault scenarios • Multiple faults scenarios • Towards Geophyiscal Applications 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  3. 1 Terraneo: An Exa-scale Mantle Convection Framework Terraneo An Exa-scale Mantle Convection Framework 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  4. 1 Terraneo: An Exa-scale Mantle Convection Framework Stokes equations and equal order discretization Let Ω ⊂ R 3 with Γ = ∂ Ω − ν ∆ u + ∇ p = f in Ω , div u = 0 in Ω , u = 0 on Γ . Equal order discretization ( P 1 – P 1 ) [Hughes 1986] [Brezzi, Douglas 1988] Find ( u h , p h ) ∈ V h × Q h such that a ( u h , v h ) + b ( v h , p h ) = f ( v h ) ∀ v h ∈ V h , b ( u h , q h ) − c h ( q h , p h ) = g h ( q h ) ∀ q h ∈ Q h , with the level-dependent stabilization terms � � δ T h 2 δ T h 2 c h ( q h , p h ) = T �∇ p h , ∇ q h � T and g h ( q h ) = − T � f , ∇ q h � T . T ∈T h T ∈T h 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  5. 1 Terraneo: An Exa-scale Mantle Convection Framework Numerical simulation to the extreme • Uzawa-type multigrid method [Bank, Welfert, Yserentant 90], [Sch¨ oberl, Zulehner 03] Apply an inexact Uzawa smoother A − 1 ( f − A u k − B ⊤ p k ) , S − 1 ( B u k +1 − C p k − g ) u k +1 = u k + ˆ p k +1 = p k + ˆ Remark: For convergence we need ˆ A ≥ A and ˆ S ≥ C + B ˆ A − 1 B ⊤ • Sacalability on a current peta-scale system (JUQUEEN) Nodes Threads DoFs iter time 2 . 7 · 10 9 5 80 10 617.28 2 . 1 · 10 10 40 640 10 703.69 1 . 2 · 10 11 320 5 120 10 741.86 1 . 7 · 10 12 2 560 40 960 9 720.24 1 . 1 · 10 13 20 480 327 680 9 776.09 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  6. 1 Terraneo: An Exa-scale Mantle Convection Framework Mountain Climbing and Faults 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  7. 2 Building a Fault Tolerant Multigrid Solver Resilience • Past: Reliability of systems was a big concern for computing pioneers ”The problem of building reliable systems out of unreliable components did preoccupy the first generation of computing system designers - see, e.g., Von Neumann, 1956, as first generation computers were very failure prone.” , [Capello et al. 2009] • Present: Built-in system level resilience Hardware failure is of minor relevance for numerical simulation • Future: Huge number of components in exa-scale Algorithmic resilience will be of increasing importance for computational sciences [Dongarra et al. 2015] Storage of a vector of size O (10 13 ) : 73 TBytes. 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  8. 2 Building a Fault Tolerant Multigrid Solver Problem setting and fault model Model problem: − ∆ u = f in Ω , + BC • Discretized by linear FE-method • Solved by multigrid V-cycles with standard components in the HPC-framework Hierarchical Hybrid Grids [Bergen, R¨ ude et al. 2002, Gmeiner 2014] Node crash in the MG: Faulty domain : u F in Ω F Interface : u Γ on Γ Intact domain : u I in Ω I 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  9. 2 Building a Fault Tolerant Multigrid Solver No fault recovery strategy within a MG From almost on the top back to the checkpoint level 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  10. 2 Building a Fault Tolerant Multigrid Solver Comparison of a local recovery strategies 6th iteration 7th iteration Fault no recovery local recovery one F-cycle α = log( � Residual � ) 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  11. 2 Building a Fault Tolerant Multigrid Solver Local recovery strategy In case of a fault • Fix interface values u I on Γ • Recover faulty values u F by solving − ∆ u F = f in Ω F with u I Dirichlet BC. Possiblility for local recovery: smoother, cg-iterations, multigrid cycles, direct solver... Faulty domain : u F in Ω F Interface : u Γ on Γ Intact domain : u I in Ω I 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  12. 2 Building a Fault Tolerant Multigrid Solver Numerical results Fault and local recovery ... ... after 5th iteration with a perfect superman. ... after 11th iteration with a perfect superman. Only MG cycles are efficient. 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  13. 2 Building a Fault Tolerant Multigrid Solver Fault for the Stokes system Algorithmic strategy: • Fault in a multigrid algorithm with Uzawa-type smoother • Freeze velocity and pressure data at the interface • Locally re-calculated the lost values by superman power Fault after 5th (left) and 11th (right) iteration step 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  14. 2 Building a Fault Tolerant Multigrid Solver Optimal fault recovery strategy within a MG From almost on the top to the top without delay 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  15. 2 Building a Fault Tolerant Multigrid Solver Data structure for the recovery Ghost layer primitives tp_mr → → tp_br mp_tr → Stencil and sub-stencil mp_mr → structure → mp_br bp_tr → bp_mr → 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  16. 2 Building a Fault Tolerant Multigrid Solver Global recovery strategies based on tearing concepts Basic idea: coupling via halos on lower primitives • Dirichlet (faulty)– Dirichlet (healthy) strategy (DD)     A II A I Γ I 0 0 0 u I Id − Id 0 0 0 u Γ I     A Γ I A ΓΓ A Γ F u Γ  0 0        − Id Id u Γ F 0 0 0     A F Γ F A F F u F 0 0 0 • Dirichlet (faulty)– Neumann (healthy) strategy (DN)     A II A I Γ I 0 0 0 0 u I Id A Γ II A Γ I Γ I 0 0 0 u Γ I      Id − Id   λ Γ I  0 0 0 0     A Γ I A ΓΓ A Γ F u Γ 0 0 0         − Id Id u Γ F 0 0 0 0     A F Γ F A F F u F 0 0 0 0 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  17. 2 Building a Fault Tolerant Multigrid Solver Dirichlet-Dirichlet Recovery Strategy Dirichlet boundary condition on healthy domain Dirichlet boundary condition on faulty domain Alg. 1 Dirichlet-Dirichlet recovery 1: Solve Au = f by multigrid cycles. 2: if Fault has occurred then STOP solving. 3: Recover boundary data u Γ F from line 4 4: Initialize u F with zero 5: In parallel do: 6: a) Use n F MG cycles accelerated by η s to approximate line 5: 7: A F F u F = f F − A F Γ F u Γ F 8: b) Use n I MG cycles to approximate line 1 9: A II u I = f I − A I Γ I u Γ I 10: RETURN to line 1 with new values u I in Ω I and u F in Ω F . 11: 12: end if     AII AI Γ I 0 0 0 uI u Γ I Id − Id 0 0 0         A Γ I A ΓΓ A Γ F u Γ (1) 0 0         − Id Id u Γ F  0 0 0        0 0 0 AF Γ F AF F uF 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  18. 2 Building a Fault Tolerant Multigrid Solver Cycle advantage factor κ k R − k Define κ := ∈ [0 , 1] , k , k R required number of iterations k F n I number of MG cycles on the healthy subdomain η s n I number of MG cycles on the faulty subdomain Fault at k F = 5 and speedup η s = 2 17% loss 2% loss 0 . 6% loss DD DN DD DN DD DN n I 0 0.80 0.80 0.80 0.80 0.80 0.80 1 0.20 0.00 0.20 0.20 0.20 0.00 2 0.20 0.00 0.20 0.00 0.00 0.00 3 0.40 0.40 0.40 0.20 0.20 0.00 4 0.60 0.60 0.60 0.40 0.40 0.20 Fault at k F = 11 and speedup η s = 5 DD DN DD DN DD DN n I 0 0.82 0.82 0.82 0.82 0.91 0.91 1 0.36 0.36 0.27 0.27 0.27 0.27 2 0.09 0.00 0.09 0.00 0.00 0.00 3 0.18 0.09 0.27 0.09 0.09 0.09 4 0.27 0.18 0.36 0.18 0.18 0.18 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  19. 2 Building a Fault Tolerant Multigrid Solver Parallel setup: 0 . 6% − 0 . 00047% information loss Adaptivly steering: n F = η s n I + ∆ n F • DD and DN strategies: one failure at k F = 7 with n I = 3 η s = 1 2 4 8 η s = 1 2 4 8 Size No Rec 4 . 5 · 10 8 13.73 (21) 9.14 -0.01 0.00 -0.00 11.47 2.30 0.02 0.04 2 . 1 · 10 9 11.69 (20) 9.31 0.04 0.08 0.11 9.35 2.41 0.11 0.14 1 . 2 · 10 10 12.49 (20) 7.42 -0.01 -0.02 -0.00 9.96 2.54 0.06 0.06 8 . 2 · 10 10 11.16 (19) 5.54 0.08 0.07 0.04 8.36 0.11 0.15 0.17 6 . 0 · 10 11 13.59 (19) 3.47 0.13 0.19 0.13 0.13 0.24 0.29 0.26 • DN strategy: two consequtive failures at k F = 5 and k F = 9 with η s = 4 Size No Rec (1,2) (1,3) (2,2) (2,3) 4 . 5 · 10 8 18.35 (23) 0.02 0.03 0.03 0.04 2 . 1 · 10 9 16.33 (22) 0.05 0.06 0.06 0.06 1 . 2 · 10 10 17.43 (22) 0.07 0.08 0.09 0.08 8 . 2 · 10 10 16.69 (21) 0.16 0.17 0.16 0.17 6 . 0 · 10 11 20.64 (21) 0.30 0.33 0.36 0.36 Global recovery can be fully compensate fault wrt time-to-solution. 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

  20. 2 Building a Fault Tolerant Multigrid Solver Towards Geophysics 0 1 2 3 4 5 ◭ ◭ ◭ ◮ ◮ ◮

Recommend


More recommend