Data-driven time parallelism and model reduction Kevin Carlberg 1 , - PowerPoint PPT Presentation

Data-driven time parallelism and model reduction Kevin Carlberg 1 , Lukas Brencher 2 , Bernard Haasdonk 2 , Andrea Barth 2 Sandia National Laboratories 1 University of Stuttgart 2 SIAM Conference on UQ April 7, 2016 Data-driven time parallelism Carlberg, Brencher, Haasdonk, Barth 1 / 27

Model reduction and UQ at Sandia CFD model High simulation costs 100 million cells 6 weeks, 5000 cores 200,000 time steps 6 runs maxes out Cielo Barrier ‘In the field’ Fast-turnaround Bayesian inference stochastic optimization Data-driven time parallelism Carlberg, Brencher, Haasdonk, Barth 2 / 27

Cavity-flow problem Re = 6.3 × 10 6 Unsteady Navier–Stokes DES turbulence model M ∞ = 0.6 1.2 million degrees of CFD code: AERO-F freedom [Farhat et al., 2003] Data-driven time parallelism Carlberg, Brencher, Haasdonk, Barth 3 / 27

GNAT model [C et al., 2011, C et al., 2013] Sample mesh: 4.1% nodes, 3.0% cells + Small problem size: can run on many fewer cores Data-driven time parallelism Carlberg, Brencher, Haasdonk, Barth 4 / 27

GNAT performance vorticity field pressure field GNAT ROM FOM FOM: 5 hour x 48 CPU GNAT ROM: 32 min x 2 CPU. + 229x CPU-hour savings. Good for many query. - 9.4x walltime savings. Bad for real time. Why? Data-driven time parallelism Carlberg, Brencher, Haasdonk, Barth 5 / 27

GNAT: strong scaling (Ahmed body) [C, 2011] ( CPU × T FOM ) / ( CPU × T ROM ) 14 450 12 400 T FOM / T ROM 10 350 8 6 300 4 250 2 200 0 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 CPU CPU (a) CPU-hour savings (b) Walltime savings + Significant CPU-hour savings (max: 438 for 4 CPU) - Modest walltime savings (max: 7 for 12 CPU) Spatial parallelism is quickly saturated! Data-driven time parallelism Carlberg, Brencher, Haasdonk, Barth 6 / 27

Time-parallel algorithms [Lions et al., 2001a, Farhat and Chandesris, 2003] Goal : expose more parallelism to reduce walltime H t M t 0 t 1 t 2 h T ¯ T 0 T 1 T 2 T ¯ M − 1 M Fine propagator: time step h F ( x ; τ 1 , τ 2 ) Coarse propagator: time step H G ( x ; τ 1 , τ 2 ) Parareal iteration k (sequential and parallel steps): x m +1 k +1 = G ( x m k +1 ; T m , T m +1 ) + F ( x m k ; T m , T m +1 ) − G ( x m k ; T m , T m +1 ) Interpretations [Gander and Vandewalle, 2007, Falgout et al., 2014] : Deferred/residual-correction scheme B ( x k +1 ) = B ( x k ) − A ( x k ) Multiple shooting method with FD Jacobian approximation Two-level multigrid Data-driven time parallelism Carlberg, Brencher, Haasdonk, Barth 7 / 27

Parareal: sequential and parallel steps [Lions et al., 2001a] 1.7 1.7 1.6 1.6 state variable state variable 1.5 1.5 1.4 1.4 1.3 1.3 1.2 1.2 1.1 1.1 1 1 0 10 20 30 40 50 60 0 10 20 30 40 50 60 time step time step F ( x m x m +1 = G ( x m 0 ; T m , T m +1 ) 0 ; T m , T m +1 ) 0 1.7 1.7 1.6 1.6 state variable state variable 1.5 1.5 1.4 1.4 1.3 1.3 1.2 1.2 1.1 1.1 1 1 0 10 20 30 40 50 60 0 10 20 30 40 50 60 time step time step F ( x m x m +1 = F ( x m 1 ; T m , T m +1 ) 0 ; T m , T m +1 ) 1 + G ( x m 1 ; T m , T m +1 ) −G ( x m 0 ; T m , T m +1 ) Data-driven time parallelism Carlberg, Brencher, Haasdonk, Barth 8 / 27

Coarse propagator Critical : coarse propagator should be fast, accurate, stable Existing coarse propagators Same integrator [Lions et al., 2001b, Bal and Maday, 2002] Coarse spatial discretization [Fischer et al., 2005, Farhat et al., 2006, Cortial and Farhat, 2009] Simplified physics model [Baffico et al., 2002, Maday and Turinici, 2003, Blouza et al., 2011, Engblom, 2009, Maday, 2007] Relaxed solver tolerance [Guibert and Tromeur-Dervout, 2007] Reduced-order model (on the fly) [Farhat et al., 2006, Cortial and Farhat, 2009, Ruprecht and Krause, 2012, Chen et al., 2014] ROM context: can we leverage offline data to improve the coarse propagator? Data-driven time parallelism Carlberg, Brencher, Haasdonk, Barth 9 / 27

Model reduction full-order model (FOM) x (0, µ ) = x 0 ( µ ) x ( t , µ ) = f ( x ; t , µ ), ˙ Offline : snapshot collection X i := [ x (0, µ i ) · · · x ( t M , µ i )] ∈ R N × M = U Σ V T � � X 1 · · · X n train Online : projection ∈ R N × ˆ � � N trial subspace Φ = u 1 · · · u ˆ N x ≈ ˜ x ( t , µ ) = Φ ˆ x ( t , µ ) test subspace Ψ ∈ R N × ˆ N � Ψ = ( α o I − δ t β 0 ∂ f � Ψ = Φ : Galerkin ∂ x ) Φ : LSPG [C et al., 2015a] x ( t , µ ) = ( Ψ T Φ ) − 1 Ψ T f ( Φ ˆ ˙ x (0, µ ) = Φ T x 0 ( µ ) ˆ x ; t , µ ), ˆ Data-driven time parallelism Carlberg, Brencher, Haasdonk, Barth 10 / 27

Revisit the SVD [ ] = X 1 X 2 X 3 V T U Σ ˆ x 1 0 0 M n time step First row of V T jth row of V T contains a basis for time evolution of ˆ x j Construct Ξ j : global time-evolution basis for ˆ x j � � ξ 1 j · · · ξ n train ξ i j := [ v M ( i − 1)+1, j · · · v Mi , j ] T Ξ j := , j Data-driven time parallelism Carlberg, Brencher, Haasdonk, Barth 11 / 27

First attempt [C et al., 2015b] 1 compute global forecast by gappy POD in time domain: 0 ˆ x 1 0 0 M 0 time step M n ˆ x 1 so far; memory α = 4; forecast; temporal basis z j = arg min z ∈ R aj � Z ( m − 1) Ξ j z − Z ( m − 1) g (ˆ x j ) � 2 � T � Time sampling: Z ( k ) := e k − β · · · e k x j ( t M )] T Time unrolling: g (ˆ x j ) : ˆ x j �→ [ˆ x j ( t 0 ) · · · ˆ 2 use e T m Ξ j z j as initial guess for ˆ x j ( t m ) in Newton solver Data-driven time parallelism Carlberg, Brencher, Haasdonk, Barth 12 / 27

First attempt: structural dynamics [C et al., 2015b] improvement Newton-it reduction speedup memory α memory α + Newton iterations reduced by up to ∼ 2x + Speedup improved by up to ∼ 1.5x + No accuracy loss + Applicable to any nonlinear ROM - Insufficient for real-time computation Can we apply the same idea for the coarse propagator? Data-driven time parallelism Carlberg, Brencher, Haasdonk, Barth 13 / 27

Coarse propagator via local forecasting Offline : Construct local time-evolution basis Ξ m j ˆ x 1 0 Ξ 1 Ξ 2 Ξ 3 Ξ 4 Ξ 5 1 1 1 1 1 0 M n time step Online : Coarse propagator G m defined via forecasting: j 1 Compute α time steps with fine propagator 2 Compute local forecast via gappy POD 3 Select last timestep of local forecast  F (ˆ x j ; T m , T m + h )  . � + G m x j ; T m , T m +1 ) �→ e T H / h Ξ m Z ( α + 1) Ξ m � . j : (ˆ   . j j   F (ˆ x j ; T m , T m + h α ) Data-driven time parallelism Carlberg, Brencher, Haasdonk, Barth 14 / 27

Initial seed x m +1 k +1 = G ( x m k +1 ; T m , T m +1 ) + F ( x m k ; T m , T m +1 ) − G ( x m k ; T m , T m +1 ) 0 , m = 0, ... , ¯ How to compute initial seed x m M? 1 Typical time integrator 2 Local forecast 3 Global forecast Data-driven time parallelism Carlberg, Brencher, Haasdonk, Barth 15 / 27

Ideal-conditions speedup Theorem j = 1, ... , ˆ If g (ˆ x j ) ∈ range ( Ξ j ), N, then the proposed method converges in one parareal iteration and realizes a theoretical speedup of ¯ M M − 1) α/ M + 1. M ( ¯ ¯ 35 α =1 30 α =2 α =4 25 α =8 speedup α =12 20 15 10 5 0 0 5 10 15 20 25 30 35 processors ¯ M Ideal-conditions speedup for M = 5000 Data-driven time parallelism Carlberg, Brencher, Haasdonk, Barth 16 / 27

Ideal-conditions speedup with initial guesses Corollary j = 1, ... , ˆ If f is nonlinear, g (ˆ x j ) ∈ range ( Ξ j ), N, and the forecasting method also provides Newton-solver initial guesses, then 1 the method converges in one parareal iteration , and 2 only α nonlinear systems of algebraic equations are solved in each time interval. The method then realizes a theoretical speedup of M ( ¯ M α ) + ( M / ¯ M − α ) τ r relative to the sequential algorithm without forecasting. Here, residual computation time τ r = nonlinear-system solution time . Data-driven time parallelism Carlberg, Brencher, Haasdonk, Barth 17 / 27

Ideal-conditions speedup with initial-guesses 120 α =1 α =2 100 α =4 α =8 α =12 80 speedup 60 40 20 0 0 5 10 15 20 25 30 35 processors ¯ M Ideal-condition speedup for M = 5000, τ r = 1 / 10 Significant speedups possible by leveraging time-domain data! Data-driven time parallelism Carlberg, Brencher, Haasdonk, Barth 18 / 27

Stability Theorem If the fine propagator is stable, i.e., �F ( x ; τ , τ + H ) � ≤ (1 + C F H ) � x � , ∀ 0 ≤ τ ≤ τ + H then the proposed method is also stable, i.e., x 0 � . x m � ˆ k +1 � ≤ C m exp( C F mH ) � ˆ C m := � m � k β k γ m α k ( H / h ) m − k � k =1 m β k := exp( − C F k ( H − h α )) ≤ 1 γ := max(max m , j 1 / � Z ( α + 1) Ξ m j � , 1 /σ min ( Z ( α + 1) Ξ m j )) Data-driven time parallelism Carlberg, Brencher, Haasdonk, Barth 19 / 27

Example: inviscid Burgers equation [Rewienski, 2003] u 2 ( x , τ ) � � ∂ u ( x , τ ) + 1 ∂ = 0.02 e µ 2 x ∂τ 2 ∂ x u (0, τ ) = µ 1 , ∀ τ ∈ [0, 25] u ( x , 0) = 1, ∀ x ∈ [0, 100] , Discretization: Godunov’s scheme ( µ 1 , µ 2 ) ∈ [2.5, 3.5] × [0.02, 0.03] h = 0.1, M = 250 fine time steps FOM: N = 500 degrees of freedom ROM: LSPG [C et al., 2011] , POD basis dimension ˆ N = 100 n train = 4 training points (LHS sampling); random online point 2 coarse propagators : Backward Euler and local forecast 3 initial seeds : Backward Euler, local forecast, global forecast Data-driven time parallelism Carlberg, Brencher, Haasdonk, Barth 20 / 27

Data-driven time parallelism and model reduction Kevin Carlberg 1 , - PowerPoint PPT Presentation

Data-driven time parallelism and model reduction Kevin Carlberg 1 , Lukas Brencher 2 , Bernard Haasdonk 2 , Andrea Barth 2 Sandia National Laboratories 1 University of Stuttgart 2 SIAM Conference on UQ April 7, 2016 Data-driven time parallelism

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

Beyond Data and Model Parallelism for Deep Neural Networks ZHIHAO JIA, MATEI ZAHARIA, ALEX AIKEN

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

Parallelism ! Multiple processes concurrently Parallelism CPU1 CPU1 CPU1 Pseudo- Process 1

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Opportunities for Parallelism Dr. Michael K. Bane HIGH END COMPUTE Questions 1. What do you

Diffusion-Driven Congestion Reduction Diffusion-Driven Congestion Reduction for Substrate

Ashik Ahmed We make work life easier Today More than 95,000 workplaces in over 85 countries use

6 FEM Modeling: Introduction IFEM Ch 6 Slide 1 Introduction to FEM FEM Terminology

Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds

Cyber-Physical Systems Introduction ICEN 553/453 Fall 2018 Prof. Dola Saha 1 Introductions

Advances in Internet data flow transport ISAE-SUPAERO Macquarie University, the 22nd November

Q1 2020 results May 6, 2020 Q1 2020 webcast 2 Q1 2020 Results May 6th, 2020 Forenote This

AEGIS A Fast Authenticated Encryption Algorithm Hongjun Wu Bart Preneel Nanyang

Reconsidering the Security Bound of AES-GCM-SIV Tetsu Iwata 1 and Yannick Seurin 2 1 Nagoya

Sambuz

Useful Links

Newsletter

Mail Us