Case Studies in Using a DSL and Task Graphs for Portable Reacting Flow Simulations J AMES C. S UTHERLAND Associate Professor - Chemical Engineering T ONY S AAD Assistant Professor - Chemical Engineering
Acknowledgments B ABAK G OSHAYESHI C HRISTOPHER E ARL Research Staff Postdoctoral Researcher Now at LLNL A BHISHEK B AGUSETTY M IKE H ANSEN D EVIN R OBISON J OSH M C C ONNELL M ICHAEL B ROWN Ph.D. Students M.S. Students DE-NA0002375 DE-NA-000740 XPS award1337145 DE-SC0008998
Nebo (E)DSL: “Matlab for PDEs on Supercomputers” rhs = − ∂ ∂ x ( J x + C x ) − ∂ ∂ y ( J y + C y ) − ∂ ∂ z ( J z + C z ) Field & stencil operations: rhs <<= -divOpX( xConvFlux + xDiffFlux ) -divOpY( yConvFlux + yDiffFlux ) -divOpZ( zConvFlux + zDiffFlux ); Can “chain” stencil operations where necessary. • Stencils : >150 natively supported stencil operations (easily extensible) Auto-generate code for DSL C++ • cond : “vectorized if” Efficiency efficient execution on • Arbitrary composition of operations • Masked assignment (perform operations CPU, GPU, XeonPhi, on a defined subset of points) Matlab etc. during compilation. • Portable : same code works for CPU, multicore, GPU execution Expressiveness • Embedded in C++ → “ plays well with others ” Earl, C., Might, M., Bagusetty, A., & Sutherland, J. C., Journal of Systems and Software (2016).
The Power of Task Graphs Register all expressions Γ = Γ ( T, p, y i ) • Each “expression” calculates one or more field quantities. Γ • Each expression advertises its direct dependencies. Direct (expressed) Set a “root” expression; construct a graph dependencies. p • All dependencies are discovered/resolved automatically. y i T Indirect (discovered) ρ • Highly localized influence of changes in models. dependencies. • Not all expressions in the registry may be relevant/ used. From the graph: u Expression • Deduce storage requirements & allocate memory τ Registry (externally to each expression). s φ • Automatically schedule evaluation, ensuring proper φ ordering. • Robust scheduling algorithms are key. *Notz, Pawlowski, & Sutherland (2012). ACM Transactions on Mathematical Software, 39(1).
Changes in model form are naturally handled Pure substance heat flux: q = � λ r T q λ T
Changes in model form are naturally handled Multi-species mixture heat flux: n X q = � λ r T + h i J i i =1 q λ J 1 J n T h n h 1 y 1 y n No complex logic changes in code when model are added/changed.
“Modifiers” — injecting new dependencies Motivation: • Boundary conditions : modify a subset of the A computed values. • Multiphase coupling : add source terms to RHS of equations. B C
“Modifiers” — injecting new dependencies Motivation: • Boundary conditions : modify a subset of the A computed values. • Multiphase coupling : add source terms to RHS BC1 S1 of equations. Modifiers allow “push” rather than B C “pull” dependency addition. Modifiers are deployed after the node they are attached to, and are provided a handle to the field just computed.
“Modifiers” — injecting new dependencies Motivation: • Boundary conditions : modify a subset of the A computed values. • Multiphase coupling : add source terms to RHS BC1 S1 of equations. Modifiers allow “push” rather than B C “pull” dependency addition. Modifiers are deployed after the node D E F they are attached to, and are provided a handle to the field just computed. Modifiers can introduce new dependencies to the graph.
Example: PoKiTT ( Po rtable Ki netics, T hermodynamics & T ransport) ρ∂ y i ∂ t = �r · J i + s i ρ∂ h ∂ t = �r · q i • Detailed kinetics • Mixture-averaged transport • Detailed thermodynamics Triple flame computed on GPU with PoKiTT Yonkee & Sutherland, SIAM Journal on Scientific Computing (2016)
Example: PoKiTT ( Po rtable Ki netics, T hermodynamics & T ransport) ρ∂ y i ∂ t = �r · J i + s i • 32 PDEs ρ∂ h • 256 2 grid points ∂ t = �r · q i • 8 million timesteps • Detailed kinetics • 8 days on 1 GPU (~5 months on 1 CPU core) • Mixture-averaged transport • Detailed thermodynamics 2.4 256^2 12 cores Triple flame computed on GPU with PoKiTT 512^2 5 1024^2 5 18.2 GPU 27 30 6 12 18 24 30 Speedup Yonkee & Sutherland, SIAM Journal on Scientific Computing (2016)
Titan: Hybrid Low Mach Algorithm Weak Scaling 100s 16^3 32^3 64^3 128^3 Mean time per timestep 10s 1s 0.1s 0.01s 1 2 8 64 512 4096 8192 12800 Everything on GPU except Poisson solve on CPU. GPUs (also # Titan Nodes, 1 GPU per Titan Node) Saad, T., & Sutherland, J. C., Journal of Computational Science (2016)
Titan: Hybrid Low Mach Algorithm Weak Scaling GPU Speedup 100s 2X 16^3 32^3 16^3 32^3 64^3 128^3 Mean time per timestep 64^3 128^3 Speedup (CPU/GPU) 10s 1.5X 1s 1X 1X 0.1s 0.5X 0.01s 0X 1 2 8 64 512 4096 8192 12800 1 2 8 4 2 6 2 0 6 1 9 9 0 5 0 1 8 4 8 2 1 GPUs (also # Titan Nodes, 1 CPUs/GPUs (also # Titan Nodes, GPU per Titan Node) 1 MPI Rank per Titan Node) Saad, T., & Sutherland, J. C., Journal of Computational Science (2016)
Titan: Compressible Algorithm Weak Scaling 10s Mean time per timestep 1s 0.1s 16^3 32^3 64^3 128^3 0.01s 1 8 512 8192 18252 GPUs (also # Titan Nodes, 1 GPU per Titan Node) Saad, T., & Sutherland, J. C., Journal of Computational Science (2016)
Titan: Compressible Algorithm Weak Scaling GPU Speedup 10s 100X 16^3 32^3 64^3 Mean time per timestep Speedup (CPU/GPU) 128^3 1s 10X 0.1s 1X 1X 16^3 32^3 64^3 128^3 0.01s 0.1X 1 8 512 8192 18252 1 8 512 8192 18252 GPUs (also # Titan Nodes, 1 GPU per CPUs (also # Titan Nodes, 1 MPI Rank Titan Node) per Titan Node) Saad, T., & Sutherland, J. C., Journal of Computational Science (2016)
What next? Low-Mach CLEAN AND SECURE ENERGY Compressible THE UNIVERSITY OF UTAH 2X 100X 16^3 Speedup (CPU/GPU) 16^3 32^3 Speedup (CPU/GPU) 32^3 64^3 128^3 Institute for 64^3 1.5X 10X 128^3 TM 1X 1X 1X 1X 0.5X 0X 0.1X 1 2 8 4 2 6 2 0 2 1 2 8 4 2 6 2 0 6 1 9 9 0 5 6 1 9 9 0 5 0 1 8 2 5 0 1 8 4 8 2 8 4 8 2 1 1 1 Wait for linear solvers to get us to many-GPU systems? • Even when these arrive, it puts a lot of demand on black-box linear solvers to achieve scalability & performance.
What next? Low-Mach CLEAN AND SECURE ENERGY Compressible THE UNIVERSITY OF UTAH 2X 100X 16^3 Speedup (CPU/GPU) 16^3 32^3 Speedup (CPU/GPU) 32^3 64^3 128^3 Institute for 64^3 1.5X 10X 128^3 TM 1X 1X 1X 1X 0.5X 0X 0.1X 1 2 8 4 2 6 2 0 2 1 2 8 4 2 6 2 0 6 1 9 9 0 5 6 1 9 9 0 5 0 1 8 2 5 0 1 8 4 8 2 8 4 8 2 1 1 1 Wait for linear solvers to get us to many-GPU systems? • Even when these arrive, it puts a lot of demand on black-box linear solvers to achieve scalability & performance. Consider alternative algorithms?
Point-implicit algorithms: CLEAN AND SECURE ENERGY THE UNIVERSITY OF UTAH High arithmetic intensity Communication patterns are the same as explicit codes (ghost/halo- Institute for updates) TM Well-suited for reacting flow calculations. Local residual � ∆ u I − ∆ σ ∂ h ∆ σ = h ( u ) ∂ u Local Jacobian matrix Computational kernel Residual (right-hand side) evaluation - Pointwise Jacobian evaluation - Local linear solves - Local eigenvalue decompositions - Matrix assembly must be efficient and extensible to complex, multiphysics problems
Example: Highly nonlinear, parameterized ODE systems • Detailed chemical kinetics K Q T Right-hand side: + + - Analytical Jacobian in PoKiTT w/ kinetics convective mixing/flow Nebo for GPU source terms heat transfer � ∂ V - Dense matrix formed w/primitives ∂ K ∂ V + ∂ Q ∂ U − 1 and sparse transformation τ I Jacobian: ∂ V • Simple convective heat transfer Full matrix 1-element 2N-elements scalar matrix - Single-element Jacobian combined (dense submat) (sparse) (sparse) with sparse transform C++ code: ( dKdV + dqdV ) * dVdU - invT • Finite mixing time - Scalar Jacobian matrix GPU Speedup - 16x16 Matrix 30 Dot Product MatVec 25 Ax=b 20 Eigen-decomp 15 10 5 0 16^3 32^3 64^3
Conclusions CLEAN AND SECURE ENERGY THE UNIVERSITY OF UTAH Robust abstractions are needed to facilitate portable & performant applications on upcoming architectures. Institute for • DAG-based software design allows flexibility needed for multiphysics codes TM on heterogeneous platforms. • (E)-DSLs can provide convenient, portable & performant abstractions for HPC applications The Algorithm-Hardware collision: • Scalable GPU linear solvers are needed for traditional algorithms to be viable on new architectures. • Alternative algorithms may be needed with higher arithmetic intensity • higher-order • point-implicit? DE-NA0002375 DE-NA-000740 XPS award1337145 DE-SC0008998
Recommend
More recommend