Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M a t h e m a t i c a l S c i e n c e s N a t i o n a l T a i w a n U n i v e r s i t y T a i p e i , T a i w a n * * I B M T . J . W a t s o n R e s e a r c h C e n t e r N Y , U S 5/8/2017 GTC 2017 @ San Jose
Outline 2 Introduction Implementation Numerical Results I P2P Matrix Sharing Numerical Results II Summary 5/8/2017 GTC 2017 @ San Jose
(Ref: Sun et al., Introduction Nature 528, 2015) 3 Photonics Waveguides Resonant cavities Frequency filters Plasmonic devices Design concerns Structural characteristics (Ref: Ivinskaya & Lavrinenko, 2011) Parameter refinement Experiment data 5/8/2017 GTC 2017 @ San Jose
Introduction - Why Multi-GPU Scaling 4 Global supercomputing trend High energy efficiency Growing popularity in deep learning applications Integration of high-performance numerical simulation and deep learning Source: ORNL Source: NVIDIA 5/8/2017 GTC 2017 @ San Jose
Introduction 5 Machine-Learning-Derived Behavior Model and Intelligent Design Nonlinear Equations Photonic Integrated Broadband with Multiphysics Circuit Design Spectral Analysis Features Photonic Crystal Analyzer Preconditioner and Algorithm for Iterative Side-Equation Solver Shift-Inverse Eigensolver Parallel Direct FDFD Solver Kernel 5/8/2017 GTC 2017 @ San Jose
Introduction 6 Machine-Learning-Derived Behavior Model Nonlinear Equations Photonic Integrated Broadband with Multiphysics Circuit Design Spectral Analysis Features Photonic Crystal Analyzer Preconditioner and Algorithm for When iterative Iterative Side-Equation Solver Shift-Inverse solver fails… Eigensolver Parallel Direct FDFD Solver Kernel 5/8/2017 GTC 2017 @ San Jose
Introduction 7 Objectives Fast generation of numerical data for different parameters Data-driven intelligent design of optical components Explicit and fast acquisition of quantitative characteristics Reduction of postprocessing and data storage/transfer requirement Finite-Difference Frequency-Domain Parallel Direct FDFD Solver Kernel 5/8/2017 GTC 2017 @ San Jose
Outline 8 Introduction Implementation Numerical Results I P2P Matrix Sharing Numerical Results II Summary 5/8/2017 GTC 2017 @ San Jose
Implementation 9 FDFD Problem Linear system 𝟑 𝜻 𝒔 𝑭 = 𝒅 Ԧ −𝜶 × 𝜶 × 𝑭 + 𝒍 𝟏 𝑲 Direct solver for robust solution • Yee’s mesh • Perfectly-matched layer • High-frequency problem Challenge • Heavy factorization loads Parallel Direct FDFD Solver Kernel 5/8/2017 GTC 2017 @ San Jose
Implementation 10 Compressed hierarchical Schur method (CHiS) Domain decomposition, multi-level algorithm 3D nested dissection of Yee’s mesh ( 𝑂 𝑦 × 𝑂 𝑧 × 𝑂 𝑨 ) Ideal periodic structure 𝑬 𝟐 = 𝐸 2 = 𝐸 3 = ⋯ = 𝐸 16 𝑻 𝟐,𝟐 = 𝑇 1,2 = 𝑇 1,3 = ⋯ = 𝑇 1,8 𝑻 𝟑,𝟐 = 𝑇 2,2 = 𝑇 2,3 = 𝑇 2,4 𝑻 𝟒,𝟐 = 𝑇 3,2 𝑻 𝟓,𝟐 5/8/2017 GTC 2017 @ San Jose
Implementation 11 Compressed hierarchical Schur method Elimination tree deduplication Diagonals Interfaces to children 𝑱 𝑽 𝑱 𝑴 GTC 2017 @ San Jose 5/8/2017 5/8/2017
Implementation 12 Compressed hierarchical Schur method Elimination tree deduplication Diagonals Interfaces to children GTC 2017 @ San Jose 5/8/2017 5/8/2017
Implementation 13 Compressed hierarchical Schur method Leaf-level Interface Compression (LIC) Use one updating submatrix over multiple Schur complement submatrices with row/column permutations. The less sparse matrix computing, the less CPU-centric load 5/8/2017 GTC 2017 @ San Jose
Implementation 14 Compressed Hierarchical Schur method Expose larger chunks of matrix computation Major function calls and libraries (Option 1) PARDISO, Sparse BLAS Subdomains (Option 2) MUMPS Sparse diagonal: Sparse factorize Sparse interface: Sparse LS solve and matrix multiply Separators Dense diagonal: Dense LU Packed dense interface: Dense LS solve and matrix multiply Hardware Acceleration BLAS (ZGEMM) and (GPU: cuBLAS, cuSolver, etc.) LAPACK (ZGETRF, ZGETRS) 5/8/2017 GTC 2017 @ San Jose
Implementation 15 GPU acceleration Considerations Multi-GPU scaling in single node (Scale-up) No longer solely based on nested dissection Asynchronous streams for small submatrices Overlapping some computation kernels Hardware scheduling Threaded GPU controls Thread affinity 5/8/2017 GTC 2017 @ San Jose
Implementation 16 Factorize all diagonal blocks 𝑇 𝑗,𝑘 GPU acceleration related to level 𝑗 . (CPU or GPU work.) 5/8/2017 GTC 2017 @ San Jose
Implementation 17 Asynchronously send some GPU acceleration blocks to GPU and perform −1 𝐽 𝑉 𝑇 𝑗,𝑘 5/8/2017 GTC 2017 @ San Jose
Implementation 18 GPU acceleration Continue to ZGEMM, no D2H data transmission −1 𝐽 𝑉 kept in GPU for 𝐽 𝑀 𝑇 𝑗,𝑘 −1 𝐽 𝑉 𝑇 𝑗,𝑘 operation later. Workspace will be simply discarded if no longer needed. 5/8/2017 GTC 2017 @ San Jose
Implementation 19 Asynchronously perform GPU acceleration −1 𝐽 𝑉 ) ZGEMM 𝐽 𝑀 (𝑇 𝑗,𝑘 5/8/2017 GTC 2017 @ San Jose
Implementation 20 −1 𝐽 𝑉 ) from all GPUs Collect 𝐽 𝑀 (𝑇 𝑗,𝑘 GPU acceleration and perform higher-level Schur update by CPU 5/8/2017 GTC 2017 @ San Jose
Implementation 21 Continue more ZGEMM GPU acceleration −1 𝐽 𝑉 ) related to (𝑇 𝑗,𝑘 −1 𝐽 𝑉 ) and 𝐽 𝑀 (𝑇 𝑗,𝑘 Schur updates… 5/8/2017 GTC 2017 @ San Jose
Implementation 22 GPU acceleration Workload balance for multi-GPU Distribute 𝐽 𝑉 blocks by parent levels Tackle extreme cases with lots of duplicates Minor increase in H2D transfer 5/8/2017 GTC 2017 @ San Jose
Implementation 23 GPU acceleration Workload balance for multi-GPU Panel 𝐽 𝑉 Each 𝐽 𝑉 column should be large enough Multiple 𝐽 𝑀 copies sent to GPUs Moderate increase in H2D transfer 5/8/2017 GTC 2017 @ San Jose
Implementation 24 Finishing time GPU acceleration > 325 seconds Without workload balance 5/8/2017 GTC 2017 @ San Jose
Implementation 25 GPU acceleration Finishing time < 250 seconds With workload balance 5/8/2017 GTC 2017 @ San Jose
Outline 26 Introduction Implementation Numerical Results I P2P Matrix Sharing Numerical Results II Summary 5/8/2017 GTC 2017 @ San Jose
Numerical Results I 27 Hardware specifications Server Brillante P8Exp CPU 2 × Intel E5-2670 v3 2 × IBM Power8 12 + 12 cores used 8 + 8 cores used Memory 256 GB 1 TB GPU 2 × K40 4 × K80 Software Intel Parallel Studio 2016 IBM ESSL and Parallel ESSL update 1 Intel PARDISO IBM XL Fortran and XL C Compiler CUDA 7.5 MUMPS 5.0.1 CUDA 7.5 5/8/2017 GTC 2017 @ San Jose
Numerical Results I 28 SOI dielectric waveguide Total grids: 79 × 319 × 39 , 2,948,517 in matrix dimension Wavelength: 1.5 𝜈𝑛 Grid size: 0.02 𝜈𝑛 100 GB RAM 5/8/2017 GTC 2017 @ San Jose
Numerical Results I 29 Brillante: 2 × 𝐿40 ZGETRS + ZGEMM 𝟓𝟒𝟘. 𝟒 seconds ( 𝟘𝟏% overall time) 5/8/2017 GTC 2017 @ San Jose
Numerical Results I Naïve GPU acceleration yields good speedup due to high AI. 30 “Scatter” time includes D2H transfer. Brillante: 2 × 𝐿40 5/8/2017 GTC 2017 @ San Jose
Numerical Results I Async streams apply to low-level separators, which is finished in 31 seconds even in CPU-only mode. Brillante: 2 × 𝐿40 5/8/2017 GTC 2017 @ San Jose
Numerical Results I Workload balance yields better 32 speedup and multi-GPU scaling. Brillante: 2 × 𝐿40 5/8/2017 GTC 2017 @ San Jose
Numerical Results I 33 P8Exp: 4 × K80 with autoboost • Good performance scaling in quad-K80 server • Higher performance with half-K80 computing • Two threads competing single PCI-E bandwidth when using full-K80 5/8/2017 GTC 2017 @ San Jose
Numerical Results I 34 P8Exp: 4 × K80 with autoboost AccTRSMM: multi-GPU scaling Increased H2D transfer due to multiple 𝐽 𝑀 copies to work- sharing GPUs We still get acceptable scaling performance 5/8/2017 GTC 2017 @ San Jose
Numerical Results I 35 Periodic air hole wavelength filter No propagation at 𝜇 0 = 1.5 μm Total grids: 79 × 575 × 47 , 6,404,925 in matrix dimension 188 GB RAM 5/8/2017 GTC 2017 @ San Jose
Numerical Results I 36 Brillante: 2 × 𝐿40 5/8/2017 GTC 2017 @ San Jose
Numerical Results I 37 P8Exp: 4 × K80 with autoboost 5/8/2017 GTC 2017 @ San Jose
Numerical Results I 38 P8Exp: GPU-scaling of AccTRSMM Much more dense matrix operations Good scaling in multi-GPU systems 5/8/2017 GTC 2017 @ San Jose
Recommend
More recommend