frequency domain photonic simulation

Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin - PowerPoint PPT Presentation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M a t h e m a t i c a l S c i e n c e s

  1. Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M a t h e m a t i c a l S c i e n c e s N a t i o n a l T a i w a n U n i v e r s i t y T a i p e i , T a i w a n * * I B M T . J . W a t s o n R e s e a r c h C e n t e r N Y , U S 5/8/2017 GTC 2017 @ San Jose

  2. Outline 2  Introduction  Implementation  Numerical Results I  P2P Matrix Sharing  Numerical Results II  Summary 5/8/2017 GTC 2017 @ San Jose

  3. (Ref: Sun et al., Introduction Nature 528, 2015) 3  Photonics  Waveguides  Resonant cavities  Frequency filters  Plasmonic devices  Design concerns  Structural characteristics (Ref: Ivinskaya & Lavrinenko, 2011)  Parameter refinement  Experiment data 5/8/2017 GTC 2017 @ San Jose

  4. Introduction - Why Multi-GPU Scaling 4  Global supercomputing trend  High energy efficiency  Growing popularity in deep learning applications  Integration of high-performance numerical simulation and deep learning Source: ORNL Source: NVIDIA 5/8/2017 GTC 2017 @ San Jose

  5. Introduction 5 Machine-Learning-Derived Behavior Model and Intelligent Design Nonlinear Equations Photonic Integrated Broadband with Multiphysics Circuit Design Spectral Analysis Features Photonic Crystal Analyzer Preconditioner and Algorithm for Iterative Side-Equation Solver Shift-Inverse Eigensolver Parallel Direct FDFD Solver Kernel 5/8/2017 GTC 2017 @ San Jose

  6. Introduction 6 Machine-Learning-Derived Behavior Model Nonlinear Equations Photonic Integrated Broadband with Multiphysics Circuit Design Spectral Analysis Features Photonic Crystal Analyzer Preconditioner and Algorithm for When iterative Iterative Side-Equation Solver Shift-Inverse solver fails… Eigensolver Parallel Direct FDFD Solver Kernel 5/8/2017 GTC 2017 @ San Jose

  7. Introduction 7  Objectives  Fast generation of numerical data for different parameters  Data-driven intelligent design of optical components  Explicit and fast acquisition of quantitative characteristics  Reduction of postprocessing and data storage/transfer requirement  Finite-Difference Frequency-Domain Parallel Direct FDFD Solver Kernel 5/8/2017 GTC 2017 @ San Jose

  8. Outline 8  Introduction  Implementation  Numerical Results I  P2P Matrix Sharing  Numerical Results II  Summary 5/8/2017 GTC 2017 @ San Jose

  9. Implementation 9  FDFD Problem  Linear system 𝟑 𝜻 𝒔 𝑭 = 𝒅 Ԧ  −𝜶 × 𝜶 × 𝑭 + 𝒍 𝟏 𝑲  Direct solver for robust solution • Yee’s mesh • Perfectly-matched layer • High-frequency problem  Challenge • Heavy factorization loads Parallel Direct FDFD Solver Kernel 5/8/2017 GTC 2017 @ San Jose

  10. Implementation 10  Compressed hierarchical Schur method (CHiS)  Domain decomposition, multi-level algorithm  3D nested dissection of Yee’s mesh ( 𝑂 𝑦 × 𝑂 𝑧 × 𝑂 𝑨 )  Ideal periodic structure  𝑬 𝟐 = 𝐸 2 = 𝐸 3 = ⋯ = 𝐸 16  𝑻 𝟐,𝟐 = 𝑇 1,2 = 𝑇 1,3 = ⋯ = 𝑇 1,8  𝑻 𝟑,𝟐 = 𝑇 2,2 = 𝑇 2,3 = 𝑇 2,4  𝑻 𝟒,𝟐 = 𝑇 3,2  𝑻 𝟓,𝟐 5/8/2017 GTC 2017 @ San Jose

  11. Implementation 11  Compressed hierarchical Schur method  Elimination tree deduplication  Diagonals  Interfaces to children 𝑱 𝑽 𝑱 𝑴 GTC 2017 @ San Jose 5/8/2017 5/8/2017

  12. Implementation 12  Compressed hierarchical Schur method  Elimination tree deduplication  Diagonals  Interfaces to children GTC 2017 @ San Jose 5/8/2017 5/8/2017

  13. Implementation 13  Compressed hierarchical Schur method  Leaf-level Interface Compression (LIC)  Use one updating submatrix over multiple Schur complement submatrices with row/column permutations.  The less sparse matrix computing, the less CPU-centric load 5/8/2017 GTC 2017 @ San Jose

  14. Implementation 14  Compressed Hierarchical Schur method  Expose larger chunks of matrix computation  Major function calls and libraries (Option 1) PARDISO, Sparse BLAS  Subdomains (Option 2) MUMPS  Sparse diagonal: Sparse factorize  Sparse interface: Sparse LS solve and matrix multiply  Separators  Dense diagonal: Dense LU  Packed dense interface: Dense LS solve and matrix multiply Hardware Acceleration BLAS (ZGEMM) and (GPU: cuBLAS, cuSolver, etc.) LAPACK (ZGETRF, ZGETRS) 5/8/2017 GTC 2017 @ San Jose

  15. Implementation 15  GPU acceleration  Considerations  Multi-GPU scaling in single node (Scale-up)  No longer solely based on nested dissection  Asynchronous streams for small submatrices  Overlapping some computation kernels  Hardware scheduling  Threaded GPU controls  Thread affinity 5/8/2017 GTC 2017 @ San Jose

  16. Implementation 16 Factorize all diagonal blocks 𝑇 𝑗,𝑘  GPU acceleration related to level 𝑗 . (CPU or GPU work.) 5/8/2017 GTC 2017 @ San Jose

  17. Implementation 17 Asynchronously send some  GPU acceleration blocks to GPU and perform −1 𝐽 𝑉 𝑇 𝑗,𝑘 5/8/2017 GTC 2017 @ San Jose

  18. Implementation 18  GPU acceleration Continue to ZGEMM, no D2H data transmission −1 𝐽 𝑉 kept in GPU for 𝐽 𝑀 𝑇 𝑗,𝑘 −1 𝐽 𝑉 𝑇 𝑗,𝑘 operation later. Workspace will be simply discarded if no longer needed. 5/8/2017 GTC 2017 @ San Jose

  19. Implementation 19 Asynchronously perform  GPU acceleration −1 𝐽 𝑉 ) ZGEMM 𝐽 𝑀 (𝑇 𝑗,𝑘 5/8/2017 GTC 2017 @ San Jose

  20. Implementation 20 −1 𝐽 𝑉 ) from all GPUs Collect 𝐽 𝑀 (𝑇 𝑗,𝑘  GPU acceleration and perform higher-level Schur update by CPU 5/8/2017 GTC 2017 @ San Jose

  21. Implementation 21 Continue more ZGEMM  GPU acceleration −1 𝐽 𝑉 ) related to (𝑇 𝑗,𝑘 −1 𝐽 𝑉 ) and 𝐽 𝑀 (𝑇 𝑗,𝑘 Schur updates… 5/8/2017 GTC 2017 @ San Jose

  22. Implementation 22  GPU acceleration  Workload balance for multi-GPU  Distribute 𝐽 𝑉 blocks by parent levels  Tackle extreme cases with lots of duplicates  Minor increase in H2D transfer 5/8/2017 GTC 2017 @ San Jose

  23. Implementation 23  GPU acceleration  Workload balance for multi-GPU  Panel 𝐽 𝑉  Each 𝐽 𝑉 column should be large enough  Multiple 𝐽 𝑀 copies sent to GPUs  Moderate increase in H2D transfer 5/8/2017 GTC 2017 @ San Jose

  24. Implementation 24 Finishing time  GPU acceleration > 325 seconds  Without workload balance 5/8/2017 GTC 2017 @ San Jose

  25. Implementation 25  GPU acceleration Finishing time < 250 seconds  With workload balance 5/8/2017 GTC 2017 @ San Jose

  26. Outline 26  Introduction  Implementation  Numerical Results I  P2P Matrix Sharing  Numerical Results II  Summary 5/8/2017 GTC 2017 @ San Jose

  27. Numerical Results I 27  Hardware specifications Server Brillante P8Exp CPU 2 × Intel E5-2670 v3 2 × IBM Power8 12 + 12 cores used 8 + 8 cores used Memory 256 GB 1 TB GPU 2 × K40 4 × K80 Software Intel Parallel Studio 2016 IBM ESSL and Parallel ESSL update 1 Intel PARDISO IBM XL Fortran and XL C Compiler CUDA 7.5 MUMPS 5.0.1 CUDA 7.5 5/8/2017 GTC 2017 @ San Jose

  28. Numerical Results I 28  SOI dielectric waveguide  Total grids: 79 × 319 × 39 , 2,948,517 in matrix dimension  Wavelength: 1.5 𝜈𝑛  Grid size: 0.02 𝜈𝑛  100 GB RAM 5/8/2017 GTC 2017 @ San Jose

  29. Numerical Results I 29  Brillante: 2 × 𝐿40 ZGETRS + ZGEMM 𝟓𝟒𝟘. 𝟒 seconds ( 𝟘𝟏% overall time) 5/8/2017 GTC 2017 @ San Jose

  30. Numerical Results I Naïve GPU acceleration yields good speedup due to high AI. 30 “Scatter” time includes D2H transfer.  Brillante: 2 × 𝐿40 5/8/2017 GTC 2017 @ San Jose

  31. Numerical Results I Async streams apply to low-level separators, which is finished in 31 seconds even in CPU-only mode.  Brillante: 2 × 𝐿40 5/8/2017 GTC 2017 @ San Jose

  32. Numerical Results I Workload balance yields better 32 speedup and multi-GPU scaling.  Brillante: 2 × 𝐿40 5/8/2017 GTC 2017 @ San Jose

  33. Numerical Results I 33  P8Exp: 4 × K80 with autoboost • Good performance scaling in quad-K80 server • Higher performance with half-K80 computing • Two threads competing single PCI-E bandwidth when using full-K80 5/8/2017 GTC 2017 @ San Jose

  34. Numerical Results I 34  P8Exp: 4 × K80 with autoboost  AccTRSMM: multi-GPU scaling  Increased H2D transfer due to multiple 𝐽 𝑀 copies to work- sharing GPUs  We still get acceptable scaling performance 5/8/2017 GTC 2017 @ San Jose

  35. Numerical Results I 35  Periodic air hole wavelength filter  No propagation at 𝜇 0 = 1.5 μm  Total grids: 79 × 575 × 47 , 6,404,925 in matrix dimension  188 GB RAM 5/8/2017 GTC 2017 @ San Jose

  36. Numerical Results I 36  Brillante: 2 × 𝐿40 5/8/2017 GTC 2017 @ San Jose

  37. Numerical Results I 37  P8Exp: 4 × K80 with autoboost 5/8/2017 GTC 2017 @ San Jose

  38. Numerical Results I 38  P8Exp: GPU-scaling of AccTRSMM  Much more dense matrix operations  Good scaling in multi-GPU systems 5/8/2017 GTC 2017 @ San Jose


More recommend