Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin - PowerPoint PPT Presentation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M a t h e m a t i c a l S c i e n c e s N a t i o n a l T a i w a n U n i v e r s i t y T a i p e i , T a i w a n * * I B M T . J . W a t s o n R e s e a r c h C e n t e r N Y , U S 5/8/2017 GTC 2017 @ San Jose

Outline 2  Introduction  Implementation  Numerical Results I  P2P Matrix Sharing  Numerical Results II  Summary 5/8/2017 GTC 2017 @ San Jose

(Ref: Sun et al., Introduction Nature 528, 2015) 3  Photonics  Waveguides  Resonant cavities  Frequency filters  Plasmonic devices  Design concerns  Structural characteristics (Ref: Ivinskaya & Lavrinenko, 2011)  Parameter refinement  Experiment data 5/8/2017 GTC 2017 @ San Jose

Introduction - Why Multi-GPU Scaling 4  Global supercomputing trend  High energy efficiency  Growing popularity in deep learning applications  Integration of high-performance numerical simulation and deep learning Source: ORNL Source: NVIDIA 5/8/2017 GTC 2017 @ San Jose

Introduction 5 Machine-Learning-Derived Behavior Model and Intelligent Design Nonlinear Equations Photonic Integrated Broadband with Multiphysics Circuit Design Spectral Analysis Features Photonic Crystal Analyzer Preconditioner and Algorithm for Iterative Side-Equation Solver Shift-Inverse Eigensolver Parallel Direct FDFD Solver Kernel 5/8/2017 GTC 2017 @ San Jose

Introduction 6 Machine-Learning-Derived Behavior Model Nonlinear Equations Photonic Integrated Broadband with Multiphysics Circuit Design Spectral Analysis Features Photonic Crystal Analyzer Preconditioner and Algorithm for When iterative Iterative Side-Equation Solver Shift-Inverse solver fails… Eigensolver Parallel Direct FDFD Solver Kernel 5/8/2017 GTC 2017 @ San Jose

Introduction 7  Objectives  Fast generation of numerical data for different parameters  Data-driven intelligent design of optical components  Explicit and fast acquisition of quantitative characteristics  Reduction of postprocessing and data storage/transfer requirement  Finite-Difference Frequency-Domain Parallel Direct FDFD Solver Kernel 5/8/2017 GTC 2017 @ San Jose

Implementation 9  FDFD Problem  Linear system 𝟑 𝜻 𝒔 𝑭 = 𝒅 Ԧ  −𝜶 × 𝜶 × 𝑭 + 𝒍 𝟏 𝑲  Direct solver for robust solution • Yee’s mesh • Perfectly-matched layer • High-frequency problem  Challenge • Heavy factorization loads Parallel Direct FDFD Solver Kernel 5/8/2017 GTC 2017 @ San Jose

Implementation 10  Compressed hierarchical Schur method (CHiS)  Domain decomposition, multi-level algorithm  3D nested dissection of Yee’s mesh ( 𝑂 𝑦 × 𝑂 𝑧 × 𝑂 𝑨 )  Ideal periodic structure  𝑬 𝟐 = 𝐸 2 = 𝐸 3 = ⋯ = 𝐸 16  𝑻 𝟐,𝟐 = 𝑇 1,2 = 𝑇 1,3 = ⋯ = 𝑇 1,8  𝑻 𝟑,𝟐 = 𝑇 2,2 = 𝑇 2,3 = 𝑇 2,4  𝑻 𝟒,𝟐 = 𝑇 3,2  𝑻 𝟓,𝟐 5/8/2017 GTC 2017 @ San Jose

Implementation 11  Compressed hierarchical Schur method  Elimination tree deduplication  Diagonals  Interfaces to children 𝑱 𝑽 𝑱 𝑴 GTC 2017 @ San Jose 5/8/2017 5/8/2017

Implementation 12  Compressed hierarchical Schur method  Elimination tree deduplication  Diagonals  Interfaces to children GTC 2017 @ San Jose 5/8/2017 5/8/2017

Implementation 13  Compressed hierarchical Schur method  Leaf-level Interface Compression (LIC)  Use one updating submatrix over multiple Schur complement submatrices with row/column permutations.  The less sparse matrix computing, the less CPU-centric load 5/8/2017 GTC 2017 @ San Jose

Implementation 14  Compressed Hierarchical Schur method  Expose larger chunks of matrix computation  Major function calls and libraries (Option 1) PARDISO, Sparse BLAS  Subdomains (Option 2) MUMPS  Sparse diagonal: Sparse factorize  Sparse interface: Sparse LS solve and matrix multiply  Separators  Dense diagonal: Dense LU  Packed dense interface: Dense LS solve and matrix multiply Hardware Acceleration BLAS (ZGEMM) and (GPU: cuBLAS, cuSolver, etc.) LAPACK (ZGETRF, ZGETRS) 5/8/2017 GTC 2017 @ San Jose

Implementation 15  GPU acceleration  Considerations  Multi-GPU scaling in single node (Scale-up)  No longer solely based on nested dissection  Asynchronous streams for small submatrices  Overlapping some computation kernels  Hardware scheduling  Threaded GPU controls  Thread affinity 5/8/2017 GTC 2017 @ San Jose

Implementation 16 Factorize all diagonal blocks 𝑇 𝑗,𝑘  GPU acceleration related to level 𝑗 . (CPU or GPU work.) 5/8/2017 GTC 2017 @ San Jose

Implementation 17 Asynchronously send some  GPU acceleration blocks to GPU and perform −1 𝐽 𝑉 𝑇 𝑗,𝑘 5/8/2017 GTC 2017 @ San Jose

Implementation 18  GPU acceleration Continue to ZGEMM, no D2H data transmission −1 𝐽 𝑉 kept in GPU for 𝐽 𝑀 𝑇 𝑗,𝑘 −1 𝐽 𝑉 𝑇 𝑗,𝑘 operation later. Workspace will be simply discarded if no longer needed. 5/8/2017 GTC 2017 @ San Jose

Implementation 19 Asynchronously perform  GPU acceleration −1 𝐽 𝑉 ) ZGEMM 𝐽 𝑀 (𝑇 𝑗,𝑘 5/8/2017 GTC 2017 @ San Jose

Implementation 20 −1 𝐽 𝑉 ) from all GPUs Collect 𝐽 𝑀 (𝑇 𝑗,𝑘  GPU acceleration and perform higher-level Schur update by CPU 5/8/2017 GTC 2017 @ San Jose

Implementation 21 Continue more ZGEMM  GPU acceleration −1 𝐽 𝑉 ) related to (𝑇 𝑗,𝑘 −1 𝐽 𝑉 ) and 𝐽 𝑀 (𝑇 𝑗,𝑘 Schur updates… 5/8/2017 GTC 2017 @ San Jose

Implementation 22  GPU acceleration  Workload balance for multi-GPU  Distribute 𝐽 𝑉 blocks by parent levels  Tackle extreme cases with lots of duplicates  Minor increase in H2D transfer 5/8/2017 GTC 2017 @ San Jose

Implementation 23  GPU acceleration  Workload balance for multi-GPU  Panel 𝐽 𝑉  Each 𝐽 𝑉 column should be large enough  Multiple 𝐽 𝑀 copies sent to GPUs  Moderate increase in H2D transfer 5/8/2017 GTC 2017 @ San Jose

Implementation 24 Finishing time  GPU acceleration > 325 seconds  Without workload balance 5/8/2017 GTC 2017 @ San Jose

Implementation 25  GPU acceleration Finishing time < 250 seconds  With workload balance 5/8/2017 GTC 2017 @ San Jose

Numerical Results I 27  Hardware specifications Server Brillante P8Exp CPU 2 × Intel E5-2670 v3 2 × IBM Power8 12 + 12 cores used 8 + 8 cores used Memory 256 GB 1 TB GPU 2 × K40 4 × K80 Software Intel Parallel Studio 2016 IBM ESSL and Parallel ESSL update 1 Intel PARDISO IBM XL Fortran and XL C Compiler CUDA 7.5 MUMPS 5.0.1 CUDA 7.5 5/8/2017 GTC 2017 @ San Jose

Numerical Results I 28  SOI dielectric waveguide  Total grids: 79 × 319 × 39 , 2,948,517 in matrix dimension  Wavelength: 1.5 𝜈𝑛  Grid size: 0.02 𝜈𝑛  100 GB RAM 5/8/2017 GTC 2017 @ San Jose

Numerical Results I 29  Brillante: 2 × 𝐿40 ZGETRS + ZGEMM 𝟓𝟒𝟘. 𝟒 seconds ( 𝟘𝟏% overall time) 5/8/2017 GTC 2017 @ San Jose

Numerical Results I Naïve GPU acceleration yields good speedup due to high AI. 30 “Scatter” time includes D2H transfer.  Brillante: 2 × 𝐿40 5/8/2017 GTC 2017 @ San Jose

Numerical Results I Async streams apply to low-level separators, which is finished in 31 seconds even in CPU-only mode.  Brillante: 2 × 𝐿40 5/8/2017 GTC 2017 @ San Jose

Numerical Results I Workload balance yields better 32 speedup and multi-GPU scaling.  Brillante: 2 × 𝐿40 5/8/2017 GTC 2017 @ San Jose

Numerical Results I 33  P8Exp: 4 × K80 with autoboost • Good performance scaling in quad-K80 server • Higher performance with half-K80 computing • Two threads competing single PCI-E bandwidth when using full-K80 5/8/2017 GTC 2017 @ San Jose

Numerical Results I 34  P8Exp: 4 × K80 with autoboost  AccTRSMM: multi-GPU scaling  Increased H2D transfer due to multiple 𝐽 𝑀 copies to work- sharing GPUs  We still get acceptable scaling performance 5/8/2017 GTC 2017 @ San Jose

Numerical Results I 35  Periodic air hole wavelength filter  No propagation at 𝜇 0 = 1.5 μm  Total grids: 79 × 575 × 47 , 6,404,925 in matrix dimension  188 GB RAM 5/8/2017 GTC 2017 @ San Jose

Numerical Results I 36  Brillante: 2 × 𝐿40 5/8/2017 GTC 2017 @ San Jose

Numerical Results I 37  P8Exp: 4 × K80 with autoboost 5/8/2017 GTC 2017 @ San Jose

Numerical Results I 38  P8Exp: GPU-scaling of AccTRSMM  Much more dense matrix operations  Good scaling in multi-GPU systems 5/8/2017 GTC 2017 @ San Jose

Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin - PowerPoint PPT Presentation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M a t h e m a t i c a l S c i e n c e s

Photonic Crystals Derek Stewart CNF Fall Workshop What are photonic crystals? Photonic crystals

Photonic Crystals Photonic Crystals and Si Photonics and Si Photonics Toshihiko Baba Toshihiko

How the colour is created? Semiconductors vs Photonic Crystals (PCs) Semiconductors vs Photonic

Self-Assembly of Metal-Organic Framework Photonic Sensors Nanyang Research Programme Loi Si

Recent Advances in Photonic Recent Advances in Photonic effect employing IP- based distributed

Frequency Decomposition The base frequency or the fundamental frequency is the lowest frequency.

Recent Progress in Recent Progress in Photonic Crystal Devices Photonic Crystal Devices

Silicon nitride based TriPleX Photonic Integrated Circuits for sensing applications Arne Leinse

Content A Polyreference Least Square Complex Frequency Introduction domain based statistical

Outline Narcisse Ngada DESY, MKK 1) What is simulation ? 14.05.2014 2) Why simulation ? 3)

The Frequency Domain Time domain methods: regress present on past; capture dynamics in

Time-Frequency Analysis Time Frequency Analysis in Visual Signal Yetmen Wang AnCAD, Inc.

Automated Configuration of Co-simulation with Domain Specific Hints Co-simulation on the rise

Grid simulation (AliEn) Outline GRID simulation Simulation tool Ptolemy (Berkeley)

The Frequency Domain DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data Science

MuseBox An Audio Equalizer/Visualizer Overview Read in frequency-domain analog sound wave

Conference 2017 IT Infrastructure Consumption Model: Should You Rent or Own? Agenda Why

United Nations Common Cash System (CCS) in Afghanistan July 2020 Governance Structure of the UN

www.callboxinc.com ExCEEding TargETs and ExpECTaTions for doCumEnT 2 prEsEnTaTion soluTions The

Cooperation and Social Status in Free and Open Source Projects Gaudenz Steinlin

Design & Implementation of a Portable File Synchronisation Mechanism for a Cloud Storage

WHAT IS IRIX ? IRIX the newest version of TBS represents an advanced internet booking

Newcastle University Library Special Collections : a case study in getting to know your

A Report on the Benefits and Disadvantages of Prototypical School Design and Construction in

Sambuz

Useful Links

Newsletter

Mail Us

Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin - PowerPoint PPT Presentation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M a t h e m a t i c a l S c i e n c e s

Photonic Crystals Derek Stewart CNF Fall Workshop What are photonic crystals? Photonic crystals

Photonic Crystals Photonic Crystals and Si Photonics and Si Photonics Toshihiko Baba Toshihiko

How the colour is created? Semiconductors vs Photonic Crystals (PCs) Semiconductors vs Photonic

Self-Assembly of Metal-Organic Framework Photonic Sensors Nanyang Research Programme Loi Si

Recent Advances in Photonic Recent Advances in Photonic effect employing IP- based distributed

Frequency Decomposition The base frequency or the fundamental frequency is the lowest frequency.

Recent Progress in Recent Progress in Photonic Crystal Devices Photonic Crystal Devices

Silicon nitride based TriPleX Photonic Integrated Circuits for sensing applications Arne Leinse

Content A Polyreference Least Square Complex Frequency Introduction domain based statistical

Outline Narcisse Ngada DESY, MKK 1) What is simulation ? 14.05.2014 2) Why simulation ? 3)

The Frequency Domain Time domain methods: regress present on past; capture dynamics in

Time-Frequency Analysis Time Frequency Analysis in Visual Signal Yetmen Wang AnCAD, Inc.

Automated Configuration of Co-simulation with Domain Specific Hints Co-simulation on the rise

Grid simulation (AliEn) Outline GRID simulation Simulation tool Ptolemy (Berkeley)

The Frequency Domain DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data Science

MuseBox An Audio Equalizer/Visualizer Overview Read in frequency-domain analog sound wave

Conference 2017 IT Infrastructure Consumption Model: Should You Rent or Own? Agenda Why

United Nations Common Cash System (CCS) in Afghanistan July 2020 Governance Structure of the UN

www.callboxinc.com ExCEEding TargETs and ExpECTaTions for doCumEnT 2 prEsEnTaTion soluTions The

Cooperation and Social Status in Free and Open Source Projects Gaudenz Steinlin

Design &amp; Implementation of a Portable File Synchronisation Mechanism for a Cloud Storage

WHAT IS IRIX ? IRIX the newest version of TBS represents an advanced internet booking

Newcastle University Library Special Collections : a case study in getting to know your

A Report on the Benefits and Disadvantages of Prototypical School Design and Construction in

Sambuz

Useful Links

Newsletter

Mail Us

Design & Implementation of a Portable File Synchronisation Mechanism for a Cloud Storage