Performance Portable Line Smoother for Multiphysics Problems using - PowerPoint PPT Presentation

Performance Portable Line Smoother for Multiphysics Problems using Compact Batched BLAS Kyungjoo Kim Mehmet Deveci Andrew M. Bradley Simon D. Hammond Sivasankaran Rajamanickam Center for Computing Research, Sandia National Labs BLIS Retreat, Sep 18 - 19, 2017 Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. SAND NO. SAND2017-10026 C

Introduction: Line Smoother Compact Data Layout Numerical Experiments Summary 2

Line Smoother 3

Line Smoother Consider a block sparse system of equations Ax = b arising from coupled PDEs. By splitting A = M − S , we obtain ( M − S ) x = b Mx = b + Sx x = M − 1 ( b + Sx ) = x + M − 1 ( b − Ax ) , where M is a set of block tridiagonal matrices corresponding to the extracted lines of elements. Solve this iteratively x k + 1 = M − 1 ( b + Sx k ) . m k n A set of line elements extracted along the stream line direction. 4

Considerations for Line Smoother Implementation A fixed number of this iteration is used as preconditioner. x k + 1 = M − 1 ( b + Sx k ) . M is factorized once per solution (or each non-linear iteration) and applied multiple times. Typical blocksizes b are 3, 5, 9 and 15, which are related to scientific applications e.g., elasticity, ideal gas and multiphysics fluid problems. Limit memory usage up to 16 GB i.e., MCDRAM on KNL and GPU device memory. With this memory constraint, typical local problem sizes ( m × n × k ) are selected as 128 × 128 × 128 for b = 3 , 5 and 64 × 64 × 128 for b = 10 , 15. Batch parallelism: running a sequential algorithm within parallel for . 1 for T in { T 0 , T 1 , ··· , T m × n − 1 } do in parallel A r ˆ B r ˆ T 0 for r ← 0 to k − 2 do 2 ˆ A r +1 ˆ C r A r : = LU ( ˆ ˆ A r ) ; 3 B r : = L − 1 ˆ T 1 ˆ B r ; 4 C r : = ˆ ˆ C r U − 1 ; 5 A r + 1 : = ˆ A r + 1 − ˆ C r ˆ ˆ B r ; 6 A k − 1 : = LU ( ˆ ˆ A k − 1 ) ; 7 T m ∗ n − 1 Initialization of the line smoother 5

Objective: Develop Performance Portable Line Preconditioner Architecture Constraints: Wide vector length (AVX512) of KNL for small problem sizes. A large number of CUDA threads (including vector lanes, 132K) on GPUs. Hierarchical parallelism to improve concurrency for solving each tridiagonal system. Different data access patterns for different architectures; core-affinity access on KNL vs coalesced access on GPUs. 6

Problems Using Vendor DLA Libraries Batched BLAS Batched BLAS computes many dense problems in parallel. Conceptually, similar to BLAS calls within parallel for . On the line smoother context: batched tridiagonal factor/solve using a sequence of batch operations is not efficient. No data locality is exploited in the sequence of batch calls. In each parallel for , synchronization is implicitly imposed. BLAS within OpenMP BLAS is not optimized for 3 × 3 matrix computations. cuBLAS provides batch interface only. 7

Compact Data Layout 8

Compact Data Layout (SIMD type) Blocks are too small to use a wide vector units. A111 A112 A113 → Vectorize across batch instances. A111 A112 A113 A121 A122 A123 Compact BLAS uses packed vectors as A211 A212 A213 A131 A132 A133 computing unit instead scalar. A121 A122 A123 By overloading arithmatic operators, the A221 A222 A223 scalar BLAS are reused and naturally A211 A212 A213 vectorized. A131 A132 A133 A221 A222 A223 A231 A232 A233 // computing unit A231 A232 A233 struct VectorAVX512D { union { __m512d v; A311 A312 A313 double s[8]; A311 A312 A313 }; A321 A322 A323 }; A411 A412 A413 A331 A332 A333 // overload arithmatic operators (+ −∗ /) A321 A322 A323 VectorAVX512D A421 A422 A423 operator +( VectorAVX512D const &a, A411 A412 A413 VectorAVX512D const &b) { A331 A332 A333 return __mm512_add_pd(a, b); A421 A422 A423 A431 A432 A433 } A431 A432 A433 In collaboration with Intel MKL team, Compact Batched BLAS standard is adopted as new batch interface. 9

Line Smoother Impl. using Compact Data Layout 1 for a pair T in Block A of T0 and T1 is packed and elements are aligned to its vector lane { ( T 0 , T 1 ) , ( T 2 , T 3 ) , ··· , ( T m × n − 2 , T m × n − 1 ) } do in ˆ ˆ A r B r α T 0 00 α T 1 00 α T 0 01 α T 1 ··· C r ˆ ˆ A r +1 01 parallel for r ← 0 to k − 2 do 2 A r : = LU ( ˆ T 0 ˆ A r ) ; 3 B r : = L − 1 ˆ ˆ B r ; 4 A r ˆ B r ˆ C r : = ˆ ˆ C r U − 1 ; 5 ˆ ˆ C r A r +1 A r + 1 : = ˆ A r + 1 − ˆ C r ˆ ˆ B r ; 6 A k − 1 : = LU ( ˆ ˆ A k − 1 ) ; T 1 7 Layered APIs for Kokkos Hierarchical Parallelism Serial (Vector) : a single thread is mapped to a single instance of parallel batch. Team : a team of threads is mapped to a single instance of parallel batch. Device : the entire device used as like current batch interface provided by vendors. Serial and Team APIs are used to compose batched block tridiagonal factorization/solve. 10

Line Smoother Impl. using Compact Data Layout 1 for a pair T in Block A of T0 and T1 is packed and elements are aligned to its vector lane { ( T 0 , T 1 ) , ( T 2 , T 3 ) , ··· , ( T m × n − 2 , T m × n − 1 ) } do in A r ˆ ˆ B r α T 0 00 α T 1 00 α T 0 01 α T 1 ˆ ··· C r ˆ A r +1 01 parallel for r ← 0 to k − 2 do 2 A r : = LU ( ˆ ˆ T 0 A r ) ; 3 B r : = L − 1 ˆ ˆ B r ; 4 A r ˆ B r ˆ C r : = ˆ ˆ C r U − 1 ; 5 ˆ A r +1 ˆ C r A r + 1 : = ˆ A r + 1 − ˆ C r ˆ ˆ B r ; 6 A k − 1 : = LU ( ˆ ˆ A k − 1 ) ; T 1 7 Some Issues As it perform cross-matrix vectorization, pivoting in LU is not feasible. → For preconditioning purpose, this does not matter. There is repacking overhead when the standard format is used. → Block tridiagonal matrices are extracted and repacked at the same time. 10

Numerical Experiments 11

Test Setup Intel Knight Landing 34 Tiles, 2 Cores (4 Threads/core), 2x AVX512 units/core, 1MB L2 3+ TFLOPs in double precision, 400+ GB/s (MCDRAM) NVIDIA Tesla P100 56 SMs, 32 FP64 CUDA Cores/SM 4.7 TFLOPs in double precision, 732+ GB/s (HBM) Bechmark Compact batched BLAS is implemented in KokkosKernels. We evaluate our compact batched BLAS against vendor optimized BLAS implementations; 1) MKL with OpenMP, 2) MKL batched APIs and 3) cuBLAS batched APIs. Line smoother is implemented using our compact batched BLAS serial and team interfaces and compared with SPARC. SPARC is Sandia production simulation code for solving Navier-Stokes equations for compressible and reacting flows. 12

Intel Knight Landing: Batched BLAS SIMD data layout allows vectorization for small blocksizes. Block Size 3 Block Size 5 0 . 6 MKL OpenMP 0 . 5 KokkosKernels 0 . 5 0 . 4 GFLOPS / Core 0 . 4 0 . 3 0 . 3 0 . 2 0 . 2 0 . 1 0 . 1 0 . 0 0 . 0 1 2 4 8 16 34 68 1 2 4 8 16 34 68 # Cores # Cores Block Size 10 Block Size 15 2 . 5 1 . 5 2 . 0 1 . 5 1 . 0 1 . 0 0 . 5 0 . 5 0 . 0 0 . 0 1 2 4 8 16 34 68 1 2 4 8 16 34 68 # Cores # Cores Batched LU 13

Intel Knight Landing: Batched BLAS SIMD data layout allows vectorization for small blocksizes. Block Size 3 Block Size 5 1 . 4 KokkosKernels 0 . 8 MKL Batch 1 . 2 MKL OpenMP 1 . 0 GFLOPS / Core 0 . 6 0 . 8 0 . 4 0 . 6 0 . 4 0 . 2 0 . 2 0 . 0 0 . 0 1 2 4 8 16 34 68 1 2 4 8 16 34 68 # Cores # Cores Block Size 10 Block Size 15 3 . 5 2 . 5 3 . 0 2 . 0 2 . 5 1 . 5 2 . 0 1 . 5 1 . 0 1 . 0 0 . 5 0 . 5 0 . 0 0 . 0 1 2 4 8 16 34 68 1 2 4 8 16 34 68 # Cores # Cores Batched Trsm 13

Intel Knight Landing: Batched BLAS SIMD data layout allows vectorization for small blocksizes. Block Size 3 Block Size 5 KokkosKernels 1 . 2 MKL Batch 1 . 5 MKL OpenMP 1 . 0 libxsmm GFLOPS / Core 0 . 8 1 . 0 0 . 6 0 . 4 0 . 5 0 . 2 0 . 0 0 . 0 1 2 4 8 16 34 68 1 2 4 8 16 34 68 # Cores # Cores Block Size 10 Block Size 15 3 . 5 5 3 . 0 4 2 . 5 2 . 0 3 1 . 5 2 1 . 0 1 0 . 5 0 . 0 0 1 2 4 8 16 34 68 1 2 4 8 16 34 68 # Cores # Cores Batched Gemm 13

Intel Knight Landing: Roofline Analysis APEX toolkit (developed in Sandia) is used for performance analysis. Memory bandwidth bounded compute characteristics. Our implementation fully exploits MCDRAM memory bandwidth. 10 4 10 4 Attainable Performance (GFlop/s) Attainable Performance (GFlop/s) Peak GFlop/s (w/FMA) Peak GFlop/s (w/FMA) s ) s ) / / B (no FMA) B (no FMA) G G W W 10 3 10 3 B B ) M ) M s s B / A B / A R R G G ( D ( D C C W W M M B (no FMA, no vectorization) B (no FMA, no vectorization) 2 2 L s ) L s ) / / 10 2 B 10 2 B G G ( ( W W B B R R D D D D 10 1 10 1 10 0 10 0 MKL OpenMP MKL Batch 10 − 1 10 − 1 10 − 2 10 − 1 10 0 10 1 10 2 10 − 2 10 − 1 10 0 10 1 10 2 Arithmetic Intensity (flops/bytes) Arithmetic Intensity (flops/bytes) Batched Gemm 14

Performance Portable Line Smoother for Multiphysics Problems using - PowerPoint PPT Presentation

Performance Portable Line Smoother for Multiphysics Problems using Compact Batched BLAS Kyungjoo Kim Mehmet Deveci Andrew M. Bradley Simon D. Hammond Sivasankaran Rajamanickam Center for Computing Research, Sandia National Labs BLIS Retreat,

PC PORTABLE PC PORTABLE PC PORTABLE Introducing the PC Portable Lamp, one of a range of

The Slope of a Line The Slope of a Line The Slope of a Line The Slope of a Line The Slope of a

Title Slide Math 696 Class July 19, 2002 Line 1 Line 2 Line 3 Line 4 Line 5 Line 6 Line 7

Portable fuel cell system s Jaeyoung Lee September 19, 2006 http:/ / w w w .h2 fc.re.kr Energy

PORTABLE MANAGEMENT BEX/BTA Oversight Committee May 17, 2019 Agenda Portable Management

Portable Enforcement Solution International Product Marketing Department Portable PTZ Dome Body

X-Line 101 June 2019 X-Line 101 X-Line Unit Overview What makes X-Line unique X-Line 101

Mol2Net-04 Modelling and simulation of SAW delay line sensors with COMSOL Multiphysics Bilel

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

X5 Portable Ultrasound New Product Release FDA Approved December 2016 ADVANCED FEATURES

Portable Chemical Sterilizer Portable Chemical Sterilizer Dr. Christopher Doona Dr. Christopher

Simple, Efficient, Portable Decomposition of Simple, Efficient, Portable Decomposition of Large

Portable Electrical Appliance Testing (PAT) Portable Electrical Appliance Testing (PAT).

Portable Electronic Devices in Healthcare: Portable Electronic Devices in Healthcare: Latest Legal

HAPPY LUNAR NEW YEAR 1 IEEE SSCS-2007 Portable Power Management Challenges and Solutions

Portable EXPath Portable EXPath Extension Functions Extension Functions Adam Retter Adam

CS475 / CM375 Lecture 17: Nov 8, 2011 QR Algorithm and Reduction to Hessenberg Reading: [TB] Chapt

Diffraction Light bends! Diffraction assumptions Solution to Maxwell's Equations The far-field

Precision measurement of the magnetic field for J-PARC muon experiments Ken-ichi Sasaki; KEK

Unitarity Triangle Sides at e + e - colliders Phillip Urquijo University of Bonn On behalf of

Discrete space (lattice) arises naturally in solids (crystals) Using localized atomic-like

Parallel Eigensolver for Graph Spectral Analysis on GPU Yimin Liu Heran Lin

Lecture 12: Dense Linear Algebra David Bindel 3 Mar 2010 HW 2 update I have speed-up over

Laurette TUCKERMAN laurette@pmmh.espci.fr Numerical Methods for Differential Equations in