Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (10) Linear Formulation Notations: • δ Ω discretized in N unknowns/degrees of freedom l n Ω . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (10) Linear Formulation Notations: • δ Ω discretized in N unknowns/degrees of freedom • M k : the convolution matrices (dimension N × N ) - input • l n : the incident wave emitted by a source on the unknowns at time step n - input • a n : the state of the system at time step n - to compute l n Ω . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (10) Linear Formulation Notations: • δ Ω discretized in N unknowns/degrees of freedom • M k : the convolution matrices (dimension N × N ) - input • l n : the incident wave emitted by a source on the unknowns at time step n - input • a n : the state of the system at time step n - to compute Convolution system: K max M 0 · a n + M k · a n − k = l n ∑ (1) k ≥ 1 l n Ω . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (10) Linear Formulation Notations: • δ Ω discretized in N unknowns/degrees of freedom • M k : the convolution matrices (dimension N × N ) - input • l n : the incident wave emitted by a source on the unknowns at time step n - input • a n : the state of the system at time step n - to compute Convolution system: K max M 0 · a n + M k · a n − k = l n ∑ (1) k ≥ 1 Solve at each time step: ( K max ) a n = ( M 0 ) − 1 l n − M k · a n − k ∑ (2) k =1 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (11) Interaction/Convolution Matrices ( M k ) • Interactions between unknowns • Symmetric and sparse, M k ( i , j ) ̸ = 0 if distance ( i , j ) ≈ k . c . ∆ t • Pre-computed (external tool) M 0 M 1 M 2 M Kmax . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (12) Solve (Schematic View) K max ( ) a n = ( M 0 ) − 1 l n − M k · a n − k ∑ (3) k =1 a n-1 a n-2 a n-3 a n-4 a n-5 ~ s n l n s n s n = - = M 1 M 2 M 3 M 4 M 5 Linear Solver = , M 0 ~ a n s n . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (13) SpMV (sparse matrix/vector product) Summation stage → K max SpMVs • Permutation, advanced storages/kernels, blocking [White III and Sadayappan, 1997, Pinar and Heath, 1999, Pichel et al., 2005, Vuduc and Moon, 2005] • Auto-tuning [Im and Yelick, 2001, Vuduc et al., 2005] . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (13) SpMV (sparse matrix/vector product) Summation stage → K max SpMVs • Permutation, advanced storages/kernels, blocking [White III and Sadayappan, 1997, Pinar and Heath, 1999, Pichel et al., 2005, Vuduc and Moon, 2005] • Auto-tuning [Im and Yelick, 2001, Vuduc et al., 2005] Low Flop-rate: • Memory bound operation • Flop/Word hardware limit • Irregular/not contiguous memory accesses • Instruction (pipelining, vectorization) • Not appropriate for GPUs [Garland, 2008, Baskaran and Bordawekar, 2008, Bell and Garland, 2009] . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (14) SpMV (Performance) . C00 MKL . . . CRS MKL . . DIA MKL . . BCSR MKL . . CRS cuSparse . . . BCSR cuSparse 30 20 GFlop/s 10 . . . . . . . . . . . . 0 . Dense 5000 Diagonal 15/500000 Random 4/20000 Block Random (5/80000) Dense 200 (x10000) SpMVs MKL/cuSparse (double precision) Peak performance: CPU Haswell Intel Xeon E5-2680 2,50 GHz core 20 GFlop / s , and K40-M GPU 1 . 43 TFlop / s . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (15) TD-BEM Application Stages User inputs, simulation parameters ↓ Mesh generator, configuration, interaction matrices pre-computation ↓ Solver · Summation stage · M 0 Linear Solver (external tool) ↓ Post-processing (TD → FD) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (16) Outline • Problem Formulation • BEM Solver (Matrix Approach) • Fast-Multipole Method Approach • FMM Algorithm & Parallelization • FMM BEM Solver (Experimental Implementation) • Conclusion & Perspectives . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (17) Computational Ordering A 0 A 1 A 2 A 3 A 4 A 5 M 6 M 5 M 4 M 3 M 2 M 1 S 6 Front ( k )/ SpMV K max N s n ( i ) ∑ ∑ M k ( i , j ) × a n − k ( j ) , 1 ≤ i ≤ N . = (4) k =1 j =1 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (17) Computational Ordering A 0 A 0 A 0 A 1 A 1 A 1 A 2 A 2 A 2 A 3 A 3 A 3 A 4 A 4 A 4 A 5 A 5 A 5 M 6 M 6 M 6 M 5 M 5 M 5 M 4 M 4 M 4 M 3 M 3 M 3 M 2 M 2 M 2 M 1 M 1 M 1 S 6 S 6 S 6 Top ( i ) Front ( k )/ SpMV Side ( j ) K max N s n ( i ) ∑ ∑ M k ( i , j ) × a n − k ( j ) , 1 ≤ i ≤ N . = (4) k =1 j =1 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (18) Structure of a Slice Matrix A Slice j : • When outer loop index is j • The concatenation of column j of the interaction matrices M k (except M 0 ) • Size ( N × ( K max − 1)) • There is one dense vector per row • Slice j ( i , k ) = M k ( i , j ) ̸ = 0 with k s = d ( i , j ) / ( c ∆ t ) and k s ≤ k ≤ k s + p Slice j M 11 (*,j) M 1 (*,j) M 2 (*,j) M 3 (*,j) M 4 (*,j) M 5 (*,j) M 6 (*,j) M 7 (*,j) M 8 (*,j) M 9 (*,j) M 10 (*,j) M 12 (*,j) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (19) Computing with a Slice Matrix a *<n (j) Slice j ) ) ) ) ) ) ) ) ) ) ) ) j j j j j j j j j j j j s n , , , , , , , , , , * , , * * * * * * * * * * * ( ( ( ( ( ( ( ( ( ( ( ( 1 2 3 4 5 6 7 8 9 0 1 2 M M M M M M M M M 1 1 1 M M M Computation with N vector/vector products (one per line): • Regular memory access (vectorization, pipelining) • Low Flop/word ratio (same as SpMV) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (20) Improving the Flop/Word Ratio State of unknown j ? ? ? …. a n+2 a n+1 a n a n-1 a n-2 a n-3 a n-4 a n-5 a n-6 a n-7 a n-8 a n-9 Slice j M 1 (*,j) M 2 (*,j) M 3 (*,j) M 4 (*,j) M 5 (*,j) M 6 (*,j) M 7 (*,j) M 8 (*,j) M 9 (*,j) s n a n . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (20) Improving the Flop/Word Ratio …. a n+2 a n+1 a n a n-1 a n-2 a n-3 a n-4 a n-5 a n-6 a n-7 a n-8 a n-9 ? ? ? a n+2 a n+1 a n a n-1 a n-2 a n-3 a n-4 a n-5 a n-6 a n-7 a n-8 a n-9 …. ? ? ? Slice j M 1 (*,j) M 2 (*,j) M 3 (*,j) M 4 (*,j) M 5 (*,j) M 6 (*,j) M 7 (*,j) M 8 (*,j) M 9 (*,j) s n+1 s n n =2 g . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (20) Improving the Flop/Word Ratio …. a n+2 a n+1 a n a n-1 a n-2 a n-3 a n-4 a n-5 a n-6 a n-7 a n-8 a n-9 ? ? ? …. a n+2 a n+1 a n a n-1 a n-2 a n-3 a n-4 a n-5 a n-6 a n-7 a n-8 a n-9 ? ? ? …. a n+2 a n+1 a n a n-1 a n-2 a n-3 a n-4 a n-5 a n-6 a n-7 a n-8 a n-9 ? ? ? Slice j M 1 (*,j) M 2 (*,j) M 3 (*,j) M 4 (*,j) M 5 (*,j) M 6 (*,j) M 7 (*,j) M 8 (*,j) M 9 (*,j) s n+2 s n+1 s n n =3 g . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (20) Improving the Flop/Word Ratio …. a n+1 a n a n-1 a n-2 a n-3 a n-4 a n-5 a n-6 a n-7 a n-8 a n-9 0 0 Slice j M 1 (*,j) M 2 (*,j) M 3 (*,j) M 4 (*,j) M 5 (*,j) M 6 (*,j) M 7 (*,j) M 8 (*,j) M 9 (*,j) s n+2 s n+1 s n n =3 g . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (20) Improving the Flop/Word Ratio …. a n+1 a n a n-1 a n-2 a n-3 a n-4 a n-5 a n-6 a n-7 a n-8 a n-9 0 0 Slice j M 1 (*,j) M 2 (*,j) M 3 (*,j) M 4 (*,j) M 5 (*,j) M 6 (*,j) M 7 (*,j) M 8 (*,j) M 9 (*,j) s n+2 s n+1 s n n =3 g . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. . . . . Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Multi-Vectors/Vector Product (21) Flop/Word Ratio Vector length v = 4, group size n g = 4 ( v × n g × 2 Flops): 0 1 v n g 2 0 0 1 2 3 3 1 1 2 3 4 4 2 2 3 4 5 5 3 3 4 5 6 6 a b c d r a b c d r r r r a b c d r r r r Vector/vector Vector/matrix Multi-vectors/vector product product product • Vectors product ( ≈ SpMV) : n g (2 v + 1) • Vector/matrix product : v + n g ( v + 1) • Multi-vectors/vector product : ( v + n g − 1) + ( v ) + ( n g ) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Multi-Vectors/Vector Product (21) Flop/Word Ratio Vector length v = 4, group size n g = 4 ( v × n g × 2 Flops): 0 1 v n g 2 0 0 1 2 3 3 1 1 2 3 4 4 2 2 3 4 5 5 3 3 4 5 6 6 a b c d r a b c d r r r r a b c d r r r r Vector/vector Vector/matrix Multi-vectors/vector product product product • Vectors product ( ≈ SpMV) : n g (2 v + 1) • Vector/matrix product : v + n g ( v + 1) • Multi-vectors/vector product : ( v + n g − 1) + ( v ) + ( n g ) 6 F / W ( n g 8) 4 2 0 . . . . . . . . . . . . . . . . . . . 0 2 4 6 8 10 12 14 16 18 20 . . . . . . . . . . . . . . . . . . . . . . v .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Multi-Vectors/Vector Product (22) Multi-vectors/vector Product (CPU) . AVX-Asm . . AVX-Intrinsic . . AVX-Template . . SSE-Intrinsic . . . Compiler Version 20 20 . . . Speed ( GFlop / s ) 15 15 10 10 5 5 0 . . . . . . . . . . . . . . 0 . . . . . . . . . . . . . 0 0 20 40 60 80 20 40 60 80 Length of vectors ( v ) Length of vectors ( v ) Figure : N r = 1 024 Figure : N r = 20 480 Plots show the GFlop / s with n g = 8 for test cases of dimension N r × v (in double precision). Haswell Intel Xeon E5-2680 at 2 , 50 GHz (20 GFlop / s ) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Multi-Vectors/Vector Product on GPUs (23) GPUs Slice Storages • Blocking scheme (small conversion overhead) • Data access appropriate for SIMT/SIMD • Memory accesses (coalesced, low bank conflicts) • Data re-use (shared memory) • CPU/GPU Balancing . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Multi-Vectors/Vector Product on GPUs (23) GPUs Slice Storages • Blocking scheme (small conversion overhead) • Data access appropriate for SIMT/SIMD • Memory accesses (coalesced, low bank conflicts) • Data re-use (shared memory) • CPU/GPU Balancing a *<n (j) Slice j Slice j 0 2 0 1 3 4 5 8 6 5 6 1 2 0 2 (a) (b) (c) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Parallelization (24) Parallelization Sequential algorithm: a n-1 a n-2 a n-3 a n-4 a n-5 ~ s n l n s n s n = - = M 1 M 2 M 3 M 4 M 5 Linear Solver = , M 0 ~ a n s n . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Parallelization (25) Parallel Solver (Schematic View) n-5 a n-1n-2n-3n-4 s n = P0 M 1 M 2 M 3 M 4 M 5 n-3n-4n-5 n-2 a n-1 Parallel Linear ~ s n s n l n s n s n Solver P1 = = + - = , M 5 M 0 ~ M 1 M 2 M 3 M 4 s n a n n-5 a n-1n-2n-3n-4 s n = P2 M 1 M 2 M 3 M 4 M 5 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Parallelization (26) Parallel Solver with n g > 1 (Schematic View) T/n g loops a n-1n-2n-3n-4n-5 n g loops s n s n+1 s n+2 Radiation = P0 P0 M 5 M 1 M 2 M 3 M 4 , , a n+f M 1 M 2 s n+f+1 s n+f+2 a n-1n-2n-3n-4n-5 Parallel Linear Radiation s n s n+1 s n+2 l n+f s n+f s n+f s n+f s n+f ~ Solver P1 P1 = = + - = + , , , M 0 a n+f M 1 M 2 M 3 M 4 M 5 ~ s n+f a n+f M 1 M 2 s n+f+1 s n+f+2 a n-1n-2n-3n-4n-5 Radiation s n s n+1 s n+2 P2 , , P2 = a n+f M 1 M 2 s n+f+1 s n+f+2 M 5 M 1 M 2 M 3 M 4 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (27) Airplane Simulation • Acoustics • N = 23 962 • 10 823 time iterations • K max = 341 interaction matrices M k • n g = 8 • 70 GB of data • double precision • Homogeneous node: 24 Cores CPU (128 GB memory) • Heterogeneous node: 24 Cores CPU (128 GB memory) and 4 K40M GPUs (12 GB memory) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. . . . . . . . . . . . . . . . . . . Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (28) Parallel Efficiency/Percentage (Homogeneous) . . . Full-MPI . . 1 Efficiency 0 . 5 . . . . . . . . . . 0 1 10 20 Number of nodes . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (28) Parallel Efficiency/Percentage (Homogeneous) . . . . . . . Summation Idle Full-MPI . . Direct solver M 0 . . . . 100 Percentage (%) 1 80 Efficiency 60 0 . 5 40 20 . . . . . . . . . . 0 . . . . . . . . . . . . . 0 1 10 20 1 10 20 Number of nodes Number of nodes . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (28) Parallel Efficiency/Percentage (Homogeneous) . . . . . . . Summation Idle Full-MPI . . Direct solver M 0 . . . . 100 Percentage (%) 1 80 Efficiency 60 0 . 5 40 20 . . . . . . . . . . 0 . . . . . . . . . . . . . 0 1 10 20 1 10 20 Number of nodes Number of nodes Summation stage ↘ M 0 Solve → . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (29) With GPUs . CPU-Only . . . 1GPU . . 1GPU Time ( seconds ) 20 , 000 21.0 Speedup 3 , 000 11.0 1 , 000 . . . . . . . . . . . . 300 . 1.0 . 1 2 3 4 5 1 2 3 4 5 Number of nodes Number of nodes Figure : Speedup against CPU-Only Figure : Execution time . . . . . . . . . . . . . Problem ≈ 70 GB /GPU 12 GB memory . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (29) With GPUs . CPU-Only . . 1GPU . . . 2GPU . 1GPU . . . 2GPU Time ( seconds ) 20 , 000 21.0 Speedup 3 , 000 11.0 1 , 000 . . . . . . . . . . . . 300 . 1.0 . 1 2 3 4 5 1 2 3 4 5 Number of nodes Number of nodes Figure : Speedup against CPU-Only Figure : Execution time . . . . . . . . . . . . . Problem ≈ 70 GB /GPU 12 GB memory . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (29) With GPUs . CPU-Only . . 1GPU . . . 2GPU . 1GPU . . 2GPU . . . 3GPU . . 3GPU Time ( seconds ) 20 , 000 21.0 Speedup 3 , 000 11.0 1 , 000 300 . . . . . . . . . . . . . 1.0 . 1 2 3 4 5 1 2 3 4 5 Number of nodes Number of nodes Figure : Speedup against CPU-Only Figure : Execution time . . . . . . . . . . . . . Problem ≈ 70 GB /GPU 12 GB memory . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (29) With GPUs . CPU-Only . . 1GPU . . . 2GPU . 1GPU . . 2GPU . . 3GPU . . . . 3GPU . . . 4GPU 4GPU Time ( seconds ) 20 , 000 21.0 Speedup 3 , 000 11.0 1 , 000 300 . . . . . . . . . . . . . 1.0 . 1 2 3 4 5 1 2 3 4 5 Number of nodes Number of nodes Figure : Speedup against CPU-Only Figure : Execution time . . . . . . . . . . . . . Problem ≈ 70 GB /GPU 12 GB memory . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Summary (30) Summary: • New computational ordering [Bramas et al., 2014] • Solver with few communication points Additional contributions: • Permutations/SpMV • Efficient SIMD kernel CPU • Efficient blocking scheme/kernel for GPU [Bramas et al., 2015] • Dynamic balancing (CPU/GPU) Limits: • M 0 Linear solver • GPUs’ memory • Interaction matrices construction • Complexity → O ( N 2 ) for each iteration . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (31) Outline • Problem Formulation • BEM Solver (Matrix Approach) • Fast-Multipole Method Approach • FMM Algorithm & Parallelization • FMM BEM Solver (Experimental Implementation) • Conclusion & Perspectives . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Algorithm (32) FMM Operators (1D) • Spatial decomposition → Potential decomposition f i = f near + f far i i • Near field by direct interactions (leaves) • Far field with FMM operators (tree) l = 0 l = 1 l = 2 l = 3 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Algorithm (32) FMM Operators (1D) • Spatial decomposition → Potential decomposition f i = f near + f far i i • Near field by direct interactions (leaves) • Far field with FMM operators (tree) l = 0 l = 1 l = 2 l = 3 M2L L2L M2L M2L M2M P2P . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (33) Related work: • Multicore study [Chandramowlishwaran et al., 2010] • NVidia GPU [Yokota and Barba, 2011] • Distributed GPU [Hamada et al., 2009] • Distributed CPU/GPU [Hu et al., 2011, Lashuk et al., 2012, Malhotra and Biros, 2015] • Using a runtime system (multicore) [Ltaief and Yokota, 2014] . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (34) Paradigms • Fork-join - Parallel-for (OpenMP) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (34) Paradigms • Fork-join - Parallel-for (OpenMP) • Task-based - Tasks pool (OpenMP 3.1) [Agullo et al., 2014] 1 1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2014). Task-based fmm for . . . . . . . . . . . . . . . . . . . . . . multicore architectures. SIAM Journal on Scientific Computing, 36(1):C66C93. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (34) Paradigms • Fork-join - Parallel-for (OpenMP) • Task-based - Tasks pool (OpenMP 3.1) [Agullo et al., 2014] 1 - Tasks-and-dependencies (runtime systems, OpenMP 4) 1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2014). Task-based fmm for . . . . . . . . . . . . . . . . . . . . . . multicore architectures. SIAM Journal on Scientific Computing, 36(1):C66C93. .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (35) Tasks-and-Dependencies Model (OpenMP 4, StarPU ) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (35) Tasks-and-Dependencies Model (OpenMP 4, StarPU ) Challenges • Granularity . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (35) Tasks-and-Dependencies Model (OpenMP 4, *PU) CPU/GPU Challenges • Granularity • Computational kernels . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (35) Tasks-and-Dependencies Model (OpenMP 4, *PU) Challenges • Granularity • Computational kernels • Scheduling . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (36) Scheduling Scheduler E A CPU0 B A F CPU1 C D GPU0 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (36) Scheduling Scheduler E B CPU0 B C A F CPU1 C D D GPU0 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (36) Scheduling Scheduler E B CPU0 B C A F CPU1 C D D GPU0 • Priority • Work stealing [Blumofe and Leiserson, 1999] • Heterogeneous Earliest Finish Time (Heft) [Topcuouglu et al., 2002] . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (36) Scheduling Scheduler E B CPU0 B C A F CPU1 C D D GPU0 • Priority • Work stealing [Blumofe and Leiserson, 1999] • Heterogeneous Earliest Finish Time (Heft) [Topcuouglu et al., 2002] Drawbacks: • Calibration • Overhead • Ready-tasks view . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (37) Heteroprio • Heteroprio [Agullo et al., 2015] 1 • Steady-state : execute tasks where they have the best acceleration factor • Critical-state : execute a task by a worker if it does not delay the hypothetical end 1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for heterogeneous architectures. Concurrency and Computation: Practice and Experience. . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (37) Heteroprio • Heteroprio [Agullo et al., 2015] 1 • Steady-state : execute tasks where they have the best acceleration factor • Critical-state : execute a task by a worker if it does not delay the hypothetical end Scheduler E CPU0 B D C GPU CPU A F CPU1 C Prio Prio B D GPU0 1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for heterogeneous architectures. Concurrency and Computation: Practice and Experience. . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (37) Heteroprio • Heteroprio [Agullo et al., 2015] 1 • Steady-state : execute tasks where they have the best acceleration factor • Critical-state : execute a task by a worker if it does not delay the hypothetical end Scheduler E B CPU0 B D C GPU CPU A F CPU1 C Prio Prio D GPU0 1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for heterogeneous architectures. Concurrency and Computation: Practice and Experience. . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (37) Heteroprio • Heteroprio [Agullo et al., 2015] 1 • Steady-state : execute tasks where they have the best acceleration factor • Critical-state : execute a task by a worker if it does not delay the hypothetical end Scheduler E B CPU0 B D GPU CPU A F CPU1 C Prio Prio D C GPU0 1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for heterogeneous architectures. Concurrency and Computation: Practice and Experience. . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (37) Heteroprio • Heteroprio [Agullo et al., 2015] 1 • Steady-state : execute tasks where they have the best acceleration factor • Critical-state : execute a task by a worker if it does not delay the hypothetical end Scheduler E B CPU0 B GPU CPU A F CPU1 D C Prio Prio D C GPU0 1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for heterogeneous architectures. Concurrency and Computation: Practice and Experience. . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (38) Test Case CPU - 24 Cores GPU 1 GPU 2 GPU 3 GPU 4 • N = 30 millions particles • Spherical Expansion/Rotation Kernel • Acc = 10 − 3 , h = 7 and Granularity = 1500 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39) Trace - Heterogeneous (24CPUs) 0GPU/15 . 5 s Legend: P2P ( ■ ), P2M ( ■ ) , M2M ( ■ ) , M2L ( ■ ), L2L ( ■ ), L2P ( ■ ) and Idle ( ■ ) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39) Trace - Heterogeneous (1GPU/23CPUs) 0GPU/15 . 5 s 1GPU/13 . 4 s Legend: P2P ( ■ ), P2M ( ■ ) , M2M ( ■ ) , M2L ( ■ ), L2L ( ■ ), L2P ( ■ ) and Idle ( ■ ) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39) Trace - Heterogeneous (1GPU/23CPUs) 0GPU/15 . 5 s 1GPU/13 . 4 s Legend: P2P ( ■ ), P2M ( ■ ) , M2M ( ■ ) , M2L ( ■ ), L2L ( ■ ), L2P ( ■ ) and Idle ( ■ ) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39) Trace - Heterogeneous (2GPUs/22CPUs) 0GPU/15 . 5 s 2GPU/10 . 9 s Legend: P2P ( ■ ), P2M ( ■ ) , M2M ( ■ ) , M2L ( ■ ), L2L ( ■ ), L2P ( ■ ) and Idle ( ■ ) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39) Trace - Heterogeneous (3GPUs/21CPUs) 0GPU/15 . 5 s 3GPU/9 . 4 s Legend: P2P ( ■ ), P2M ( ■ ) , M2M ( ■ ) , M2L ( ■ ), L2L ( ■ ), L2P ( ■ ) and Idle ( ■ ) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39) Trace - Heterogeneous (4GPUs/20CPUs) 0GPU/15 . 5 s 4GPU/8 . 7 s Legend: P2P ( ■ ), P2M ( ■ ) , M2M ( ■ ) , M2L ( ■ ), L2L ( ■ ), L2P ( ■ ) and Idle ( ■ ) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (40) Test Case CPU - 24 Cores GPU 1 GPU 2 GPU 3 GPU 4 • N = 30 millions particles • Uniform/Lagrange kernel • Acc = { 10 − 5 , 10 − 7 } , h = 7 and Granularity = 1500 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (41) Trace - Heterogeneous (4GPUs) Acc = 10 − 5 /7 . 9 s Legend: P2P ( ■ ), P2M ( ■ ) , M2M ( ■ ) , M2L ( ■ ), L2L ( ■ ), L2P ( ■ ) and . . . . . . . . . . . . . . . . . . . . . . Idle ( ■ ) .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (41) Trace - Heterogeneous (4GPUs) Acc = 10 − 5 /7 . 9 s Acc = 10 − 7 /17 s Legend: P2P ( ■ ), P2M ( ■ ) , M2M ( ■ ) , M2L ( ■ ), L2L ( ■ ), L2P ( ■ ) and . . . . . . . . . . . . . . . . . . . . . . Idle ( ■ ) .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (42) Test Cases Node 1 - 24 Cores Node 2 - 24 Cores Node 3 - 24 Cores Node 4 - 24 Cores Node 5 - 24 Cores Node 6 - 24 Cores Node 7 - 24 Cores • N = 200 millions particles • Spherical Expansion/Rotation Kernel • Acc = 10 − 3 , h = 8 and Granularity = 2000 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (43) Trace - 7 nodes × 24CPUs Legend: P2P ( ■ ), P2M ( ■ ) , M2M ( ■ ) , M2L ( ■ ), L2L ( ■ ), L2P ( ■ ) and Idle ( ■ ) . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Summary (44) Summary: • Generic • Kernel independent • Architecture independent • Performance portability . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Summary (44) Summary: • Generic • Kernel independent • Architecture independent • Performance portability Additional contributions: • Commutativity expression in FMM • MPI/OpenMP implementation All included in ScalFMM (C++/HPC library) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (45) Outline • Problem Formulation • BEM Solver (Matrix Approach) • Fast-Multipole Method Approach • FMM Algorithm & Parallelization • FMM BEM Solver (Experimental Implementation) • Conclusion & Perspectives . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (46) Propagation of the Current State to the Future a n-1 a n-2 a n-3 a n-4 a n-5 ~ s n l n s n s n = - = M 1 M 2 M 3 M 4 M 5 Linear Solver = , M 0 ~ a n s n . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (46) Propagation of the Current State to the Future a n-1 a n-2 a n-3 a n-4 a n-5 ~ s n l n s n s n = - = M 1 M 2 M 3 M 4 M 5 Linear Solver = , M 0 ~ a n s n a n M 5 s n+5 ~ l n s n s n Linear Solver s n+4 M 4 s n+3 M 3 + s n+2 M 2 - = = + s n+1 M 1 + , + M 0 ~ a n s n + . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (47) With FMM a n s n+5 ~ s n+4 l n s n s n Linear Solver FMM s n+3 + s n+2 M 2 + - = = s n+1 M 1 + , + M 0 ~ + a n s n • Far interactions in time (between far elements in space) are computed by the FMM • The spatial decomposition is given by the octree . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (48) Overview • The octree is over a mesh (integration points) • Interactions matrices between leaves • Approximation/FMM • development in the time-domain • multipole: what a cell emits to the outside • local: what a cell receives from the outside • operators in FD or TD • accurate up-to a chosen frequency • the results in the TD of the matrix approach ̸ = FMM Figure : Complete unit sphere Figure : Truncated unit sphere . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (49) Operators (Overview) • P2M • compute what is emitted by a leaf to the outside . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (49) Operators (Overview) • P2M • compute what is emitted by a leaf to the outside • M2M/L2L • Extrapolation + time shift . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (49) Operators (Overview) • P2M • compute what is emitted by a leaf to the outside • M2M/L2L • Extrapolation + time shift • M2L • Convolution product in TD (term-by-term multiplication in FD) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (49) Operators (Overview) • P2M • compute what is emitted by a leaf to the outside • M2M/L2L • Extrapolation + time shift • M2L • Convolution product in TD (term-by-term multiplication in FD) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (49) Operators (Overview) • P2M • compute what is emitted by a leaf to the outside • M2M/L2L • Extrapolation + time shift • M2L • Convolution product in TD (term-by-term multiplication in FD) • L2P • Integration . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (50) Cone-Sphere Test Cases Case C-927 C-4269 C-10012 Number of unknowns 927 4269 10012 FMM tree height 3 4 5 Number of leaves 16 64 234 Number of M k matrices ( K max ) 117 244 370 Number of M k matrices (leaves) 60 64 49 Number of time steps ( T ) 2033 4345 6647 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (51) Sequential Executions TD vs. FD operators: FMM Stages TD TD + FD-M2L FD Matrix approach M k Construction 76 s 76 s 76 s 242 s Solve 58 122 s 53 241 s 97 861 s 7 . 8 s (*) Total 58 198 s 53 317 s 97 937 s 249 . 8 s Execution time TD-FMM Vs. matrix approach to solve the Case C-927 in double precision. (*) Our optimized BEM solver. . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (52) Parallel Executions (FMM Vs. Matrix Approach) . Matrix generation . . . Solve · 10 4 · 10 4 4 1 , 000 10 , 080 33 , 408 883 . . 9 , 426 1 25 , 256 Time (s) 2 500 0 . 5 237 . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 0 0 . . . FMM Matrix Approach FMM Matrix Approach FMM Matrix Approach Figure : C-927 ( × 3 . 8) Figure : C-4269 ( × 1) Figure : C-10012 ( × 1 . 4) The captions of the different cases show the overhead of the FMM TD-BEM against the matrix approach. . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Summary (53) Summary: • Preliminary results • Best configuration: TD + FD M2L • Not competitive against the direct approach (maybe on larger test cases) • Any improvement of the matrix creation will make the FMM less competitive Additional contributions: • Incomplete/4D FMM • Sphere discretization/length APS signal . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (54) Conclusion & Perspectives . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
Recommend
More recommend