High Performance Geo-Computing Group GPU Acceleration on the 3D Elastic RTM Method Lin Gan, Tsinghua University May 8 st , 2017, GTC 2017
About Tsinghua HPGC • High Performance Geo-Computing Group – Interdisciplinary research group – High performance, high resolution geo-science acceleration GPU Acceleration on Elastic RTM
About Tsinghua HPGC • High Performance Geo-Computing Group – Interdisciplinary research group – High performance, high resolution geo-science acceleration data computing Climate changing Seismic modeling High Performance Computing GPU Acceleration on Elastic RTM
About Tsinghua HPGC • High Performance Geo-Computing Group – Interdisciplinary research group – High performance, high resolution geo-science acceleration – The most advanced HPC platforms • Multi-core CPU, many-core GPU & MIC • Reconfigurable data flow engines – Maxeler DFEs, IBM OpenPower, Intel Xeon+FPGA • Supercomputer – Tianhe-1A: 7168 CPU-GPU nodes, 4.7PFlops Rpeak – Tianhe-2: 16,000 CPU-3MIC nodes, 54.9PFlops Rpeak – Tsinghua Explore100: 740 CPU nodes, 4TFlops Rpeak – Cooperation and Sponsorship GPU Acceleration on Elastic RTM
About This Work • HPGC-SEP Summer Exchange Project – Advisor: Dr. Haohuan Fu , Dr. Robert Clapp, and Prof. Biondo Biondi – Special thanks to Gustavo Alves, and Ettore Biondi • Achievements on GPU – 10x speedup accelerating a 2D elastic RTM code over 24 CPU cores – Implementation of a 3D elastic RTM kernel with adjustable interfaces – 27x speedup accelerating the 3D RTM kernel over 24 CPU cores GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils • State variables (data) and the attributes (model) Shear stresses Particle velocities Normal stresses 𝑤 " , 𝑤 # , 𝑤 $ , Data Data 𝜏 "" ,𝜏 ## , 𝜏 $$ 𝜏 "# ,𝜏 "$ , 𝜏 #$ Forward Adjoint Model Density Model Mu Lambda mass kg ρ = = Δ Δ Δ 3 x y z m Force ∂ P = Area λ = ρ µ = = GPa GPa ∂ ρ Δ x length GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils • Forward and Adjoint t=0 t=Nt Data Data Forward Adjoint … … ∆𝑢 ∆𝑢 Model Model t=0 t=Nt Memory GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils • Wave Equations ∂ ∂ ∂ ∂ 1 = σ + σ + σ + V ( , ) x t [ ( , ) x t ( , ) x t ( , ) x t S ( , )] x t x xx xy xz x ∂ ρ ∂ ∂ ∂ t ( ) x x y z ∂ ∂ ∂ ∂ 1 = σ + σ + σ + V ( , ) x t [ ( , ) x t ( , ) x t ( , ) x t S ( , )] x t ∂ y ρ ∂ xy ∂ yy ∂ yz y t ( ) x x y z ∂ ∂ ∂ ∂ 1 = σ + σ + σ + V ( , ) x t [ ( , ) x t ( , ) x t ( , ) x t S ( , )] x t z xz yz zz z ∂ ρ ∂ ∂ ∂ t ( ) x x y z ∂ ∂ ∂ ∂ σ = λ + µ + λ + + ( , ) x t [ ( ) x 2 ( )] x V ( , ) x t ( )[ x V ( , ) x t V ( , )] x t S ( , ) x t ∂ xx ∂ x ∂ y ∂ z xx t x y z ∂ ∂ ∂ ∂ σ = λ + µ ( , ) x t [ ( ) x 2 ( )] x V ( , ) x t + λ + + ( )[ x V ( , ) x t V ( , )] x t S ( , ) x t ∂ yy ∂ x x z yy t x ∂ ∂ x z ∂ ∂ ∂ ∂ + λ + + σ = λ + µ ( )[ x V ( , ) x t V ( , )] x t S ( , ) x t ( , ) x t [ ( ) x 2 ( )] x V ( , ) x t ∂ x ∂ y zz ∂ zz ∂ x x y t x ∂ ∂ ∂ σ = µ + + ( , ) x t ( )[ x V ( , ) x t V ( , )] x t S ( , ) x t ∂ xy ∂ y ∂ x xy t x y ∂ ∂ ∂ σ = µ + + ( , ) x t ( )[ x V ( , ) x t V ( , )] x t S ( , ) x t xz z x xz ∂ ∂ ∂ t x z ∂ ∂ ∂ σ = µ + + ( , ) x t ( )[ x V ( , ) x t V ( , )] x t S ( , ) x t ∂ yz ∂ z ∂ x yz t y z GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils • For time: 2 nd ord. F.D. approximation Δ Δ t t Δ ∂ ∂ ∂ t + − t t = + σ + σ + σ + t t t t V ( ) x V ( ) x [ ( ) x ( ) x ( )] x S ( ) x 2 2 x x xx xy xz x ρ ∂ ∂ ∂ ( ) x x y z Δ Δ t t Δ ∂ ∂ ∂ t + − t t = + σ + σ + σ + Forward t t t t V ( ) x V ( ) x [ ( ) x ( ) x ( )] x S ( ) x 2 2 y y xy yy yz y ρ ∂ ∂ ∂ ( ) x x y z Δ Δ t t Δ ∂ ∂ ∂ t + − t t = + σ + σ + σ + t t t t V ( ) x V ( ) x [ ( ) x ( ) x ( )] x S ( ) x 2 2 z z ρ ∂ xz ∂ yz ∂ zz z ( ) x x y z Adjoint • Based on staggered grid • For space: 10 th ord. F.D. approximation 4 or 5 Stencil 5 or 4 GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils • GPU Optimizations – K40 GPU, (200*200*200)*1000ts • Configuration of different blk sizes, reg. per blk • Best: blk ß 20*20; max reg. ß 56 • Variable data into L1/SM, Constant data into Read-only Cache – Dynamic Pointer Switch & Minimum Data Cubes • Only malloc data cubes covering three steps … … … 𝜏 #$ 𝜏 #$ 𝜏 #$ 𝑤 " 𝑤 " 𝑤 " t-1 t t+1 GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils • GPU Optimizations – K40 GPU, (200*200*200)*1000ts • Configuration of different blk sizes, reg. per blk • Best: blk ß 20*20; max reg. ß 56 • Variable data into L1/SM, Constant data into Read-only Cache – Dynamic Pointer Switch & Minimum Data Cubes • Only malloc data cubes covering three steps … … … 𝜏 #$ 𝜏 #$ 𝜏 #$ 𝑤 " 𝑤 " 𝑤 " pre cur next GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils • GPU Optimizations – K40 GPU, (200*200*200)*1000ts • Configuration of different blk sizes, reg. per blk • Best: blk ß 20*20; max reg. ß 56 • Variable data into L1/SM, Constant data into Read-only Cache – Dynamic Pointer Switch & Minimum Data Cubes • Only malloc data cubes covering three steps … … … 𝜏 #$ 𝜏 #$ 𝜏 #$ 𝑤 " 𝑤 " 𝑤 " cur next pre GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils • GPU Optimizations – K40 GPU, (200*200*200)*1000ts • Configuration of different blk sizes, reg. per blk • Best: blk ß 20*20; max reg. ß 56 • Variable data into L1/SM, Constant data into Read-only Cache – Dynamic Pointer Switch & Minimum Data Cubes • Only malloc data cubes covering three steps … … … 𝜏 #$ 𝜏 #$ 𝜏 #$ 𝑤 " 𝑤 " 𝑤 " ∆𝑢 next pre cur Memory GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils • GPU Optimizations x – Multiple GPUs y z 4 or 5 5 or 4 GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils • GPU Optimizations x – Multiple GPUs y z 4 or 5 5 or 4 GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils • GPU Optimizations x – Multiple GPUs y z halo 4 or 5 Internal 5 or 4 halo GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils • GPU Optimizations – Multiple GPUs Internal GPU 0 GPU Algorithm per Stencil sweep halo For each subdomain ① Calculate RTM stencil ② Update Halo halo ③ Add Source ④ Switch Pointer GPU 1 Internal halo Stencil Computing Updating halo workflow GPU 2 Internal GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils • GPU Optimizations – Multiple GPUs Internal GPU 0 GPU Algorithm per Stencil sweep halo For each subdomain ① Calculate halo RTM stencil ② Calculate Internal RTM stencil halo Update Halo ④ Add Source GPU 1 Internal ⑤ Switch Pointers halo Updating Halo Internal halo GPU 2 Overlapping workflow Internal GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils • Validation and Performance – GPU Cluster in SEP • 4 K40 GPUs over 24 core CPU (OpenMP) • 200*200*200 + 1000 steps (record every 100 steps) Vx Vy GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils • Validation and Performance – GPU Cluster in SEP • 4 K40 GPUs over 24 core CPU (OpenMP) • 200*200*200 + 1000 steps (record every 100 steps) Vx Vy GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils • Validation and Performance – GPU Cluster in SEP • 4 K40 GPUs over 24 core CPU (OpenMP) • 200*200*200 + 1000 steps (record every 100 steps) Vx Vy GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils • Validation and Performance – GPU Cluster in SEP • 4 K40 GPUs over 24 core CPU (OpenMP) • 200*200*200 + 1000 steps (record every 100 steps) Vx Vy GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils • Validation and Performance – GPU Cluster in SEP • 4 K40 GPUs over 24 core CPU (OpenMP) • 200*200*200 + 1000 steps (record every 100 steps) Vx Vy GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils • Validation and Performance – GPU Cluster in SEP • 4 K40 GPUs over 24 core CPU (OpenMP) • 200*200*200 + 1000 steps (record every 100 steps) Vx Vy GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils • Validation and Performance – GPU Cluster in SEP • 4 K40 GPUs over 24 core CPU (OpenMP) • 200*200*200 + 1000 steps (record every 100 steps) Vx Vy GPU Acceleration on Elastic RTM
3D Elastic RTM Stencils • Validation and Performance – GPU Cluster in SEP • 4 K40 GPUs over 24 core CPU (OpenMP) • 200*200*200 + 1000 steps (record every 100 steps) Vx Vy GPU Acceleration on Elastic RTM
Recommend
More recommend