Accelerating Reverse Time Migration application for seismic imaging with GPU architecture Sergio Orlandini, Cristiano Calonaci, Luca Ferraro @CINECA Nicola Bienati @ENI GPU Technology Conference GTC San Jose (CA) April 4-7, 2016 5 th April 2016
Reverse Time Migration developers @CINECA @ENI @NVIDIA Simone Campagna Nicola Bienati Paulius Micikevicius Cristiano Calonaci Jacopo Panizzardi Peng Wang Marco Comparato Massimiliano Culpo Luca Ferraro Roberto Gori Chiara Latini Sergio Orlandini Stefano Tagliaventi S. Orlandini RTM algorithm & workflow @GTC16 — 1/16
Reverse Time Migration algorithm RTM applies the discretized acoustic wave equation to propagate waves through a given velocity model. 3 main parts: 1 Forward propagation of the solution of the wave equation to model the source wavefield. 2 Backward propagation in reverse time of data recorded on the field. 3 Calculation of imaging condition. Both Forward and Backward propagation consist of Finite Difference (FD) wave equation solution which is: Compute intensive : fourth power of grid dimensions Memory demanding : domain decomposition Communication intensive : ghost cell exchange with first neighbours every time step S. Orlandini RTM algorithm & workflow @GTC16 — 2/16
RTM workflow 3 main tasks 1 FD propagation Second order wave equation is solved with finite difference approximation 2 Exchange borders Each time step domain borders have to be exchanged between first neighbours MPI communications 3 Imaging Imaging frequency Write source field in Forward Read source field in Backward Compute imaging condition in Backward S. Orlandini RTM algorithm & workflow @GTC16 — 3/16
Finite Difference propagation kernel performances Finite Difference propagation is the mostly compute intense task FD propagation is the most relevant part in RTM application “Easily” ported to GPU Speed-Up vs CPU roughly 4x (1 node/16 cores vs 1 node with 2 GPUs) NB: concerns only FD stand-alone kernel performance, not entire RTM application TTI kernel Isotropic kernel Single GPU Single GPU Radius = 4 Radius = 4 Grid 512x512x512 Grid 512x512x512 200 time steps 200 time steps S. Orlandini Finite Difference kernel performances @GTC16 — 4/16
RTM-GPU workflow Queue of imaging steps with dedicated buffer for wavefield snapshots 2 concurrent host threads: Kernel thread [ KT ] Device management Computes Wave-Fields Manages FD kernel launches Exchange borders Transfers wavefields D2H Imaging thread [ IT ] Host management Computes Imaging Waits for an available snapshot then computes imaging Writes/Reads to/from I/O Forks in a poll of threads for computing imaging S. Orlandini RTM porting to GPU @GTC16 — 5/16
RTM-GPU workflow Queue of imaging steps with dedicated buffer for wavefield snapshots 2 concurrent host threads: Kernel thread [ KT ] Device management Computes Wave-Fields Manages FD kernel launches Exchange borders Transfers wavefields D2H Imaging thread [ IT ] Host management Computes Imaging Waits for an available snapshot then computes imaging Writes/Reads to/from I/O Forks in a poll of threads for computing imaging S. Orlandini RTM porting to GPU @GTC16 — 5/16
RTM-GPU workflow Optimization of communications: FD propagation in 2 steps 1 Propagate halos domain to exchange 2 Propagate internal domain 2 While propagating internal domain concurrently exchange halos domains Overlap communications with computation Communications are hidden behind device computation Exchange borders task is more complex: Device/Host MPI GPU-Direct PeerToPeer S. Orlandini RTM porting to GPU @GTC16 — 6/16
RTM-GPU workflow Optimization of communications: FD propagation in 2 steps 1 Propagate halos domain to exchange 2 Propagate internal domain 2 While propagating internal domain concurrently exchange halos domains Overlap communications with computation Communications are hidden behind device computation Exchange borders task is more complex: Device/Host MPI GPU-Direct PeerToPeer S. Orlandini RTM porting to GPU @GTC16 — 6/16
RTM-GPU performances: only FD kernels Simulation details: Imaging disabled (no I/O operations) Only computational FD throughput From 1 up to 5 nodes 2 GPUs per node TTI simulation * Normalized on number of nodes and devices Prestack mode Considerations: Stand-alone kernel perfs Same trend as pure FD computational speed-up No overhead inside RTM application Complete overlap of halos communications behind computation S. Orlandini RTM porting to GPU @GTC16 — 7/16
RTM-GPU performances: with imaging calculation Simulation details: High frequency imaging (highly stressed I/O) FD+imaging computation From 1 up to 5 nodes 2 GPUs per node TTI simulation * Normalized on number of nodes and devices Prestack mode Considerations: I/O bottleneck RTM speed-up depends on I/O bandwidth not on GPU throughput Unstable and dropped RTM kernel performances due to unbalance of I/O operation on nodes S. Orlandini RTM porting to GPU @GTC16 — 8/16
RTM-GPU I/O bottleneck I/O operations are complex processes: Forward writes and Backward reads (from last step to first one) Write phase: Data are compressed with lossy 1 compression (multi steps process) Workload differs from process to process Compressed data are written to disk 2 Read phase: Read compressed data from disk 1 2 Uncompress data for computing imaging Increasing compression rate exploites max- imum I/O bandwidth and reduces drop in RTM kernel performances. NB : maximum lossy compression used is the one with maximum acceptable numerical er- rors. S. Orlandini RTM porting to GPU @GTC16 — 9/16
RTM-GPU strong/weak points Strong points: Weak points: Fully exploit device throughput Complex scheme No overhead is introduced Sophisticated synchronization Communications are hidden CPU/GPU behind device computation More complex communications Overlap CPU/GPU MPI/H2D/P2P computation Tricky tuning Too high device throughput CPU/GPU/Communications and I/O compared to I/O bandwidth Limited GPU memory Way out to GPU memory limitation: 1 Increase domain decomposition 2 Reduce memory utilization using tiny representation (16bits) 3 Increase order of stencil S. Orlandini RTM porting to GPU @GTC16 — 10/16
RTM-GPU scalability RTM perfs scaling on more nodes TTI simulation Strong scaling Increase number of nodes with constant domain size Node = 20 cores + 2 K20x Weak scaling Increase both number of nodes and problem size (Costant domain size per process) Node = 16 cores + 2 K10 2 types of simulation: 1 Imaging disabled (no I/O operations) 2 Imaging enabled S. Orlandini RTM-GPU reduce device memory usage @GTC16 — 11/16
RTM-GPU scalability RTM perfs scaling on more nodes TTI simulation Strong scaling Increase number of nodes with constant domain size Node = 20 cores + 2 K20x Weak scaling Increase both number of nodes and problem size (Costant domain size per process) Node = 16 cores + 2 K10 2 types of simulation: 1 Imaging disabled (no I/O operations) 2 Imaging enabled S. Orlandini RTM-GPU reduce device memory usage @GTC16 — 11/16
RTM-GPU scalability RTM perfs scaling on more nodes TTI simulation Strong scaling Increase number of nodes with constant domain size Node = 20 cores + 2 K20x Weak scaling Increase both number of nodes and problem size (Costant domain size per process) Node = 16 cores + 2 K10 2 types of simulation: 1 Imaging disabled (no I/O operations) 2 Imaging enabled S. Orlandini RTM-GPU reduce device memory usage @GTC16 — 11/16
RTM-GPU 16bits representations Only for storing not for computing Only for velocity fields , not for wave-fields more velocity fields than wavefields reduce numerical errors 16bits chances: 1 Floating Point (FP) @16bits: FP16 IEEE-754 (aka half-float) - 1bit-sign 5bits-exponent 11bits-mantissa - Range: [6 . 1 · 10 − 5 : 65504] FP16 custom (bits and bias) and normalized - 1bit-sign 4bits-exponent 12bits-mantissa - one more bit to mantissa and one less to exponent, change bias to fit in normalized range - Increase accuracy to the detriment of dynamical range - Range: [ − 1 : 1] 2 Fixed Point (FX) @16bits: FX16 Q − 2 . 17 (store υ 2 · ∆ t 2 / ∆ x 2 ) - Range [ − 0 . 25 : 2 . 49 · 10 − 1 ], Resolution 7 . 2 · 10 − 6 FX16 Q 0 . 15 (all 16bits to decimal part) - Range [ − 1 : 1], Resolution 3 . 05 · 10 − 5 - Normalize data. Multidimentional normalization domain (1D/2D/3D) S. Orlandini RTM-GPU reduce device memory usage @GTC16 — 12/16
RTM-GPU 16bits representations Least numerical error with FX16 Q 0 . 16 Normalized FX16 memory gain: Roughly 50% for velocity fields (depends on normalization domain) Best domain: 2D 4x4 tiles along xy-plane FX16 usage: 1 Pre-compute data Normalize FP32 data over domain Convert normalized FP32 to FX16 Store FX16 data and normalization factors @32bits 2 Reading data Read FX16 and normalization factors Convert FX16 to FP32 De-normalize FP32 data FX16 perfs loss about 4% S. Orlandini RTM-GPU reduce device memory usage @GTC16 — 13/16
Recommend
More recommend