Nikolay Khokhlov, MIPT Quasilinear equations, inverse problems and - PowerPoint PPT Presentation

Applying OpenCL technology for seismic modeling using grid-characteristic methods Andrey Ivanov, MIPT Nikolay Khokhlov, MIPT Quasilinear equations, inverse problems and their applications Moscow Institute of Physics and Technology, Dolgoprudny, 12-15 Sept. 2016

Outline Mathematical model and numerical method  Test conditions  Description of program  Optimization  Test results  Single GPU  Speedup (compared to GPU)  Percentage of peak performance  Performance (FLOPS)  Multiple GPUs  Speedup (compared to single GPU)  Speedup with GPUDirect 

Mathematical model Relation between velocity and deformation Motion equation Hooke’s law ρ – density λ , μ – Lame elastic parameters v – velocity T – stress tensor

Numerical method       ( , , , , ) Split directions u x y xx xy yy Hyperbolic problem

Test conditions  CPU Compilers: icc  Compiler Options :  -mavx  -fopenmp (auto vectorization)  -O2   GPU Compilers: nvcc, gcc  Compiler Options:  -O2  -use_fast_math 

CPU properties: Intel Xeon E5-2697 2.7 GHz GPU properties: GFLOPS - GFLOPS - CUDA cores Clock rate, single SP:DP double GPU (streaming MHz precision precision processors) 384 900 691 24 29 GeForce GT 640 480 1401 1345 8 168 GeForce GTX 480 1536 1006 3090 24 129 GeForce GTX 680 1152 980 2258 24 94 GeForce GTX 760 2304 863 3977 24 166 GeForce GTX 780 2880 876 5046 24 210 GeForce GTX 780 Ti 2048 1126 4612 32 144 GeForce GTX 980 448 1150 1030 2 515 Tesla M2070 2880 745 4291 3 1430 Tesla K40m 2496 562 2806 1.5 1870 Tesla K80 1792 800 2867 4 717 Radeon HD 7950 2560 947 4849 8 606 Radeon R9 290

Test program  Grid size: 4096x4096  Time steps: 6500  Data type: float, double  Grid node: 5 float (double)  Occupied memory:  320 MB (float)  640 MB (double)

CPU version  Single-precision and double-precision  190 FLOPS to recalculate one node in grid  Program consumes 18.8 TFLOPS  Single-thread, single CPU core  AVX instructions – vectorization

Optimization  Array of structures (AOS)  Two grids on GPU  Block sizes 16x16

Optimization Structure of arrays (AOS -> SOA)  Coalesced memory access  Use of GPU shared memory  Reduce conditional branches 

Optimization Block size in step X – 256x1  Block size in step Y – 16x16 

Speedup of GPU implementation compared to CPU compare with cpu Intel Xeon E5-2697 - float + fast math Radeon R9 290 Radeon HD 7950 Tesla K80 Tesla K40m Tesla M2070 GeForce GTX 980 GeForce GTX 780 Ti opencl GeForce GTX 780 cuda GeForce GTX 760 GeForce GTX 680 GeForce GTX 480 GeForce GT 640 0 10 20 30 40 50 60 Speedup

Speedup of GPU implementation compared to CPU compare with cpu Intel Xeon E5-2697 - double Radeon R9 290 Radeon HD 7950 Tesla K80 Tesla K40m Tesla M2070 GeForce GTX 980 GeForce GTX 780 Ti opencl GeForce GTX 780 cuda GeForce GTX 760 GeForce GTX 680 GeForce GTX 480 GeForce GT 640 0 5 10 15 20 25 30 35 40 45 50 Speedup

Percentage of peak performance Percentage of peak performance - float + fast math Radeon R9 290 Radeon HD 7950 Tesla K80 Tesla K40m Tesla M2070 GeForce GTX 980 opencl GeForce GTX 780 Ti cuda GeForce GTX 780 GeForce GTX 760 GeForce GTX 680 GeForce GTX 480 GeForce GT 640 0 2 4 6 8 10 12 14 16

Percentage of peak performance Percentage of peak performance - double Radeon R9 290 Radeon HD 7950 Tesla K80 Tesla K40m Tesla M2070 GeForce GTX 980 opencl GeForce GTX 780 Ti cuda GeForce GTX 780 GeForce GTX 760 GeForce GTX 680 GeForce GTX 480 GeForce GT 640 0 5 10 15 20 25 30 35

Performance Performance - float + fast math Radeon R9 290 Radeon HD 7950 Tesla K80 Tesla K40m Tesla M2070 GeForce GTX 980 GeForce GTX 780 Ti opencl GeForce GTX 780 cuda GeForce GTX 760 GeForce GTX 680 GeForce GTX 480 GeForce GT 640 0 50 100 150 200 250 300 350 400 450 500 GFLOPS

Performance Performance - double Radeon R9 290 Radeon HD 7950 Tesla K80 Tesla K40m Tesla M2070 GeForce GTX 980 GeForce GTX 780 Ti opencl GeForce GTX 780 cuda GeForce GTX 760 GeForce GTX 680 GeForce GTX 480 GeForce GT 640 0 20 40 60 80 100 120 140 160 GFLOPS

GPU parallelization  Multiple GPUs  Divide grid along axis Y  Data exchanges between GPUs by adjacent grid nodes  GPUDirect (only in CUDA) – exchange data by PCI Express bypassing CPU

Speedup (number of GPUs) Speedup, float 7 6 5 1 2 4 3 4 3 5 6 7 2 8 1 0 Radeon R9 290 GeForce GTX 980 Tesla K80 GeForce GTX 680 Tesla M2070 GeForce GTX 780 Ti Tesla K40m

GPUDirect (except Radeon R9 290) GPUDirect, float 7 6 5 1 2 4 3 4 3 5 6 7 2 8 1 0 Radeon R9 290 GeForce GTX 980 Tesla K80 GeForce GTX 680 Tesla M2070 GeForce GTX 780 Ti Tesla K40m

Speedup (number of GPUs) Speedup, double 8 7 6 1 5 2 3 4 4 5 3 6 7 2 8 1 0 Radeon R9 290 GeForce GTX 980 Tesla K80 GeForce GTX 680 Tesla M2070 GeForce GTX 780 Ti Tesla K40m

GPUDirect (except Radeon R9 290) GPUDirect, double 8 7 6 1 5 2 3 4 4 5 3 6 7 2 8 1 0 Radeon R9 290 GeForce GTX 980 Tesla K80 GeForce GTX 680 Tesla M2070 GeForce GTX 780 Ti Tesla K40m

Conclusion Speedup (single GPU compared with CPU):  Single-precision – up to 55 times ( GeForce GTX 780 Ti )  Double-precision – up to 44 times ( Tesla K80 )  Performance (single GPU):  Single-precision – up to 460 GFLOPS ( GeForce GTX 780 Ti )  Double-precision - up to 138 GFLOPS ( Tesla K80 )  Speedup (multiple GPU compared with single GPU):  Single-precision – up to 6.1 times ( Tesla K40m )  Double-precision – up to 7.1 times ( GeForce GTX 780 Ti )  Increase in speedup with GPUDirect  Single-precision - 10 % on 8 GeForce GTX 780 Ti  Double-precision – 2.4 % on 8 GeForce GTX 780 Ti 

Nikolay Khokhlov, MIPT Quasilinear equations, inverse problems and - PowerPoint PPT Presentation

Applying OpenCL technology for seismic modeling using grid-characteristic methods Andrey Ivanov, MIPT Nikolay Khokhlov, MIPT Quasilinear equations, inverse problems and their applications Moscow Institute of Physics and Technology, Dolgoprudny,

Field and Matter or Pure Field Physics? Igor BULYZHENKOV , bulyzhenkov.ie@mipt.ru Moscow Institute

Bayesian hierarchical models in Stata Nikolay Balov StataCorp LP 2016 Stata Conference Nikolay

PI is not at least as succinct as MODS Nikolay Kaleyski July 7, 2017 Nikolay Kaleyski PI is not

Exponential Stretch-Rotation (ESR) transformation in GR A.M Khokhlov and I.D. Novikov

A New Perspective on October 11-12 Combining GMM and DNN Frameworks for Speaker Adaptation

A Conditional Information Inequality and Its Combinatorial Applications Nikolay Vereshchagin 1

Oracle waits log file sync and log file parallel write on Linux July 2017 Nikolay

Changing Points in APN Functions Nikolay S. Kaleyski (joint work with Lilya Budaghyan, Claude

Bozon Sampling Kadyrmetov Shamil, DGAP 325 MIPT, 2016 History Implementations : Year/author

Zilch currents in CKT Pavel Mitkin MIPT&ITEP ICNFP, August 2019 Chiral effects CKT for

Holmgren theorems for the Radon transform Jan Boman, Stockholm University MIPT, September 14,

Convex Optimization for Data Science Gasnikov Alexander gasnikov.av@mipt.ru Lecture 6.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY Nikolay Sakharnykh, 3/27/2018 SINGLE POINTER

Everything you need to know for a successful deployment Nikolay Muravlyannikov Paul Cannon

Recent developments in neutron imaging Nikolay Kardjilov, Ingo Manke, Andr Hilger, John Banhart

Application of Many-core Accelerators for Problems in Astronomy and Physics N.Nakasato

Module 4.1 Memory and Data Locality CUDA Memories Objective To learn to effectively use

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

BLIS Performs Devangi N. Parikh Science of High Performance Compu8ng The University of Texas at

Samuel Cremer 1,2 , Michel Bagein 1 , Sad Mahmoudi 1 , Pierre Manneback 1 1 UMONS, University of

Programming the Adapteva Epiphany 64-core Network-on-chip Coprocessor Anish Varghese, Robert

CS 839: Design the Next-Generation Database Lecture 7: GPU Database Xiangyao Yu 2/11/2020 1

CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing SIMD Bernhard

Nikolay Khokhlov, MIPT Quasilinear equations, inverse problems and - PowerPoint PPT Presentation

Applying OpenCL technology for seismic modeling using grid-characteristic methods Andrey Ivanov, MIPT Nikolay Khokhlov, MIPT Quasilinear equations, inverse problems and their applications Moscow Institute of Physics and Technology, Dolgoprudny,

Field and Matter or Pure Field Physics? Igor BULYZHENKOV , bulyzhenkov.ie@mipt.ru Moscow Institute

Bayesian hierarchical models in Stata Nikolay Balov StataCorp LP 2016 Stata Conference Nikolay

PI is not at least as succinct as MODS Nikolay Kaleyski July 7, 2017 Nikolay Kaleyski PI is not

Exponential Stretch-Rotation (ESR) transformation in GR A.M Khokhlov and I.D. Novikov

A New Perspective on October 11-12 Combining GMM and DNN Frameworks for Speaker Adaptation

A Conditional Information Inequality and Its Combinatorial Applications Nikolay Vereshchagin 1

Oracle waits log file sync and log file parallel write on Linux July 2017 Nikolay

Changing Points in APN Functions Nikolay S. Kaleyski (joint work with Lilya Budaghyan, Claude

Bozon Sampling Kadyrmetov Shamil, DGAP 325 MIPT, 2016 History Implementations : Year/author

Zilch currents in CKT Pavel Mitkin MIPT&amp;ITEP ICNFP, August 2019 Chiral effects CKT for

Holmgren theorems for the Radon transform Jan Boman, Stockholm University MIPT, September 14,

Convex Optimization for Data Science Gasnikov Alexander gasnikov.av@mipt.ru Lecture 6.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY Nikolay Sakharnykh, 3/27/2018 SINGLE POINTER

Everything you need to know for a successful deployment Nikolay Muravlyannikov Paul Cannon

Recent developments in neutron imaging Nikolay Kardjilov, Ingo Manke, Andr Hilger, John Banhart

Application of Many-core Accelerators for Problems in Astronomy and Physics N.Nakasato

Module 4.1 Memory and Data Locality CUDA Memories Objective To learn to effectively use

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

BLIS Performs Devangi N. Parikh Science of High Performance Compu8ng The University of Texas at

Samuel Cremer 1,2 , Michel Bagein 1 , Sad Mahmoudi 1 , Pierre Manneback 1 1 UMONS, University of

Programming the Adapteva Epiphany 64-core Network-on-chip Coprocessor Anish Varghese, Robert

CS 839: Design the Next-Generation Database Lecture 7: GPU Database Xiangyao Yu 2/11/2020 1

CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing SIMD Bernhard

Zilch currents in CKT Pavel Mitkin MIPT&ITEP ICNFP, August 2019 Chiral effects CKT for