DiamondTile Algorithm for High-Performance Wave Modeling Vadim Levchenko Anastasia Perepelkina Keldysh Institute of Apllied Mathematics RAS GTC 2015
FLOPs and Bandwidth Performance Ratio 1000 s p o l F / s s e p t o y l B F / s 1 e . 0 t y B 4 100 s p o l F GB/s / s e t y B 4 0 . 0 10 nVidia Maxwell, 2014-15 nVidia Kepler, 2012-13 Intel CPU, 2014 NEC SX, 199x 0.1 1 10 TFLOP/s (fp32)
RoofLine model S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52:65–76, 2009. L.Barba, R.Yokota, How Will the Fast Multipole Method Fare in the Exascale Era?
Wave Modeling Specifics ∂ t 2 = c 2 � � ∂ 2 F ∂ 2 F ∂ x 2 + ∂ 2 F ∂ y 2 + ∂ 2 F (+BCs + ICs) ∂ z 2 Finite difference along each axis: domain of in fl uence � ∂ 2 F � N O / 2 1 x 0 , y 0 , z 0 = C i ( F | x 0 + i ∆ x , y 0 , z 0 + F | x 0 − i ∆ x , y 0 , z 0 ) � ∂ x 2 ∆ x 2 i = 0 � x 2 +y 2 +z 2 =c 2 t 2 N O = 2 for ∂ 2 F ∂ t 2 , N O = 2 , 4 , 6 , .. 14 for coordinate axes. asynchro- nous domain synchronization instant Per one cell, per one time step calculation: t asynchro- ◮ O = 1 + 3 N O FMA operations nous domain ◮ D = 3 + 3 N O data domain of Operational intensity: dependence O / D ∼ 1 / 2 Flop/byte (na¨ ıve algorithm) . y x
Wave Modeling Specifics ∂ t 2 = c 2 � � ∂ 2 F ∂ 2 F ∂ x 2 + ∂ 2 F ∂ y 2 + ∂ 2 F (+BCs + ICs) ∂ z 2 Finite difference along each axis: domain of in fl uence � ∂ 2 F � N O / 2 1 x 0 , y 0 , z 0 = C i ( F | x 0 + i ∆ x , y 0 , z 0 + F | x 0 − i ∆ x , y 0 , z 0 ) � ∂ x 2 ∆ x 2 i = 0 � x 2 +y 2 +z 2 =c 2 t 2 N O = 2 for ∂ 2 F ∂ t 2 , N O = 2 , 4 , 6 , .. 14 for coordinate axes. asynchro- nous domain synchronization instant t asynchro- nous domain domain of dependence y . x Cross-shaped stencil fits into diamond shape
Wave equation modelling Computational Grid projection to (x–t)
Wave equation modelling Computational Grid projection to (x–t)
Wave equation modelling
Wave equation modelling
Wave equation modelling
Traditional stepwise evaluation order
Traditional stepwise evaluation order
Traditional stepwise evaluation order
Traditional stepwise evaluation order Overlapping stencils increase operational intensity: ◮ O = 1 + 3 N O FMA operations ◮ D = 3 data Operational intensity: O / D ∼ ( 1 + N O ) Flop/byte
RoofLine Model for Wave Equation on GPGPU 1000 TitanZ the best of stepwise performance, 10 9 cells/sec GTX 970 100 CUDA FDTD3d results naive 10 0.1 1 10 localization parameter, cells calculations/(data loads+stores)
LRnLA method
LRnLA method
LRnLA method Locality Take advantage of memory subsystem hierarchy, from on-chip CPU cash and up to disk and network Recursivity Application of “divide et impera” strategy for any situations (computer architectures, numerical schemes, etc.) non-Locality Optimized for distributed computations Asynchrony Adaptable parallel computations on any levels
Memory Subsystem Hierarchy for GPGPU and CPU . GK110 Haswell GM204 . . GTX Titan Xeon E5 v3 GTX 980 . 10 14 regs regs 10 13 regs Data throughput, B/sec L1+sh L1+sh L1 10 12 L2 L2 L2 GDDR5 GDDR5 LLC 10 11 DDR4 10 10 SSD/PCIe 10 9 HDD 1T 1G 1M 1K 1M 1G 1T Data set size, B
DiamondTile based algorithm construction Computational grid in x-y and x-t projections
DiamondTile based algorithm construction Computational domain is subdivided into Diamond shaped tiles in x-y. ◮ Diamond encloses cross-shaped stencil ◮ All elements along 3rd (z) axis are included
DiamondTile based algorithm construction Choose a DiamondTile on first time-step ❖
DiamondTile based algorithm construction Choose a DiamondTile on first time-step ❖ Plot influence cone of first tile ❖
DiamondTile based algorithm construction Choose a DiamondTile on first time-step ❖ Plot influence cone of first tile ❖ Choose a shifted DiamondTile on another time-step (Nt steps later) ❖
DiamondTile based algorithm construction Choose a DiamondTile on first time-step ❖ Plot influence cone of first tile ❖ Choose a shifted DiamondTile on another time-step (Nt steps later) ❖ Plot dependence cone of last tile ❖
DiamondTile based algorithm construction Choose a DiamondTile on first time-step ❖ Plot influence cone of first tile ❖ Choose a shifted DiamondTile on another time-step (Nt steps later) ❖ Plot dependence cone of last tile ❖ Find intersection ❖
DiamondTorre Algorithm shape
Understand Algorithm as a shape Stepwise
Understand Algorithm as a shape Domain decomposition
Understand Algorithm as a shape More operational intensity
Understand Algorithm as a shape DiamondTorre
DiamondTorre Algorithm shape ◮ DiamondTorre tilt depends on stencil size ◮ Stencil width is determined by order of approximation ( N O )
DiamondTorre Algorithm parameters Performance depends on careful choice of algorithm parameters: ◮ Size of DiamondTorre base — Diamond Tile Size, DTS ◮ Quantity of time layers — Nt Operational Intensity ∼ DTS/(4-1/DTS) (for large Nt)
RoofLine Model for Wave Equation on GPGPU 1000 DTS=20 DTS=14 S T D s u TitanZ o i r a v the best of stepwise r DTS=7 o performance, 10 9 cells/sec f e r r o GTX 970 T DiamondTile, DTS=1 d DTS=4 n o m a i D 100 DTS=1 naive 10 0.1 1 10 localization parameter, cells calculations/(data loads+stores)
DiamondTorre Algorithm with CUDA In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block First stage
DiamondTorre Algorithm with CUDA In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block Second stage
DiamondTorre Algorithm with CUDA In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block Odd and even stages are alternating. Synchronization after each stage.
DiamondTorre Algorithm with CUDA In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block Odd and even stages are alternating. Synchronization after each stage.
DiamondTorre Algorithm with CUDA In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block Odd and even stages are alternating. Synchronization after each stage.
DiamondTorre Algorithm with CUDA At first, some portion of cells remain on first time step, while some are processed to ❖ several time layers
DiamondTorre Algorithm with CUDA At first, some portion of cells remain on first time step, while some are processed to ❖ several time layers
DiamondTorre Algorithm with CUDA At first, some portion of cells remain on first time step, while some are processed to ❖ several time layers
DiamondTorre Algorithm with CUDA At the end, all data are progressed to a given time step. This time step is ❖ determined by DiamondTorre height
RoofLine Model for Wave Equation on GPGPU 1000 DTS=20 DTS=14 S T D s u TitanZ o i r a v the best of stepwise r DTS=7 o performance, 10 9 cells/sec f e r r o GTX 970 T DiamondTile, DTS=1 d DTS=4 n o m a i D 100 DTS=1 CUDA FDTD3d results naive 10 0.1 1 10 localization parameter, cells calculations/(data loads+stores)
60 GTX 750Ti GTX 970 TitanZ (1) 50 calc rate, Gcells/sec 40 30 20 10 0 2/1 4/1 6/1 8/1 10/112/114/1 6/1 6/2 4/1 4/2 4/3 2/1 2/2 2/3 2/4 2/5 2/6 2/7 various scheme/algorithm parameters, NO/DTS
100 TitanZ GTX970 FDTD3d TitanZ rate 10 FDTD3d GTX970 rate calc rate, Gcells/sec 1 FDTD3d CPU rate with -O3 0.1 FDTD3d CPU rate 0.01 0.01 0.1 1 10 100 1000 parallel level, warps
Wave Modeling Applications FDTD simulation for electromagnetics (2nd and 4th order approximation, PML) (Zakirov A., Goryachev I.)
Wave Modeling Applications Gas Dynamis with RKDG scheme (Korneev B.)
Wave Modeling Applications 2000 3000 4000 5000 6000 7000 -7.5 -3.75 0 3.75 7.5 0 0 2 2 4 4 6 6 -7.5 -3.75 0 3.75 7.5 0 FDTD simulation for elastic seismic media (Levander scheme, 4th order, PML, Thompsen anisotropy, TFSF source) (Levchenko V., Zakirov A., Perepelkina A., Ivanov A.)
Wave Modeling Applications Particle-in-cell plasma kinetics (Levchenko V., Perepelkina A., Goryachev I.)
Main Results and Conclusions ◮ New algorithms DiamondTile of LRnLA family are developed for wave modeling. The algorithms are efficient on memory and parallelism models of CUDA GPGPU; ◮ Unlike traditional stepwise evaluation order, data dependencies are traced for many time iteration steps. It increases operational intensity and allows to reach higher calculation rates. ◮ Performance of 50-60 billion cells/s is achieved with Titan, as well as with GTX970 in the implementation of wave modeling.
Recommend
More recommend