www.bsc.es HPC Application Porting to CUDA at BSC Pau Farré, Marc Jordà GTC 2016 - San Jose
Agenda ● ● WARIS-Transport PELE ○ Atmospheric volcanic ash transport ○ Protein-drug interaction simulation ○ simulation Life Sciences department ○ Computer Applications department 2
WARIS-Transport Volcano ash dispersion simulation
Motivation ● ● VAAC: Volcanic Ash Advisory centers Forecast of atmospheric ○ Controlling volcano eruptions transport and deposition of Help airliners → Redirect flights volcanic ash ○ ○ Meteorological models 4
Eruptions ● Eyajfajallajökull eruption (Iceland, 2010) ○ 48% cancelled flights in europe during a week (107.000 flights) ○ Over € 1.3 billions in losses Ash extension map Airspace shutdown ● Puyehue-Cordon Caullé eruption (Chile, 2011) ○ Multiple flights cancelled in ■ Chile ■ Argentina ■ South-Africa ■ Australia 5 Ash extension map
Description Rectangular Cartesian Grid (x,y,z) Factors controlling atmospheric transport: ● Wind advection ● Turbulent diffusion ● Gravitational settling of particles General Advection-Diffusion-Reaction Eq. ⇒ Custom Jacobi Stencil output stencil 6
Algorithm • Finite difference method: Iterative process • Main computation – Advection-Diffusion-Reaction 7
CUDA Implementation (I) 1. Advection-Diffusion-Reaction Kernel ○ ~80% CPU execution time 8
CUDA Implementation (II) 1. Advection-Diffusion-Reaction Kernel 2. Compute Terminal Velocity ○ Meteorological computations 9
CUDA Implementation (III) 1. Advection-Diffusion-Reaction Kernel 2. Compute Terminal Velocity 3. Implement all non-IO computations in GPU ○ Minimize CPU ⇔ GPU copies 10
CUDA Implementation (IV) 1. Advection-Diffusion-Reaction Kernel 2. Compute Terminal Velocity 3. Implement all non-IO computations in GPU 4. Different particles sizes are launched in different streams 11
Kernel Overlap Chile-2011 dataset 0.25º (grid size 121x121x64) Chile-2011 dataset 0.05º (grid size 601x601x64) ● Some datasets are too small to fully occupy all SMs with only one kernel ● Parallel kernel execution to fully occupy all SMs 12
Results Implementations: ● MPI + AVX ● MPI + OpenMP + AVX ● MIC (MPI+OpenMP+AVX) ● MPI + CUDA (1 GPU/rank) ● Chile 2011 dataset 0.05º ● Marenostrum supercomputer ○ 16x cores/node ○ 2x Intel MIC ● GPU Server: ○ 4x Nvidia Tesla K40 4 GPU runs as fast as 8 Marenostrum3 nodes (128 cores) 13
PELE: Protein Energy Landscape Exploration Interactive Drug Design with Monte Carlo Simulations
PELE Vision ● Drug design is a costly process ● Design through Interactive biomolecular simulations ○ Statistical approach → Faster simulations ○ Visual analysis ● Computational power + human intuition PELE-GUI 15
PELE: Protein Energy Landscape Exploration Monte Carlo approach where each trial does: ● Perturbation ○ Protein shape + ligand position ● Relaxation ○ Further refinement to a more stable position (energy minimization) ● Acceptance test ○ If accepted, used as inital conformation for future trials Relaxation Perturbation 16
PELE Demo 17
PELE Energy Formula Initial profiling → Energy computation was the most time consuming task Exec. time cost of energy terms ● Bond Energy: 1.27% ● Angle Energy: 0.93% ● Dihedral Energy: 2.13% ● Non-bonding Interactions ○ Electrostatic ○ Lennard Jones ○ Solvent Energy ○ Total: 37,58 % ● Update alphas: 27.96% 18
PELE Energy Formula Initial profiling → Energy computation was the most time consuming task Exec. time cost of energy terms ● Bond Energy: 1.27% ● Angle Energy: 0.93% ● Dihedral Energy: 2.13% ● Non-bonding Interactions ○ Electrostatic ○ Lennard Jones ○ Solvent Energy ○ Total: 37.58 % ● Update alphas: 27.96% 19
CUDA Implementation Update Alphas (27.96%) ● All to all atom interactions ● No major issues Non-bonding Terms (37.58%) ● List of interactions (atom pairs) ○ Several cut-offs to reduce the number of interactions ● CUDA implementation ○ New data structure for interactions list in GPU ○ With atomics ■ Profiling showed high overheads ● Lack of DP atomics? ● High contention due to list order? ○ Without atomics ■ Main kernel + custom reduction to aggregate results ■ ~3x faster than 1st approach 20
CUDA Implementation (II) ● Energy computations are performed multiple times in different parts of PELE Energy computations in time ● Maintain data coherent between CPU and GPU ● High code complexity ○ Porting everything inbetween involves a major refactoring PELE call graph 21
CPU/GPU data coherence Explicit CPU ⇔ GPU copies ● Code is harder to follow and maintain ● Complex application: ○ Difficult to track which CPU code uses GPU results ○ Usage may depend on many conditions ● Programmers tend to be conservative ○ Always copy GPU results to host after the kernel ■ If not used, performance cost for no reason Automatic CPU ⇔ GPU copies ● CUDA Unified Virtual Memory (UVM) ● Unified CPU & GPU data structures ○ Allocation pointers can be used both in the CPU and GPU ○ CUDA runtime manages the copies internally ● Custom std::allocator for std::vectors 22
UVM profiling ● 4KB copies are not large enough to get maximum PCIe bandwidth ● Also, some unnecessary copies ○ The runtime has to be conservative because it doesn’t always know what’s input or output ○ Our use of streams and allocations attached to them was not optimal 23
Semi-automatic memory manager UVM style ● It maintains pairs of allocations (CPU & GPU) ● DtoH copies are only performed when data is really needed in the CPU ○ A page-fault handler detects CPU accesses ● Copies all the allocation at once ○ Better bandwidth Before launching a kernel ● Call owner_GPU(void* host_ptr, access_type) ○ Access types ■ Read, Write, ReadWrite, FullWrite ○ Returns gpu_ptr After the kernel launch ● Call owner_CPU(...) to notify the memory manager ● As said, copies are done lazily when needed 24
Performance comparison UVM Semi-automatic memory manager ● Semi-automatic memory manager has better performance ○ Mainly because of better PCIe bandwidth 25
Results (I) 55x 15.09x 5.29x 26
Results (II) Upper bound 2.9x (Amdahl’s law) 2x PELE acceleration is still ongoing 2.4x ● Non-bonding list generation ● Computations in perturbation step ● Etc. 27
Conclusions
Conclusions Acceleration of existing applications ● Some parts are accelerated while others are kept in the CPU ● Maintain data coherence between CPU & GPU is complex ● We showed two examples: ○ WARIS-Transport ■ Simple enough to port most of the computations to GPU and keep data there ○ PELE ■ Complex app → use a manager to handle the copies ■ UVM is a great tool to automatize the copies ■ We implemented a Semi-automatic memory manager to improve the performance Atomics might have a large performance impact ● Store partial results and apply a reduction step after the kernel ● Libraries can help with reductions ○ CUB, Modern GPU, etc. 29
www.bsc.es Thank you! For further information please contact pau.farre@bsc.es marc.jorda@bsc.es
Recommend
More recommend