HPC Application Porting to CUDA at BSC Pau Farr, Marc Jord GTC - PowerPoint PPT Presentation

www.bsc.es HPC Application Porting to CUDA at BSC Pau Farré, Marc Jordà GTC 2016 - San Jose

Agenda ● ● WARIS-Transport PELE ○ Atmospheric volcanic ash transport ○ Protein-drug interaction simulation ○ simulation Life Sciences department ○ Computer Applications department 2

WARIS-Transport Volcano ash dispersion simulation

Motivation ● ● VAAC: Volcanic Ash Advisory centers Forecast of atmospheric ○ Controlling volcano eruptions transport and deposition of Help airliners → Redirect flights volcanic ash ○ ○ Meteorological models 4

Eruptions ● Eyajfajallajökull eruption (Iceland, 2010) ○ 48% cancelled flights in europe during a week (107.000 flights) ○ Over € 1.3 billions in losses Ash extension map Airspace shutdown ● Puyehue-Cordon Caullé eruption (Chile, 2011) ○ Multiple flights cancelled in ■ Chile ■ Argentina ■ South-Africa ■ Australia 5 Ash extension map

Description Rectangular Cartesian Grid (x,y,z) Factors controlling atmospheric transport: ● Wind advection ● Turbulent diffusion ● Gravitational settling of particles General Advection-Diffusion-Reaction Eq. ⇒ Custom Jacobi Stencil output stencil 6

Algorithm • Finite difference method: Iterative process • Main computation – Advection-Diffusion-Reaction 7

CUDA Implementation (I) 1. Advection-Diffusion-Reaction Kernel ○ ~80% CPU execution time 8

CUDA Implementation (II) 1. Advection-Diffusion-Reaction Kernel 2. Compute Terminal Velocity ○ Meteorological computations 9

CUDA Implementation (III) 1. Advection-Diffusion-Reaction Kernel 2. Compute Terminal Velocity 3. Implement all non-IO computations in GPU ○ Minimize CPU ⇔ GPU copies 10

CUDA Implementation (IV) 1. Advection-Diffusion-Reaction Kernel 2. Compute Terminal Velocity 3. Implement all non-IO computations in GPU 4. Different particles sizes are launched in different streams 11

Kernel Overlap Chile-2011 dataset 0.25º (grid size 121x121x64) Chile-2011 dataset 0.05º (grid size 601x601x64) ● Some datasets are too small to fully occupy all SMs with only one kernel ● Parallel kernel execution to fully occupy all SMs 12

Results Implementations: ● MPI + AVX ● MPI + OpenMP + AVX ● MIC (MPI+OpenMP+AVX) ● MPI + CUDA (1 GPU/rank) ● Chile 2011 dataset 0.05º ● Marenostrum supercomputer ○ 16x cores/node ○ 2x Intel MIC ● GPU Server: ○ 4x Nvidia Tesla K40 4 GPU runs as fast as 8 Marenostrum3 nodes (128 cores) 13

PELE: Protein Energy Landscape Exploration Interactive Drug Design with Monte Carlo Simulations

PELE Vision ● Drug design is a costly process ● Design through Interactive biomolecular simulations ○ Statistical approach → Faster simulations ○ Visual analysis ● Computational power + human intuition PELE-GUI 15

PELE: Protein Energy Landscape Exploration Monte Carlo approach where each trial does: ● Perturbation ○ Protein shape + ligand position ● Relaxation ○ Further refinement to a more stable position (energy minimization) ● Acceptance test ○ If accepted, used as inital conformation for future trials Relaxation Perturbation 16

PELE Demo 17

PELE Energy Formula Initial profiling → Energy computation was the most time consuming task Exec. time cost of energy terms ● Bond Energy: 1.27% ● Angle Energy: 0.93% ● Dihedral Energy: 2.13% ● Non-bonding Interactions ○ Electrostatic ○ Lennard Jones ○ Solvent Energy ○ Total: 37,58 % ● Update alphas: 27.96% 18

PELE Energy Formula Initial profiling → Energy computation was the most time consuming task Exec. time cost of energy terms ● Bond Energy: 1.27% ● Angle Energy: 0.93% ● Dihedral Energy: 2.13% ● Non-bonding Interactions ○ Electrostatic ○ Lennard Jones ○ Solvent Energy ○ Total: 37.58 % ● Update alphas: 27.96% 19

CUDA Implementation Update Alphas (27.96%) ● All to all atom interactions ● No major issues Non-bonding Terms (37.58%) ● List of interactions (atom pairs) ○ Several cut-offs to reduce the number of interactions ● CUDA implementation ○ New data structure for interactions list in GPU ○ With atomics ■ Profiling showed high overheads ● Lack of DP atomics? ● High contention due to list order? ○ Without atomics ■ Main kernel + custom reduction to aggregate results ■ ~3x faster than 1st approach 20

CUDA Implementation (II) ● Energy computations are performed multiple times in different parts of PELE Energy computations in time ● Maintain data coherent between CPU and GPU ● High code complexity ○ Porting everything inbetween involves a major refactoring PELE call graph 21

CPU/GPU data coherence Explicit CPU ⇔ GPU copies ● Code is harder to follow and maintain ● Complex application: ○ Difficult to track which CPU code uses GPU results ○ Usage may depend on many conditions ● Programmers tend to be conservative ○ Always copy GPU results to host after the kernel ■ If not used, performance cost for no reason Automatic CPU ⇔ GPU copies ● CUDA Unified Virtual Memory (UVM) ● Unified CPU & GPU data structures ○ Allocation pointers can be used both in the CPU and GPU ○ CUDA runtime manages the copies internally ● Custom std::allocator for std::vectors 22

UVM profiling ● 4KB copies are not large enough to get maximum PCIe bandwidth ● Also, some unnecessary copies ○ The runtime has to be conservative because it doesn’t always know what’s input or output ○ Our use of streams and allocations attached to them was not optimal 23

Semi-automatic memory manager UVM style ● It maintains pairs of allocations (CPU & GPU) ● DtoH copies are only performed when data is really needed in the CPU ○ A page-fault handler detects CPU accesses ● Copies all the allocation at once ○ Better bandwidth Before launching a kernel ● Call owner_GPU(void* host_ptr, access_type) ○ Access types ■ Read, Write, ReadWrite, FullWrite ○ Returns gpu_ptr After the kernel launch ● Call owner_CPU(...) to notify the memory manager ● As said, copies are done lazily when needed 24

Performance comparison UVM Semi-automatic memory manager ● Semi-automatic memory manager has better performance ○ Mainly because of better PCIe bandwidth 25

Results (I) 55x 15.09x 5.29x 26

Results (II) Upper bound 2.9x (Amdahl’s law) 2x PELE acceleration is still ongoing 2.4x ● Non-bonding list generation ● Computations in perturbation step ● Etc. 27

Conclusions

Conclusions Acceleration of existing applications ● Some parts are accelerated while others are kept in the CPU ● Maintain data coherence between CPU & GPU is complex ● We showed two examples: ○ WARIS-Transport ■ Simple enough to port most of the computations to GPU and keep data there ○ PELE ■ Complex app → use a manager to handle the copies ■ UVM is a great tool to automatize the copies ■ We implemented a Semi-automatic memory manager to improve the performance Atomics might have a large performance impact ● Store partial results and apply a reduction step after the kernel ● Libraries can help with reductions ○ CUB, Modern GPU, etc. 29

www.bsc.es Thank you! For further information please contact pau.farre@bsc.es marc.jorda@bsc.es

HPC Application Porting to CUDA at BSC Pau Farr, Marc Jord GTC - PowerPoint PPT Presentation

www.bsc.es HPC Application Porting to CUDA at BSC Pau Farr, Marc Jord GTC 2016 - San Jose Agenda WARIS-Transport PELE Atmospheric volcanic ash transport Protein-drug interaction simulation simulation Life Sciences

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Studying : BSc Accounting & Finance, BSc Business & Economics, and BSc Business &

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Welcome to BSC Javier Bartolom BSC System Head 1st WISE Workshop, 20-22 October 2015 Agenda

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Porting Go to NetBSD/arm64 Maya Rashish <coypu@sdf.org> Porting Go to NetBSD/arm64

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes Vishal Mehta Engineer,

Challenges in Application Porting and Abstraction Presented by: Raj Johnson, President & CEO

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Review of interim results for 24 weeks ended 2 March 2019 24 April 2019 Financial Highlights

PROGRESS NEEDS AND CHALLENGES Goal: A mine/ERW -free Zimbabwe where women, girls, boys and men

CUME DO AVIA (Spain, Galicia, Ourense) PUNK ROCK WINES FROM GALICIA. More than 10 years ago Diego,

CORPORATE PRESENTATION June 2014 SSRI:NDAQ | SSO: TSX 1 Cautionary Notes Cautionary Note

Stefano Cristiani Stefano Cristiani INAF- INAF -Observatory of Trieste Observatory of Trieste

Hear Me Out Whoever has ears let him hear Hear Me Out Whoever has ears let him hear B L E S

INVESTOR PRESENTATION August 2019 DISCLOSURES Forward-Looking Statements This presentation

Framework COUNSELING EDUCATION Individual and Peer Rigorous Academics Counseling

HPC Application Porting to CUDA at BSC Pau Farr, Marc Jord GTC - PowerPoint PPT Presentation

www.bsc.es HPC Application Porting to CUDA at BSC Pau Farr, Marc Jord GTC 2016 - San Jose Agenda WARIS-Transport PELE Atmospheric volcanic ash transport Protein-drug interaction simulation simulation Life Sciences

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Studying : BSc Accounting &amp; Finance, BSc Business &amp; Economics, and BSc Business &amp;

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Welcome to BSC Javier Bartolom BSC System Head 1st WISE Workshop, 20-22 October 2015 Agenda

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Porting Go to NetBSD/arm64 Maya Rashish &lt;coypu@sdf.org&gt; Porting Go to NetBSD/arm64

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes Vishal Mehta Engineer,

Challenges in Application Porting and Abstraction Presented by: Raj Johnson, President &amp; CEO

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Review of interim results for 24 weeks ended 2 March 2019 24 April 2019 Financial Highlights

PROGRESS NEEDS AND CHALLENGES Goal: A mine/ERW -free Zimbabwe where women, girls, boys and men

CUME DO AVIA (Spain, Galicia, Ourense) PUNK ROCK WINES FROM GALICIA. More than 10 years ago Diego,

CORPORATE PRESENTATION June 2014 SSRI:NDAQ | SSO: TSX 1 Cautionary Notes Cautionary Note

Stefano Cristiani Stefano Cristiani INAF- INAF -Observatory of Trieste Observatory of Trieste

Hear Me Out Whoever has ears let him hear Hear Me Out Whoever has ears let him hear B L E S

INVESTOR PRESENTATION August 2019 DISCLOSURES Forward-Looking Statements This presentation

Framework COUNSELING EDUCATION Individual and Peer Rigorous Academics Counseling

Studying : BSc Accounting & Finance, BSc Business & Economics, and BSc Business &

Porting Go to NetBSD/arm64 Maya Rashish <coypu@sdf.org> Porting Go to NetBSD/arm64

Challenges in Application Porting and Abstraction Presented by: Raj Johnson, President & CEO