Tackling Performance Bottlenecks in the Diversifying CUDA HPC - PowerPoint PPT Presentation

Tackling Performance Bottlenecks in the Diversifying CUDA HPC Ecosystem: a Molecular Dynamics Perspective Szilárd Páll KTH Royal Institute of Technology GTC 2015

Diversifying hardware & complex parallelism Compute Cloud ● Increasingly: – parallel on multiple levels – heterogenous – power constrained x10 2-5 + x2-8 Widening SIMD/SIMT units Increasing CPU/GPU “Skinny” workstations Mini-clusters, core count to fat compute nodes petaflop machines, & NUMA on die NUMA, NUAA, ... On-demand computing

Diversifying hardware & complex parallelism Compute Cloud ● Need to address each level ● Choice of parallelization important ● How much of the burden is placed on the user? x10 2-5 + x2-8 Widening SIMD/SIMT units Increasing CPU/GPU “Skinny” workstations Mini-clusters, core count to fat compute nodes petaflop machines, & NUMA on die NUMA, NUAA, ... On-demand computing

Molecular dynamics: modelling physics Reproduced under CC BY-SA from: http://commons.wikimedia.org/wiki/File:MM_PEF.png

Molecular dynamics: basics ● Given: Newton's equations – N particles – masses – potential V ● Integrate (leap frog) – acceleration → velocities – velocities → coordinates

Molecular dynamics: interactions

Molecular dynamics: forces Bonded Non-bonded Over all atom-pairs! ● Bonded forces: loop over all interactions – few but localized → imbalance challenge (threading & domain decomposition) ● Non-bonded: a double loop? → too expensive: limit the interaction range (cut-off)

Pair interactions: cut-off ● LJ decays fast: 1/r^6 – can use a cut-off ● Coulomb decays slowly: 1/r – cut-off is not good enough → treat long-range part separately: Particle Mesh Ewald

GROMACS : fast, flexible, free ● Developers : Stockholm & Uppsala, parallel constraints SE and many more worldwide arbitrary ● Open source : LGPLv2 units cells ● Open development: https://gerrit.gromacs.org virtual interaction sites ● Large user base : – 10k's academic & industry – 100k's through F@H ● Supports all major force-fields eighth shell domain decomposition Triclinic unit cell with load balancing and staggered cell boundaries

GROMACS : fast, flexible, free ● Code: portability of great parallel constraints importance arbitrary – C++98 (subset) units cells – CMake ● Pretty large: virtual interaction sites – LOC: ~2 mil. ½ of which is SIMD! ● Bottom-up performance tuning eighth shell → absolute performance is domain decomposition what matters (to users) Triclinic unit cell with load balancing and staggered cell boundaries

Costs in MD ● Every step: 10 6 -10 8 Flops ● Every simulation: 10 6 -10 8 steps What are flops Pair Search distance check spent on? LJ + Coulomb tabulated (F) Non-bonded LJ + Coulomb tabulated (F+E) Coulomb tabulated (F) Coulomb tabulated (F+E) 1,4 nonbonded interactions Calc Weights PME Spread Q Bspline Gather F Bspline 3D-FFT Angles Bonded Propers Settle

Molecular dynamics step Pair-search step every 10-50 iterations MD iteration = step Pair Bonded F Non-bonded F PME Integration Constraints search ~ milliseconds or less Goal: do it as fast as possible!

Herogenous accelerated GROMACS ● 2 nd gen GPU acceleration: Pair search/domain-decomposition: every 10-50 iterations “regular” MD step since GROMACS v4.6 Pair search, Integration Bonded F Non-bonded F PME DD Constraints ● Advantages: offload – 2-4x speedup – offload → multi-GPU “for free” – wide feature support 100s of microseconds at peak! ● Challenges: – added latencies, overheads – load balancing

Herogenous accelerated MD Pair search: every 10-50 iterations ● 2 nd gen GPU acceleration: “regular” MD step since GROMACS v4.6 CPU Launch Pair Wait Integration, Bonded F PME OpenMP search GPU GPU Constraints threads ● Advatages: H2D H2D x,q D2H F pair-list – 3-4x speedup – offload → multi-GPU “for free” Non-bonded F GPU & Clear F Idle Idle Idle Idle CUDA – wide feature support Pair-list pruning ● Challenges: Average CPU-GPU overlap: 60-80% per step – added latency, re-casting work for GPUs – load balancing: intra-GPU, intra-node,...

Parallel heterogeneous MD Pair-search/domain-decompostion step every 10-50 iterations “regular” MD step MPI send non-local F MPI receive non-local x CPU Local Non-Local Wait for Wait Integration Bonded F PME OpenMP pair search pair search nl F local F Constraints threads non-local pair-list non- local x,q H2D local x,q local pair-list D2H local F non- local F \ H2D H2D H2D D2H ... Local Local Local non-bonded F Clear F NB F stream GPU list pruning pist p. Idle Idle Idle CUDA Non-local Non-local non-bonded F stream list pruning

Intra-node: The accelerator

SIMD/SIMT targeted algorithms ● Cluster pair interaction algorithm – lends itself well to efficient fine-grained parallelization – adaptable to the characteristics of the architecture

Particle cluster algorithm: SIMD implementation 4x4 setup on SIMT - Classical 1x1 neighborlist on 4-way SIMD Cluster size and 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 grouping are the 3 5 6 8 9 10 11 15 “knobs” to adjust for 8 9 10 11 12 13 14 15 0 0 0 0 a specific arch: 0 Traditional algorithm: cache pressure, 1 – data reuse ill data reuse, in-register shuffle bound 2 3 4x4 setup on 4-way SIMD – arithmetic 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 4 intensity 5 6 4 5 6 7 12 13 14 15 – cache-efficiency 7 0 0 0 0 8 1 1 1 1 9 2 2 2 2 10 11 3 3 3 3 Cluster algorithm for fine-grained Cluster algorithm for SSE4, VMX, 128-bit AVX: hardware threading: SIMT, etc. cache friendly, 4-way j-reuse

Tuning kernels: feeding the GPU 800 800 ● Avoid load imbalance: raw pair list reshaped list 600 600 – create enough idependent work units = blocks per list size list size Raw lists: Regularized lists: SM[X|M] 400 400 too few blocks balanced SMX – sort them imbalanced execution 200 200 0 ● Workload regularization improves by up to: 0 100 200 300 400 0 100 200 300 400 300 300 – 2-3x on “narrower” 250 250 Regularized: – 3-5x on “wider” GPUs 4x faster 200 200 execution KCycles 150 150 ● (Re-)tuning is needed for new architectures 100 100 ● Tradeoffs: 50 50 – lowers j-particle data reuse 0 0 SMX0 SMX5 SMX6 SMX10 SMX11 SMX12 SMX0 SMX1 SMX2 SMX3 SMX4 SMX5 SMX6 SMX7 SMX8 SMX9 SMX10 SMX11 SMX12 SMX1 SMX2 SMX3 SMX4 SMX7 SMX8 SMX9 – atomic clashes Workload per SMX: Tesla K20c, 1500 atoms

Tuning kernels: ready for automation TESLA K20c Force kernel tuning 0.5 960 atoms 1.5k atoms 3k atoms 6k atoms 0.45 12k atoms 0.4 0.35 kernel time (ms) 0.3 0.25 0.2 0.15 0.1 0.05 0 Scanning list splits from 0 → 1000

Tuning kernels: still getting faster ● Up to 2x faster wrt the first version Tesla C2070 CUDA non-bonded force kernel PME, rc=1.0 nm, nstlist=20 Initial target 0.3 First alpha ● We still keep finding ways to improve Second alpha Pair list tweaks performance; 0.25 step time per 1000 atoms (ms) Kepler back-port Improved sorting most recently: CU55 + buffer 0.2 improvement – better inter-SM load balancing GMX 5.0 Reduction tweak – more consistent list lengths 0.15 – concurrent atomic operations 0.1 0.96 1.5 3 6 12 24 48 96 192 384 768 system size (x1000 atoms)

Tuning kernels: still getting faster ● Up to 2x faster wrt the first version Tesla C2070 CUDA non-bonded force kernel PME, rc=1.0 nm, nstlist=20 0.31 Initial target First alpha ● We still keep finding ways to improve 0.26 Second alpha Pair list tweaks performance step time per 1000 atoms (ms) Kepler back-port 0.21 Improved sorting most recently: CU55 + buffer 0.16 improvement – better inter-SM load balancing GMX 5.0 Reduction tweak 0.11 – more consistent list lengths GTX TITAN X 0.06 – concurrent atomic operations 0.01 0.96 1.5 3 6 12 24 48 96 192 384 768 ● But NVIDIA does too! system size (x1000 atoms)

Adapting the cluster algorithm to the GK210 j-loop ● Doubled register size 4x4 setup on SIMT -8 4x4 setup on SIMT - 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 => can fit 2x threads/block 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 0 1 i-cluster vs 2 j-clusters 0 1 1 2 2 3 3 i-loop 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 Particle cluster Particle cluster processing on processing on GK210 GF, GK1xx, GM

K80 kernel performance ● Pre-GK210: 64 threads/block, 16 blocks per SM occupancy: max 50%, achieved ~0.495% ● Extra registers allows 128 threads/block, still , 16 blocks per SM occupancy: max 100%, achieved ~0.92% K80 64x16 3.95 GPU, #threads x #blocks K80 128x16 2.90 K80 128x15 2.91 K80 128x14 3.08 K80 256x8 3.06 K40 64x16 3.41 K40 128x8 3.57 K40 256x4 5.50 kernel time (ms)

Tackling Performance Bottlenecks in the Diversifying CUDA HPC - PowerPoint PPT Presentation

Tackling Performance Bottlenecks in the Diversifying CUDA HPC Ecosystem: a Molecular Dynamics Perspective Szilrd Pll KTH Royal Institute of Technology GTC 2015 Diversifying hardware & complex parallelism Compute Cloud

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

Agenda Freight-Caused Roadway Bottlenecks Roadway Freight Network Freight Strategy

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Plan Optimizing Matrix Transpose with CUDA 1 CS4402-9535: High-Performance Computing with CUDA

Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun Minlan Yu,

RAMA HOETZLEIN Graphics Research Engineer | SIGGRAPH 2013 Outline Atomic Ops state Bottlenecks

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

S9751: ACCELERATE YOUR CUDA DEVELOPMENT WITH LATEST DEBUGGING AND CODE ANALYSIS DEVELOPER TOOLS

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA CUDA 7 Runtime C++11 cuSOLVER Compilation

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 Pascal Support Unified Memory

2016-17 Jill Shattock Director of Commissioning Mission Enab ablin ing th the people le of

2017-18 Advancing Quality At Sherwood Forest Hospitals Paul Moore - Director of Governance &

Howard Elman University of Maryland Vicki Howles John Shadid David Kay David

ASSESSMENT OF THE IMPACT OF THE SOUTH AFRICAN NATIONAL QUALIFICATIONS FRAMEWORK 9 February

Proposed 2021 ABPI Code and next steps Monday 15 June 2020 www.pmcpa.org.uk Agenda Overview

DFW 8-Hour Ozone SIP & Rules Update October 20, 2006 Theresa Pella Manager Air Quality

Buses Bill ADEPT South West Board 3rd November 2016 Philip Williams Gloucestershire County

Alison Holbourn Grenville Page Grenville Page Discussion Whats happening? Whats

Tackling Performance Bottlenecks in the Diversifying CUDA HPC - PowerPoint PPT Presentation

Tackling Performance Bottlenecks in the Diversifying CUDA HPC Ecosystem: a Molecular Dynamics Perspective Szilrd Pll KTH Royal Institute of Technology GTC 2015 Diversifying hardware & complex parallelism Compute Cloud

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

Agenda Freight-Caused Roadway Bottlenecks Roadway Freight Network Freight Strategy

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Plan Optimizing Matrix Transpose with CUDA 1 CS4402-9535: High-Performance Computing with CUDA

Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun Minlan Yu,

RAMA HOETZLEIN Graphics Research Engineer | SIGGRAPH 2013 Outline Atomic Ops state Bottlenecks

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

S9751: ACCELERATE YOUR CUDA DEVELOPMENT WITH LATEST DEBUGGING AND CODE ANALYSIS DEVELOPER TOOLS

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA CUDA 7 Runtime C++11 cuSOLVER Compilation

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 Pascal Support Unified Memory

2016-17 Jill Shattock Director of Commissioning Mission Enab ablin ing th the people le of

2017-18 Advancing Quality At Sherwood Forest Hospitals Paul Moore - Director of Governance &amp;

Howard Elman University of Maryland Vicki Howles John Shadid David Kay David

ASSESSMENT OF THE IMPACT OF THE SOUTH AFRICAN NATIONAL QUALIFICATIONS FRAMEWORK 9 February

Proposed 2021 ABPI Code and next steps Monday 15 June 2020 www.pmcpa.org.uk Agenda Overview

DFW 8-Hour Ozone SIP &amp; Rules Update October 20, 2006 Theresa Pella Manager Air Quality

Buses Bill ADEPT South West Board 3rd November 2016 Philip Williams Gloucestershire County

Alison Holbourn Grenville Page Grenville Page Discussion Whats happening? Whats

2017-18 Advancing Quality At Sherwood Forest Hospitals Paul Moore - Director of Governance &

DFW 8-Hour Ozone SIP & Rules Update October 20, 2006 Theresa Pella Manager Air Quality