Auto-Tuning Kernel Launch Parameters for Maximum Performance - PowerPoint PPT Presentation

Auto-Tuning Kernel Launch Parameters for Maximum Performance Joshua A. Anderson T HE G LOTZER G ROUP

Molecular dynamics Monte Carlo Tethered nanospheres Truncated Tetrahedra Arbitrary polyhedra Quasicrystal growth Langevin dynamics Hard particle MC Hard particle MC Molecular Dynamics Damasceno, P. F. et al., Science 337 , 453 Engel M. et al., Nature Materials 14 109-116, 2014 Marson, R, Nano Letters 14 , 4, 2014 Damasceno, P. F. et al., ACS Nano 6 , 609 (2012) (2012) Self-propelled colloids Surfactant coated surfaces Interacting nanoplates Hard disks - hexatic Non-equilibrium MD Dissipative particle dynamics Hard particle MC with interactions Hard particle MC Pons-Siepermann, I. C., Soft matter 6 3919 (2012) Nguyen N., Phys Rev E 86 1, 2012 Engel M. et al., PRE 87 , 042134 (2013) Ye X. et al., Nature Chemistry cover article (2013) T HE G LOTZER G ROUP

Features in HOOMD-blue v1.0 Integration Bond forces Pair forces • NVT (Nosé-Hoover) • Harmonic • Lennard Jones • NPT • FENE • Gaussian • NPH • Table • CGCMM • Brownian Dynamics Angle forces • Morse • Dissipative Particle Dynamics • Harmonic • Table • NVE • CGCMM • Yukawa • FIRE energy minimization • Table • PPPM electrostatics • Rigid body dynamics Dihedral/Improper forces • Harmonic Snapshot formats • Table • MOL2 • DCD • PDB • XML Many-body forces Simulation types Hardware support • EAM • 2D and 3D • All recent NVIDIA GPUs • Triclinic box • Multi-GPU with MPI • Replica exchange (via script) • Multi-CPU with MPI T HE G LOTZER G ROUP

HPMC - Massively parallel MC on the GPU • Hard Particle Monte Carlo plugin for HOOMD-blue • 2D Shapes • Disk • Convex (Sphero)polygon • Concave polygon • Ellipse Damasceno, P. F. et al., ACS Nano 6 , 609 (2012) Damasceno et al., Science (2012) • 3D Shapes • Sphere • Ellipsoid • Convex (Sphero)polyhedon • NVT and NPT ensembles • Frenkel-Ladd free energy • Parallel execution on a single GPU • Domain decomposition across multiple nodes (CPUs or GPUs) Engel M. et al., PRE 87 , 042134 Damasceno et al., Science (2012) (2013) T HE G LOTZER G ROUP T HE G LOTZER G ROUP

Example job script from hoomd_script import * from hoomd_plugins import hpmc init.read_xml ( filename = ‘init.xml’ ) mc = hpmc.integrate.convex_polygon ( seed =10, d =0.25, a =0.3); mc. shape_param.set ('A', vertices =[(-0.5, -0.5), (0.5, -0.5), (0.5, 0.5), (-0.5, 0.5)]); run(10e3) T HE G LOTZER G ROUP

Kernel performance depends on launch parameters K20 K40 32 224 N=1e6 4% 64 416 N=4096 30% T HE G LOTZER G ROUP

Kernel performance depends on launch parameters Kernel performance depends on launch parameters 128,4 K20 K40 160,2 N=1e6 200% N=4096 40% 128,4 160,16 T HE G LOTZER G ROUP

The need for runtime autotuning • 100+ kernels, many with multiple variants • Many GPU generations • Variations within generations (K20, K40, K80) • Different CUDA compiler versions (5.5, 6.0, 6.5, 7.0, …) • Infinite workloads based on user configuration • Workloads can vary during a single run • Multiple dimensions of launch parameters (block size, stride, alternate algorithms, …) • Performance vs launch parameter is not predictable T HE G LOTZER G ROUP

Time steps • Tuning occurs during actual simulation run steps • Each kernel has a separate Autotuner • Kernels may be called at different rates 32 64 96 128 160 32 96 160 224 96 96 96 96 96 96 96 96 96 96 64 128 192 256 96 96 96 96 96 96 96 96 96 Time Kernel 1 Kernel 2 Kernel 3 T HE G LOTZER G ROUP

Repeated tuning • One scan through launch parameters takes ~1 second • Lock to the optimal for 5 minutes (user configurable) • Sample again to adapt to changes • Run at non-optimal sizes for less than 0.2% of the run 5 minutes 96,2 128,4 96,2 192,1 256,1 64,8 64 128 128 1s Tune Kernel 1 Tune Kernel 2 Tune Kernel 3 Run with optimal T HE G LOTZER G ROUP

Autotuner interface constructor() m_tuner = Autotuner(valid_launch_params) update(timestep) m_tuner.begin() call_kernel(…, m_tuner.getParam() ) m_tuner.end() • Minimal additional code to a module • Initialize tuner • Wrap the kernel around calls to begin() and end() T HE G LOTZER G ROUP

Implementation details • Autotuner methods operate a state machine to control • When not tuning: • getParam() returns the found optimal params • begin() and end() are no-ops • When tuning: • getParam() switches to a new parameter on each call • begin() and end() use CUDA events to measure time T HE G LOTZER G ROUP

Code void Autotuner::begin() { if (m_state == STARTUP || m_state == SCANNING) cudaEventRecord (m_start, 0); } void Autotuner::end() { if (m_state == STARTUP || m_state == SCANNING) { cudaEventRecord (m_stop, 0); cudaEventSynchronize (m_stop); cudaEventElapsedTime (&m_samples[m_current_element][m_current_sample], m_start, m_stop); } // ... implement state machine update } T HE G LOTZER G ROUP

Sampling • Noise in kernel launch time • Record M samples per launch param (i.e. 5) • Take median, mean, or max • Warmup phase needs to sample len(valid_launch_params)*M launches • Subsequent runs only need to replace one of the sets of samples or len(valid_launch_params) samples. • Typically only 32-192 T HE G LOTZER G ROUP

Invalid block sizes • What about invalid block sizes? • Not all kernels can be run at every possible block size • A simple approach • Put all possible params in valid_launch_params • Clamp to the max possible block size • Account for dynamic shared memory if used cudaFuncAttributes attr; cudaFuncGetAttributes (&attr, kernel<template params>); int block_size = min (attr.maxThreadsPerBlock, target_block_size); T HE G LOTZER G ROUP

Drawbacks • Slow period kernels can take minutes to fully tune • Runtime auto-tuning only works with iterative methods • Other codes can tune offline • Floating point reduction kernels give non-deterministic results T HE G LOTZER G ROUP

• Code available: • Autotuner.h / Autotuner.cc in HOOMD-blue • http://codeblue.umich.edu/hoomd-blue Funding / Resources • Research supported by the National Science Foundation, Division of Materials Research Award # DMR 1409620. email: joaander@umich.edu T HE G LOTZER G ROUP

Auto-Tuning Kernel Launch Parameters for Maximum Performance - PowerPoint PPT Presentation

Auto-Tuning Kernel Launch Parameters for Maximum Performance Joshua A. Anderson T HE G LOTZER G ROUP Molecular dynamics Monte Carlo Tethered nanospheres Truncated Tetrahedra Arbitrary polyhedra Quasicrystal growth Langevin dynamics Hard

KODA AUTO University KODA AUTO University Agenda on KODA AUTO University Enterprise

KODA AUTO University KODA AUTO University Agenda on KODA AUTO University Enterprise

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Online Auto-Tuning Ray S. Chen Jeffrey K. Hollingsworth 1 Motivation HPC systems will

Xabclib and OpenATLib Ver.1.0: A Fully Auto-tuned Sparse Iterative Library and Its Auto-tuning

The Korean Auto & Auto Parts Industry Chapter 1. The Status of Korean Auto Industry 2 1

GB Auto The Ghabbour Group of Companies Everything on Wheels GB Auto, S.A.E I nitial

GB Auto The Ghabbour Group of Companies Everything on Wheels GB Auto, S.A.E Initial

WIDE Project RFID/Auto-ID activities Yojiro UO Auto-ID Labs, JAPAN WIDE Project Auto-ID

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Introduction to Linux Kernel Modules Luca Abeni luca.abeni@santannapisa.it Linux Kernel Modules

2019/20 Budget Consolidated Budget Senate March 25, 2019 2019/20 Budget Presentation

Northern Powerhouse as a State Spatial Strategy Danny MacKinnon, CURDS, Newcastle

TSX: ANX TSX:ANX Productio tion Cash h Flow w Exploratio tion Growth wth A

new childrens hospital project and community benefit update Presentation to South Central Area

Attenuation inversions using broadband acoustic sources Gopu R. Potty and James H. Miller

ELDER WORK Regaining the Straying OUR ATTITUDE TOWARD THE STRAYING What terms have been used to

Cholesky Decomposition Techniques in Quantum Chemical Implementations Outline What is

Population Status and Diet of Sympatric Hornbills in Jomotsangkha Wildlife Sanctuary (JWS),

Auto-Tuning Kernel Launch Parameters for Maximum Performance - PowerPoint PPT Presentation

Auto-Tuning Kernel Launch Parameters for Maximum Performance Joshua A. Anderson T HE G LOTZER G ROUP Molecular dynamics Monte Carlo Tethered nanospheres Truncated Tetrahedra Arbitrary polyhedra Quasicrystal growth Langevin dynamics Hard

KODA AUTO University KODA AUTO University Agenda on KODA AUTO University Enterprise

KODA AUTO University KODA AUTO University Agenda on KODA AUTO University Enterprise

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Online Auto-Tuning Ray S. Chen Jeffrey K. Hollingsworth 1 Motivation HPC systems will

Xabclib and OpenATLib Ver.1.0: A Fully Auto-tuned Sparse Iterative Library and Its Auto-tuning

The Korean Auto &amp; Auto Parts Industry Chapter 1. The Status of Korean Auto Industry 2 1

GB Auto The Ghabbour Group of Companies Everything on Wheels GB Auto, S.A.E I nitial

GB Auto The Ghabbour Group of Companies Everything on Wheels GB Auto, S.A.E Initial

WIDE Project RFID/Auto-ID activities Yojiro UO Auto-ID Labs, JAPAN WIDE Project Auto-ID

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Introduction to Linux Kernel Modules Luca Abeni luca.abeni@santannapisa.it Linux Kernel Modules

2019/20 Budget Consolidated Budget Senate March 25, 2019 2019/20 Budget Presentation

Northern Powerhouse as a State Spatial Strategy Danny MacKinnon, CURDS, Newcastle

TSX: ANX TSX:ANX Productio tion Cash h Flow w Exploratio tion Growth wth A

new childrens hospital project and community benefit update Presentation to South Central Area

Attenuation inversions using broadband acoustic sources Gopu R. Potty and James H. Miller

ELDER WORK Regaining the Straying OUR ATTITUDE TOWARD THE STRAYING What terms have been used to

Cholesky Decomposition Techniques in Quantum Chemical Implementations Outline What is

Population Status and Diet of Sympatric Hornbills in Jomotsangkha Wildlife Sanctuary (JWS),

The Korean Auto & Auto Parts Industry Chapter 1. The Status of Korean Auto Industry 2 1