Accelerating a Spectral Algorithm for Plasma Physics with Python/Numba on GPU FBPIC: A spectral, quasi-3D, GPU accelerated Particle-In-Cell code Manuel Kirchen Rémi Lehe Center for Free-Electron Laser Science BELLA Center & University of Hamburg , Germany Center for Beam Physics, LBNL , USA manuel.kirchen@desy.de rlehe@lbl.gov CFEL SCIENCE
Content ‣ Introduction to Plasma Accelerators ‣ Modelling Plasma Physics with Particle-In-Cell Simulations ‣ A Spectral, Quasi-3D PIC Code (FBPIC) ‣ Two-Level Parallelization Concept ‣ GPU Acceleration with Numba ‣ Implementation & Performance ‣ Summary CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 2 SCIENCE
Introduction to Plasma Accelerators Example of a Laser-driven Wakefield Basic principle of Laser Wakefield Acceleration Electron bunch Laser e - bunch Laser few fs Wakefield 10 - 100 µm Plasma Wake formed by oscillating electrons due to static heavy ion background Plasma period Image taken from: http://features.boats.com/boat-content/files/2013/07/centurion-elite.jpg ‣ cm-scale plasma target (ionized gas) ‣ Laser pulse or electron beam drives the wake ‣ Length scale of accelerating structure: Plasma wavelength (µm scale) ‣ Charge separation induces strong electric fields (~100 GV/m) Shrink accelerating distance from km to mm scale (orders of magnitude) + Ultra-short timescales (few fs) CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 3 SCIENCE
Modelling Plasma Physics with Particle-In-Cell Simulations ‣ Fields on discrete grid ‣ Macroparticles interact with fields Cell ∆ x Charge and PIC Cycle current Grid Fields Particle ‣ Charge/Current deposition on grid nodes ‣ Fields are calculated ➔ Maxwell equations Simulation Box ‣ Fields are gathered onto particles Millions of cells, particles and iterations! ‣ Particles are pushed ➔ Lorentz equation CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 4 SCIENCE
Productivity of a (Computational) Physicist Productivity Python/Numba helped us (as a physicist) speed up this process Fast simulations, Simulations take physical insights ! too long! Develop novel algorithm + efficient parallelization… Time Our goal: Reasonably fast & accurate code with many features and user-friendly interface CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 5 SCIENCE
A Spectral, Quasi-3D PIC Code PIC Simulations in 3D are essential, but computationally demanding Majority of algorithms are based on finite-di ff erence algorithms that introduce numerical artefacts Spectral solvers Quasi-cylindrical symmetry ‣ Correct evolution of electromagnetic waves ‣ Captures important 3D e ff ects PSATD algorithm (Haber et al., 1973) ( Lifschitz et al., 2009) ‣ Less numerical artefacts ‣ Computational cost similar to 2D code Combine best of both worlds ➞ Spectral & quasi-cylindrical algorithm CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 6 SCIENCE
Algorithm developed A Spectral, Quasi-3D PIC Code by Rémi Lehe FBPIC (Fourier-Bessel Particle-In-Cell) (R. Lehe et al., 2016) ‣ Written entirely in Python and uses Numba Just-In-Time compilation ‣ Only single-core and not easy to parallelize due to global operations (FFT and DHT) CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 7 SCIENCE
Parallelization Approach for Spectral PIC Algorithms Not easy to parallelize by domain decomposition , due to FFT & DHT. Standard (FDTD) Spectral Local Transformations & Domain Decomposition Transformations Domain Decomposition local exchange local communication & exchange global communication high accuracy arbitrary accuracy low accuracy Local parallelization of global operations & global domain decomposition CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 8 SCIENCE
Parallelization Concept Typical HPC infrastructure Intra-node parallelization ‣ Shared memory layout ‣ GPU (or multi-core CPU) ‣ Parallel PIC methods & CLUSTER DEVICE MEMORY Transformations RAM GPU CPU ‣ Numba + CUDA NODE Inter-node parallelization ‣ Distributed memory layout LOCAL AREA NETWORK ‣ Multi-CPU / Multi-GPU ‣ Spatial domain decomposition for spectral codes (Vay et al., 2013) ‣ mpi4py Shared and distributed memory layouts ➞ Two-level parallelization entirely with Python CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 9 SCIENCE
Intra-Node Parallelization of PIC Methods Particles Fields ‣ Particle push: Each thread updates one particle ‣ Field push and current correction: Each thread updates one grid value ‣ Field gathering: Some threads read same field value ‣ Transformations: Use optimized parallel algorithms ‣ Field deposition: Some threads write same field value ➞ race conditions! Intra-node parallelization ➞ CUDA with Numba CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 10 SCIENCE
CUDA Implementation with Numba Particles Fields ‣ Field gathering and particle push per-particle ‣ Transformation ➞ CUDA Libraries ‣ Field deposition ➞ Particles are sorted and ‣ Field push & current correction per-cell each thread loops over particles in its cell CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 11 SCIENCE
CUDA Implementation with Numba ‣ Simple interface for writing CUDA kernels ‣ Made use of cuBLAS, cuFFT, RadixSort ‣ Manual Memory Management Data is kept on GPU / only copied to CPU for I/O ‣ Almost full control over CUDA API ‣ Ported code to GPU in less than 3 weeks Simple CUDA kernel in FBPIC CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 12 SCIENCE
Single-GPU Performance Results Speed-up on different Nvidia GPUs Runtime distribution of the GPU PIC methods Field gathering 77 Particle sort 20.0% 67 14.0% Field push 6.8% 26 29.0% Field deposition 1 8.3% FFT Intel Xeon Nvidia Nvidia Nvidia 14.0% 7.9% E5-2650 v2 M2070 K20m K20x DHT (single-core) Particle push Speed-up of up to ~70 20 ns per particle per step compared to single-core CPU version CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 13 SCIENCE
Parallelization of FBPIC Standard FDTD PSATD Local Transformations & Domain Decomposition Transformations Domain Decomposition ✘ ✔ ? local exchange local communication & exchange global communication high accuracy limited accuracy low accuracy work in progress CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 14 SCIENCE
Inter-Node Parallelization Spatial domain decomposition Process 0 Process 1 Process 2 Process 3 ‣ Split work by spatial decomposition ‣ Domains computed in parallel ‣ Exchange local information at boundaries ‣ Order of accuracy defines guard region size ( Large guard regions for quasi-spectral accuracy ) Local field and particle exchange Domain 1 overlapping guard regions Concept of domain decomposition in the longitudinal direction Domain 2 CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 15 SCIENCE
Scaling of the MPI version of FBPIC Strong scaling on JURECA supercomputer (Nvivida K80) Preliminary results (not optimized) GPU Scaling of FBPIC 32 16384x512 cells 64 guard cells per domain guard region size 16 = local domain size speed up 8 4 Best strategy for our case: 2 Extensive Intra-node parallelization on the GPU and 1 only a few Inter-node domains. 4 8 16 32 64 128 # of GPUs For productive and fast simulations: 4-32 GPUs more than enough! CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 16 SCIENCE
Summary ‣ Motivation: Efficient and easy parallelization of a novel PIC algorithm to combine speed, accuracy and usability in order to work productively as a physicist ‣ FBPIC is entirely written in Python (easy to develop and maintain the code) ‣ Implementation uses Numba (JIT compilation and interface for writing CUDA-Python) ‣ Intra- and Inter-node parallelization approach suitable for spectral algorithms ‣ Single GPU well suited for global operations (FFT & DHT) ‣ Enabling CUDA support for the full code took less than 3 weeks ‣ Multi-GPU parallelization by spatial domain decomposition with mpi4py ‣ Outlook: Finalize Multi-GPU, CUDA Streams, GPU Direct, OpenSourcing of FBPIC CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 17 SCIENCE
Thanks… Questions? thanks to funding contributed by JURECA CFEL supercomputer LBNL SCIENCE Special thanks to group LBNL Rémi Lehe Brian McNeil WARP code FSP302 BMBF group group Johannes Bahrdt Jens Osterhoff
Recommend
More recommend