Petascale Computational Fluid Dynamics with Python on GPUs F.D. Witherden , P.E. Vincent Department of Aeronautics Imperial College London
Introduction • Computational fluid dynamics (CFD) is the bedrock of several high-tech industries. • Desire amongst practitioners to perform unsteady , scale resolving simulations, within the vicinity of complex geometries .
Image courtesy of A.S. Ayer
The Need for FLOP/s • From The Opportunities and Challenges of Exascale Computing , US DOE, fall 2010.
R MAX != R PEAK • FLOP/s are great… • if you can get them. • Most commercial codes struggle to get ~10% of peak on CPUs.
PyFR • A high-order compressible Navier- Stokes solver for unstructured grids. • Designed from the ground up to run on NVIDIA GPUs. • Written entirely in Python !
The Py in PyFR • Leverages PyCUDA and mpi4py . • Makes extensive use of run-time code generation . • All compute performed on device . • Overhead from the Python interpreter < 1% .
The Py in PyFR • Leverages PyCUDA and mpi4py . • Makes extensive use of run-time code generation . • All compute performed on device . • Overhead from the Python interpreter < 1% .
The FR in PyFR • Uses flux reconstruction (FR) approach; • can recover well-know schemes including nodal Discontinuous Galerkin (DG) methods. • Lots of element-local structured compute.
The FR in PyFR • Majority of operations are block-by-panel type matrix multiplications: C A B M N K • where N ~ 105 and N ≫ (M, K).
The FR in PyFR • In parallel only simple halo exchanges are required between MPI ranks.
The FR in PyFR • FR is a great fit for modern hardware. • Previous GTC talks have outlined the key tenants of an efficient multi-GPU capable implementation: • GTC 2014 — PyFR: Technical Challenges of Bringing Next Generation Fluid Dynamics to GPUs • GTC 2015 — GiMMiK: Generating Bespoke Matrix Multiplication Kernels
PyFR Scaling • Evaluated on the Piz Daint cluster at CSCS . • Test case is a NACA 0021 aerofoil at a high angle of attack. Animation courtesy of J.S. Park
PyFR Strong Scaling 100 80 % of Peak FLOP/s 60 40 20 0 50 100 200 400 K20X GPUs
PyFR Weak Scaling 100 1.31 PFLOP/s 80 % of Peak FLOP/s 60 40 20 0 2 4 8 40 80 160 2000 K20X GPUs
So The Solver Scales • There’s a lot more to a code than just the solver… • and it all needs to scale .
Traditional Visualisation • Traditional visualisation pipeline with PyFR:
Traditional Visualisation • Traditional visualisation pipeline with PyFR:
Traditional Visualisation • Disk I/O… 7000 5600 Bandwidth MiB/s • like device ↔ host transfers only 4200 2800 slower 1400 • …much slower ! 0 Device ↔ host Disk
In-situ Visualisation • Cut out the middle men…
In-situ Visualisation • Cut out the middle men… • Using ParaView Catalyst it is possible to avoid disk I/O …
In-situ Visualisation • Pipeline with Catalyst… Solution Triangle list • majority of processing performed on the host with VTK.
In-situ Visualisation • Can we do better? • Yes !
In-situ Visualisation • Interface with PyFR using the plugin infrastructure. PyFR plugin CUDA pointer C++ shared library
In-situ Visualisation • Pipeline with Catalyst and VTK-m… Solution Triangle list • all compute performed on the device .
In-situ Visualisation • Pipeline with Catalyst and VTK-m… Solution Triangle list • all compute performed on the device .
In-situ Visualisation • Kitware • NVIDIA • ORNL • Utkarsh Ayachit • Bhushan Desam • Jack Wells • T.J. Corona • Tom Fogal • Zenotech • David DeMarle • Peter Messmer • Berk Geveci • Jeremy Purches • Mark Allan • Robert Maynard • Jamil Appa • Imperial College • Robert O’Bara • Andrei Cimpoeru • Patrick O’Leary • Arvind Iyer • David Standingford • Jin Seok Park • Brian Vermeire
In-situ Visualisation Animation courtesy of A.S. Ayer
In-situ Visualisation Animation courtesy of A.S. Ayer
Summary • Funded and supported by • Any questions? • E-mail: freddie.witherden08@imperial.ac.uk • Website: http://pyfr.org
Recommend
More recommend