A Technical Overview of PyFR F.D. Witherden Department of Ocean - PowerPoint PPT Presentation

A Technical Overview of PyFR F.D. Witherden Department of Ocean Engineering, Texas A&M University

Why Go High-Order? • Greater resolving power per degree of freedom (DOF)… • …and thus fewer overall DOFs for same accuracy. • Tight coupling between DOFs inside of an element… • …reduces indirection and saves memory bandwidth .

Flux Reconstruction • Our high-order method of choice is the flux reconstruction (FR) scheme of Huynh. • It is both unifying and capable of operating effectively on mixed unstructured grids .

PyFR Python + Flux Reconstruction

PyFR • Features. Governing Equations Compressible and incompressible Navier Stokes Arbitrary order Flux Reconstruction on mixed unstructured Spatial Discretisation grids (tris, quads, hexes, tets, prisms, and pyramids) Temporal Discretisation Adaptive explicit Runge-Kutta schemes Precision single double Sub-grid scale models None CPU and Xeon Phi clusters Platforms NVIDIA GPU clusters AMD GPU clusters

PyFR • High level structure. Python Outer Layer Matrix Multiply Point-Wise ( Hardware Independent ) Kernels Nonlinear Kernels • Setup • Data • Flux functions, • Distributed memory parallelism interpolation/ Riemann solvers • Outer loop calls hardware specific kernels extrapolation etc. etc. C/OpenMP OpenCL CUDA Hardware Hardware Hardware Pass templates specific specific specific through Mako Call GEMM kernels kernels kernels derived templating engine

PyFR • Enables heterogeneous computing from a homogeneous code base.

PyFR • PyFR can scale up to leadership class DOE machines and was shortlisted for the 2016 Gordon Bell Prize .

Implementing FR Efficiently 1. Use non-blocking communication primitives. 2. Arrange data in a cache- and vectorisation-friendly manner. 3. Cast key kernels as performance primitives.

Non-Blocking Communication • Time to solution is heavily impacted by the parallel scaling of a code. • This, in turn, is influenced by the amount of communication performed at each time step.

Non-Blocking Communication

Non-Blocking Communication • If a code is to strong scale it is hence essential for it to overlap communication with computation .

Non-Blocking Communication MPI Recv Compute Compute Compute Compute A B C D MPI Send t

Non-Blocking Communication MPI IRecv Compute Compute Compute MPI Compute A C D Wait B MPI ISend t

Non-Blocking Communication

Data Layouts • FR is very often a memory bandwidth bound algorithm. • It is therefore vital that a code arranges its data in a way which enables us to extract a high fraction of peak bandwidth.

Data Layouts • Three main layouts: • AoS • SoA • AoSoA( k )

Data Layouts: AoS struct { float rho; float rhou; float E; } data[NELES];

Data Layouts: AoS Memory • Cache and TLB friendly. • Difficult to vectorise.

Data Layouts: SoA struct { float rho[NELES]; float rhou[NELES]; float E[NELES]; } data;

Data Layouts: SoA Memory • Trivial to vectorise. • Can put pressure on TLB and/or hardware pre-fetchers.

Data Layouts: AoSoA( k = 2) struct { float rho[k]; float rhou[k]; float E[k]; } data[NELES / k];

Data Layouts: AoSoA( k = 2) Memory • Can be vectorised efficiently for suitable k. • Cache and TLB friendly.

Data Layouts: AoSoA( k = 2) • The ideal ‘Goldilocks’ solution • …albeit at the cost of messy indexing • …and requires coaxing for compilers to vectorise .

Data Layouts: AoSoA( k ) Results • FR with SoA vs FR AoSoA on an Intel KNL. p = 1 p = 2 p = 3 p = 4 0 2 4 6 8 10 12 Time per DOF per RK stage / ns

Performance Primitives • On modern hardware it can be extremely difficult to extract a high percentage of peak FLOP/s in otherwise compute-bound kernels. • To this end it is important—where possible—to cast operations in terms of performance primitives .

Performance Primitives • Have data at and want to interpolate to . M =

Performance Primitives • This operation can be recognised as a matrix-vector product (GEMV) as u = M v . • If we are working in transformed space then M is the same for all elements. • This can be recognised as a matrix-matrix product (GEMM) as U = MV.

Performance Primitives • Both GEMV and GEMM are performance primitives and optimised implementations are readily available from vendor BLAS libraries. • These routines can perform an order of magnitude better than hand-rolled routines .

Performance Primitives • In FR the operator matrix M can sometimes be sparse . • This requires use of a more specialised primitives such as those found in GiMMiK and libxsmm which account for the size/sparsity of FR operators .

Summary • Use non-blocking communication primitives. • Arrange data in a cache- and vectorisation-friendly manner. • Cast key kernels as performance primitives.

A Technical Overview of PyFR F.D. Witherden Department of Ocean - PowerPoint PPT Presentation

A Technical Overview of PyFR F.D. Witherden Department of Ocean Engineering, Texas A&M University Why Go High-Order? Greater resolving power per degree of freedom (DOF) and thus fewer overall DOFs for same accuracy. Tight

High-Order Accurate Numerical Simulations of Flow around a Projectile using PyFR Jin Seok Park 1 1

Experiences with OpenCL in PyFR: 2014Present F.D. Witherden 1 and P.E. Vincent 2 1 Department

Simulations of Flow over Low-Pressure Turbine Blades with PyFR Yoshiaki Abe 1 , Arvind Iyer 2 ,

Partially-Averaged Navier-Stokes in PyFR Tarik Dzanic Prof. Freddie D. Witherden Department of

PyFR Symposium 2020 Addin ing Mult ltiphase Capabili lities to to PyF yFR Xi Deng 1 , Pierre

PyFR: PastPresentFuture P. E. Vincent Department of Aeronautics, Imperial College London 19

Product Features Technical Training 2007 Technical Training 2007 Technical Training 2007

Service Section Service Section Technical Training Technical Training December 2004 December

Technical Analysis Technical Analysis Technical Analysis Technical Analysis Introduction

Service Section Service Section Technical Training Technical Training Technical Training

EFFECTIVE TECHNICAL REPORT WRITING AND HIGH IMPACT PRESENTATION SKILLS FOR TECHNICAL

01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 |

Business and Technical Concepts of Business and Technical Concepts of Business and Technical

Service Section Service Section Technical Training Technical Training December 2004 December

Service Section Service Section Technical Training Technical Training Dec 2004 Dec 2004

Service Section Service Section Technical Training Technical Training December 2004 December

T minus 1 class Homework 10 will be out shortly About 6 questions No more

!"#$%$&$"'()+,+%- .+#/(!+01$.

Background Attitudes to dementia are slowly changing There is an increasing recognition

f r o m h a n d t o mo u t h P a v l o B a r o n G e e k s p a

Lecture: Google Chubby lock service ZooKeeper

New Research: Why a Provocative Approach Falls Short in Two out of Three Critical Conversations

TESTING JAVASCRIPT BUILDING JAVASCRIPT APPLICATIONS YOU WON'T HATE 1/45 Testing Javascript

ARB_gl_spirv implementation GParamSpec pspec; / Party code attribute */ pspec =

A Technical Overview of PyFR F.D. Witherden Department of Ocean - PowerPoint PPT Presentation

A Technical Overview of PyFR F.D. Witherden Department of Ocean Engineering, Texas A&M University Why Go High-Order? Greater resolving power per degree of freedom (DOF) and thus fewer overall DOFs for same accuracy. Tight

High-Order Accurate Numerical Simulations of Flow around a Projectile using PyFR Jin Seok Park 1 1

Experiences with OpenCL in PyFR: 2014Present F.D. Witherden 1 and P.E. Vincent 2 1 Department

Simulations of Flow over Low-Pressure Turbine Blades with PyFR Yoshiaki Abe 1 , Arvind Iyer 2 ,

Partially-Averaged Navier-Stokes in PyFR Tarik Dzanic Prof. Freddie D. Witherden Department of

PyFR Symposium 2020 Addin ing Mult ltiphase Capabili lities to to PyF yFR Xi Deng 1 , Pierre

PyFR: PastPresentFuture P. E. Vincent Department of Aeronautics, Imperial College London 19

Product Features Technical Training 2007 Technical Training 2007 Technical Training 2007

Service Section Service Section Technical Training Technical Training December 2004 December

Technical Analysis Technical Analysis Technical Analysis Technical Analysis Introduction

Service Section Service Section Technical Training Technical Training Technical Training

EFFECTIVE TECHNICAL REPORT WRITING AND HIGH IMPACT PRESENTATION SKILLS FOR TECHNICAL

01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 |

Business and Technical Concepts of Business and Technical Concepts of Business and Technical

Service Section Service Section Technical Training Technical Training December 2004 December

Service Section Service Section Technical Training Technical Training Dec 2004 Dec 2004

Service Section Service Section Technical Training Technical Training December 2004 December

T minus 1 class Homework 10 will be out shortly About 6 questions No more

!&quot;#$%$&amp;$&quot;'()*+,+%- .+#/(!+*01$.

Background Attitudes to dementia are slowly changing There is an increasing recognition

f r o m h a n d t o mo u t h P a v l o B a r o n G e e k s p a

Lecture: Google Chubby lock service ZooKeeper

New Research: Why a Provocative Approach Falls Short in Two out of Three Critical Conversations

TESTING JAVASCRIPT BUILDING JAVASCRIPT APPLICATIONS YOU WON'T HATE 1/45 Testing Javascript

ARB_gl_spirv implementation GParamSpec *pspec; /* Party code attribute */ pspec =

!"#$%$&$"'()+,+%- .+#/(!+01$.

ARB_gl_spirv implementation GParamSpec pspec; / Party code attribute */ pspec =