unstructured mesh framework Graham Markall, Florian Rathgeber, - PowerPoint PPT Presentation

1 of 13 PyOP2: A performance portable unstructured mesh framework Graham Markall, Florian Rathgeber, Nicolas Loriant, Georghe-Teodor Bercea, David Ham, Paul Kelly – Imperial College London Lawrence Mitchell – EPCC, Edinburgh Mike Giles, Gihan Mudalige – Oxford University Istvan Reguly – Pazmany Peter Catholic University

2 of 13 • Performance portability: platform-agnostic performance without source code changes • It is essential for performance portability that both a kernel and its call site are generated – GPU: Kernel call, shared memory staging – CPU: AVX vectorisation, data movement

3 of 13 PyOP2 • Driving application: finite element assembly • Hardware-specific performance optimisations in the form compiler breaks modularity • Based on OP2 – static-compiled C++ API • Python re-implementation – JIT Compilation – Linear algebra – Iteration spaces

4 of 13 PyOP2 Overview Application code API boundary Linear Algebra Runtime Core (PETSc, CUSP) Backends Sequential C C with OpenMP OpenCL CUDA MPI Code gen Instant PyOpenCL PyCUDA

5 of 13 3 2 Data declarations 1 0 dofs = op2.Set(4) cells = op2.Set(2) 0 1 cell_dof = op2.Map(cells, dofs, 3, [ 0, 1, 3, 2, 3, 1 ]) dof_vals = op2.Dat(dofs, 1, [0.0, 0.0, 0.0, 0.0]) cell_vals = op2.Dat(cells, 1, [1.0, 2.0]) sparsity = op2.Sparsity([(cell_dof, cell_dof)]) mat = op2.Mat(sparsity)

6 of 13 Kernel and parallel loop user_kernel = op2.Kernel (“”” void kernel(double *dof_val, double *cell_val) { for (int i=0; i<3; i++) dof_val[i] += *cell_val; }”””, “kernel” ) op2.par_loop(user_kernel, cells, dof_vals(cell_dof, op2.INC), cell_vals(op2.IdentityMap, op2.READ))

7 of 13 Iteration spaces – Design + API • Entry-to-thread mapping should be handled by the runtime - not the user kernel • Define user kernel in terms of one matrix entry op2.par_loop(kernel, cells (3,3) , mat(cell_dof[ op2.i[0] ], cell_dof[ op2.i[1] ]), *args) op2.par_loop(kernel, cells (12,12) , mat(cell_dof[ op2.i[0] ], cell_dof[ op2.i[1] ]), *args)

8 of 13 Iteration spaces - motivation 144 entries 144 entries Multiple matrices 1 thread per tile Per thread What should tile size be? void user_kernel(...) { for (ele=TID/9; ele+=NT/9; ele<n) void user_kernel(...) { patch_i = TID%3; for (ele=TID/4; ele<n; ele<n/4) patch_j = (TID%9)/3; for (i=0; i<12; i++) for (i=0; i<4; i++) for (j=0; j<12; j++) for (j=0; j<4; j++) A[i,j] += ... A[patch_i*4+i, patch_j*4+j] } += ... }

9 of 13 Iteration spaces – code generation user_kernel(..., int i, int j ) { A[i,j] += ... } for (ele=TID/3; ele+=NT/3; ele<n) patch_i = TID%3; patch_j = (TID%9)/3; for (i=0; i<4; i++) for (j=0; j<4; j++) ki = patch_i*4 + i; kj = patch_j*4 + j; user_kernel(..., ki, kj ); addto(matrix, ki, kj, ele) for (ele=TID; ele+=NT; ele<n) for (i=0; i<12; i++) for (j=0; j<12; j++) user_kernel(..., i, j ) addto(matrix, i, j, ele)

10 of 13 Parallel Execution • Two key optimisations: • Partitioning • Colouring Edges Cross-partition edges Vertices

11 of 13 Parallel Execution • Two key optimisations: • Partitioning • Colouring – Elements of the edge set are coloured to avoid races due to concurrent updates to shared Edges nodes Cross-partition edges Vertices

12 of 13 Parallel execution Parallel Loop Generate plan* * Cached items Generate code* Execute kernel

13 of 13 Summary • PyOP2 takes control of the data layout, • Generating data movement code, and • Using freedom to manage the iteration space, • it provides performance portability for unstructured mesh applications In the future, will allow: • AVX vectorisation for CPU • Multi-GPU support with CUDA+MPI

14 of 13

15 of 13 Spare/unused slides

16 of 13 Colouring __device__ user_kernel(args...) { ... } __global__ wrap_user_kernel__(args) { for (partition=0; partition<np; partition++) { /* Stage in data for partition */ for (col=0; col<ncol; col++) { for (i=0; itspace_i; i++) for (j=0; itspace_j; j++) user_kernel(..., i, j); } /* Stage out data for partition */ } for col in xrange(plan.ncolors): # PyCUDA kernel launch fun.prepared_async_call(grid_size, block_size, stream, *arglist, shared_size=shared_size)

17 of 13 API • Data declarations: – Sets : vertices, edges, cells etc. – Dats : data on sets – pressure, velocity – Maps : represent connectivity – cells → vertices – Sparsities : matrix structure – Mats : matrix data • Parallel execution: – Kernel definition – Parallel loop invocation

18 of 13 Data declarations • Runtime free to manage the data structures • User is prevented - freed – from having to manage data • Numpy array wrapping – can get accessor when necessary

19 of 13 Kernel and parallel loop • Kernels computation for a single set element • Par loop traverses set in any order • Dat arguments accessed: – Directly, with the identity map – Indirectly, through a map – READ, WRITE, RW – INC, MAX, MIN

20 of 13 CUDA/OpenCL Execution • Coalescing • Little opportunity on unstructured meshes • Staging into shared memory used instead

21 of 13 Parallel Execution • Two key optimisation s: • Partitioning • Colouring – At two levels Edges Cross-partition edges Vertices

unstructured mesh framework Graham Markall, Florian Rathgeber, - PowerPoint PPT Presentation

1 of 13 PyOP2: A performance portable unstructured mesh framework Graham Markall, Florian Rathgeber, Nicolas Loriant, Georghe-Teodor Bercea, David Ham, Paul Kelly Imperial College London Lawrence Mitchell EPCC, Edinburgh Mike Giles, Gihan

Thomas Hhn 4. Juni 2009 TU-Berlin, Berlin Why to How to Worksheets mesh ? mesh ? Outline

Mesh Basics Mesh Basics 1 Spring 2010 Definitions: Definitions: 1/2 Definitions:

Overview of Unstructured Mesh Generation Methods Structured Meshes local mesh points and

CFD General Notation System (CGNS) Usage for unstructured grids Edwin van der Weide Stanford

Mesh Networks | Hacking The T3lc0 Model http://arig.org.il What's a Mesh Anyway ? Mesh =

A Service Mesh Is Easy To Swallow In Small Pieces Andrew Jenkins Eng Lead, Aspen Mesh

W ir eless Mesh Netw or k W ir eless Mesh Netw or k Technical Overview Technical Overview Danny

What Makes for a Good Mesh? CS101 Meshing Winter 2007 1 Mesh Quality What makes a mesh

Smoothing Gianpaolo Palma Triangle Mesh List of vertices + List of triangle as triple of vertex

Non-Mesh Treatment of SUI Shachar Aharony MD AUA SUI Guidelines 2017 Shachar Aharony MD,

Best Practices Workshop: Parts & Mesh-Based Operations Overview What are Parts and Mesh

Mesh Network Information ArrowSpan Wireless Mesh Network Solutions (China) Sept 2007 1

What is a mesh network? A mesh network is created when many devices have established

13 Mesh Animation Steve Marschner CS5625 Spring 2020 Basic surface deformation methods Blend

CREATIVE FREEDOM FOR COMPUTATIONAL MESH GENERATION IN DEAL.II Nicola Giuliani , Luca Heltai

Mesh Models (Chapter 8) 1. Overview of Mesh and Related models. a. Diameter: The linear

EM Analysis in the IoT Context: Lessons Learned from an Attack on Thread Daniel Dinu 1 , Ilya

Task 1: Large N-body simulations Volker Springel Adrian Jenkins Ilian Iliev Pier Stefano

Elmer Parallel Computing ElmerTeam CSC IT Center for Science Ltd. CSC, April 2013 Parallel

Scientific Workflows Shahbaz Memon 1,2 , Dorothee Vallot 3 , Helmut Neukirchen 2 , Morris Riedel

* Dr. Axel Voigt (voigt@caesar.de) research center caesar crystal growth group

MATH 676 Finite element methods in scientifjc computing Wolfgang Bangerth, T exas A&M

Algorithmic Differentiation of Structured Mesh Applications G abor D aniel Balogh

OpenMP Kenjiro Taura 1 / 74 Contents 1 Overview 2 A Running Example: SpMV 3 parallel pragma 4

unstructured mesh framework Graham Markall, Florian Rathgeber, - PowerPoint PPT Presentation

1 of 13 PyOP2: A performance portable unstructured mesh framework Graham Markall, Florian Rathgeber, Nicolas Loriant, Georghe-Teodor Bercea, David Ham, Paul Kelly Imperial College London Lawrence Mitchell EPCC, Edinburgh Mike Giles, Gihan

Thomas Hhn 4. Juni 2009 TU-Berlin, Berlin Why to How to Worksheets mesh ? mesh ? Outline

Mesh Basics Mesh Basics 1 Spring 2010 Definitions: Definitions: 1/2 Definitions:

Overview of Unstructured Mesh Generation Methods Structured Meshes local mesh points and

CFD General Notation System (CGNS) Usage for unstructured grids Edwin van der Weide Stanford

Mesh Networks | Hacking The T3lc0 Model http://arig.org.il What's a Mesh Anyway ? Mesh =

A Service Mesh Is Easy To Swallow In Small Pieces Andrew Jenkins Eng Lead, Aspen Mesh

W ir eless Mesh Netw or k W ir eless Mesh Netw or k Technical Overview Technical Overview Danny

What Makes for a Good Mesh? CS101 Meshing Winter 2007 1 Mesh Quality What makes a mesh

Smoothing Gianpaolo Palma Triangle Mesh List of vertices + List of triangle as triple of vertex

Non-Mesh Treatment of SUI Shachar Aharony MD AUA SUI Guidelines 2017 Shachar Aharony MD,

Best Practices Workshop: Parts &amp; Mesh-Based Operations Overview What are Parts and Mesh

Mesh Network Information ArrowSpan Wireless Mesh Network Solutions (China) Sept 2007 1

What is a mesh network? A mesh network is created when many devices have established

13 Mesh Animation Steve Marschner CS5625 Spring 2020 Basic surface deformation methods Blend

CREATIVE FREEDOM FOR COMPUTATIONAL MESH GENERATION IN DEAL.II Nicola Giuliani , Luca Heltai

Mesh Models (Chapter 8) 1. Overview of Mesh and Related models. a. Diameter: The linear

EM Analysis in the IoT Context: Lessons Learned from an Attack on Thread Daniel Dinu 1 , Ilya

Task 1: Large N-body simulations Volker Springel Adrian Jenkins Ilian Iliev Pier Stefano

Elmer Parallel Computing ElmerTeam CSC IT Center for Science Ltd. CSC, April 2013 Parallel

Scientific Workflows Shahbaz Memon 1,2 , Dorothee Vallot 3 , Helmut Neukirchen 2 , Morris Riedel

* Dr. Axel Voigt (voigt@caesar.de) research center caesar crystal growth group

MATH 676 Finite element methods in scientifjc computing Wolfgang Bangerth, T exas A&amp;M

Algorithmic Differentiation of Structured Mesh Applications G abor D aniel Balogh

OpenMP Kenjiro Taura 1 / 74 Contents 1 Overview 2 A Running Example: SpMV 3 parallel pragma 4

Best Practices Workshop: Parts & Mesh-Based Operations Overview What are Parts and Mesh

MATH 676 Finite element methods in scientifjc computing Wolfgang Bangerth, T exas A&M