the pyop2 abstraction its role in firedrake and some
play

The PyOP2 abstraction, its role in Firedrake, and some optimisations - PowerPoint PPT Presentation

The PyOP2 abstraction, its role in Firedrake, and some optimisations that it enables Paul H J Kelly Group Leader, Software Performance Optimisation Co-Director, Centre for Computational Methods in Science and Engineering Department of Computing,


  1. The PyOP2 abstraction, its role in Firedrake, and some optimisations that it enables Paul H J Kelly Group Leader, Software Performance Optimisation Co-Director, Centre for Computational Methods in Science and Engineering Department of Computing, Imperial College London Joint work with : David Ham (Imperial Computing/Maths/Grantham Inst for Climate Change) Gerard Gorman, (Imperial Earth Science Engineering – Applied Modelling and Computation Group) Mike Giles, Gihan Mudalige, Istvan Reguly (Mathematical Inst, Oxford) Doru Bercea, Fabio Luporini, Graham Markall, Lawrence Mitchell, Florian Rathgeber, George Rokos (Software Perf Opt Group, Imperial Computing) Spencer Sherwin (Aeronautics, Imperial), Chris Cantwell (Cardio-mathematics group, Mathematics, Imperial) Michelle Mills Strout, Chris Krieger, Cathie Olschanowsky (Colorado State University) Carlo Bertolli (IBM Research) Ram Ramanujam (Louisiana State University) 1

  2. What we Vectorisation, PyOP2/OP2 Aeroengine Finite- parametric turbo- Unstructured- are volume CFD polyhedral tiling machinery mesh stencils doing…. Tiling for Firedrake unstructured- Finite- Weather and mesh stencils Finite-element element climate assembly Lazy, data-driven compute- PAMELA Real-time 3D communicate scene Tidal turbines Dense SLAM understanding – 3D vision Runtime code Targetting generation PRAgMaTIc Domestic MPI, Adaptive- robotics, Dynamic OpenMP, mesh CFD augmented mesh Multicore graph OpenCL, reality adaptation worklists Dataflow/ Unsteady GiMMiK FPGA, from CFD - higher- Formula-1, Small-matrix Massive common UAVs order flux- supercomp multiplication sub-expressions reconstruction uters to Ab-initio mobile, TINTL Optimisation of computational Solar energy, embedded Fourier composite chemistry drug design interpolation transforms and (ONETEP) wearable Projects Contexts Technologies Applications

  3. This talk OP2 and PyOP2: A stencil “DSL” for unstructured meshes An instance of a “decoupled access-execute” model Firedrake: a compiler for a higher-level DSL That uses PyOP2 as an intermediate representation (IR) COFFEE: a domain-specific compiler for a kernels This talk’s message: Optimise at the right level of abstraction Stencil ideas generalise The “DSL” can be an IR (and can look like a library) Runtime code generation can be incredibly powerful 5

  4. From DSL to loop chains 4 Firedrake provides a DSL for finite element methods phi, p = Function ( mesh , …) Loop over the mesh! … while not convergence : { … Loop over the mesh! phi -= dt / 2 * p if …: ! p += ( assemble (dt *inner ( nabla_grad (v),…))*dx) ! else: Call to third party library! solve (…) … phi += dt / 2 * p … Loop over the mesh! } … Each of these loops is implemented in PyOP2

  5. The OP2/PyOP2 programming model The OP2 programming model 5 void incrVertices ( double* e_weight, double* v1, double* v2) { *v1 += f(e_weight) *v2 += f(e_weight) } op_par_loop ( incrVertices , edges, op_arg_dat (edgeWeight, -1, OP_ID, OP_READ), op_arg_dat (vertexDat, 0, edges2vertices, OP_INC), op_arg_dat (vertexDat, 1, edges2vertices, OP_INC));

  6. The OP2/PyOP2 programming model The OP2 programming model 5 void incrVertices ( double* e_weight, double* v1, double* v2) { *v1 += f(e_weight) *v2 += f(e_weight) } op_par_loop ( incrVertices , edges, op_arg_dat (edgeWeight, -1, OP_ID, OP_READ), op_arg_dat (vertexDat, 0, edges2vertices, OP_INC), op_arg_dat (vertexDat, 1, edges2vertices, OP_INC)); INDIRECT MEMORY ACCESSES ( A[B[i]] )!

  7. The OP2/PyOP2 programming model The OP2 programming model 5 void incrVertices ( double* e_weight, double* v1, double* v2) { *v1 += f(e_weight) *v2 += f(e_weight) } op_par_loop ( incrVertices , edges, op_arg_dat (edgeWeight, -1, OP_ID, OP_READ), op_arg_dat (vertexDat, 0, edges2vertices, OP_INC), op_arg_dat (vertexDat, 1, edges2vertices, OP_INC)); INDIRECT MEMORY ACCESSES ( A[B[i]] )! op_par_loop (X, cells, …)

  8. Loop chains in OP2/PyOP2 The OP2 programming model 5 void incrVertices ( double* e_weight, double* v1, double* v2) { *v1 += f(e_weight) *v2 += f(e_weight) } op_par_loop ( incrVertices , edges, op_arg_dat (edgeWeight, -1, OP_ID, OP_READ), op_arg_dat (vertexDat, 0, edges2vertices, OP_INC), op_arg_dat (vertexDat, 1, edges2vertices, OP_INC)); INDIRECT MEMORY ACCESSES ( A[B[i]] )! op_par_loop (X, cells, …) Synchronization point (function call e.g., PETSc) op_par_loop (Y, vertices, …)

  9. 6 Implementation of an op_par_loop in CUDA void incrVertices ( op_par_loop ( incrVertices , edges, double* e, op_arg_dat (edgeWeight, -1, OP_ID, OP_READ), double* v1, op_arg_dat (vertexDat, 0, edges2vertices, OP_INC), double* v2) { op_arg_dat (vertexDat, 1, edges2vertices, OP_INC)); *v1 += *e; *v2 += *e; } Coloring used for avoiding race conditions in shared memory parallel execution �� �� � � �� �� �� �� �� �� �� � �� �� �� �� �� �� � �� �� �� � �� �� �� �� �� �� � �� � �� �� � �� �� � �� �� �� �� �� �� � �� �� � �� �� �� �� � � �� �� �� �� �� �� �� �� �� �� �� �� �� �� � �� � �� �� � �� �� �� � �� �� �� �� �� � �� �� �� �� �� �� �� � � �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� ��

  10. Implementation of an op_par_loop in CUDA 6 void incrVertices ( op_par_loop ( incrVertices , edges, double* e, op_arg_dat (edgeWeight, -1, OP_ID, OP_READ), double* v1, op_arg_dat (vertexDat, 0, edges2vertices, OP_INC), double* v2) { op_arg_dat (vertexDat, 1, edges2vertices, OP_INC)); *v1 += *e; *v2 += *e; } Coloring used for avoiding race conditions in shared memory parallel execution �� �� � Each partition assigned ! � �� �� �� �� �� � �� �� �� �� �� �� �� �� � �� �� �� �� � �� �� to a Thread Block and ! �� �� �� � �� � �� �� � �� �� � �� �� �� �� �� �� � �� �� �� � �� �� further colored �� � � �� �� �� �� �� �� �� �� �� �� �� �� �� �� � �� � �� �� � �� �� �� � �� �� �� �� � �� �� �� �� �� �� �� �� � � �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� ��

  11. Implementation of an op_par_loop in CUDA 6 void incrVertices ( op_par_loop ( incrVertices , edges, double* e, op_arg_dat (edgeWeight, -1, OP_ID, OP_READ), double* v1, op_arg_dat (vertexDat, 0, edges2vertices, OP_INC), double* v2) { op_arg_dat (vertexDat, 1, edges2vertices, OP_INC)); *v1 += *e; *v2 += *e; } Coloring used for avoiding race conditions in shared memory parallel execution �� �� � Each partition assigned ! � �� �� �� �� �� � �� �� �� �� �� �� �� �� � �� �� �� �� � �� �� to a Thread Block and ! �� �� �� � �� � �� �� � �� �� � �� �� �� �� �� �� � �� �� �� � �� �� further colored �� � � �� �� �� �� �� �� �� �� �� �� �� �� �� �� � �� � �� �� � �� �� �� � �� �� �� �� � �� �� �� �� �� �� �� �� � � �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� ��

Recommend


More recommend