Optimising SpMV for FEM on FPGAs Paul Grigoras, Pavel Burovskiy, - PowerPoint PPT Presentation

Optimising SpMV for FEM on FPGAs Paul Grigoras, Pavel Burovskiy, Wayne Luk, Spencer Sherwin 1

Finite Element Methods - Solve PDEs over large, unstructured geometries ● PDEs: Incompressible Navier Stokes, Shallow Water etc. ● Applications: computational fluid dynamics, biomedicine, geoscience, etc. 3

Finite Element Methods Source: www.nektar.info Mesh over unstructured domain 4

Finite Element Methods Mesh elements Source: www.nektar.info Mesh over unstructured domain 5

Finite Element Methods Mesh elements Source: www.nektar.info Assembly Sparse Matrix Mesh over unstructured domain 6

Finite Element Methods Mesh elements Source: www.nektar.info Assembly Sparse Matrix PDE Solver Mesh over unstructured domain 7

Finite Element Methods Source: www.nektar.info Source: www.nektar.info Mesh over CFD Simulation unstructured domain 8

Finite Element Methods Mesh elements Source: www.nektar.info Assembly Sparse Matrix PDE Solver Mesh over unstructured domain 9

Finite Element Methods Mesh elements Source: www.nektar.info Assembly Sparse Matrix PDE Solver Mesh over Linear Solver unstructured domain 10

Finite Element Methods Mesh elements Source: www.nektar.info Assembly Sparse Matrix PDE Solver Mesh over Iterative Linear Solver ⇒ SpMV unstructured domain 11

Finite Element Methods Mesh elements Source: www.nektar.info Assembly Sparse Matrix PDE Solver Mesh over Vector Gather/Scatter Block Diagonal SpMV unstructured domain [Burovskiy FPL15] (this work) 12

Overview ● Point of departure : focus on high order, spectral HP FEM, with local assembly ○ block diagonal SpMV (this work) vs generic SpMV (prior work) 13

Block SpMV ● Each dense block corresponds to one element ● Larger dense blocks ⇒ More structured computation 14

Overview ● Point of departure : focus on high order, spectral HP FEM, with local assembly ○ block diagonal SpMV (this work) vs generic SpMV (prior work) 15

Overview ● Point of departure : focus on high order, spectral HP FEM, with local assembly ○ block diagonal SpMV (this work) vs generic SpMV (prior work) ● Contributions : ○ Optimised architecture and implementation for block diagonal SpMV ○ Resource constrained performance model for the proposed architecture ○ Automated method to customise the architecture based on mesh parameters 16

Overview ● Point of departure : focus on high order, spectral HP FEM, with local assembly ○ block diagonal SpMV (this work) vs generic SpMV (prior work) ● Contributions : ○ Optimised architecture and implementation for block diagonal SpMV ○ Resource constrained performance model for the proposed architecture ○ Automated method to customise the architecture based on mesh parameters ● Result : a custom, mesh-specific architecture generator ○ Maximise throughput/area ⇒ fit larger meshes & improve performance 17

Architecture ● Each MPE has ○ Independent memory channel ○ Customisable precision datapath ○ Variable depth FIFO - support block variations at runtime 18

Architecture ● Each MPE has ○ Independent memory channel ○ Customisable precision datapath ○ Variable depth FIFO - support block variations at runtime ● Design: ○ Parametric: NMPEs, MPEwidth ○ Task vs Data Parellelism tradeoff ○ ⇒ Mesh specific optimal config. 19

Architecture ● Each MPE has ○ Independent memory channel ○ Customisable precision datapath ○ Variable depth FIFO - support block variations at runtime ● Design: ○ Parametric: NMPEs, MPEwidth ○ Task vs Data Parellelism tradeoff ○ ⇒ Mesh specific optimal config. ● Block SpMV advantages: ⇒ Simplified control (format decoding) ⇒ Reduced metadata ⇒ Simplified reduction circuit 20

Parameter Extraction ● Assume matrix is block diagonal ● Extract mesh parameters: size & number of blocks for each element ● In DSE: find and synthesise optimal architectures ● At runtime: select the appropriate architecture 21

Performance Model ● Mesh parameters ⇒ optimal architecture parameters ● Performance: ● Resource usage: ● Functional, hardware constraints ⇒ See paper for details 22

Runtime ● Software layer - can be integrated in existing FEM software packages ● Reorder to enforce linear access pattern in DRAM ○ Maximise throughput ○ Minimise control logic 23

Putting it Together 24

Putting it Together Offline tuning : build a repository of customised architectures from a set of mesh instances 25

Putting it Together Offline tuning : build a Runtime: select the repository of customised optimal architecture for an architectures from a set of input mesh instance mesh instances 26

Evaluation 27

Evaluation ● Implementation ○ Design: MaxComplier + MaxJ dataflow language ○ FPGA Server: Maxeler Max 4 Maia (Stratix VSG, 48GB DRAM, per board) ○ Software: C++14, G++ 5.2 ○ CPU Server: Dual Intel Xeon E5-2640, 64GB DRAM, Infiniband QSFP ○ Place and route with Altera Quartus 14.1 ○ Available as extension to the CASK framework [Grigoras et al, FPGA 16]: ■ http://caskorg.github.io/cask/ ● Reference software - Nektar++ FEM Package, http://www.nektar.info/ ● Reference hardware ○ [Burovskiy et al, FPL 15], Nektar++ Accelerated FEM 28

Experiments 1. What is the benefit of tuning architecture based on mesh properties ? a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture 29

Compute efficiency is maximised for smaller MPE Width 31

Compute efficiency is maximised for smaller MPE Width Achieved DRAM bandwidth is maximised for larger MPE Width 32

Compute efficiency is maximised for smaller MPE Width Achieved DRAM bandwidth is maximised for larger MPE Width ⇒ aggressive tuning (max MPE Width) - not resource efficient 33

Experiments 1. What is the benefit of tuning architecture based on mesh properties ? a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture ⇒ find architecture with good efficiency ⇒ improve performance s.t. resource usage 34

Experiments 1. What is the benefit of tuning architecture based on mesh properties ? a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture ⇒ find architecture with good efficiency ⇒ improve performance s.t. resource usage b. Fixed architecture, variable mesh, data parallel vs task parallel 35

A1 ~ 2X Better A2 ~ 2X Better 37

A1 ~ 2X Better A2 ~ 2X Better +Mem. Chan., -Vector lanes, 1058 BRAMs - Mem. Chan., + Vector lanes, 686 BRAMs 38

A1 ~ 2X Better A2 ~ 2X Better +Mem. Chan., -Vector lanes, 1058 BRAMs - Mem. Chan., + Vector lanes, 686 BRAMs ⇒ Good for small blocks ⇒ Good for large blocks 39

A1 ~ 2X Better A2 ~ 2X Better +Mem. Chan., -Vector lanes, 1058 BRAMs - Mem. Chan., + Vector lanes, 686 BRAMs ⇒ Good for small blocks ⇒ Good for large blocks 40

Experiments 1. What is the benefit of tuning architecture based on mesh properties ? a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture ⇒ find architecture with good efficiency ⇒ improve performance s.t. resource usage b. Fixed architecture, variable mesh, data parallel vs task parallel ⇒ select optimal MPE Width and N Mpe for given mesh ⇒ improve performance, reduce resource usage 41

Experiments 1. What is the benefit of tuning architecture based on mesh properties ? a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture ⇒ find architecture with good efficiency ⇒ improve performance s.t. resource usage b. Fixed architecture, variable mesh, data parallel vs task parallel ⇒ select optimal MPE Width and N Mpe for given mesh ⇒ improve performance, reduce resource usage 2. What is the expected benefit for a full FEM implementation? a. Baseline, Nektar++ implementation from [Burovskiy et al, FPL 2015] 42

Experiments 1. What is the benefit of tuning architecture based on mesh properties ? a. Fixed mesh (NACA 1L, [Burovskiy et al, FPL 2015]) - optimal architecture ⇒ find architecture with good efficiency ⇒ improve performance s.t. resource usage b. Fixed architecture, variable mesh, data parallel vs task parallel ⇒ select optimal MPE Width and N Mpe for given mesh ⇒ improve performance, reduce resource usage 2. What is the expected benefit for a full FEM implementation? a. Baseline, Nektar++ implementation from [Burovskiy et al, FPL 2015] ⇒ enabling larger problem sizes, not supported by previous work. ⇒ enable a good proportion of the projected speedup (3X over CPU) 44

Optimising SpMV for FEM on FPGAs Paul Grigoras, Pavel Burovskiy, - PowerPoint PPT Presentation

Optimising SpMV for FEM on FPGAs Paul Grigoras, Pavel Burovskiy, Wayne Luk, Spencer Sherwin 1 2 Finite Element Methods - Solve PDEs over large, unstructured geometries PDEs: Incompressible Navier Stokes, Shallow Water etc.

6 FEM Modeling: Introduction IFEM Ch 6 Slide 1 Introduction to FEM FEM Terminology

27 A Complete Plane Stress FEM Program IFEM Ch 27 Slide 1 Introduction to FEM The 3 Basic

21 FEM Program for Space Trusses IFEM Ch 21 Slide 1 Introduction to FEM The Three Basic

The BIST History of FPGAs FPGAs The BIST History of The BISTory BISTory of of FPGAs FPGAs

19 FEM Convergence Requirements IFEM Ch 19 Slide 1 Introduction to FEM Convergence

FPGAs 1 CMPE691/491: Advanced FPGA Design FPGAs Large array of configurable logic blocks

Physical Design For FPGAs Rajeev Jayaraman Physical Implementation Tools Xilinx Inc. ISPD-2001

Strategies for parallel SpMV multiplication Albert-Jan Yzelman June 2012 2012, ExaScience

Strategies for parallel SpMV multiplication Albert-Jan Yzelman June 2012 2012, ExaScience

7 FEM Modeling: Mesh, Loads and BCs IFEM Ch 7 Slide 1 Introduction to FEM Topics in

7 FEM Modeling: Introduction IFEM Ch 7 Slide 1 Department of Engineering Mechanics PhD.

Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto

Industrial Trucks Going Electric Heiko Boekhoff Secretary FEM IT EuroBat Forum 2017 Brussels, 9

TRIMAS frame / TRIMAS fem FEM in Building Construction Various standards e.g. EN 1992 +

CAD Output of Plots CAD-Import Program Environment Presentation ITS-fem-SlabPackage 1 / 5 ZWAX

12 Variational Formulation of Plane Beam Element IFEM Ch 12 Slide 1 Introduction to FEM

CE 620 CE 620 FINITE ELEMENT METHOD Yogesh M. Desai Department of Civil Engineering Indian

New class of finite element methods: weak Galerkin methods Xiu Ye University of Arkansas at

Finite Element Methods and Vectorized Procedures in MATLAB Jonathan Fritz thesis advisor: Bed

Hack the Derivative! Erik Taubeneck GameChanger Media - Software Engineer and Data Maven Were

h-P discontinuous Galerkin finite element method for electronic structure calculations Carlo

Convergence of the Adaptive Finite Element Method Carsten Carstensen Department of Mathematics,

Preconditioning techniques for mixed finite element equations with multiple scales Jrg Espen

Amortized Finite Element Analysis for Fast PDE-Constrained Optimization Tianju Xue , Alex Beatson,