F BLAS: Streaming Linear Algebra Kernels on FPGA 5 TH International - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T IZIANO D E M ATTEIS , J OHANNES DE F INE L ICHT AND T ORSTEN H OEFLER F BLAS: Streaming Linear Algebra Kernels on FPGA 5 TH International Workshop on Heterogeneous High-performance Reconfigurable Computing

spcl.inf.ethz.ch @spcl_eth FPGA for HPC Modern high-performance FPGAs are attractive for HPC workloads:  they are offered with native floating points units (DSPs), HBM, Network interfaces … However, they are rarely considered in HPC  Productivity : HLS and OpenCL ease programmers life  Tools and libraries : lack of maintained, publicly available and re-usable components; We contribute with F BLAS, an open-source projects:  First open source (HLS) and complete BLAS available for FPGA;  Numerical module interfaces are designed to natively support streaming communication across on-chip connections github.com/spcl/FBLAS 2

spcl.inf.ethz.ch @spcl_eth F BLAS: library design HLS Modules: implement numerical routines (e.g. DOT , GEMV , …) :  exploit spatial parallelism and fast on-chip memory  have a streaming interface to enable communications through on- chip FIFO buffers: data arrives/is produced using input/output channels Host Layer: allows the user to invoke numerical routines from the host  the API is written in C++, and provides a set of library calls matching BLAS API  can be used to offload single routine to FPGA FBLAS currently targets the Intel ecosystem (e.g. Stratix 10)  Eventually both SDx and Intel OpenCL support with the same interface 3

spcl.inf.ethz.ch @spcl_eth Modules implementation F BLAS modules are pre-optimized with key HLS transformations, such as pipelined loops , replication , and tiling Tiling has implications for how data For GEMM , computation is organized in a is streamed to/from modules 2D Systolic array 1 1 3 2 3 5 2 4 5 4 6 6 Optimizations are configurable by the user according to desired performance or utilization requirements 4

spcl.inf.ethz.ch @spcl_eth Module composition Streaming interface enables communication through on-chip memory rather than through off-chip DRAM Example : consider the following computation RAM RAM GER GEMV GER GEMV I/O: 3N 2 + 5N I/O: N 2 + 5N Reduces costly off-chip memory accesses and allows pipelined parallel modules execution 5

spcl.inf.ethz.ch @spcl_eth Streaming Composition A computation is expressed by a Module Directed Acyclic Graph (MDAG) An MDAG is valid if : x y  it expresses a composition that will terminate M 1  all the edges are valid. An edge is valid if:  # of elements produced = # of elements consumed M 2 z  order in which elements are consumed = order in which they are produced Composition of multi-trees A multi-tree module composition, with valid edges, is always valid. E.g. axpydot: Requires 3 BLAS calls. I/O = 7N I/O = 3N + 1 (and modules run in parallel) 6

spcl.inf.ethz.ch @spcl_eth Streaming Composition A computation is expressed by a Module Directed Acyclic Graph (MDAG) An MDAG is valid if : x y  it expresses a composition that will terminate M 1  all the edges are valid. An edge is valid if:  # of elements produced = # of elements consumed M 2 z  order in which elements are consumed = order in which they are produced Composition of non multi-trees Invalid graphs could occur in generic compositions Solved by: M 1  setting the channel size appropriately (according to the size of input data)  breaking the MDAG into multiple valid components M 2 M 3 7

spcl.inf.ethz.ch @spcl_eth Results Target architecture: FPGA: Stratix 10, 5.7K DSPs, 29 MB BRAM, 32 GB DRAM. Host: 10 cores Intel Xeon , 64 GB DRAM. Module evaluation: scaling with different vectorization width/tiling. Input data generated on chip Streaming composition: speedup wrt. DRAM implementation, evaluated over various meaningful compositions. 8

spcl.inf.ethz.ch @spcl_eth CONCLUSIONS F BLAS, is the first HLS-based BLAS implementation available for FPGA User can offload routines from an host program or integrate them into HLS codes HLS modules have a streaming interface to enable communications through on-chip FIFO buffers rather than DRAM github.com/spcl/FBLAS 9

spcl.inf.ethz.ch @spcl_eth Thanks! Any Questions? 10

F BLAS: Streaming Linear Algebra Kernels on FPGA 5 TH International - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T IZIANO D E M ATTEIS , J OHANNES DE F INE L ICHT AND T ORSTEN H OEFLER F BLAS: Streaming Linear Algebra Kernels on FPGA 5 TH International Workshop on Heterogeneous High-performance Reconfigurable Computing

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Numerical Linear Algebra Software (based on slides written by Michael Grant) BLAS, ATLAS

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Open Source FPGA Toolchain FPGA LSE Summer Week 2015 iCE40 Flow Conclusion Vincent Gatine

Tips about an FPGA 02/09/2018 J.C. special topic FPGA ( field-programmable gate array ) FPGA :

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

WWW.FPGA What is an FPGA? Field Programmable Gate Array Introduction to FPGA designs

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

PV Math Department MCL Vision Credit Options Credit General General/Post- College Honors

Linear algebra explained in four pages Excerpt from the N O BULLSHIT GUIDE TO LINEAR ALGEBRA by

Lecture 5: Introduction to JavaScript Scripts, Variables and Expressions, predefined functions,

Searching for sterile antineutrinos with SciBooNE & MiniBooNE M.O. Wascko Imperial College

13 - Computer Security Bufger Overfmows 1 Context

Buffer Overflow Attacks & Defenses Slides are borrowed from Franziska Roesner @UW and Dawn

The Weak Fundamental Theorem of Algebra Robert Lubarsky Fred Richman Florida Atlantic University

The Prehistory of Strings From current-algebra to the Veneziano formula GGI and Interdisciplinary

An Efficient Computer Algebra System for Python Pearu Peterson pearu.peterson@gmail.com

STAT 830 Blank Slides for Notes Richard Lockhart SFU STAT 830 Fall 2020 Richard Lockhart

F BLAS: Streaming Linear Algebra Kernels on FPGA 5 TH International - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T IZIANO D E M ATTEIS , J OHANNES DE F INE L ICHT AND T ORSTEN H OEFLER F BLAS: Streaming Linear Algebra Kernels on FPGA 5 TH International Workshop on Heterogeneous High-performance Reconfigurable Computing

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Numerical Linear Algebra Software (based on slides written by Michael Grant) BLAS, ATLAS

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

Graphics 2014 Linear Algebra II Linear Maps &amp; Matrices Linear Maps &amp; Matrices CORE

Open Source FPGA Toolchain FPGA LSE Summer Week 2015 iCE40 Flow Conclusion Vincent Gatine

Tips about an FPGA 02/09/2018 J.C. special topic FPGA ( field-programmable gate array ) FPGA :

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

WWW.FPGA What is an FPGA? Field Programmable Gate Array Introduction to FPGA designs

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

PV Math Department MCL Vision Credit Options Credit General General/Post- College Honors

Linear algebra explained in four pages Excerpt from the N O BULLSHIT GUIDE TO LINEAR ALGEBRA by

Lecture 5: Introduction to JavaScript Scripts, Variables and Expressions, predefined functions,

Searching for sterile antineutrinos with SciBooNE &amp; MiniBooNE M.O. Wascko Imperial College

13 - Computer Security Bufger Overfmows 1 Context

Buffer Overflow Attacks &amp; Defenses Slides are borrowed from Franziska Roesner @UW and Dawn

The Weak Fundamental Theorem of Algebra Robert Lubarsky Fred Richman Florida Atlantic University

The Prehistory of Strings From current-algebra to the Veneziano formula GGI and Interdisciplinary

An Efficient Computer Algebra System for Python Pearu Peterson pearu.peterson@gmail.com

STAT 830 Blank Slides for Notes Richard Lockhart SFU STAT 830 Fall 2020 Richard Lockhart

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Searching for sterile antineutrinos with SciBooNE & MiniBooNE M.O. Wascko Imperial College

Buffer Overflow Attacks & Defenses Slides are borrowed from Franziska Roesner @UW and Dawn