Making FPGAs Programmable as Computers and Doing It At Scale Paul - PowerPoint PPT Presentation

Making FPGAs Programmable as Computers and Doing It At Scale Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto

What’s the real goal? • Build large-scale applications with FPGAs without added pain J • Where do we stand? • Need for abstractions and middleware support 2 • Our work at UofT to support HPC with FPGAs November 14, 2016 H2RC 2016

3 OUR P PHILOSOPHY November 14, 2016 H2RC 2016

Preserve Current Programming Models • Program and use an FPGA just like any other software-based processor • (Software) programmers should not necessarily need to know that processing is done on an FPGA 4 – Ability to pick FPGA execution for performance/power reasons – even better if this is automatic! November 14, 2016 H2RC 2016

5 WH WHERE D DO WE WE S STAND? November 14, 2016 H2RC 2016

High-Level Synthesis • Raises the level of abstraction above hardware design • Lots of great research • Absolutely required • Tremendous progress recently 6 • Can describe complex computations and functions algorithmically and create hardware But!!! We are still just building custom hardware. HLS is only a part of the big picture… November 14, 2016 H2RC 2016

Consider Portability How many of you have taken a C program written on one platform and just recompiled it to run on another? How many have done the same for code targeted for an 7 FPGA platform? (If you have even tried! J ) • Much invested in writing any application • Reuse, modify, enhance, evolve it • Code should run on any platform November 14, 2016 H2RC 2016

Consider Design Environments • Well-developed in software – IDEs – Visual Studio, NetBeans – Linux + emacs/vi + gcc – Good open source options 8 • FPGA Hardware – Vivado, Quartus – Not what software developer would expect – No open source options • Makefiles vs TCL! November 14, 2016 H2RC 2016

9 COMPUTING A ABSTRACTIONS F FOR FP FPGAS AS November 14, 2016 H2RC 2016

What Abstractions? • Memory model – Data[127:0] vs connect to memory controller • I/O – read()/write() vs connect to a PCIe controller 1 0 – USB – Networking – TCP/IP , UDP • Services – Filesystem, status, control • FPGAs have lacked all of these things November 14, 2016 H2RC 2016

Many Approaches • APIs to connect to FPGAs • Hardware threads Ø Commercial (non-vendor) tools 1 Ø Commercial vendor tools 1 Ø UofT approach • Give a representative, not complete set of examples November 14, 2016 H2RC 2016

Commercial Non-Vendor Tools • HLS with an environment to debug, monitor performance, load and run hardware – Handel-C – Impulse-C 1 – Maxeler – proprietary hardware 2 • Not broadly used because of proprietary tools (and sometimes hardware) November 14, 2016 H2RC 2016

Commercial Vendor Tools • OpenCL tools – SDAccel, SDK for OpenCL – Data centre is a major target • OpenCL is an open “standard” – Possible to have cross-vendor, cross-platform portability, 1 even cross-architecture (GPUs, PHI, etc.) 3 – More interest than proprietary approach • SDSoC – for C/C++ – On SoC platforms, but could be any heterogeneous system November 14, 2016 H2RC 2016

Why FPGA OpenCL is more like computing • Provides a higher-level software abstraction – Don’t worry about CPU-FPGA communication layer • PCIe, QPI, AXI, ethernet, etc. – Runtime manages bit streams, memory allocation, data transfers – Transparently uses HLS for the kernels 1 4 • Knowledgeable software person can use – Must understand parallelism, basic architectural concepts, latency and throughput, I/O for data in terms of structure and protocols – Doesn’t need to know about clocks • Early days still, but you can see where it’s going – FPGA vendors learning to be computer companies November 14, 2016 H2RC 2016

OpenCL is not HLS • Recognize that HLS is not what makes OpenCL a computing environment – HLS is necessary but not sufficient • It’s the other stuff under the hood + HLS 1 5 – Run time services • Data transfer, memory management, bit stream loading – Hardware shell services • CPU/FPGA interconnect, DMA engine, memory controller November 14, 2016 H2RC 2016

Scalability • OpenCL, as a programming model, does not scale • Could scale by using MPI between nodes, and OpenCL to build the accelerator – As is done with MPI + OpenMP 1 6 – Need to deal with two programming models November 14, 2016 H2RC 2016

1 7 WORK A AT UOFT UOFT November 14, 2016 H2RC 2016

Classic accelerator model: Master-Slave 1 8 Need custom APIs to interact with accelerators Lacks portability and scalability November 14, 2016 H2RC 2016

Our programming model philosophy • Use a common API for Software and Hardware Kernel migration Kernel migration Kernel migration Kernel migration CPU (x86) CPU (x86) CPU (x86) CPU (x86) Embedded CPU Embedded CPU Embedded CPU Custom Custom Processing Processing 1 Application Application Application Application Application Application Application FPGA FPGA FPGA FPGA Element Element 9 Common API Common API Common API Common API Common API Common API Common API Interconnect Interconnect Interconnect Interconnect Common API Common API Common API Common API Common API Drivers Drivers Drivers Drivers Hardware Hardware Hardware Hardware Hardware Interconnect Interconnect Interconnect Interconnect November 14, 2016 H2RC 2016

Common SW/HW API • CPU and FPGA components can initiate data transfers – they are peers • SW and HW components use similar call formats • For distributed memory and message-passing, 2 this was implemented by TMD-MPI 0 (TMD: Toronto Molecular Dynamics) • For shared memory, building hardware infrastructure for a common API for PGAS November 14, 2016 H2RC 2016

Why, again, a common API? • Developer can focus on algorithm, exposing parallel tasks in pure software environment – Easier development: SW Prototyping à Migration • Model makes no distinction between CPUs and FPGAs (in terms of data communication, synchronization) 2 1 • Map tasks to computing elements later – Not as part of initial design • FPGA-initiated communication relieves CPU (even more so for one-sided communication) • FPGA-only systems (or one CPU + many FPGAs) can work efficiently November 14, 2016 H2RC 2016

2 2 BUILDING A A L LARGE H HETEROGENEOUS HPC A APPLICATION WI WITH M MPI November 14, 2016 H2RC 2016

Molecular Dynamics • Simulate motion of molecules at atomic level • Highly compute-intensive • Understand protein folding 2 • Computer-aided drug design 3 November 14, 2016 H2RC 2016

Origin of Computational Complexity [ ( ) ] k 1 cos n , n 0 + φ − γ ≠ O(n) ⎧ i i i i i U ∑ ⎩ = ⎨ t 2 k ( 0 ) , n 0 − γ = i i i i 2 U k ( 0 ) ∑ = θ − θ a i i i i 2 U k ( r r 0 ) ∑ = − b i i i i 2 10 3 - 10 10 4 O(n 2 ) 12 6 q q ⎡ ⎤ 1 N N σ σ ⎛ ⎞ ⎛ ⎞ i j U ∑ τ ∑∑ V ( r ) 4 =   = ε − ⎢ ⎜ ⎟ ⎜ ⎟ ⎥ 2 r n  r r + ⎝ ⎠ ⎝ ⎠ n i 1 j 1 = = ij ⎢ ⎥ ⎣ ⎦ November 14, 2016 H2RC 2016

The TMD Machine • The Toronto Molecular Dynamics Machine • Use multi-FPGA system to accelerate MD • Built using an MPI programming model 2 • Principal algorithm developer: Chris Madill, Ph.D. 5 candidate (now done!) in Biochemistry – Writes C++ using MPI, not Verilog/VHDL • Have used three platforms – portability • Plus scalability and maintainability November 14, 2016 H2RC 2016

UofT MPI Approach (FPL 2006) Also a system simulation 2 6 HLS can do this November 14, 2016 H2RC 2016

Platform Evolution FPGA portability and design abstraction facilitated ongoing migration. Network of Five V2Pro PCI Cards (2006) Network of BEE2 Multi-FPGA Boards (2007) 2 7 • First to integrate hardware acceleration • Added electrostatic terms • Simple LJ fluids only • Added bonded terms November 14, 2016 H2RC 2016

2010 – Xilinx/Nallatech ACP 2 8 Stack of 5 large Virtex-5 FPGAs + 1 FPGA for FSB PHY interface Quad socket Xeon Server November 14, 2016 H2RC 2016

Typical MD Simulator Process i Process i Bonded Process i Bonded Data i Nonbonded 2 Process i Bonded 9 Data i Nonbonded PME Bonded Data i Nonbonded PME CPU i Data i Nonbonded PME CPU i PME CPU i CPU i November 14, 2016 H2RC 2016

TMD Machine Architecture MPI::Send(&msg, size, dest …); Atom Atom Manager Atom Manager Atom Manager Manager 3 Bond 0 Bond Input Engine Short range Engine Short range Nonbond Short range Nonbond Short range Engine Scheduler Nonbond Short range Engine Nonbond Short range Engine Nonbond Engine Nonbond Engine Engine Output Long range Long range Electrostatics Long range Electrostatics Engine Visualizer Electrostatics Engine Engine November 14, 2016 H2RC 2016

Making FPGAs Programmable as Computers and Doing It At Scale Paul - PowerPoint PPT Presentation

Making FPGAs Programmable as Computers and Doing It At Scale Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Whats the real goal? Build large-scale

The BIST History of FPGAs FPGAs The BIST History of The BISTory BISTory of of FPGAs FPGAs

FPGAs 1 CMPE691/491: Advanced FPGA Design FPGAs Large array of configurable logic blocks

ROMs, PLAs and FPGAs October 5, 2006 Typeset by Foil T EX Why Programmable Logic?

Physical Design For FPGAs Rajeev Jayaraman Physical Implementation Tools Xilinx Inc. ISPD-2001

High-Speed Computing & Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are

Built- -In Self In Self- -Test for Programmable Test for Programmable Built I/O Buffers in

Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto

PROGRAMMABLE LOGIC CONTROLLER Control Systems Types Programmable Logic Controllers

Field Programmable Gate Arrays by Ketil Red Field Programmable Gate Array Integrated

Virtex-7 FPGAs Target Software Virtex-7 FPGAs Target Software Defined Radio Applications Defined

Linux and FPGAs Chad D. Kersey chad@cdkersey.com cdkersey@gatech.edu Linux and FPGAs - p. 1/9

Physical optimization for Physical optimization for FPGAs using post- FPGAs using post-

Hybrid Dot-Product Design for FP-Enabled FPGAs Bogdan Pasca Intel ARITH 2019, June 10-12, 2019

A Network of Time Division Multiplexing for FPGAs Rosemary Francis Motivation FPGAs are

with FP FPGAs: Cas ase Stu tudy on on a a Key-Value Store FPGAs in the Cloud Wider

Gigabit Ethernet Gigabit Ethernet implementation for implementation for FPGAs FPGAs Grzegorz

Where Perception Meets Reality: The state of association communications and recommendations to

CSE P503: Principles of Shoes must Software be worn Engineering David Notkin Autumn 2007

Communication Avoiding Power Scaling Power Scaling Derivatives of Algorithmic Communication

Software Architecture Software Engineering - 2017 Alessio Gambi - Saarland University These

Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 H. Wang, S.

The Network Layer Computer Communication Course Computer Communication Course 15 February, 2001

Topics ! Use of networks ! Network structure ! Implementation of networks Computer Networks

Mobile Communications Original design motivation was not so much mobility support Mobility

Making FPGAs Programmable as Computers and Doing It At Scale Paul - PowerPoint PPT Presentation

Making FPGAs Programmable as Computers and Doing It At Scale Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Whats the real goal? Build large-scale

The BIST History of FPGAs FPGAs The BIST History of The BISTory BISTory of of FPGAs FPGAs

FPGAs 1 CMPE691/491: Advanced FPGA Design FPGAs Large array of configurable logic blocks

ROMs, PLAs and FPGAs October 5, 2006 Typeset by Foil T EX Why Programmable Logic?

Physical Design For FPGAs Rajeev Jayaraman Physical Implementation Tools Xilinx Inc. ISPD-2001

High-Speed Computing &amp; Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are

Built- -In Self In Self- -Test for Programmable Test for Programmable Built I/O Buffers in

Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto

PROGRAMMABLE LOGIC CONTROLLER Control Systems Types Programmable Logic Controllers

Field Programmable Gate Arrays by Ketil Red Field Programmable Gate Array Integrated

Virtex-7 FPGAs Target Software Virtex-7 FPGAs Target Software Defined Radio Applications Defined

Linux and FPGAs Chad D. Kersey chad@cdkersey.com cdkersey@gatech.edu Linux and FPGAs - p. 1/9

Physical optimization for Physical optimization for FPGAs using post- FPGAs using post-

Hybrid Dot-Product Design for FP-Enabled FPGAs Bogdan Pasca Intel ARITH 2019, June 10-12, 2019

A Network of Time Division Multiplexing for FPGAs Rosemary Francis Motivation FPGAs are

with FP FPGAs: Cas ase Stu tudy on on a a Key-Value Store FPGAs in the Cloud Wider

Gigabit Ethernet Gigabit Ethernet implementation for implementation for FPGAs FPGAs Grzegorz

Where Perception Meets Reality: The state of association communications and recommendations to

CSE P503: Principles of Shoes must Software be worn Engineering David Notkin Autumn 2007

Communication Avoiding Power Scaling Power Scaling Derivatives of Algorithmic Communication

Software Architecture Software Engineering - 2017 Alessio Gambi - Saarland University These

Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 H. Wang, S.

The Network Layer Computer Communication Course Computer Communication Course 15 February, 2001

Topics ! Use of networks ! Network structure ! Implementation of networks Computer Networks

Mobile Communications Original design motivation was not so much mobility support Mobility

High-Speed Computing & Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are