Making FPGAs Programmable as Computers and Doing It At Scale Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto
What’s the real goal? • Build large-scale applications with FPGAs without added pain J • Where do we stand? • Need for abstractions and middleware support 2 • Our work at UofT to support HPC with FPGAs November 14, 2016 H2RC 2016
3 OUR P PHILOSOPHY November 14, 2016 H2RC 2016
Preserve Current Programming Models • Program and use an FPGA just like any other software-based processor • (Software) programmers should not necessarily need to know that processing is done on an FPGA 4 – Ability to pick FPGA execution for performance/power reasons – even better if this is automatic! November 14, 2016 H2RC 2016
5 WH WHERE D DO WE WE S STAND? November 14, 2016 H2RC 2016
High-Level Synthesis • Raises the level of abstraction above hardware design • Lots of great research • Absolutely required • Tremendous progress recently 6 • Can describe complex computations and functions algorithmically and create hardware But!!! We are still just building custom hardware. HLS is only a part of the big picture… November 14, 2016 H2RC 2016
Consider Portability How many of you have taken a C program written on one platform and just recompiled it to run on another? How many have done the same for code targeted for an 7 FPGA platform? (If you have even tried! J ) • Much invested in writing any application • Reuse, modify, enhance, evolve it • Code should run on any platform November 14, 2016 H2RC 2016
Consider Design Environments • Well-developed in software – IDEs – Visual Studio, NetBeans – Linux + emacs/vi + gcc – Good open source options 8 • FPGA Hardware – Vivado, Quartus – Not what software developer would expect – No open source options • Makefiles vs TCL! November 14, 2016 H2RC 2016
9 COMPUTING A ABSTRACTIONS F FOR FP FPGAS AS November 14, 2016 H2RC 2016
What Abstractions? • Memory model – Data[127:0] vs connect to memory controller • I/O – read()/write() vs connect to a PCIe controller 1 0 – USB – Networking – TCP/IP , UDP • Services – Filesystem, status, control • FPGAs have lacked all of these things November 14, 2016 H2RC 2016
Many Approaches • APIs to connect to FPGAs • Hardware threads Ø Commercial (non-vendor) tools 1 Ø Commercial vendor tools 1 Ø UofT approach • Give a representative, not complete set of examples November 14, 2016 H2RC 2016
Commercial Non-Vendor Tools • HLS with an environment to debug, monitor performance, load and run hardware – Handel-C – Impulse-C 1 – Maxeler – proprietary hardware 2 • Not broadly used because of proprietary tools (and sometimes hardware) November 14, 2016 H2RC 2016
Commercial Vendor Tools • OpenCL tools – SDAccel, SDK for OpenCL – Data centre is a major target • OpenCL is an open “standard” – Possible to have cross-vendor, cross-platform portability, 1 even cross-architecture (GPUs, PHI, etc.) 3 – More interest than proprietary approach • SDSoC – for C/C++ – On SoC platforms, but could be any heterogeneous system November 14, 2016 H2RC 2016
Why FPGA OpenCL is more like computing • Provides a higher-level software abstraction – Don’t worry about CPU-FPGA communication layer • PCIe, QPI, AXI, ethernet, etc. – Runtime manages bit streams, memory allocation, data transfers – Transparently uses HLS for the kernels 1 4 • Knowledgeable software person can use – Must understand parallelism, basic architectural concepts, latency and throughput, I/O for data in terms of structure and protocols – Doesn’t need to know about clocks • Early days still, but you can see where it’s going – FPGA vendors learning to be computer companies November 14, 2016 H2RC 2016
OpenCL is not HLS • Recognize that HLS is not what makes OpenCL a computing environment – HLS is necessary but not sufficient • It’s the other stuff under the hood + HLS 1 5 – Run time services • Data transfer, memory management, bit stream loading – Hardware shell services • CPU/FPGA interconnect, DMA engine, memory controller November 14, 2016 H2RC 2016
Scalability • OpenCL, as a programming model, does not scale • Could scale by using MPI between nodes, and OpenCL to build the accelerator – As is done with MPI + OpenMP 1 6 – Need to deal with two programming models November 14, 2016 H2RC 2016
1 7 WORK A AT UOFT UOFT November 14, 2016 H2RC 2016
Classic accelerator model: Master-Slave 1 8 Need custom APIs to interact with accelerators Lacks portability and scalability November 14, 2016 H2RC 2016
Our programming model philosophy • Use a common API for Software and Hardware Kernel migration Kernel migration Kernel migration Kernel migration CPU (x86) CPU (x86) CPU (x86) CPU (x86) Embedded CPU Embedded CPU Embedded CPU Custom Custom Processing Processing 1 Application Application Application Application Application Application Application FPGA FPGA FPGA FPGA Element Element 9 Common API Common API Common API Common API Common API Common API Common API Interconnect Interconnect Interconnect Interconnect Common API Common API Common API Common API Common API Drivers Drivers Drivers Drivers Hardware Hardware Hardware Hardware Hardware Interconnect Interconnect Interconnect Interconnect November 14, 2016 H2RC 2016
Common SW/HW API • CPU and FPGA components can initiate data transfers – they are peers • SW and HW components use similar call formats • For distributed memory and message-passing, 2 this was implemented by TMD-MPI 0 (TMD: Toronto Molecular Dynamics) • For shared memory, building hardware infrastructure for a common API for PGAS November 14, 2016 H2RC 2016
Why, again, a common API? • Developer can focus on algorithm, exposing parallel tasks in pure software environment – Easier development: SW Prototyping à Migration • Model makes no distinction between CPUs and FPGAs (in terms of data communication, synchronization) 2 1 • Map tasks to computing elements later – Not as part of initial design • FPGA-initiated communication relieves CPU (even more so for one-sided communication) • FPGA-only systems (or one CPU + many FPGAs) can work efficiently November 14, 2016 H2RC 2016
2 2 BUILDING A A L LARGE H HETEROGENEOUS HPC A APPLICATION WI WITH M MPI November 14, 2016 H2RC 2016
Molecular Dynamics • Simulate motion of molecules at atomic level • Highly compute-intensive • Understand protein folding 2 • Computer-aided drug design 3 November 14, 2016 H2RC 2016
Origin of Computational Complexity [ ( ) ] k 1 cos n , n 0 + φ − γ ≠ O(n) ⎧ i i i i i U ∑ ⎩ = ⎨ t 2 k ( 0 ) , n 0 − γ = i i i i 2 U k ( 0 ) ∑ = θ − θ a i i i i 2 U k ( r r 0 ) ∑ = − b i i i i 2 10 3 - 10 10 4 O(n 2 ) 12 6 q q ⎡ ⎤ 1 N N σ σ ⎛ ⎞ ⎛ ⎞ i j U ∑ τ ∑∑ V ( r ) 4 = = ε − ⎢ ⎜ ⎟ ⎜ ⎟ ⎥ 2 r n r r + ⎝ ⎠ ⎝ ⎠ n i 1 j 1 = = ij ⎢ ⎥ ⎣ ⎦ November 14, 2016 H2RC 2016
The TMD Machine • The Toronto Molecular Dynamics Machine • Use multi-FPGA system to accelerate MD • Built using an MPI programming model 2 • Principal algorithm developer: Chris Madill, Ph.D. 5 candidate (now done!) in Biochemistry – Writes C++ using MPI, not Verilog/VHDL • Have used three platforms – portability • Plus scalability and maintainability November 14, 2016 H2RC 2016
UofT MPI Approach (FPL 2006) Also a system simulation 2 6 HLS can do this November 14, 2016 H2RC 2016
Platform Evolution FPGA portability and design abstraction facilitated ongoing migration. Network of Five V2Pro PCI Cards (2006) Network of BEE2 Multi-FPGA Boards (2007) 2 7 • First to integrate hardware acceleration • Added electrostatic terms • Simple LJ fluids only • Added bonded terms November 14, 2016 H2RC 2016
2010 – Xilinx/Nallatech ACP 2 8 Stack of 5 large Virtex-5 FPGAs + 1 FPGA for FSB PHY interface Quad socket Xeon Server November 14, 2016 H2RC 2016
Typical MD Simulator Process i Process i Bonded Process i Bonded Data i Nonbonded 2 Process i Bonded 9 Data i Nonbonded PME Bonded Data i Nonbonded PME CPU i Data i Nonbonded PME CPU i PME CPU i CPU i November 14, 2016 H2RC 2016
TMD Machine Architecture MPI::Send(&msg, size, dest …); Atom Atom Manager Atom Manager Atom Manager Manager 3 Bond 0 Bond Input Engine Short range Engine Short range Nonbond Short range Nonbond Short range Engine Scheduler Nonbond Short range Engine Nonbond Short range Engine Nonbond Engine Nonbond Engine Engine Output Long range Long range Electrostatics Long range Electrostatics Engine Visualizer Electrostatics Engine Engine November 14, 2016 H2RC 2016
Recommend
More recommend