AUTOMATIC CODE RESTRUCTURING FOR FPGAS: CURRENT STATUS, TRENDS AND - PowerPoint PPT Presentation

AUTOMATIC CODE RESTRUCTURING FOR FPGAS: CURRENT STATUS, TRENDS AND OPEN IS ISSUES Special Day on “Embedded Meets Hyperscale and HPC” João MP Cardoso jmpc@acm.org DATE 2019 | DATE - Design, Automation and Test in Europe, Firenze, Italy, March 27, 2019

Compiling to hardware: Timeline ... 80’s 90’s 00’s 10’s 20’ 2

Compiling to FPGAs (hardware) • Of paramount importance for allowing software developers to map computations to FPGA-based accelerators • Efficient compilation will improve designer productivity and will make the use of FPGA technology viable for software programmers • Challenge: • Added complexity of the extensive set of execution models supported by FPGAs makes efficient compilation (and programming) very hard • Years of research on High-Level Synthesis (mostly on hardware generation from C) and adoption of mature compiler frameworks are resulting in the effective use of HLS 3

Outline • Intro • Why source to source compilers? • Code restructuring • Some approaches for code restructuring • Our ongoing work • Conclusion • Future work 4

Why source to source compilers? • There are many optimizations and code transformations that can be explored at the source code level • Target code is still legible • Not tied to a specific target compiler (tool flow) or target Architecture! But: • Not all optimizations can be done at source code level! • Some code transformations are too specific and without enough application potential to justify inclusion in a compiler (unless the code is too important and must be regularly used/modified/extended) 5

Source level code transf.: 3D Path Planner • Target: ML507 Xilinx Virtex-5 board, PowerPC@400 MHz, CCUs@100 MHz Strategy Optimization 1 2 3 4 5 6 7 8 Systems. FCCM 2012 Strategies for FPGA-based See: Cardoso et al., Specifying Compiler        Loop fission and move     Replicate array 3×         Map gridit to HW core       Pointer-based accesses and strength Strategy 8: 6.8  faster than reduction pure software solution         Unroll 2× 8 6.80         Eliminating array accesses 7 6.72  Move data access 6 6.68 Specialization → 3 HW cores   5 6.08     Transfer pot data according to gridit call 4 5.94       Transfer obstacles data according to gridit 3 5.61 call 2 5.01 Implementation       On-demand obstacles data transfer FPGA resources 1 1.94 1 2,3,4 5,6 7,8 1.8 2.3 2.8 3.3 3.8 4.3 4.8 5.3 5.8 6.3 6.8 7.3 # Slice Registers as FF 901 939 956 2,470 # Slice LUTs 1,182 1,284 1,308 2,148 Source: EU-Funded FP7 REFLECT project # occupied Slices 531 663 642 1,004 6 # BlockRAM/# DSP48Es 34/6 34/6 98/6 98/12

Simple code restructuring example An FIR 7

Code restructuring: FIR example // x is an input array // y is an output array #define c0 2, c1 4, c2 4, c3 2 #define M 256 // no. of samples #define N 4 // no. of coeff. int c[N] = {c0, c1, c2, c3}; ... // Loop 1: for (int j=N-1; j<M; j++) { output=0; // Loop 2: for (int i=0; i<N; i++) { output+=c[i]*x[j-i]; } y[j] = output; } 8

Code restructuring: FIR example // x is an input array // y is an output array #define c0 2, c1 4, c2 4, c3 2 #define M 256 // no. of samples II=2 // Loop 1 #define N 4 // no. of coeff. for (int j=3; j<M; j++) { int c[N] = {c0, c1, c2, c3}; x_3=x[j]; ... x_2=x[j-1]; // Loop 1: x_1=x[j-2]; for (int j=N-1; j<M; j++) { x_0=x[j-3]; output=0; output=c0*x_3; // Loop 2: output+=c1*x_2; for (int i=0; i<N; i++) { output+=c2*x_1; output+=c[i]*x[j-i]; output+=c3*x_0; } y[j] = output; y[j] = output; } } 1 sample per 2 clock cycles 9

Code restructuring: FIR example // x is an input array // y is an output array x_0=x[0]; x_1=x[1]; #define c0 2, c1 4, c2 4, c3 2 #define M 256 // no. of samples x_2=x[2]; II=1 II=2 // Loop 1 #define N 4 // no. of coeff. // Loop 1 for (int j=3; j<M; j++) { int c[N] = {c0, c1, c2, c3}; for (int j=3; j<M; j++) { x_3=x[j]; ... x_3=x[j]; x_2=x[j-1]; output=c0*x_3; // Loop 1: x_1=x[j-2]; for (int j=N-1; j<M; j++) { output+=c1*x_2; x_0=x[j-3]; output+=c2*x_1; output=0; output=c0*x_3; // Loop 2: output+=c3*x_0; output+=c1*x_2; for (int i=0; i<N; i++) { x_0=x_1; output+=c2*x_1; x_1=x_2; output+=c[i]*x[j-i]; output+=c3*x_0; } x_2=x_3; y[j] = output; y[j] = output; y[j] = output; } } } 1 sample per 2 clock cycles 10 1 sample per clock cycle

II=1 Code restructuring: // Loop 1 for (int j=3; j<M; j+=2) { x_3=x[j]; FIR example Synthesis. FPGAs for Software Programmers 2016. See: João M. P output=c0*x_3; output+=c1*x_2; output+=c2*x_1; x_0=x[0]; output+=c3*x_0; . Cardoso, Markus Weinhardt, High-Level x_1=x[1]; x_0=x_1; II=2 x_2=x[2]; x_1=x_2; II=1 // Loop 1 // Loop 1 x_2=x_3; for (int j=3; j<M; j++) { for (int j=3; j<M; j++) { y[j] = output; x_3=x[j]; x_3=x[j]; x_3=x[j+1]; x_2=x[j-1]; output=c0*x_3; output=c0*x_3; x_1=x[j-2]; output+=c1*x_2; output+=c1*x_2; x_0=x[j-3]; output+=c2*x_1; output+=c2*x_1; output=c0*x_3; output+=c3*x_0; output+=c3*x_0; output+=c1*x_2; x_0=x_1; x_0=x_1; output+=c2*x_1; x_1=x_2; x_1=x_2; output+=c3*x_0; x_2=x_3; x_2=x_3; y[j] = output; y[j] = output; y[j+1] = output; } } } 2 samples per clock cycle 11 1 sample per 2 clock cycles 1 sample per clock cycle

Code restructuring • Manual • Programmers need to know the impact of code styles and structures on the generated architecture – with similarities to the HDL developers, although in a different level • Fully automatic with a source-to-source compiler (refactoring tool) • Need to devise the code transformations to apply and their ordering • Need source to source compilers integrating a vast portfolio of code transformations • Semi-automatic with a source-to-source compiler (refactoring tool) • Code transformations automatically applied but guided by users • Users can define their own code transformations 12

Some approaches for code restructuring/opt. - LegUp [Canis et al., ACM TECS’13]: flag selection and phase • Flag selection ordering (via LLVM + opt) [Huang et al., ACM TRETS’15] • Phase ordering - The Merlin Compiler and source to source optimizations by Cong et.al., FSP’16 • Polyhedral models - Polyhedral transformations by Zuo et al., FPGA’13 - Polyhedral in nested loop pipelining by Morvan et al., IEEE • Graph-based TCAD’13 transformations - Graph- based code restructuring by Ferreira and Cardoso, FSP’18, ARC’19 13

Flag selection • Generation controlled by enabling/disabling compiler flags – sequence of optimizations are the ones built-in and pre-fixed for each flag • Suitable to most common approaches, but without taking full-advantage of customization/specialization Helping but without solving the code restructuring problem! 14

Phase ordering • Providing specific sequences of compiler optimizations • Problem is very complex as besides selecting the phases one needs to provide sequences – usually repeating phases • Difficult to find the sequence! • Fully dependent on the portfolio of phases a compiler may include – phases need to justify their inclusion (i.e., if they pay-off) Limitations for solving the code restructuring problem! 15

Polyhedral models • Applied to Static Control Parts – require specific loop structures, statically known iteration spaces, limited to affine domains • Pure polyhedral models transform iteration spaces – more advanced approaches combine the polyhedral model with AST transformations • Able to provide useful code transformations and justify their inclusion in the portfolio of compiler This Photo by Unknown Author is licensed under CC BY-NC optimizations Helping on solving the code restructuring problem! 16

Graph-based transformations (our ongoing work) • Traces of computations are represented in Dataflow Graphs (DFGs) • Code restructuring problem is solved by graph transformations • Able to achieve high-levels of code restructuring and suitable HLS directives This Photo by Unknown Author is licensed under CC BY-SA A proof of concept… scalability still needs to be solved! 17

Code restructuring: ongoing Application Graphs Code Analysis, (e.g., Graph-based Code (Software Profiling, Representing Optimizations Generation Programming Traces) Execution Language) Input Strategies Strategies 18

Code restructuring: graph-based approach Application Code DFG Analysis, Code (Software Graph-based (Representi Profiling, Generation Programming Optimizations ng a Trace) Execution Language) + directives Configurations Optimize DFG Split in subDFGs Fold DFGs Identify data reuse Balance chains of operations Data partitioning 19

AUTOMATIC CODE RESTRUCTURING FOR FPGAS: CURRENT STATUS, TRENDS AND - PowerPoint PPT Presentation

AUTOMATIC CODE RESTRUCTURING FOR FPGAS: CURRENT STATUS, TRENDS AND OPEN IS ISSUES Special Day on Embedded Meets Hyperscale and HPC Joo MP Cardoso jmpc@acm.org DATE 2019 | DATE - Design, Automation and Test in Europe, Firenze, Italy,

The BIST History of FPGAs FPGAs The BIST History of The BISTory BISTory of of FPGAs FPGAs

FPGAs 1 CMPE691/491: Advanced FPGA Design FPGAs Large array of configurable logic blocks

Physical Design For FPGAs Rajeev Jayaraman Physical Implementation Tools Xilinx Inc. ISPD-2001

RESTRUCTURING AND RESTRUCTURING AND TARIFF TARIFF NEGOTIATIONS NEGOTIATIONS by Gerhard Coeln

Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Virtex-7 FPGAs Target Software Virtex-7 FPGAs Target Software Defined Radio Applications Defined

Linux and FPGAs Chad D. Kersey chad@cdkersey.com cdkersey@gatech.edu Linux and FPGAs - p. 1/9

Physical optimization for Physical optimization for FPGAs using post- FPGAs using post-

Hybrid Dot-Product Design for FP-Enabled FPGAs Bogdan Pasca Intel ARITH 2019, June 10-12, 2019

High-Speed Computing & Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are

A Network of Time Division Multiplexing for FPGAs Rosemary Francis Motivation FPGAs are

with FP FPGAs: Cas ase Stu tudy on on a a Key-Value Store FPGAs in the Cloud Wider

Gigabit Ethernet Gigabit Ethernet implementation for implementation for FPGAs FPGAs Grzegorz

FPGAs 1 To read more This days papers: Brown and Rose, Architecture of FPGAs and

FPGAs milliseconds+ to reconfjgure custom chips ??? (next week) FPGAs ??? GPUs

Programming Language for Switches ECE/CS598HPN Radhika Mittal Conventional SDN Very

Finding heap-bounds for hardware synthesis B. Cook + J. Simsa* A. Gupta # S. Singh + S. Magill*

Hardware-Sensitive Scan Operator Variants for Compiled Selection Pipelines Databases D B and

Progress in automatic GPU compilation and why you want to run MPI on your GPU with Tobias Grosser

Instruction Set Architecture 9/20/16 Overview How to directly interact with hardware

The modeling of the Vela pulsar pulses - - from op5cal to hard gamma-rays Bronek Rudak in

Hard-core atomic physics: highly charged ions Jos R R. C Crespo L po Lpez-Urr pez-Urrutia

Electromagnetic Signatures of Supermassive Binary Black Holes with Cole Miller, Constanze Roedig,

AUTOMATIC CODE RESTRUCTURING FOR FPGAS: CURRENT STATUS, TRENDS AND - PowerPoint PPT Presentation

AUTOMATIC CODE RESTRUCTURING FOR FPGAS: CURRENT STATUS, TRENDS AND OPEN IS ISSUES Special Day on Embedded Meets Hyperscale and HPC Joo MP Cardoso jmpc@acm.org DATE 2019 | DATE - Design, Automation and Test in Europe, Firenze, Italy,

The BIST History of FPGAs FPGAs The BIST History of The BISTory BISTory of of FPGAs FPGAs

FPGAs 1 CMPE691/491: Advanced FPGA Design FPGAs Large array of configurable logic blocks

Physical Design For FPGAs Rajeev Jayaraman Physical Implementation Tools Xilinx Inc. ISPD-2001

RESTRUCTURING AND RESTRUCTURING AND TARIFF TARIFF NEGOTIATIONS NEGOTIATIONS by Gerhard Coeln

Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Virtex-7 FPGAs Target Software Virtex-7 FPGAs Target Software Defined Radio Applications Defined

Linux and FPGAs Chad D. Kersey chad@cdkersey.com cdkersey@gatech.edu Linux and FPGAs - p. 1/9

Physical optimization for Physical optimization for FPGAs using post- FPGAs using post-

Hybrid Dot-Product Design for FP-Enabled FPGAs Bogdan Pasca Intel ARITH 2019, June 10-12, 2019

High-Speed Computing &amp; Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are

A Network of Time Division Multiplexing for FPGAs Rosemary Francis Motivation FPGAs are

with FP FPGAs: Cas ase Stu tudy on on a a Key-Value Store FPGAs in the Cloud Wider

Gigabit Ethernet Gigabit Ethernet implementation for implementation for FPGAs FPGAs Grzegorz

FPGAs 1 To read more This days papers: Brown and Rose, Architecture of FPGAs and

FPGAs milliseconds+ to reconfjgure custom chips ??? (next week) FPGAs ??? GPUs

Programming Language for Switches ECE/CS598HPN Radhika Mittal Conventional SDN Very

Finding heap-bounds for hardware synthesis B. Cook + J. Simsa* A. Gupta # S. Singh + S. Magill*

Hardware-Sensitive Scan Operator Variants for Compiled Selection Pipelines Databases D B and

Progress in automatic GPU compilation and why you want to run MPI on your GPU with Tobias Grosser

Instruction Set Architecture 9/20/16 Overview How to directly interact with hardware

The modeling of the Vela pulsar pulses - - from op5cal to hard gamma-rays Bronek Rudak in

Hard-core atomic physics: highly charged ions Jos R R. C Crespo L po Lpez-Urr pez-Urrutia

Electromagnetic Signatures of Supermassive Binary Black Holes with Cole Miller, Constanze Roedig,

High-Speed Computing & Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are