AUTOMATIC CODE RESTRUCTURING FOR FPGAS: CURRENT STATUS, TRENDS AND OPEN IS ISSUES Special Day on “Embedded Meets Hyperscale and HPC” João MP Cardoso jmpc@acm.org DATE 2019 | DATE - Design, Automation and Test in Europe, Firenze, Italy, March 27, 2019
Compiling to hardware: Timeline ... 80’s 90’s 00’s 10’s 20’ 2
Compiling to FPGAs (hardware) • Of paramount importance for allowing software developers to map computations to FPGA-based accelerators • Efficient compilation will improve designer productivity and will make the use of FPGA technology viable for software programmers • Challenge: • Added complexity of the extensive set of execution models supported by FPGAs makes efficient compilation (and programming) very hard • Years of research on High-Level Synthesis (mostly on hardware generation from C) and adoption of mature compiler frameworks are resulting in the effective use of HLS 3
Outline • Intro • Why source to source compilers? • Code restructuring • Some approaches for code restructuring • Our ongoing work • Conclusion • Future work 4
Why source to source compilers? • There are many optimizations and code transformations that can be explored at the source code level • Target code is still legible • Not tied to a specific target compiler (tool flow) or target Architecture! But: • Not all optimizations can be done at source code level! • Some code transformations are too specific and without enough application potential to justify inclusion in a compiler (unless the code is too important and must be regularly used/modified/extended) 5
Source level code transf.: 3D Path Planner • Target: ML507 Xilinx Virtex-5 board, PowerPC@400 MHz, CCUs@100 MHz Strategy Optimization 1 2 3 4 5 6 7 8 Systems. FCCM 2012 Strategies for FPGA-based See: Cardoso et al., Specifying Compiler Loop fission and move Replicate array 3× Map gridit to HW core Pointer-based accesses and strength Strategy 8: 6.8 faster than reduction pure software solution Unroll 2× 8 6.80 Eliminating array accesses 7 6.72 Move data access 6 6.68 Specialization → 3 HW cores 5 6.08 Transfer pot data according to gridit call 4 5.94 Transfer obstacles data according to gridit 3 5.61 call 2 5.01 Implementation On-demand obstacles data transfer FPGA resources 1 1.94 1 2,3,4 5,6 7,8 1.8 2.3 2.8 3.3 3.8 4.3 4.8 5.3 5.8 6.3 6.8 7.3 # Slice Registers as FF 901 939 956 2,470 # Slice LUTs 1,182 1,284 1,308 2,148 Source: EU-Funded FP7 REFLECT project # occupied Slices 531 663 642 1,004 6 # BlockRAM/# DSP48Es 34/6 34/6 98/6 98/12
Simple code restructuring example An FIR 7
Code restructuring: FIR example // x is an input array // y is an output array #define c0 2, c1 4, c2 4, c3 2 #define M 256 // no. of samples #define N 4 // no. of coeff. int c[N] = {c0, c1, c2, c3}; ... // Loop 1: for (int j=N-1; j<M; j++) { output=0; // Loop 2: for (int i=0; i<N; i++) { output+=c[i]*x[j-i]; } y[j] = output; } 8
Code restructuring: FIR example // x is an input array // y is an output array #define c0 2, c1 4, c2 4, c3 2 #define M 256 // no. of samples II=2 // Loop 1 #define N 4 // no. of coeff. for (int j=3; j<M; j++) { int c[N] = {c0, c1, c2, c3}; x_3=x[j]; ... x_2=x[j-1]; // Loop 1: x_1=x[j-2]; for (int j=N-1; j<M; j++) { x_0=x[j-3]; output=0; output=c0*x_3; // Loop 2: output+=c1*x_2; for (int i=0; i<N; i++) { output+=c2*x_1; output+=c[i]*x[j-i]; output+=c3*x_0; } y[j] = output; y[j] = output; } } 1 sample per 2 clock cycles 9
Code restructuring: FIR example // x is an input array // y is an output array x_0=x[0]; x_1=x[1]; #define c0 2, c1 4, c2 4, c3 2 #define M 256 // no. of samples x_2=x[2]; II=1 II=2 // Loop 1 #define N 4 // no. of coeff. // Loop 1 for (int j=3; j<M; j++) { int c[N] = {c0, c1, c2, c3}; for (int j=3; j<M; j++) { x_3=x[j]; ... x_3=x[j]; x_2=x[j-1]; output=c0*x_3; // Loop 1: x_1=x[j-2]; for (int j=N-1; j<M; j++) { output+=c1*x_2; x_0=x[j-3]; output+=c2*x_1; output=0; output=c0*x_3; // Loop 2: output+=c3*x_0; output+=c1*x_2; for (int i=0; i<N; i++) { x_0=x_1; output+=c2*x_1; x_1=x_2; output+=c[i]*x[j-i]; output+=c3*x_0; } x_2=x_3; y[j] = output; y[j] = output; y[j] = output; } } } 1 sample per 2 clock cycles 10 1 sample per clock cycle
II=1 Code restructuring: // Loop 1 for (int j=3; j<M; j+=2) { x_3=x[j]; FIR example Synthesis. FPGAs for Software Programmers 2016. See: João M. P output=c0*x_3; output+=c1*x_2; output+=c2*x_1; x_0=x[0]; output+=c3*x_0; . Cardoso, Markus Weinhardt, High-Level x_1=x[1]; x_0=x_1; II=2 x_2=x[2]; x_1=x_2; II=1 // Loop 1 // Loop 1 x_2=x_3; for (int j=3; j<M; j++) { for (int j=3; j<M; j++) { y[j] = output; x_3=x[j]; x_3=x[j]; x_3=x[j+1]; x_2=x[j-1]; output=c0*x_3; output=c0*x_3; x_1=x[j-2]; output+=c1*x_2; output+=c1*x_2; x_0=x[j-3]; output+=c2*x_1; output+=c2*x_1; output=c0*x_3; output+=c3*x_0; output+=c3*x_0; output+=c1*x_2; x_0=x_1; x_0=x_1; output+=c2*x_1; x_1=x_2; x_1=x_2; output+=c3*x_0; x_2=x_3; x_2=x_3; y[j] = output; y[j] = output; y[j+1] = output; } } } 2 samples per clock cycle 11 1 sample per 2 clock cycles 1 sample per clock cycle
Code restructuring • Manual • Programmers need to know the impact of code styles and structures on the generated architecture – with similarities to the HDL developers, although in a different level • Fully automatic with a source-to-source compiler (refactoring tool) • Need to devise the code transformations to apply and their ordering • Need source to source compilers integrating a vast portfolio of code transformations • Semi-automatic with a source-to-source compiler (refactoring tool) • Code transformations automatically applied but guided by users • Users can define their own code transformations 12
Some approaches for code restructuring/opt. - LegUp [Canis et al., ACM TECS’13]: flag selection and phase • Flag selection ordering (via LLVM + opt) [Huang et al., ACM TRETS’15] • Phase ordering - The Merlin Compiler and source to source optimizations by Cong et.al., FSP’16 • Polyhedral models - Polyhedral transformations by Zuo et al., FPGA’13 - Polyhedral in nested loop pipelining by Morvan et al., IEEE • Graph-based TCAD’13 transformations - Graph- based code restructuring by Ferreira and Cardoso, FSP’18, ARC’19 13
Flag selection • Generation controlled by enabling/disabling compiler flags – sequence of optimizations are the ones built-in and pre-fixed for each flag • Suitable to most common approaches, but without taking full-advantage of customization/specialization Helping but without solving the code restructuring problem! 14
Phase ordering • Providing specific sequences of compiler optimizations • Problem is very complex as besides selecting the phases one needs to provide sequences – usually repeating phases • Difficult to find the sequence! • Fully dependent on the portfolio of phases a compiler may include – phases need to justify their inclusion (i.e., if they pay-off) Limitations for solving the code restructuring problem! 15
Polyhedral models • Applied to Static Control Parts – require specific loop structures, statically known iteration spaces, limited to affine domains • Pure polyhedral models transform iteration spaces – more advanced approaches combine the polyhedral model with AST transformations • Able to provide useful code transformations and justify their inclusion in the portfolio of compiler This Photo by Unknown Author is licensed under CC BY-NC optimizations Helping on solving the code restructuring problem! 16
Graph-based transformations (our ongoing work) • Traces of computations are represented in Dataflow Graphs (DFGs) • Code restructuring problem is solved by graph transformations • Able to achieve high-levels of code restructuring and suitable HLS directives This Photo by Unknown Author is licensed under CC BY-SA A proof of concept… scalability still needs to be solved! 17
Code restructuring: ongoing Application Graphs Code Analysis, (e.g., Graph-based Code (Software Profiling, Representing Optimizations Generation Programming Traces) Execution Language) Input Strategies Strategies 18
Code restructuring: graph-based approach Application Code DFG Analysis, Code (Software Graph-based (Representi Profiling, Generation Programming Optimizations ng a Trace) Execution Language) + directives Configurations Optimize DFG Split in subDFGs Fold DFGs Identify data reuse Balance chains of operations Data partitioning 19
Recommend
More recommend