Automatic Generation of Efficient Accelerator Designs for Reconfigurable Hardware David Koeplinger Raghu Prabhakar Yaqi Zhang Christina Delimitrou Christos Kozyrakis Kunle Olukotun Stanford University ISCA 2016
FPGAs in Data Centers Increasing interest in use of FPGAs as application accelerators in data centers Key advantage: Performance/Watt 2
Problem: Large Design Spaces Design spaces grow exponentially with the number of parameters Even relatively small designs can have very large spaces Parameters can change runtime by orders of magnitude Parameters typically aren’t independent Manual exploration is tedious, may result in suboptimal designs 3
Design Space Example: Dot Product Algorithm: Dot Product of Vectors A and B DRAM A Tile A acc + × Tile B B FPGA Key Small and simple, but slow! Scratchpad Reg op 4
Important Parameters: Tile Sizes Algorithm: Dot Product of Vectors A and B DRAM A Tile A acc + × Tile B B FPGA Key Increases length of DRAM accesses Runtime Scratchpad Increases exploited spatial locality Runtime Reg op Increases local memory sizes Area 5
Important Parameters: Pipelining Algorithm: Dot Product of Vectors A and B DRAM Tile A A acc + × Tile B B Stage 2 Key Stage 1 FPGA Double Overlaps memory and compute Runtime Buffer Increases local memory sizes Area Reg op Adds synchronization logic Area 6
Important Parameters: Parallelization Algorithm: Dot Product of Vectors A and B DRAM × A Tile A + acc + × + Tile B B × FPGA Key Improves element throughput Runtime Scratchpad Duplicates compute resources Area Reg op 7
Language/Tool Requirements VHDL LegUp Vivado HLS Aladdin DHDL Verilog OpenCL SDK Targets FPGAs Enables pipelining at arbitrary loop levels Exposes design parameters to the compiler Evaluates designs prior to synthesis Explores design space automatically Generates synthesizable code 8
Delite Hardware Definition Language Includes a variety parameterized templates Parallel patterns with implicit parallelization factors Pipeline constructs for pipelining at arbitrary levels Explicit size parameters for loop step size and buffer sizes All parameters are exposed to compiler Compiler includes latency and area models for quick design evaluation Compiler automatically explores design space Generates synthesizable MaxJ HGL after exploration 9
Dot Product DHDL Diagram Tile Size (B) DRAM Tile A out out A + + × B Tile B Inner Outer Reduce Reduce Parallelism factor #2 Parallelism factor #3 Parallelism factor #1 Pipelining toggle 10
Dot Product in DHDL val output = Reg [ Float ] val vectorA = OffChipMem [ Float ](N) val vectorB = OffChipMem [ Float ](N) Parallelism factor #1 Pipelining toggle Reduce (N by B)(output){ i => val tileA = Scratchpad [ Float ](B) Tile Size (B) val tileB = Scratchpad [ Float ](B) val acc = Reg [ Float ] tileA load vectorA(i :: i+B) Parallelism factor #2 1 tileB load vectorB(i :: i+B) Reduce (B by 1)(acc){ j => Parallelism factor #3 tileA(j) * tileB(j) 2 }{a, b => a + b} }{a, b => a + b} 11
DHDL to Hardware DHDL Simple Analyses DHDL + Design Space Design Space Exploration Fixed DHDL Code Generation MaxJ HGL MaxCompiler + Altera Toolchain 12
DHDL Enables Fast DSE DHDL Program Parameterized Templates Concise IR Simple Linear Easily Derived Models Space Constraints Space Pruning Fast Estimation No Unrolling No Scheduling Smaller Spaces Fast Design Space Exploration 13
Latency Modeling Analytical model Uses depth-first search to get critical path of pipelines Accurate estimation requires data size annotations Main-memory model Mathematical model fit to observed runtimes Parameterized by: Number of contending readers/writers Number of commands issued in sequence Command length 14
Area Modeling Analytical model Simple summation of area of each template Includes estimates for delay lines, banked memories Neural network models Models routing costs and memory duplication Simple, 3 layer networks suffice here (we use 11-6-1) Trained on about set of 200 characterization designs Total area = analytical area + neural net area 15
Evaluation Accuracy : How accurate are the models, compared to observations? Speed : How fast are the predictions, compared to commercial tools? Space : Do the design parameters help capture an interesting space? Performance : How good is the best generated design? 16
Results: Model Accuracy (Area) ALMs Model BRAMs Synthesized DSPs Resource Usage (%) 100% 60% 20% dotproduct outerprod tpchq6 blackscholes gda kmeans gemm Area models follow important trends and are accurate enough to drive automatic design space exploration 17
Results: Model Accuracy (Latency) 20% 18.4% Average Error (%) 15% 10% 6.7% 7% 3.4% 5% 3.1% 2.8% 1.3% 0% dotproduct outerprod tpchq6 blackscholes gda kmeans gemm Latency models follow important trends and are accurate enough to drive automatic design space exploration 18
Results: Prediction Speed DHDL: Benchmark Designs Search Time Dot Product 5,426 5.3 ms / design Outer Product 1,702 30 ms / design TPCHQ6 5,426 8.2 ms / design 6533x Blackscholes 572 27 ms / design Speedup Matrix Multiply 70,740 11 ms / design Over HLS! K-Means 75,200 20 ms / design GDA 42,800 17 ms / design Vivado HLS: Designs Search Time GDA 250 1.85 min / design 19
Results: GDA Design Space Performance limited by available BRAMs Cycles (Log Scale) 10 10 10 9 10 8 10 7 20% 60% 100% 20% 60% 100% 20% 60% 100% ALMs DSPs BRAMs Resource Usage (% of maximum) Space for GDA spans four orders of magnitude Valid design point Pareto-optimal design Invalid design point Synthesized pareto design point 20
Evaluation: Multi-Core Comparison FPGA Altera Stratix V (28 nm) 150 MHz clock Peak main memory bandwidth of 37.5 GB/sec Multi-core CPU Intel Xeon E5-2630 (32nm) 2.3 GHz Peak main memory bandwidth of 42.6 GB/sec 6 cores, 6 threads Multi-threaded C++ code generated from Delite Execution time = FPGA execution time Does not include CPU FPGA communication or configuration time 21
Results: Comparison with Multi-Core 20 16.73 15 Gemm uses multi-threaded Speedup OpenBLAS on CPU 10 4.55 5 2.42 1.11 1.15 1.07 0.1 0 dotproduct outerprod tpchq6 blackscholes gda kmeans gemm Compute-bound Memory-bound 22
Summary DHDL exposes large design spaces to the compiler Parameterized templates enable fast, accurate estimators Fast estimators enable rapid automated DSE Up to 6533x faster estimation compared to Vivado HLS Up to 16.7x speedup over 6-core CPU 23
24
Results: TPCHQ6 Design Space Cycles (Log Scale) 10 8 10 7 10 6 20% 60% 100% 20% 60% 100% 20% 60% 100% ALMs DSPs BRAMs Resource Usage (% of maximum) Valid design point Pareto-optimal design Invalid design point Synthesized pareto design point 25
Results: Blackscholes Design Space Cycles (Log Scale) 10 8 10 7 10 6 20% 60% 100% 20% 60% 100% 20% 60% 100% ALMs DSPs BRAMs Resource Usage (% of maximum) Valid design point Pareto-optimal design Invalid design point Synthesized pareto design point 26
Recommend
More recommend