Automatic Generation of Efficient Accelerator Designs for - PowerPoint PPT Presentation

Automatic Generation of Efficient Accelerator Designs for Reconfigurable Hardware David Koeplinger Raghu Prabhakar Yaqi Zhang Christina Delimitrou Christos Kozyrakis Kunle Olukotun Stanford University ISCA 2016

FPGAs in Data Centers  Increasing interest in use of FPGAs as application accelerators in data centers Key advantage: Performance/Watt 2

Problem: Large Design Spaces  Design spaces grow exponentially with the number of parameters  Even relatively small designs can have very large spaces  Parameters can change runtime by orders of magnitude  Parameters typically aren’t independent  Manual exploration is tedious, may result in suboptimal designs 3

Design Space Example: Dot Product Algorithm: Dot Product of Vectors A and B DRAM A Tile A acc + × Tile B B FPGA Key Small and simple, but slow! Scratchpad Reg op 4

Important Parameters: Tile Sizes Algorithm: Dot Product of Vectors A and B DRAM A Tile A acc + × Tile B B FPGA Key  Increases length of DRAM accesses Runtime Scratchpad  Increases exploited spatial locality Runtime Reg op  Increases local memory sizes Area 5

Important Parameters: Pipelining Algorithm: Dot Product of Vectors A and B DRAM Tile A A acc + × Tile B B Stage 2 Key Stage 1 FPGA Double  Overlaps memory and compute Runtime Buffer  Increases local memory sizes Area Reg op  Adds synchronization logic Area 6

Important Parameters: Parallelization Algorithm: Dot Product of Vectors A and B DRAM × A Tile A + acc + × + Tile B B × FPGA Key  Improves element throughput Runtime Scratchpad  Duplicates compute resources Area Reg op 7

Language/Tool Requirements VHDL LegUp Vivado HLS Aladdin DHDL Verilog OpenCL SDK Targets FPGAs Enables pipelining at arbitrary loop levels Exposes design parameters to the compiler Evaluates designs prior to synthesis Explores design space automatically Generates synthesizable code 8

Delite Hardware Definition Language  Includes a variety parameterized templates  Parallel patterns with implicit parallelization factors  Pipeline constructs for pipelining at arbitrary levels  Explicit size parameters for loop step size and buffer sizes  All parameters are exposed to compiler  Compiler includes latency and area models for quick design evaluation  Compiler automatically explores design space  Generates synthesizable MaxJ HGL after exploration 9

Dot Product DHDL Diagram Tile Size (B) DRAM Tile A out out A + + × B Tile B Inner Outer Reduce Reduce Parallelism factor #2 Parallelism factor #3 Parallelism factor #1 Pipelining toggle 10

Dot Product in DHDL val output = Reg [ Float ] val vectorA = OffChipMem [ Float ](N) val vectorB = OffChipMem [ Float ](N) Parallelism factor #1 Pipelining toggle Reduce (N by B)(output){ i => val tileA = Scratchpad [ Float ](B) Tile Size (B) val tileB = Scratchpad [ Float ](B) val acc = Reg [ Float ] tileA load vectorA(i :: i+B) Parallelism factor #2 1 tileB load vectorB(i :: i+B) Reduce (B by 1)(acc){ j => Parallelism factor #3 tileA(j) * tileB(j) 2 }{a, b => a + b} }{a, b => a + b} 11

DHDL to Hardware DHDL Simple Analyses DHDL + Design Space Design Space Exploration Fixed DHDL Code Generation MaxJ HGL MaxCompiler + Altera Toolchain 12

DHDL Enables Fast DSE DHDL Program Parameterized Templates Concise IR Simple Linear Easily Derived Models Space Constraints Space Pruning Fast Estimation No Unrolling No Scheduling Smaller Spaces Fast Design Space Exploration 13

Latency Modeling  Analytical model  Uses depth-first search to get critical path of pipelines  Accurate estimation requires data size annotations  Main-memory model  Mathematical model fit to observed runtimes  Parameterized by:  Number of contending readers/writers  Number of commands issued in sequence  Command length 14

Area Modeling  Analytical model  Simple summation of area of each template  Includes estimates for delay lines, banked memories  Neural network models  Models routing costs and memory duplication  Simple, 3 layer networks suffice here (we use 11-6-1)  Trained on about set of 200 characterization designs  Total area = analytical area + neural net area 15

Evaluation  Accuracy : How accurate are the models, compared to observations?  Speed : How fast are the predictions, compared to commercial tools?  Space : Do the design parameters help capture an interesting space?  Performance : How good is the best generated design? 16

Results: Model Accuracy (Area) ALMs Model BRAMs Synthesized DSPs Resource Usage (%) 100% 60% 20% dotproduct outerprod tpchq6 blackscholes gda kmeans gemm Area models follow important trends and are accurate enough to drive automatic design space exploration 17

Results: Model Accuracy (Latency) 20% 18.4% Average Error (%) 15% 10% 6.7% 7% 3.4% 5% 3.1% 2.8% 1.3% 0% dotproduct outerprod tpchq6 blackscholes gda kmeans gemm Latency models follow important trends and are accurate enough to drive automatic design space exploration 18

Results: Prediction Speed DHDL: Benchmark Designs Search Time Dot Product 5,426 5.3 ms / design Outer Product 1,702 30 ms / design TPCHQ6 5,426 8.2 ms / design 6533x Blackscholes 572 27 ms / design Speedup Matrix Multiply 70,740 11 ms / design Over HLS! K-Means 75,200 20 ms / design GDA 42,800 17 ms / design Vivado HLS: Designs Search Time GDA 250 1.85 min / design 19

Results: GDA Design Space Performance limited by available BRAMs Cycles (Log Scale) 10 10 10 9 10 8 10 7 20% 60% 100% 20% 60% 100% 20% 60% 100% ALMs DSPs BRAMs Resource Usage (% of maximum) Space for GDA spans four orders of magnitude Valid design point Pareto-optimal design Invalid design point Synthesized pareto design point 20

Evaluation: Multi-Core Comparison  FPGA  Altera Stratix V (28 nm)  150 MHz clock  Peak main memory bandwidth of 37.5 GB/sec  Multi-core CPU  Intel Xeon E5-2630 (32nm)  2.3 GHz  Peak main memory bandwidth of 42.6 GB/sec  6 cores, 6 threads  Multi-threaded C++ code generated from Delite  Execution time = FPGA execution time  Does not include CPU   FPGA communication or configuration time 21

Results: Comparison with Multi-Core 20 16.73 15 Gemm uses multi-threaded Speedup OpenBLAS on CPU 10 4.55 5 2.42 1.11 1.15 1.07 0.1 0 dotproduct outerprod tpchq6 blackscholes gda kmeans gemm Compute-bound Memory-bound 22

Summary  DHDL exposes large design spaces to the compiler  Parameterized templates enable fast, accurate estimators  Fast estimators enable rapid automated DSE  Up to 6533x faster estimation compared to Vivado HLS  Up to 16.7x speedup over 6-core CPU 23

Results: TPCHQ6 Design Space Cycles (Log Scale) 10 8 10 7 10 6 20% 60% 100% 20% 60% 100% 20% 60% 100% ALMs DSPs BRAMs Resource Usage (% of maximum) Valid design point Pareto-optimal design Invalid design point Synthesized pareto design point 25

Results: Blackscholes Design Space Cycles (Log Scale) 10 8 10 7 10 6 20% 60% 100% 20% 60% 100% 20% 60% 100% ALMs DSPs BRAMs Resource Usage (% of maximum) Valid design point Pareto-optimal design Invalid design point Synthesized pareto design point 26

Automatic Generation of Efficient Accelerator Designs for - PowerPoint PPT Presentation

Automatic Generation of Efficient Accelerator Designs for Reconfigurable Hardware David Koeplinger Raghu Prabhakar Yaqi Zhang Christina Delimitrou Christos Kozyrakis Kunle Olukotun Stanford University ISCA 2016 FPGAs in Data Centers

Automatic Registration and Calibration Automatic Registration and Calibration Automatic

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

A Framework for Automatic Generation A Framework for Automatic Generation of Configuration Files

Automatic Enrollment and Automatic IRAs David C. John The Heritage Foundation The Retirement

Dependency Dependency- -Based Automatic Evaluation Based Automatic Evaluation Dependency

Automatic Generation of Efficient Dynamic Binary Translators Fr ed eric P etrot, Luc

Digital Testing Digital Testing Lecture 9 : Combinational Automatic Test Pattern Automatic

Towards efficient automatic end-to-end learning Frank Hutter University of Freiburg, Germany

Seminar 18122 Automatic Quality Assurance and Release Seminar 18122 Automatic Quality

Advice Automatic Structures and Uniformly Automatic Classes Faried Abu Zaid 1 , Erich Grdel 2 ,

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

Automatic Generation of High Throughput Energy Efficient Streaming Architectures for Arbitrary

Procedural Generation Lauri Kongas What is procedural generation? Procedural Generation It is

Procedural Generation Kaarel T onisson 2018-04-20 Kaarel T onisson Procedural Generation

Efficient signal processing using Haskell and LLVM Henning Thielemann 2016-09-15 Efficient

Automatic Wrapper Generation and Data Extraction Kristina Lerman University of Southern

Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten Hoefler (with Tobias Grosser)

Clacc 2019: An Update on OpenACC Support for Clang and LLVM Joel E. Denny, Seyong Lee, Jeffrey S.

All City Council Live! Info Academic Update Q&A Raise your hand to be called on! Use the

Where are you going with those types? Vincent St-Amour, Sam Tobin-Hochstadt, Matthew Flatt,

Having an Effective Annual Career Conference (ACC) Center for Faculty Development Goals To

multi-threaded programs Authors: K. Rustan Leino, P. Mller Speaker: Martin Lanter 1

Abstract Read Permissions Fractional Permissions without the Fractions Alex Summers ETH Zurich

A New DSP Approach for 5G and AI Albert Camilleri VP Business Development North America VSORA

Automatic Generation of Efficient Accelerator Designs for - PowerPoint PPT Presentation

Automatic Generation of Efficient Accelerator Designs for Reconfigurable Hardware David Koeplinger Raghu Prabhakar Yaqi Zhang Christina Delimitrou Christos Kozyrakis Kunle Olukotun Stanford University ISCA 2016 FPGAs in Data Centers

Automatic Registration and Calibration Automatic Registration and Calibration Automatic

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

A Framework for Automatic Generation A Framework for Automatic Generation of Configuration Files

Automatic Enrollment and Automatic IRAs David C. John The Heritage Foundation The Retirement

Dependency Dependency- -Based Automatic Evaluation Based Automatic Evaluation Dependency

Automatic Generation of Efficient Dynamic Binary Translators Fr ed eric P etrot, Luc

Digital Testing Digital Testing Lecture 9 : Combinational Automatic Test Pattern Automatic

Towards efficient automatic end-to-end learning Frank Hutter University of Freiburg, Germany

Seminar 18122 Automatic Quality Assurance and Release Seminar 18122 Automatic Quality

Advice Automatic Structures and Uniformly Automatic Classes Faried Abu Zaid 1 , Erich Grdel 2 ,

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

Automatic Generation of High Throughput Energy Efficient Streaming Architectures for Arbitrary

Procedural Generation Lauri Kongas What is procedural generation? Procedural Generation It is

Procedural Generation Kaarel T onisson 2018-04-20 Kaarel T onisson Procedural Generation

Efficient signal processing using Haskell and LLVM Henning Thielemann 2016-09-15 Efficient

Automatic Wrapper Generation and Data Extraction Kristina Lerman University of Southern

Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten Hoefler (with Tobias Grosser)

Clacc 2019: An Update on OpenACC Support for Clang and LLVM Joel E. Denny, Seyong Lee, Jeffrey S.

All City Council Live! Info Academic Update Q&amp;A Raise your hand to be called on! Use the

Where are you going with those types? Vincent St-Amour, Sam Tobin-Hochstadt, Matthew Flatt,

Having an Effective Annual Career Conference (ACC) Center for Faculty Development Goals To

multi-threaded programs Authors: K. Rustan Leino, P. Mller Speaker: Martin Lanter 1

Abstract Read Permissions Fractional Permissions without the Fractions Alex Summers ETH Zurich

A New DSP Approach for 5G and AI Albert Camilleri VP Business Development North America VSORA

All City Council Live! Info Academic Update Q&A Raise your hand to be called on! Use the