building fpga targeted accelerators with heterocl
play

Building FPGA-Targeted Accelerators with HeteroCL Zhiru Zhang - PowerPoint PPT Presentation

Building FPGA-Targeted Accelerators with HeteroCL Zhiru Zhang School of ECE, Cornell University csl.cornell.edu/~zhiruz In collaboration with Cornell : Yi-Hsiang Lai, Shaojie Xiang, Yuwei Hu UCLA : Yuze Chi, Jie Wang, Cody Yu, Jason Cong TVM


  1. Building FPGA-Targeted Accelerators with HeteroCL Zhiru Zhang School of ECE, Cornell University csl.cornell.edu/~zhiruz In collaboration with Cornell : Yi-Hsiang Lai, Shaojie Xiang, Yuwei Hu UCLA : Yuze Chi, Jie Wang, Cody Yu, Jason Cong TVM Workshop @ UW, 12/5/2019

  2. HeteroCL Overview ▸ A programming framework built with TVM for productive hardware specialization – Flexible : Mixed declarative & imperative programming – Efficient : Mapping to high-performance spatial architecture templates – Portable : Clean decoupling of algorithm & hardware customizations HeteroCL High-level DSLs Processors + Accelerators Algorithm Spec. (declarative + imperative) Compute Customization Data Type Customization Custom Xcels CPUs … Memory Customization (e.g. FPGAs) github.com/cornell-zhang/heterocl Y.-H. Lai, et al., HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing , FPGA’2019 ( Best Paper Award ) 1

  3. Essential Techniques for Hardware Specialization PE PE PE Compute customization • Parallelization, Pipelining … … … × × PE PE PE … 32 32 … … … + … PE PE PE 2

  4. Essential Techniques for Hardware Specialization PE PE PE Compute customization • Parallelization, Pipelining … × × × × … … 16 Data type customization PE PE PE • Low-bitwidth integer, + + Fixed point ... … … … … … + PE PE PE 3

  5. Essential Techniques for Hardware Specialization PE PE PE Loader Compute customization • Parallelization, Pipelining … Memory / Storage … … … … Data type customization PE PE PE • Low-bitwidth integer, Accelerator Fixed point ... … … … … Memory customization • Banking, Data reuse, PE PE PE Unloader Streaming ... FIFO Scratchpad 4

  6. FPGA as a Programmable Accelerator ▸ Massive amount of fine-grained parallelism – Highly parallel / deeply pipelined architecture – Distributed data/control dispatch ▸ Silicon configurable to fit the application Block RAM Block RAM – Compute at desired numerical accuracy – Customized memory hierarchy ▸ High performance/watt – Low clock speed – Pre-fabricated architecture blocks ~2 Million ~5000 ~300Mb Logic Blocks DSP Blocks Block RAM But FPGAs are really hard to PROGRAM AWS F1 FPGA instance: Xilinx UltraScale+ VU9P [Figure source: David Pellerin, AWS] 5

  7. Increasing Use of High-Level Synthesis (HLS) module dut(rst, clk, q); uint8 dut() { input rst; static uint8 c; vs. input clk; c+=1; output q; } reg [7:0] c; HLS C always @ ( posedge clk) begin if (rst == 1b’1) begin c <= 8'b00000000; end 3000+ papers since 2012 else begin 800 c <= c + 1; Number of Publications end 400 assign q = c; endmodule 0 RTL Verilog 2012 2013 2014 2015 2016 2017 2018 Year 6

  8. FPGA Programming with HLS #pragma HLS array_partition variable=filter dim=0 ▸ Example: convolution hls::LineBuffer<3, N, ap_fixed<8,4> > buf; hls::Window<3, 3, ap_fixed<8,4> > window; for (int y = 0; y < N; y++) for(int y = 0; y < N; y++) { for (int x = 0; x < N; x++) for(int xo = 0; xo < N/M; xo++) { Custom compute for (int r = 0; r < 3; r++) #pragma HLS pipeline II=1 (Loop tiling) for(int xi = 0; xi < M; xi++) { for (int c = 0; c < 3; c++) int x = xo*M + xi; out[x, y] += image[x+r, y+c] * kernel[r, c] Custom data type ap_fixed<8,4> acc = 0; ap_fixed<8,4> in = image[y][x]; (Quantization) buf.shift_up(x); Custom memory buf.insert_top(in, x); window.shift_left(); (Reuse buffers) for(int r = 0; r < 2; r++) Algorithm#1 window.insert(buf.getval(r,x), i, 2); Compute Customization Entangled hardware window.insert(in, 2, 2); customization and if (y >= 2 && x >= 2) { Algorithm#2 for(int r = 0; r < 3; r++) { algorithm Data Type Customization for(int c = 0; c < 3; c++) { • Less portable acc += window.getval(r,c) * kernel[r][c]; • Less maintainable Memory Customization }} • Less productive out[y-2][x-2] = acc; Algorithm#3 }}}} 7

  9. Decoupling Algorithm from Hardware Customizations HLS C HeteroCL Algorithm#1 Algorithm#1-3 Compute Customization Algorithm#2 Data Type Customization Compute Customization Memory Customization Data Type Customization Algorithm#3 Memory Customization Entangled algorithm specification Fully decoupled customization and customization schemes [1,2,3] schemes + Clean abstraction capturing the interdependence 8

  10. Decoupled Compute Customization HeteroCL code HLS code r = hcl.reduce_axis(0, 3) Declarative for (int y = 0; y < N; y++) Algorithm c = hcl.reduce_axis(0, 3) programming for (int x = 0; x < N; x++) out = hcl.compute(N, N), ( TVM based ) for (int r = 0; r < 3; r++) lambda y, x: for (int c = 0; c < 3; c++) hcl.sum(image[x+r, y+c]*kernel[r, c], out[x, y] += image[x+r, y+c] * kernel[r, c] axis=[r, c])) Tile loop for (int xi = 0; xi < M; xi++) for (int xo = 0; xo < N/M; xo++) customization Decoupled s = hcl.create_schedule() for (int y = 0; y < N; y++) Reorder loops xo, xi = s[out].split(out.x, factor=M) for (int r = 0; r < 3; r++) s[out].reorder(xi, xo, out.y) for (int c = 0; c < 3; c++) out[xi+xo*M, y] += Customization primitives image[xi+xo*M+r, y+c] * kernel[r, c] • Portable, less error-prone 9

  11. Decoupled Data Type Customization ▸ Bit-accurate data type support (e.g., Int(15), Fixed(7,4) ) – W.I.P.: custom floating-point types (e.g., bfloat16) ▸ Decoupled customization primitives: downsize & quantize 32-bit Floating-point r = hcl.reduce_axis(0, 3) Sign Exponent Mantissa c = hcl.reduce_axis(0, 3) 1b 23b 8b out = hcl.compute(N, N), 16-bit Brain Floating-point (bfloat) lambda y, x: Sign Exponent Mantissa hcl.sum(image[x+r, y+c]*kernel[r, c], 1b 8b 7b axis=[r, c])) 8-bit Fixed-point Fixed(8, 6) Quantize/downsize Int Fraction for i in range (2, 8): 2b 6b s = hcl.create_scheme() 2-bit Integer Int(2) s. quantize ([out], Fixed(i, i-2)) Int 2b 10

  12. Decoupled Memory Customization ▸ Inferring custom on-chip storage with the reuse_at () primitive for (int y = 0; y < N; y++) r = hcl.reduce_axis(0, 3) for (int x = 0; x < N; x++) c = hcl.reduce_axis(0, 3) for (int r = 0; r < 3; r++) out = hcl.compute(N, N), for (int c = 0; c < 3; c++) lambda y, x: out[x, y] += image[x+r, y+c] * kernel[r, c] hcl.sum(image[x+r, y+c]*kernel[r, c], axis=[r, c])) linebuffer s = hcl.create_schedule() x linebuf = s[image]. reuse_at (out, out.y ) y image out 11

  13. Decoupled Memory Customization ▸ Inferring custom on-chip storage with the reuse_at () primitive for (int y = 0; y < N; y++) r = hcl.reduce_axis(0, 3) for (int x = 0; x < N; x++) c = hcl.reduce_axis(0, 3) for (int r = 0; r < 3; r++) out = hcl.compute(N, N), for (int c = 0; c < 3; c++) lambda y, x: out[x, y] += image[x+r, y+c] * kernel[r, c] hcl.sum(image[x+r, y+c]*kernel[r, c], axis=[r, c])) linebuffer s = hcl.create_schedule() x linebuf = s[image]. reuse_at (out, out.y ) y winbuf = s[linebuf]. reuse_at (out, out.x ) image out window buffer 12

  14. Decoupled Memory Customization ▸ Inferring custom on-chip storage with the reuse_at () primitive for (int y = 0; y < N; y++) r = hcl.reduce_axis(0, 3) for (int x = 0; x < N; x++) c = hcl.reduce_axis(0, 3) for (int r = 0; r < 3; r++) out = hcl.compute(N, N), for (int c = 0; c < 3; c++) lambda y, x: out[x, y] += image[x+r, y+c] * kernel[r, c] hcl.sum(image[x+r, y+c]*kernel[r, c], axis=[r, c])) linebuffer s = hcl.create_schedule() x linebuf = s[image]. reuse_at (out, out.y ) y winbuf = s[linebuf]. reuse_at (out, out.x ) = ⨂ image out window buffer kernel 13

  15. Decoupled Data Placement ▸ A unified interface for specifying data placement & movement data Compute unit from heterocl import platform @ hcl.def_ () def conv (input, kernel): r = hcl.reduce_axis(0, 3) Host c = hcl.reduce_axis(0, 3) Kernel1 Kernel2 return hcl.compute(N, N), lambda y, x: hcl.sum(input[x+r, y+c]*kernel[r, c], axis=[r, c])) Out2 Image Out1 conv2 conv1 out1 = conv(image, kernel1, "conv1") out2 = conv(out1, kernel2, "conv2") s = hcl.create_schedule() p = platform.fpga_soc Xcel f = hcl.build(p) 14

  16. Decoupled Data Placement ▸ A unified interface for specifying data placement & movement between – Host and accelerator data Stream Compute unit from heterocl import platform @ hcl.def_ () def conv (input, kernel): r = hcl.reduce_axis(0, 3) Host c = hcl.reduce_axis(0, 3) Kernel1 Kernel2 return hcl.compute(N, N), lambda y, x: hcl.sum(input[x+r, y+c]*kernel[r, c], axis=[r, c])) Out2 Image out1 = conv(image, kernel1, "conv1") out2 = conv(out1, kernel2, "conv2") Out1 s = hcl.create_schedule() Compute p = platform.fpga_soc placement is Xcel conv1 conv2 s. to ([image, kernel1, kernel2], p.xcel) inferred s. to (out2, p.host) automatically 15

Recommend


More recommend