heterocl a multi paradigm programming infrastructure for
play

HeteroCL: A Multi-Paradigm Programming Infrastructure for - PowerPoint PPT Presentation

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing Yi-Hsiang Lai 1 , Yuze Chi 2 , Yuwei Hu 1 , Jie Wang 2 , Cody Hao Yu 2,3 , Yuan Zhou 1 , Jason Cong 2 , Zhiru Zhang 1 1 Cornell University 2


  1. HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing Yi-Hsiang Lai 1 , Yuze Chi 2 , Yuwei Hu 1 , Jie Wang 2 , Cody Hao Yu 2,3 , Yuan Zhou 1 , Jason Cong 2 , Zhiru Zhang 1 1 Cornell University 2 University of California, Los Angeles 3 Falcon Computing Solutions, Inc.

  2. Essential Techniques for Hardware Acceleration 32 32 32 Compute customization ... × + × × + + • Parallelization 32 32 ... • Pipelining, etc. × + + × + 16 8 8 4 Data type customization ... × × × × + + + + • Low-bitwidth integer 16 • Fixed point, etc. ... × × + + + Memory customization mem mem mem • Banking ... ... FIFO FIFO • Data reuse, etc There exists interdependence among different customizations 1

  3. Hardware Customization in High-Level Synthesis #pragma HLS array_partition variable=filter dim=0 ▸ Driving example: convolutional kernel hls::LineBuffer<3, N, ap_fixed<8,4> > buf; hls::Window<3, 3, ap_fixed<8,4> > window; for(int y = 0; y < N; y++) { for (int y = 0; y < N; y++) for(int xo = 0; xo < N/M; xo++) { Custom compute for (int x = 0; x < N; x++) #pragma HLS pipeline II=1 (Loop tiling) for(int xi = 0; xi < M; xi++) { for (int r = 0; r < 3; r++) int x = xo*M + xi; Custom data type for (int c = 0; c < 3; c++) ap_fixed<8,4> acc = 0; ap_fixed<8,4> in = image[y][x]; (Quantization) out[x, y] += image[x+r, y+c] * kernel[r, c] buf.shift_up(x); Custom memory buf.insert_top(in, x); window.shift_left(); (Reuse buffers) for(int r = 0; r < 2; r++) Algorithm#1 window.insert(buf.getval(r,x), i, 2); Entangled hardware Compute Customization window.insert(in, 2, 2); customization and if (y >= 2 && x >= 2) { Algorithm#2 algorithm for(int r = 0; r < 3; r++) { • Data Type Customization for(int c = 0; c < 3; c++) { Less portable acc += window.getval(r,c) * kernel[r][c]; • Less maintainable Memory Customization }} • Less productive out[y-2][x-2] = acc; Algorithm#3 }}}} 2

  4. Decoupling Algorithm from Hardware Customization HLS C Halide, TVM, etc. HeteroCL Algorithm#1,2 Algorithm#1 Algorithm#1,2,3 Compute Customization Data Type Customization Algorithm#2 Memory Customization Data Type Customization Algorithm#3 Compute Customization Memory Customization Data Type Customization Compute Customization Algorithm#3 Memory Customization Memory Customization Entangled algorithm specification Decoupled temporal Fully decoupled customization schedules [4,5,6,7,8] and customization schemes [1,2,3] schemes + Clean abstraction capturing the [4] Ragan- Kelly, et al. SIGPLAN’13 [1] Intel HLS interdependence [2] Xilinx Vivado HLS [5] Baghdadi, et al. arXiv’18 [3] Canis , et al. FPGA’11 [6] Rong , et al. arXiv’17 [7] Pu, et al. TACO’17 [8] Chen, et al. arXiv’18 3

  5. Decoupled Compute Customization HeteroCL code HLS code r = hcl.reduce_axis(0, 3) Declarative for (int y = 0; y < N; y++) Algorithm c = hcl.reduce_axis(0, 3) programming for (int x = 0; x < N; x++) out = hcl.compute(N, N), for (int r = 0; r < 3; r++) lambda y, x: for (int c = 0; c < 3; c++) hcl.sum(image[x+r, y+c]*kernel[r, c], out[x, y] += image[x+r, y+c] * kernel[r, c] axis=[r, c])) Tile loop for (int xi = 0; xi < M; xi++) for (int xo = 0; xo < N/M; xo++) customization Decoupled s = hcl.create_schedule() for (int y = 0; y < N; y++) Reorder loops xo, xi = s[out].split(out.x, factor=M) for (int r = 0; r < 3; r++) s[out].reorder(xi, xo, out.y) for (int c = 0; c < 3; c++) out[xi+xo*M, y] += Customization primitives image[xi+xo*M+r, y+c] * kernel[r, c] • More productive / less labor-intensive 4

  6. Decoupled Memory Customization ▸ Primitives can be applied with a user-defined sequence for (int y = 0; y < N; y++) r = hcl.reduce_axis(0, 3) for (int x = 0; x < N; x++) c = hcl.reduce_axis(0, 3) for (int r = 0; r < 3; r++) out = hcl.compute(N, N), for (int c = 0; c < 3; c++) lambda y, x: out[x, y] += image[x+r, y+c] * kernel[r, c] hcl.sum(image[x+r, y+c]*kernel[r, c], axis=[r, c])) 5

  7. Decoupled Memory Customization ▸ Primitives can be applied with a user-defined sequence for (int y = 0; y < N; y++) r = hcl.reduce_axis(0, 3) for (int x = 0; x < N; x++) c = hcl.reduce_axis(0, 3) for (int r = 0; r < 3; r++) out = hcl.compute(N, N), for (int c = 0; c < 3; c++) lambda y, x: out[x, y] += image[x+r, y+c] * kernel[r, c] hcl.sum(image[x+r, y+c]*kernel[r, c], axis=[r, c])) linebuffer s = hcl.create_schedule() x linebuf = s[image]. reuse_at (out, out.y ) y image out 6

  8. Decoupled Memory Customization ▸ Primitives can be applied with a user-defined sequence for (int y = 0; y < N; y++) r = hcl.reduce_axis(0, 3) for (int x = 0; x < N; x++) c = hcl.reduce_axis(0, 3) for (int r = 0; r < 3; r++) out = hcl.compute(N, N), for (int c = 0; c < 3; c++) lambda y, x: out[x, y] += image[x+r, y+c] * kernel[r, c] hcl.sum(image[x+r, y+c]*kernel[r, c], axis=[r, c])) linebuffer s = hcl.create_schedule() x linebuf = s[image]. reuse_at (out, out.y ) y winbuf = s[linebuf]. reuse_at (out, out.x ) image out window buffer 7

  9. Decoupled Memory Customization ▸ Primitives can be applied with a user-defined sequence for (int y = 0; y < N; y++) r = hcl.reduce_axis(0, 3) for (int x = 0; x < N; x++) c = hcl.reduce_axis(0, 3) for (int r = 0; r < 3; r++) out = hcl.compute(N, N), for (int c = 0; c < 3; c++) lambda y, x: out[x, y] += image[x+r, y+c] * kernel[r, c] hcl.sum(image[x+r, y+c]*kernel[r, c], axis=[r, c])) linebuffer s = hcl.create_schedule() x linebuf = s[image].reuse_at(out, out.y) y winbuf = s[linebuf].reuse_at(out, out.x) ⨂ image out window buffer kernel 8

  10. Decoupled Data Type Customization ▸ Bit-accurate data type support (e.g., Int(15) , Fixed(7,4) ) ▸ Decoupled customization primitives: downsize & quantize r = hcl.reduce_axis(0, 3) c = hcl.reduce_axis(0, 3) out = hcl.compute(N, N), lambda y, x: hcl.sum(image[x+r, y+c]*kernel[r, c], axis=[r, c])) s = hcl.create_scheme() s.quantize([out], Fixed(6, 4)) 9

  11. Decoupled Data Type Customization ▸ Bit-accurate data type support (e.g., Int(15) , Fixed(7,4) ) ▸ Decoupled customization primitives: downsize & quantize r = hcl.reduce_axis(0, 3) c = hcl.reduce_axis(0, 3) 100 out = hcl.compute(N, N), 80 Accuracy (%) lambda y, x: 60 hcl.sum(image[x+r, y+c]*kernel[r, c], 40 axis=[r, c])) 20 0 for i in range (2, 8): 2 4 6 8 s = hcl.create_scheme() Total bitwidth s.quantize([out], Fixed(i, i-2)) Trade-off between accuracy and resource for a neural network 10

  12. Currently Supported Customization Primitives Compute customization Memory customization ⨁ Macros for spatial architecture templates Data type customization 11

  13. Macro for Stencil with Dataflow Architecture ▸ A sliding window applied on a tensor ▸ For applications where data elements are updated with some fixed, local patterns ▸ Incorporate with SODA [Y. Chi, et al. ICCAD’18] – Scalable reuse buffers with minimum buffer size that achieve highest throughput Pipelined reuse buffers (RB) r = hcl.reduce_axis(0, 3) input FW FW FW FW RB RB RB RB PE: compute module, c = hcl.reduce_axis(0, 3) implements the input FW FW FW RB RB RB kernel function out = hcl.compute(N, N), lambda y, x: input FW FW FW FW RB RB RB RB hcl.sum(image[x+r, y+c]*kernel[r, c], PE output axis=[r, c])) PE output FW: forwarding module, s = hcl.create_schedule() implements FIFO and PE output s[out].stencil() distributes data 12

  14. Macro for Systolic Array ▸ A group of PEs locally connected to each other ▸ For applications having perfectly nested loops with uniform dependency ▸ Incorporate with PolySA [J. Cong, et al. ICCAD’18] – Systematic and efficient design space exploration => Comparable performance to manual designs within hours Off-chip DDRs On-chip BRAMs r = hcl.reduce_axis(0, 3) c = hcl.reduce_axis(0, 3) Loader Feeder Feeder Feeder Feeder out = hcl.compute(N, N), Feeder PE PE PE PE lambda y, x: hcl.sum(image[x+r, y+c]*kernel[r, c], Feeder PE PE PE PE axis=[r, c])) Feeder PE PE PE PE s = hcl.create_schedule() s[out].systolic() Feeder PE PE PE PE [X. Wei, et al. DAC’17] 13

  15. Imperative Programming in HeteroCL ▸ HeteroCL further provides an embedded imperative DSL – Not all algorithms can be described using declarative code ▸ Unified interface for applying hardware customization to both imperative and declarative codes with hcl.for_(0, N) as y: We need DSL with hcl.for_(0, N) as x: because normal with hcl.for_(0, 3) as r: Python is too flexible (i.e., not all semantics with hcl.for_(0, 3) as c: are synthesizable) out[x, y] += image[x+r, y+c] * kernel[r, c] s = hcl.create_schedule() s[out]. split (out.x, M) linebuf = s[image]. reuse_at (out, out.y) s. quantize ([out], Fixed(6, 4)) # … 14

  16. Explore the Interdependence: Dot Product i = hcl.reduce_axis(0, N) W return hcl.compute((1,), dot_product lambda x: hcl.sum(local_A[i] * local_B[i], DMA Off-chip memory axis=i)) PE + for W in [4, 8, 16, 32]: output + NUM_PE = BANDWIDTH / W PE local_A xo, xi = s[psum]. split (x, NUM_PE) Compute s[psum]. unroll (xi) NUM_PE … Data type s. quantize (local_A, hcl.Fixed(W)) + Memory s[local_A]. partition (NUM_PE) PE local_B Compute throughput #Elem / IO access Performance W=8 W=8 W=16 W=16 ) W=32 W=32 = min( , NUM_PE NUM_PE NUM_PE 15

Recommend


More recommend