Spatial: A Language and Compiler for Application Accelerators David Koeplinger Matthew Feldman Raghu Prabhakar Yaqi Zhang Stefan Hadjis Ruben Fiszel Tian Zhao Luigi Nardi Ardavan Pedram Christos Kozyrakis Kunle Olukotun PLDI June 21, 2018
Instructions Add Overheads Instruction-Based vectorA · vectorB CPU mov r8 , rcx add r8 , 8 Floating Point mov r9 , rdx Instruction: 70 pJ Control add r9 , 8 mov rcx , rax Arithmetic/Logic mov rax , 0 25pJ 38pJ 6pJ .calc: Inst. Queue Register File mov rbx , [r9] Control Overheads I-cache Register imul rbx , [r8] 32-bit ADD: L1 Cache L1 Cache add rax , rbx File access add r8 , 8 ~0.5 pJ (Instructions) (Data) Access add r9 , 8 loop .calc L2 Cache DRAM Legend Control Compute SRAM Regs Mark Horowitz, Computing’s Energy Problem (and what we can do about it) ISSCC 2014 2
A Dark Tale: The CPU Power Wall 3
A More Efficient Way Configuration-Based Instruction-Based vectorA · vectorB CPU* mov r8 , rcx Custom Hardware* Floating Point ctrl add r8 , 8 mov r9 , rdx Control add r8 , 8 Arithmetic/Logic ctr vectorA mov rcx , rax mov rax , 0 + acc × Inst. Queue Register File vectorB .calc: + mov rbx , [r9] imul rbx , [r8] L1 Cache L1 Cache add rax , rbx (Instructions) (Data) add r8 , 8 add r9 , 8 DRAM loop .calc L2 Cache *Also not to scale DRAM Legend *Not to scale Control Compute SRAM Regs 4
The Future Is (Probably) Reconfigurable 10,000 Energy Efficiency (MOPS/mW) ASIC Brainwave (ISCA ’18) Dedicated 1,000 CGRA Reconfigurable 287 MOps/mW 100 Instruction-Based FPGA GPU 10 XPU (HotChips ’17) CPU 1 25x perf/W vs. CPU 0.1 Not programmable Less programmable More programmable Programmability 5
Key Question How can we more productively target reconfigurable architectures like FPGAs? Productivity Performance Fast and efficient programmers Fast and efficient designs Portability Target-generic solutions 6
Language Taxonomy Domain Specificity Domain-Specific Halide Multi-Domain MyHDL Verilog x86 Netlist General Purpose VHDL Abstraction Abstraction Higher Level Lower Level Lower Level “What?” “How?” “How?” Reconfigurable Architectures (FPGAs) Instruction-Based Architectures (CPUs) 7
Abstracting Hardware Design Domain Specificity +hardware pragmas HDLs Netlist Abstraction Abstraction Higher Level Lower Level Lower Level “What?” “How?” “How?” Reconfigurable Architectures (FPGAs) Instruction-Based Architectures (CPUs) 8
HDLs Hardware Description Languages (HDLs) e.g. Verilog, VHDL, Chisel, Bluespec Productivity Performance ✓ Arbitrary RTL ✘ No high-level abstractions Portability ✘ Significant target-specific code 9
C + Pragmas Existing High Level Synthesis (C + Pragmas) HDLs e.g. Vivado HLS, SDAccel, Altera OpenCL Productivity Performance ✓ Nested loops ✘ No memory ✘ Ad-hoc mix of hierarchy ✘ No arbitrary software/hardware pipelining ✘ Difficult to optimize Portability ✓ Portable for single vendor 10
Criteria for Improved HLS Requirement C+Pragmas Represent memory hierarchy explicitly Aids on-chip memory optimization, specialization Express control as nested loops Enables analysis of access patterns Support arbitrarily nested pipelining Exploits nested parallelism Specialize memory transfers Enables customized memory controllers based on access patterns Capture design parameters Enables automatic design tuning in compiler 11
Design Space Parameters Example vectorA · vectorB DRAM FPGA vectorA tileA + acc × tileB vectorB ctr Small and simple, but slow! Legend Control Compute SRAM Regs 12
Important Parameters: Buffer Sizes vectorA · vectorB DRAM FPGA vectorA tileA + acc × tileB vectorB ctr n Increases length of DRAM accesses Runtime n Increases exploited locality Runtime Legend Control Compute n Increases local memory sizes Area SRAM Regs 13
Important Parameters: Pipelining vectorA · vectorB DRAM Metapipelining requires buffering FPGA vectorA tileA (1) tileA (0) + acc Tile × tileB (0) tileB (1) vectorB B Stage 2 Stage 1 n Overlaps memory and compute Runtime Legend n Increases local memory sizes Area Control Compute SRAM Regs n Adds synchronization logic Area Double Buffer 14
Important Parameters: Parallelization vectorA · vectorB DRAM FPGA vectorA × tileA + + acc × tileB vectorB + × ctr ctr ctr n Improves element throughput Runtime Legend n Duplicates compute resources Area Control Compute SRAM Regs 15
Important Parameters: Memory Banking vectorA · vectorB DRAM Parallelization requires banking vectorA × tileA + + acc × tileB vectorB + × ctr ctr ctr Legend n Improves memory bandwidth Runtime Control Compute SRAM Regs n May duplicate memory resources Area Banked SRAM 16
Criteria for Improved HLS Requirement C+Pragmas Represent memory hierarchy explicitly Aids on-chip memory optimization, specialization Express control as nested loops Enables analysis of access patterns Support arbitrarily nested pipelining Exploits nested parallelism Specialize memory transfers Enables customized memory controllers based on access patterns Capture design parameters Enables automatic design tuning in compiler 17
Rethinking HLS Improved HLS HDLs C + Pragmas Productivity Performance ✓ Memory ✓ Nested loops hierarchy ✓ Automatic memory ✓ Arbitrary pipelining banking/buffering ✓ Implicit design parameters (unrolling, banking, etc.) ✓ Automated design tuning Portability ✓ Target-generic source across reconfigurable architectures 18
Abstracting Hardware Design Domain Specificity Spatial HDLs Netlist +pragmas Abstraction Abstraction Higher Level Lower Level Lower Level “What?” “How?” “How?” Reconfigurable Architectures (FPGAs) Instruction-Based Architectures (CPUs) 19
Spatial: Memory Hierarchy DDR DRAM val image = DRAM [ UInt8 ](H,W) GB buffer load image(i, j::j+C) // dense buffer gather image(a) // sparse val buffer = SRAM [ UInt8 ](C) On-Chip SRAM val fifo = FIFO [ Float ](D) MB val lbuf = LineBuffer [ Int ](R,C) Local Regs val accum = Reg [ Double ] KB val pixels = RegFile [ UInt8 ](R,C) 20
Spatial: Control And Design Parameters val P = 16 (1 → 32) Implicit/Explicit parallelization factors Reduce(0)(N by 1 par P ){i => (optional, but can be explicitly declared) data(i) }{(a,b) => a + b} Implicit/Explicit control schemes Stream . Foreach(0 until N){i => … (also optional, but can be used to override compiler) } val B = 64 (64 → 1024) Explicit size parameters for loop step val buffer = SRAM [ Float ]( B ) size and buffer sizes Foreach (N by B ){i => … (informs compiler it can tune this value) } Foreach (64 par 16){i => Implicit memory banking and buffering buffer(i) // Parallel read schemes for parallelized access } 21
Dot Product in Spatial val output = ArgOut [Float] Off-chip memory declarations val vectorA = DRAM [Float](N) val vectorB = DRAM [Float](N) DRAM Accel { Reduce (output)(N by B){ i => vectorA val tileA = SRAM [Float](B) val tileB = SRAM [Float](B) val acc = Reg [Float] vectorB tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B) output Reduce (acc)(B by 1){ j => tileA(j) * tileB(j) FPGA }{a, b => a + b} }{a, b => a + b} } 22
Dot Product in Spatial val output = ArgOut [Float] Explicit work division in IR val vectorA = DRAM [Float](N) val vectorB = DRAM [Float](N) DRAM Accel { Reduce (output)(N by B){ i => vectorA val tileA = SRAM [Float](B) val tileB = SRAM [Float](B) val acc = Reg [Float] vectorB tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B) output Reduce (acc)(B by 1){ j => tileA(j) * tileB(j) FPGA }{a, b => a + b} }{a, b => a + b} } 24
Dot Product in Spatial val output = ArgOut [Float] val vectorA = DRAM [Float](N) Tiled reduction (outer) val vectorB = DRAM [Float](N) DRAM Accel { Reduce (output)(N by B){ i => vectorA val tileA = SRAM [Float](B) val tileB = SRAM [Float](B) val acc = Reg [Float] vectorB tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B) output Reduce (acc)(B by 1){ j => Outer Reduce tileA(j) * tileB(j) FPGA }{a, b => a + b} } 24
Dot Product in Spatial val output = ArgOut [Float] On-chip memory declarations val vectorA = DRAM [Float](N) val vectorB = DRAM [Float](N) DRAM Accel { Reduce (output)(N by B){ i => tileA (0) tileA (1) vectorA val tileA = SRAM [Float](B) acc acc val tileB = SRAM [Float](B) tileB (0) tileB (1) val acc = Reg [Float] vectorB tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B) output Reduce (acc)(B by 1){ j => Outer Reduce tileA(j) * tileB(j) FPGA }{a, b => a + b} } 24
Dot Product in Spatial val output = ArgOut [Float] DRAM → SRAM transfers val vectorA = DRAM [Float](N) (also have store, scatter , and gather) val vectorB = DRAM [Float](N) DRAM Accel { Reduce (output)(N by B){ i => tileA (0) tileA (1) vectorA val tileA = SRAM [Float](B) acc acc val tileB = SRAM [Float](B) tileB (0) tileB (1) val acc = Reg [Float] Stage 1 vectorB tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B) output Reduce (acc)(B by 1){ j => Outer Reduce tileA(j) * tileB(j) FPGA }{a, b => a + b} } 24
Recommend
More recommend