dsagen synthesizing programmable
play

DSAGEN: Synthesizing Programmable Spatial Accelerators Jian Weng, - PowerPoint PPT Presentation

DSAGEN: Synthesizing Programmable Spatial Accelerators Jian Weng, Sihao Liu, Vidushi Dadu, Zhengrong Wang, Preyas Shah, Tony Nowatzki University of California, Los Angeles May 15 th , 2020 1 Existing Domain-Specific Approach: Specialized


  1. DSAGEN: Synthesizing Programmable Spatial Accelerators Jian Weng, Sihao Liu, Vidushi Dadu, Zhengrong Wang, Preyas Shah, Tony Nowatzki University of California, Los Angeles May 15 th , 2020 1

  2. Existing Domain-Specific Approach: Specialized Accelerators High-level Abstraction Specialized architecture often occupies 1/5~1/3 of publications in top conferences. 40.00% Compiler 35.00% 30.00% Sw/Hw Interface 25.00% 20.00% Specialized Mechanisms 15.00% 10.00% 5.00% Idioms Apps 0.00% ISCA'19 ASPLOS'19 MICRO'19 HPCA'20 2

  3. DSAGEN: Decoupled Spatial Accelerator Generator Design Specialized Apps Space Hardware Explorer 3

  4. DSAGEN: Decoupled Spatial Accelerator Generator Transformations Multiple with tradeoffs on Compiler Apps Xformed performance and hardware cost IR Candidate Proposed Design Hardware Hardware Space Exp. 4

  5. Outline • Design Space — Decoupled-Spatial Architecture • Insight from Prior Work • The Programming Paradigm • Design Space: Hardware Primitives (& Composition) • Compilation • Design Space Exploration • Evaluation 5

  6. AG S AG S S S Controller Activation PMU PMU + PCU Coalescing Unit Coalescing Unit + + AG S S S S AG Memory × × × × Prefetch Buffer PCU PMU PCU S S S S AG S S S S AG S S PMU PCU PMU S AG S S AG S S (a) ASPLOS18-MAERI (b) ISCA17-Plasticine Memory • Decoupled-Spatial Paradigm Mem. • Decoupled Compute/Memory Controller Func. Unit Controller • Spatially exposed resources Memory • Design Space Stream Switch Dispatcher • Composing hardware with Control Sync. Elem. simple primitives • Architecture Description Graph 6 (c) ISCA17-Softbrain

  7. Background: Decoupled-Spatial Architecture Ctrl Host Scratch Memory Controller Address for (int i = 0; i < n; ++i) Generator c[i] += a[i] * b[i]; Mem. Memory Controller b[0:n] c[0:n] a[0:n] × × Sync. Elem + Processing + Elements c[0:n] Switches 7

  8. Hardware Primitives: Processing Element & Switch Hardware Dedicated (=1) Shared (>1) High Cost: Low “CGRA” 2.6x Area “Systolic” 1x Area MUX MUX Statically + Better resource + No contention Scheduled utilization - Harder to map - Higher power - Harder to map *Conventional CGRA *Softbrain Instruction Function Register “Tagged Dataflow” Buffer Unit File “Ordered Dataflow” Dynamically 5.8x Area 2.1x Area Instruction Scheduled + Better flexibility + Better flexibility Scheduler + Better resource *SPU utilization High *Triggered Instruction 8

  9. Hardware Primitives: Memory • Memory Ind. Address Generator • Size XBAR • Bandwidth • Indirect Support 0xfc 0xfd 0xfe 0xff • a[b[i]] 0xee 0xef 0xfa 0xfb • Atomic Update … … … … • a[b[i]] += 1 4 5 6 7 0 1 2 3 FU FU FU FU Reorder Buffer 9

  10. Examples of ADG Controller Activation + Memory + + Memory × × × × Prefetch Buffer Mem. Controller Controller S S S S S S Stream S Dispatcher (b) MAERI (a) Softbrain Memory × × × × + + + + × × × × × × × × × × + + + + + + + + + + + + + + + + + + S S S S (d) Data Path of (c) Diannao Complex Mul. 10

  11. Outline • Decoupled-Spatial Architecture • Compilation • High-Level Abstraction • Hardware-Aware Modular Compilation • Design Space Exploration • Evaluation 11

  12. Compiling High-Level Lang. to Decoupled Spatial ? Pragma Executable Apps Annotation Binary How to abstract diverse underlying features with a unified high-level interface? • Programmer Hints • Which code regions are offloaded onto the spatial accelerator. • Which memory accesses can be decoupled intrinsics. • Which offloaded regions should be concurrent. 12

  13. An example of pragma annotation #pragma config ← The offloaded region in this compound body are concurrent { c[0:n] c[0:n] #pragma stream ← The memory accesses below will be restricted for (i=0; i<n; ++i) b[] d[0:n] a[0:n] d[0:n] a[0:n] b[] ← The computational instructions below will be offloaded #pragma offload × × × for (j=0; j<n; ++j) + + + a[i*n+j] += b[c[j]] * d[i*n+j]; a[0:n] a[0:n] } 13

  14. Compiling High-Level Lang. to Decoupled Spatial Compute ? Graph Modular Executable Pragma Apps Annotation XFROM Binary Encoded Mem. Stream Architecture Description Graph (ADG) How to hide the diversity of underlying hardware? • Modular Transformation • Specialized Hardware features often dictate the code transformation • A fallback is required when the hardware feature is not available 14

  15. Modular Transformation Inspect the hardware features to generate c[0:n] corresponding version of indirect memory #pragma config d[0:n] a[0:n] b[] { × #pragma stream // With indirect support for (i=0; i<n; ++i) + Read c[0:n], stream0 Indirect b, stream0, stream1 #pragma offload a[0:n] for (j=0; j<n; ++j) // Without indirect support for (j=0; j<n; ++j) a[i*n+j] += b[c[j]] * d[i*n+j]; Scalar b[c[j]], stream0 } 15

  16. Compiling High-Level Lang. to Decoupled Spatial Compute Graph Modular Executable Pragma Apps Annotation XFROM Binary Encoded Mem. Stream How is the dependence graph of computational instructions mapped? 16

  17. Spatial Mapping Sync 1 1 1 2 × 2 3 3 +1 4 + 4 Sync How is the dependence graph of computational instructions mapped? 1. Placement: Map instruction to PE’s with corresponding capability. 2. Routing: Routing the dependence edges thru the spatial network. 3. Timing: If necessary, balance the timing of data arrival • If one of 1-3 is not successful, revert some nodes and repeat 123 17

  18. Outline • Decoupled-Spatial Architecture • Compilation • Design Space Exploration • Drive the Search • Evaluating Design Points • Repairing the Mapping • Evaluation 18

  19. Design Space Exploration Map Multiple Architecture Remap Xformed Description Graph (ADG) IR Evaluate the sw/hw pair Create a new ADG based on the current Design Space Exp. 19

  20. • Power/Area Estimation Model • Synthesis can be time consuming • Performance • A regression model can predict • Spatial architecture the trend of hardware cost essentially enables hardware Model Validation specialized sw-pipelining Area Sparse CNN • The ratio of data availability 450000 200 Power 400000 determines the performance 350000 150 MachSuite 300000 • Perf=#Inst * (Activity Ratio) 250000 100 200000 Dense NN 150000 The model has mean 50 100000 performance error of 7%, and 50000 with maximum error 30%. 0 0 Model Synth. Model Synth. Model Synth. 20

  21. // Original Code Repairing the Spatial Mapping for (i=0; i<n; ++i) c[i]+=a[i]*b[i]; b[0:n] c[0:n] Sync a[0:n] Sync × × × No Unrolling: + + + c[0:n] Sync Sync Sync Sync a[0:n] b[0:n] c[0:n] × × × × × × Unroll by 2: + + + + + + Sync Sync c[0:n] 21

  22. Hardware/Software Interface Generation • How to configure accelerator with arbitrary topology? • Reuse the data path for configuration • Find path(s) that cover(s) all the components • A heuristic based heuristic algorithm to minimize the longest path of configuration Sync • For a graph with m nodes covered by × n paths, the longest path cannot be shorter than ⌈ 𝑛 𝑜 ⌉ . + • We only introduces 40% overhead Sync over the ideal bound. 22

  23. Outline • Decoupled-Spatial Architecture • Compilation • Design Space Exploration • Evaluation • Methodology • Compiler • Design Space Exploration 23

  24. Methodology • Performance • Gem5 RISCV in-order core integrated with a cycle-accurate spatial accelerator simulator • The in-order core is extended with stream decoupled ISA • Power/Area • All the components are implemented in Chisel RTL • Synthesized in Synopsys DC 28nm @1.00GHz • SRAM power/area are estimated by CACTI 7.0 24

  25. Compiler Performance • Softbrain — MachSuite • Versatile accelerator can handle moderate irregularity • SPU — Histogram, and Key Join • Accelerator specialized for irregular workloads • REVEL and Trigger — DSP • Accelerator specialized for imperfect loop body • MAERI — PolyBench • Accelerator for neural network 25

  26. Compiler Performance 30 Irregular (SPU) compiled 25 manual MachSuite 20 DSP (Softbrain) (REVEL) 15 DSP PolyBench (Trigger) (MAERI) 10 5 0 26

  27. Design Space Explorer • Workloads • Dense Neural Network • MachSuite • Sparse Convolutional Neural Network • Initial Design • A 5x5 mesh with all capability (arithmetic, control, and indirect) • Objective: perf²/mm² 27

  28. Design Space Explorer Area Breakdown 300 Power Breakdown Sparse CNN: 24h fu nw 600000 MachSuite: 19.2h sync mem 250 Dense NN: 16h 500000 200 400000 150 300000 100 200000 50 100000 0 0 28

  29. Conclusion HLS Manual DSAGEN Frontend C+Pragma DSL/Intrinsics, etc. C+Pragma Design Flow Nearly Automated Manual Nearly Automated Input A Single Multiple Target Multiple Target Application Applications Applications Output Application- ASIC/Programmable A Programmable Specific Accel. Accel. Accelerator Design Space Limited Rich Rich 29

  30. Q&A • Our framework is working in progress at: https://github.com/PolyArch/dsa-framework • All the questions and comments are welcomed 30

Recommend


More recommend