DSAGEN: Synthesizing Programmable Spatial Accelerators Jian Weng, Sihao Liu, Vidushi Dadu, Zhengrong Wang, Preyas Shah, Tony Nowatzki University of California, Los Angeles May 15 th , 2020 1
Existing Domain-Specific Approach: Specialized Accelerators High-level Abstraction Specialized architecture often occupies 1/5~1/3 of publications in top conferences. 40.00% Compiler 35.00% 30.00% Sw/Hw Interface 25.00% 20.00% Specialized Mechanisms 15.00% 10.00% 5.00% Idioms Apps 0.00% ISCA'19 ASPLOS'19 MICRO'19 HPCA'20 2
DSAGEN: Decoupled Spatial Accelerator Generator Design Specialized Apps Space Hardware Explorer 3
DSAGEN: Decoupled Spatial Accelerator Generator Transformations Multiple with tradeoffs on Compiler Apps Xformed performance and hardware cost IR Candidate Proposed Design Hardware Hardware Space Exp. 4
Outline • Design Space — Decoupled-Spatial Architecture • Insight from Prior Work • The Programming Paradigm • Design Space: Hardware Primitives (& Composition) • Compilation • Design Space Exploration • Evaluation 5
AG S AG S S S Controller Activation PMU PMU + PCU Coalescing Unit Coalescing Unit + + AG S S S S AG Memory × × × × Prefetch Buffer PCU PMU PCU S S S S AG S S S S AG S S PMU PCU PMU S AG S S AG S S (a) ASPLOS18-MAERI (b) ISCA17-Plasticine Memory • Decoupled-Spatial Paradigm Mem. • Decoupled Compute/Memory Controller Func. Unit Controller • Spatially exposed resources Memory • Design Space Stream Switch Dispatcher • Composing hardware with Control Sync. Elem. simple primitives • Architecture Description Graph 6 (c) ISCA17-Softbrain
Background: Decoupled-Spatial Architecture Ctrl Host Scratch Memory Controller Address for (int i = 0; i < n; ++i) Generator c[i] += a[i] * b[i]; Mem. Memory Controller b[0:n] c[0:n] a[0:n] × × Sync. Elem + Processing + Elements c[0:n] Switches 7
Hardware Primitives: Processing Element & Switch Hardware Dedicated (=1) Shared (>1) High Cost: Low “CGRA” 2.6x Area “Systolic” 1x Area MUX MUX Statically + Better resource + No contention Scheduled utilization - Harder to map - Higher power - Harder to map *Conventional CGRA *Softbrain Instruction Function Register “Tagged Dataflow” Buffer Unit File “Ordered Dataflow” Dynamically 5.8x Area 2.1x Area Instruction Scheduled + Better flexibility + Better flexibility Scheduler + Better resource *SPU utilization High *Triggered Instruction 8
Hardware Primitives: Memory • Memory Ind. Address Generator • Size XBAR • Bandwidth • Indirect Support 0xfc 0xfd 0xfe 0xff • a[b[i]] 0xee 0xef 0xfa 0xfb • Atomic Update … … … … • a[b[i]] += 1 4 5 6 7 0 1 2 3 FU FU FU FU Reorder Buffer 9
Examples of ADG Controller Activation + Memory + + Memory × × × × Prefetch Buffer Mem. Controller Controller S S S S S S Stream S Dispatcher (b) MAERI (a) Softbrain Memory × × × × + + + + × × × × × × × × × × + + + + + + + + + + + + + + + + + + S S S S (d) Data Path of (c) Diannao Complex Mul. 10
Outline • Decoupled-Spatial Architecture • Compilation • High-Level Abstraction • Hardware-Aware Modular Compilation • Design Space Exploration • Evaluation 11
Compiling High-Level Lang. to Decoupled Spatial ? Pragma Executable Apps Annotation Binary How to abstract diverse underlying features with a unified high-level interface? • Programmer Hints • Which code regions are offloaded onto the spatial accelerator. • Which memory accesses can be decoupled intrinsics. • Which offloaded regions should be concurrent. 12
An example of pragma annotation #pragma config ← The offloaded region in this compound body are concurrent { c[0:n] c[0:n] #pragma stream ← The memory accesses below will be restricted for (i=0; i<n; ++i) b[] d[0:n] a[0:n] d[0:n] a[0:n] b[] ← The computational instructions below will be offloaded #pragma offload × × × for (j=0; j<n; ++j) + + + a[i*n+j] += b[c[j]] * d[i*n+j]; a[0:n] a[0:n] } 13
Compiling High-Level Lang. to Decoupled Spatial Compute ? Graph Modular Executable Pragma Apps Annotation XFROM Binary Encoded Mem. Stream Architecture Description Graph (ADG) How to hide the diversity of underlying hardware? • Modular Transformation • Specialized Hardware features often dictate the code transformation • A fallback is required when the hardware feature is not available 14
Modular Transformation Inspect the hardware features to generate c[0:n] corresponding version of indirect memory #pragma config d[0:n] a[0:n] b[] { × #pragma stream // With indirect support for (i=0; i<n; ++i) + Read c[0:n], stream0 Indirect b, stream0, stream1 #pragma offload a[0:n] for (j=0; j<n; ++j) // Without indirect support for (j=0; j<n; ++j) a[i*n+j] += b[c[j]] * d[i*n+j]; Scalar b[c[j]], stream0 } 15
Compiling High-Level Lang. to Decoupled Spatial Compute Graph Modular Executable Pragma Apps Annotation XFROM Binary Encoded Mem. Stream How is the dependence graph of computational instructions mapped? 16
Spatial Mapping Sync 1 1 1 2 × 2 3 3 +1 4 + 4 Sync How is the dependence graph of computational instructions mapped? 1. Placement: Map instruction to PE’s with corresponding capability. 2. Routing: Routing the dependence edges thru the spatial network. 3. Timing: If necessary, balance the timing of data arrival • If one of 1-3 is not successful, revert some nodes and repeat 123 17
Outline • Decoupled-Spatial Architecture • Compilation • Design Space Exploration • Drive the Search • Evaluating Design Points • Repairing the Mapping • Evaluation 18
Design Space Exploration Map Multiple Architecture Remap Xformed Description Graph (ADG) IR Evaluate the sw/hw pair Create a new ADG based on the current Design Space Exp. 19
• Power/Area Estimation Model • Synthesis can be time consuming • Performance • A regression model can predict • Spatial architecture the trend of hardware cost essentially enables hardware Model Validation specialized sw-pipelining Area Sparse CNN • The ratio of data availability 450000 200 Power 400000 determines the performance 350000 150 MachSuite 300000 • Perf=#Inst * (Activity Ratio) 250000 100 200000 Dense NN 150000 The model has mean 50 100000 performance error of 7%, and 50000 with maximum error 30%. 0 0 Model Synth. Model Synth. Model Synth. 20
// Original Code Repairing the Spatial Mapping for (i=0; i<n; ++i) c[i]+=a[i]*b[i]; b[0:n] c[0:n] Sync a[0:n] Sync × × × No Unrolling: + + + c[0:n] Sync Sync Sync Sync a[0:n] b[0:n] c[0:n] × × × × × × Unroll by 2: + + + + + + Sync Sync c[0:n] 21
Hardware/Software Interface Generation • How to configure accelerator with arbitrary topology? • Reuse the data path for configuration • Find path(s) that cover(s) all the components • A heuristic based heuristic algorithm to minimize the longest path of configuration Sync • For a graph with m nodes covered by × n paths, the longest path cannot be shorter than ⌈ 𝑛 𝑜 ⌉ . + • We only introduces 40% overhead Sync over the ideal bound. 22
Outline • Decoupled-Spatial Architecture • Compilation • Design Space Exploration • Evaluation • Methodology • Compiler • Design Space Exploration 23
Methodology • Performance • Gem5 RISCV in-order core integrated with a cycle-accurate spatial accelerator simulator • The in-order core is extended with stream decoupled ISA • Power/Area • All the components are implemented in Chisel RTL • Synthesized in Synopsys DC 28nm @1.00GHz • SRAM power/area are estimated by CACTI 7.0 24
Compiler Performance • Softbrain — MachSuite • Versatile accelerator can handle moderate irregularity • SPU — Histogram, and Key Join • Accelerator specialized for irregular workloads • REVEL and Trigger — DSP • Accelerator specialized for imperfect loop body • MAERI — PolyBench • Accelerator for neural network 25
Compiler Performance 30 Irregular (SPU) compiled 25 manual MachSuite 20 DSP (Softbrain) (REVEL) 15 DSP PolyBench (Trigger) (MAERI) 10 5 0 26
Design Space Explorer • Workloads • Dense Neural Network • MachSuite • Sparse Convolutional Neural Network • Initial Design • A 5x5 mesh with all capability (arithmetic, control, and indirect) • Objective: perf²/mm² 27
Design Space Explorer Area Breakdown 300 Power Breakdown Sparse CNN: 24h fu nw 600000 MachSuite: 19.2h sync mem 250 Dense NN: 16h 500000 200 400000 150 300000 100 200000 50 100000 0 0 28
Conclusion HLS Manual DSAGEN Frontend C+Pragma DSL/Intrinsics, etc. C+Pragma Design Flow Nearly Automated Manual Nearly Automated Input A Single Multiple Target Multiple Target Application Applications Applications Output Application- ASIC/Programmable A Programmable Specific Accel. Accel. Accelerator Design Space Limited Rich Rich 29
Q&A • Our framework is working in progress at: https://github.com/PolyArch/dsa-framework • All the questions and comments are welcomed 30
Recommend
More recommend