DSAGEN: Synthesizing Programmable Spatial Accelerators Jian Weng, - PowerPoint PPT Presentation

DSAGEN: Synthesizing Programmable Spatial Accelerators Jian Weng, Sihao Liu, Vidushi Dadu, Zhengrong Wang, Preyas Shah, Tony Nowatzki University of California, Los Angeles May 15 th , 2020 1

Existing Domain-Specific Approach: Specialized Accelerators High-level Abstraction Specialized architecture often occupies 1/5~1/3 of publications in top conferences. 40.00% Compiler 35.00% 30.00% Sw/Hw Interface 25.00% 20.00% Specialized Mechanisms 15.00% 10.00% 5.00% Idioms Apps 0.00% ISCA'19 ASPLOS'19 MICRO'19 HPCA'20 2

DSAGEN: Decoupled Spatial Accelerator Generator Design Specialized Apps Space Hardware Explorer 3

DSAGEN: Decoupled Spatial Accelerator Generator Transformations Multiple with tradeoffs on Compiler Apps Xformed performance and hardware cost IR Candidate Proposed Design Hardware Hardware Space Exp. 4

Outline • Design Space — Decoupled-Spatial Architecture • Insight from Prior Work • The Programming Paradigm • Design Space: Hardware Primitives (& Composition) • Compilation • Design Space Exploration • Evaluation 5

AG S AG S S S Controller Activation PMU PMU + PCU Coalescing Unit Coalescing Unit + + AG S S S S AG Memory × × × × Prefetch Buffer PCU PMU PCU S S S S AG S S S S AG S S PMU PCU PMU S AG S S AG S S (a) ASPLOS18-MAERI (b) ISCA17-Plasticine Memory • Decoupled-Spatial Paradigm Mem. • Decoupled Compute/Memory Controller Func. Unit Controller • Spatially exposed resources Memory • Design Space Stream Switch Dispatcher • Composing hardware with Control Sync. Elem. simple primitives • Architecture Description Graph 6 (c) ISCA17-Softbrain

Background: Decoupled-Spatial Architecture Ctrl Host Scratch Memory Controller Address for (int i = 0; i < n; ++i) Generator c[i] += a[i] * b[i]; Mem. Memory Controller b[0:n] c[0:n] a[0:n] × × Sync. Elem ＋ Processing ＋ Elements c[0:n] Switches 7

Hardware Primitives: Processing Element & Switch Hardware Dedicated (=1) Shared (>1) High Cost: Low “CGRA” 2.6x Area “Systolic” 1x Area MUX MUX Statically + Better resource + No contention Scheduled utilization - Harder to map - Higher power - Harder to map *Conventional CGRA *Softbrain Instruction Function Register “Tagged Dataflow” Buffer Unit File “Ordered Dataflow” Dynamically 5.8x Area 2.1x Area Instruction Scheduled + Better flexibility + Better flexibility Scheduler + Better resource *SPU utilization High *Triggered Instruction 8

Hardware Primitives: Memory • Memory Ind. Address Generator • Size XBAR • Bandwidth • Indirect Support 0xfc 0xfd 0xfe 0xff • a[b[i]] 0xee 0xef 0xfa 0xfb • Atomic Update … … … … • a[b[i]] += 1 4 5 6 7 0 1 2 3 FU FU FU FU Reorder Buffer 9

Examples of ADG Controller Activation + Memory + + Memory × × × × Prefetch Buffer Mem. Controller Controller S S S S S S Stream S Dispatcher (b) MAERI (a) Softbrain Memory × × × × + + + + × × × × × × × × × × + + + + + + + + + + + + + + + + + + S S S S (d) Data Path of (c) Diannao Complex Mul. 10

Outline • Decoupled-Spatial Architecture • Compilation • High-Level Abstraction • Hardware-Aware Modular Compilation • Design Space Exploration • Evaluation 11

Compiling High-Level Lang. to Decoupled Spatial ? Pragma Executable Apps Annotation Binary How to abstract diverse underlying features with a unified high-level interface? • Programmer Hints • Which code regions are offloaded onto the spatial accelerator. • Which memory accesses can be decoupled intrinsics. • Which offloaded regions should be concurrent. 12

An example of pragma annotation #pragma config ← The offloaded region in this compound body are concurrent { c[0:n] c[0:n] #pragma stream ← The memory accesses below will be restricted for (i=0; i<n; ++i) b[] d[0:n] a[0:n] d[0:n] a[0:n] b[] ← The computational instructions below will be offloaded #pragma offload × × × for (j=0; j<n; ++j) ＋＋＋ a[i*n+j] += b[c[j]] * d[i*n+j]; a[0:n] a[0:n] } 13

Compiling High-Level Lang. to Decoupled Spatial Compute ? Graph Modular Executable Pragma Apps Annotation XFROM Binary Encoded Mem. Stream Architecture Description Graph (ADG) How to hide the diversity of underlying hardware? • Modular Transformation • Specialized Hardware features often dictate the code transformation • A fallback is required when the hardware feature is not available 14

Modular Transformation Inspect the hardware features to generate c[0:n] corresponding version of indirect memory #pragma config d[0:n] a[0:n] b[] { × #pragma stream // With indirect support for (i=0; i<n; ++i) ＋ Read c[0:n], stream0 Indirect b, stream0, stream1 #pragma offload a[0:n] for (j=0; j<n; ++j) // Without indirect support for (j=0; j<n; ++j) a[i*n+j] += b[c[j]] * d[i*n+j]; Scalar b[c[j]], stream0 } 15

Compiling High-Level Lang. to Decoupled Spatial Compute Graph Modular Executable Pragma Apps Annotation XFROM Binary Encoded Mem. Stream How is the dependence graph of computational instructions mapped? 16

Spatial Mapping Sync 1 1 1 2 × 2 3 3 +1 4 ＋ 4 Sync How is the dependence graph of computational instructions mapped? 1. Placement: Map instruction to PE’s with corresponding capability. 2. Routing: Routing the dependence edges thru the spatial network. 3. Timing: If necessary, balance the timing of data arrival • If one of 1-3 is not successful, revert some nodes and repeat 123 17

Outline • Decoupled-Spatial Architecture • Compilation • Design Space Exploration • Drive the Search • Evaluating Design Points • Repairing the Mapping • Evaluation 18

Design Space Exploration Map Multiple Architecture Remap Xformed Description Graph (ADG) IR Evaluate the sw/hw pair Create a new ADG based on the current Design Space Exp. 19

• Power/Area Estimation Model • Synthesis can be time consuming • Performance • A regression model can predict • Spatial architecture the trend of hardware cost essentially enables hardware Model Validation specialized sw-pipelining Area Sparse CNN • The ratio of data availability 450000 200 Power 400000 determines the performance 350000 150 MachSuite 300000 • Perf=#Inst * (Activity Ratio) 250000 100 200000 Dense NN 150000 The model has mean 50 100000 performance error of 7%, and 50000 with maximum error 30%. 0 0 Model Synth. Model Synth. Model Synth. 20

// Original Code Repairing the Spatial Mapping for (i=0; i<n; ++i) c[i]+=a[i]*b[i]; b[0:n] c[0:n] Sync a[0:n] Sync × × × No Unrolling: ＋＋＋ c[0:n] Sync Sync Sync Sync a[0:n] b[0:n] c[0:n] × × × × × × Unroll by 2: ＋＋＋＋＋＋ Sync Sync c[0:n] 21

Hardware/Software Interface Generation • How to configure accelerator with arbitrary topology? • Reuse the data path for configuration • Find path(s) that cover(s) all the components • A heuristic based heuristic algorithm to minimize the longest path of configuration Sync • For a graph with m nodes covered by × n paths, the longest path cannot be shorter than ⌈ 𝑛 𝑜 ⌉ . ＋ • We only introduces 40% overhead Sync over the ideal bound. 22

Outline • Decoupled-Spatial Architecture • Compilation • Design Space Exploration • Evaluation • Methodology • Compiler • Design Space Exploration 23

Methodology • Performance • Gem5 RISCV in-order core integrated with a cycle-accurate spatial accelerator simulator • The in-order core is extended with stream decoupled ISA • Power/Area • All the components are implemented in Chisel RTL • Synthesized in Synopsys DC 28nm @1.00GHz • SRAM power/area are estimated by CACTI 7.0 24

Compiler Performance • Softbrain — MachSuite • Versatile accelerator can handle moderate irregularity • SPU — Histogram, and Key Join • Accelerator specialized for irregular workloads • REVEL and Trigger — DSP • Accelerator specialized for imperfect loop body • MAERI — PolyBench • Accelerator for neural network 25

Compiler Performance 30 Irregular (SPU) compiled 25 manual MachSuite 20 DSP (Softbrain) (REVEL) 15 DSP PolyBench (Trigger) (MAERI) 10 5 0 26

Design Space Explorer • Workloads • Dense Neural Network • MachSuite • Sparse Convolutional Neural Network • Initial Design • A 5x5 mesh with all capability (arithmetic, control, and indirect) • Objective: perf²/mm² 27

Design Space Explorer Area Breakdown 300 Power Breakdown Sparse CNN: 24h fu nw 600000 MachSuite: 19.2h sync mem 250 Dense NN: 16h 500000 200 400000 150 300000 100 200000 50 100000 0 0 28

Conclusion HLS Manual DSAGEN Frontend C+Pragma DSL/Intrinsics, etc. C+Pragma Design Flow Nearly Automated Manual Nearly Automated Input A Single Multiple Target Multiple Target Application Applications Applications Output Application- ASIC/Programmable A Programmable Specific Accel. Accel. Accelerator Design Space Limited Rich Rich 29

Q&A • Our framework is working in progress at: https://github.com/PolyArch/dsa-framework • All the questions and comments are welcomed 30

DSAGEN: Synthesizing Programmable Spatial Accelerators Jian Weng, - PowerPoint PPT Presentation

DSAGEN: Synthesizing Programmable Spatial Accelerators Jian Weng, Sihao Liu, Vidushi Dadu, Zhengrong Wang, Preyas Shah, Tony Nowatzki University of California, Los Angeles May 15 th , 2020 1 Existing Domain-Specific Approach: Specialized

ROMs, PLAs and FPGAs October 5, 2006 Typeset by Foil T EX Why Programmable Logic?

SYNTHESIZING 3D SOUND SYNTHESIZING 3D SOUND AND AND SOUND LOCALIZATION SOUND LOCALIZATION

PROGRAMMABLE LOGIC CONTROLLER Control Systems Types Programmable Logic Controllers

Field Programmable Gate Arrays by Ketil Red Field Programmable Gate Array Integrated

GoBack On Synthesizing Controllers from Bounded-Response Properties Dejan Ni ckovi Oded

Synthesizing an Instruction Selection Rule Library from Semantic Specifjcations Sebastian

Regulatory Guidance on the Use of Field Programmable Gate of Field Programmable Gate Arrays in

Outline FPGA clocking Programmable clocks Dynamic programmable oscillators EMI

Programmable Data Plane at Terabit Speeds Milad Sharif SOFTWARE ENGINEER PISA: Protocol

Nanowire- -Based Based Nanowire Programmable Programmable Architectures Architectures

TESTING PROGRAMMABLE INFRASTRUCTURE (WITH RUBY) @burythehammer PROGRAMMABLE INFRASTRUCTURE IS

Open Programmable Architecture for Java-enabled Network Devices Tal Lavian Technology Center

SoC Design SoC Design g Lecture 4: Programmable ASICs L Lecture 4: Programmable ASICs L 4 P

Programmable Switch Hardware ECE/CS598HPN Radhika Mittal Conventional SDN Programmable

Common Lisp - The programmable programing language Ben Dudson Common Lisp - The programmable

VHDL VHDL - Flaxer Eli Ch 2 - 1 Programmable Logic Review (last chapter) VHDL and

Memory Sean Barker 1 Memory Addresses Sean Barker 2 Endianness int x = 0x01234567; // stored

Reduction of Operating System Jitter Caused by Page Reclaim Yoshihiro Oyama 1,3 Shun Ishiguro 1

Cache Storage Channels Alias-driven Attacks Formally Verified Platforms Formally Verified

Symbolic Heap Abstraction with Demand-Driven Axiomatization of Memory Invariants Isil Dillig

CS642: Computer Security Drew Davidson davidson@cs.wisc.edu From

PythonMemory Management101 DeepinginGarbagecollector Jos Manuel Ortega

Memory Locations For Variables Chapter Twelve Modern Programming Languages, 2nd ed. 1 A Binding

Dynamic Memory Allocation CS 351: Systems Programming Michael Saelee <lee@iit.edu>

DSAGEN: Synthesizing Programmable Spatial Accelerators Jian Weng, - PowerPoint PPT Presentation

DSAGEN: Synthesizing Programmable Spatial Accelerators Jian Weng, Sihao Liu, Vidushi Dadu, Zhengrong Wang, Preyas Shah, Tony Nowatzki University of California, Los Angeles May 15 th , 2020 1 Existing Domain-Specific Approach: Specialized

ROMs, PLAs and FPGAs October 5, 2006 Typeset by Foil T EX Why Programmable Logic?

SYNTHESIZING 3D SOUND SYNTHESIZING 3D SOUND AND AND SOUND LOCALIZATION SOUND LOCALIZATION

PROGRAMMABLE LOGIC CONTROLLER Control Systems Types Programmable Logic Controllers

Field Programmable Gate Arrays by Ketil Red Field Programmable Gate Array Integrated

GoBack On Synthesizing Controllers from Bounded-Response Properties Dejan Ni ckovi Oded

Synthesizing an Instruction Selection Rule Library from Semantic Specifjcations Sebastian

Regulatory Guidance on the Use of Field Programmable Gate of Field Programmable Gate Arrays in

Outline FPGA clocking Programmable clocks Dynamic programmable oscillators EMI

Programmable Data Plane at Terabit Speeds Milad Sharif SOFTWARE ENGINEER PISA: Protocol

Nanowire- -Based Based Nanowire Programmable Programmable Architectures Architectures

TESTING PROGRAMMABLE INFRASTRUCTURE (WITH RUBY) @burythehammer PROGRAMMABLE INFRASTRUCTURE IS

Open Programmable Architecture for Java-enabled Network Devices Tal Lavian Technology Center

SoC Design SoC Design g Lecture 4: Programmable ASICs L Lecture 4: Programmable ASICs L 4 P

Programmable Switch Hardware ECE/CS598HPN Radhika Mittal Conventional SDN Programmable

Common Lisp - The programmable programing language Ben Dudson Common Lisp - The programmable

VHDL VHDL - Flaxer Eli Ch 2 - 1 Programmable Logic Review (last chapter) VHDL and

Memory Sean Barker 1 Memory Addresses Sean Barker 2 Endianness int x = 0x01234567; // stored

Reduction of Operating System Jitter Caused by Page Reclaim Yoshihiro Oyama 1,3 Shun Ishiguro 1

Cache Storage Channels Alias-driven Attacks Formally Verified Platforms Formally Verified

Symbolic Heap Abstraction with Demand-Driven Axiomatization of Memory Invariants Isil Dillig

CS642: Computer Security Drew Davidson davidson@cs.wisc.edu From

PythonMemory Management101 DeepinginGarbagecollector Jos Manuel Ortega

Memory Locations For Variables Chapter Twelve Modern Programming Languages, 2nd ed. 1 A Binding

Dynamic Memory Allocation CS 351: Systems Programming Michael Saelee &lt;lee@iit.edu&gt;

Dynamic Memory Allocation CS 351: Systems Programming Michael Saelee <lee@iit.edu>