Balancing Efficiency and Flexibility for DNN Acceleration via - PowerPoint PPT Presentation

Balancing Efficiency and Flexibility for DNN Acceleration via Temporal GPU-Systolic Array Integration Cong Guo 1 , Yangjie Zhou 1 , Jingwen Leng 1 , Yuhao Zhu 2 , Zidong Du 3 , Quan Chen 1 , Chao Li 1 , Bin Yao 1 , Minyi Guo 1 1 Shanghai Jiao Tong University, 2 University of Rochester, 3 Institute of Computing Technology, Chinese Academy of Sciences

Biography • Cong Guo • First-year Ph.D. student at Shanghai Jiao Tong University • Interested in computer architecture and high performance computing. • Jingwen Leng • Associate professor • John Hopcroft Center for Computer Science Department of Computer Science and Engineering Shanghai Jiao Tong University 2

Outline • Introduction • Simultaneous Multi-mode Architecture • Evaluation 3

Introduction Efficiency Flexibility Google TPU [ISCA’16] Nvidia GPU [Volta, 17] Specialized for General-purpose Core GEMM (General Matrix Multiply) 4

Inefficiency for TPU • TPU v2, 1 core, 22TFLOPS • GPU V100, 15TFLOPS • TPU profiling • Mask R-CNN Hybrid models: • CNN/FC: 20% faster than GPU Mask R-CNN • Total: 75% slower than GPU • DeepLab DeepLab • CNN/FC: 40% faster than GPU • ArgMax: 2x slowdown than GPU • CRF: 10x worse than GPU 5

Inefficiency for GPU • Spatial integration • Explicit synchronization • Fixed shape GEMM • Performance inefficiency • Tensor Core • Efficiency < 60% • TPU • Efficiency > 99% Efficiency: Achieved FLOPS divided by the peak FLOPS 6

GPU GPU (SIMD) Instruction Cache DRAM • GPU Warp Scheduler • Papalism Dispatch Unit SM • Single Instruction Multiple Data Register File • Massive threads • Warp active mask DRAM • Memory Core SFU CUDA core • Register file, vector access Dispatch Port • Shared memory, scalar access Operand Collector • Communication LD/ST FP INT Interconnect Network • PE array with shared memory Result Queue Cache • Warp Shuffle Instruction 8

TPU • TPU • Papalism • 2D Systolic Array (MISD) • High concurrency • Active/Non-Active (one inst. two status) • Memory • Weight, vector access (continuous) • Input/output, scalar access(discontinuous) • Communication • Interconnected PE array 9

Similarity • TPU • GPU • Parallelism • Parallelism • 2D Systolic Array (MISD) • Single Instruction Multiple Data • High concurrency • Massive threads • Active/Non-Active (one inst. two status) • Warp active mask • Memory • Memory • Weight, vector access (continuous) • Register file, vector access • Input/output, scalar access(discontinuous) • Shared memory, scalar access • Communication • Communication • Interconnected PE array • PE array with shared memory 10

SMA Hardware Design Challenges: 1: Massive output scalar accesses 2: Inter-PE Communication 3: How to control the systolic array Simultaneous Multi-mode Architecture (SMA) Similar to GPU 11

Massive output scalar accesses Shared memory • Semi-broadcast Bank conflicts Shared memory • Weight • Preload • Register file • Output • Vector access • Register file • Input • Scalar access Register file • Shared memory Semi-broadcasted Weight-stationary Weight-stationary 12

Partial Sum Communication • One-way wires • Horizontal neighbor PEs • Fast Partial Sum need low latency • latency-sensitive • Negligible overhead Slow • Vertical PEs • Slow • Broadcast, Prefetch • latency-insensitivity Fast Without PE layout reconfiguration 13

Instruction Control • A new instruction: • LSMA (Load, Store and Multiply-accumulate) • A systolic controller Per SM Input : 8 x 2 x 4 Bytes Output : 24 x 2 x 4 Bytes Total : 256 Bytes Area Overhead < 0.1% 256KB Register file 128KB L1 Cache/Shared memory 14

Software Design: Tiling GEMM LSMA, PTX code half synchronization Based on CUTLASS 1.3 15

Evaluation • Methodology • GPGPUsim-4.0 • GEMM based on CUTLASS 17

Iso-FLOP • Square GEMM • 2-SMA efficiency > 90%, 30% higher than 4-TC. • SMA (broadcast) 20% - 40% higher than non-broadcast. 18

Iso-Area • 5 networks • 3-SMA 63% faster, 23% less energy than 4-TC on average. 19

End-to-end application (autopilot) DeepLab, CNN GO-TURN, CNN ORB-SLAM, non-CNN CUDA CUDA SMA Same area Tensor 1.0x 0.5x + 1.0x 1.5x GEMM Speedup Platform GPU TC SMA [Shih-Chieh Lin, etc. ， ASPLOS’18] 20

End-to-end application (autopilot) [Euphrates, ISCA’18] CUDA core non-GEMM Bottleneck N = 4, SMA Latency Reduction 50%, More Flexibility 21

Summary • Hardware • Parallelism similarity • Memory and communication • Systolic controller • Software • Tiling GEMM • Evaluation • Efficiency • Flexibility 22

Questions Thank you!

Backup Slides

SMA Hardware Design Simultaneous Multi-mode Architecture (SMA)

LSMA execution • 5-bit mask for 4x4 array Counter 8 • Counter (input row number) • Preload weight Mask 1 0 0 0 0 • Prefetch input (Shared memory) Input • Execute • Store output (Register file) PE array & Preloaded weight B 4x4 Output Cycle : 0 A C 8x4 8x4

Cycle : 1 Counter 8 Mask 1 0 0 0 0 Prefetch input Input Discontinuous PE array & Preloaded weight Output Cycle : 1

Cycle : 2 Counter 7 Mask 1 1 0 0 0 Input PE array & Execute Preloaded weight Output Cycle : 2

Cycle : 3 Counter 6 Mask 1 1 1 0 0 Input PE array & Preloaded weight Output Cycle : 3

Cycle : 5 Counter 4 Mask 1 1 1 1 1 Input Continuous PE array & Store output Preloaded weight Output Cycle : 5

Software Design: Tiling GEMM

Balancing Efficiency and Flexibility for DNN Acceleration via - PowerPoint PPT Presentation

Balancing Efficiency and Flexibility for DNN Acceleration via Temporal GPU-Systolic Array Integration Cong Guo 1 , Yangjie Zhou 1 , Jingwen Leng 1 , Yuhao Zhu 2 , Zidong Du 3 , Quan Chen 1 , Chao Li 1 , Bin Yao 1 , Minyi Guo 1 1 Shanghai Jiao Tong

DNN-based Branch-and-bound for the Quadratic Assignment Problem *Koichi Fujii, Naoki Ito, Yuji

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -> 2

Load Balancing Load Balancing Load balancing: distributing data and/or computations across

The Dark Side of DNN Pruning Reza Yazdani Marc Riera Jose-Maria Arnau Antonio Gonzlez

produce Good Flexibility? I. What does flexibility do? II. What flexibility does a

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Internal Load Balancing in 5 mins Deliver scalable and resilient internal-only services on GCP

Dynamic Load Balancing in Dynamic Load Balancing in Charm+ + Charm+ + Abhinav S Bhatele

Epidemic Algorithm for Load Balancing Harshitha Menon, Laxmikant Kal e 15th April 1 / 25

Convolution Engine Balancing Efficiency & Flexibility in Specialized Computing Wajahat

Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing Did the

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

Option contracts for power system balancing Part 3: Power system balancing and multiple optimal

Robustly Reusable Fuzzy Extractor from Standard Assumptions Yunhua Wen and Shengli Liu Shanghai

What is luban http://luban.danse.us A python package Simple, natural syntax for

When does the Tukey Median work? Banghua Zhu with Jiantao Jiao and Jacob Steinhardt Department

PhyloSub Jiao et. al. BMC Bioinformatics 2014, 15:35 Background Genetically-diverse subclonal

Low Background Laboratories Per Provencher PHYS 575 Fall 2015 12/1/2015 Why Low Background

Differentially-Private Deep Learning from an Optimization Perspective Presenter: Liyao Xiang

Probabilistic Bisimilarity Revisited Yuxin Deng Shanghai Jiao Tong University

Polynomial Invariant Generation for Non-deterministic Recursive Programs Krishnendu Chatterjee 1 ,

Balancing Efficiency and Flexibility for DNN Acceleration via - PowerPoint PPT Presentation

Balancing Efficiency and Flexibility for DNN Acceleration via Temporal GPU-Systolic Array Integration Cong Guo 1 , Yangjie Zhou 1 , Jingwen Leng 1 , Yuhao Zhu 2 , Zidong Du 3 , Quan Chen 1 , Chao Li 1 , Bin Yao 1 , Minyi Guo 1 1 Shanghai Jiao Tong

DNN-based Branch-and-bound for the Quadratic Assignment Problem *Koichi Fujii, Naoki Ito, Yuji

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -&gt; 2

Load Balancing Load Balancing Load balancing: distributing data and/or computations across

The Dark Side of DNN Pruning Reza Yazdani Marc Riera Jose-Maria Arnau Antonio Gonzlez

produce Good Flexibility? I. What does flexibility do? II. What flexibility does a

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Internal Load Balancing in 5 mins Deliver scalable and resilient internal-only services on GCP

Dynamic Load Balancing in Dynamic Load Balancing in Charm+ + Charm+ + Abhinav S Bhatele

Epidemic Algorithm for Load Balancing Harshitha Menon, Laxmikant Kal e 15th April 1 / 25

Convolution Engine Balancing Efficiency &amp; Flexibility in Specialized Computing Wajahat

Convolution Engine: Balancing Efficiency &amp; Flexibility in Specialized Computing Did the

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

Option contracts for power system balancing Part 3: Power system balancing and multiple optimal

Robustly Reusable Fuzzy Extractor from Standard Assumptions Yunhua Wen and Shengli Liu Shanghai

What is luban http://luban.danse.us A python package Simple, natural syntax for

When does the Tukey Median work? Banghua Zhu with Jiantao Jiao and Jacob Steinhardt Department

PhyloSub Jiao et. al. BMC Bioinformatics 2014, 15:35 Background Genetically-diverse subclonal

Low Background Laboratories Per Provencher PHYS 575 Fall 2015 12/1/2015 Why Low Background

Differentially-Private Deep Learning from an Optimization Perspective Presenter: Liyao Xiang

Probabilistic Bisimilarity Revisited Yuxin Deng Shanghai Jiao Tong University

Polynomial Invariant Generation for Non-deterministic Recursive Programs Krishnendu Chatterjee 1 ,

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -> 2

Convolution Engine Balancing Efficiency & Flexibility in Specialized Computing Wajahat

Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing Did the