T ANGRAM : Optimized Coarse-Grained Dataflow for Scalable NN Accelerators Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Tsinghua University Google ASPLOS – April 2019
Neural Networks (NNs) q Unprecedented accuracy for challenging applications o Fully-connected (MLPs), Convolutional (CNNs), Recurrent (LSTMs) NNs q Inference: layer-wise processing on direct acyclic graphs (DAGs) d in c t −1 h t −1 d in x t | Conv FC 1 × 1 1 × 1 3 × 3 Conv × Conv Conv Pool I-Gate 1 × 1 × + Conv Conv F-Gate 3 × 3 5 × 5 1 × 1 Conv Conv Conv FC tanh | FC × O-Gate c t h t d out d out Convolutional NN LSTM Cell Inception Module 2
NN Accelerators q Domain-specific processing engine o An array of specialized processing elements (PEs) o On-chip register files and SRAMs o 100x performance and energy efficiency q Diannao/Cambricon, Google TPU, Eyeriss, Cnvlutin, EIE, … PE PE PE PE Reg File Global Buffer PE PE PE PE ALU PE PE PE PE Processing Element PE PE PE PE NN Processing Engine 3
Scaling NN Performance q Use more PEs & more on-chip buffers Mem 0 Global Buffer Monolithic Array q Monolithic engine of PEs Mem 1 û Low resource utilization û Long array buses û Far from SRAM GBuf GBuf GBuf GBuf Array Array Array Array Mem 0 Mem 2 q Tiled architecture— focus of our work ü Mostly local data transfers GBuf GBuf GBuf GBuf Array Array Array Array ü Easy to scale up/down Mem 1 Mem 3 ? Dataflow scheduling GBuf GBuf GBuf GBuf Array Array Array Array 4
T ANGRAM : Optimizing Coarse-Grained Dataflow GBuf GBuf Array Array GBuf GBuf Array Array q Intra-layer parallelism q Inter-layer pipelining q Buffer Sharing Dataflow q Fine-grained data forwarding & pipelining of complex DAGs o Reuse data across engines à higher energy efficiency o Reduce pipeline stalls o Avoid on-chip data duplication à higher throughput à smaller buffer area o Temporarily store forwarded data à smaller buffer area 5
Intra-Layer Parallelism 6
Parallelizing a Single Layer foreach b in batch Nb foreach ifmap i in Ni foreach ofmap o in No I [0][0:1] I [0][0:1] // 2D conv O [0][0] O [0][1] O[b][o] += I[b][i] * W[o][i] W [0][0:1] W [1][0:1] Ifmaps Weights Ofmaps I [1][0:1] I [1][0:1] = * N b N b O [1][0] O [1][1] N i W [0][0:1] W [1][0:1] N o N i N o q Inefficient buffer use for shared data û Replicated buffered data (area) û Data reuse limited within each tile (energy) q ALL parallelization schemes share some data! 7
Optimizing Dataflow for Shared Data W [1][0] W [0][1] W [0][0] W [1][1] O [0][0] O [0][1] I [0][1] I [0][0] O [1][0] O [1][1] I [1][0] I [1][1] q Skew computation order of engines o All engines start in parallel à high throughput q Rotate buffered data between engines o Fully reuse shared data à low energy o No on-chip data duplication à low area 8
Buffer Sharing Dataflow q Unify distributed buffers as an ideal large buffer o Efficiently store and reuse data q Formalize as loop transformations o (tile coordinate x , time step t ) -> index of data to be buffered i o See paper for detailed maths q Easy to implement o Buffer controller fetches from memory or other tiles o No changes for dataflow within a tile q Support all parallelization schemes (including hybrid) 9
Inter-Layer Pipelining 10
Pipelining Multiple Layers Layer 1 Layer 3 Layer 2 Layer 2 Layer 3 Layer 4 Layer 1 q Pros: avoid off-chip intermediate data accesses o Save DRAM bandwidth and energy q Cons: utilize resources less efficiently o Long delays: pipeline filling/draining due to inter-layer data dependencies o Large SRAM buffers: store entire intermediate data 11
Fine-Grained Data Forwarding q Forward each subset of data to the next layer as soon as ready o Reduce pipeline stalls: next layer starts earlier o Reduce buffer capacity: only store the subset currently being forwarded q Require matched access patterns between adjacent layers … … Ifmaps No dependencies; trivially pipelined foreach ofmap o in No foreach ifmap i in Ni Ofmaps foreach b in batch Nb 0 1 2 // 2D conv Time foreach ifmap i in Ni foreach ofmap o in No // 2D conv Ifmaps 0 1 2 O[b][o] += I[b][i] * W[o][i] foreach ifmap i in Ni … … foreach ofmap o in No Ofmaps // 2D conv Time 12
Alternate Layer Loop Ordering (ALLO) Unoptimized Optimized … … … … Buffer for Buffer for Layer Layer ONE fmap ALL fmaps 1 1 1 2 0 1 2 0 … … 0 1 2 Benefits apply to Layer Layer half of all layers 2 … … 2 0 1 2 … … … … Layer Layer 3 3 0 1 2 0 1 2 Delay for Delay for Time Time ALL fmaps ONE fmap 13
Layer Pipelining for Complex NN DAGs q A dataflow tool explores pipeline schedules of multiple layers q Subject to design rules due to data dependency constraints o E.g., no multiple predecessor layers on-chip FC R0 R1 R3 R5 R0 1 × 1 Conv × I-Gate 1 × 1 1 × 1 3 × 3 R1 3 × 3 Conv R1 Conv Conv Pool × + 1 × 1 F-Gate 1 × 1 Conv Conv 3 × 3 5 × 5 1 × 1 R2 R2 + Conv Conv Conv tanh R0 R2 R4 R6 1 × 1 Conv R3 × O-Gate R3 Inception Module ResNet module LSTM Cell 14
Evaluation Results 15
Modeling Methodology q State-of-the-art NNs o CNNs: AlexNet, VGGNet, GoogLeNet, ResNet o MLPs & LSTMs: medium and large scales q Hardware o Inference engine: Eyeriss [ISCA’16], 8 × 8 PEs, 32 kB buffer, 500 MHz o Off-chip memory: LPDDR3-1600, 4 channels o Overall chip: 16 x 16 tiles • 16384 PEs + 8 MB SRAM • 90 mm 2 at 28 nm 16
Overall Comparison 1 2 Energy Time 0.5 1 0 0 t t t t t t t t M L M L M L M L e e e e e e e e - - - - N N P M N N P M N N - - N N - - P L M P L M x G e s x G e s T T L M L M e e e L T e L T G M S G M S l g R l g R S L S L A A V V o o L L o o G G Monolithic Base Tiled TANGRAM Monolithic Base Tiled TANGRAM q Base tiled vs. monolithic: 3.6x performance, 7% worse energy o Less flexible and less efficient use of on-chip SRAM buffers q T ANGRAM : 2x over base tiled, outperforms monolithic 17
Intra- vs. Inter-Layer Optimizations 4.59 q Intra-layer: Buffer Sharing 2.5 o AlexNet: fit large fmaps on-chip 2 o MLP-L: enable weight pinning 1.5 Energy 1 q Inter-layer: ALLO + complex DAGs 0.5 o AlexNet, GoogLeNet & LSTM-M o Linear NNs benefit less 0 AlexNet GoogLeNet MLP-L LSTM-M TANGRAM w/o Intra w/o Inter 18
Summary q Efficiently scale NN acceleration o Coarse-grained parallel dataflow on tiled architectures o Optimized tiled architectures outperform monolithic engines q T ANGRAM : dataflow optimizations o Intra-layer buffer sharing o Inter-layer pipelining with fine-grained data forwarding o Pipelining complex NN DAGs Thank you! q Dataflow scheduling tool open sourced o https://github.com/stanford-mast/nn_dataflow 19
Recommend
More recommend