SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017 Motivation Introduction - PowerPoint PPT Presentation

SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017

Motivation Introduction Algorithm AGENDA Naïve Implementation Optimizations Experiments Conclusion 2

MOTIVATION Exploit sparsity for faster, larger networks Recurrent Neural Networks(RNNs) are a powerful tool! Sequence-based problems Larger network == better, slower “Persistent” RNNs Efficient GPU implementation, minimizes weight bandwidth ( Diamos et al. (2016)) Size of on-chip storage imposes a strict upper limit on the network size Model pruning Significant reductions in RNN parameters ( Narang et al. (2017)) Network weights become sparse 3

INTRODUCTION A n efficient implementation for sparse RNNs, including following optimization methods: LDS.128 optimization guided layout optimization lamport bar Implement a cusparse-based RNN as competitor The final result shows by using sparse persistent RNN: On-chip storage can hold larger network Big sparse network can be fitted into small chip More efficient than existing methods for sparse RNN 4

ALGORITHM RNN 𝑔 𝑔 ℎ 2 ℎ 1 𝑔 ℎ 𝑜 State Invalidation State Invalidation Sync CTA implicitly Sync CTA implicitly 𝑔 ℎ 1 … 𝒈 = 𝝉(R 𝒊 𝒖−𝟐 … 𝒈 𝒗 ) 𝑔 𝒊 𝒖 + 𝒄 𝒖 𝑔 ℎ 0 ℎ 𝑜 RNN formula Parameters are shared by 𝑣 R R 𝑣 R 𝑐 0 𝑣 𝑐 𝑜−1 𝑐 1 all time steps gemm<<<>>> gemm<<<>>> gemm<<<>>> fold structure of RNN Weights R is reloaded at each timeslice Global memory becomes bottleneck 5

ALGORITHM Sparse Persistent RNN 𝑔 𝑔 𝑔 𝑔 ℎ 2 ℎ 2 ℎ 1 ℎ 1 𝑔 𝑔 ℎ 𝑜 ℎ 𝑜 𝑔 𝑔 ℎ 1 ℎ 1 … … … 𝑔 𝑔 <idx,val> <idx,val> 𝑔 𝑔 ℎ 0 ℎ 0 ℎ 𝑜 ℎ 𝑜 pairs pairs 𝑣 𝑣 𝑣 𝑣 R R 𝑐 0 𝑐 0 𝑣 𝑣 𝑐 𝑜−1 𝑐 𝑜−1 𝑐 1 𝑐 1 Sparse format of R Sparse_Persist_RNN<<<>>> fold structure of sparse Persistent RNN Minimize bandwidth by keeping parameters in on-chip storage Only Compress parameters by using <column_index, value> pairs 7

ALGORITHM: WORK DISTRIBUTION Force all rows to nz_per_row nonzeros By padding rows with fewer than the maximum nnz with <index, 0.0f> pairs CTA 0 CTA 1 CTA 2 One thread’s nonzeros belongs to one row … Sparse Persistent RNN CTA N 8

NAÏVE IMPLEMENTATION Load weights into RF For each timeslice: For each mini-batch b i : Load activations into aShMem foreach <col_idx, value> in this thread: Compute the accumulator acc += value * aShMem [col_idx] Reduce the accumulators across the CTA Store the output Synchronize across the Grid pseudocode of Naïve implementation of Sparse Persistent RNN PERF IS LIMITED BY SHARED MEMORY A LARGE NUMBER OF BANK CONFLICTS One FMA ops and one shared memory load In a warp, 32 col_idx may points to a same bank Math throughput vs ShMem throughput = 4:1 9

10 SHARED MEORY A warp accesses A warp accesses nz_per_row nz_per_row <163, 0.1> Bank conflict <35, 0.2> Bank conflict one warp 32 rows 32 rows one warp Bank conflict Bank conflict <67, 0.2> Ideal Layout (no back conflict) Weights Layout without guided BankN means column_index%32=N BankN means column_index%32=N In CUDA programming model, many threads access ShMem ShMem is divided into 32 banks to achieve high bandwidth Each bank can service one ShMem address per cycle, so SOL is 32 addresses/cycle Multiple column_index pointing to a bank result in bank conflict and conflicting accesses are serialized 32 addresses every 32 clocks(32 bank conflicts) vs 32 addresses every clock(no bank conflict) 10

OPTIMIZATION 11

GUIDED LAYOUT OPTIMIZATION nz_per_row Color(0) Color(1) <34, 0.2> 34(Bank2) can store to other Color(1) Color(2)Color(3) color, if 32 ROWS one warp Color(2)’s Color(2) Color(3) locations are full Color(0) Guided Layout (less back conflict) BankN means column_index%32=N If location <rowX, columnY> belongs to color M (to store any nonzero whose index mod 32 equals to M first), and then location <rowX+1, volumnY> and location <rowX, ColumnY+1> belong to color (M+1) Reordering nonzero locations for each row doesn’t affect the final result but affects access sequence Reordering nonzeros costs once and benefit can be used for the whole lifetime Row is colored by 32 colors corresponding to 32 ShMem banks Greedy algorithm: nonzero who requires BankM (column_index%32==M) should store to Color(M)’s locations first, unless Color(M)’s 12 locations are full

LDS.128 OPTIMIZATION Minibatch=4 Minibatch=1 1 2 3 4 5 … 1792 Col_idx … Sample0 Col_idx 1 2 3 4 5 6 … 1791 Sample0 Sample1 Sample2 3 bank conflicts Sample3 LDS.32 2 bank conflicts LDS.128 <1,0.1> <5,-0.2> <6,0.8> <10,0.2> <1, 0.1> <5,-0.2> <6, 0.8> <10, 0.2> Row#0 Row#0 Row#1 <1, 0.3> <2, 0.9> <6, 0.3> <9, 0.0> <1, 0.3> <2, 0.9> <6, 0.3> <9, 0.0> Row#1 <3, 0.5> <5, 0.2> <7, 0.5> <11,0.5> Row#2 <3, 0.5> <5, 0.2> <7, 0.5> <11, 0.5> Row#2 … … <5 0.6> <3, 0.6> <4, 0.9> <6, 0.2> Row#6 <2, 0.8> <5, 0.9> <7, 0.9> <12,0.3> Row#30 <1, 0.1> <4, 0.7> <8,-0.1> <12,0.0> Row#7 <2, 0.4> <1, 0.2> <6,-0.1> <0, 0.0> Row#31 8 threads, each threads LDS.128 four sample 32 threads, each thread LDS.32 one sample Max bank conflict# per instruction becomes 8 instead of 32 13

LAMPORT BARRIER Load weights into RF Load weights into RF For each timeslice: For each timeslice: For each mini-batch B i : For each mini-batch B i : Load mini-batch B i+1 into RF Load activations into aShMem Overlap Load with Compute Read-to-use dep Compute the accumulator Compute the accumulators for B i Lamport check & Store B i+1 to aShMem Reduce the accumulators across the CTA Store the output Reduce the accumulators across the CTA Store the output Synchronize across the Grid Synchronize across the Grid naïve Sparse Persistent RNN Software Pipelining/Lamport Bar + LDS.128 For LDS.128, mini-batch=4 14

15 EVALUATE OPTIMIZATIONS hiddensize1152, batchsize8,timestep256 23% nonzeros Reduce 82% bank conflicts 4.00 70,000 61,604 3.36 3.50 60,000 3.10 speedup(vscublasRNN) 3.00 51,571 50,000 bank conflict# 2.50 2.26 40,000 2.00 30,000 1.50 20,000 1.00 11,009 11,009 10,000 0.50 0.00 0 （ +LDS.128 ）（ +Guided Layout ）（ +Lamport Barrier ） Naïve optimizations Speedup bankconflit# 15

EXPERIMENTS 16

EXPERIMENTS Environment Tesla P100-SXM2-16GB, cudnn6.0, cuda8.0, cuSparse8.0 RNN implementations: cublasRNN(cudnnRNN), baseline persistentRNN(cudnnPersistRNN) cuSparseRNN sparsePersistentRNN Test cases: Vary batch size Vary density Vary hidden size Fix parameter number, varying hidden size and density 17

SOME PRUNING CONTEXT 10% density is a reasonable target for recurrent networks Han et al. 2017 18

EFFECT OF VARYING BATCH SIZE Hidden layer size = 1792, density = 10%, 256 timesteps, Vary batch size Persistent RNN can work 12.00 100% 10.46 speedup(vs cublasRNN) 10.00 80% 8.04 8.00 6.72 6.29 60% 6.00 50.4% 49.4% 49.8% 49.8% 40% 3.89 4.00 1.85 1.81 20% 2.00 12.3% 0.00 0% 4 8 16 32 64 batchsize cublasRNN persistRNN cusparseRNN sparsePersistRNN ShMem% 19

MAXIMUM SIZE FOR DENSE PERSISTENT RNN Hidden layer size = 1792, batch size = 4, 256 timesteps, vary density 16 100% 13.42 14 speedup(vs cublasRNN) 12.08 80% 12 10.46 10 60% 50.3% 7.80 55.0% 45.7% 8 6.29 6.20 6.56 29.7% 40% 6 4 12.3% 20% 2 0 0% 1%(28SMs) 5%(56SMs) 10%(56SMs) 20%(56SMs) 30%(56SMs) nonzero% (SM# used by sparsePersistRNN ) cublasRNN persistRNN cusparseRNN sparsePersistRNN ShMem% 20

TOO LARGE FOR DENSE PERSISTENT KERNELS Hidden layer size = 2304, batch size = 4, time steps = 256, vary density Persistent RNN cannot work 16.00 14.78 100% speedup(vs cublasRNN) 14.00 80% 11.20 12.00 10.00 60% 8.28 8.00 40% 6.00 36.4% 30.6% 4.00 1.98 20% 16.3% 15.7% 2.00 0.53 8.3% 0.00 0% 1%(36SMs) 5%(36SMs) 10%(36SMs) 20%(36SMs) 30%(18SMs) nonzero% (SM# used by sparsePersistRNN ) cublasRNN persistRNN cusparseRNN sparsePersistRNN ShMem% 21

WAY TOO LARGE FOR DENSE PERSISTENT KERNELS Hidden layer size = 5760, batch size = 2, time steps = 256, vary density cuSparse RNN outperforms others in 5% 50.00 100% 41.17 speedup(vs cublasRNN) 40.00 80% 30.00 60% 20.00 40% 10.00 20% 19.9% 4.27 3.16 2.92 10.0% 0.00 0% 1%(45SMs) 5%(45SMs) 10.00% nonzero% (SM# used by sparsePersistRNN ) cublasRNN persistRNN cusparseRNN sparsePersistRNN ShMem% 22

SMALL DENSE VS LARGE SPARSE A larger network with the same number of Different gates can be pruned to different nonzero parameters will outperform the levels of sparsity. (See et al., 2016) Deep speech2 model options smaller, dense network. (Baidu, ICLR 16 2017) 14 12 10 Error Rate 8 Dense 6 Sparse 4 2 0 0 50 100 150 # Parameters 23

SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017 Motivation Introduction - PowerPoint PPT Presentation

SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017 Motivation Introduction Algorithm AGENDA Nave Implementation Optimizations Experiments Conclusion 2 MOTIVATION Exploit sparsity for faster, larger networks Recurrent Neural Networks(RNNs) are

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Some RNN Variants Arun Mallya Best viewed with Computer Modern fonts installed Outline

Hardware Support for ACID Transactions in Persistent Memory Arpit Joshi , Vijay Nagarajan, Marcelo

Persistent Handles: approaches Ralph Bhme, Samba Team, SerNet 2018-06-08 Outline Persistent

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Persistent Homology: Persistence Modules Andrey Blinov 6 October 2017 Andrey Blinov Persistent

Distributed Shared Persistent Memory (SoCC 17) Yizhou Shan, Yiying Zhang Persistent Memory

Logging in Persistent Memory: to Cache, or Not to Cache? Mengjie Li, Matheus Ogleari , Jishen Zhao

Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Dependent Random

RNN Input Layer RNN Hidden Layer RNN h t-1 h t x t (Picture adapted from Andrej

RNN-based AMs + Introduction to Language Modeling Lecture 9 CS 753 Instructor: Preethi Jyothi

Lecture 6: RNN wrap-up Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

MLSS 06 - Canberra Elements Hierarchical Basis Sparse Grids Sparse Grids Combination

State College Nursing (RN) Program Electronic Information Session Fall 2020 Lake Worth

Risk Management Framework November 2004 Risk Facilitator: Dennis J Clark 9/21/2005 LBA Risk

Evaluating risk minimisation effectiveness Where are we now? Elspeth Kay Director, RMP

HSBC Holdings plc and HSBC Bank Canada Presentation to Canadian Investors Important notice and

Lecture 17: Language Modelling 2 CS109B Data Science 2 Pavlos Protopapas, Mark Glickman, and

QA Framework for monitoring in 2012- 2013 Anne Trotter Standards Compliance Manager 25

Seminar on Land-Use Planning and Industrial Safety Session 5 Practical Session with a Role

RomReal Limited Investor presentation Second Quarter (Q2) 2018 results Harris Palaondas - IR

SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017 Motivation Introduction - PowerPoint PPT Presentation

SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017 Motivation Introduction Algorithm AGENDA Nave Implementation Optimizations Experiments Conclusion 2 MOTIVATION Exploit sparsity for faster, larger networks Recurrent Neural Networks(RNNs) are

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN &amp; Gated RNN

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Some RNN Variants Arun Mallya Best viewed with Computer Modern fonts installed Outline

Hardware Support for ACID Transactions in Persistent Memory Arpit Joshi , Vijay Nagarajan, Marcelo

Persistent Handles: approaches Ralph Bhme, Samba Team, SerNet 2018-06-08 Outline Persistent

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Persistent Homology: Persistence Modules Andrey Blinov 6 October 2017 Andrey Blinov Persistent

Distributed Shared Persistent Memory (SoCC 17) Yizhou Shan, Yiying Zhang Persistent Memory

Logging in Persistent Memory: to Cache, or Not to Cache? Mengjie Li, Matheus Ogleari , Jishen Zhao

Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Dependent Random

RNN Input Layer RNN Hidden Layer RNN h t-1 h t x t (Picture adapted from Andrej

RNN-based AMs + Introduction to Language Modeling Lecture 9 CS 753 Instructor: Preethi Jyothi

Lecture 6: RNN wrap-up Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

MLSS 06 - Canberra Elements Hierarchical Basis Sparse Grids Sparse Grids Combination

State College Nursing (RN) Program Electronic Information Session Fall 2020 Lake Worth

Risk Management Framework November 2004 Risk Facilitator: Dennis J Clark 9/21/2005 LBA Risk

Evaluating risk minimisation effectiveness Where are we now? Elspeth Kay Director, RMP

HSBC Holdings plc and HSBC Bank Canada Presentation to Canadian Investors Important notice and

Lecture 17: Language Modelling 2 CS109B Data Science 2 Pavlos Protopapas, Mark Glickman, and

QA Framework for monitoring in 2012- 2013 Anne Trotter Standards Compliance Manager 25

Seminar on Land-Use Planning and Industrial Safety Session 5 Practical Session with a Role

RomReal Limited Investor presentation Second Quarter (Q2) 2018 results Harris Palaondas - IR

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN