sparse persistent rnn
play

SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017 Motivation Introduction - PowerPoint PPT Presentation

SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017 Motivation Introduction Algorithm AGENDA Nave Implementation Optimizations Experiments Conclusion 2 MOTIVATION Exploit sparsity for faster, larger networks Recurrent Neural Networks(RNNs) are


  1. SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017

  2. Motivation Introduction Algorithm AGENDA Naïve Implementation Optimizations Experiments Conclusion 2

  3. MOTIVATION Exploit sparsity for faster, larger networks Recurrent Neural Networks(RNNs) are a powerful tool! Sequence-based problems Larger network == better, slower “Persistent” RNNs Efficient GPU implementation, minimizes weight bandwidth ( Diamos et al. (2016)) Size of on-chip storage imposes a strict upper limit on the network size Model pruning Significant reductions in RNN parameters ( Narang et al. (2017)) Network weights become sparse 3

  4. INTRODUCTION A n efficient implementation for sparse RNNs, including following optimization methods: LDS.128 optimization guided layout optimization lamport bar Implement a cusparse-based RNN as competitor The final result shows by using sparse persistent RNN: On-chip storage can hold larger network Big sparse network can be fitted into small chip More efficient than existing methods for sparse RNN 4

  5. ALGORITHM RNN 𝑔 𝑔 ℎ 2 ℎ 1 𝑔 ℎ 𝑜 State Invalidation State Invalidation Sync CTA implicitly Sync CTA implicitly 𝑔 ℎ 1 … 𝒈 = 𝝉(R 𝒊 𝒖−𝟐 … 𝒈 𝒗 ) 𝑔 𝒊 𝒖 + 𝒄 𝒖 𝑔 ℎ 0 ℎ 𝑜 RNN formula Parameters are shared by 𝑣 R R 𝑣 R 𝑐 0 𝑣 𝑐 𝑜−1 𝑐 1 all time steps gemm<<<>>> gemm<<<>>> gemm<<<>>> fold structure of RNN Weights R is reloaded at each timeslice Global memory becomes bottleneck 5

  6. ALGORITHM Sparse Persistent RNN 𝑔 𝑔 𝑔 𝑔 ℎ 2 ℎ 2 ℎ 1 ℎ 1 𝑔 𝑔 ℎ 𝑜 ℎ 𝑜 𝑔 𝑔 ℎ 1 ℎ 1 … … … 𝑔 𝑔 <idx,val> <idx,val> 𝑔 𝑔 ℎ 0 ℎ 0 ℎ 𝑜 ℎ 𝑜 pairs pairs 𝑣 𝑣 𝑣 𝑣 R R 𝑐 0 𝑐 0 𝑣 𝑣 𝑐 𝑜−1 𝑐 𝑜−1 𝑐 1 𝑐 1 Sparse format of R Sparse_Persist_RNN<<<>>> fold structure of sparse Persistent RNN Minimize bandwidth by keeping parameters in on-chip storage Only Compress parameters by using <column_index, value> pairs 7

  7. ALGORITHM: WORK DISTRIBUTION Force all rows to nz_per_row nonzeros By padding rows with fewer than the maximum nnz with <index, 0.0f> pairs CTA 0 CTA 1 CTA 2 One thread’s nonzeros belongs to one row … Sparse Persistent RNN CTA N 8

  8. NAÏVE IMPLEMENTATION Load weights into RF For each timeslice: For each mini-batch b i : Load activations into aShMem foreach <col_idx, value> in this thread: Compute the accumulator acc += value * aShMem [col_idx] Reduce the accumulators across the CTA Store the output Synchronize across the Grid pseudocode of Naïve implementation of Sparse Persistent RNN PERF IS LIMITED BY SHARED MEMORY A LARGE NUMBER OF BANK CONFLICTS One FMA ops and one shared memory load In a warp, 32 col_idx may points to a same bank Math throughput vs ShMem throughput = 4:1 9

  9. 10 SHARED MEORY A warp accesses A warp accesses nz_per_row nz_per_row <163, 0.1> Bank conflict <35, 0.2> Bank conflict one warp 32 rows 32 rows one warp Bank conflict Bank conflict <67, 0.2> Ideal Layout (no back conflict) Weights Layout without guided BankN means column_index%32=N BankN means column_index%32=N In CUDA programming model, many threads access ShMem ShMem is divided into 32 banks to achieve high bandwidth Each bank can service one ShMem address per cycle, so SOL is 32 addresses/cycle Multiple column_index pointing to a bank result in bank conflict and conflicting accesses are serialized 32 addresses every 32 clocks(32 bank conflicts) vs 32 addresses every clock(no bank conflict) 10

  10. OPTIMIZATION 11

  11. GUIDED LAYOUT OPTIMIZATION nz_per_row Color(0) Color(1) <34, 0.2> 34(Bank2) can store to other Color(1) Color(2)Color(3) color, if 32 ROWS one warp Color(2)’s Color(2) Color(3) locations are full Color(0) Guided Layout (less back conflict) BankN means column_index%32=N If location <rowX, columnY> belongs to color M (to store any nonzero whose index mod 32 equals to M first), and then location <rowX+1, volumnY> and location <rowX, ColumnY+1> belong to color (M+1) Reordering nonzero locations for each row doesn’t affect the final result but affects access sequence Reordering nonzeros costs once and benefit can be used for the whole lifetime Row is colored by 32 colors corresponding to 32 ShMem banks Greedy algorithm: nonzero who requires BankM (column_index%32==M) should store to Color(M)’s locations first, unless Color(M)’s 12 locations are full

  12. LDS.128 OPTIMIZATION Minibatch=4 Minibatch=1 1 2 3 4 5 … 1792 Col_idx … Sample0 Col_idx 1 2 3 4 5 6 … 1791 Sample0 Sample1 Sample2 3 bank conflicts Sample3 LDS.32 2 bank conflicts LDS.128 <1,0.1> <5,-0.2> <6,0.8> <10,0.2> <1, 0.1> <5,-0.2> <6, 0.8> <10, 0.2> Row#0 Row#0 Row#1 <1, 0.3> <2, 0.9> <6, 0.3> <9, 0.0> <1, 0.3> <2, 0.9> <6, 0.3> <9, 0.0> Row#1 <3, 0.5> <5, 0.2> <7, 0.5> <11,0.5> Row#2 <3, 0.5> <5, 0.2> <7, 0.5> <11, 0.5> Row#2 … … <5 0.6> <3, 0.6> <4, 0.9> <6, 0.2> Row#6 <2, 0.8> <5, 0.9> <7, 0.9> <12,0.3> Row#30 <1, 0.1> <4, 0.7> <8,-0.1> <12,0.0> Row#7 <2, 0.4> <1, 0.2> <6,-0.1> <0, 0.0> Row#31 8 threads, each threads LDS.128 four sample 32 threads, each thread LDS.32 one sample Max bank conflict# per instruction becomes 8 instead of 32 13

  13. LAMPORT BARRIER Load weights into RF Load weights into RF For each timeslice: For each timeslice: For each mini-batch B i : For each mini-batch B i : Load mini-batch B i+1 into RF Load activations into aShMem Overlap Load with Compute Read-to-use dep Compute the accumulator Compute the accumulators for B i Lamport check & Store B i+1 to aShMem Reduce the accumulators across the CTA Store the output Reduce the accumulators across the CTA Store the output Synchronize across the Grid Synchronize across the Grid naïve Sparse Persistent RNN Software Pipelining/Lamport Bar + LDS.128 For LDS.128, mini-batch=4 14

  14. 15 EVALUATE OPTIMIZATIONS hiddensize1152, batchsize8,timestep256 23% nonzeros Reduce 82% bank conflicts 4.00 70,000 61,604 3.36 3.50 60,000 3.10 speedup(vscublasRNN) 3.00 51,571 50,000 bank conflict# 2.50 2.26 40,000 2.00 30,000 1.50 20,000 1.00 11,009 11,009 10,000 0.50 0.00 0 ( +LDS.128 ) ( +Guided Layout ) ( +Lamport Barrier ) Naïve optimizations Speedup bankconflit# 15

  15. EXPERIMENTS 16

  16. EXPERIMENTS Environment Tesla P100-SXM2-16GB, cudnn6.0, cuda8.0, cuSparse8.0 RNN implementations: cublasRNN(cudnnRNN), baseline persistentRNN(cudnnPersistRNN) cuSparseRNN sparsePersistentRNN Test cases: Vary batch size Vary density Vary hidden size Fix parameter number, varying hidden size and density 17

  17. SOME PRUNING CONTEXT 10% density is a reasonable target for recurrent networks Han et al. 2017 18

  18. EFFECT OF VARYING BATCH SIZE Hidden layer size = 1792, density = 10%, 256 timesteps, Vary batch size Persistent RNN can work 12.00 100% 10.46 speedup(vs cublasRNN) 10.00 80% 8.04 8.00 6.72 6.29 60% 6.00 50.4% 49.4% 49.8% 49.8% 40% 3.89 4.00 1.85 1.81 20% 2.00 12.3% 0.00 0% 4 8 16 32 64 batchsize cublasRNN persistRNN cusparseRNN sparsePersistRNN ShMem% 19

  19. MAXIMUM SIZE FOR DENSE PERSISTENT RNN Hidden layer size = 1792, batch size = 4, 256 timesteps, vary density 16 100% 13.42 14 speedup(vs cublasRNN) 12.08 80% 12 10.46 10 60% 50.3% 7.80 55.0% 45.7% 8 6.29 6.20 6.56 29.7% 40% 6 4 12.3% 20% 2 0 0% 1%(28SMs) 5%(56SMs) 10%(56SMs) 20%(56SMs) 30%(56SMs) nonzero% (SM# used by sparsePersistRNN ) cublasRNN persistRNN cusparseRNN sparsePersistRNN ShMem% 20

  20. TOO LARGE FOR DENSE PERSISTENT KERNELS Hidden layer size = 2304, batch size = 4, time steps = 256, vary density Persistent RNN cannot work 16.00 14.78 100% speedup(vs cublasRNN) 14.00 80% 11.20 12.00 10.00 60% 8.28 8.00 40% 6.00 36.4% 30.6% 4.00 1.98 20% 16.3% 15.7% 2.00 0.53 8.3% 0.00 0% 1%(36SMs) 5%(36SMs) 10%(36SMs) 20%(36SMs) 30%(18SMs) nonzero% (SM# used by sparsePersistRNN ) cublasRNN persistRNN cusparseRNN sparsePersistRNN ShMem% 21

  21. WAY TOO LARGE FOR DENSE PERSISTENT KERNELS Hidden layer size = 5760, batch size = 2, time steps = 256, vary density cuSparse RNN outperforms others in 5% 50.00 100% 41.17 speedup(vs cublasRNN) 40.00 80% 30.00 60% 20.00 40% 10.00 20% 19.9% 4.27 3.16 2.92 10.0% 0.00 0% 1%(45SMs) 5%(45SMs) 10.00% nonzero% (SM# used by sparsePersistRNN ) cublasRNN persistRNN cusparseRNN sparsePersistRNN ShMem% 22

  22. SMALL DENSE VS LARGE SPARSE A larger network with the same number of Different gates can be pruned to different nonzero parameters will outperform the levels of sparsity. (See et al., 2016) Deep speech2 model options smaller, dense network. (Baidu, ICLR 16 2017) 14 12 10 Error Rate 8 Dense 6 Sparse 4 2 0 0 50 100 150 # Parameters 23

Recommend


More recommend