Feiwen Zhu, 5/9/2017
SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017 Motivation Introduction - - PowerPoint PPT Presentation
SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017 Motivation Introduction - - PowerPoint PPT Presentation
SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017 Motivation Introduction Algorithm AGENDA Nave Implementation Optimizations Experiments Conclusion 2 MOTIVATION Exploit sparsity for faster, larger networks Recurrent Neural Networks(RNNs) are
2
AGENDA
Motivation Introduction Algorithm Naïve Implementation Optimizations Experiments Conclusion
3
MOTIVATION
Recurrent Neural Networks(RNNs) are a powerful tool!
Sequence-based problems Larger network == better, slower
“Persistent” RNNs
Efficient GPU implementation, minimizes weight bandwidth (Diamos et al. (2016)) Size of on-chip storage imposes a strict upper limit on the network size
Model pruning
Significant reductions in RNN parameters (Narang et al. (2017)) Network weights become sparse
Exploit sparsity for faster, larger networks
4
INTRODUCTION
An efficient implementation for sparse RNNs, including following optimization methods:
LDS.128 optimization guided layout optimization lamport bar
Implement a cusparse-based RNN as competitor The final result shows by using sparse persistent RNN:
On-chip storage can hold larger network Big sparse network can be fitted into small chip More efficient than existing methods for sparse RNN
5
Weights R is reloaded at each timeslice Global memory becomes bottleneck Parameters are shared by all time steps
ALGORITHM
𝒊𝒖
𝒈 = 𝝉(R 𝒊𝒖−𝟐 𝒈
+ 𝒄𝒖
𝒗)
RNN formula
RNN
ℎ0
𝑔
ℎ1
𝑔
ℎ1
𝑔
ℎ2
𝑔
ℎ𝑜
𝑔
𝑐0
𝑣
𝑐1
𝑣
𝑐𝑜−1
𝑣
ℎ𝑜
𝑔
…
fold structure of RNN
R R R
State Invalidation Sync CTA implicitly State Invalidation Sync CTA implicitly
…
gemm<<<>>> gemm<<<>>> gemm<<<>>>
7
Minimize bandwidth by keeping parameters in on-chip storage Only Compress parameters by using <column_index, value> pairs
ALGORITHM
Sparse Persistent RNN
ℎ0
𝑔
ℎ1
𝑔
ℎ1
𝑔
ℎ2
𝑔
ℎ𝑜
𝑔
𝑐0
𝑣
𝑐1
𝑣
𝑐𝑜−1
𝑣
ℎ𝑜
𝑔
…
<idx,val> pairs <idx,val> pairs
fold structure of sparse Persistent RNN
R
…
Sparse format of R
Sparse_Persist_RNN<<<>>> ℎ0
𝑔
ℎ1
𝑔
ℎ1
𝑔
ℎ2
𝑔
ℎ𝑜
𝑔
𝑐0
𝑣
𝑐1
𝑣
𝑐𝑜−1
𝑣
ℎ𝑜
𝑔
…
R
8
ALGORITHM: WORK DISTRIBUTION
Sparse Persistent RNN
CTA 0 CTA 1 CTA 2 CTA N
…
Force all rows to nz_per_row nonzeros By padding rows with fewer than the maximum nnz with <index, 0.0f> pairs
One thread’s nonzeros belongs to one row
9
NAÏVE IMPLEMENTATION
Load weights into RF For each timeslice: For each mini-batch bi: Load activations into aShMem Compute the accumulator Reduce the accumulators across the CTA Store the output Synchronize across the Grid foreach <col_idx, value> in this thread: acc += value * aShMem [col_idx]
pseudocode of Naïve implementation of Sparse Persistent RNN
One FMA ops and one shared memory load Math throughput vs ShMem throughput = 4:1
PERF IS LIMITED BY SHARED MEMORY
In a warp, 32 col_idx may points to a same bank
A LARGE NUMBER OF BANK CONFLICTS
10
SHARED MEORY
10
In CUDA programming model, many threads access ShMem ShMem is divided into 32 banks to achieve high bandwidth Each bank can service one ShMem address per cycle, so SOL is 32 addresses/cycle Multiple column_index pointing to a bank result in bank conflict and conflicting accesses are serialized 32 addresses every 32 clocks(32 bank conflicts) vs 32 addresses every clock(no bank conflict)
Weights Layout without guided
BankN means column_index%32=N Bank conflict Bank conflict Bank conflict Bank conflict
- ne
warp
- ne
warp
Ideal Layout (no back conflict)
BankN means column_index%32=N <35, 0.2> <163, 0.1> <67, 0.2>
nz_per_row nz_per_row 32 rows 32 rows A warp accesses
A warp accesses
11
OPTIMIZATION
12
GUIDED LAYOUT OPTIMIZATION
Guided Layout (less back conflict)
BankN means column_index%32=N If location <rowX, columnY> belongs to color M (to store any nonzero whose index mod 32 equals to M first), and then location <rowX+1, volumnY> and location <rowX, ColumnY+1> belong to color (M+1)
- ne
warp
Color(2)Color(3) Color(3) Color(1) Color(0)
<34, 0.2> 34(Bank2) can store to other color, if Color(2)’s locations are full
Reordering nonzero locations for each row doesn’t affect the final result but affects access sequence Reordering nonzeros costs once and benefit can be used for the whole lifetime Row is colored by 32 colors corresponding to 32 ShMem banks
Greedy algorithm: nonzero who requires BankM (column_index%32==M) should store to Color(M)’s locations first, unless Color(M)’s locations are full 32 ROWS nz_per_row
Color(0) Color(1) Color(2)
13
LDS.128 OPTIMIZATION
<1, 0.1> <5,-0.2> <6, 0.8> <10, 0.2> <1, 0.3> <2, 0.9> <6, 0.3> <9, 0.0> <3, 0.5> <5, 0.2> <7, 0.5> <11, 0.5> … <5 0.6> <3, 0.6> <4, 0.9> <6, 0.2> <1, 0.1> <4, 0.7> <8,-0.1> <12,0.0> 8 threads, each threads LDS.128 four sample
1 2 3 4 5 … … 1792
LDS.128
Minibatch=4
<1,0.1> <5,-0.2> <6,0.8> <10,0.2> <1, 0.3> <2, 0.9> <6, 0.3> <9, 0.0> <3, 0.5> <5, 0.2> <7, 0.5> <11,0.5> … <2, 0.8> <5, 0.9> <7, 0.9> <12,0.3> <2, 0.4> <1, 0.2> <6,-0.1> <0, 0.0>
32 threads, each thread LDS.32 one sample
1 2 3 4 5 6 … 1791
LDS.32
Minibatch=1
Col_idx Sample0 Row#0 Row#1 Row#30 Row#31 Row#2 Col_idx Sample0 Sample1 Sample2 Sample3 Row#0 Row#1 Row#6 Row#7 Row#2
3 bank conflicts 2 bank conflicts
Max bank conflict# per instruction becomes 8 instead of 32
14
LAMPORT BARRIER
Load weights into RF For each timeslice: For each mini-batch Bi: Load mini-batch Bi+1 into RF Compute the accumulators for Bi Lamport check & Store Bi+1 to aShMem Reduce the accumulators across the CTA Store the output Synchronize across the Grid
Software Pipelining/Lamport Bar + LDS.128 For LDS.128, mini-batch=4 naïve Sparse Persistent RNN
Load weights into RF For each timeslice: For each mini-batch Bi: Load activations into aShMem Compute the accumulator Reduce the accumulators across the CTA Store the output Synchronize across the Grid
Read-to-use dep Overlap Load with Compute
15
EVALUATE OPTIMIZATIONS
hiddensize1152, batchsize8,timestep256 23% nonzeros Reduce 82% bank conflicts
15
2.26 3.10 3.36 61,604 51,571 11,009 11,009 10,000 20,000 30,000 40,000 50,000 60,000 70,000 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 Naïve (+LDS.128) (+Guided Layout) (+Lamport Barrier)
bank conflict# speedup(vscublasRNN)
- ptimizations
Speedup bankconflit#
16
EXPERIMENTS
17
EXPERIMENTS
Environment Tesla P100-SXM2-16GB, cudnn6.0, cuda8.0, cuSparse8.0 RNN implementations: cublasRNN(cudnnRNN), baseline persistentRNN(cudnnPersistRNN) cuSparseRNN sparsePersistentRNN Test cases: Vary batch size Vary density Vary hidden size Fix parameter number, varying hidden size and density
18
SOME PRUNING CONTEXT
10% density is a reasonable target for recurrent networks
Han et al. 2017
19
EFFECT OF VARYING BATCH SIZE
Hidden layer size = 1792, density = 10%, 256 timesteps, Vary batch size Persistent RNN can work
6.29 1.81 10.46 6.72 3.89 8.04 1.85 12.3% 49.4% 49.8% 50.4% 49.8% 0% 20% 40% 60% 80% 100% 0.00 2.00 4.00 6.00 8.00 10.00 12.00 4 8 16 32 64 speedup(vs cublasRNN) batchsize cublasRNN persistRNN cusparseRNN sparsePersistRNN ShMem%
20
MAXIMUM SIZE FOR DENSE PERSISTENT RNN
Hidden layer size = 1792, batch size = 4, 256 timesteps, vary density
6.20 6.29 13.42 12.08 10.46 7.80 6.56 12.3% 29.7% 45.7% 50.3% 55.0% 0% 20% 40% 60% 80% 100% 2 4 6 8 10 12 14 16 1%(28SMs) 5%(56SMs) 10%(56SMs) 20%(56SMs) 30%(56SMs) speedup(vs cublasRNN) nonzero% (SM# used by sparsePersistRNN ) cublasRNN persistRNN cusparseRNN sparsePersistRNN ShMem%
21
TOO LARGE FOR DENSE PERSISTENT KERNELS
Hidden layer size = 2304, batch size = 4, time steps = 256, vary density Persistent RNN cannot work
14.78 11.20 8.28 1.98 0.53 15.7% 30.6% 36.4% 16.3% 8.3% 0% 20% 40% 60% 80% 100% 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 1%(36SMs) 5%(36SMs) 10%(36SMs) 20%(36SMs) 30%(18SMs) speedup(vs cublasRNN) nonzero% (SM# used by sparsePersistRNN ) cublasRNN persistRNN cusparseRNN sparsePersistRNN ShMem%
22
WAY TOO LARGE FOR DENSE PERSISTENT KERNELS
Hidden layer size = 5760, batch size = 2, time steps = 256, vary density cuSparse RNN outperforms others in 5%
4.27 3.16 41.17 2.92 19.9% 10.0% 0% 20% 40% 60% 80% 100% 0.00 10.00 20.00 30.00 40.00 50.00 1%(45SMs) 5%(45SMs) 10.00% speedup(vs cublasRNN) nonzero% (SM# used by sparsePersistRNN ) cublasRNN persistRNN cusparseRNN sparsePersistRNN ShMem%
23
SMALL DENSE VS LARGE SPARSE
2 4 6 8 10 12 14 16 50 100 150 Error Rate # Parameters Dense Sparse
Deep speech2 model options
A larger network with the same number of nonzero parameters will outperform the smaller, dense network. (Baidu, ICLR 2017) Different gates can be pruned to different levels of sparsity. (See et al., 2016)
24
EFFECTIVE NETWORK SIZE
Fixed to 1.3M parameters, prune other hidden sizes to same nnz
2304 25% density, 3456 11% density, 4608 6.25% density, 5760 4% density
16.52 10.69 19.44 19.73 0% 20% 40% 60% 80% 100% 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 2304,25%(36SMs) 3456,11%(54SMs) 4608,6.25%(36SMs) 5760,4%(45SMs) time(ms) hiddensize, nonzero% (SM# used by sparsePersistRNN ) cudnnRNN (ms) sparsePersistRNN(ms)
25
RNN TOOLS
- General Tool
CUBLAS RNN
- Best Tool When Weights Can be Resident in on
Chip Storage
PERSISTENT RNN
- General Sparse Tool
- It can outperform others in some cases
CUSPARSE RNN
- Best Sparse Tool When nonzeros Can be
Resident in on-Chip Storage
- Can handle larger network than Persist RNN
- Up to 10X speedup on 10% dense cases
- Big sparse network can be fitted in small chip
SPARSE PERSISTENT RNN
27
THANKS!
Jeff Pool, Jeremy Appleyard, Fung Xie, Michael Andersch My e-mail: mzhu@nvidia.com