SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017 Motivation Introduction - - PowerPoint PPT Presentation

sparse persistent rnn
SMART_READER_LITE
LIVE PREVIEW

SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017 Motivation Introduction - - PowerPoint PPT Presentation

SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017 Motivation Introduction Algorithm AGENDA Nave Implementation Optimizations Experiments Conclusion 2 MOTIVATION Exploit sparsity for faster, larger networks Recurrent Neural Networks(RNNs) are


slide-1
SLIDE 1

Feiwen Zhu, 5/9/2017

SPARSE PERSISTENT RNN

slide-2
SLIDE 2

2

AGENDA

Motivation Introduction Algorithm Naïve Implementation Optimizations Experiments Conclusion

slide-3
SLIDE 3

3

MOTIVATION

Recurrent Neural Networks(RNNs) are a powerful tool!

Sequence-based problems Larger network == better, slower

“Persistent” RNNs

Efficient GPU implementation, minimizes weight bandwidth (Diamos et al. (2016)) Size of on-chip storage imposes a strict upper limit on the network size

Model pruning

Significant reductions in RNN parameters (Narang et al. (2017)) Network weights become sparse

Exploit sparsity for faster, larger networks

slide-4
SLIDE 4

4

INTRODUCTION

An efficient implementation for sparse RNNs, including following optimization methods:

LDS.128 optimization guided layout optimization lamport bar

Implement a cusparse-based RNN as competitor The final result shows by using sparse persistent RNN:

On-chip storage can hold larger network Big sparse network can be fitted into small chip More efficient than existing methods for sparse RNN

slide-5
SLIDE 5

5

Weights R is reloaded at each timeslice Global memory becomes bottleneck Parameters are shared by all time steps

ALGORITHM

𝒊𝒖

𝒈 = 𝝉(R 𝒊𝒖−𝟐 𝒈

+ 𝒄𝒖

𝒗)

RNN formula

RNN

ℎ0

𝑔

ℎ1

𝑔

ℎ1

𝑔

ℎ2

𝑔

ℎ𝑜

𝑔

𝑐0

𝑣

𝑐1

𝑣

𝑐𝑜−1

𝑣

ℎ𝑜

𝑔

fold structure of RNN

R R R

State Invalidation Sync CTA implicitly State Invalidation Sync CTA implicitly

gemm<<<>>> gemm<<<>>> gemm<<<>>>

slide-6
SLIDE 6

7

Minimize bandwidth by keeping parameters in on-chip storage Only Compress parameters by using <column_index, value> pairs

ALGORITHM

Sparse Persistent RNN

ℎ0

𝑔

ℎ1

𝑔

ℎ1

𝑔

ℎ2

𝑔

ℎ𝑜

𝑔

𝑐0

𝑣

𝑐1

𝑣

𝑐𝑜−1

𝑣

ℎ𝑜

𝑔

<idx,val> pairs <idx,val> pairs

fold structure of sparse Persistent RNN

R

Sparse format of R

Sparse_Persist_RNN<<<>>> ℎ0

𝑔

ℎ1

𝑔

ℎ1

𝑔

ℎ2

𝑔

ℎ𝑜

𝑔

𝑐0

𝑣

𝑐1

𝑣

𝑐𝑜−1

𝑣

ℎ𝑜

𝑔

R

slide-7
SLIDE 7

8

ALGORITHM: WORK DISTRIBUTION

Sparse Persistent RNN

CTA 0 CTA 1 CTA 2 CTA N

Force all rows to nz_per_row nonzeros By padding rows with fewer than the maximum nnz with <index, 0.0f> pairs

One thread’s nonzeros belongs to one row

slide-8
SLIDE 8

9

NAÏVE IMPLEMENTATION

Load weights into RF For each timeslice: For each mini-batch bi: Load activations into aShMem Compute the accumulator Reduce the accumulators across the CTA Store the output Synchronize across the Grid foreach <col_idx, value> in this thread: acc += value * aShMem [col_idx]

pseudocode of Naïve implementation of Sparse Persistent RNN

One FMA ops and one shared memory load Math throughput vs ShMem throughput = 4:1

PERF IS LIMITED BY SHARED MEMORY

In a warp, 32 col_idx may points to a same bank

A LARGE NUMBER OF BANK CONFLICTS

slide-9
SLIDE 9

10

SHARED MEORY

10

In CUDA programming model, many threads access ShMem ShMem is divided into 32 banks to achieve high bandwidth Each bank can service one ShMem address per cycle, so SOL is 32 addresses/cycle Multiple column_index pointing to a bank result in bank conflict and conflicting accesses are serialized 32 addresses every 32 clocks(32 bank conflicts) vs 32 addresses every clock(no bank conflict)

Weights Layout without guided

BankN means column_index%32=N Bank conflict Bank conflict Bank conflict Bank conflict

  • ne

warp

  • ne

warp

Ideal Layout (no back conflict)

BankN means column_index%32=N <35, 0.2> <163, 0.1> <67, 0.2>

nz_per_row nz_per_row 32 rows 32 rows A warp accesses

A warp accesses

slide-10
SLIDE 10

11

OPTIMIZATION

slide-11
SLIDE 11

12

GUIDED LAYOUT OPTIMIZATION

Guided Layout (less back conflict)

BankN means column_index%32=N If location <rowX, columnY> belongs to color M (to store any nonzero whose index mod 32 equals to M first), and then location <rowX+1, volumnY> and location <rowX, ColumnY+1> belong to color (M+1)

  • ne

warp

Color(2)Color(3) Color(3) Color(1) Color(0)

<34, 0.2> 34(Bank2) can store to other color, if Color(2)’s locations are full

Reordering nonzero locations for each row doesn’t affect the final result but affects access sequence Reordering nonzeros costs once and benefit can be used for the whole lifetime Row is colored by 32 colors corresponding to 32 ShMem banks

Greedy algorithm: nonzero who requires BankM (column_index%32==M) should store to Color(M)’s locations first, unless Color(M)’s locations are full 32 ROWS nz_per_row

Color(0) Color(1) Color(2)

slide-12
SLIDE 12

13

LDS.128 OPTIMIZATION

<1, 0.1> <5,-0.2> <6, 0.8> <10, 0.2> <1, 0.3> <2, 0.9> <6, 0.3> <9, 0.0> <3, 0.5> <5, 0.2> <7, 0.5> <11, 0.5> … <5 0.6> <3, 0.6> <4, 0.9> <6, 0.2> <1, 0.1> <4, 0.7> <8,-0.1> <12,0.0> 8 threads, each threads LDS.128 four sample

1 2 3 4 5 … … 1792

LDS.128

Minibatch=4

<1,0.1> <5,-0.2> <6,0.8> <10,0.2> <1, 0.3> <2, 0.9> <6, 0.3> <9, 0.0> <3, 0.5> <5, 0.2> <7, 0.5> <11,0.5> … <2, 0.8> <5, 0.9> <7, 0.9> <12,0.3> <2, 0.4> <1, 0.2> <6,-0.1> <0, 0.0>

32 threads, each thread LDS.32 one sample

1 2 3 4 5 6 … 1791

LDS.32

Minibatch=1

Col_idx Sample0 Row#0 Row#1 Row#30 Row#31 Row#2 Col_idx Sample0 Sample1 Sample2 Sample3 Row#0 Row#1 Row#6 Row#7 Row#2

3 bank conflicts 2 bank conflicts

Max bank conflict# per instruction becomes 8 instead of 32

slide-13
SLIDE 13

14

LAMPORT BARRIER

Load weights into RF For each timeslice: For each mini-batch Bi: Load mini-batch Bi+1 into RF Compute the accumulators for Bi Lamport check & Store Bi+1 to aShMem Reduce the accumulators across the CTA Store the output Synchronize across the Grid

Software Pipelining/Lamport Bar + LDS.128 For LDS.128, mini-batch=4 naïve Sparse Persistent RNN

Load weights into RF For each timeslice: For each mini-batch Bi: Load activations into aShMem Compute the accumulator Reduce the accumulators across the CTA Store the output Synchronize across the Grid

Read-to-use dep Overlap Load with Compute

slide-14
SLIDE 14

15

EVALUATE OPTIMIZATIONS

hiddensize1152, batchsize8,timestep256 23% nonzeros Reduce 82% bank conflicts

15

2.26 3.10 3.36 61,604 51,571 11,009 11,009 10,000 20,000 30,000 40,000 50,000 60,000 70,000 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 Naïve (+LDS.128) (+Guided Layout) (+Lamport Barrier)

bank conflict# speedup(vscublasRNN)

  • ptimizations

Speedup bankconflit#

slide-15
SLIDE 15

16

EXPERIMENTS

slide-16
SLIDE 16

17

EXPERIMENTS

Environment Tesla P100-SXM2-16GB, cudnn6.0, cuda8.0, cuSparse8.0 RNN implementations: cublasRNN(cudnnRNN), baseline persistentRNN(cudnnPersistRNN) cuSparseRNN sparsePersistentRNN Test cases: Vary batch size Vary density Vary hidden size Fix parameter number, varying hidden size and density

slide-17
SLIDE 17

18

SOME PRUNING CONTEXT

10% density is a reasonable target for recurrent networks

Han et al. 2017

slide-18
SLIDE 18

19

EFFECT OF VARYING BATCH SIZE

Hidden layer size = 1792, density = 10%, 256 timesteps, Vary batch size Persistent RNN can work

6.29 1.81 10.46 6.72 3.89 8.04 1.85 12.3% 49.4% 49.8% 50.4% 49.8% 0% 20% 40% 60% 80% 100% 0.00 2.00 4.00 6.00 8.00 10.00 12.00 4 8 16 32 64 speedup(vs cublasRNN) batchsize cublasRNN persistRNN cusparseRNN sparsePersistRNN ShMem%

slide-19
SLIDE 19

20

MAXIMUM SIZE FOR DENSE PERSISTENT RNN

Hidden layer size = 1792, batch size = 4, 256 timesteps, vary density

6.20 6.29 13.42 12.08 10.46 7.80 6.56 12.3% 29.7% 45.7% 50.3% 55.0% 0% 20% 40% 60% 80% 100% 2 4 6 8 10 12 14 16 1%(28SMs) 5%(56SMs) 10%(56SMs) 20%(56SMs) 30%(56SMs) speedup(vs cublasRNN) nonzero% (SM# used by sparsePersistRNN ) cublasRNN persistRNN cusparseRNN sparsePersistRNN ShMem%

slide-20
SLIDE 20

21

TOO LARGE FOR DENSE PERSISTENT KERNELS

Hidden layer size = 2304, batch size = 4, time steps = 256, vary density Persistent RNN cannot work

14.78 11.20 8.28 1.98 0.53 15.7% 30.6% 36.4% 16.3% 8.3% 0% 20% 40% 60% 80% 100% 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 1%(36SMs) 5%(36SMs) 10%(36SMs) 20%(36SMs) 30%(18SMs) speedup(vs cublasRNN) nonzero% (SM# used by sparsePersistRNN ) cublasRNN persistRNN cusparseRNN sparsePersistRNN ShMem%

slide-21
SLIDE 21

22

WAY TOO LARGE FOR DENSE PERSISTENT KERNELS

Hidden layer size = 5760, batch size = 2, time steps = 256, vary density cuSparse RNN outperforms others in 5%

4.27 3.16 41.17 2.92 19.9% 10.0% 0% 20% 40% 60% 80% 100% 0.00 10.00 20.00 30.00 40.00 50.00 1%(45SMs) 5%(45SMs) 10.00% speedup(vs cublasRNN) nonzero% (SM# used by sparsePersistRNN ) cublasRNN persistRNN cusparseRNN sparsePersistRNN ShMem%

slide-22
SLIDE 22

23

SMALL DENSE VS LARGE SPARSE

2 4 6 8 10 12 14 16 50 100 150 Error Rate # Parameters Dense Sparse

Deep speech2 model options

A larger network with the same number of nonzero parameters will outperform the smaller, dense network. (Baidu, ICLR 2017) Different gates can be pruned to different levels of sparsity. (See et al., 2016)

slide-23
SLIDE 23

24

EFFECTIVE NETWORK SIZE

Fixed to 1.3M parameters, prune other hidden sizes to same nnz

2304 25% density, 3456 11% density, 4608 6.25% density, 5760 4% density

16.52 10.69 19.44 19.73 0% 20% 40% 60% 80% 100% 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 2304,25%(36SMs) 3456,11%(54SMs) 4608,6.25%(36SMs) 5760,4%(45SMs) time(ms) hiddensize, nonzero% (SM# used by sparsePersistRNN ) cudnnRNN (ms) sparsePersistRNN(ms)

slide-24
SLIDE 24

25

RNN TOOLS

  • General Tool

CUBLAS RNN

  • Best Tool When Weights Can be Resident in on

Chip Storage

PERSISTENT RNN

  • General Sparse Tool
  • It can outperform others in some cases

CUSPARSE RNN

  • Best Sparse Tool When nonzeros Can be

Resident in on-Chip Storage

  • Can handle larger network than Persist RNN
  • Up to 10X speedup on 10% dense cases
  • Big sparse network can be fitted in small chip

SPARSE PERSISTENT RNN

slide-25
SLIDE 25

27

THANKS!

Jeff Pool, Jeremy Appleyard, Fung Xie, Michael Andersch My e-mail: mzhu@nvidia.com

slide-26
SLIDE 26