E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous System Runbin Shi 1 Junjie Liu 1 Shuo Wang 2 Yun Liang 2 Hayden So 1 1 Department of Electrical and Electronic Engineering The University of Hong Kong 2 Center for Energy-efficient Computing and Applications School of EECS, Peking University Design Automation Conference, June 2019 1 / 17
Table of Contents Background 1 LSTM-based Neural Networks Target Embedded-Platform of E-LSTM Method: An Area-saving Sparse Weight Format (eSELL) 2 Method: Optimizations for LSTM Inter-Cell Parallelism 3 Generic E-LSTM Architecture and Throughput Bottleneck Optimization with Inherent Sparsity in LSTM Arithmetic Scheduling with Cell-fusion Experiments and Evaluation 4 Conclusion 5 2 / 17
Iterative Cell Evaluation in LSTM Inference An illustration of the LSTM-cell iteration. Input sequence: Output sequence: words, audio, image translation, prediction X ts X 3 X 2 X 1 h ts h 3 h 2 h 1 Embedded LSTM Vector … … INPUT OUTPUT Cell Sequence Time Step (ts) x (x 1 , x 2 ,…, x ts ) h (h 1 , h 2 ,…, h ts ) Figure: The LSTM cell and its iterative evaluation over temporal sequence. 3 / 17
Iterative Cell Evaluation in LSTM Inference An illustration of the LSTM-cell iteration. h (h 1 , h 2 ,…, h ts ) h 1 h 2 h ts Context h 1 h 2 h ts-1 Link LSTM Cell Cell Cell … unrolled Cell iter 0 iter 1 c 2 iter ts c 1 c ts-1 x 1 x 2 x ts x (x 1 , x 2 ,…, x ts ) Figure: The LSTM cell and its iterative evaluation over temporal sequence. 3 / 17
Arithmetic of LSTM-cell Computation c t-1 c t x + x tanh σ σ tanh f t i t h t o t σ x h t-1 W f , U f , b f W i , U i , b i W c , U c , b c W o , U o , b o x t Cell iter t Figure: Detail dataflow in the LSTM cell. f t = σ ( W f x t + U f h t − 1 + b f ) (1) i t = σ ( W i x t + U i h t − 1 + b i ) (2) c t = f t · c t − 1 + i t · tanh( W c x t + U c h t − 1 + b c ) (3) o t = σ ( W o x t + U o h t − 1 + b o ) (4) h t = o t · tanh ( c t ) (5) 4 / 17
Heavy Workload v.s. Low Performance of Embedded CPU Main Computational Workload of LSTM Matrix-vector multiplication: W = ( W f , W i , W c , W o ) T ∈ R 4 n × m , x t ∈ R m W x t , U = ( U f , U i , U c , U o ) T ∈ R 4 n × n , h t ∈ R n U h t , In a benchmark layer for machine comprehension, m = n = 1500, one sequence has 35 time steps (cell iteration). 630 , 000 , 000 MACC operations for each sequence. One LSTM layer costs 0.63 second on a CPU with 1 GOp/s. 5 / 17
Heavy Workload v.s. Low Performance of Embedded CPU Main Computational Workload of LSTM Matrix-vector multiplication: W = ( W f , W i , W c , W o ) T ∈ R 4 n × m , x t ∈ R m W x t , U = ( U f , U i , U c , U o ) T ∈ R 4 n × n , h t ∈ R n U h t , Sparsity in Weight Matrix Sparsity ( W , U ) ∈ [0 . 2 , 0 . 8] CPU performance decreases while computing Sparse matrix-vector multiplication (SpMV). Embedded Solution for LSTM Inference A heterogeneous system coupling CPU and a generic LSTM accelerator. 5 / 17
Target Platform: Tightly-coupled Heterogeneous System Data Path ROCC Control Path ROCC Data Path Advantages: RISC-V HW Chip Accelerator L1 lower latency: 30 cycles (DRAM access) v.s. 1 cycle (DCache DCache Buf access via ROCC) 64b smaller area: DRAM Controller (Accel) DRAM Area Limitations: Controller saving chip-area limitation: off-chip weight storage (CPU) ROCC bandwidth: 64bits/cycle DRAM Figure: Tightly-coupled Arch. 6 / 17
Table of Contents Background 1 LSTM-based Neural Networks Target Embedded-Platform of E-LSTM Method: An Area-saving Sparse Weight Format (eSELL) 2 Method: Optimizations for LSTM Inter-Cell Parallelism 3 Generic E-LSTM Architecture and Throughput Bottleneck Optimization with Inherent Sparsity in LSTM Arithmetic Scheduling with Cell-fusion Experiments and Evaluation 4 Conclusion 5 7 / 17
eSELL: Area-saving Sparse Weight Format Access Coalescing 0 0 0 Port 0 v 1 Port 0 0 0 2 1 3 Port 1 Port 2,3 = X X = 2 4 4 3 Vector 6 Vector 6 (on-chip) (on-chip) 7 7 Matrix (o ff -chip) Result (on-chip) Matrix (o ff -chip) Result (on-chip) Figure: Coalesced access to result buffer, 4 MACC per cycle, Figure: Column-major SpMV with Compressed Sparse Column (CSC) format, 4 MACC per cycle. 63% area reduction to CSC. SRAM Area Estimation [ ? ] area ∝ (# bits ) 0 . 9 × (# port ) 0 . 7 8 / 17
eSELL: Area-saving Sparse Weight Format Weight Format Construction STEP1 STEP2 STEP3 STEP4 (eSELL) Values & BLK w Values Values Bin Column Indices 0 1 3 0 0 1 3 001 1 2 3 CHK 0 1 2 2 1 2 100 3 Row Permutation Index ( IDX row ) Encoded Column Indices ( EIDX col ) 0 2 4 0 2 010 BLK � 0,0 � 0 3 6 0 3 011 BLK h Chunk Width (CHK W ) 2 1 2 110 4 CHK 1 1 3 1 100 1 MAT h 5 000 … 7 000 (0,0) 1 0 2 5 6 010 CHK h 5 CHK 2 2 3 110 2 6 0 3 BLK � 1,0 � 011 7 3 111 0 2 7 110 … … CHK 3 2 3 111 … 1 4 1 100 3 MAT w 000 Figure: Steps for eSELL weight format construction. 8 / 17
eSELL: Area-saving Sparse Weight Format Alignment to ROCC Interface Chunk1 Head Invalid Chunk0 Head 0 IDX row (3x4 bits) EIDX col (3x4) CHK w (2) IDX row (3x4) EIDX col (3x4) CHK w (2) (10) 8 Value (16 bits) Value Value Value … Address Value Value Value Value IDX row (3x4 bits) EIDX col (3x4) CHK w (2) IDX row (3x4) EIDX col (3x4) CHK w (2) (10) Value Value Value Value ROCC word (64 bits) Figure: eSELL storage / transmission pattern aligned with ROCC 64-bits interface. 8 / 17
Table of Contents Background 1 LSTM-based Neural Networks Target Embedded-Platform of E-LSTM Method: An Area-saving Sparse Weight Format (eSELL) 2 Method: Optimizations for LSTM Inter-Cell Parallelism 3 Generic E-LSTM Architecture and Throughput Bottleneck Optimization with Inherent Sparsity in LSTM Arithmetic Scheduling with Cell-fusion Experiments and Evaluation 4 Conclusion 5 9 / 17
Generic Accelerator Hardware for Embedded LSTM Instruction eSELL Decoder Weight Matrix Data I/O … SpMV SpMV SpMV Controller Load/Store PE PE PE Request Vectors Interface BUF x BUF wx BUF uh BUF h BUF b BUF c ROCC Vectors Vector PE Figure: Accelerator architecture in E-LSTM. 10 / 17
Throughput Bottleneck Pipeline diagram for single SpMV-PE case. Time Load Matrix U Load Matrix W ROCC … Wx 1 Uh 1 Wx 2 Wx ts Uh ts SpMV-PE Figure: Process cell iterations in sequence; both ROCC and PE are fully utilized. 11 / 17
Throughput Bottleneck Pipeline diagram for multiple SpMV-PE case. Time Load Matrix W Load Matrix U ROCC Wx 1 Wx 4 SpMV-PE1 PE Stall Period SpMV-PE2 Wx 2 Wx 5 Wx 3 Uh 1 Uh 2 Uh 3 Wx 6 SpMV-PE3 Figure: Process W x t in parallel and U h t in sequence. 11 / 17
Throughput Bottleneck Pipeline diagram for multiple SpMV-PE case. Time Load Matrix W Load Matrix U ROCC Wx 1 Wx 4 SpMV-PE1 PE Stall Period SpMV-PE2 Wx 2 Wx 5 Wx 3 Uh 1 Uh 2 Uh 3 Wx 6 SpMV-PE3 Figure: Process W x t in parallel and U h t in sequence. Pipeline Stall W x t and U h t cannot be computed concurrently, as the ROCC can only load one word of W or U in each cycle. Thus the stall of PE is unavoidable, and U h t becomes the throughput bottleneck. 11 / 17
Optimization1: Shorten U h t period with inherent sparsity of h t Backtrace of h t computation: o t = σ ( W o x t + U o h t − 1 + b o ) h t = o t · tanh ( c t ) 1 1 σ ( x ) = σ (x) 1 + e − x Inherent sparsity of h t 0.5 As P ( o t < 0 . 1) ≈ 0 . 32, and tanh ( c t ) ∈ ( − 1 , 1), a considerable portion of h t is closed to zero that can p = 32% be regarded as zero in U h t computation. σ (x)=0.1 0 − 6 − 4 − 2 0 2 4 6 x 12 / 17 Figure: function ( x )
Optimization1: Shorten U h t period with inherent sparsity of h t Sparse-Matrix Sparse-Vector Multiplication (SpMSpV) in U h t non-zero value zero non-zero value zero res U res U weight column to be read weight column to be read h t h t x = x = Figure: SpMSpV: U h t computation considering inherent Figure: SpMV: original computation of U h t . sparsity of h t . In this example, SpMSpV achieves 3 × speedup on U h t computation. 12 / 17
Optimization2: Scheduling with Cell-fusion Inter-cell Parallel Scheme: Cell-fusion Cell-fusion Scheme Assuming there are N pe PEs, we set ( N pe − 1) of them process W x t (SpMV) and the rest one process U h t (SpMSpV). Besides, each SpMV-PE process W x t of N fuse cell iterations in interleave. Time ROCC N fuse =3, ts=18 Load Matrix U Free Load Matrix W PE1 Stall Wx 1,2,3 Wx 7,8,9 Wx 13,14,15 PE2 Stall Wx 4,5,6 Wx 10,11,12 Wx 16,17,18 PE3 Stall Uh 1 Uh 2 Uh 3 Uh 4 Uh 5 Uh 6 Uh 7 Uh 8 Uh 9 Uh 10 Uh 11 Uh 12 … Uh 18 T main T epilog T prolog 13 / 17
Optimization2: Scheduling with Cell-fusion Inter-cell Parallel Scheme: Cell-fusion Time ROCC N fuse =3, ts=18 Load Matrix U Free Load Matrix W PE1 Stall Wx 1,2,3 Wx 7,8,9 Wx 13,14,15 PE2 Wx 4,5,6 Wx 10,11,12 Wx 16,17,18 Stall PE3 Stall Uh 1 Uh 2 Uh 3 Uh 4 Uh 5 Uh 6 Uh 7 Uh 8 Uh 9 Uh 10 Uh 11 Uh 12 … Uh 18 T prolog T main T epilog Advantage: W x t and U h t are processed in concurrent. In every N fuse cycles, ROCC interface is occupied by loading W for 1 cycle and loading U for ( N fuse − 1) cycles. 13 / 17
Recommend
More recommend