Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance Optimization Meng Li*, YiLei Li*, Pi Pierce ce Chuang , Liangzhen Lai, and Vikas Chandra EMC2 Workshop @ NeurIPS 2019 Facebook Silicon AI Research
Motivation Dataflow processing is widely exploited to amortize memory access energy Datapath energy becomes important for dataflow accelerators Consist of compute energy in process elements (PEs) and data propagation energy among PEs • … weight PE Array Act Act Act Act C Misc weight Buffer C Misc … C weight PE Array weight weight PE Buffer Array H x W Datapath weight … K 57.7% 87.3% K weight … Psum Psum Psum Psum K H x W weight Input Stationary Output Stationary Thinker [Yin+, JSSC’18] ShiDianNao [Du+, ISCA’15] 2
Motivation In dataflow processing, operands are streamed into the compute array Datapath energy is determined by the total bit flips induced by operand streaming Ta Targe get : propose post-training and training-aware techniques to reduce bit flips of weight streaming C H x W W[0, 0] W[0, 1] W[0, 2] W[0, 3] A[0, 0] A[1, 0] A[2, 0] A[3, 0] 600 weight PE Array A[0, 1] A[1, 1] A[2, 1] A[3, 1] W[1, 0] W[1, 1] W[1, 2] W[1, 3] x K C 500 weight Normalized Energy A[0, 2] A[1, 2] A[2, 2] A[3, 2] W[2, 0] W[2, 1] W[2, 2] W[2, 3] 400 A[0, 3] A[1, 3] A[2, 3] A[3, 3] W[3, 0] W[3, 1] W[3, 2] W[3, 3] … C 300 weight K H x W 200 H x W weight W[3, 0] W[2, 0] W[1, 0] W[0, 0] A[0, 0] A[1, 0] A[2, 0] A[3, 0] 100 K W[3, 1] W[2, 1] W[1, 1] W[0, 1] A[0, 1] A[1, 1] A[2, 1] A[3, 1] 0 … C Psum Psum Psum Psum 0.E+00 1.E+05 2.E+05 3.E+05 4.E+05 K W[3, 2] W[2, 2] W[1, 2] W[0, 2] A[0, 2] A[1, 2] A[2, 2] A[2, 3] Total Bit Flips W[3, 3] W[2, 3] W[1, 3] W[0, 3] A[0, 3] A[1, 3] A[3, 2] A[3, 3] K, C, H, W denotes output channel, input channel, output height, and output width, respectively 3
Post-Training Optimization: Output Channel Reordering To reduce bit flips, the most straight-forward technique is output channel reordering • Output channel reordering can be mapped to a traveling salesman problem, which can be approximately solved with efficient greedy algorithms K H x W C C W[3, 0] W[2, 0] W[1, 0] W[0, 0] A[0, 0] A[1, 0] A[2, 0] A[3, 0] 00 10 10 01 10 10 00 01 W[3, 1] W[2, 1] W[1, 1] W[0, 1] A[0, 1] A[1, 1] A[2, 1] A[3, 1] 01 01 10 11 K 00 10 10 01 C K Reorder 10 10 00 01 01 01 10 11 W[3, 2] W[2, 2] W[1, 2] W[0, 2] A[0, 2] A[1, 2] A[2, 2] A[2, 3] 00 01 11 11 00 01 11 11 W[3, 3] W[2, 3] W[1, 3] W[0, 3] A[0, 3] A[1, 3] A[3, 2] A[3, 3] 4
Post-Training Optimization: Input Channel Clustering For most networks, the channel dimension can be larger than the compute array size Weight matrices need to be segmented first and then fed into compute array • Each weight sub-matrix can use different output channel orders • Before segmenting the weight matrix, different input channels can be clustered first Propose an iterative assignment and update approach for input clustering K K K K Reordering 11 11 11 00 11 11 11 00 C cluster 1 C cluster 1 11 11 11 00 11 00 00 00 11 00 00 00 11 00 11 11 10 10 10 01 10 10 10 01 C Clustering 11 00 00 00 10 10 01 01 10 10 01 01 11 11 00 11 C 10 10 10 01 K Reordering 11 00 00 00 11 00 00 00 10 01 10 10 C cluster 2 C cluster 2 11 11 00 11 11 11 00 11 10 10 01 01 10 01 10 10 10 01 10 10 10 10 01 10 10 01 10 10 10 01 10 10 5
Experimental Results Post-training optimization technique comparison • Use 1x1 Conv in MobileNetV2 and 3x3 Conv in ResNet26 for evaluation Combine post-training and training-aware optimization • Incorporate bit flip loss into the loss function • Use MobileNetV2 trained on Cifar100 for evaluation 2 4 Baseline Average HD Reduction MobileNetV2 3.5 1 . 8 Direct Reorder Cluster-then-Reorder 3 1 . 6 Reduction 2.5 ResNet26 2 1 . 4 1.5 1 1 . 2 0.5 1 0 HD Reduction Energy Reduction 0 . 8 8 16 32 64 8 16 32 64 Baseline Post-Training Training-Aware Combine Channels/Cluster 6
Recommend
More recommend