CMSC5743 L02: CNN Accurate Speedup I Bei Yu (Latest update: September 28, 2020) Fall 2020 1 / 31
These slides contain/adapt materials developed by ◮ Minsik Cho and Daniel Brand (2017). “MEC: memory-efficient convolution for deep neural network”. In: Proc. ICML ◮ Asit K. Mishra et al. (2017). “Fine-grained accelerators for sparse machine learning workloads”. In: Proc. ASPDAC , pp. 635–640 ◮ Jongsoo Park et al. (2017). “Faster CNNs with direct sparse convolutions and guided pruning”. In: Proc. ICLR ◮ UC Berkeley EE290: “Hardware for Machine Learning” https://inst.eecs.berkeley.edu/~ee290-2/sp20/ 2 / 31
Overview Convolution 101 GEMM Sparse Convolution Direct Convolution Further Discussions 3 / 31
Overview Convolution 101 GEMM Sparse Convolution Direct Convolution Further Discussions 4 / 31
2D-Convolution Input Output Weight Activation Activation H: Height of Input Activation a b c d e W: Width of Input Activation f g h i j 1 2 3 A B C R: Height of Weight S: Width of Weight H k l m n o R 4 5 6 P D E F P: Height of Output Activation p q r s t 7 8 9 G H I Q: Width of Output Activation u v w x y S Q W 𝐵 = 𝑏 ∗ 1 + 𝑐 ∗ 2 + 𝑑 ∗ 3 +𝑔 ∗ 4 + ∗ 5 + ℎ ∗ 6 +𝑙 ∗ 7 + 𝑚 ∗ 8 + 𝑛 ∗ 9 4 / 31
2D-Convolution Input Output Weight Activation Activation H: Height of Input Activation a b c d e W: Width of Input Activation R: Height of Weight f g h i j 1 2 3 A B C S: Width of Weight H R P k l m n o 4 5 6 D E F P: Height of Output Activation p q r s t 7 8 9 G H I Q: Width of Output Activation u v w x y S stride: # of rows/columns Q traversed per step W 4 / 31
2D-Convolution Input Output Weight Activation Activation H: Height of Input Activation a b c d e W: Width of Input Activation R: Height of Weight f g h i j 1 2 3 A B C S: Width of Weight H R P k l m n o 4 5 6 D E F P: Height of Output Activation p q r s t 7 8 9 G H I Q: Width of Output Activation u v w x y S stride: # of rows/columns Q traversed per step W 3 4 / 31
2D-Convolution Input Output Weight Activation Activation H: Height of Input Activation a b c d e W: Width of Input Activation R: Height of Weight f g h i j 1 2 3 A B C S: Width of Weight H k l m n o R 4 5 6 P D E F P: Height of Output Activation p q r s t 7 8 9 G H I Q: Width of Output Activation u v w x y S stride: # of rows/columns Q traversed per step W 𝐽 = 𝑛 ∗ 1 + 𝑜 ∗ 2 + 𝑝 ∗ 3 +𝑠 ∗ 4 + 𝑡 ∗ 5 + 𝑢 ∗ 6 +𝑥 ∗ 7 + 𝑦 ∗ 8 + 𝑧 ∗ 9 4 / 31
2D-Convolution Input Output Weight Activation Activation H: Height of Input Activation a b c d e W: Width of Input Activation R: Height of Weight f g h i j 1 2 3 A B C S: Width of Weight H R P k l m n o 4 5 6 D E F P: Height of Output Activation p q r s t 7 8 9 G H I Q: Width of Output Activation u v w x y S stride: # of rows/columns Q traversed per step W 𝑄 = (𝐼 − 𝑆) 𝑡𝑢𝑠𝑗𝑒𝑓 + 1 𝑅 = (𝑋 − 𝑇) 𝑡𝑢𝑠𝑗𝑒𝑓 + 1 4 / 31
2D-Convolution Input Output Weight Activation Activation H: Height of Input Activation 0 0 0 0 0 0 0 W: Width of Input Activation 0 a b c d e 0 A B C R: Height of Weight 1 2 3 0 f g h i j 0 P D E F S: Width of Weight H R 4 5 6 0 k l m n o 0 P: Height of Output Activation G H I 0 p q r s t 0 7 8 9 Q: Width of Output Activation 0 u v w x y 0 S stride: # of rows/columns 0 0 0 0 0 0 0 Q traversed per step 𝑄 = (𝐼 − 𝑆 + 2 ∗ 𝑞𝑏𝑒) W padding: # of zero + 1 rows/columns added 𝑡𝑢𝑠𝑗𝑒𝑓 𝑅 = (𝑋 − 𝑇 + 2 ∗ 𝑞𝑏𝑒) + 1 𝑡𝑢𝑠𝑗𝑒𝑓 4 / 31
3D-Convolution H: Height of Input Activation W: Width of Input Activation Input Output Weight R: Height of Weight Activation Activation S: Width of Weight P: Height of Output Activation Q: Width of Output Activation C stride: # of rows/columns C traversed per step a b c d e padding: # of zero A B C f g h i j 1 2 3 rows/columns added P D E F H R k l m n o 4 5 6 G H I C: # of Input Channels p q r s t 7 8 9 Q u v w x y S W 5 / 31
3D-Convolution Weight H: Height of Input Activation Input Output W: Width of Input Activation Activation Activation R: Height of Weight C S: Width of Weight P: Height of Output Activation C 1 2 3 Q: Width of Output Activation K R 4 5 6 stride: # of rows/columns a b c d e traversed per step 7 8 9 A B C padding: # of zero f g h i j P S D E F K rows/columns added H k l m n o G H I … p q r s t C: # of Input Channels Q u v w x y K: # of Output Channels C W 1 2 3 R 4 5 6 7 8 9 S 5 / 31
3D-Convolution Input Activation Weight Output Activation C C K H: Height of Input Activation a b c d e 1 2 3 A B C W: Width of Input Activation f g h i j R 4 5 6 P D E F R: Height of Weight H k l m n o S: Width of Weight 7 8 9 G H I P: Height of Output Activation p q r s t S Q Q: Width of Output Activation K u v w x y N stride: # of rows/columns N … … traversed per step padding: # of zero C K rows/columns added C C: # of Input Channels 1 2 3 K: # of Output Channels P R 4 5 6 N: Batch size H 7 8 9 Q S W 5 / 31
Convolution 101 0 0 0 0 0 0 0 4 6 3 5 4 0 2 2 1 1 2 0 2 0 2 0 1 1 0 0 1 0 0 6 2 4 4 1 1 1 1 5 3 4 4 0 2 0 1 2 0 0 2 4 3 3 0 1 1 1 1 1 0 1 0 -1 4 0 0 0 1 0 2 0 0 2 2 4 3 0 0 0 0 0 0 0 Direct convolution: No extra memory overhead ◮ Low performance ◮ Poor memory access pattern due to geometry-specific constraint ◮ Relatively short dot product 6 / 31
Background: Memory System Processor Inclusive– 4-8 bytes (word) what is in L1$ is a subset of Increasing L1$ what is in L2$ distance is a subset of 8-32 bytes (block) from the what is in MM L2$ processor that is a 1 to 4 blocks in access subset of is in Main Memory time SM 1,024+ bytes (disk sector = page) Secondary Memory (Relative) size of the memory at each level ◮ Spatial locality ◮ Temporal Locality 7 / 31
Overview Convolution 101 GEMM Sparse Convolution Direct Convolution Further Discussions 8 / 31
Im2col (Image2Column) Convolution 0 0 0 0 2 2 0 2 0 1 0 0 0 0 0 0 0 0 0 0 2 2 1 2 0 1 0 4 6 3 5 4 0 2 2 1 1 2 0 0 0 0 2 1 1 0 1 1 0 . 2 6 2 4 4 0 2 0 1 1 0 0 1 . 1 5 3 4 4 1 0 2 0 1 2 0 0 . 2 4 3 3 4 1 0 1 1 1 1 1 0 1 1 0 1 2 0 1 1 1 1 0 2 2 4 3 0 0 0 1 0 2 0 . . 0 0 0 0 0 0 0 0 -1 1 1 0 0 2 0 0 0 0 25 x 9 9 x 1 ◮ Large extra memory overhead ◮ Good performance ◮ BLAS-friendly memory layout to enjoy SIMD/locality/parallelism ◮ Applicable for any convolution configuration on any platform 8 / 31
Recommend
More recommend