Accelerating Winograd Convolutions using Symbolic Computation and Meta-programming A. Mazaheri 1 , T. Beringer 1 , M. Moskewicz 2 , F. Wolf 1 , A. Jannesari 3 1 TU Darmstadt, 2 Tesla Inc., 3 Iowa State University 30.04.2020 EuroSys’20, Heraklion, Crete, Greece Image: Freepik.com
Neural networks are everywhere Object detection Semantic segmentation Autonomous cars Speech recognition Translation Music composition Sentiment analysis Word prediction Intelligent agents 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 2
Convolutional neural networks Feature map visualization 1x1x1000 Convolution+ReLU Max pooling Fully connected+ReLU Softmax 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 3
Convolution & tensors Input tensor Kernel tensor Output tensor C ✕ H ✕ W OC ✕ IC ✕ M ✕ N H’ ✕ W’ 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 4
Convolution & tensors Input tensor Kernel tensor Output tensor C ✕ H ✕ W OC ✕ IC ✕ M ✕ N H’ ✕ W’ Element-wise Summation 1 2 multiplication 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 5
Convolution & tensors Input tensor Kernel tensor Output tensor C ✕ H ✕ W OC ✕ IC ✕ M ✕ N H’ ✕ W’ Element-wise Summation 1 2 multiplication 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 6
Convolution & tensors Input tensor Kernel tensor Output tensor C ✕ H ✕ W OC ✕ IC ✕ M ✕ N H’ ✕ W’ Element-wise Summation 1 2 multiplication 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 7
Convolution & tensors Input tensor Kernel tensor Output tensor C ✕ H ✕ W OC ✕ IC ✕ M ✕ N H’ ✕ W’ Element-wise Summation 1 2 multiplication • Dominate computation (>90% of runtime) • Similar to generalized matrix-matrix multiply à Massive GPU parallelism 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 8
Winograd convolution !(#, %) Sample !(# = 2×2, % = 3×3) Winograd convolution • Input (tiled) Input / Filter transformation Multiplication Output transformation Output Internal tile size: + = # + % − 1 Research questions: • Can we reduce the overhead of Winograd transformations? How to properly choose the right + ? • • How to run Winograd efficiently on a wide range of GPU platforms? 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 9
Winograd code generation workflow CNN frontend Caffe TensorFlow … myCNN.proto: layers{ layer{ Conv1 top=data bottom=conv1 }layer{ Conv2 top=conv1 … }… } Compute graph (CG) Code generation Auto-tuning & HW Winograd Conv. Codegen Transformation Templates per operation variant selection Info. matrices DB Non-fused Fused Direct Annotated CG Winograd Winograd Conv. Transformation Graph-level optimization KERNEL conv(in, filts)// CUCL IN img:chan:y:x recipe generator main(cf1,cf2){ %(filts_buf_loads); } conv1 conv1 + SGEMM Template meta-programming activation Winograd spec. activation Library (C++ metacode) F(m, r) CUDA/OpenCL/Vulkan kernels fc1 fc1 KERNEL conv(in, filts)// CUCL IN img:chan:y:x main(cf1,cf2){ filts_buf[0+tid]= filts[tid];} Refined CG ? OpenCL CUDA Vulkan Nvidia AMD Qualcomm HW ? New targets GPUs GPUs Snapdragon 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 10
Optimizing Winograd transformations [Symbolic analysis] • Represent the target matrix by symbols • Perform multiplication and obtain the results −1×$ %,) + 0 + 0 −1×$ %,* + 0 + 0 −1×$ %,% + 0 + 0 + ,,- + -,- + .,- + ,,. + -,. + .,. + ,,, + -,, + .,, * + * + * + * + * + * + * * * = + ,,- + -,- + .,- + ,,. + -,. + .,. + ,,, + -,, + .,, * − * + * − * + * − * + * * * 0 + 0 + 1×$ *,) 0 + 0 + 1×$ *,* 0 + 0 + 1×$ *,% for (i = 0; i < alpha; i++) { for (j = 0; j < r; j++) { res[i][j] = 0; Matrix multiplication for (k = 0; k < r; k++) code before optimization res[i][j] += G[i][k] * g[k][j]; } } 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 11
Optimizing Winograd transformations [Remove 1,0s] −1×$ %,) + 0 + 0 −1×$ %,* + 0 + 0 −1×$ %,% + 0 + 0 + ,,- + -,- + .,- + ,,. + -,. + .,. + ,,, + -,, + .,, * + * + * + * + * + * + * * * = + ,,- + -,- + .,- + ,,. + -,. + .,. + ,,, + -,, + .,, * − * + * − * + * − * + * * * 0 + 0 + 1×$ *,) 0 + 0 + 1×$ *,* 0 + 0 + 1×$ *,% −$ %,) −$ %,* −$ %,% + ,,- + -,- + .,- + ,,. + -,. + .,. + ,,, + -,, + .,, * + * + * + * + * + * + * * * = + ,,- + -,- + .,- + ,,. + -,. + .,. + ,,, + -,, + .,, * − * + * − * + * − * + * * * $ *,) $ *,* $ *,% 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 12
Optimizing Winograd transformations [Index representation] −" #,% −" #,& −" #,# ' (,* ' *,* ' +,* ' (,+ ' *,+ ' +,+ ' (,( ' *,( ' +,( & + & + & + & + & + & + & & & = ' (,* ' *,* ' +,* ' (,+ ' *,+ ' +,+ ' (,( ' *,( ' +,( & − & + & − & + & − & + & & & " &,% " &,& " &,# −" #,, ' (,- ' +,- ' *,- & + & + & = ' (,- ' +,- ' *,- & + & − & " &,, 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 13
Optimizing Winograd transformations [Factorization] −" #,% & ',( & +,( & ,,( ) + ) + ) = & ',( & +,( & ,,( ) + ) − ) " ),% −" #,% - ) (" #,% + " ),% + " -,% ) = - ) (" #,% + " ),% − " -,% ) " ),% 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 14
Optimizing Winograd transformations [Common subexpression elimination] −" #,% & ' (" #,% + " ',% + " &,% ) = & ' (" #,% + " ',% − " &,% ) " ',% −" #,% & ' (+,-1 + " &,% ) = , +,-1 = " #,% + " ',% & ' (+,-1 − " &,% ) " ',% 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 15
Optimizing Winograd transformations [Code generation] −" #,% & ' ()*+1 + " &,% ) , )*+1 = " #,% + " ',% = & ' ()*+1 − " &,% ) " ',% Before optimizations After optimizations for (i = 0; i < alpha; i++) { for(j=0, j<4, j++){ for (j = 0; j < r; j++) { cse1 = g[0][j] + g[2][j]; Gg[i][j] = 0; Gg[0][j] = -g[0][j]; for (k = 0; k < r; k++) Gg[1][j] = 0.5*(cse1 + g[1][j]); Gg[i][j] += G[i][k] * Gg[2][j] = 0.5*(cse1 - g[1][j]); g[k][j]; Gg[3][j] = g[2][j]; } } } 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 16
Performance auto-tuning Tuning knobs • Winograd variant • Thread blocking • Register blocking • Loop unrolling factor • Winograd output tile size Lowest runtime kernel Tensor operation kernel float g[7][7]; float Gg[8][7]; float tmp[8][8]; const GASQ float *B = filts_ref + (k * 3 + c) * 7 * 7; for (int i = 0; i < 7; ++i) { for (int j = 0; j < 7; ++j) { g[i][j] = B[7*j + i]; } } 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 17
Winograd convolution accuracy • L1-norm error analysis for various Winograd internal tile sizes 10 − 1 6 Error increase rate L1-Norm Error 10 − 3 4 10 − 5 2 10 − 7 4 5 6 7 8 9 10 11 12 13 14 15 16 α Lowest error growth 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 18
Winograd transformation optimization results • Overall arithmetic reduction ratios related to transformation steps and the whole Winograd algorithm for a single tile α = 8 α = 8 0 . 6 0 . 6 α = 8 reduction ratio 0 . 6 Transformations Arithmetic Whole Winograd 0 . 4 0 . 4 0 . 4 0 . 2 0 . 2 0 . 2 0 . 0 0 . 0 0 . 0 ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) 3 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 5 7 7 7 7 7 7 7 7 7 , , , , , , , , , , , , , , , , , , , , , , , , , , , 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 9 0 ( ( ( ( ( ( ( ( 1 ( ( ( ( ( ( ( ( 1 ( ( ( ( ( ( ( ( 1 F F F F F F F F F F F F F F F F F F F F F F F F ( ( ( F F F • Runtime comparison on Nvidia 1080 Ti 3x3 conv 5x5 conv 7x7 conv 0 . 7 14 . 0 Runtime (ms) Speedup ratio 0 . 4 Non-optimized 1 . 2 Optimized 1 . 4 0 . 3 9 . 3 0 . 4 1 . 25 4 . 7 0 . 2 0 . 1 1 . 2 1 . 0 1 . 00 0 . 0 0 . 0 0 . 0 F(2,3) F(3,3) F(4,3) F(5,3) F(6,3) F(7,3) F(8,3) F(9,3) F(2,5) F(3,5) F(4,5) F(5,5) F(6,5) F(7,5) F(8,5) F(9,5) F(2,7) F(3,7) F(4,7) F(5,7) F(6,7) F(7,7) F(8,7) F(9,7) 4/27/20 | Department of Computer Science | Laboratory for Parallel Programming | Arya Mazaheri | 19
Recommend
More recommend