StrassenNets: Deep Learning with a Multiplication Budget Michael - PowerPoint PPT Presentation

StrassenNets: Deep Learning with a Multiplication Budget Michael Tschannen ∗ michaelt@nari.ee.ethz.ch 13 July 2018 Joint work with Aran Khanna ∗ and Anima Anandkumar ∗ ∗ work done at Amazon AI

Motivation Outstanding predictive performance of deep neural networks (DNNs) comes at the cost of high computational complexity and high energy consumption . 2 / 16

Motivation Outstanding predictive performance of deep neural networks (DNNs) comes at the cost of high computational complexity and high energy consumption . Known solutions Architectural optimizations [Iandola et al. 2016, Howard et al. 2017, Zhang et al. 2017] Factorizations of weight matrices and tensors [Denton et al. 2014, Novikov et al. 2015, Kossaifi et al. 2017, Kim et al. 2017] Pruning of weights and filters [Liu et al. 2015, Wen et al. 2016, Labedev et al. 2016, ] Reducing numerical precision of weights and activations [Courbariaux et al. 2015, Rastegari et al. 2016, Zhou et al. 2016, Lin et al., 2017] 2 / 16

Motivation Our approach: Reducing the number of multiplications as a guiding principle 3 / 16

Motivation Our approach: Reducing the number of multiplications as a guiding principle This strategy led to many fast algorithms Strassen’s matrix multiplication algorithm Winograd-filter based convolution [Gray & Lavin 2016] 3 / 16

Motivation Our approach: Reducing the number of multiplications as a guiding principle This strategy led to many fast algorithms Strassen’s matrix multiplication algorithm Winograd-filter based convolution [Gray & Lavin 2016] DNNs with {− 1 , 0 , 1 } -valued weights have 60% higher throughput on FPGA than on GPU, while being 2.3 × better in performance/watt [Nurvitadhi et al. 2017] 3 / 16

Motivation Our approach: Reducing the number of multiplications as a guiding principle This strategy led to many fast algorithms Strassen’s matrix multiplication algorithm Winograd-filter based convolution [Gray & Lavin 2016] DNNs with {− 1 , 0 , 1 } -valued weights have 60% higher throughput on FPGA than on GPU, while being 2.3 × better in performance/watt [Nurvitadhi et al. 2017] Multiplications take up to 32 × more cycles than additions on (low-end) MCUs 3 / 16

Motivation Our approach: Reducing the number of multiplications as a guiding principle This strategy led to many fast algorithms Strassen’s matrix multiplication algorithm Winograd-filter based convolution [Gray & Lavin 2016] DNNs with {− 1 , 0 , 1 } -valued weights have 60% higher throughput on FPGA than on GPU, while being 2.3 × better in performance/watt [Nurvitadhi et al. 2017] Multiplications take up to 32 × more cycles than additions on (low-end) MCUs Additions are more area-efficient and hence much less energy consuming (3–30 × [Horowitz 2014]) than multiplications on ASIC 3 / 16

Casting matrix multiplications as 2-layer sum-product networks (SPNs) A large fraction of arithmetic operations in DNNs are due to matrix multiplications 4 / 16

Casting matrix multiplications as 2-layer sum-product networks (SPNs) A large fraction of arithmetic operations in DNNs are due to matrix multiplications W b vec( B ) W c ⇐ ⇒ C = AB vec( C ) vec( A ) r W a 4 / 16

Casting matrix multiplications as 2-layer sum-product networks (SPNs) A large fraction of arithmetic operations in DNNs are due to matrix multiplications W b vec( B ) W c ⇐ ⇒ C = AB vec( C ) vec( A ) r W a A is k × m , B is m × n : Ternary ( {− 1 , 0 , 1 } ) W a , W b , W c exist if r ≥ nmk 4 / 16

Casting matrix multiplications as 2-layer sum-product networks (SPNs) A large fraction of arithmetic operations in DNNs are due to matrix multiplications W b vec( B ) W c ⇐ ⇒ C = AB vec( C ) vec( A ) r W a A is k × m , B is m × n : Ternary ( {− 1 , 0 , 1 } ) W a , W b , W c exist if r ≥ nmk A, B are 2 × 2 : Strassen’s algorithm: Ternary W a , W b , W c for r = 7 4 / 16

Casting matrix multiplications as 2-layer sum-product networks (SPNs) A large fraction of arithmetic operations in DNNs are due to matrix multiplications W b vec( B ) W c ⇐ ⇒ C = AB vec( C ) vec( A ) r W a Change assumptions A fixed, B distributed on low-dimensional “manifold”: Can realize approximate multiplication for r ≪ nmk 4 / 16

Casting matrix multiplications as 2-layer sum-product networks (SPNs) A large fraction of arithmetic operations in DNNs are due to matrix multiplications W b vec( B ) W c ⇐ ⇒ C = AB vec( C ) vec( A ) r W a Idea: Associate A with the weights/filters and B with the activations/feature maps and learn W a , W b , W c with r ≪ nmk end-to-end. Alternatively, learn ˜ a = W a vec( A ) from scratch. 4 / 16

Casting matrix multiplications as 2-layer sum-product networks (SPNs) A large fraction of arithmetic operations in DNNs are due to matrix multiplications W b vec( B ) W c ⇐ ⇒ C = AB vec( C ) a ˜ r Idea: Associate A with the weights/filters and B with the activations/feature maps and learn W a , W b , W c with r ≪ nmk end-to-end. Alternatively, learn ˜ a = W a vec( A ) from scratch. 4 / 16

Application to 2D convolution Write convolution as matrix multiplication ( im2col ) → impractically large W a , W b , W c 5 / 16

Application to 2D convolution Write convolution as matrix multiplication ( im2col ) → impractically large W a , W b , W c Compress computation of c out × p × p outputs from c in × ( p − 1 + k ) × ( p − 1 + k ) inputs 5 / 16

Application to 2D convolution Write convolution as matrix multiplication ( im2col ) → impractically large W a , W b , W c Compress computation of c out × p × p outputs from c in × ( p − 1 + k ) × ( p − 1 + k ) inputs W b W c p r p ⊙ ˜ a c in c out 5 / 16

Application to 2D convolution Write convolution as matrix multiplication ( im2col ) → impractically large W a , W b , W c Compress computation of c out × p × p outputs from c in × ( p − 1 + k ) × ( p − 1 + k ) inputs W b W c p r p ⊙ ˜ a c in c out c out × r × p × p r × c in × ( p − 1+ k ) × ( p − 1+ k ) stride 1 /p stride p , g groups 5 / 16

Application to 2D convolution Write convolution as matrix multiplication ( im2col ) → impractically large W a , W b , W c Compress computation of c out × p × p outputs from c in × ( p − 1 + k ) × ( p − 1 + k ) inputs → multiplication reduction by a factor of c in c out k 2 p 2 / r W b W c p r p ⊙ ˜ a c in c out c out × r × p × p r × c in × ( p − 1+ k ) × ( p − 1+ k ) stride 1 /p stride p , g groups 5 / 16

Training SGD with momentum Quantize ( W a ) , W b , W c with method described by [Li et al. 2016] Quantization in the forward pass Straight-through gradient estimator for backward pass Gradient step on full-precision weights Pretraining with full-precision weights Knowledge distillation [Hinton et al. 2015] L KD ( f S , f T ; x, y ) = (1 − λ ) L ( f S ( x ) , y ) + λ CE( f S ( x ) , f T ( x )) 6 / 16

Experiment: ResNet-18 on ImageNet 70 FP FP FP top-1 acc. [%] TTQ TTQ TTQ 65 TWN TWN TWN BWN 60 BWN BWN 10 7 10 8 10 9 10 9 10 10 10 1 10 2 multiplications additions model size [MB] 9 / 16

Experiment: ResNet-18 on ImageNet 70 FP FP FP top-1 acc. [%] TTQ TTQ TTQ 65 TWN TWN TWN BWN 60 BWN BWN 10 7 10 8 10 9 10 9 10 10 10 1 10 2 multiplications additions model size [MB] 1 6 4 2 1 2 blue: p = 2 , g = 1 ; green: p = 1 , g = 1 ; red: p = 1 , g = 4 ; marker type: r/c out 9 / 16

Experiment: Character-CNN language model on Penn Tree Bank Compact model proposed by [Kim et al. 2016] ◮ Word-level decoder ◮ 2-layer LSTM, 650 units ◮ 2-layer highway network, 650 units ◮ Convolution layer, 1100 filters ◮ Character-level embedding 10 / 16

Experiment: Character-CNN language model on Penn Tree Bank 95 TWN TWN TWN testing perplexity 90 85 80 FP FP FP 75 10 5 10 6 10 7 10 7 10 8 10 2 10 3 multiplications additions model size [MB] 1 1 8 6 4 2 1 2 4 11 / 16

Rediscovering Strassen’s algorithm Learn to multiply 2 × 2 matrices using 7 multiplications W a , W b ∈ {− 1 , 0 , 1 } 7 × 4 , W c ∈ {− 1 , 0 , 1 } 4 × 7 → solution space size 3 3 · 4 · 7 = 3 84 L2-loss, 100k synthetic training examples, 25 random initializations:  − 1 − 1 0 0   − 1 − 1 0 0  0 0 0 1 0 0 0 1       1 0 0 − 1 − 1 0 1     − 1 − 1 1 1 0 1 0 0     0 0 1 1 1 0 − 1       W a = − 1 0 1 0 , W b = 1 0 1 0 , W c =       − 1 0 0 0 1 1 − 1       − 1 − 1 1 0 − 1 − 1 − 1 0     0 1 0 0 0 0 1     0 0 1 0 1 1 1 1     0 − 1 0 0 0 0 − 1 0 13 / 16

Summary & Outlook Proposed and evaluated a versatile framework to learn fast approximate matrix multiplications for DNNs end-to-end Over 99.5% multiplication reduction in image classification and language modeling applications while maintaining predictive performance Method can learn fast exact 2 × 2 matrix multiplication 14 / 16

StrassenNets: Deep Learning with a Multiplication Budget Michael - PowerPoint PPT Presentation

StrassenNets: Deep Learning with a Multiplication Budget Michael Tschannen michaelt@nari.ee.ethz.ch 13 July 2018 Joint work with Aran Khanna and Anima Anandkumar work done at Amazon AI Motivation Outstanding predictive

CS 401 Integer Multiplication / Matrix Multiplication Xiaorui Sun 1 Integer Multiplication

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Maths Multiplication and Division Maths | Year 2 | Multiplication and Division | Solve

Lecture 8: Binary Multiplication & Division Todays topics: Multiplication

Matrix Multiplication Matrix multiplication is an operation with properties quite different from

lecture 7 Integer multiplication (grade school) How to do (unsigned) integer multiplication in

Efficient multiplication 2 Matrix multiplication If you have square matrices A and B, then C =

MA/CSSE 473 Day 04 Multiplication runtime Multiplication based on Gauss formula Mathematical

PARK OPERATIONS 1 COUNTY BUDGET 2 COUNTY BUDGET 3 COUNTY BUDGET 4 COUNTY BUDGET 5 COUNTY

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Proposed Budget FY 2018-19 Overview 2018-19 Proposed Budget l Budget message l Budget by the

Proposed Budget FY 2019-20 Overview 2019-20 Proposed Budget l Budget process l Budget message l

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

CS 1501 www.cs.pitt.edu/~nlf4/cs1501/ Introduction Meta-notes These notes are intended for use

News, Stock Prices and Economic Fluctuations Paul Beaudry & Franck Portier University of

Random Processes DS GA 1002 Probability and Statistics for Data Science

Promoting Education under Distortionary Taxation: A Comparison between Equality of Opportunity

Scale Invariant Interest Point Detection Sanja Fidler CSC420: Intro to Image Understanding 1 /

Lecture 1 Introduction sheet Course webpage

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine

Lec02: x86_64 / Shellcode / Tools Taesoo Kim 2 Scoreboard 3 Administrivia Survey: how

Sambuz

Useful Links

Newsletter

Mail Us

StrassenNets: Deep Learning with a Multiplication Budget Michael - PowerPoint PPT Presentation

StrassenNets: Deep Learning with a Multiplication Budget Michael Tschannen michaelt@nari.ee.ethz.ch 13 July 2018 Joint work with Aran Khanna and Anima Anandkumar work done at Amazon AI Motivation Outstanding predictive

CS 401 Integer Multiplication / Matrix Multiplication Xiaorui Sun 1 Integer Multiplication

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Maths Multiplication and Division Maths | Year 2 | Multiplication and Division | Solve

Lecture 8: Binary Multiplication &amp; Division Todays topics: Multiplication

Matrix Multiplication Matrix multiplication is an operation with properties quite different from

lecture 7 Integer multiplication (grade school) How to do (unsigned) integer multiplication in

Efficient multiplication 2 Matrix multiplication If you have square matrices A and B, then C =

MA/CSSE 473 Day 04 Multiplication runtime Multiplication based on Gauss formula Mathematical

PARK OPERATIONS 1 COUNTY BUDGET 2 COUNTY BUDGET 3 COUNTY BUDGET 4 COUNTY BUDGET 5 COUNTY

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Proposed Budget FY 2018-19 Overview 2018-19 Proposed Budget l Budget message l Budget by the

Proposed Budget FY 2019-20 Overview 2019-20 Proposed Budget l Budget process l Budget message l

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

CS 1501 www.cs.pitt.edu/~nlf4/cs1501/ Introduction Meta-notes These notes are intended for use

News, Stock Prices and Economic Fluctuations Paul Beaudry &amp; Franck Portier University of

Random Processes DS GA 1002 Probability and Statistics for Data Science

Promoting Education under Distortionary Taxation: A Comparison between Equality of Opportunity

Scale Invariant Interest Point Detection Sanja Fidler CSC420: Intro to Image Understanding 1 /

Lecture 1 Introduction sheet Course webpage

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine

Lec02: x86_64 / Shellcode / Tools Taesoo Kim 2 Scoreboard 3 Administrivia Survey: how

Sambuz

Useful Links

Newsletter

Mail Us

Lecture 8: Binary Multiplication & Division Todays topics: Multiplication

News, Stock Prices and Economic Fluctuations Paul Beaudry & Franck Portier University of