Training Binary Neural Networks Using the Bayesian Learning Rule - PowerPoint PPT Presentation

Thirty-seventh International Conference on Machine Learning (ICML 2020) Training Binary Neural Networks Using the Bayesian Learning Rule Xiangming Meng Roman Bachmann Mohammad Emtiyaz Khan (EPFL) (RIKEN AIP) (RIKEN AIP) Presenter 1

Binary Neural Networks (BiNN) • BiNN: Neural Networks with binary weights • Much faster and much smaller [1,2] • Di ffi cult to optimize in theory (discrete optimization) • But easy in practice: Just use SGD with “Straight-through estimator (STE)”! • It is mysterious as to why this works [3] • Are there any principled approaches to explain this? 1. Courbariaux et al., Training deep neural networks with binary weights during propagations. NeurIPS 2015. 2. Courbariaux et al., . Binarized neural networks.… arXiv:1602.02830, 2016. 3. Yin, P. et al., Understanding straight-through estimator in training activation quantized neural nets. arXiv, 2019. 2

Our Contribution: Training BiNN using Bayes • We show that by using the Bayesian Learning Rule [1,2] (natural-gradient variational inference), we can justify such previous approaches • Main point: optimize the parameter of a Bernoulli distribution (a continuous optimization problem) • The Bayesian approach gives us an estimate of uncertainty which can be used for continual learning [3] 1. Khan, M. E. and Rue, H. Learning-algorithms from bayesian principles. ArXiv. 2019. 2. Khan, M. E. and Lin, W. Conjugate-computation variational inference. AISTATS, 2017 3. Kirkpatrick, J. et al. Overcoming catastrophic forgetting in neural networks. PNAS, 114(13):3521–3526, 2017. 3

Training BiNN is a Discrete Optimization problem! Output Input Binary weights Loss Neural Network 4

Training BiNN is a Discrete Optimization problem! Output Input Binary weights Loss Neural Network • Easy in practice: SGD with “Straight- through estimator (STE)” [1] “latent” weights 1. Bengio et al. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv:1308.3432, 2013. 2. Helwegen et al. Latent weights do not exist: Rethinking binarized neural network optimization. arXiv preprint arXiv:1906.02107, 2019. 3. Yin, P. et al., Understanding straight-through estimator in training activation quantized neural nets. arXiv, 2019. 5

Training BiNN is a Discrete Optimization problem! Output Input Binary weights Loss Neural Network • Easy in practice: SGD with “Straight- through estimator (STE)” [1] “latent” weights • Helwegen et al. [2] argued “latent” weights are not weights but “Inertia” Binary Optimizer (Bop) 1. Bengio et al. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv:1308.3432, 2013. 2. Helwegen et al. Latent weights do not exist: Rethinking binarized neural network optimization. arXiv preprint arXiv:1906.02107, 2019. 3. Yin, P. et al., Understanding straight-through estimator in training activation quantized neural nets. arXiv, 2019. 6

Training BiNN is a Discrete Optimization problem! Output Input Binary weights Loss Neural Network • Easy in practice: SGD with “Straight- through estimator (STE)” [1] “latent” weights • Helwegen et al. [2] argued “latent” weights are not weights but “Inertia” Binary Optimizer (Bop) • Open question: Why does this work? [3] 1. Bengio et al. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv:1308.3432, 2013. 2. Helwegen et al. Latent weights do not exist: Rethinking binarized neural network optimization. arXiv preprint arXiv:1906.02107, 2019. 3. Yin, P. et al., Understanding straight-through estimator in training activation quantized neural nets. arXiv, 2019. 7

BayesBiNN • Main point: optimize the parameters of Bernoulli distribution (a continuous optimization problem) • Problem reformulation: Optimize distribution over weights [1,2] Loss min q ( w ) Posterior approximation Prior KL over weights Distribution Divergence 1. Zellner, A. Optimal information processing and Bayes’s theorem. The American Statistician, 42(4):278–280, 1988. 2. Bissiri et al.. A general framework for updating belief distributions. Journal of the Royal Statistical Society, 78(5):1103–1130, 2016. 8

BayesBiNN • Main point: optimize the parameters of Bernoulli distribution (a continuous optimization problem) • Problem reformulation: Optimize distribution over weights [1,2] Loss min q ( w ) Posterior approximation Prior KL over weights Distribution Divergence • is chosen to be mean-field Bernoulli distribution q ( w ) D D exp [ λ i ϕ ( w i ) − A ( λ i ) ] 1 − wi 1 + wi ∏ ∏ ( 1 − p i ) 2 q ( w ) = q ( w ) = p 2 i i =1 w i ∈ { − 1, + 1} i =1 p i Natural parameters: λ i := 1 2 log Probability of w i = + 1 1 − p i 1. Zellner, A. Optimal information processing and Bayes’s theorem. The American Statistician, 42(4):278–280, 1988. 2. Bissiri et al.. A general framework for updating belief distributions. Journal of the Royal Statistical Society, 78(5):1103–1130, 2016. 9

BayesBiNN • The Bayesian learning rule [1] (natural-gradient variational inference) Learning rate Expectation parameter Natural parameter Natural parameter of q ( w ) of p ( w ) of q ( w ) 1. Khan, M. E. and Rue, H. Learning-algorithms from bayesian principles. 2019. 2. Maddison, et al., The concrete distribution: A continuous relaxation of discrete random variables. arXiv:1611.00712, 2016. 10 3. Jiang et al. Categorical repa-rameterization with gumbel-softmax. arXiv:1611.01144, 2016.

BayesBiNN • The Bayesian learning rule [1] (natural-gradient variational inference) Learning rate How to compute? Expectation parameter Natural parameter Natural parameter of q ( w ) of p ( w ) of q ( w ) 1. Khan, M. E. and Rue, H. Learning-algorithms from bayesian principles. 2019. 2. Maddison, et al., The concrete distribution: A continuous relaxation of discrete random variables. arXiv:1611.00712, 2016. 11 3. Jiang et al. Categorical repa-rameterization with gumbel-softmax. arXiv:1611.01144, 2016.

BayesBiNN • The Bayesian learning rule [1] (natural-gradient variational inference) Learning rate How to compute? Expectation parameter Natural parameter Natural parameter of q ( w ) of p ( w ) of q ( w ) • Using the Gumbel Softmax trick [2,3] , we can approximate the natural gradient by using the mini-batch gradient Minibatch Gradient, easy to compute! Scale vector 1. Khan, M. E. and Rue, H. Learning-algorithms from bayesian principles. 2019. 2. Maddison, et al., The concrete distribution: A continuous relaxation of discrete random variables. arXiv:1611.00712, 2016. 12 3. Jiang et al. Categorical repa-rameterization with gumbel-softmax. arXiv:1611.01144, 2016.

BayesBiNN Justifies Some Previous Methods Note that in BayesBiNN corresponds to λ w r • Main point 1: STE works as a special case of BayesBiNN as τ → 0 13

BayesBiNN Justifies Some Previous Methods Note that in BayesBiNN corresponds to λ w r • Main point 1: STE works as a special case of BayesBiNN as τ → 0 τ → 0 14

BayesBiNN Justifies Some Previous Methods Note that in BayesBiNN corresponds to λ w r • Main point 1: STE works as a special case of BayesBiNN as τ → 0 τ → 0 • Main point 2: Justify the “exponential average” used in Bop 15

̂ Uncertainty Estimation • Main point: BayesBiNN obtains uncertainty estimates around the classification boundaries w ( c ) q ( w ) c =1 p ( y = k | x , w ( c ) ) , C = 10 Classification on two moons dataset p k ← 1 C ∑ C ~ • STE finds a deterministic boundary • Open-source Code Available : https://github.com/team-approx-bayes/BayesBiNN 16

BayesBiNN STE ≈ • Open-source Code Available : https://github.com/team-approx-bayes/BayesBiNN 17

Uncertainty Provided by BayesBiNN Enables Continual Learning • Main point: BayesBiNN enables continual learning (CL) for BiNN using the intrinsic KL divergence as regularization • CL: Sequentially learning new tasks without forgetting old ones [1] Overcoming catastrophic forgetting Common Method: Regularizing weights • But, it is unclear how to regularize binary weights of BiNN using STE/Bop 18

Uncertainty Provided by BayesBiNN Enables Continual Learning • Main point: BayesBiNN enables continual learning (CL) for BiNN using the intrinsic KL divergence as regularization • CL: Sequentially learning new tasks without forgetting old ones [1] Overcoming catastrophic forgetting Common Method: Regularizing weights • But, it is unclear how to regularize binary weights of BiNN using STE/Bop • In BayesBiNN, there is one natural solution using KL divergence q t ( w ) 𝔽 q t ( w ) [ ∑ i )) ] + 𝔼 KL ( q t ( w ) || p ( w ) ) Independent Prior Distribution ℓ ( y t i , f w ( x t min Learning (uniform) i ∈ D t 1. Kirkpatrick, J. et al. Overcoming catastrophic forgetting in neural networks. PANS, 114(13):3521–3526, 2017. 19

Training Binary Neural Networks Using the Bayesian Learning Rule - PowerPoint PPT Presentation

Thirty-seventh International Conference on Machine Learning (ICML 2020) Training Binary Neural Networks Using the Bayesian Learning Rule Xiangming Meng Roman Bachmann Mohammad Emtiyaz Khan (EPFL) (RIKEN AIP) (RIKEN AIP) Presenter 1 Binary

Bayesian Methods for Neural Networks Readings: Bishop, Neural Networks for Pattern Recognition .

Binary Numbers Binary numbers look like this Binary Numbers or Binary Code Binary numbers or

A Quick Review Decimal to binary Binary to decimal Binary to hexadecimal

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Binary Trees, Heaps Binary Trees, Heaps Binary trees Binary trees A binary tree (

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National University Neural networks

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Binary Numbers 723 Binary Numbers 723 = 7x100 + 2x10 + 3x1 Binary Numbers 723 = 7x100 + 2x10 +

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Binary Search Trees and Balanced Binary Search Trees using AVL Trees Mark Redekopp David Kempe

Universal Point Sets for Planar Graph Drawings with Circular Arcs Patrizio Angelini, David

Keep It Straight: Teaching Placement how to Better Handle Designs w ith Datapaths Samuel I.

Perspective: A bunch of straight lines Colby Johansson June 2019 Purpose: History: horizon

Generic and Complete Techniques for Straight- Line String Constraints T aolue Chen (Birkbeck)

Lecture 6.7: Ruler and compass constructions Matthew Macauley Department of Mathematical Sciences

Playing the Certification Game (No Straitjacket Required) Dru Lavigne Chair, BSD Certification

Electromagnetic Induction Current Induced in a Straight Conductor Current Induced in a Loop

Straight Skeleton Computational Geometry and Geometric Computing Seminar Lavinia Dinu

Sambuz

Useful Links

Newsletter

Mail Us