bayesian deep learning and restricted boltzmann machines
play

Bayesian Deep Learning and Restricted Boltzmann Machines Narada - PowerPoint PPT Presentation

Bayesian Deep Learning and Restricted Boltzmann Machines Narada Warakagoda Forsvarets Forskningsinstitutt ndw@ffi.no November 1, 2018 Narada Warakagoda (FFI) Short title November 1, 2018 1 / 56 Overview Probability Review 1 Bayesian


  1. Bayesian Deep Learning and Restricted Boltzmann Machines Narada Warakagoda Forsvarets Forskningsinstitutt ndw@ffi.no November 1, 2018 Narada Warakagoda (FFI) Short title November 1, 2018 1 / 56

  2. Overview Probability Review 1 Bayesian Deep Learning 2 Restricted Boltzmann Machines 3 Narada Warakagoda (FFI) Short title November 1, 2018 2 / 56

  3. Probability Review Narada Warakagoda (FFI) Short title November 1, 2018 3 / 56

  4. Probability and Statistics Basics Normal (Gaussian) Distribution � � 1 − 1 µ ) T Σ Σ − 1 ( x − µ p ( x ) = Σ | 1 / 2 exp 2 ( x − µ µ Σ µ µ ) = N ( µ, Σ) µ, Σ) µ, Σ) (2 π ) d / 2 | Σ Σ Categorical Distribution k � p [ x = i ] P ( x ) = i i =1 Sampling x x x ∼ p ( x x x ) Narada Warakagoda (FFI) Short title November 1, 2018 4 / 56

  5. Probability and Statistics Basics Independent variables k � p ( x x 1 , x x x x 2 , · · · , x x x k ) = p ( x x x i ) i =1 Expectation � x ) f ( x x ) = x f ( x x x ) p ( x x x ) dx E p ( x x or for discrete variables k � E p ( x x ) f ( x x x ) = f ( x x x i ) P ( x x x i ) x i =1 Narada Warakagoda (FFI) Short title November 1, 2018 5 / 56

  6. Kullback Leibler Distance � q ( x � x ) x KL ( q ( x x x ) || p ( x x x )) = E q ( x x ) log x p ( x x x ) � = [ q ( x x x ) log q ( x x x ) − q ( x x x ) log p ( x x )] dx x x x For the discrete case k � KL ( Q ( x x x ) || P ( x x x )) = [ Q ( x x x i ) log Q ( x x x i ) − Q ( x x x i ) log P ( x x x i )] i =1 Narada Warakagoda (FFI) Short title November 1, 2018 6 / 56

  7. Bayesian Deep Learning Narada Warakagoda (FFI) Short title November 1, 2018 7 / 56

  8. Bayesian Statistics Joint distribution x | y p ( x x x , y y y ) = p ( x x y y ) p ( y y y ) Marginalization � p ( x x ) = x p ( x x x , y y y ) dy y y � P ( x x x ) = P ( x x x , y y y ) y y y Conditional distribution y ) = p ( x x , y y ) p ( y y | x x ) p ( x x ) x y y x x � p ( x x x | y y = p ( y y y ) p ( y y y | x x x ) p ( x x x ) dx x x Narada Warakagoda (FFI) Short title November 1, 2018 8 / 56

  9. Statistical view of Neural Networks Prediction p ( y y y | x x x , w w w ) = N ( f f w ( x x x ) , Σ Σ) Σ f w w Classification k � x ) [ y = i ] f i P ( y | x x x , w w ) = w f f w ( x x w w i =1 Narada Warakagoda (FFI) Short title November 1, 2018 9 / 56

  10. Training Criteria Maximum Likelihood(ML) w w = arg max � p ( Y | X Y | X Y | X , w w w ) w w w w Maximum A-Priori (MAP) � Y , w w | X w | X ) = arg max w | X w p ( Y | X Y | X , w Y | X w w w = arg max p ( Y , w Y , w w w w w ) p ( w w ) w w w w w w w Bayesian X ) = p ( Y Y Y | X X X , w w w ) p ( w w w ) p ( Y Y | X Y X , w X w ) p ( w w w w ) � p ( w w | Y w Y Y , X X = P ( Y Y Y | X X ) X P ( Y Y | X Y X X , w w w ) p ( w w w ) dw w w Narada Warakagoda (FFI) Short title November 1, 2018 10 / 56

  11. Motivation for Bayesian Approach Narada Warakagoda (FFI) Short title November 1, 2018 11 / 56

  12. Motivation for Bayesian Approach Narada Warakagoda (FFI) Short title November 1, 2018 12 / 56

  13. Uncertainty with Bayesian Approach Not only prediction/classification, but their uncertainty can also be calculated w | Y Since we have p ( w w Y Y , X X ) we can sample w X w w and use each sample as network parameters in calculating the prediction/classification p ( � y | � x , w w w )) (i.e.network output for a given input ). Prediction/classification is the mean of p ( � y | � x , w w ) w � p out = p ( � y | � x , Y Y Y , X X X ) = p ( � y | � x , w w w ) p ( w w w | Y Y Y , X X X ) dw w w Uncertainty of prediction/classification is the variance of p ( � y | � x , w w ) w � w ) − p out ] 2 p ( w Var( p ( � y | � x , w w w )) = [ p ( � y | � x , w w w w | Y Y Y , X X X ) dw w w Uncertainty is important in safety critical applications (eg: self-driving cars, medical diagnosis, military applications Narada Warakagoda (FFI) Short title November 1, 2018 13 / 56

  14. Other Advantages of Bayesian Approach Natural interpretation for regularization Model selection Input data selection (active learning) Narada Warakagoda (FFI) Short title November 1, 2018 14 / 56

  15. Main Challenge of Bayesian Approach We calculate For continuous case: p ( Y Y | X Y X X , w w w ) p ( w w w ) w | Y � p ( w w Y Y , X X X ) = Y | X P ( Y Y X X , w w w ) p ( w w w ) dw w w For discrete case: p ( Y Y | X Y X , w X w w ) P ( w w w ) w | Y � P ( w w Y Y , X X X ) = Y | X w p ( Y Y X X , w w ) P ( w w w ) w w w Calculating denominator is often intractable Eg: Consider a weight vector w w w of 100 elements, each can have two values. Then there are 2 100 = 1 . 2 × 10 30 different weight vectors. Compare this with universe’s age 13.7 billion years. We need approximations Narada Warakagoda (FFI) Short title November 1, 2018 15 / 56

  16. Different Approaches Monte Carlo techniques (Eg: Markov Chain Monte Carlo -MCMC) Variational Inference Introducing random elements in training (eg: Dropout) Narada Warakagoda (FFI) Short title November 1, 2018 16 / 56

  17. Advantages and Disadvantages of Different Approaches Markov Chain Monte Carlo - MCMC Asymptotically exact Computationally expensive Variational Inference No guarantee of exactness Possibility for faster computation Narada Warakagoda (FFI) Short title November 1, 2018 17 / 56

  18. Monte Carlo Techniques We are interested in � p out = Mean( p ( � y | � x , w w w )) = p ( � y | � x , Y Y , X Y X X ) = p ( � y | � x , w w ) p ( w w w | Y w Y Y , X X X ) dw w w � w ) − p out ] 2 p ( w Var( p ( � y | � x , w w )) = [ p ( � y | � x , w w | Y Y , X X ) dw w w w Y X w w Both are integrals of the type � I = F ( w w w ) p ( w w w |D ) dw w w where D = ( Y Y Y , X X X ) is training data. Approximate the integral by sampling w w w i from p ( w w w |D ) � L I ≈ 1 F ( w w w i ) . L i =1 Narada Warakagoda (FFI) Short title November 1, 2018 18 / 56

  19. Monte Carlo techniques Challenge: We don’t have the posterior Y | X p ( Y Y X X , w w w ) p ( w w ) w � p ( w w w |D ) = p ( w w | Y w Y Y , X X X ) = Y | X P ( Y Y X X , w w w ) p ( w w w ) dw w w ”Solution”: Use importance sampling by sampling from a proposal distribution q ( w w w ) � L � w |D ) w i | D ) w ) p ( w w w ≈ 1 w i ) p ( w w I = F ( w w w ) q ( w w w ) dw w F ( w w q ( w w q ( w w w i ) L i = Problem: We still do not have p ( w w w |D ) Narada Warakagoda (FFI) Short title November 1, 2018 19 / 56

  20. Monte Carlo Techniques w |D ) Problem: We still do not have p ( w w Solution: use unnormalized posterior ˜ p ( w w w |D ) = p ( Y Y | X Y X X , w w w ) p ( w w w ) � Y | X where normalization factor Z = P ( Y Y X X , w w w ) p ( w w w ) dw w w such that w |D ) = ˜ p ( w w w |D ) p ( w w Z Integral can be calculated with: � L i =1 F ( w w w i ) ˜ p ( w w w i | D ) / q ( w w w i ) I ≈ � L i =1 ˜ p ( w w w i | D ) / q ( w w w i ) Narada Warakagoda (FFI) Short title November 1, 2018 20 / 56

  21. Weakness of Importance Sampling Proposal distribution must be close to the non-zero areas of original distribution p ( w w w |D ). In neural networks, p ( w w w |D ) is typically small except for few narrow areas. Blind sampling from q ( w w ) has a high chance that they fall outside w w |D ) non-zero areas of p ( w w We must actively try to get samples that lie close to p ( w w w |D ) Markov Chain Monte Carlo (MCMC) is such technique. Narada Warakagoda (FFI) Short title November 1, 2018 21 / 56

  22. Metropolis Algorithm Metropolis algorithm is an example of MCMC Draw samples repeatedly from random walk w w w t +1 = w w w t + ǫ ǫ ǫ where ǫ ǫ ǫ is ǫ ǫ ∼ q ( ǫ ǫ a small random vector, ǫ ǫ ) (eg: Gaussian noise) p ( w ˜ w t |D ) w Drawn sample at t = t is either accepted based on the ratio p ( w ˜ w w t − 1 |D ) If ˜ p ( w w w t |D ) > ˜ p ( w w w t − 1 |D ) accept sample p ( w ˜ w t |D ) w If ˜ p ( w w t |D ) < ˜ w p ( w w t − 1 |D ) accept sample with probability w p ( w ˜ w w t − 1 |D ) If sample accepted use it for calculating I Can use the same formula for calculating I � L i =1 F ( w w i ) ˜ w p ( w w w i | D ) / q ( w w w i ) I ≈ � L i =1 ˜ p ( w w w i | D ) / q ( w w w i ) Narada Warakagoda (FFI) Short title November 1, 2018 22 / 56

  23. Other Monte Carlo and Related Techniques Hybrid Monte Carlo (Hamiltonian Monte Carlo) Similar to Metropolis algorithm But uses gradient information rather than a random walk. Simulated Annealing Narada Warakagoda (FFI) Short title November 1, 2018 23 / 56

  24. Variational Inference Goal: computation of posterior p ( w w w |D ), i.e. the parameters of the neural network w w w given data D = ( Y Y Y , X X X ) But this computation is often intractable Idea: find a distribution q ( w w w ) from a family of distributions Q such that q ( w w w ) can closely approximate p ( w w w |D ) How to measure the distance between q ( w w w ) and p ( w w w |D ) ? � � Kullback-Leibler Distance KL q ( w w w ) || p ( w w w |D ) The problem can be formulated as � � p ( w ˆ w w |D ) = arg min w ) KL q ( w w w ) || p ( w w w |D ) q ( w w Narada Warakagoda (FFI) Short title November 1, 2018 24 / 56

Recommend


More recommend