Ba Bayesi esian Deep Deep Le Lear arning ning Prof. Leal-Taixé and Prof. Niessner 1
Go Going ful g full B Baye yesi sian • Bayes = Probabilities Hypothesis = Model • Bayes Theorem p ( H | E ) = p ( E | H ) p ( H ) p ( E ) Evidence = data Prof. Leal-Taixé and Prof. Niessner 2
Go Going ful g full B Baye yesi sian • Start with a prior on the model parameters p ( θ ) • Choose a statistical model p ( x | θ ) data • Use data to refine my prior, i.e., compute the posterior p ( θ | x ) = p ( x | θ ) p ( θ ) p ( x ) No dependence on parameters Prof. Leal-Taixé and Prof. Niessner 3
Go Going ful g full B Baye yesi sian • Start with a prior on the model parameters p ( θ ) • Choose a statistical model p ( x | θ ) data • Use data to refine my prior, i.e., compute the posterior p ( θ | x ) = p ( x | θ ) p ( θ ) posterior prior likelihood Prof. Leal-Taixé and Prof. Niessner 4
Go Going ful g full B Baye yesi sian • 1. Learning: Computing the posterior – Finding a point estimate (MAP) à what we have been doing so far! p ( θ | x ) = p ( x | θ ) p ( θ ) – Finding a probability distribution of θ This lecture p ( θ | x ) = p ( x | θ ) p ( θ ) p ( x ) Prof. Leal-Taixé and Prof. Niessner 5
Wh What at hav ave e we e lear earned ed so o far ar? ages of Deep Learning models • Ad Advant antag – Very expressive models – Good for tasks such as classification, regression, sequence prediction – Modular structure, efficient training, many tools – Scales well with large amounts of data • But we have also disad advant antag ages … – ”Black-box” feeling – We cannot judge how “confident” the model is about a decision Prof. Leal-Taixé and Prof. Niessner 7
Model Modeling uncer ertai ainty • Wish list: – We want to know what our models know and what they do not know Prof. Leal-Taixé and Prof. Niessner 8
Model Modeling uncer ertai ainty • Example: I have built a dog breed classifier Bulldog German What answer sheperd will my NN give? Chihuaha Prof. Leal-Taixé and Prof. Niessner 9
Model Modeling uncer ertai ainty • Example: I have built a dog breed classifier Bulldog German sheperd I would rather get as an answer that my model is not certain Chihuaha about the type of dog breed Prof. Leal-Taixé and Prof. Niessner 10
Model Modeling uncer ertai ainty • Wish list: – We want to know what our models know and what they do not know • Why do we care? – Decision making – Learning from limited, noisy, and missing data – Insights on why a model failed Prof. Leal-Taixé and Prof. Niessner 11
Model Modeling uncer ertai ainty • Finding the posterior – Finding a point estimate (MAP) à what we have been doing so far! – Finding a probability distribution of θ Image: https://medium.com/@joeDiHare/deep-bayesian-neural-networks-952763a9537 Prof. Leal-Taixé and Prof. Niessner 12
Model Modeling uncer ertai ainty • We can sample many times from the distribution and see how this affects our model’s predictions • If predictions are consistent = model is confident Image: https://medium.com/@joeDiHare/deep-bayesian-neural-networks-952763a9537 Prof. Leal-Taixé and Prof. Niessner 13
Model Modeling uncer ertai ainty I am not really sure Kendal & Gal. “What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?“ NIPS 2016 Prof. Leal-Taixé and Prof. Niessner 14
Wh Why? Prof. Leal-Taixé and Prof. Niessner 15
Ho How d do w we g get t the p post sterio ior? • Compute the posterior over the weights p ( θ | x ) = p ( x | θ ) p ( θ ) p ( x ) • Probability of observing our data under all possible model parameters How do we p ( x | θ ) p ( θ ) p ( θ | x ) = compute R θ p ( x | θ ) p ( θ ) d θ this? Prof. Leal-Taixé and Prof. Niessner 16
Ho How d do w we g get t the p post sterio ior? • How do we compute this? p ( x | θ ) p ( θ ) p ( θ | x ) = R θ p ( x | θ ) p ( θ ) d θ • Denominator = we cannot compute all possible combinations • Two ways to compute the Markov Chain Monte Carlo approximation of the posterior: Variational Inference Prof. Leal-Taixé and Prof. Niessner 17
Ho How d do w we g get t the p post sterio ior? • Markov Chain Monte Carlo (MCMC) – A chain of samples SLOW θ t → θ t +1 → θ t +2 ... that converge to p ( θ | x ) • Variational Inference q ( θ ) – Find an approximation that. arg min KL ( q ( θ ) || p ( θ | x )) Prof. Leal-Taixé and Prof. Niessner 18
Dropout Dropout for or Ba Bayesi esian I Inferen erence ce Prof. Leal-Taixé and Prof. Niessner 19
Rec Recal all: Drop opou out • Disable a random set of neurons (typically 50%) Forward Prof. Leal-Taixé and Prof. Niessner 20 Srivastava 2014
Rec Recal all: Drop opou out Redundant representations • Using half the network = half capacity Furry Has two eyes Has a tail Has paws Has two ears Prof. Leal-Taixé and Prof. Niessner 21
Rec Recal all: Drop opou out • Using half the network = half capacity – Redundant representations – Base your scores on more features • Consider it as model ensemble Prof. Leal-Taixé and Prof. Niessner 22
Rec Recal all: Drop opou out • Two models in one Model 1 Model 2 Prof. Leal-Taixé and Prof. Niessner 23
MC MC dr drop opou out • Variational Inference – Find an approximation that q ( θ ) arg min KL ( q ( θ ) || p ( θ | x )) • Dropout training – The variational distribution is from a Bernoulli distribution (where the states are “on” and “off”) Y Gal, Z Ghahramani, “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”, ICML 2016 Prof. Leal-Taixé and Prof. Niessner 24
MC MC dr drop opou out • 1. Train a model with dropout before every weight layer test time • 2. Apply dropout at te – Sampling is done in a Monte Carlo fashion, hence the name Monte Carlo dropout Y Gal, Z Ghahramani, “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”, ICML 2016 Prof. Leal-Taixé and Prof. Niessner 25
MC dr MC drop opou out – Sampling is done in a Monte Carlo fashion, e.g., T p ( y = c | x ) ≈ 1 X Softmax ( f ˆ θ t ( x )) T t =1 classification Parameter sampling NN ˆ where θ t ∼ q ( θ ) and is the dropout distribution ∼ q ( θ ) Y Gal, Z Ghahramani, “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”, ICML 2016 Prof. Leal-Taixé and Prof. Niessner 26
Meas Measure e you our model odel’s uncer ertai ainty Kendal & Gal. “What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?“ NIPS 2016 Prof. Leal-Taixé and Prof. Niessner 27
Variational l Au Autoenc ncoders Prof. Leal-Taixé and Prof. Niessner 32
Rec Recal all: Autoen oencoder oders • Encode the input into a representation (bottleneck) and reconstruct it with the decoder Encoder Decoder ˜ x x z Conv Transpose Conv Prof. Leal-Taixé and Prof. Niessner 33
Var Variat ation onal al Autoen oencoder oder p θ (˜ x | z ) q φ ( z | x ) Encoder Decoder φ ˜ θ x x z Conv Transpose Conv Prof. Leal-Taixé and Prof. Niessner 34
Var Variat ation onal al Autoen oencoder oder Goal: Sample from the latent distribution to generate new outputs! φ ˜ θ x x z Conv Transpose Conv Prof. Leal-Taixé and Prof. Niessner 35
Var Variat ation onal al Autoen oencoder oder • Latent space is now a distribution • Specifically it is a Gaussian Encoder Decoder µ z | x Sample φ x ˜ θ x Σ z | x z z | x ∼ N ( µ z | x , Σ z | x ) Prof. Leal-Taixé and Prof. Niessner 36
Var Variat ation onal al Autoen oencoder oder • Latent space is now a distribution • Specifically it is a Gaussian Encoder Mean µ z | x z | x ∼ N ( µ z | x , Σ z | x ) φ x Σ z | x Diagonal covariance Prof. Leal-Taixé and Prof. Niessner 37
Var Variat ation onal al Autoen oencoder oder • Training Encoder Decoder µ z | x Sample φ x ˜ θ x Σ z | x z z | x ∼ N ( µ z | x , Σ z | x ) Prof. Leal-Taixé and Prof. Niessner 38
Var Variat ation onal al Autoen oencoder oder • Test: sampling from the latent space Decoder µ z | x Sample ˜ θ x Σ z | x z z | x ∼ N ( µ z | x , Σ z | x ) Prof. Leal-Taixé and Prof. Niessner 39
VA VAE: trai aining Goal: Want to • Back to the Bayesian view for training estimate the parameters of my Z generative model p θ ( x ) = p θ ( x | z ) p θ ( z ) dz z Prior = Gaussian Intractable to x | z ) p θ ( z ) dz θ compute the output for every z z Decoder (Neural Z Network) p θ ( x | z ) p z Prof. Leal-Taixé and Prof. Niessner 40
VA VAE: trai aining Goal: Want to • We approximate it with an encoder estimate the parameters of my generative model Encoder µ z | x Sample φ x ˜ θ x Σ z | x z q φ ( z | x ) p θ (˜ x | z ) Prof. Leal-Taixé and Prof. Niessner 41
VA VAE: los oss function on • Loss function for a data point x i log( p θ ( x i )) = E z ∼ q φ ( z | x i ) [log( p θ ( x i ))] I draw samples of the latent variable z from my encoder Prof. Leal-Taixé and Prof. Niessner 42
Recommend
More recommend