Ba Bayesi esian Deep Deep Le Lear arning ning Prof. Leal-Taixé and Prof. Niessner 1
Go Going ful g full Bay ayes esian an • Bayes = Probabilities Hypothesis = Model • Bayes Theorem Evidence = data Prof. Leal-Taixé and Prof. Niessner 2
Go Going ful g full Bay ayes esian an • Start with a prior on the model parameters • Choose a statistical model data • Use data to refine my prior, i.e., compute the posterior No dependence on parameters Prof. Leal-Taixé and Prof. Niessner 3
Go Going ful g full Bay ayes esian an • Start with a prior on the model parameters • Choose a statistical model data • Use data to refine my prior, i.e., compute the posterior posterior prior likelihood Prof. Leal-Taixé and Prof. Niessner 4
Go Going ful g full Bay ayes esian an • 1. Learning: Computing the posterior – Finding a point estimate (MAP) à what we have been doing so far! – Finding a probability distribution of This lecture Prof. Leal-Taixé and Prof. Niessner 5
Wh What at hav ave e we e lear earned ed so o far ar? ages of Deep Learning models • Ad Advant antag – Very expressive models – Good for tasks such as classification, regression, sequence prediction – Modular structure, efficient training, many tools – Scales well with large amounts of data • But we have also disad advant antag ages … – ”Black-box” feeling – We cannot judge how “confident” the model is about a decision Prof. Leal-Taixé and Prof. Niessner 7
Model Modeling uncer ertai ainty • Wish list: – We want to know what our models know and what they do not know Prof. Leal-Taixé and Prof. Niessner 8
Model Modeling uncer ertai ainty • Example: I have built a dog breed classifier Bulldog German What answer sheperd will my NN give? Chihuaha Prof. Leal-Taixé and Prof. Niessner 9
Model Modeling uncer ertai ainty • Example: I have built a dog breed classifier Bulldog German sheperd I would rather get as an answer that my model is not certain Chihuaha about the type of dog breed Prof. Leal-Taixé and Prof. Niessner 10
Model Modeling uncer ertai ainty • Wish list: – We want to know what our models know and what they do not know • Why do we care? – Decision making – Learning from limited, noisy, and missing data – Insights on why a model failed Prof. Leal-Taixé and Prof. Niessner 11
Model Modeling uncer ertai ainty • Finding the posterior – Finding a point estimate (MAP) à what we have been doing so far! – Finding a probability distribution of Image: https://medium.com/@joeDiHare/deep-bayesian-neural-networks-952763a9537 Prof. Leal-Taixé and Prof. Niessner 12
Model Modeling uncer ertai ainty • We can sample many times from the distribution and see how this affects our model’s predictions • If predictions are consistent = model is confident Image: https://medium.com/@joeDiHare/deep-bayesian-neural-networks-952763a9537 Prof. Leal-Taixé and Prof. Niessner 13
Model Modeling uncer ertai ainty I am not really sure Kendal & Gal. “What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?“ NIPS 2016 Prof. Leal-Taixé and Prof. Niessner 14
Ho How d do w we g get t the p post sterio ior? • Compute the posterior over the weights • Probability of observing our data under all possible model parameters How do we compute this? Prof. Leal-Taixé and Prof. Niessner 15
Ho How d do w we g get t the p post sterio ior? • How do we compute this? • Denominator = we cannot compute all possible combinations • Two ways to compute the Markov Chain Monte Carlo approximation of the posterior: Variational Inference Prof. Leal-Taixé and Prof. Niessner 16
Ho How d do w we g get t the p post sterio ior? • Markov Chain Monte Carlo (MCMC) – A chain of samples SLOW that converge to • Variational Inference – Find an approximation that. Prof. Leal-Taixé and Prof. Niessner 17
Dropout Dropout for or Ba Bayesi esian I Inferen erence ce Prof. Leal-Taixé and Prof. Niessner 18
Rec Recal all: Drop opou out • Disable a random set of neurons (typically 50%) Forward Prof. Leal-Taixé and Prof. Niessner 19 Srivastava 2014
Rec Recal all: Drop opou out Redundant representations • Using half the network = half capacity Furry Has two eyes Has a tail Has paws Has two ears Prof. Leal-Taixé and Prof. Niessner 20
Rec Recal all: Drop opou out • Using half the network = half capacity – Redundant representations – Base your scores on more features • Consider it as model ensemble Prof. Leal-Taixé and Prof. Niessner 21
Rec Recal all: Drop opou out • Two models in one Model 1 Model 2 Prof. Leal-Taixé and Prof. Niessner 22
MC MC dr drop opou out • Variational Inference – Find an approximation that • Dropout training – The variational distribution is from a Bernoulli distribution (where the states are “on” and “off”) Y Gal, Z Ghahramani, “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”, ICML 2016 Prof. Leal-Taixé and Prof. Niessner 23
MC MC dr drop opou out • 1. Train a model with dropout before every weight layer test time • 2. Apply dropout at te – Sampling is done in a Monte Carlo fashion, hence the name Monte Carlo dropout Y Gal, Z Ghahramani, “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”, ICML 2016 Prof. Leal-Taixé and Prof. Niessner 24
MC MC dr drop opou out – Sampling is done in a Monte Carlo fashion, e.g., classification Parameter sampling NN where and is the dropout distribution Y Gal, Z Ghahramani, “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”, ICML 2016 Prof. Leal-Taixé and Prof. Niessner 25
Meas Measure e you our model odel’s uncer ertai ainty Kendal & Gal. “What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?“ NIPS 2016 Prof. Leal-Taixé and Prof. Niessner 26
Another lo look Prof. Leal-Taixé and Prof. Niessner 27
Le Let t us ta take ke ano nothe ther lo look • We know it is intractable, we approximate it • The denominator expresses how my data is generated Prof. Leal-Taixé and Prof. Niessner 28
Le Let t us ta take ke ano nothe ther lo look • We assume that the data is generated by some random process, involving an unobserved continuous random (latent) variable • Generation process: • Posterior: Prof. Leal-Taixé and Prof. Niessner 29
Le Let t us ta take ke ano nothe ther lo look • Variational Inference – Find an approximation. • My approximation is parameterized by a model Prof. Leal-Taixé and Prof. Niessner 30
Variational l Au Autoenc ncoders Prof. Leal-Taixé and Prof. Niessner 31
Rec Recal all: Autoen oencoder oders • Encode the input into a representation (bottleneck) and reconstruct it with the decoder Encoder Decoder Conv Transpose Conv Prof. Leal-Taixé and Prof. Niessner 32
Var Variat ation onal al Autoen oencoder oder Encoder Decoder Conv Transpose Conv Prof. Leal-Taixé and Prof. Niessner 33
Var Variat ation onal al Autoen oencoder oder • Latent space is now a distribution • Specifically it is a Gaussian Encoder Prof. Leal-Taixé and Prof. Niessner 34
Var Variat ation onal al Autoen oencoder oder • Latent space is now a distribution • Specifically it is a Gaussian Encoder Mean Diagonal covariance Prof. Leal-Taixé and Prof. Niessner 35
Var Variat ation onal al Autoen oencoder oder • Latent space is now a distribution • Specifically it is a Gaussian Encoder Mean Diagonal covariance Prof. Leal-Taixé and Prof. Niessner 36
Var Variat ation onal al Autoen oencoder oder • Back to our Bayesian view, our generation process was: • Which is the denominator of the posterior: I want to optimize Prof. Leal-Taixé and Prof. Niessner 37
Var Variat ation onal al Autoen oencoder oder • Loss function for a data point I draw samples of the latent variable z from my encoder Prof. Leal-Taixé and Prof. Niessner 38
Var Variat ation onal al Autoen oencoder oder • Loss function for a data point Bayes Rule Posterior Prof. Leal-Taixé and Prof. Niessner 39
Var Variat ation onal al Autoen oencoder oder • Loss function for a data point Just a constant Prof. Leal-Taixé and Prof. Niessner 40
Var Variat ation onal al Autoen oencoder oder • Loss function for a data point Prof. Leal-Taixé and Prof. Niessner 41
Var Variat ation onal al Autoen oencoder oder • Loss function for a data point Kullback-Leibler Divergences Prof. Leal-Taixé and Prof. Niessner 42
Var Variat ation onal al Autoen oencoder oder • Loss function for a data point Measures how good Reconstruction loss I still cannot express my latent distribution the shape of the is with respect to my distribution. But I know prior Prof. Leal-Taixé and Prof. Niessner 43
Var Variat ation onal al Autoen oencoder oder • Loss function for a data point Loss function (lower bound) Prof. Leal-Taixé and Prof. Niessner 44
Recommend
More recommend