Marrying Graphical Models & Deep Learning Max Welling University of Amsterdam Universiteit van Amsterdam Uva-Qualcomm Quva Lab Canadian Institute for Advanced Research 0
Overview: Generative versus discriminative modeling • Machine Learning as Computational Statistics • Deep Learning: • Graphical Models: • CNN Bayes nets • • Dropout • MRFs • Latent variable models • Bayesian inference • Bayesian deep models • Inference: • Compression Variational inference • • MCMC • Learning: • EM • Amortized EM • Variational autoencoder • 1
ML as Statistics Data: • Optimize objective: • (unsupervised) maximize log likelihood: • (supervised) (supervised) minimize loss: • ML is more than an optimization problem: it’s a statistical inference problem . • E.g.: you should not optimize parameters more precisely than the scale at which the MLE fluctuates • under resampling the data: , or risk overfitting. 2
Bias Variance Tradeoff 3 http://scott.fortmann-roe.com/docs/BiasVariance.html
Graphical Models A graphical representation to concisely represent (conditional) independence relations between variables. • There is a one-to-one correspondence between the dependencies implied by the graph and the probabilistic model. • E.g. Bayes Nets • P(all) = P(traffic-jam | rush-hour, bad-weather, accident) x P(sirens | accident) x P(accident | bad-weather) x P(bad-weather) x P(rush-hour) P(rush-hour) independent P(bad-weather) ßà sum_{traffic-jam,sirens,accident) P(all) = P(rush-hour) P(bad-weather) 4
Rush-hour independent of bad-weather Source: 5
Markov Random Fields Source: Bishop Undirected edges (Conditional) independence relationships easy: A independent B given C (for independence, all paths must be blocked) Probability distribution: : maximal clique = largest completely connected subgraphs Hammersley-Clifford Theorem: if P>0 all x, then all (conditional) independencies in P match those of the graph. 6
Latent Variable Models Introducing latent (unobserved) variables will dramatically increase the capacity of a model. • Problem: P(Z|X) is intractable for most nontrivial models • 7
Approximate Inference Variational Inference Sampling p p q ∗ Variational Family Q All probability distributions Deterministic • Stochastic (sample error) • Biased • Unbiased • Local minima • Hard to mix between modes • • Easy to assess convergence • Hard to assess convergence 8
Independence Samplers & MCMC Generating Independent Samples Sample from g and suppress samples with low p( θ |X) e.g. a) Rejection Sampling b) Importance Sampling p ( θ | X ) - Does not scale to high dimensions g Markov Chain Monte Carlo • Make steps by perturbing previous sample • Probability of visiting a state is equal to P( θ |X) 9
Sampling 101 – What is MCMC? Given target distribution S 0 , design transitions s.t. p t ( θ t ) → S 0 as t → ∞ T ( θ t +1 | θ t ) θ 0 θ t +1 θ 1 θ t Burn-in ( Throw away) Samples from S 0 T I = 1 3 3 I = h f i S 0 ⇡ ˆ X f ( θ t ) 2 T 2 last position coordinate last position coordinate t =1 1 1 θ t 0 0 Bias(ˆ I ) = E [ˆ I − I ] = 0 − 1 − 1 − 2 − 2 I ) = τ Var( f ) − 3 − 3 Var(ˆ 0 200 400 600 800 1000 0 200 400 600 800 1000 t t T iteration iteration High τ Low τ Auto correlation time 10
Sampling 101 – Metropolis-Hastings Transition Kernel T( θ t+1 | θ t ) Propose Accept/Reject Test 1 , q ( θ t | θ 0 ) S 0 ( θ 0 ) � P a = min θ t +1 θ t θ 0 ∼ q ( θ 0 | θ t ) q ( θ 0 | θ t ) S 0 ( θ t ) ⇢ θ 0 with probability P a θ t +1 ← θ t with probability 1 − P a Is it easy to come back Is the new state to the current state? more probable? N For Bayesian Posterior Inference, Y S 0 ( θ ) ∝ p ( θ ) p ( x i | θ ) i =1 1) Burn-in is unnecessarily slow. I ] ∝ 1 V ar [ˆ 2) is too high. T 11
Approximate MCMC S ✏ x xx x x x x x Low High x x x x x x Variance Variance x x x x x x ( Fast ) ( Slow ) x x x S 0 x x x x x x x x High Bias Low Bias x x x Decreasing ϵ 12
Minimizing Risk 2 Risk Bias Variance = + σ 2 τ /T h I ) 2 i h f i P � h f i P ✏ ( I − ˆ E Given finite sampling time, ϵ =0 is not the optimal setting. X Axis – ϵ, Y Axis – Bias 2 , Variance, Risk Computational Time 13
Stochastic Gradient Langevin Dynamics Welling & Teh 2011 Gradient Ascent Langevin Dynamics ↓ Metropolis-Hastings Accept Step Stochastic Gradient Ascent Stochastic Gradient Langevin Dynamics Metropolis-Hastings Accept Step e.g. 14
Demo: Stochastic Gradient LD 15
A Closer Look … large 16
A Closer Look … small 17
Demo SGLD: large stepsize 18
Demo SGLD: small stepsize 19
Variational Inference Choose tractable family of distributions (e.g. Gaussian, discrete) • Minimize over Q: • Equivalent to maximize over : • Φ P Q 20
Learning: Expectation Maximization Gap: Bound E-step: (variational inference) M-step: (approximate learning) 21
Amortized Inference Bij making q(z|x) a function of x and sharing • parameters , we can do very fast inference at test φ time (i.e. avoid iterative optimization of q test (z)) 22
Deep NN as a glorified conditional distribution Y X P(Y|X) 23
The “Deepify” Operator Find a graphical model with conditional distributions and replace those with a deep NN. • • Logistic regression à deep NN. • “deep survival analysis”. Cox’s proportional hazard function: Replace with deep NN! • Latent variable model: replace generative and recognition models with deep NNs: à ”Variational Autoencoder” (VAE). 24
Variational Autoencoder deepify deepify 25
Deep Generative Model: The Variational Auto-Encoder Q P z z deterministic h NN node μ σ deep neural net deep neural net h h unobserved stochastic node h p x x observed stochastic node 26
Stochastic Variational Bayesian Inference X B ( Q ) = Q ( Z | X, Φ )(log P ( X | Z, Θ ) + log P ( Z ) − log Q ( Z | X, Φ )) Z X r Φ B ( Q ) = Q ( Z | X, Φ ) r Φ log Q ( Z | X, Φ )(log P ( X | Z, Θ ) + log P ( Z ) � log Q ( Z | X, Φ )) Z subsample mini-batch X Sample Z N S r Φ B ( Q ) = 1 1 X X r Φ log Q ( Z is | X i , Φ )(log P ( X i | Z is , Θ ) + log P ( Z is ) � log Q ( Z is | X i , Φ )) N S s =1 i =1 very high variance 27
Reducing the Variance: Kingma 2013, Bengio 2013, Kingma & Welling 2014 The Reparametrization Trick Reparameterization: • Z r Φ B ( Θ , Φ ) = r Φ dz Q Φ ( z | x )[log P Θ ( x, z ) � log Q Φ ( z | x )] Applied to VAE: • ⇡ r Φ [log P Θ ( x, z s ) � log Q Φ ( z s | x )] z s = g ( ✏ s , Φ ) , ✏ s ⇠ P ( ✏ ) Z Example: • dz N z ( µ, � ) z r µ = 1 X z s ( z s � µ ) / � 2 , z s ⇠ N z ( µ, � ) S s or 1 X ✏ s ⇠ N ✏ (0 , 1) , 1 , z = µ + �✏ 28 S s
Semi-Supervised VAE I D.P. Kingma, D.J. Rezende, S. Mohamed, M. Welling, NIPS 2014 Q P y y z z h h h Sometimes observed h h h stochastic node x x (normal VB objective) (boosting influence q(y|x) )
Discriminative or Generative? -Deep Learning Variational Auto-Encoder -Bayesian Networks -Kernel Methods -Probabilistic Programs -Random Forests -Simulator Models -Boosting Advantages generative models: • Advantages discriminative models: • Inject expert knowledge • • Flexible map from input to target (low bias) Model causal relations • • Efficient training algorithms available Interpretable • Solve the problem you are evaluating on. • • Data efficient Very successful and accurate! • More robust to domain shift • Facilitate un/semi-supervised learning •
Big N vs. Small N? We need statistical efficiency We need computational efficiency N=10^8-10^9 N = 100-1000 -Customer Intelligence -Healthcare (p>>N) -Finance -Generative, causal models -Video/image generalize much better to new -Internet of Things unknown situation (domain invariance) 32
Combining Generative and Discriminative Models Use physics Use causality Use expert knowledge Black box DNN/CNN
Deep Convolutional Networks Input dimensions have "topology”: (1D, speech, 2D image, 3D MRI, 2+1D video, 4D fMRI) • Forward: Filter, subsample, filter, nonlinearity, subsample, …., classify Backward: backpropagation (propagate error signal backward) 34
Dropout 35
Example: Dermatology 36
37
38
Example: Retinopathy 39
What do these Problems have in common? It’s the same CNN in all cases: Inception-v3 40
So..., CNNs work really well. However : T hey are way too big • They consume too much energy • • They use too much memory • à we need to make them more efficient! 41
Reasons for Bayesian Deep Learning • Automatic model selection / pruning • Automatic regularization Realistic prediction uncertainty (important for decision making) • Autonomous Driving Computer Aided Diagnosis
Example Increased uncertainty away from data
Recommend
More recommend