Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 6 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 1 / 25
Plan for today 1 Latent Variable Models Learning deep generative models Stochastic optimization: Reparameterization trick Inference Amortization Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 2 / 25
Variational Autoencoder A mixture of an infinite number of Gaussians: 1 z ∼ N (0 , I ) 2 p ( x | z ) = N ( µ θ ( z ) , Σ θ ( z )) where µ θ ,Σ θ are neural networks 3 Even though p ( x | z ) is simple, the marginal p ( x ) is very complex/flexible Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 3 / 25
Recap Latent Variable Models Allow us to define complex models p ( x ) in terms of simple building blocks p ( x | z ) Natural for unsupervised learning tasks (clustering, unsupervised representation learning, etc.) No free lunch: much more difficult to learn compared to fully observed, autoregressive models Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 4 / 25
Recap: Variational Inference Suppose q ( z ) is any probability distribution over the hidden variables � D KL ( q ( z ) � p ( z | x ; θ )) = − q ( z ) log p ( z , x ; θ ) + log p ( x ; θ ) − H ( q ) ≥ 0 z Evidence lower bound (ELBO) holds for any q � log p ( x ; θ ) ≥ q ( z ) log p ( z , x ; θ ) + H ( q ) z Equality holds if q = p ( z | x ; θ ) � log p ( x ; θ )= q ( z ) log p ( z , x ; θ ) + H ( q ) z Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 5 / 25
Recap: The Evidence Lower bound What if the posterior p ( z | x ; θ ) is intractable to compute? Suppose q ( z ; φ ) is a (tractable) probability distribution over the hidden variables parameterized by φ (variational parameters) For example, a Gaussian with mean and covariance specified by φ q ( z ; φ ) = N ( φ 1 , φ 2 ) Variational inference : pick φ so that q ( z ; φ ) is as close as possible to p ( z | x ; θ ). In the figure, the posterior p ( z | x ; θ ) (blue) is better approximated by N (2 , 2) (orange) than N ( − 4 , 0 . 75) (green) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 6 / 25
Recap: The Evidence Lower bound � ≥ q ( z ; φ ) log p ( z , x ; θ ) + H ( q ( z ; φ )) = L ( x ; θ, φ ) log p ( x ; θ ) � �� � z ELBO = L ( x ; θ, φ ) + D KL ( q ( z ; φ ) � p ( z | x ; θ )) The better q ( z ; φ ) can approximate the posterior p ( z | x ; θ ), the smaller D KL ( q ( z ; φ ) � p ( z | x ; θ )) we can achieve, the closer ELBO will be to log p ( x ; θ ). Next: jointly optimize over θ and φ to maximize the ELBO over a dataset Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 7 / 25
Variational learning L ( and L ( x ; θ, φ 2 ) are both lower bounds. We want to jointly optimize θ and φ Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 8 / 25
The Evidence Lower bound applied to the entire dataset Evidence lower bound (ELBO) holds for any q ( z ; φ ) � log p ( x ; θ ) ≥ q ( z ; φ ) log p ( z , x ; θ ) + H ( q ( z ; φ )) = L ( x ; θ, φ ) � �� � z ELBO Maximum likelihood learning (over the entire dataset): � � log p ( x i ; θ ) ≥ L ( x i ; θ, φ i ) ℓ ( θ ; D ) = x i ∈D x i ∈D Therefore � L ( x i ; θ, φ i ) max ℓ ( θ ; D ) ≥ max θ θ,φ 1 , ··· ,φ M x i ∈D Note that we use different variational parameters φ i for every data point x i , because the true posterior p ( z | x i ; θ ) is different across datapoints x i Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 9 / 25
A variational approximation to the posterior Assume p ( z , x i ; θ ) is close to p data ( z , x i ). Suppose z captures information such as the digit identity (label), style, etc. For simplicity, assume z ∈ { 0 , 1 , 2 , · · · , 9 } . Suppose q ( z ; φ i ) is a (categorical) probability distribution over the hidden variable z parameterized by φ i = [ p 0 , p 1 , · · · , p 9 ] � q ( z ; φ i ) = ( φ i k ) 1[ z = k ] k ∈{ 0 , 1 , 2 , ··· , 9 } If φ i = [0 , 0 , 0 , 1 , 0 , · · · , 0], is q ( z ; φ i ) a good approximation of p ( z | x 1 ; θ ) ( x 1 is the leftmost datapoint)? Yes If φ i = [0 , 0 , 0 , 1 , 0 , · · · , 0], is q ( z ; φ i ) a good approximation of p ( z | x 3 ; θ ) ( x 3 is the rightmost datapoint)? No For each x i , need to find a good φ i , ∗ (via optimization, can be expensive). Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 10 / 25
Learning via stochastic variational inference (SVI) Optimize � x i ∈D L ( x i ; θ, φ i ) as a function of θ, φ 1 , · · · , φ M using (stochastic) gradient descent � L ( x i ; θ, φ i ) q ( z ; φ i ) log p ( z , x i ; θ ) + H ( q ( z ; φ i )) = z E q ( z ; φ i ) [log p ( z , x i ; θ ) − log q ( z ; φ i )] = 1 Initialize θ, φ 1 , · · · , φ M 2 Randomly sample a data point x i from D 3 Optimize L ( x i ; θ, φ i ) as a function of φ i : Repeat φ i = φ i + η ∇ φ i L ( x i ; θ, φ i ) 1 until convergence to φ i , ∗ ≈ arg max φ L ( x i ; θ, φ ) 2 4 Compute ∇ θ L ( x i ; θ, φ i , ∗ ) 5 Update θ in the gradient direction. Go to step 2 How to compute the gradients? There might not be a closed form solution for the expectations. So we use Monte Carlo sampling Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 11 / 25
Learning Deep Generative models � L ( x ; θ, φ ) = q ( z ; φ ) log p ( z , x ; θ ) + H ( q ( z ; φ )) z = E q ( z ; φ ) [log p ( z , x ; θ ) − log q ( z ; φ )] Note: dropped i superscript from φ i for compactness To evaluate the bound, sample z 1 , · · · , z k from q ( z ; φ ) and estimate E q ( z ; φ ) [log p ( z , x ; θ ) − log q ( z ; φ )] ≈ 1 � log p ( z k , x ; θ ) − log q ( z k ; φ )) k k Key assumption: q ( z ; φ ) is tractable, i.e., easy to sample from and evaluate Want to compute ∇ θ L ( x ; θ, φ ) and ∇ φ L ( x ; θ, φ ) The gradient with respect to θ is easy ∇ θ E q ( z ; φ ) [log p ( z , x ; θ ) − log q ( z ; φ )] E q ( z ; φ ) [ ∇ θ log p ( z , x ; θ )] = 1 � ∇ θ log p ( z k , x ; θ ) ≈ k k Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 12 / 25
Learning Deep Generative models � L ( x ; θ, φ ) = q ( z ; φ ) log p ( z , x ; θ ) + H ( q ( z ; φ )) z = E q ( z ; φ ) [log p ( z , x ; θ ) − log q ( z ; φ )] Want to compute ∇ θ L ( x ; θ, φ ) and ∇ φ L ( x ; θ, φ ) The gradient with respect to φ is more complicated because the expectation depends on φ We still want to estimate with a Monte Carlo average Later in the course we’ll see a general technique called REINFORCE (from reinforcement learning) For now, a better but less general alternative that only works for continuous z (and only some distributions) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 13 / 25
Reparameterization Want to compute a gradient with respect to φ of � E q ( z ; φ ) [ r ( z )] = q ( z ; φ ) r ( z ) d z where z is now continuous Suppose q ( z ; φ ) = N ( µ, σ 2 I ) is Gaussian with parameters φ = ( µ, σ ). These are equivalent ways of sampling: Sample z ∼ q φ ( z ) Sample ǫ ∼ N (0 , I ), z = µ + σǫ = g ( ǫ ; φ ) Using this equivalence we compute the expectation in two ways: � E z ∼ q ( z ; φ ) [ r ( z )] = E ǫ ∼N (0 , I ) [ r ( g ( ǫ ; φ ))] = p ( ǫ ) r ( µ + σǫ ) d ǫ ∇ φ E q ( z ; φ ) [ r ( z )] = ∇ φ E ǫ [ r ( g ( ǫ ; φ ))] = E ǫ [ ∇ φ r ( g ( ǫ ; φ ))] Easy to estimate via Monte Carlo if r and g are differentiable w.r.t. φ and ǫ is easy to sample from (backpropagation) � k ∇ φ r ( g ( ǫ k ; φ )) where ǫ 1 , · · · , ǫ k ∼ N (0 , I ). E ǫ [ ∇ φ r ( g ( ǫ ; φ ))] ≈ 1 k Typically much lower variance than REINFORCE Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 14 / 25
Learning Deep Generative models � L ( x ; θ, φ ) = q ( z ; φ ) log p ( z , x ; θ ) + H ( q ( z ; φ )) z = E q ( z ; φ ) [log p ( z , x ; θ ) − log q ( z ; φ ) ] � �� � r ( z ,φ ) Our case is slightly more complicated because we have E q ( z ; φ ) [ r ( z , φ )] instead of E q ( z ; φ ) [ r ( z )]. Term inside the expectation also depends on φ . Can still use reparameterization. Assume z = µ + σǫ = g ( ǫ ; φ ) like before. Then E q ( z ; φ ) [ r ( z , φ )] = E ǫ [ r ( g ( ǫ ; φ ) , φ )] 1 � r ( g ( ǫ k ; φ ) , φ ) ≈ k k Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 15 / 25
Amortized Inference � L ( x i ; θ, φ i ) ℓ ( θ ; D ) ≥ max max θ,φ 1 , ··· ,φ M θ x i ∈D So far we have used a set of variational parameters φ i for each data point x i . Does not scale to large datasets. Amortization: Now we learn a single parametric function f λ that maps each x to a set of (good) variational parameters. Like doing regression on x i �→ φ i , ∗ For example, if q ( z | x i ) are Gaussians with different means µ 1 , · · · , µ m , we learn a single neural network f λ mapping x i to µ i We approximate the posteriors q ( z | x i ) using this distribution q λ ( z | x ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 6 16 / 25
Recommend
More recommend