2/4/2020 CS 3750 Advanced Machine Learning Latent Variable Generative Models II Ahmad Diab AHD23@cs.pitt.edu Feb 4, 2020 Based on slides of Professor Milos Hauskrecht Outline • Latent Variable Generative Models • Cooperative Vector Quantizer Model • Model Formulation • Expectation Maximization (EM) • Variational Approximation • Noisy-OR Component Analyzer • Model Formulation • Variational EM for NOCA • References 1
2/4/2020 Latent Variable Generative Models • Generative Models: Unsupervised learning models that study the underlying structure (e.g. interesting patterns) and causal structures of data to generate data like it. • Latent (hidden) variables are random variables that are hard to observe. (ex. Length is measured, but intelligence is not), and is assumed to affect the response variable. • The idea: introduce an unobserved latent variable, S, and use it to generate a traceable, less complex distribution. p(x, s) = p(x | s) p(s) p(x) Complex Distribution Simpler Distribution Latent Variable Generative Models • Assumption: Observable variables are independent given latent variables. S 1 S 2 S q . . . . . . x d-1 x d x 1 x 2 2
2/4/2020 Cooperative Vector Quantizer (CVQ) • Latent variables (s): Binary vars with Dimensionality k • Observed variables (x): real valued vars Dimensionality d S 1 S 2 S k . . . . . . x d-1 x 1 x 2 x d CVQ – Model Description S: k binary vars • Model … 𝐿 • 𝑦 = σ 𝑙=1 𝑡 𝑙 𝑥 𝑙 + 𝜗 • Latent variables 𝑡 𝑗 • ~ Bernoulli distribution parameter: 𝜌 𝑗 • 𝑄(𝑡 𝑗 | 𝜌 𝑗 ) = 𝜌 𝑗𝑡 𝑗 (1 − 𝜌 𝑗 ) 1−𝑡 𝑗 X: d real valued vars • 𝑥 𝑙 is the weight output by source 𝑡 𝑙 w w .. w • Observable variables 𝑦 11 12 1 k w • ~ Normal distributions parameters: W, Σ 21 W .. • 𝑄(𝑦 | 𝑡 ) = N(Ws, Σ ), .. .. • we assume Σ = 𝜏𝐽 w w 1 d dk • Joint for one instance of s and x 2𝜏 2 (𝑦 − 𝑋𝑡) 𝑈 (𝑦 − 𝑋𝑡) ς 𝑗=1 1 • 𝑞 𝑦, 𝑡 Θ) = 2 −𝑒/2 𝜏 −𝑒/2 exp{ − 𝑙 𝜌 𝑗𝑡 𝑗 (1 − 𝜌 𝑗 ) 1−𝑡 𝑗 6 3
2/4/2020 CVQ – Model Description • Objective: to learn parameters of the model: W, π, σ • If both x and s are observable, • Use loglikelihood: 𝑂 𝑚𝑝𝑄(𝑦 𝑜 , 𝑡 𝑜 |Θ) = 𝑜=1 (𝑜) 𝑚𝑝 𝜌 𝑗 2𝜏 2 (𝑦 𝑜 − 𝑋𝑡 𝑜 ) 𝑈 (𝑦 𝑜 − 𝑋𝑡 𝑜 ) + σ 𝑗=1 1 𝑂 𝑙 σ 𝑜=1 −𝑒 𝑚𝑝 𝜏 − 𝑡 𝑗 (𝑜) )log(1 − 𝜌 𝑗 ) + c (1 − 𝑡 𝑗 • Solution is nice and easy 7 CVQ – Model Description • Objective: to learn parameters of the model: W, π, σ • If only x are observable • Log likelihood of data: 𝑚𝑝𝑄(𝑦 𝑜 |Θ) = σ 𝑜=1 𝑚𝑝 σ {𝑡 𝑜 } 𝑄 (𝑦 𝑜 , 𝑡 𝑜 |Θ) 𝑂 𝑂 𝑚𝑝𝑄 𝐸 Θ = σ 𝑜=1 • Solution is hard, we can no longer benefit from the decomposition. • Use Expectation Maximization (EM). 8 4
2/4/2020 Expectation Maximization (EM) • Let H be a set of all variables with hidden or missing values • 𝑄(𝐼, 𝐸|Θ, 𝜊) = 𝑄(𝐼 |𝐸, Θ, 𝜊)𝑄(𝐸|Θ, 𝜊) • log 𝑄 ( 𝐼 , 𝐸 | Θ , 𝜊 ) = log 𝑄 ( 𝐼 | 𝐸 , Θ , 𝜊 ) + log 𝑄 ( 𝐸 | Θ , 𝜊 ) • log 𝑄 ( 𝐸 | Θ , 𝜊 ) = log 𝑄 ( 𝐼 , 𝐸 | Θ , 𝜊 ) − log 𝑄 ( 𝐼 | 𝐸 , Θ , 𝜊 ) • Average both sides with 𝑄(𝐼 |𝐸, Θ′, 𝜊) for Θ′ • 𝐹 𝐼|𝐸,Θ′ 𝑚𝑝𝑄(𝐸|Θ, 𝜊) = 𝐹 𝐼|𝐸,Θ′ 𝑚𝑝𝑄(𝐼, 𝐸|Θ, 𝜊) − 𝐹 𝐼|𝐸,Θ′ 𝑚𝑝𝑄(𝐼|𝐸, Θ, 𝜊) • log 𝑄(𝐸 | Θ, 𝜊 ) = 𝐺 (Θ | Θ′) = 𝐹(Θ | Θ′) + 𝐼 (Θ | Θ′) Log-likelihood of data • EM uses the true posterior. ) 𝑄(𝐼|𝐸, 𝛪′, 𝜊 9 Expectation Maximization (EM) • General EM Algorithm: • Initialize parameters Θ • Set Θ '= Θ • Expectation step • 𝐹(Θ|Θ′) = 𝑚𝑝𝑄(𝐼, 𝐸|Θ, 𝜊) 𝑄(𝐼|𝐸,Θ′) • Maximization step • Θ = argmax 𝐹(Θ|Θ′) • Repeat until no or small improvement in Θ ( (Θ = Θ') • Problem • 𝑄 𝐼 𝐸, Θ ′ = ς 𝑜=1 𝑄(𝑦 𝑜 , 𝑡 𝑜 |Θ′) 𝑂 • Each data point requires us to calculate 2 𝑙 probabilities • If k is large, then this is a bottleneck 10 5
2/4/2020 Variational Approximation • An alternative method to approximate inference based on stochastic sampling. • Let H be a set of all variables with hidden or missing values • log 𝑄 ( 𝐸 | Θ , 𝜊 ) = log 𝑄 ( 𝐼 , 𝐸 | Θ , 𝜊 ) − log 𝑄 ( 𝐼 | 𝐸 , Θ , 𝜊 ) • Average both sides using a distribution 𝑅(𝐼 | 𝜇) [ surrogate posterior ] 𝐹 𝐼|𝜇 𝑚𝑝𝑄 𝐸 Θ, 𝜊 = 𝐹 𝐼|𝜇 𝑚𝑝𝑄(𝐼, 𝐸|Θ, 𝜊) − 𝐹 𝐼|𝜇 𝑚𝑝𝑅(𝐼 |𝜇) +𝐹 𝐼|𝜇 𝑚𝑝𝑅(𝐼 |𝜇) − 𝐹 𝐼|𝜇 𝑚𝑝𝑄(𝐼|Θ, 𝜊) log𝑄(𝐸|𝛪, 𝜊) = 𝐺(𝑅, 𝛪) + 𝐿𝑀(𝑅, 𝑄 ) 𝐺(𝑅, Θ) = Σ {𝐼} 𝑅(𝐼 |𝜇)𝑚𝑝𝑄(𝐼, 𝐸|Θ, 𝜊) − Σ {𝐼} 𝑅(𝐼 |𝜇)𝑚𝑝𝑅(𝐼 |𝜇) 𝐿𝑀(𝑅, 𝑄) = Σ {𝐼} 𝑅(𝐼 |𝜇)[𝑚𝑝𝑅(𝐼 |𝜇) − 𝑚𝑝𝑄(𝐼 |𝐸, Θ)] 11 Variational Approximation ) log𝑄(𝐸|𝛪, 𝜊) = 𝐺(𝑅, 𝛪) + 𝐿𝑀(𝑅, 𝑄 𝐺(𝑅, Θ) = Σ {𝐼} 𝑅(𝐼 |𝜇)𝑚𝑝𝑄(𝐼, 𝐸|Θ, 𝜊) − Σ {𝐼} 𝑅(𝐼 |𝜇)𝑚𝑝𝑅(𝐼 |𝜇) 𝐿𝑀(𝑅, 𝑄) = Σ {𝐼} 𝑅(𝐼 |𝜇)[𝑚𝑝𝑅(𝐼 |𝜇) − 𝑚𝑝𝑄(𝐼 |𝐸, Θ)] • Approximation: maximize 𝐺(𝑅, Θ) • Parameters: Θ, 𝜇 • Maximization of F pushes up the lower bound on the log-likelihood log 𝑄 𝐸 Θ, 𝜊 ≥ 𝐺 𝑅, Θ . 6
2/4/2020 Kullback-Leibler (KL) divergence • A method to measure the difference between two probability distributions over the same variable x • 𝐿𝑀(𝑄 || 𝑅) • Where the “||” operator indicates “ divergence ” or P’s divergence from Q • Entropy: the average amount of information for a probability distribution 𝑜 • 𝐼 𝑄 = 𝐹 𝑄 𝐽 𝑄 𝑌 = − σ 𝑗=1 𝑄 𝑗 log(𝑄 𝑗 ) 𝑄(𝑗) 𝑜 𝑜 𝑜 • 𝐿𝑀 𝑄 ||𝑅 = 𝐼 𝑄, 𝑅 − 𝐼 𝑄 = − σ 𝑗=1 𝑄 𝑗 log 𝑅 𝑗 + σ 𝑗=1 𝑄 𝑗 log(𝑄 𝑗 ) = σ 𝑗=1 𝑄 𝑗 log( 𝑅(𝑗) ) • If we have some theoretic minimal distribution P, we want to try to find an approximation Q that tries to get as close as possible by minimizing the KL divergence 13 Variational EM • To use Variational EM, we hope if we choose 𝑅(𝐼 | 𝜇) well, the optimization of both 𝜇 and Θ will become easy. • A well-behaved choice for 𝑅(𝐼 | 𝜇) is the mean field approximation . • Let H – be a set of all variables with hidden or missing values: • E-step: Compute expectation over hidden variables • Optimize: 𝐺(𝑅, Θ) with respect to 𝜇 while keeping Θ fixed. • M-step: Maximize expected loglikelihood • Optimize: 𝐺 (𝑅 , Θ) with respect to Θ while keeping 𝜇𝑡 fixed. 14 7
2/4/2020 Mean Field Approximation • To find the distribution Q, we use Mean Field Approximation • Assumption: • 𝑅(𝐼|𝜇) is the mean field approximation • Variables in the 𝑅(𝐼) distribution are independent variables 𝐼 𝑗 • Q is completely factorized 𝑅(𝐼|𝜇) = ς𝑅 𝑗 (𝐼 𝑗 |𝜇 𝑗 ) • For our CVQ model • Hidden variables are binary sources 𝑅(𝑡 𝑜 |𝜇 𝑜 ) 𝑅(𝐼|𝜇) = ෑ 𝑜=1…𝑂 𝑜 |𝜇 𝑗 𝑅(𝑡 𝑜 |𝜇 𝑜 ) = ς 𝑗=1…𝑙 𝑅(𝑡 𝑗 (𝑜) ) 𝑜 𝑡 𝑗 𝑜 𝑜 𝜇 𝑗 𝑜 ) 1 −𝑡 𝑗 𝑜 𝑜 𝑅 𝑡 𝑗 = 𝜇 𝑗 (1 − 𝜇 𝑗 15 Mean Field Approximation • Functional F for the mean field: ) ) 𝐺(𝑅, 𝛪) = 𝑅(𝐼|𝜇 log𝑄(𝐼, 𝐸|𝛪, 𝜊) − 𝑅(𝐼|𝜇 log𝑅(𝐼|𝜇) 𝐼 𝐼 • Assume just one data point x and corresponding s : 𝑂 𝑚𝑝𝑄((𝑦 𝑜 , 𝑡 𝑜 |Θ) 𝑅(𝑡 𝑜 |𝜇 𝑜 ) − 𝑚𝑝𝑅 𝑡 𝑜 𝜇 𝑜 𝐺 𝑅, Θ = 𝑅 𝑡 𝑜 𝜇 𝑜 𝑜=1 1 2𝜏 2 𝐲 − 𝐗𝐭 𝑈 (𝐲 − 𝐗𝐭 = −𝑒log𝜏 − ቁ (1) ) 𝑅(𝑡|𝜇 𝑙 + 𝑗=1 𝑡 𝑗 log𝜌 𝑗 + (1 − 𝑡 𝑗 )log(1 − 𝜌 𝑗 ) (2) ) 𝑅(𝑡|𝜇 𝑙 ) − 𝑗=1 𝑡 𝑗 log𝜇 𝑗 + (1 − 𝑡 𝑗 )log(1 − 𝜇 𝑗 (3) 𝑅(𝑡|𝜇 ) 16 8
Recommend
More recommend