2/4/2020 Latent Variable Models CS3750 Xiaoting Li 1 Out utli line • Latent Variable Models • Expectation Maximization Algorithm (EM) • Factor Analysis • Probabilistic Principal Component Analysis • Model Formulation • Maximum Likelihood for PPCA • EM for PPCA • Examples • Sensible Principal Component Analysis • Model Formulation • EM for SPCA • References 2 1
2/4/2020 Laten ent Var ariable le Mod odels: : Mot otiv ivatio ion 3 Laten ent Var ariable le Mod odels: : Mot otiv ivatio ion 4 2
2/4/2020 Laten ent Var ariable le Mod odels: : Mot otiv ivatio ion • Gaussian mixture models • A single Gaussian is not a good fit to data • But two different Gaussians may do • True class of each point is unobservable 5 Laten ent Var ariable le Mod odels A latent variable model p is a probability distribution over two sets of variables s,x : 𝑞(𝑡, 𝑦; 𝜄) where the x variables are observed at learning time in a dataset D and the s are never observed 6 3
2/4/2020 Laten ent Var ariable le Mod odels • The goal of a latent variable model is to express the distribution p(x) of the variables 𝑦 1 , … , 𝑦 𝑒 in terms of a smaller number of latent variables s = ( 𝑡 1 ,..., 𝑡 𝑟 ) where q < d 𝑡 1 𝑡 2 𝑡 3 Latent variable: s, q-dimensions q < d Observed variable: x, d-dimensions 𝑦 1 𝑦 2 𝑦 3 𝑦 4 7 Expe Ex pectatio ion-Maxim imizatio ion (EM (EM) alg algorit ithm • EM algorithm is a hugely important and widely used algorithm for learning directed latent-variable graphical • The key idea of the method: • Compute the parameter estimates iteratively by performing the following two steps: • 1. Expectation step . For all hidden and missing variables (and their possible value assignments) calculate their expectations for the current set of parameters Θ ' • 2. Maximization step . Compute the new estimates of Θ by considering the expectations of the different value completions • Stop when no improvement possible 8 4
2/4/2020 Fac actor Anal nalysis is • Assumptions: • Underlying latent variable has a Gaussian distribution • s ~ 𝑂( 0, I), independent, Gaussian with unit variance • Linear relationship between latent and observed variables • Diagonal Gaussian noise in data dimensions • 𝜗 ~ 𝑂 0, Ψ , Gaussian noise 9 Fac actor Anal nalysis is • A common latent variable where the relationship is linear: x = 𝑋𝑡 + 𝜈 + 𝜗 • d−dimensional observation vector x • q -dimensional vector of latent variable s • 𝑒 × 𝑟 matrix W relates the two sets of variables, 𝑟 < 𝑒 • 𝜈 permits the model to have non-zero mean • s ~ 𝑂( 0, I), independent, Gaussian with unit variance • 𝜗 ~ 𝑂 0, Ψ , Gaussian noise • Then x ~ 𝑂(𝜈, 𝑋𝑋 𝑈 + Ψ) 10 5
2/4/2020 Fac actor Ana naly lysis is s ~ 𝑂(0, 𝐽) Latent variable: s, q-dimensions Observed variable: x, d-dimensions 𝑡 1 𝑡 2 𝑡 3 𝑆𝑓𝑛𝑏𝑞𝑞𝑗𝑜: Ws (weight matrix: w) + 𝜈 (location parameter) + 𝜗 ~ 𝑂 0, Ψ , Gaussian noise 𝑦 1 𝑦 2 𝑦 3 𝑦 4 Parameters of interest: W (weight matrix), Ψ (variance of noise), 𝝂 x = Ws + 𝜈 + 𝜗 𝑦~ 𝑂 𝜈, 𝑋𝑋 𝑈 + Ψ 11 Fac actor Anal nalysis is: Optim imizatio ion • Use EM to solve parameters • E-step: • compute posterior p(s|x) • M-step: • Take derivatives of the expected complete log likelihood with respect to parameters 12 6
2/4/2020 Pri rincipal l Com omponent Ana naly lysis • General motivation is to transform the data into some reduced dimensionality representation • Linear transformation of a d dimensional input x to q dimensional vector s such that q < d under which the retained variance is maximal • Limitation: • Absence of an associated probabilistic model for the observed data • Computational intensive for covariance matrix • Does not deal properly with missing data 13 Prob obabili listic ic PCA • Motivations: • The corresponding likelihood measure would permit comparison with other density – estimation techniques and would facilitate statistical testing. • Provides a natural framework for thinking about hypothesis testing • Offers the potential to extend the scope of conventional PCA. • Can be utilized as a constrained Gaussian density model. • Constrained covariance • Allows us to deal with missing values in the data set. • Can be used to model class conditional densities and hence it can be applied to classification problems. 14 7
2/4/2020 Gen enerativ ive Vie View of of PPCA • Generative view of the PPCA for a 2-d data space and 1-d latent space s s s s s s s 15 PPCA PPCA • Assumptions: • Underlying latent variable 𝑟 − dim 𝑡 has a Gaussian distribution • Linear relationship between 𝑟 − dim latent 𝑡 and 𝑒 − dim observed 𝑦 variables • Isotropic Gaussian noise in observed dimensions • Noise variances constrained to be equal 16 8
2/4/2020 PPCA PPCA • A special case of factor analysis • noise variances constrained to be equal: • 𝜗 ~ 𝑂(0, 𝜏 2 I) • the s conditional probability distribution over x-space: • x |𝑡 ~ 𝑂(𝑋𝑡 + 𝜈, 𝜏 2 I) • latent variables: • s ~ 𝑂(0, 𝐽) • observed data x be obtained by integrating out the latent variables: • x ~ 𝑂 𝜈, 𝐷 • 𝐹 𝑦 = 𝐹 𝜈 + 𝑋𝑡 + 𝜗 = 𝜈 + 𝑋𝐹 𝑡 + 𝐹 𝜗 = 𝜈 + 𝑋0 + 0 = 𝜈 • 𝐷 = 𝑋𝑋 𝑈 + 𝜏 2 I (the observation covariance model) 𝜈 + 𝑋𝑡 + 𝜗 − 𝜈 𝑈 = 𝐹 𝑋𝑡 + 𝜗 𝑋𝑡 + 𝜗 𝑈 = 𝑋𝑋 𝑈 + 𝜏 2 I • 𝐷 = 𝐷𝑝𝑤 𝑦 = 𝐹 𝜈 + 𝑋𝑡 + 𝜗 − 𝜈 • The maximum likelihood estimator for 𝜈 is given by the mean of data, S is sample covariance matrix of the observations {𝑦 𝑜 } • Estimates for 𝑋 and 𝜏 2 can be solved in two ways • Closed form • EM Algorithms 17 Latent variable: s, q-dimensions PPCA PPCA Observed variable: x, d-dimensions s ~ 𝑂(0, 𝐽) 𝑡 1 𝑡 2 𝑡 3 𝑆𝑓𝑛𝑏𝑞𝑞𝑗𝑜: Ws (weight matrix: w) + 𝜈 (location parameter) + Random error (noise): 𝜗 ~ 𝑂 0, 𝜏 2 𝐽 𝑦 1 𝑦 2 𝑦 3 𝑦 4 Parameters of interest: W (weight matrix), 𝝉 𝟑 (variance of noise), 𝝂 x = Ws + 𝜈 + 𝜗 x ~ 𝑂(𝜈, 𝑋𝑋 𝑈 + 𝜏 2 I) 18 9
2/4/2020 Fac actor Anal nalysis is vs. s. PPCA • PPCA • x ~ 𝑂(𝜈, 𝑋𝑋 𝑈 + 𝜏 2 I) • Isotropic error • Factor Analysis • x ~ 𝑂(𝜈, 𝑋𝑋 𝑈 + Ψ) • The error covariance is a diagonal matrix • FA doesn’t change if you scale variables • FA looks for directions of large correlation in the data • FA doesn’t chase large -noise features that are uncorrelated with other features • FA changes if you rotate data • can’t interpret multiple factors as being unique 19 Maxi ximum Likelih ihood for or PPCA • The log-likelihood for the observed data under this model is given by 𝑂 = − 𝑂𝑒 2 ln 2𝜌 − 𝑂 2 ln C − 𝑂 2 𝑈𝑠{𝐷 −1 𝑇} ℒ = ln 𝑞 𝑦 𝑜 𝑜=1 • where 𝑇 is the sample covariance matrix of the observations 𝑦 𝑜 𝑂 𝑇 = 1 (𝑦 𝑜 − 𝜈)(𝑦 𝑜 − 𝜈) 𝑈 𝑂 𝑜=1 • 𝐷 = 𝑋𝑋 𝑈 + 𝜏 2 I • The log-likelihood is maximized when the columns of W span the principal subspace of the data. • Fit parameters (𝑿, 𝜈, 𝜏) to maximum likelihood: make the constrained model covariance as close as possible to the observed covariance 20 10
2/4/2020 Maxi ximum Likelih ihood for or PPCA • Consider the derivatives with respect to W 𝜖ℒ 𝜖𝑋 = 𝑂(𝐷 −1 𝑇𝐷 −1 𝑋 − 𝐷 −1 W) • • Maximizing with respect to W • 𝑋 𝑁𝑀 = 𝑉 𝑟 (∧ 𝑟 −𝜏 2 𝐽) 1/2 𝑆 • Where • the 𝑟 column vectors in 𝑉𝑟 are eigenvectors of 𝑇 , with corresponding eigenvalues in the diagonal matrix Λ𝑟 • 𝑆 is an arbitrary 𝑟 × 𝑟 orthogonal rotation matrix. 𝑁𝑀 , the maximum-likelihood estimator for 𝜏 2 is given by • For 𝑋 = 𝑋 1 2 𝑒 • 𝜏 𝑁𝑀 𝑒−𝑟 σ 𝑘=𝑟+1 = 𝜇 𝑘 • the average variance associated with the discarded dimensions 21 Maxi ximum Likelih ihood for or PPCA • Consider the derivatives with respect to W 𝜖ℒ • 𝜖𝑋 = 𝑂(𝐷 −1 𝑇𝐷 −1 𝑋 − 𝐷 −1 W) • At the stationary points 𝑇𝐷 −1 𝑋 = 𝑋 , assuming that 𝐷 −1 exists • Three possible classes of solutions • 𝑋 = 0, minimum of the log-likelihood • 𝐷 = 𝑇 • Covariance model is exact • 𝑋𝑋 𝑈 = 𝑇 − 𝜏 2 𝐽 has a known solution at 𝑋 = 𝑉(∧ − 𝜏 2 𝐽) 1/2 𝑆 , where 𝑉 is a square matrix whose columns are the eigenvectors of 𝑇 , with ∧ is the corresponding diagonal matrix of eigenvalues, 𝑆 is an arbitrary orthogonal matrix • 𝑇𝐷 −1 𝑋 = 𝑋, 𝑐𝑣𝑢 𝑋 ≠ 0 𝑏𝑜𝑒 𝐷 ≠ 𝑇 22 11
Recommend
More recommend