Factor analysis & Exact inference for Gaussian networks Probabilistic Graphical Models Sharif University of Technology Spring 2016 Soleymani
Multivariate Gaussian distribution 2π π/2 π― 1/2 exp{β 1 1 2 π β π π π― β1 (π β π)} πͺ π|π, π― = ο½ The natural, canonical, or information parameterization of a Gaussian distribution arises from quadratic form πͺ π| π , π² β exp{β 1 2 π π π²π + π π π} π³ = π² = π― β1 π = π― β1 π 2
Joint Gaussian distribution: block elements ο½ If we partition the vector π into π 1 and π 2 : π = π 1 π 2 π π― = π― 11 π― 12 π― 21 = π― 12 π― 11 and π― 22 are symmetric π― 21 π― 22 π 1 π 1 π 1 π 2 , π― 11 π― 12 π π, π― = πͺ π 2 π 2 π― 21 π― 22 3
Marginal and conditional of Gaussian π¦ 2 π¦ 2 = 0.7 π(π¦ 1 |π¦ 2 = 0.7) π(π¦ 1 , π¦ 2 ) π(π¦ 1 ) π¦ 1 [Bishop] For multivariate Gaussian distribution, all marginal and conditional distributions are also Gaussians 4
Matrix inverse lemma β1 βππΆπΈ β1 π΅ πΆ π = πΈ β1 + πΈ β1 π·ππΆπΈ β1 βπΈ β1 π·π π· πΈ π = π΅ β πΆπΈ β1 π· β1 5
Precision matrix ο½ In many situations, it will be convenient to work with π³ = π― β1 known as precision matrix: β1 π³ = π― β1 = π― 11 π― 12 π π― 21 π― 22 π³ 21 = π³ 12 π³ = π³ 11 π³ 12 π³ 11 and π³ 22 are symmetric π³ 21 π³ 22 Relation between the inverse of a partitioned matrix and the inverses of its partitions (using matrix inverse lemma): β1 π― 21 β1 π³ 11 = π― 11 β π― 12 π― 22 β1 π― 21 β1 π― 12 π― 22 β1 π³ 12 = β π― 11 β π― 12 π― 22 6
Marginal and conditional distributions based on block elements of π³ π³ = π― β1 π³ = π³ 11 π³ 12 ο½ Conditional π³ 21 π³ 22 π π 1 |π 2 = πͺ π 1 |π 1|2 , π― 1|2 β1 π³ 12 π 2 β π 2 π 1|2 = π 1 β π³ 11 linear-Gaussian model β1 π― 1|2 = π³ 11 ο½ Marginal π π 1 = πͺ π 1 |π 1 , π― 1 β1 β1 π³ 21 π― 1 = π³ 11 β π³ 12 π³ 22 7
Marginal and conditional distributions based on block elements of π― ο½ Conditional distributions : π π 1 |π 2 = πͺ π 1 |π 1|2 , π― 1|2 β1 π 2 β π 2 π 1|2 = π 1 + π― 12 π― 22 β1 π― 21 π― 1|2 = π― 11 β π― 12 π― 22 ο½ Marginal distributions based on block element of π and π― : π π 1 = πͺ π 1 |π 1 , π― 11 π π 2 = πͺ π 2 |π 2 , π― 22 8
Factor analysis ο½ Gaussian latent variable π ( π dimensional) ο½ Continuous latent variable ο½ Observed variable π ( πΈ dimensional) π π π = πͺ π π, π± π, π©, π π π π|π = πͺ π π + π©π, π π β β π π© : factor loading πΈ Γ π matrix π β β πΈ π : diagonal covariance matrix 9
Marginal distribution π π π = πͺ π π, π± π = π + π©π + π π π|π = πͺ π π + π©π, π π~πͺ(π, π) π is independent of π and π π The product of Gaussian distributions are Gaussian, as well as the marginal of Gaussian, thus π π = π π π π|π ππ is Gaussian π π = πΉ π = πΉ π + π©π = π + π©πΉ π = π π β π π = πΉ π©π + π π π― ππ = πΉ π β π π©π + π = π©πΉ ππ π π© π + π = π©π© π + π β π π = πͺ π|π , π©π© π + π 10
Joint Gaussian distribution π― ππ = π·ππ€ π, π = πΉ π π©π + π π = π© π π― ππ = π©π© π + π΄ π― ππ = π± π± π© β π― = π©π© π + π΄ π© π πΉ π π = π π π π π π± π© β π = πͺ π , π©π© π + π΄ π© π π π 11
π π 1 |π 2 = πͺ π 1 |π 1|2 , π― 1|2 β1 π 2 β π 2 π 1|2 = π 1 + π― 12 π― 22 Conditional distributions β1 π― 21 π― 1|2 = π― 11 β π― 12 π― 22 π π|π = πͺ π|π π|π , π» π|π π π|π = π© π π©π© π + π β1 π β π π» π|π = π± β π© π π©π© π + π β1 π© ο½ πΈ Γ πΈ matrix is required to be inverted. If π < πΈ , it is preferred to use: π π|π = π± + π© π π β1 π© β1 π© π π β1 π β π = π» π|π π© π π β1 π β π π» π|π = π± + π© π π β1 π© β1 Posterior covariance does not depend on observed data π ! Computing the posterior mean is a linear operation 12 π΅ β πΆπΈ β1 π· β1 = π΅ β1 + π΅ β1 πΆ πΈ β π·π΅ β1 πΆ β1 π·π΅ β1
Geometric illustration π π¦ 3 π(π|π) [Jordan] π(π) π¦ 2 π¦ 1 To generate data, first generate a point within the manifold then add noise. 13
Factor analysis: dimensionality reduction ο½ FA is just a constrained Gaussian model ο½ If π΄ were not diagonal then we could model any Gaussian ο½ FA is a low rank parameterization of a multi-variate Gaussian ο½ Since π π = πͺ π|π , π©π© π + π , FA approximates the covariance matrix of the visible vector using a low-rank decomposition π©π© π and the diagonal matrix π ο½ π©π© π + π is the outer product of two low-rank matrices plus a diagonal matrix (i.e., π(ππΈ) parameters instead of π(πΈ 2 ) ) ο½ Given {π 1 , β¦ , π (π) } (the observation on high dimensional data), by learning from incomplete data we find π© for transforming data to a lower dimensional space 14
Incomplete likelihood β πΎ; π π = β π 2 log π©π© π + π΄ β 1 π π©π© π + π΄ β1 π π β π π π β π 2 π=1 = β π 2 log π©π© π + π΄ β 1 2 π’π π©π© π + π΄ β1 π» π π π π β π π π β π π» = π=1 π π ππ = 1 π π π π=1 15
E-step: expected sufficient statistics π πΉ π π (π) |π (π) ,πΎ π’ log π π (π) πΎ + log π π (π) π (π) , πΎ πΉ π β|π ,πΎ π’ log π π , β πΎ = π=1 ο½ Expected sufficient statistics: π = β π 2 log π΄ β 1 π’π πΉ π π π π π πΉ log π π , β πΎ 2 π=1 π β 1 π π΄ β1 + π π π β π©π π π π β π©π π 2 π’π πΉ π=1 π π β π©π (π) π π π β π©π (π) πΉ = π π π π π β π©πΉ π π π π β π π πΉ π π π π© π + π©πΉ π (π) π π π π© π = π π|π (π) = π» π|π π© π π΄ β1 π π β π πΉ π π (π) |π (π) ,πΎ π’ π π π (π) π π π = π» π|π (π) + π π|π (π) π π|π π π πΉ π π (π) |π (π) ,πΎ π’ π» π|π = π± + π© π π β1 π© β1 16 π π|π = π» π|π π© π π β1 π β π
M-Step β1 π π π πΉ π π π π π π© π’+1 = π π πΉ π π π=1 π=1 π΄ π’+1 = 1 π π β π© π’+1 π (π) π π π β π© π’+1 π (π) π diag πΉ π π π π π π π β π© π’+1 πΉ π π π π π ) = diag( π=1 π=1 17
Unidentifiability ο½ π© only appears as outer product π©π© π , thus the model is invariant to rotation and axis flips of the latent space. ο½ π© can be replaced with π©πΉ for any orthonormal matrix πΉ and the model containing only π©π© π remains the same. ο½ Thus, FA is an un-identifiable model. ο½ Likelihood objective function on a set of data will not have a unique maximum (an infinite number of parameters give the maximum score) ο½ It not be guaranteed to identify the same parameters. 18
Probabilistic PCA & FA ο½ Data is a linear function of low-dimensional latent coordinates, plus Gaussian noise ο½ Factor analysis: π΄ is a general diagonal matrix ο½ Probabilistic PCA: π΄ = π½π± π π = πͺ(π|π, π±) π π π π = πͺ π|π©π + π, π΄ π π = πͺ π|π, π©π© πΌ + π΄ π¨ π 19 [Bishop]
PPCA Posterior mean is not an orthogonal projection, since it is shrunk somewhat towards the prior mean [Murphy] 20
Exact inference for Gaussian networks 21
Multivariate Gaussian distribution 2π π/2 π― 1/2 exp{β 1 1 2 π β π π π― β1 (π β π)} π π = π π β exp{β 1 2 π π π²π + π²π π π} π² = π― β1 π is normalizable (i.e., normalization constant is finite) and defines a legal Gaussian distribution ο½ Directed model if and only if π² is positive definite. ο½ Linear Gaussian model ο½ Undirected model ο½ Gaussian MRF 22
Linear-Gaussian model ο½ Linear-Gaussian model for CPDs: π π π ππ π π = πͺ π π | π₯ ππ π π + π π , π€ π π π βππ π π ο½ The joint distribution is Gaussian: ln π π 1 , β¦ , π πΈ 2 πΈ 1 = β π π β π₯ ππ π π β π π + π· 2π€ π π=1 π π βππ π π 23
Recommend
More recommend