exact inference for gaussian networks
play

& Exact inference for Gaussian networks Probabilistic Graphical - PowerPoint PPT Presentation

Factor analysis & Exact inference for Gaussian networks Probabilistic Graphical Models Sharif University of Technology Spring 2016 Soleymani Multivariate Gaussian distribution 2 /2 1/2 exp{ 1 1 2


  1. Factor analysis & Exact inference for Gaussian networks Probabilistic Graphical Models Sharif University of Technology Spring 2016 Soleymani

  2. Multivariate Gaussian distribution 2𝜌 𝑒/2 𝜯 1/2 exp{βˆ’ 1 1 2 π’š βˆ’ 𝝂 π‘ˆ 𝜯 βˆ’1 (π’š βˆ’ 𝝂)} π’ͺ π’š|𝝂, 𝜯 =  The natural, canonical, or information parameterization of a Gaussian distribution arises from quadratic form π’ͺ π’š| π’Š , 𝑲 ∝ exp{βˆ’ 1 2 π’š π‘ˆ π‘²π’š + π’Š π‘ˆ π’š} 𝚳 = 𝑲 = 𝜯 βˆ’1 π’Š = 𝜯 βˆ’1 𝝂 2

  3. Joint Gaussian distribution: block elements  If we partition the vector π’š into π’š 1 and π’š 2 : 𝝂 = 𝝂 1 𝝂 2 π‘ˆ 𝜯 = 𝜯 11 𝜯 12 𝜯 21 = 𝜯 12 𝜯 11 and 𝜯 22 are symmetric 𝜯 21 𝜯 22 π’š 1 π’š 1 𝝂 1 𝝂 2 , 𝜯 11 𝜯 12 𝑄 𝝂, 𝜯 = π’ͺ π’š 2 π’š 2 𝜯 21 𝜯 22 3

  4. Marginal and conditional of Gaussian 𝑦 2 𝑦 2 = 0.7 𝑄(𝑦 1 |𝑦 2 = 0.7) 𝑄(𝑦 1 , 𝑦 2 ) 𝑄(𝑦 1 ) 𝑦 1 [Bishop] For multivariate Gaussian distribution, all marginal and conditional distributions are also Gaussians 4

  5. Matrix inverse lemma βˆ’1 βˆ’π‘πΆπΈ βˆ’1 𝐡 𝐢 𝑁 = 𝐸 βˆ’1 + 𝐸 βˆ’1 𝐷𝑁𝐢𝐸 βˆ’1 βˆ’πΈ βˆ’1 𝐷𝑁 𝐷 𝐸 𝑁 = 𝐡 βˆ’ 𝐢𝐸 βˆ’1 𝐷 βˆ’1 5

  6. Precision matrix  In many situations, it will be convenient to work with 𝚳 = 𝜯 βˆ’1 known as precision matrix: βˆ’1 𝚳 = 𝜯 βˆ’1 = 𝜯 11 𝜯 12 π‘ˆ 𝜯 21 𝜯 22 𝚳 21 = 𝚳 12 𝚳 = 𝚳 11 𝚳 12 𝚳 11 and 𝚳 22 are symmetric 𝚳 21 𝚳 22 Relation between the inverse of a partitioned matrix and the inverses of its partitions (using matrix inverse lemma): βˆ’1 𝜯 21 βˆ’1 𝚳 11 = 𝜯 11 βˆ’ 𝜯 12 𝜯 22 βˆ’1 𝜯 21 βˆ’1 𝜯 12 𝜯 22 βˆ’1 𝚳 12 = βˆ’ 𝜯 11 βˆ’ 𝜯 12 𝜯 22 6

  7. Marginal and conditional distributions based on block elements of 𝚳 𝚳 = 𝜯 βˆ’1 𝚳 = 𝚳 11 𝚳 12  Conditional 𝚳 21 𝚳 22 𝑄 π’š 1 |π’š 2 = π’ͺ π’š 1 |𝝂 1|2 , 𝜯 1|2 βˆ’1 𝚳 12 π’š 2 βˆ’ 𝝂 2 𝝂 1|2 = 𝝂 1 βˆ’ 𝚳 11 linear-Gaussian model βˆ’1 𝜯 1|2 = 𝚳 11  Marginal 𝑄 π’š 1 = π’ͺ π’š 1 |𝝂 1 , 𝜯 1 βˆ’1 βˆ’1 𝚳 21 𝜯 1 = 𝚳 11 βˆ’ 𝚳 12 𝚳 22 7

  8. Marginal and conditional distributions based on block elements of 𝜯  Conditional distributions : 𝑄 π’š 1 |π’š 2 = π’ͺ π’š 1 |𝝂 1|2 , 𝜯 1|2 βˆ’1 π’š 2 βˆ’ 𝝂 2 𝝂 1|2 = 𝝂 1 + 𝜯 12 𝜯 22 βˆ’1 𝜯 21 𝜯 1|2 = 𝜯 11 βˆ’ 𝜯 12 𝜯 22  Marginal distributions based on block element of 𝝂 and 𝜯 : 𝑄 π’š 1 = π’ͺ π’š 1 |𝝂 1 , 𝜯 11 𝑄 π’š 2 = π’ͺ π’š 2 |𝝂 2 , 𝜯 22 8

  9. Factor analysis  Gaussian latent variable 𝒂 ( 𝑀 dimensional)  Continuous latent variable  Observed variable 𝒀 ( 𝐸 dimensional) 𝒂 𝑄 π’œ = π’ͺ π’š 𝟏, 𝑱 𝝂, 𝑩, 𝛀 𝒀 𝑄 π’š|π’œ = π’ͺ π’š 𝝂 + π‘©π’œ, 𝛀 𝒂 ∈ ℝ 𝑀 𝑩 : factor loading 𝐸 Γ— 𝑀 matrix 𝒀 ∈ ℝ 𝐸 𝛀 : diagonal covariance matrix 9

  10. Marginal distribution π’œ 𝑄 π’œ = π’ͺ π’œ 𝟏, 𝑱 π’š = 𝝂 + π‘©π’œ + 𝒙 𝑄 π’š|π’œ = π’ͺ π’š 𝝂 + π‘©π’œ, 𝛀 𝒙~π’ͺ(𝟏, 𝛀) 𝒙 is independent of π’š and π’œ π’š The product of Gaussian distributions are Gaussian, as well as the marginal of Gaussian, thus 𝑄 π’š = 𝑄 π’œ 𝑄 π’š|π’œ π‘’π’œ is Gaussian 𝝂 π’š = 𝐹 π’š = 𝐹 𝝂 + π‘©π’œ = 𝝂 + 𝑩𝐹 π’œ = 𝝂 π’š βˆ’ 𝝂 π‘ˆ = 𝐹 π‘©π’œ + 𝒙 π‘ˆ 𝜯 π’šπ’š = 𝐹 π’š βˆ’ 𝝂 π‘©π’œ + 𝒙 = 𝑩𝐹 π’œπ’œ π‘ˆ 𝑩 π‘ˆ + 𝛀 = 𝑩𝑩 π‘ˆ + 𝛀 β‡’ 𝑄 π’š = π’ͺ π’š|𝝂 , 𝑩𝑩 π‘ˆ + 𝛀 10

  11. Joint Gaussian distribution 𝜯 π’œπ’š = 𝐷𝑝𝑀 π’œ, π’š = 𝐹 π’œ π‘©π’œ + 𝒙 π‘ˆ = 𝑩 π‘ˆ 𝜯 π’šπ’š = 𝑩𝑩 π‘ˆ + 𝜴 𝜯 π’œπ’œ = 𝑱 𝑱 𝑩 β‡’ 𝜯 = 𝑩𝑩 π‘ˆ + 𝜴 𝑩 π‘ˆ 𝐹 π’œ π’š = 𝟏 𝝂 π’œ π’œ 𝟏 𝑱 𝑩 β‡’ 𝑄 = π’ͺ 𝝂 , 𝑩𝑩 π‘ˆ + 𝜴 𝑩 π‘ˆ π’š π’š 11

  12. 𝑄 π’š 1 |π’š 2 = π’ͺ π’š 1 |𝝂 1|2 , 𝜯 1|2 βˆ’1 π’š 2 βˆ’ 𝝂 2 𝝂 1|2 = 𝝂 1 + 𝜯 12 𝜯 22 Conditional distributions βˆ’1 𝜯 21 𝜯 1|2 = 𝜯 11 βˆ’ 𝜯 12 𝜯 22 𝑄 π’œ|π’š = π’ͺ π’œ|𝝂 π’œ|π’š , 𝚻 π’œ|π’š 𝝂 π’œ|π’š = 𝑩 π‘ˆ 𝑩𝑩 π‘ˆ + 𝛀 βˆ’1 π’š βˆ’ 𝝂 𝚻 π’œ|π’š = 𝑱 βˆ’ 𝑩 π‘ˆ 𝑩𝑩 π‘ˆ + 𝛀 βˆ’1 𝑩  𝐸 Γ— 𝐸 matrix is required to be inverted. If 𝑀 < 𝐸 , it is preferred to use: 𝝂 π’œ|π’š = 𝑱 + 𝑩 π‘ˆ 𝛀 βˆ’1 𝑩 βˆ’1 𝑩 π‘ˆ 𝛀 βˆ’1 π’š βˆ’ 𝝂 = 𝚻 π’œ|π’š 𝑩 π‘ˆ 𝛀 βˆ’1 π’š βˆ’ 𝝂 𝚻 π’œ|π’š = 𝑱 + 𝑩 π‘ˆ 𝛀 βˆ’1 𝑩 βˆ’1 Posterior covariance does not depend on observed data π’š ! Computing the posterior mean is a linear operation 12 𝐡 βˆ’ 𝐢𝐸 βˆ’1 𝐷 βˆ’1 = 𝐡 βˆ’1 + 𝐡 βˆ’1 𝐢 𝐸 βˆ’ 𝐷𝐡 βˆ’1 𝐢 βˆ’1 𝐷𝐡 βˆ’1

  13. Geometric illustration π’š 𝑦 3 𝑄(π’œ|π’š) [Jordan] 𝑄(π’œ) 𝑦 2 𝑦 1 To generate data, first generate a point within the manifold then add noise. 13

  14. Factor analysis: dimensionality reduction  FA is just a constrained Gaussian model  If 𝜴 were not diagonal then we could model any Gaussian  FA is a low rank parameterization of a multi-variate Gaussian  Since 𝑄 π’š = π’ͺ π’š|𝝂 , 𝑩𝑩 π‘ˆ + 𝛀 , FA approximates the covariance matrix of the visible vector using a low-rank decomposition 𝑩𝑩 π‘ˆ and the diagonal matrix 𝛀  𝑩𝑩 π‘ˆ + 𝛀 is the outer product of two low-rank matrices plus a diagonal matrix (i.e., 𝑃(𝑀𝐸) parameters instead of 𝑃(𝐸 2 ) )  Given {π’š 1 , … , π’š (𝑂) } (the observation on high dimensional data), by learning from incomplete data we find 𝑩 for transforming data to a lower dimensional space 14

  15. Incomplete likelihood β„“ 𝜾; 𝒠 𝑂 = βˆ’ 𝑂 2 log 𝑩𝑩 π‘ˆ + 𝜴 βˆ’ 1 π‘ˆ 𝑩𝑩 π‘ˆ + 𝜴 βˆ’1 π’š π‘œ βˆ’ 𝝂 π’š π‘œ βˆ’ 𝝂 2 π‘œ=1 = βˆ’ 𝑂 2 log 𝑩𝑩 π‘ˆ + 𝜴 βˆ’ 1 2 𝑒𝑠 𝑩𝑩 π‘ˆ + 𝜴 βˆ’1 𝑻 𝑂 π‘ˆ π’š π‘œ βˆ’ 𝝂 π’š π‘œ βˆ’ 𝝂 𝑻 = π‘œ=1 𝑂 𝝂 𝑁𝑀 = 1 π’š π‘œ 𝑂 π‘œ=1 15

  16. E-step: expected sufficient statistics 𝑂 𝐹 𝑄 π’œ (π‘œ) |π’š (π‘œ) ,𝜾 𝑒 log 𝑄 π’œ (π‘œ) 𝜾 + log 𝑄 π’š (π‘œ) π’œ (π‘œ) , 𝜾 𝐹 𝑄 β„‹|𝒠,𝜾 𝑒 log 𝑄 𝒠, β„‹ 𝜾 = π‘œ=1  Expected sufficient statistics: 𝑂 = βˆ’ 𝑂 2 log 𝜴 βˆ’ 1 𝑒𝑠 𝐹 π’œ π‘œ π’œ π‘œ π‘ˆ 𝐹 log 𝑄 𝒠, β„‹ 𝜾 2 π‘œ=1 𝑂 βˆ’ 1 π‘ˆ 𝜴 βˆ’1 + 𝑑 π’š π‘œ βˆ’ π‘©π’œ π‘œ π’š π‘œ βˆ’ π‘©π’œ π‘œ 2 𝑒𝑠 𝐹 π‘œ=1 π’š π‘œ βˆ’ π‘©π’œ (π‘œ) π‘ˆ π’š π‘œ βˆ’ π‘©π’œ (π‘œ) 𝐹 = π’š π‘œ π’š π‘œ π‘ˆ βˆ’ 𝑩𝐹 π’œ π‘œ π’š π‘œ βˆ’ π’š π‘œ 𝐹 π’œ π‘œ π‘ˆ 𝑩 π‘ˆ + 𝑩𝐹 π’œ (π‘œ) π’œ π‘œ π‘ˆ 𝑩 π‘ˆ = 𝝂 π’œ|π’š (π‘œ) = 𝚻 π’œ|π’š 𝑩 π‘ˆ 𝜴 βˆ’1 π’š π‘œ βˆ’ 𝝂 𝐹 𝑄 π’œ (π‘œ) |π’š (π‘œ) ,𝜾 𝑒 π’œ π‘œ π’œ (π‘œ) π’œ π‘œ π‘ˆ = 𝚻 π’œ|π’š (π‘œ) + 𝝂 π’œ|π’š (π‘œ) 𝝂 π’œ|π’š π‘œ π‘ˆ 𝐹 𝑄 π’œ (π‘œ) |π’š (π‘œ) ,𝜾 𝑒 𝚻 π’œ|π’š = 𝑱 + 𝑩 π‘ˆ 𝛀 βˆ’1 𝑩 βˆ’1 16 𝝂 π’œ|π’š = 𝚻 π’œ|π’š 𝑩 π‘ˆ 𝛀 βˆ’1 π’š βˆ’ 𝝂

  17. M-Step βˆ’1 𝑂 𝑂 π‘ˆ 𝐹 π’œ π‘œ π’œ π‘œ π‘ˆ 𝑩 𝑒+1 = π’š π‘œ 𝐹 π’œ π‘œ π‘œ=1 π‘œ=1 𝜴 𝑒+1 = 1 π’š π‘œ βˆ’ 𝑩 𝑒+1 π’œ (π‘œ) π‘ˆ π’š π‘œ βˆ’ 𝑩 𝑒+1 π’œ (π‘œ) 𝑂 diag 𝐹 𝑂 𝑂 π’š π‘œ π’š π‘œ π‘ˆ βˆ’ 𝑩 𝑒+1 𝐹 π’œ π‘œ π’š π‘œ π‘ˆ ) = diag( π‘œ=1 π‘œ=1 17

  18. Unidentifiability  𝑩 only appears as outer product 𝑩𝑩 π‘ˆ , thus the model is invariant to rotation and axis flips of the latent space.  𝑩 can be replaced with 𝑩𝑹 for any orthonormal matrix 𝑹 and the model containing only 𝑩𝑩 π‘ˆ remains the same.  Thus, FA is an un-identifiable model.  Likelihood objective function on a set of data will not have a unique maximum (an infinite number of parameters give the maximum score)  It not be guaranteed to identify the same parameters. 18

  19. Probabilistic PCA & FA  Data is a linear function of low-dimensional latent coordinates, plus Gaussian noise  Factor analysis: 𝜴 is a general diagonal matrix  Probabilistic PCA: 𝜴 = 𝛽𝑱 𝑄 π’œ = π’ͺ(π’œ|𝟏, 𝑱) 𝒃 𝑄 π’š π’œ = π’ͺ π’š|π‘©π’œ + 𝝂, 𝜴 𝑄 π’š = π’ͺ π’š|𝝂, 𝑩𝑩 𝑼 + 𝜴 𝑨 𝒃 19 [Bishop]

  20. PPCA Posterior mean is not an orthogonal projection, since it is shrunk somewhat towards the prior mean [Murphy] 20

  21. Exact inference for Gaussian networks 21

  22. Multivariate Gaussian distribution 2𝜌 𝑒/2 𝜯 1/2 exp{βˆ’ 1 1 2 π’š βˆ’ 𝝂 π‘ˆ 𝜯 βˆ’1 (π’š βˆ’ 𝝂)} 𝑄 𝒀 = 𝑄 𝒀 ∝ exp{βˆ’ 1 2 π’š π‘ˆ π‘²π’š + 𝑲𝝂 π‘ˆ π’š} 𝑲 = 𝜯 βˆ’1 𝑄 is normalizable (i.e., normalization constant is finite) and defines a legal Gaussian distribution  Directed model if and only if 𝑲 is positive definite.  Linear Gaussian model  Undirected model  Gaussian MRF 22

  23. Linear-Gaussian model  Linear-Gaussian model for CPDs: 𝑄 π‘Œ 𝑗 𝑄𝑏 π‘Œ 𝑗 = π’ͺ π‘Œ 𝑗 | π‘₯ π‘—π‘˜ π‘Œ π‘˜ + 𝑐 𝑗 , 𝑀 𝑗 π‘Œ π‘˜ βˆˆπ‘„π‘ π‘Œ 𝑗  The joint distribution is Gaussian: ln 𝑄 π‘Œ 1 , … , π‘Œ 𝐸 2 𝐸 1 = βˆ’ π‘Œ 𝑗 βˆ’ π‘₯ π‘—π‘˜ π‘Œ π‘˜ βˆ’ 𝑐 𝑗 + 𝐷 2𝑀 𝑗 𝑗=1 π‘Œ π‘˜ βˆˆπ‘„π‘ π‘Œ 𝑗 23

Recommend


More recommend