part iii latent tree models
play

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral - PowerPoint PPT Presentation

Spectral Algorithms for Latent Variable Models Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent Variable Models, Edinburgh, UK Joint work with Mariya Ishteva, Ankur Parikh, Eric Xing, Byron Boots , Geoff


  1. Spectral Algorithms for Latent Variable Models Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent Variable Models, Edinburgh, UK Joint work with Mariya Ishteva, Ankur Parikh, Eric Xing, Byron Boots , Geoff Gordon, Alex Smola and Kenji Fukumizu

  2. Latent Tree Graphical Models Graphical model: nodes represent variables, edges represent conditional independence relation Latent tree graphical models: latent and observed variables are arranged in a tree structure Latent Variable Observed Variable π‘Œ 10 Latent Tree Hidden Markov Model π‘Œ 7 π‘Œ 8 π‘Œ 9 π‘Œ 10 π‘Œ 11 π‘Œ 12 π‘Œ 7 π‘Œ 8 π‘Œ 9 π‘Œ 1 π‘Œ 2 π‘Œ 3 π‘Œ 4 π‘Œ 5 π‘Œ 6 π‘Œ 1 π‘Œ 2 π‘Œ 3 π‘Œ 4 π‘Œ 5 π‘Œ 6 Many real world applications, eg., time-series prediction, topic modeling 2

  3. Scope of This Tutorial Estimating marginal probability of the observed variables Spectral HMMs (Hsu et al. COLT’09) Kernel spectral HMMs (Song et al. ICML’10) Spectral latent tree (Parikh et al. ICML’11, Song et al. NIPS’11) Spectral dimensional reduction for HMMs (Foster et al. Arxiv) More recent: Cohen et al. ACL’12, Balle et al. ICML’12 Estimating latent parameters PCA approach (Mossel & Roch AOAP’06) PCA and SVD approach, (Anandkumar et al. COLT’12, Arxiv) Estimating the structure of latent variable models Recursive grouping (Choi et al. JMLR’11) Spectral short quartet (Anandkumar et al. NIPS’11) 3

  4. Challenge of Estimating Marginal of Observed Variables Exponential number of entries in 𝑄 π‘Œ 1 , π‘Œ 2 , … , π‘Œ 6 Discrete variable taking π‘œ possible values, 𝑄 has 𝑃(π‘œ 6 ) entries! Latent tree reduces the number of parameters 𝑄 π‘Œ 1 , π‘Œ 2 , … , π‘Œ 6 = 𝑄 π‘Œ 1 , π‘Œ 2 , … , π‘Œ 6 , 𝑦 7 , … , 𝑦 10 𝑃 3π‘œ 2 𝑦 7 ,𝑦 8 ,𝑦 9 ,𝑦 10 𝑃 π‘œ params params 𝑄 π‘Œ 10 𝑄 𝑦 7 𝑦 10 𝑄 π‘Œ 1 𝑦 7 𝑄 π‘Œ 2 𝑦 7 π‘Œ 10 𝑦 7 ,𝑦 8 ,𝑦 9 ,𝑦 10 𝑄 𝑦 8 𝑦 10 𝑄 π‘Œ 3 𝑦 8 𝑄(π‘Œ 4 |𝑦 8 ) 𝑄 𝑦 9 𝑦 10 𝑄 π‘Œ 5 𝑦 9 𝑄(π‘Œ 6 |𝑦 9 ) π‘Œ 9 π‘Œ 7 π‘Œ 8 Latent tree has 𝑃 9π‘œ 2 params π‘Œ 1 π‘Œ 2 π‘Œ 3 π‘Œ 4 π‘Œ 5 π‘Œ 6 Significant saving! 4

  5. EM Algorithm for Parameter Estimation Do not observe latent variables, need to estimate the corresponding parameters, eg., 𝑄(π‘Œ 7 |π‘Œ 10 ) and 𝑄 π‘Œ 1 π‘Œ 7 π‘Œ 10 Goal of spectral algorithm: π‘Œ 7 π‘Œ 8 π‘Œ 9 Estimate the marginal in π‘Œ 1 π‘Œ 2 π‘Œ 3 π‘Œ 4 π‘Œ 5 π‘Œ 6 local-minimum-free fashion 1 1 1 1 1 1 𝑗 = 1 𝑦 1 𝑦 2 𝑦 3 𝑦 4 𝑦 6 𝑦 5 … … … 𝑛 𝑛 𝑛 𝑛 𝑛 𝑛 𝑗 = 𝑛 𝑦 2 𝑦 3 𝑦 1 𝑦 4 𝑦 5 𝑦 6 Expectation maximization: maximize likelihood of observations 𝑛 𝑗 , … , 𝑦 6 𝑗 ) max 𝑄(𝑦 1 𝑗=1 Drawback: local maxima, slow to converge, difficult to analyze 5

  6. Key Features of Spectral Algorithms Represent joint probability table of observed variables with low rank factorization, without using the joint table in the computation! Eg. 𝑄 1,…,𝑒 ; 𝑒+1 ,…,2𝑒 = π‘†π‘“π‘‘β„Žπ‘π‘žπ‘“(𝑄 π‘Œ 1 , … , π‘Œ 2𝑒 , 1, … , 𝑒 ) π‘œ 𝑒 β€’ Represent it by low rank factors to avoid exponential blowup β€’ Use clever decomposition π‘œ 𝑒 𝑄 1,…,𝑒 ; 𝑒+1 ,…,2𝑒 technique to avoid directly using all entries from the table β€’ Use singular value decomposition 6

  7. Tensor View of Marginal Probability Marginal probability table 𝓀 = 𝑄 π‘Œ 1 , π‘Œ 2 , … , π‘Œ 6 Discrete variable taking π‘œ possible values 1, … , π‘œ 6-way table, or 6 th order tensor Dimension labeled by the variable Value of the variable is the index to the corresponding dimension, need 6 indexes to access a single entry 𝑄(π‘Œ 1 = 1, π‘Œ 2 = 4, … , π‘Œ 6 = 3) is the entry 𝓀[1,4, … , 3] Running Examples: Latent Tree π‘Œ 10 π‘Œ 7 π‘Œ 8 π‘Œ 9 π‘Œ 10 π‘Œ 11 π‘Œ 12 Hidden Markov π‘Œ 7 π‘Œ 8 π‘Œ 9 Model π‘Œ 1 π‘Œ 2 π‘Œ 3 π‘Œ 4 π‘Œ 5 π‘Œ 6 π‘Œ 1 π‘Œ 2 π‘Œ 3 π‘Œ 4 π‘Œ 5 π‘Œ 6 7

  8. Reshaping Tensor into Matrices π‘ˆ = π‘†π‘“π‘‘β„Žπ‘π‘žπ‘“ 𝓀, π’Ÿ : multi-index π’Ÿ mapped into row index, and the remaining indexes into column index Eg. 𝓀 = 𝑄 π‘Œ 1 , π‘Œ 2 , π‘Œ 3 , a 3 rd order tensor and π‘œ = 3 𝑄 2 ;{1,3} = π‘†π‘“π‘‘β„Žπ‘π‘žπ‘“ 𝓀, {2} turns the dimension of π‘Œ 2 into row π‘Œ 3 π‘Œ 1 π‘Œ 1 Slice at dimension of π‘Œ 3 π‘Œ 2 π‘Œ 2 𝓀 π‘Œ 3 = 1 π‘Œ 3 = 2 π‘Œ 3 = 3 π‘Œ 1 π‘ˆ = π‘Œ 2 8

  9. Reshaping 6 th Order Tensor π‘ˆ = 𝑄 1,2,3 ;{4,5,6} = π‘†π‘“π‘‘β„Žπ‘π‘žπ‘“(𝑄 π‘Œ 1 , … , π‘Œ 6 , 1,2,3 ) π‘Œ 6 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 π‘Œ 5 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 π‘Œ 4 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 π‘Œ 1 π‘Œ 3 π‘Œ 2 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 Each entry is the probability of a unique assignment to 𝑄(2,3,1,2,1,2) π‘Œ 1 , … , π‘Œ 6 9

  10. Reshaping according to Latent Tree Structure For marginal 𝓠 = 𝑄 π‘Œ 1 , π‘Œ 2 , … , π‘Œ 6 of a latent tree model, reshape it according to the edges in the tree 𝑄 1 ;{2,3,4,5,6} = π‘†π‘“π‘‘β„Žπ‘π‘žπ‘“(𝓠, 1 ) π‘Œ 10 π‘Œ 9 π‘Œ 7 π‘Œ 8 𝑄 1,2 ;{3,4,5,6} = π‘†π‘“π‘‘β„Žπ‘π‘žπ‘“(𝓠, 1,2 ) π‘Œ 1 π‘Œ 2 π‘Œ 3 π‘Œ 4 π‘Œ 5 π‘Œ 6 𝑄 1 ;{2,3,4,5,6} 𝑄 1,2,3,4 ;{5,6} = π‘†π‘“π‘‘β„Žπ‘π‘žπ‘“(𝓠, 1,2,3,4 ) π‘Œ 10 π‘Œ 10 π‘Œ 7 π‘Œ 8 π‘Œ 9 π‘Œ 7 π‘Œ 8 π‘Œ 9 π‘Œ 1 π‘Œ 2 π‘Œ 3 π‘Œ 4 π‘Œ 5 π‘Œ 6 π‘Œ 1 π‘Œ 2 π‘Œ 3 π‘Œ 4 π‘Œ 5 π‘Œ 6 𝑄 1,2 ;{3,4,5,6} 𝑄 1,2,3,4 ;{5,6} 10

  11. Low Rank Structure after Reshaping Size of 𝑄 1,2 ;{3,4,5,6} is π‘œ 2 Γ— π‘œ 4 , but its rank is just π‘œ π‘Œ 10 𝑄 π‘Œ 1 , π‘Œ 2 , … , π‘Œ 6 = π‘Œ 9 π‘Œ 7 π‘Œ 8 𝑄 π‘Œ 1 , π‘Œ 2 𝑦 7 𝑄 𝑦 7 , 𝑦 10 π‘Œ 1 π‘Œ 2 π‘Œ 3 π‘Œ 4 π‘Œ 5 π‘Œ 6 𝑦 7 ,𝑦 10 𝑄(π‘Œ 3 , π‘Œ 4 , π‘Œ 5 , π‘Œ 6 |𝑦 10 ) 𝑄 1,2 ;{3,4,5,6} Use matrix multiplications to express summation over π‘Œ 7 , π‘Œ 10 ⊀ 𝑄 1,2 ;{3,4,5,6} = 𝑄 1,2 | 7 𝑄 7 ;{10} 𝑄 3,4,5,6 |{10} 𝑄 1,2 | 7 ≔ π‘†π‘“π‘‘β„Žπ‘π‘žπ‘“(𝑄 π‘Œ 1 , π‘Œ 2 π‘Œ 7 , 1,2 ) 𝑄 3,4,5,6 | 10 ≔ π‘†π‘“π‘‘β„Žπ‘π‘žπ‘“(𝑄 π‘Œ 3 , π‘Œ 4 , π‘Œ 5 , π‘Œ 6 π‘Œ 10 , 3,4,5,6 ) π‘œ 4 π‘œ 4 π‘œ π‘œ π‘œ π‘œ π‘œ 2 π‘œ 2 = 𝑄 1,2 ;{3,4,5,6} 𝑄 7 ;{10} 11

  12. Low Rank Structure of Latent Tree Model ⊀ 𝑄 3,4 ;{1,2,5,6} = 𝑄 3,4 | 8 𝑄 8 ;{10} 𝑄 1,2,5,6 |{10} π‘œ 4 π‘œ 4 π‘œ π‘œ π‘œ π‘œ π‘œ 2 = π‘œ 2 ⊀ 𝑄 1 ;{2,3,4,5,6} = 𝑄 1 | 7 𝑄 7 ;{7} 𝑄 2,3,4,5,6 |{7} π‘œ 5 = π‘œ π‘œ 5 π‘œ π‘œ π‘œ π‘œ π‘œ π‘Œ 10 π‘Œ 9 π‘Œ 7 π‘Œ 8 All these reshapings are low rank, and with rank π‘œ π‘Œ 1 π‘Œ 2 π‘Œ 3 π‘Œ 4 π‘Œ 5 π‘Œ 6 12

  13. Low Rank Structure of Hidden Markov Models π‘Œ 7 π‘Œ 8 π‘Œ 9 π‘Œ 10 π‘Œ 11 π‘Œ 12 π‘Œ 1 π‘Œ 2 π‘Œ 3 π‘Œ 4 π‘Œ 5 π‘Œ 6 ⊀ 𝑄 1,2 ;{3,4,5,6} = 𝑄 1,2 | 8 𝑄 8 ;{9} 𝑄 3,4,5,6 |{9} π‘œ 4 π‘œ 4 π‘œ π‘œ π‘œ π‘œ π‘œ 2 = π‘œ 2 ⊀ 𝑄 1,2,3 ;{4,5,6} = 𝑄 1,2,3 | 9 𝑄 9 ;{10} 𝑄 4,5,6 |{10} π‘œ 3 π‘œ 3 π‘œ π‘œ π‘œ π‘œ = π‘œ 3 π‘œ 3 13

  14. Key Features of Spectral Algorithms Represent joint probability table of observed variables with low rank factorization, without using the joint table in the computation! Eg. 𝑄 1,…,𝑒 ; 𝑒+1 ,…,2𝑒 = π‘†π‘“π‘‘β„Žπ‘π‘žπ‘“(𝑄 π‘Œ 1 , … , π‘Œ 2𝑒 , 1, … , 𝑒 ) π‘œ 𝑒 β€’ Represent it by low rank factors to avoid exponential blowup β€’ Use clever decomposition π‘œ 𝑒 𝑄 1,…,𝑒 ; 𝑒+1 ,…,2𝑒 technique to avoid directly using all entries from the table β€’ Use singular value decomposition 14

  15. Key Theorem Theorem 1: 𝑄: 𝑑𝑗𝑨𝑓 𝑛 Γ— π‘œ, π‘ π‘π‘œπ‘™ 𝑙 𝐡: 𝑑𝑗𝑨𝑓 π‘œ Γ— 𝑙, π‘ π‘π‘œπ‘™ 𝑙 𝐢: 𝑑𝑗𝑨𝑓 𝑙 Γ— 𝑛, π‘ π‘π‘œπ‘™ 𝑙 𝐢𝑄𝐡 βˆ’1 𝐢𝑄 𝐽𝑔 𝐢𝑄𝐡 π‘—π‘œπ‘€π‘“π‘ π‘’π‘—π‘π‘šπ‘“, π‘’β„Žπ‘“π‘œ 𝑄 = 𝑄𝐡 𝑄 will be the reshaped joint probability table 𝐡 and 𝐢 will be marginalization operator Theorem 1 will be applied recursively Recover several existing spectral algorithms as special cases 15

  16. Marginalization Operator A and B Compute the marginal probability of a subset of variables can be expressed as matrix product 𝑄 π‘Œ 1 , π‘Œ 2 , π‘Œ 3 , π‘Œ 4 = 𝑄 π‘Œ 1 , π‘Œ 2 , π‘Œ 3 , π‘Œ 4 , 𝑦 5 , 𝑦 6 𝑦 5 ,𝑦 6 𝑄 1,2,3 ; 4 = 𝑄 1,2,3 ;{4,5,6} 𝐡 , where 𝐡 = 1 π‘œ βŠ— 1 π‘œ βŠ— 𝐽 π‘œ π‘œ 3 π‘œ π‘œ π‘œ 1 1 π‘œ βŠ— βŠ— π‘œ π‘œ π‘œ 𝐽 π‘œ 3 π‘œ 3 π‘œ 3 = π‘œ 3 = 𝐡 𝐡 1 π‘œ 2 16

  17. Zoom into Marginalization Operation π‘Œ 6 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 π‘Œ 5 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 π‘Œ 4 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 = 𝑄 1,2,3 ;{4} 𝑄 1,2,3 ;{4,5,6} 1 3 βŠ— 1 3 βŠ— 𝐽 3 17

  18. Apply Theorem 1 to Latent Tree Model Let π‘Œ 10 𝑄 = 𝑄 1,2 ;{3,4,5,6} π‘Œ 9 𝐡 = 1 π‘œ βŠ— 1 π‘œ βŠ— 1 π‘œ βŠ— 𝐽 π‘œ π‘Œ 7 π‘Œ 8 𝐢 = 𝐽 π‘œ βŠ— 1 π‘œ ⊀ π‘Œ 1 π‘Œ 2 π‘Œ 3 π‘Œ 4 π‘Œ 5 π‘Œ 6 𝑄 1,2 ;{3,4,5,6} Then 𝑄 1,2 ; 3,4,5,6 𝐡 = 𝑄 1,2 ;{3} 𝐢𝑄 1,2 ; 3,4,5,6 = 𝑄 2 ;{3,4,5,6} 𝐢𝑄 1,2 ; 3,4,5,6 𝐡 = 𝑄 2 ;{3} 𝐢𝑄𝐡 βˆ’1 𝐢𝑄 Finally use 𝑄 = 𝑄𝐡 βˆ’1 𝑄 1,2 ;{3,4,5,6} = 𝑄 1,2 ;{3} 𝑄 2 ; 3 𝑄 2 ;{3,4,5,6} 18

Recommend


More recommend