Spectral Algorithms for Latent Variable Models Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent Variable Models, Edinburgh, UK Joint work with Mariya Ishteva, Ankur Parikh, Eric Xing, Byron Boots , Geoff Gordon, Alex Smola and Kenji Fukumizu
Latent Tree Graphical Models Graphical model: nodes represent variables, edges represent conditional independence relation Latent tree graphical models: latent and observed variables are arranged in a tree structure Latent Variable Observed Variable π 10 Latent Tree Hidden Markov Model π 7 π 8 π 9 π 10 π 11 π 12 π 7 π 8 π 9 π 1 π 2 π 3 π 4 π 5 π 6 π 1 π 2 π 3 π 4 π 5 π 6 Many real world applications, eg., time-series prediction, topic modeling 2
Scope of This Tutorial Estimating marginal probability of the observed variables Spectral HMMs (Hsu et al. COLTβ09) Kernel spectral HMMs (Song et al. ICMLβ10) Spectral latent tree (Parikh et al. ICMLβ11, Song et al. NIPSβ11) Spectral dimensional reduction for HMMs (Foster et al. Arxiv) More recent: Cohen et al. ACLβ12, Balle et al. ICMLβ12 Estimating latent parameters PCA approach (Mossel & Roch AOAPβ06) PCA and SVD approach, (Anandkumar et al. COLTβ12, Arxiv) Estimating the structure of latent variable models Recursive grouping (Choi et al. JMLRβ11) Spectral short quartet (Anandkumar et al. NIPSβ11) 3
Challenge of Estimating Marginal of Observed Variables Exponential number of entries in π π 1 , π 2 , β¦ , π 6 Discrete variable taking π possible values, π has π(π 6 ) entries! Latent tree reduces the number of parameters π π 1 , π 2 , β¦ , π 6 = π π 1 , π 2 , β¦ , π 6 , π¦ 7 , β¦ , π¦ 10 π 3π 2 π¦ 7 ,π¦ 8 ,π¦ 9 ,π¦ 10 π π params params π π 10 π π¦ 7 π¦ 10 π π 1 π¦ 7 π π 2 π¦ 7 π 10 π¦ 7 ,π¦ 8 ,π¦ 9 ,π¦ 10 π π¦ 8 π¦ 10 π π 3 π¦ 8 π(π 4 |π¦ 8 ) π π¦ 9 π¦ 10 π π 5 π¦ 9 π(π 6 |π¦ 9 ) π 9 π 7 π 8 Latent tree has π 9π 2 params π 1 π 2 π 3 π 4 π 5 π 6 Significant saving! 4
EM Algorithm for Parameter Estimation Do not observe latent variables, need to estimate the corresponding parameters, eg., π(π 7 |π 10 ) and π π 1 π 7 π 10 Goal of spectral algorithm: π 7 π 8 π 9 Estimate the marginal in π 1 π 2 π 3 π 4 π 5 π 6 local-minimum-free fashion 1 1 1 1 1 1 π = 1 π¦ 1 π¦ 2 π¦ 3 π¦ 4 π¦ 6 π¦ 5 β¦ β¦ β¦ π π π π π π π = π π¦ 2 π¦ 3 π¦ 1 π¦ 4 π¦ 5 π¦ 6 Expectation maximization: maximize likelihood of observations π π , β¦ , π¦ 6 π ) max π(π¦ 1 π=1 Drawback: local maxima, slow to converge, difficult to analyze 5
Key Features of Spectral Algorithms Represent joint probability table of observed variables with low rank factorization, without using the joint table in the computation! Eg. π 1,β¦,π ; π+1 ,β¦,2π = πππ‘βπππ(π π 1 , β¦ , π 2π , 1, β¦ , π ) π π β’ Represent it by low rank factors to avoid exponential blowup β’ Use clever decomposition π π π 1,β¦,π ; π+1 ,β¦,2π technique to avoid directly using all entries from the table β’ Use singular value decomposition 6
Tensor View of Marginal Probability Marginal probability table π€ = π π 1 , π 2 , β¦ , π 6 Discrete variable taking π possible values 1, β¦ , π 6-way table, or 6 th order tensor Dimension labeled by the variable Value of the variable is the index to the corresponding dimension, need 6 indexes to access a single entry π(π 1 = 1, π 2 = 4, β¦ , π 6 = 3) is the entry π€[1,4, β¦ , 3] Running Examples: Latent Tree π 10 π 7 π 8 π 9 π 10 π 11 π 12 Hidden Markov π 7 π 8 π 9 Model π 1 π 2 π 3 π 4 π 5 π 6 π 1 π 2 π 3 π 4 π 5 π 6 7
Reshaping Tensor into Matrices π = πππ‘βπππ π€, π : multi-index π mapped into row index, and the remaining indexes into column index Eg. π€ = π π 1 , π 2 , π 3 , a 3 rd order tensor and π = 3 π 2 ;{1,3} = πππ‘βπππ π€, {2} turns the dimension of π 2 into row π 3 π 1 π 1 Slice at dimension of π 3 π 2 π 2 π€ π 3 = 1 π 3 = 2 π 3 = 3 π 1 π = π 2 8
Reshaping 6 th Order Tensor π = π 1,2,3 ;{4,5,6} = πππ‘βπππ(π π 1 , β¦ , π 6 , 1,2,3 ) π 6 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 π 5 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 π 4 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 π 1 π 3 π 2 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 Each entry is the probability of a unique assignment to π(2,3,1,2,1,2) π 1 , β¦ , π 6 9
Reshaping according to Latent Tree Structure For marginal π = π π 1 , π 2 , β¦ , π 6 of a latent tree model, reshape it according to the edges in the tree π 1 ;{2,3,4,5,6} = πππ‘βπππ(π , 1 ) π 10 π 9 π 7 π 8 π 1,2 ;{3,4,5,6} = πππ‘βπππ(π , 1,2 ) π 1 π 2 π 3 π 4 π 5 π 6 π 1 ;{2,3,4,5,6} π 1,2,3,4 ;{5,6} = πππ‘βπππ(π , 1,2,3,4 ) π 10 π 10 π 7 π 8 π 9 π 7 π 8 π 9 π 1 π 2 π 3 π 4 π 5 π 6 π 1 π 2 π 3 π 4 π 5 π 6 π 1,2 ;{3,4,5,6} π 1,2,3,4 ;{5,6} 10
Low Rank Structure after Reshaping Size of π 1,2 ;{3,4,5,6} is π 2 Γ π 4 , but its rank is just π π 10 π π 1 , π 2 , β¦ , π 6 = π 9 π 7 π 8 π π 1 , π 2 π¦ 7 π π¦ 7 , π¦ 10 π 1 π 2 π 3 π 4 π 5 π 6 π¦ 7 ,π¦ 10 π(π 3 , π 4 , π 5 , π 6 |π¦ 10 ) π 1,2 ;{3,4,5,6} Use matrix multiplications to express summation over π 7 , π 10 β€ π 1,2 ;{3,4,5,6} = π 1,2 | 7 π 7 ;{10} π 3,4,5,6 |{10} π 1,2 | 7 β πππ‘βπππ(π π 1 , π 2 π 7 , 1,2 ) π 3,4,5,6 | 10 β πππ‘βπππ(π π 3 , π 4 , π 5 , π 6 π 10 , 3,4,5,6 ) π 4 π 4 π π π π π 2 π 2 = π 1,2 ;{3,4,5,6} π 7 ;{10} 11
Low Rank Structure of Latent Tree Model β€ π 3,4 ;{1,2,5,6} = π 3,4 | 8 π 8 ;{10} π 1,2,5,6 |{10} π 4 π 4 π π π π π 2 = π 2 β€ π 1 ;{2,3,4,5,6} = π 1 | 7 π 7 ;{7} π 2,3,4,5,6 |{7} π 5 = π π 5 π π π π π π 10 π 9 π 7 π 8 All these reshapings are low rank, and with rank π π 1 π 2 π 3 π 4 π 5 π 6 12
Low Rank Structure of Hidden Markov Models π 7 π 8 π 9 π 10 π 11 π 12 π 1 π 2 π 3 π 4 π 5 π 6 β€ π 1,2 ;{3,4,5,6} = π 1,2 | 8 π 8 ;{9} π 3,4,5,6 |{9} π 4 π 4 π π π π π 2 = π 2 β€ π 1,2,3 ;{4,5,6} = π 1,2,3 | 9 π 9 ;{10} π 4,5,6 |{10} π 3 π 3 π π π π = π 3 π 3 13
Key Features of Spectral Algorithms Represent joint probability table of observed variables with low rank factorization, without using the joint table in the computation! Eg. π 1,β¦,π ; π+1 ,β¦,2π = πππ‘βπππ(π π 1 , β¦ , π 2π , 1, β¦ , π ) π π β’ Represent it by low rank factors to avoid exponential blowup β’ Use clever decomposition π π π 1,β¦,π ; π+1 ,β¦,2π technique to avoid directly using all entries from the table β’ Use singular value decomposition 14
Key Theorem Theorem 1: π: π‘ππ¨π π Γ π, π πππ π π΅: π‘ππ¨π π Γ π, π πππ π πΆ: π‘ππ¨π π Γ π, π πππ π πΆππ΅ β1 πΆπ π½π πΆππ΅ πππ€ππ π’ππππ, π’βππ π = ππ΅ π will be the reshaped joint probability table π΅ and πΆ will be marginalization operator Theorem 1 will be applied recursively Recover several existing spectral algorithms as special cases 15
Marginalization Operator A and B Compute the marginal probability of a subset of variables can be expressed as matrix product π π 1 , π 2 , π 3 , π 4 = π π 1 , π 2 , π 3 , π 4 , π¦ 5 , π¦ 6 π¦ 5 ,π¦ 6 π 1,2,3 ; 4 = π 1,2,3 ;{4,5,6} π΅ , where π΅ = 1 π β 1 π β π½ π π 3 π π π 1 1 π β β π π π π½ π 3 π 3 π 3 = π 3 = π΅ π΅ 1 π 2 16
Zoom into Marginalization Operation π 6 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 π 5 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 π 4 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 = π 1,2,3 ;{4} π 1,2,3 ;{4,5,6} 1 3 β 1 3 β π½ 3 17
Apply Theorem 1 to Latent Tree Model Let π 10 π = π 1,2 ;{3,4,5,6} π 9 π΅ = 1 π β 1 π β 1 π β π½ π π 7 π 8 πΆ = π½ π β 1 π β€ π 1 π 2 π 3 π 4 π 5 π 6 π 1,2 ;{3,4,5,6} Then π 1,2 ; 3,4,5,6 π΅ = π 1,2 ;{3} πΆπ 1,2 ; 3,4,5,6 = π 2 ;{3,4,5,6} πΆπ 1,2 ; 3,4,5,6 π΅ = π 2 ;{3} πΆππ΅ β1 πΆπ Finally use π = ππ΅ β1 π 1,2 ;{3,4,5,6} = π 1,2 ;{3} π 2 ; 3 π 2 ;{3,4,5,6} 18
Recommend
More recommend