learning overcomplete latent variable models through
play

Learning Overcomplete Latent Variable Models through Tensor Methods - PowerPoint PPT Presentation

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine Joint work with Majid Janzamin Rong Ge UC Irvine Microsoft Research Latent Variable Probabilistic Models Latent (hidden) variable h R k ,


  1. Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine Joint work with Majid Janzamin Rong Ge UC Irvine Microsoft Research

  2. Latent Variable Probabilistic Models Latent (hidden) variable h ∈ R k , observed variable x ∈ R d .

  3. Latent Variable Probabilistic Models Latent (hidden) variable h ∈ R k , observed variable x ∈ R d . Multiview linear mixture models Categorical hidden variable h . h · · · Views: conditionally indep. given h . x 1 x 2 x 3 Linear model: E [ x 1 | h ] = a h , E [ x 2 | h ] = b h , E [ x 3 | h ] = c h .

  4. Latent Variable Probabilistic Models Latent (hidden) variable h ∈ R k , observed variable x ∈ R d . Multiview linear mixture models Categorical hidden variable h . h · · · Views: conditionally indep. given h . x 1 x 2 x 3 Linear model: E [ x 1 | h ] = a h , E [ x 2 | h ] = b h , E [ x 3 | h ] = c h . Gaussian Mixture Categorical hidden variable h . x | h ∼ N ( µ h , Σ h ) .

  5. Latent Variable Probabilistic Models Latent (hidden) variable h ∈ R k , observed variable x ∈ R d . Multiview linear mixture models Categorical hidden variable h . h · · · Views: conditionally indep. given h . x 1 x 2 x 3 Linear model: E [ x 1 | h ] = a h , E [ x 2 | h ] = b h , E [ x 3 | h ] = c h . Gaussian Mixture Categorical hidden variable h . x | h ∼ N ( µ h , Σ h ) . ICA, Sparse Coding, HMM, Topic modeling, . . .

  6. Latent Variable Probabilistic Models Latent (hidden) variable h ∈ R k , observed variable x ∈ R d . Multiview linear mixture models Categorical hidden variable h . h · · · Views: conditionally indep. given h . x 1 x 2 x 3 Linear model: E [ x 1 | h ] = a h , E [ x 2 | h ] = b h , E [ x 3 | h ] = c h . Gaussian Mixture Categorical hidden variable h . x | h ∼ N ( µ h , Σ h ) . ICA, Sparse Coding, HMM, Topic modeling, . . . Efficient Learning of the parameters a h , µ h , . . . ?

  7. Method-of-Moments (Spectral methods) Multi-variate observed moments M 2 := E [ x ⊗ x ] , M 3 := E [ x ⊗ x ⊗ x ] . M 1 := E [ x ] ,

  8. Method-of-Moments (Spectral methods) Multi-variate observed moments M 2 := E [ x ⊗ x ] , M 3 := E [ x ⊗ x ⊗ x ] . M 1 := E [ x ] , Matrix E [ x ⊗ x ] ∈ R d × d is a second order tensor. E [ x ⊗ x ] i 1 ,i 2 = E [ x i 1 x i 2 ] . For matrices: E [ x ⊗ x ] = E [ xx ⊤ ] .

  9. Method-of-Moments (Spectral methods) Multi-variate observed moments M 2 := E [ x ⊗ x ] , M 3 := E [ x ⊗ x ⊗ x ] . M 1 := E [ x ] , Matrix E [ x ⊗ x ] ∈ R d × d is a second order tensor. E [ x ⊗ x ] i 1 ,i 2 = E [ x i 1 x i 2 ] . For matrices: E [ x ⊗ x ] = E [ xx ⊤ ] . Tensor E [ x ⊗ x ⊗ x ] ∈ R d × d × d is a third order tensor. E [ x ⊗ x ⊗ x ] i 1 ,i 2 ,i 3 = E [ x i 1 x i 2 x i 3 ] .

  10. Method-of-Moments (Spectral methods) Multi-variate observed moments M 2 := E [ x ⊗ x ] , M 3 := E [ x ⊗ x ⊗ x ] . M 1 := E [ x ] , Matrix E [ x ⊗ x ] ∈ R d × d is a second order tensor. E [ x ⊗ x ] i 1 ,i 2 = E [ x i 1 x i 2 ] . For matrices: E [ x ⊗ x ] = E [ xx ⊤ ] . Tensor E [ x ⊗ x ⊗ x ] ∈ R d × d × d is a third order tensor. E [ x ⊗ x ⊗ x ] i 1 ,i 2 ,i 3 = E [ x i 1 x i 2 x i 3 ] . Information in moments for learning LVMs?

  11. Multiview Mixture Model [ k ] := { 1 , . . . , k } . Multiview linear mixture models Categorical hidden variable h ∈ [ k ] . h w j := Pr[ h = j ] · · · Views: conditionally indep. given h . x 1 x 2 x 3 Linear model: E [ x 1 | h ] = a h , E [ x 2 | h ] = b h , E [ x 3 | h ] = c h .

  12. Multiview Mixture Model [ k ] := { 1 , . . . , k } . Multiview linear mixture models Categorical hidden variable h ∈ [ k ] . h w j := Pr[ h = j ] · · · Views: conditionally indep. given h . x 1 x 2 x 3 Linear model: E [ x 1 | h ] = a h , E [ x 2 | h ] = b h , E [ x 3 | h ] = c h . x 1 x ⊤ 2 � �� � E x [ x 1 ⊗ x 2 ] = E h [ E x [ x 1 ⊗ x 2 | h ]] = E h [ a h ⊗ b h ] � = w j a j ⊗ b j . j ∈ [ k ]

  13. Multiview Mixture Model [ k ] := { 1 , . . . , k } . Multiview linear mixture models Categorical hidden variable h ∈ [ k ] . h w j := Pr[ h = j ] · · · Views: conditionally indep. given h . x 1 x 2 x 3 Linear model: E [ x 1 | h ] = a h , E [ x 2 | h ] = b h , E [ x 3 | h ] = c h . � E [ x 1 ⊗ x 2 ] = w j a j ⊗ b j , j ∈ [ k ]

  14. Multiview Mixture Model [ k ] := { 1 , . . . , k } . Multiview linear mixture models Categorical hidden variable h ∈ [ k ] . h w j := Pr[ h = j ] · · · Views: conditionally indep. given h . x 1 x 2 x 3 Linear model: E [ x 1 | h ] = a h , E [ x 2 | h ] = b h , E [ x 3 | h ] = c h . � E [ x 1 ⊗ x 2 ] = w j a j ⊗ b j , j ∈ [ k ] � E [ x 1 ⊗ x 2 ⊗ x 3 ] = w j a j ⊗ b j ⊗ c j . j ∈ [ k ]

  15. Multiview Mixture Model [ k ] := { 1 , . . . , k } . Multiview linear mixture models Categorical hidden variable h ∈ [ k ] . h w j := Pr[ h = j ] · · · Views: conditionally indep. given h . x 1 x 2 x 3 Linear model: E [ x 1 | h ] = a h , E [ x 2 | h ] = b h , E [ x 3 | h ] = c h . � E [ x 1 ⊗ x 2 ] = w j a j ⊗ b j , j ∈ [ k ] � E [ x 1 ⊗ x 2 ⊗ x 3 ] = w j a j ⊗ b j ⊗ c j . j ∈ [ k ] Tensor (matrix) factorization for learning LVMs.

  16. Tensor Rank and Tensor Decomposition Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T ( i, j, l ) = w · a ( i ) · b ( j ) · c ( l ) .

  17. Tensor Rank and Tensor Decomposition Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T ( i, j, l ) = w · a ( i ) · b ( j ) · c ( l ) . CANDECOMP/PARAFAC (CP) Decomposition � w j a j ⊗ b j ⊗ c j ∈ R d × d × d , a j , b j , c j ∈ S d − 1 . T = j ∈ [ k ] .... = + w 1 · a 1 ⊗ b 1 ⊗ c 1 w 2 · a 2 ⊗ b 2 ⊗ c 2 Tensor T

  18. Tensor Rank and Tensor Decomposition Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T ( i, j, l ) = w · a ( i ) · b ( j ) · c ( l ) . CANDECOMP/PARAFAC (CP) Decomposition � w j a j ⊗ b j ⊗ c j ∈ R d × d × d , a j , b j , c j ∈ S d − 1 . T = j ∈ [ k ] .... = + w 1 · a 1 ⊗ b 1 ⊗ c 1 w 2 · a 2 ⊗ b 2 ⊗ c 2 Tensor T k : tensor rank, d : ambient dimension. k ≤ d : undercomplete and k > d : overcomplete.

  19. Tensor Rank and Tensor Decomposition Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T ( i, j, l ) = w · a ( i ) · b ( j ) · c ( l ) . CANDECOMP/PARAFAC (CP) Decomposition � w j a j ⊗ b j ⊗ c j ∈ R d × d × d , a j , b j , c j ∈ S d − 1 . T = j ∈ [ k ] .... = + w 1 · a 1 ⊗ b 1 ⊗ c 1 w 2 · a 2 ⊗ b 2 ⊗ c 2 Tensor T k : tensor rank, d : ambient dimension. k ≤ d : undercomplete and k > d : overcomplete. This talk: guarantees for overcomplete tensor decomposition

  20. Challenges in Tensor Decomposition � Symmetric tensor T ∈ R d × d × d : T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] Challenges in tensors Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.

  21. Challenges in Tensor Decomposition � Symmetric tensor T ∈ R d × d × d : T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] Challenges in tensors Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. Tractable case: orthogonal tensor decomposition ( � v i , v j � = 0 , i � = j ) T ( I, v, v ) Algorithm: tensor power method: v �→ � T ( I, v, v ) � . • { v i } ’s are the only robust fixed points.

  22. Challenges in Tensor Decomposition � Symmetric tensor T ∈ R d × d × d : T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] Challenges in tensors Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. Tractable case: orthogonal tensor decomposition ( � v i , v j � = 0 , i � = j ) T ( I, v, v ) Algorithm: tensor power method: v �→ � T ( I, v, v ) � . • { v i } ’s are the only robust fixed points. • All other eigenvectors are saddle points.

  23. Challenges in Tensor Decomposition � Symmetric tensor T ∈ R d × d × d : T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] Challenges in tensors Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. Tractable case: orthogonal tensor decomposition ( � v i , v j � = 0 , i � = j ) T ( I, v, v ) Algorithm: tensor power method: v �→ � T ( I, v, v ) � . • { v i } ’s are the only robust fixed points. • All other eigenvectors are saddle points. For an orthogonal tensor, no spurious local optima!

  24. Beyond Orthogonal Tensor Decomposition Limitations Not ALL tensors have orthogonal decomposition (unlike matrices).

  25. Beyond Orthogonal Tensor Decomposition Limitations Not ALL tensors have orthogonal decomposition (unlike matrices). Undercomplete tensors ( k ≤ d ) with full rank components Non-orthogonal decomposition T 1 = � i w i a i ⊗ a i ⊗ a i . v 1 a 1 W Whitening matrix W : a 2 v 2 a 3 v 3 Multilinear transform: T 2 = T 1 ( W, W, W ) Tensor T 1 Tensor T 2

  26. Beyond Orthogonal Tensor Decomposition Limitations Not ALL tensors have orthogonal decomposition (unlike matrices). Undercomplete tensors ( k ≤ d ) with full rank components Non-orthogonal decomposition T 1 = � i w i a i ⊗ a i ⊗ a i . v 1 a 1 W Whitening matrix W : a 2 v 2 a 3 v 3 Multilinear transform: T 2 = T 1 ( W, W, W ) Tensor T 1 Tensor T 2 This talk: guarantees for overcomplete tensor decomposition

  27. Outline Introduction 1 Overcomplete tensor decomposition 2 Sample Complexity Analysis 3 Conclusion 4

  28. Our Setup So far General tensor decomposition: NP-hard. Orthogonal tensors: too limiting. Tractable cases? Covers overcomplete tensors?

Recommend


More recommend