Tensor Methods for large-scale Machine Learning Anima Anandkumar U.C. Irvine
Learning with Big Data
Data vs. Information
Data vs. Information
Data vs. Information Missing observations, gross corruptions, outliers.
Data vs. Information Missing observations, gross corruptions, outliers. High dimensional regime: as data grows, more variables !
Data vs. Information Missing observations, gross corruptions, outliers. High dimensional regime: as data grows, more variables ! Data deluge an information desert!
Learning in High Dimensional Regime Useful information: low-dimensional structures. Learning with big data: ill-posed problem.
Learning in High Dimensional Regime Useful information: low-dimensional structures. Learning with big data: ill-posed problem. Learning is finding needle in a haystack
Learning in High Dimensional Regime Useful information: low-dimensional structures. Learning with big data: ill-posed problem. Learning is finding needle in a haystack Learning with big data: computationally challenging! Principled approaches for finding low dimensional structures?
How to model information structures? Latent variable models Incorporate hidden or latent variables. Information structures: Relationships between latent variables and observed data.
How to model information structures? Latent variable models Incorporate hidden or latent variables. Information structures: Relationships between latent variables and observed data. Basic Approach: mixtures/clusters Hidden variable is categorical.
How to model information structures? Latent variable models Incorporate hidden or latent variables. Information structures: Relationships between latent variables and observed data. Basic Approach: mixtures/clusters Hidden variable is categorical. h 1 Advanced: Probabilistic models Hidden variables have more general distributions. h 2 h 3 Can model mixed membership/hierarchical groups. x 1 x 2 x 3 x 4 x 5
Latent Variable Models (LVMs) Document modeling Observed: words. Hidden: topics. Social Network Modeling Observed: social interactions. Hidden: communities, relationships. Recommendation Systems Observed: recommendations (e.g., reviews). Hidden: User and business attributes Unsupervised Learning: Learn LVM without labeled examples.
LVM for Feature Engineering Learn good features/representations for classification tasks, e.g., computer vision and NLP. Sparse Coding/Dictionary Learning Sparse representations, low dimensional hidden structures. A few dictionary elements make complicated shapes.
Challenges in Learning LVMs Computational Challenges Maximum likelihood: non-convex optimization. NP-hard. Practice: Local search approaches such as gradient descent, EM, Variational Bayes have no consistency guarantees. Can get stuck in bad local optima. Poor convergence rates. Hard to parallelize. Alternatives? Guaranteed and efficient learning?
Outline Introduction 1 Spectral Methods 2 Moment Tensors of Latent Variable Models 3 Experiments 4 Conclusion 5
Classical Spectral Methods: Matrix PCA For centered samples { x i } , find projection P with Rank ( P ) = k s.t. 1 � x i − Px i � 2 . � min n P i ∈ [ n ] Result: Eigen-decomposition of Cov ( X ) . Beyond PCA: Spectral Methods on Tensors?
Moment Matrices and Tensors Multivariate Moments M 1 := E [ x ] , M 2 := E [ x ⊗ x ] , M 3 := E [ x ⊗ x ⊗ x ] . Matrix E [ x ⊗ x ] ∈ R d × d is a second order tensor. E [ x ⊗ x ] i 1 ,i 2 = E [ x i 1 x i 2 ] . For matrices: E [ x ⊗ x ] = E [ xx ⊤ ] . Tensor E [ x ⊗ x ⊗ x ] ∈ R d × d × d is a third order tensor. E [ x ⊗ x ⊗ x ] i 1 ,i 2 ,i 3 = E [ x i 1 x i 2 x i 3 ] .
Spectral Decomposition of Tensors M 2 = � λ i u i ⊗ v i i .... = + Matrix M 2 λ 1 u 1 ⊗ v 1 λ 2 u 2 ⊗ v 2
Spectral Decomposition of Tensors M 2 = � λ i u i ⊗ v i i .... = + Matrix M 2 λ 1 u 1 ⊗ v 1 λ 2 u 2 ⊗ v 2 M 3 = � λ i u i ⊗ v i ⊗ w i i .... = + Tensor M 3 λ 1 u 1 ⊗ v 1 ⊗ w 1 λ 2 u 2 ⊗ v 2 ⊗ w 2 u ⊗ v ⊗ w is a rank- 1 tensor since its ( i 1 , i 2 , i 3 ) th entry is u i 1 v i 2 w i 3 . How to solve this non-convex problem?
Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ]
Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] M ( I, v ) Recall matrix power method: v �→ � M ( I, v ) � .
Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] M ( I, v ) Recall matrix power method: v �→ � M ( I, v ) � . T ( I, v, v ) Algorithm: tensor power method: v �→ � T ( I, v, v ) � .
Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] M ( I, v ) Recall matrix power method: v �→ � M ( I, v ) � . T ( I, v, v ) Algorithm: tensor power method: v �→ � T ( I, v, v ) � . How do we avoid spurious solutions (not part of decomposition)?
Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] M ( I, v ) Recall matrix power method: v �→ � M ( I, v ) � . T ( I, v, v ) Algorithm: tensor power method: v �→ � T ( I, v, v ) � . How do we avoid spurious solutions (not part of decomposition)? • { v i } ’s are the only robust fixed points.
Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] M ( I, v ) Recall matrix power method: v �→ � M ( I, v ) � . T ( I, v, v ) Algorithm: tensor power method: v �→ � T ( I, v, v ) � . How do we avoid spurious solutions (not part of decomposition)? • { v i } ’s are the only robust fixed points. • All other eigenvectors are saddle points.
Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] M ( I, v ) Recall matrix power method: v �→ � M ( I, v ) � . T ( I, v, v ) Algorithm: tensor power method: v �→ � T ( I, v, v ) � . How do we avoid spurious solutions (not part of decomposition)? • { v i } ’s are the only robust fixed points. • All other eigenvectors are saddle points. For an orthogonal tensor, no spurious local optima!
Putting it together Non-orthogonal tensor M 3 = � i w i a i ⊗ a i ⊗ a i , M 2 = � i w i a i ⊗ a i . Whitening matrix W : v 1 a 1 W a 2 v 2 a 3 v 3 Multilinear transform: T = M 3 ( W, W, W ) Tensor M 3 Tensor T
Putting it together Non-orthogonal tensor M 3 = � i w i a i ⊗ a i ⊗ a i , M 2 = � i w i a i ⊗ a i . Whitening matrix W : v 1 a 1 W a 2 v 2 a 3 v 3 Multilinear transform: T = M 3 ( W, W, W ) Tensor M 3 Tensor T Tensor Decomposition: Guaranteed Non-Convex Optimization!
Putting it together Non-orthogonal tensor M 3 = � i w i a i ⊗ a i ⊗ a i , M 2 = � i w i a i ⊗ a i . Whitening matrix W : v 1 a 1 W a 2 v 2 a 3 v 3 Multilinear transform: T = M 3 ( W, W, W ) Tensor M 3 Tensor T Tensor Decomposition: Guaranteed Non-Convex Optimization! For what latent variable models can we obtain M 2 and M 3 forms?
Outline Introduction 1 Spectral Methods 2 Moment Tensors of Latent Variable Models 3 Experiments 4 Conclusion 5
Topic Modeling
Moments for Single Topic Models E [ x i | h ] = Ah. h w := E [ h ] . A A A A A Learn topic-word matrix A , vector w x 1 x 2 x 3 x 5 x 4
Moments for Single Topic Models E [ x i | h ] = Ah. h w := E [ h ] . A A A A A Learn topic-word matrix A , vector w x 1 x 2 x 3 x 5 x 4 Pairwise Co-occurence Matrix M 2 k � M 2 := E [ x 1 ⊗ x 2 ] = E [ E [ x 1 ⊗ x 2 | h ]] = w i a i ⊗ a i i =1 Triples Tensor M 3 k � M 3 := E [ x 1 ⊗ x 2 ⊗ x 3 ] = E [ E [ x 1 ⊗ x 2 ⊗ x 3 | h ]] = w i a i ⊗ a i ⊗ a i i =1 Can be extended to learning LDA: mutiple topics in a document.
Tractable Learning for LVMs GMM HMM ICA h 1 h 2 h k h 1 h 2 h 3 x 1 x 2 x d x 1 x 2 x 3 Multiview and Topic Models
Overall Framework = + .... Inference Probabilistic Tensor Unlabeled admixture Method Data models
Outline Introduction 1 Spectral Methods 2 Moment Tensors of Latent Variable Models 3 Experiments 4 Conclusion 5
Learning Communities through Tensor Methods Author Business Coauthor User Reviews Yelp DBLP(sub) n ∼ 40 k n ∼ 1 million( ∼ 100 k ) Error ( E ) and Recovery ratio ( R ) ˆ Dataset k Method Running Time E R DBLP sub(k=250) 500 ours 10,157 0 . 139 89% DBLP sub(k=250) 500 variational 558,723 16 . 38 99% DBLP(k=6000) 100 ours 5407 0 . 105 95% Thanks to Prem Gopalan and David Mimno for providing variational code.
Experimental Results on Yelp Lowest error business categories & largest weight businesses Rank Category Business Stars Review Counts 1 Latin American Salvadoreno Restaurant 4 . 0 36 2 Gluten Free P.F. Chang’s China Bistro 3 . 5 55 3 Hobby Shops Make Meaning 4 . 5 14 4 Mass Media KJZZ 91 . 5 FM 4 . 0 13 5 Yoga Sutra Midtown 4 . 5 31
Recommend
More recommend