Tensor Methods for large-scale Machine Learning Anima Anandkumar - PowerPoint PPT Presentation

Tensor Methods for large-scale Machine Learning Anima Anandkumar U.C. Irvine

Learning with Big Data

Data vs. Information

Data vs. Information Missing observations, gross corruptions, outliers.

Data vs. Information Missing observations, gross corruptions, outliers. High dimensional regime: as data grows, more variables !

Data vs. Information Missing observations, gross corruptions, outliers. High dimensional regime: as data grows, more variables ! Data deluge an information desert!

Learning in High Dimensional Regime Useful information: low-dimensional structures. Learning with big data: ill-posed problem.

Learning in High Dimensional Regime Useful information: low-dimensional structures. Learning with big data: ill-posed problem. Learning is finding needle in a haystack

Learning in High Dimensional Regime Useful information: low-dimensional structures. Learning with big data: ill-posed problem. Learning is finding needle in a haystack Learning with big data: computationally challenging! Principled approaches for finding low dimensional structures?

How to model information structures? Latent variable models Incorporate hidden or latent variables. Information structures: Relationships between latent variables and observed data.

How to model information structures? Latent variable models Incorporate hidden or latent variables. Information structures: Relationships between latent variables and observed data. Basic Approach: mixtures/clusters Hidden variable is categorical.

How to model information structures? Latent variable models Incorporate hidden or latent variables. Information structures: Relationships between latent variables and observed data. Basic Approach: mixtures/clusters Hidden variable is categorical. h 1 Advanced: Probabilistic models Hidden variables have more general distributions. h 2 h 3 Can model mixed membership/hierarchical groups. x 1 x 2 x 3 x 4 x 5

Latent Variable Models (LVMs) Document modeling Observed: words. Hidden: topics. Social Network Modeling Observed: social interactions. Hidden: communities, relationships. Recommendation Systems Observed: recommendations (e.g., reviews). Hidden: User and business attributes Unsupervised Learning: Learn LVM without labeled examples.

LVM for Feature Engineering Learn good features/representations for classification tasks, e.g., computer vision and NLP. Sparse Coding/Dictionary Learning Sparse representations, low dimensional hidden structures. A few dictionary elements make complicated shapes.

Challenges in Learning LVMs Computational Challenges Maximum likelihood: non-convex optimization. NP-hard. Practice: Local search approaches such as gradient descent, EM, Variational Bayes have no consistency guarantees. Can get stuck in bad local optima. Poor convergence rates. Hard to parallelize. Alternatives? Guaranteed and efficient learning?

Outline Introduction 1 Spectral Methods 2 Moment Tensors of Latent Variable Models 3 Experiments 4 Conclusion 5

Classical Spectral Methods: Matrix PCA For centered samples { x i } , find projection P with Rank ( P ) = k s.t. 1 � x i − Px i � 2 . � min n P i ∈ [ n ] Result: Eigen-decomposition of Cov ( X ) . Beyond PCA: Spectral Methods on Tensors?

Moment Matrices and Tensors Multivariate Moments M 1 := E [ x ] , M 2 := E [ x ⊗ x ] , M 3 := E [ x ⊗ x ⊗ x ] . Matrix E [ x ⊗ x ] ∈ R d × d is a second order tensor. E [ x ⊗ x ] i 1 ,i 2 = E [ x i 1 x i 2 ] . For matrices: E [ x ⊗ x ] = E [ xx ⊤ ] . Tensor E [ x ⊗ x ⊗ x ] ∈ R d × d × d is a third order tensor. E [ x ⊗ x ⊗ x ] i 1 ,i 2 ,i 3 = E [ x i 1 x i 2 x i 3 ] .

Spectral Decomposition of Tensors M 2 = � λ i u i ⊗ v i i .... = + Matrix M 2 λ 1 u 1 ⊗ v 1 λ 2 u 2 ⊗ v 2

Spectral Decomposition of Tensors M 2 = � λ i u i ⊗ v i i .... = + Matrix M 2 λ 1 u 1 ⊗ v 1 λ 2 u 2 ⊗ v 2 M 3 = � λ i u i ⊗ v i ⊗ w i i .... = + Tensor M 3 λ 1 u 1 ⊗ v 1 ⊗ w 1 λ 2 u 2 ⊗ v 2 ⊗ w 2 u ⊗ v ⊗ w is a rank- 1 tensor since its ( i 1 , i 2 , i 3 ) th entry is u i 1 v i 2 w i 3 . How to solve this non-convex problem?

Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ]

Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] M ( I, v ) Recall matrix power method: v �→ � M ( I, v ) � .

Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] M ( I, v ) Recall matrix power method: v �→ � M ( I, v ) � . T ( I, v, v ) Algorithm: tensor power method: v �→ � T ( I, v, v ) � .

Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] M ( I, v ) Recall matrix power method: v �→ � M ( I, v ) � . T ( I, v, v ) Algorithm: tensor power method: v �→ � T ( I, v, v ) � . How do we avoid spurious solutions (not part of decomposition)?

Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] M ( I, v ) Recall matrix power method: v �→ � M ( I, v ) � . T ( I, v, v ) Algorithm: tensor power method: v �→ � T ( I, v, v ) � . How do we avoid spurious solutions (not part of decomposition)? • { v i } ’s are the only robust fixed points.

Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] M ( I, v ) Recall matrix power method: v �→ � M ( I, v ) � . T ( I, v, v ) Algorithm: tensor power method: v �→ � T ( I, v, v ) � . How do we avoid spurious solutions (not part of decomposition)? • { v i } ’s are the only robust fixed points. • All other eigenvectors are saddle points.

Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ R d × d × d : � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] M ( I, v ) Recall matrix power method: v �→ � M ( I, v ) � . T ( I, v, v ) Algorithm: tensor power method: v �→ � T ( I, v, v ) � . How do we avoid spurious solutions (not part of decomposition)? • { v i } ’s are the only robust fixed points. • All other eigenvectors are saddle points. For an orthogonal tensor, no spurious local optima!

Putting it together Non-orthogonal tensor M 3 = � i w i a i ⊗ a i ⊗ a i , M 2 = � i w i a i ⊗ a i . Whitening matrix W : v 1 a 1 W a 2 v 2 a 3 v 3 Multilinear transform: T = M 3 ( W, W, W ) Tensor M 3 Tensor T

Putting it together Non-orthogonal tensor M 3 = � i w i a i ⊗ a i ⊗ a i , M 2 = � i w i a i ⊗ a i . Whitening matrix W : v 1 a 1 W a 2 v 2 a 3 v 3 Multilinear transform: T = M 3 ( W, W, W ) Tensor M 3 Tensor T Tensor Decomposition: Guaranteed Non-Convex Optimization!

Putting it together Non-orthogonal tensor M 3 = � i w i a i ⊗ a i ⊗ a i , M 2 = � i w i a i ⊗ a i . Whitening matrix W : v 1 a 1 W a 2 v 2 a 3 v 3 Multilinear transform: T = M 3 ( W, W, W ) Tensor M 3 Tensor T Tensor Decomposition: Guaranteed Non-Convex Optimization! For what latent variable models can we obtain M 2 and M 3 forms?

Topic Modeling

Moments for Single Topic Models E [ x i | h ] = Ah. h w := E [ h ] . A A A A A Learn topic-word matrix A , vector w x 1 x 2 x 3 x 5 x 4

Moments for Single Topic Models E [ x i | h ] = Ah. h w := E [ h ] . A A A A A Learn topic-word matrix A , vector w x 1 x 2 x 3 x 5 x 4 Pairwise Co-occurence Matrix M 2 k � M 2 := E [ x 1 ⊗ x 2 ] = E [ E [ x 1 ⊗ x 2 | h ]] = w i a i ⊗ a i i =1 Triples Tensor M 3 k � M 3 := E [ x 1 ⊗ x 2 ⊗ x 3 ] = E [ E [ x 1 ⊗ x 2 ⊗ x 3 | h ]] = w i a i ⊗ a i ⊗ a i i =1 Can be extended to learning LDA: mutiple topics in a document.

Tractable Learning for LVMs GMM HMM ICA h 1 h 2 h k h 1 h 2 h 3 x 1 x 2 x d x 1 x 2 x 3 Multiview and Topic Models

Overall Framework = + .... Inference Probabilistic Tensor Unlabeled admixture Method Data models

Learning Communities through Tensor Methods Author Business Coauthor User Reviews Yelp DBLP(sub) n ∼ 40 k n ∼ 1 million( ∼ 100 k ) Error ( E ) and Recovery ratio ( R ) ˆ Dataset k Method Running Time E R DBLP sub(k=250) 500 ours 10,157 0 . 139 89% DBLP sub(k=250) 500 variational 558,723 16 . 38 99% DBLP(k=6000) 100 ours 5407 0 . 105 95% Thanks to Prem Gopalan and David Mimno for providing variational code.

Experimental Results on Yelp Lowest error business categories & largest weight businesses Rank Category Business Stars Review Counts 1 Latin American Salvadoreno Restaurant 4 . 0 36 2 Gluten Free P.F. Chang’s China Bistro 3 . 5 55 3 Hobby Shops Make Meaning 4 . 5 14 4 Mass Media KJZZ 91 . 5 FM 4 . 0 13 5 Yoga Sutra Midtown 4 . 5 31

Tensor Methods for large-scale Machine Learning Anima Anandkumar - PowerPoint PPT Presentation

Tensor Methods for large-scale Machine Learning Anima Anandkumar U.C. Irvine Learning with Big Data Data vs. Information Data vs. Information Data vs. Information Missing observations, gross corruptions, outliers. Data vs. Information

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

Tensor Methods for Signal Processing and Machine Learning Qibin Zhao Tensor Learning Unit RIKEN

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

and You Tensor network methods Matrix product states (MPS) Projected Entangled Pair States

Tensor Network Representation for Machine Learning - Recent Advances and Perspectives Qibin ZHAO

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS Cris Cecka Senior Research Scientist,

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Inexact Tensor Methods with Dynamic Accuracies Nikita Doikov Yurii Nesterov UCLouvain, Belgium

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Large-Scale Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science,

Mix-Automatic Sequences Jrg Endrullis Clemens Grabmayer Dimitri Hendriks Fields Workshop on

Construct 2 A game engine without the programming. The Code Liberation Foundation Lecture 3:

Helical Mirror Concept Exploration: Design and Status A. V. Sudnikov, et al. Budker INP SB RAS

ZFS The Last Word in Filesystem tzute Computer Center, CS, NCTU What is RAID? 2 Computer

Lecture 23: Virtual Memory, Multiprocessors Todays topics: Virtual memory

Processor Performance and Parallelism Y. K. Malaiya Processor Execution time The time taken by

MS degree in Computer Engineering University of Rome Tor Vergata Lecturer: Francesco Quaglia

Parallel Computer Architecture Lars Karlsson Ume a University 2009-12-07 Lars Karlsson