Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine
Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to Fei-Fei Li, Rob Fergus, Antonio Torralba, et al.
Principles Behind Feature Learning y Classification/regression tasks: Predict y given x . Find feature transform φ ( x ) to better predict y . x Feature learning: Learn φ ( · ) from data.
Principles Behind Feature Learning y Classification/regression tasks: Predict y given x . φ ( x ) Find feature transform φ ( x ) to better predict y . x Feature learning: Learn φ ( · ) from data.
Principles Behind Feature Learning y Classification/regression tasks: Predict y given x . φ ( x ) Find feature transform φ ( x ) to better predict y . x Feature learning: Learn φ ( · ) from data. Learning φ ( x ) from Labeled vs. Unlabeled Samples Labeled samples { x i , y i } and unlabeled samples { x i } . Labeled samples should lead to better feature learning φ ( · ) but are harder to obtain.
Principles Behind Feature Learning y Classification/regression tasks: Predict y given x . φ ( x ) Find feature transform φ ( x ) to better predict y . x Feature learning: Learn φ ( · ) from data. Learning φ ( x ) from Labeled vs. Unlabeled Samples Labeled samples { x i , y i } and unlabeled samples { x i } . Labeled samples should lead to better feature learning φ ( · ) but are harder to obtain. Learn features φ ( x ) through latent variables related to x, y .
Conditional Latent Variable Models: Two Cases y x y x
Conditional Latent Variable Models: Two Cases y φ ( φ ( x )) φ ( x ) x y x
Conditional Latent Variable Models: Two Cases y φ ( φ ( x )) Multi-layer Neural Networks φ ( x ) x y x
Conditional Latent Variable Models: Two Cases y φ ( φ ( x )) Multi-layer Neural Networks φ ( x ) E [ y | x ] = σ ( A d σ ( A d − 1 σ ( · · · A 2 σ ( A 1 x )))) x y x
Conditional Latent Variable Models: Two Cases y φ ( φ ( x )) Multi-layer Neural Networks φ ( x ) E [ y | x ] = σ ( A d σ ( A d − 1 σ ( · · · A 2 σ ( A 1 x )))) x y Mixture of Classifiers or GLMs G ( x ) := E [ y | x, h ] = σ ( � Uh, x � + � b, h � ) x
Conditional Latent Variable Models: Two Cases y φ ( φ ( x )) Multi-layer Neural Networks φ ( x ) E [ y | x ] = σ ( A d σ ( A d − 1 σ ( · · · A 2 σ ( A 1 x )))) x y Mixture of Classifiers or GLMs G ( x ) := E [ y | x, h ] = σ ( � Uh, x � + � b, h � ) h x
Challenges in Learning LVMs Challenge: Identifiability Conditions When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms? Computational Challenges Maximum likelihood is NP-hard in most scenarios. Practice: Local search approaches such as Back-propagation, EM, Variational Bayes have no consistency guarantees. Sample Complexity Sample complexity needs to be low for high-dimensional regime. Guaranteed and efficient learning through tensor methods
Outline Introduction 1 Spectral and Tensor Methods 2 Generative Models for Feature Learning 3 Proposed Framework 4 Conclusion 5
Classical Spectral Methods: Matrix PCA and CCA Unsupervised Setting: PCA For centered samples { x i } , find projection P with Rank ( P ) = k s.t. 1 � � x i − Px i � 2 . min n P i ∈ [ n ] Result: Eigen-decomposition of S = Cov ( X ) . Supervised Setting: CCA For centered samples { x i , y i } , find y x a ⊤ ˆ E [ xy ⊤ ] b max . � a,b a ⊤ ˆ E [ xx ⊤ ] a b ⊤ ˆ E [ yy ⊤ ] b � a, x � � b, y � Result: Generalized eigen decomposition.
Beyond SVD: Spectral Methods on Tensors How to learn the mixture models without separation constraints? ◮ PCA uses covariance matrix of data. Are higher order moments helpful? Unified framework? ◮ Moment-based estimation of probabilistic latent variable models? SVD gives spectral decomposition of matrices. ◮ What are the analogues for tensors?
Moment Matrices and Tensors Multivariate Moments M 1 := E [ x ] , M 2 := E [ x ⊗ x ] , M 3 := E [ x ⊗ x ⊗ x ] . Matrix E [ x ⊗ x ] ∈ R d × d is a second order tensor. E [ x ⊗ x ] i 1 ,i 2 = E [ x i 1 x i 2 ] . For matrices: E [ x ⊗ x ] = E [ xx ⊤ ] . Tensor E [ x ⊗ x ⊗ x ] ∈ R d × d × d is a third order tensor. E [ x ⊗ x ⊗ x ] i 1 ,i 2 ,i 3 = E [ x i 1 x i 2 x i 3 ] .
Spectral Decomposition of Tensors M 2 = � λ i u i ⊗ v i i .... = + Matrix M 2 λ 1 u 1 ⊗ v 1 λ 2 u 2 ⊗ v 2 M 3 = � λ i u i ⊗ v i ⊗ w i i .... = + Tensor M 3 λ 1 u 1 ⊗ v 1 ⊗ w 1 λ 2 u 2 ⊗ v 2 ⊗ w 2 u ⊗ v ⊗ w is a rank- 1 tensor since its ( i 1 , i 2 , i 3 ) th entry is u i 1 v i 2 w i 3 . Guaranteed recovery. (Anandkumar et al 2012, Zhang & Golub 2001).
Moment Tensors for Conditional Models Multivariate Moments: Many possibilities... E [ x ⊗ y ] , E [ x ⊗ x ⊗ y ] , E [ φ ( x ) ⊗ y ] . . . . Feature Transformations of the Input: x �→ φ ( x ) How to exploit them? Are moments E [ φ ( x ) ⊗ y ] useful? If φ ( x ) is a matrix/tensor, we have matrix/tensor moments. Can carry out spectral decomposition of the moments.
Moment Tensors for Conditional Models Multivariate Moments: Many possibilities... E [ x ⊗ y ] , E [ x ⊗ x ⊗ y ] , E [ φ ( x ) ⊗ y ] . . . . Feature Transformations of the Input: x �→ φ ( x ) How to exploit them? Are moments E [ φ ( x ) ⊗ y ] useful? If φ ( x ) is a matrix/tensor, we have matrix/tensor moments. Can carry out spectral decomposition of the moments. Construct φ ( x ) based on input distribution?
Outline Introduction 1 Spectral and Tensor Methods 2 Generative Models for Feature Learning 3 Proposed Framework 4 Conclusion 5
Score Function of Input Distribution Score function S ( x ) := −∇ log p ( x ) 1-d PDF 1-d Score (a) p ( x ) = 1 ∂x log p ( x ) = − ∂ ∂ Z exp( − E ( x )) (b) ∂x E ( x ) Figures from Alain and Bengio 2014.
Score Function of Input Distribution Score function S ( x ) := −∇ log p ( x ) 1-d PDF 1-d Score 2-d Score (a) p ( x ) = 1 ∂x log p ( x ) = − ∂ ∂ Z exp( − E ( x )) (b) ∂x E ( x ) Figures from Alain and Bengio 2014.
Why Score Function Features? S ( x ) := −∇ log p ( x ) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models.
Why Score Function Features? S ( x ) := −∇ log p ( x ) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models. Approximation of score function using denoising auto-encoders ∇ log p ( x ) ≈ r ∗ ( x + n ) − x σ 2
Why Score Function Features? S ( x ) := −∇ log p ( x ) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models. Approximation of score function using denoising auto-encoders ∇ log p ( x ) ≈ r ∗ ( x + n ) − x σ 2 Recall our goal: construct moments E [ y ⊗ φ ( x )]
Why Score Function Features? S ( x ) := −∇ log p ( x ) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models. Approximation of score function using denoising auto-encoders ∇ log p ( x ) ≈ r ∗ ( x + n ) − x σ 2 Recall our goal: construct moments E [ y ⊗ φ ( x )] Beyond vector features?
Matrix and Tensor-valued Features Higher order score functions S m ( x ) := ( − 1) m ∇ ( m ) p ( x ) p ( x ) Can be a matrix or a tensor instead of a vector. Can be used to construct matrix and tensor moments E [ y ⊗ φ ( x )] .
Outline Introduction 1 Spectral and Tensor Methods 2 Generative Models for Feature Learning 3 Proposed Framework 4 Conclusion 5
Operations on Score Function Features Form the cross-moments: E [ y ⊗ S m ( x )] .
Operations on Score Function Features Form the cross-moments: E [ y ⊗ S m ( x )] . Our result � � ∇ ( m ) G ( x ) E [ y ⊗ S m ( x )] = E G ( x ) := E [ y | x ] . ,
Operations on Score Function Features Form the cross-moments: E [ y ⊗ S m ( x )] . Our result � � ∇ ( m ) G ( x ) E [ y ⊗ S m ( x )] = E G ( x ) := E [ y | x ] . , Extension of Stein’s lemma.
Operations on Score Function Features Form the cross-moments: E [ y ⊗ S m ( x )] . Our result � � ∇ ( m ) G ( x ) E [ y ⊗ S m ( x )] = E G ( x ) := E [ y | x ] . , Extension of Stein’s lemma. Extract discriminative directions through spectral decomposition � � � ∇ ( m ) G ( x ) E [ y ⊗ S m ( x )] = E = λ j · u j ⊗ u j . . . ⊗ u j . � �� � j ∈ [ k ] m times
Operations on Score Function Features Form the cross-moments: E [ y ⊗ S m ( x )] . Our result � � ∇ ( m ) G ( x ) E [ y ⊗ S m ( x )] = E G ( x ) := E [ y | x ] . , Extension of Stein’s lemma. Extract discriminative directions through spectral decomposition � � � ∇ ( m ) G ( x ) E [ y ⊗ S m ( x )] = E = λ j · u j ⊗ u j . . . ⊗ u j . � �� � j ∈ [ k ] m times Construct σ ( u ⊤ j x ) for some nonlinearity σ .
Automated Extraction of Discriminative Features
Recommend
More recommend