tensor methods for feature learning
play

Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine - PowerPoint PPT Presentation

Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to Fei-Fei Li, Rob Fergus, Antonio Torralba,


  1. Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine

  2. Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to Fei-Fei Li, Rob Fergus, Antonio Torralba, et al.

  3. Principles Behind Feature Learning y Classification/regression tasks: Predict y given x . Find feature transform φ ( x ) to better predict y . x Feature learning: Learn φ ( · ) from data.

  4. Principles Behind Feature Learning y Classification/regression tasks: Predict y given x . φ ( x ) Find feature transform φ ( x ) to better predict y . x Feature learning: Learn φ ( · ) from data.

  5. Principles Behind Feature Learning y Classification/regression tasks: Predict y given x . φ ( x ) Find feature transform φ ( x ) to better predict y . x Feature learning: Learn φ ( · ) from data. Learning φ ( x ) from Labeled vs. Unlabeled Samples Labeled samples { x i , y i } and unlabeled samples { x i } . Labeled samples should lead to better feature learning φ ( · ) but are harder to obtain.

  6. Principles Behind Feature Learning y Classification/regression tasks: Predict y given x . φ ( x ) Find feature transform φ ( x ) to better predict y . x Feature learning: Learn φ ( · ) from data. Learning φ ( x ) from Labeled vs. Unlabeled Samples Labeled samples { x i , y i } and unlabeled samples { x i } . Labeled samples should lead to better feature learning φ ( · ) but are harder to obtain. Learn features φ ( x ) through latent variables related to x, y .

  7. Conditional Latent Variable Models: Two Cases y x y x

  8. Conditional Latent Variable Models: Two Cases y φ ( φ ( x )) φ ( x ) x y x

  9. Conditional Latent Variable Models: Two Cases y φ ( φ ( x )) Multi-layer Neural Networks φ ( x ) x y x

  10. Conditional Latent Variable Models: Two Cases y φ ( φ ( x )) Multi-layer Neural Networks φ ( x ) E [ y | x ] = σ ( A d σ ( A d − 1 σ ( · · · A 2 σ ( A 1 x )))) x y x

  11. Conditional Latent Variable Models: Two Cases y φ ( φ ( x )) Multi-layer Neural Networks φ ( x ) E [ y | x ] = σ ( A d σ ( A d − 1 σ ( · · · A 2 σ ( A 1 x )))) x y Mixture of Classifiers or GLMs G ( x ) := E [ y | x, h ] = σ ( � Uh, x � + � b, h � ) x

  12. Conditional Latent Variable Models: Two Cases y φ ( φ ( x )) Multi-layer Neural Networks φ ( x ) E [ y | x ] = σ ( A d σ ( A d − 1 σ ( · · · A 2 σ ( A 1 x )))) x y Mixture of Classifiers or GLMs G ( x ) := E [ y | x, h ] = σ ( � Uh, x � + � b, h � ) h x

  13. Challenges in Learning LVMs Challenge: Identifiability Conditions When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms? Computational Challenges Maximum likelihood is NP-hard in most scenarios. Practice: Local search approaches such as Back-propagation, EM, Variational Bayes have no consistency guarantees. Sample Complexity Sample complexity needs to be low for high-dimensional regime. Guaranteed and efficient learning through tensor methods

  14. Outline Introduction 1 Spectral and Tensor Methods 2 Generative Models for Feature Learning 3 Proposed Framework 4 Conclusion 5

  15. Classical Spectral Methods: Matrix PCA and CCA Unsupervised Setting: PCA For centered samples { x i } , find projection P with Rank ( P ) = k s.t. 1 � � x i − Px i � 2 . min n P i ∈ [ n ] Result: Eigen-decomposition of S = Cov ( X ) . Supervised Setting: CCA For centered samples { x i , y i } , find y x a ⊤ ˆ E [ xy ⊤ ] b max . � a,b a ⊤ ˆ E [ xx ⊤ ] a b ⊤ ˆ E [ yy ⊤ ] b � a, x � � b, y � Result: Generalized eigen decomposition.

  16. Beyond SVD: Spectral Methods on Tensors How to learn the mixture models without separation constraints? ◮ PCA uses covariance matrix of data. Are higher order moments helpful? Unified framework? ◮ Moment-based estimation of probabilistic latent variable models? SVD gives spectral decomposition of matrices. ◮ What are the analogues for tensors?

  17. Moment Matrices and Tensors Multivariate Moments M 1 := E [ x ] , M 2 := E [ x ⊗ x ] , M 3 := E [ x ⊗ x ⊗ x ] . Matrix E [ x ⊗ x ] ∈ R d × d is a second order tensor. E [ x ⊗ x ] i 1 ,i 2 = E [ x i 1 x i 2 ] . For matrices: E [ x ⊗ x ] = E [ xx ⊤ ] . Tensor E [ x ⊗ x ⊗ x ] ∈ R d × d × d is a third order tensor. E [ x ⊗ x ⊗ x ] i 1 ,i 2 ,i 3 = E [ x i 1 x i 2 x i 3 ] .

  18. Spectral Decomposition of Tensors M 2 = � λ i u i ⊗ v i i .... = + Matrix M 2 λ 1 u 1 ⊗ v 1 λ 2 u 2 ⊗ v 2 M 3 = � λ i u i ⊗ v i ⊗ w i i .... = + Tensor M 3 λ 1 u 1 ⊗ v 1 ⊗ w 1 λ 2 u 2 ⊗ v 2 ⊗ w 2 u ⊗ v ⊗ w is a rank- 1 tensor since its ( i 1 , i 2 , i 3 ) th entry is u i 1 v i 2 w i 3 . Guaranteed recovery. (Anandkumar et al 2012, Zhang & Golub 2001).

  19. Moment Tensors for Conditional Models Multivariate Moments: Many possibilities... E [ x ⊗ y ] , E [ x ⊗ x ⊗ y ] , E [ φ ( x ) ⊗ y ] . . . . Feature Transformations of the Input: x �→ φ ( x ) How to exploit them? Are moments E [ φ ( x ) ⊗ y ] useful? If φ ( x ) is a matrix/tensor, we have matrix/tensor moments. Can carry out spectral decomposition of the moments.

  20. Moment Tensors for Conditional Models Multivariate Moments: Many possibilities... E [ x ⊗ y ] , E [ x ⊗ x ⊗ y ] , E [ φ ( x ) ⊗ y ] . . . . Feature Transformations of the Input: x �→ φ ( x ) How to exploit them? Are moments E [ φ ( x ) ⊗ y ] useful? If φ ( x ) is a matrix/tensor, we have matrix/tensor moments. Can carry out spectral decomposition of the moments. Construct φ ( x ) based on input distribution?

  21. Outline Introduction 1 Spectral and Tensor Methods 2 Generative Models for Feature Learning 3 Proposed Framework 4 Conclusion 5

  22. Score Function of Input Distribution Score function S ( x ) := −∇ log p ( x ) 1-d PDF 1-d Score (a) p ( x ) = 1 ∂x log p ( x ) = − ∂ ∂ Z exp( − E ( x )) (b) ∂x E ( x ) Figures from Alain and Bengio 2014.

  23. Score Function of Input Distribution Score function S ( x ) := −∇ log p ( x ) 1-d PDF 1-d Score 2-d Score (a) p ( x ) = 1 ∂x log p ( x ) = − ∂ ∂ Z exp( − E ( x )) (b) ∂x E ( x ) Figures from Alain and Bengio 2014.

  24. Why Score Function Features? S ( x ) := −∇ log p ( x ) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models.

  25. Why Score Function Features? S ( x ) := −∇ log p ( x ) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models. Approximation of score function using denoising auto-encoders ∇ log p ( x ) ≈ r ∗ ( x + n ) − x σ 2

  26. Why Score Function Features? S ( x ) := −∇ log p ( x ) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models. Approximation of score function using denoising auto-encoders ∇ log p ( x ) ≈ r ∗ ( x + n ) − x σ 2 Recall our goal: construct moments E [ y ⊗ φ ( x )]

  27. Why Score Function Features? S ( x ) := −∇ log p ( x ) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models. Approximation of score function using denoising auto-encoders ∇ log p ( x ) ≈ r ∗ ( x + n ) − x σ 2 Recall our goal: construct moments E [ y ⊗ φ ( x )] Beyond vector features?

  28. Matrix and Tensor-valued Features Higher order score functions S m ( x ) := ( − 1) m ∇ ( m ) p ( x ) p ( x ) Can be a matrix or a tensor instead of a vector. Can be used to construct matrix and tensor moments E [ y ⊗ φ ( x )] .

  29. Outline Introduction 1 Spectral and Tensor Methods 2 Generative Models for Feature Learning 3 Proposed Framework 4 Conclusion 5

  30. Operations on Score Function Features Form the cross-moments: E [ y ⊗ S m ( x )] .

  31. Operations on Score Function Features Form the cross-moments: E [ y ⊗ S m ( x )] . Our result � � ∇ ( m ) G ( x ) E [ y ⊗ S m ( x )] = E G ( x ) := E [ y | x ] . ,

  32. Operations on Score Function Features Form the cross-moments: E [ y ⊗ S m ( x )] . Our result � � ∇ ( m ) G ( x ) E [ y ⊗ S m ( x )] = E G ( x ) := E [ y | x ] . , Extension of Stein’s lemma.

  33. Operations on Score Function Features Form the cross-moments: E [ y ⊗ S m ( x )] . Our result � � ∇ ( m ) G ( x ) E [ y ⊗ S m ( x )] = E G ( x ) := E [ y | x ] . , Extension of Stein’s lemma. Extract discriminative directions through spectral decomposition � � � ∇ ( m ) G ( x ) E [ y ⊗ S m ( x )] = E = λ j · u j ⊗ u j . . . ⊗ u j . � �� � j ∈ [ k ] m times

  34. Operations on Score Function Features Form the cross-moments: E [ y ⊗ S m ( x )] . Our result � � ∇ ( m ) G ( x ) E [ y ⊗ S m ( x )] = E G ( x ) := E [ y | x ] . , Extension of Stein’s lemma. Extract discriminative directions through spectral decomposition � � � ∇ ( m ) G ( x ) E [ y ⊗ S m ( x )] = E = λ j · u j ⊗ u j . . . ⊗ u j . � �� � j ∈ [ k ] m times Construct σ ( u ⊤ j x ) for some nonlinearity σ .

  35. Automated Extraction of Discriminative Features

Recommend


More recommend