beating the perils of non convexity machine learning
play

Beating the Perils of Non-Convexity: Machine Learning using Tensor - PowerPoint PPT Presentation

Beating the Perils of Non-Convexity: Machine Learning using Tensor Methods Anima Anandkumar .. Joint work with Majid Janzamin and Hanie Sedghi. U.C. Irvine Learning with Big Data Learning is finding needle in a haystack Learning with Big


  1. Beating the Perils of Non-Convexity: Machine Learning using Tensor Methods Anima Anandkumar .. Joint work with Majid Janzamin and Hanie Sedghi. U.C. Irvine

  2. Learning with Big Data Learning is finding needle in a haystack

  3. Learning with Big Data Learning is finding needle in a haystack High dimensional regime: as data grows, more variables! Useful information: low-dimensional structures. Learning with big data: ill-posed problem.

  4. Learning with Big Data Learning is finding needle in a haystack High dimensional regime: as data grows, more variables! Useful information: low-dimensional structures. Learning with big data: ill-posed problem. Learning with big data: statistically and computationally challenging!

  5. Optimization for Learning Most learning problems can be cast as optimization. Unsupervised Learning Clustering k -means, hierarchical . . . Maximum Likelihood Estimator Probabilistic latent variable models Supervised Learning Output Optimizing a neural network with Neuron respect to a loss function Input

  6. Convex vs. Non-convex Optimization Progress is only tip of the iceberg.. Images taken from https://www.facebook.com/nonconvex

  7. Convex vs. Non-convex Optimization Progress is only tip of the iceberg.. Real world is mostly non-convex! Images taken from https://www.facebook.com/nonconvex

  8. Convex vs. Nonconvex Optimization Unique optimum: global/local. Multiple local optima

  9. Convex vs. Nonconvex Optimization Unique optimum: global/local. Multiple local optima In high dimensions possibly exponential local optima

  10. Convex vs. Nonconvex Optimization Unique optimum: global/local. Multiple local optima In high dimensions possibly exponential local optima How to deal with non-convexity?

  11. Outline Introduction 1 Guaranteed Training of Neural Networks 2 Overview of Other Results on Tensors 3 Conclusion 4

  12. Training Neural Networks Tremendous practical impact with deep learning. Algorithm: backpropagation. Highly non-convex optimization

  13. Toy Example: Failure of Backpropagation y y =1 y = − 1 x 2 σ ( · ) σ ( · ) w 1 w 2 x 1 x 2 x x 1 Labeled input samples Goal: binary classification Our method: guaranteed risk bounds for training neural networks

  14. Toy Example: Failure of Backpropagation y y =1 y = − 1 x 2 σ ( · ) σ ( · ) w 1 w 2 x 1 x 2 x x 1 Labeled input samples Goal: binary classification Our method: guaranteed risk bounds for training neural networks

  15. Toy Example: Failure of Backpropagation y y =1 y = − 1 x 2 σ ( · ) σ ( · ) w 1 w 2 x 1 x 2 x x 1 Labeled input samples Goal: binary classification Our method: guaranteed risk bounds for training neural networks

  16. Backpropagation vs. Our Method Weights w 2 randomly drawn and fixed Backprop (quadratic) loss surface 650 600 550 500 450 400 350 300 250 200 −4 −3 −2 −1 4 0 3 2 1 1 2 0 −1 3 −2 w 1 (1) −3 w 1 (2) 4 −4

  17. Backpropagation vs. Our Method Weights w 2 randomly drawn and fixed Backprop (quadratic) loss surface Loss surface for our method 200 650 180 600 160 550 140 500 120 100 450 80 400 60 350 40 300 20 250 0 200 −4 −4 −3 −3 −2 −2 −1 4 −1 0 3 4 2 0 1 3 1 2 1 2 0 1 −1 2 0 3 −2 −1 w 1 (1) −3 w 1 (2) 4 3 −2 −4 w 1 (1) −3 w 1 (2) 4 −4

  18. Overcoming Hardness of Training In general, training a neural network is NP hard. How does knowledge of input distribution help?

  19. Overcoming Hardness of Training In general, training a neural network is NP hard. How does knowledge of input distribution help?

  20. Generative vs. Discriminative Models 0.12 1.2 Class y = 1 Class y = 0 Class y = 1 0.1 Class y = 0 1 0.08 p ( x, y ) p ( y | x ) 0.8 0.06 0.6 0.04 0.4 0.02 0.2 0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Input data x Input data x Generative models: Encode domain knowledge. Discriminative: good classification performance. Neural Network is a discriminative model. Do generative models help in discriminative tasks?

  21. Feature Transformation for Training Neural Networks y Feature learning: Learn φ ( · ) from input data. φ ( x ) How to use φ ( · ) to train neural networks? x

  22. Feature Transformation for Training Neural Networks y Feature learning: Learn φ ( · ) from input data. φ ( x ) How to use φ ( · ) to train neural networks? x Multivariate Moments: Many possibilities, . . . E [ x ⊗ y ] , E [ x ⊗ x ⊗ y ] , E [ φ ( x ) ⊗ y ] , . . .

  23. Tensor Notation for Higher Order Moments Multi-variate higher order moments form tensors. Are there spectral operations on tensors akin to PCA on matrices? Matrix E [ x ⊗ y ] ∈ R d × d is a second order tensor. E [ x ⊗ y ] i 1 ,i 2 = E [ x i 1 y i 2 ] . For matrices: E [ x ⊗ y ] = E [ xy ⊤ ] . Tensor E [ x ⊗ x ⊗ y ] ∈ R d × d × d is a third order tensor. E [ x ⊗ x ⊗ y ] i 1 ,i 2 ,i 3 = E [ x i 1 x i 2 y i 3 ] . In general, E [ φ ( x ) ⊗ y ] is a tensor. What class of φ ( · ) useful for training neural networks?

  24. Score Function Transformations Score function for x ∈ R d with pdf p ( · ) : S 1 ( x ) := −∇ x log p ( x ) S 1 ( x ) ∈ R d Input: x ∈ R d

  25. Score Function Transformations Score function for x ∈ R d with pdf p ( · ) : S 1 ( x ) := −∇ x log p ( x ) S 1 ( x ) ∈ R d Input: x ∈ R d

  26. Score Function Transformations Score function for x ∈ R d with pdf p ( · ) : S 1 ( x ) := −∇ x log p ( x ) S 1 ( x ) ∈ R d Input: x ∈ R d

  27. Score Function Transformations Score function for x ∈ R d with pdf p ( · ) : S 1 ( x ) := −∇ x log p ( x ) m th -order score function: S 1 ( x ) ∈ R d Input: x ∈ R d

  28. Score Function Transformations Score function for x ∈ R d with pdf p ( · ) : S 1 ( x ) := −∇ x log p ( x ) m th -order score function: S 1 ( x ) ∈ R d Input: x ∈ R d S m ( x ) := ( − 1) m ∇ ( m ) p ( x ) p ( x )

  29. Score Function Transformations Score function for x ∈ R d with pdf p ( · ) : S 1 ( x ) := −∇ x log p ( x ) m th -order score function: S 2 ( x ) ∈ R d × d Input: x ∈ R d S m ( x ) := ( − 1) m ∇ ( m ) p ( x ) p ( x )

  30. Score Function Transformations Score function for x ∈ R d with pdf p ( · ) : S 1 ( x ) := −∇ x log p ( x ) m th -order score function: S 3 ( x ) ∈ R d × d × d Input: x ∈ R d S m ( x ) := ( − 1) m ∇ ( m ) p ( x ) p ( x )

  31. Moments of a Neural Network y a 2 E [ y | x ] = f ( x ) = a ⊤ 2 σ ( A ⊤ 1 x + b 1 ) + b 2 1 k σ ( · ) σ ( · ) · · · σ ( · ) σ ( · ) A 1 · · · x 1 x 2 x 3 x d x x · · ·

  32. Moments of a Neural Network y a 2 E [ y | x ] = f ( x ) = a ⊤ 2 σ ( A ⊤ 1 x + b 1 ) + b 2 1 k σ ( · ) σ ( · ) · · · σ ( · ) σ ( · ) Given labeled examples { ( x i , y i ) } A 1 · · · � � ∇ ( m ) f ( x ) x 1 x 2 x 3 x d x E [ y · S m ( x )] = E x · · · ⇓ � M 1 = E [ y · S 1 ( x )] = λ 1 ,j · u j ⊗ u j ⊗ u j j ∈ [ k ]

  33. Moments of a Neural Network y a 2 E [ y | x ] = f ( x ) = a ⊤ 2 σ ( A ⊤ 1 x + b 1 ) + b 2 1 k σ ( · ) σ ( · ) · · · σ ( · ) σ ( · ) Given labeled examples { ( x i , y i ) } A 1 · · · � � ∇ ( m ) f ( x ) x 1 x 2 x 3 x d x E [ y · S m ( x )] = E x · · · ⇓ � M 1 = E [ y · S 1 ( x )] = λ 1 ,j · ( A 1 ) j ⊗ ( A 1 ) j ⊗ ( A 1 ) j j ∈ [ k ]

  34. Moments of a Neural Network y a 2 E [ y | x ] = f ( x ) = a ⊤ 2 σ ( A ⊤ 1 x + b 1 ) + b 2 1 k σ ( · ) σ ( · ) · · · σ ( · ) σ ( · ) Given labeled examples { ( x i , y i ) } A 1 · · · � � ∇ ( m ) f ( x ) x 1 x 2 x 3 x d x E [ y · S m ( x )] = E x · · · ⇓ � M 1 = E [ y · S 1 ( x )] = λ 1 ,j · ( A 1 ) j ⊗ ( A 1 ) j ⊗ ( A 1 ) j j ∈ [ k ] .... = + λ 11 ( A 1 ) 1 λ 12 ( A 1 ) 2

  35. Moments of a Neural Network y a 2 E [ y | x ] = f ( x ) = a ⊤ 2 σ ( A ⊤ 1 x + b 1 ) + b 2 1 k σ ( · ) σ ( · ) · · · σ ( · ) σ ( · ) Given labeled examples { ( x i , y i ) } A 1 · · · � � ∇ ( m ) f ( x ) x 1 x 2 x 3 x d x E [ y · S m ( x )] = E x · · · ⇓ � M 2 = E [ y · S 2 ( x )] = λ 2 ,j · ( A 1 ) j ⊗ ( A 1 ) j ⊗ ( A 1 ) j j ∈ [ k ]

  36. Moments of a Neural Network y a 2 E [ y | x ] = f ( x ) = a ⊤ 2 σ ( A ⊤ 1 x + b 1 ) + b 2 1 k σ ( · ) σ ( · ) · · · σ ( · ) σ ( · ) Given labeled examples { ( x i , y i ) } A 1 · · · � � ∇ ( m ) f ( x ) x 1 x 2 x 3 x d x E [ y · S m ( x )] = E x · · · ⇓ � M 2 = E [ y · S 2 ( x )] = λ 2 ,j · ( A 1 ) j ⊗ ( A 1 ) j ⊗ ( A 1 ) j j ∈ [ k ] .... = + λ 11 ( A 1 ) 1 ⊗ ( A 1 ) 1 λ 12 ( A 1 ) 2 ⊗ ( A 1 ) 2

  37. Moments of a Neural Network y a 2 E [ y | x ] = f ( x ) = a ⊤ 2 σ ( A ⊤ 1 x + b 1 ) + b 2 1 k σ ( · ) σ ( · ) · · · σ ( · ) σ ( · ) Given labeled examples { ( x i , y i ) } A 1 · · · � � ∇ ( m ) f ( x ) x 1 x 2 x 3 x d x E [ y · S m ( x )] = E x · · · ⇓ � M 3 = E [ y · S 3 ( x )] = λ 3 ,j · ( A 1 ) j ⊗ ( A 1 ) j ⊗ ( A 1 ) j j ∈ [ k ]

  38. Moments of a Neural Network y a 2 E [ y | x ] = f ( x ) = a ⊤ 2 σ ( A ⊤ 1 x + b 1 ) + b 2 1 k σ ( · ) σ ( · ) · · · σ ( · ) σ ( · ) Given labeled examples { ( x i , y i ) } A 1 · · · � � ∇ ( m ) f ( x ) x 1 x 2 x 3 x d x E [ y · S m ( x )] = E x · · · ⇓ � M 3 = E [ y · S 3 ( x )] = λ 3 ,j · ( A 1 ) j ⊗ ( A 1 ) j ⊗ ( A 1 ) j j ∈ [ k ] .... = +

Recommend


More recommend