mit 9 520 6 860 fall 2018 statistical learning theory and
play

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and - PowerPoint PPT Presentation

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and Kernels Lorenzo Rosasco Linear functions Let H lin be the space of linear functions f ( x ) = w x . f w is one to one, H := w


  1. MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and Kernels Lorenzo Rosasco

  2. Linear functions Let H lin be the space of linear functions f ( x ) = w ⊤ x . ◮ f ↔ w is one to one, � � H := w ⊤ ¯ f , ¯ ◮ inner product f w , � � � � � f − ¯ ◮ norm/metric f � H := � w − ¯ w � . L.Rosasco, 9.520/6.860 2018

  3. An observation Function norm controls point-wise convergence. Since | f ( x ) − ¯ f ( x ) | ≤ � x �� w − ¯ w � , ∀ x ∈ X then w j → w ⇒ f j ( x ) → f ( x ) , ∀ x ∈ X . L.Rosasco, 9.520/6.860 2018

  4. ERM � n 1 ( y i − w ⊤ x i ) 2 + λ � w � 2 , min λ ≥ 0 n w ∈ R d i = 1 ◮ λ → 0 ordinary least squares (bias to minimal norm), ◮ λ > 0 ridge regression (stable). L.Rosasco, 9.520/6.860 2018

  5. Computations Let Xn ∈ R nd and � Y ∈ R n . The ridge regression solution is w λ = ( Xn ⊤ Xn + n λ I ) − 1 Xn ⊤ � O ( nd 2 ∨ d 3 ) O ( nd ∨ d 2 ) time mem. � Y but also w λ = Xn ⊤ ( XnXn ⊤ + n λ I ) − 1 � O ( dn 2 ∨ n 3 ) O ( nd ∨ n 2 ) � Y time mem. L.Rosasco, 9.520/6.860 2018

  6. Representer theorem in disguise We noted that � n � n w λ = Xn ⊤ c = ˆ f λ ( x ) = x ⊤ x i c i , � x i c i ⇔ i = 1 i = 1 c = ( XnXn ⊤ + n λ I ) − 1 � ( XnXn ⊤ ) ij = x ⊤ Y , i x j . L.Rosasco, 9.520/6.860 2018

  7. Limits of linear functions Regression L.Rosasco, 9.520/6.860 2018

  8. Limits of linear functions Classification L.Rosasco, 9.520/6.860 2018

  9. Nonlinear functions Two main possibilities: f ( x ) = w ⊤ Φ ( x ) , f ( x ) = Φ ( w ⊤ x ) where Φ is a non linear map. ◮ The former choice leads to linear spaces of functions 1 . ◮ The latter choice can be iterated f ( x ) = Φ ( w ⊤ L Φ ( w ⊤ L − 1 ... Φ ( w ⊤ 1 x ))) . 1 The spaces are linear, NOT the functions! L.Rosasco, 9.520/6.860 2018

  10. Features and feature maps f ( x ) = w ⊤ Φ ( x ) , where Φ : X → R p Φ ( x ) = ( ϕ 1 ( x ) ,...,ϕ p ( x )) ⊤ and ϕ j : X → R , for j = 1 ,..., p . ◮ X need not be R d . ◮ We can also write p � w j ϕ j ( x ) . f ( x ) = i = 1 L.Rosasco, 9.520/6.860 2018

  11. Geometric view f ( x ) = w ⊤ Φ ( x ) L.Rosasco, 9.520/6.860 2018

  12. An example L.Rosasco, 9.520/6.860 2018

  13. More examples The equation p � f ( x ) = w ⊤ Φ ( x ) = w j ϕ j ( x ) i = 1 suggests to think of features as some form of basis. Indeed we can consider ◮ Fourier basis, ◮ wave-lets + their variations, ◮ ... L.Rosasco, 9.520/6.860 2018

  14. And even more examples Any set of functions ϕ j : X → R , j = 1 ,..., p can be considered. Feature design/engineering ◮ vision: SIFT, HOG ◮ audio: MFCC ◮ ... L.Rosasco, 9.520/6.860 2018

  15. Nonlinear functions using features Let H Φ be the space of linear functions f ( x ) = w ⊤ Φ ( x ) . ◮ f ↔ w is one to one, if ( ϕ j ) j are lin. indip. � � H Φ := w ⊤ ¯ ◮ inner product f , ¯ f w , � � � � � f − ¯ ◮ norm/metric f � H Φ := � w − ¯ w � . In this case | f ( x ) − ¯ f ( x ) | ≤ � Φ ( x ) �� w − ¯ w � , ∀ x ∈ X . L.Rosasco, 9.520/6.860 2018

  16. Back to ERM � n 1 ( y i − w ⊤ Φ ( x i )) 2 + λ � w � 2 , min λ ≥ 0 , n w ∈ R p i = 1 Equivalent to, � n 1 ( y i − f ( x i )) 2 + λ � f � 2 min H Φ , λ ≥ 0 . n f ∈H Φ i = 1 L.Rosasco, 9.520/6.860 2018

  17. Computations using features Φ ∈ R np with Let � ( � Φ ) ij = ϕ j ( x i ) The ridge regression solution is w λ = ( � Φ ⊤ � Φ + n λ I ) − 1 � Φ ⊤ � O ( np 2 ∨ p 3 ) O ( np ∨ p 2 ) , time mem. � Y but also w λ = � Φ ⊤ ( � Φ � Φ ⊤ + n λ I ) − 1 � O ( pn 2 ∨ n 3 ) O ( np ∨ n 2 ) . � Y time mem. L.Rosasco, 9.520/6.860 2018

  18. Representer theorem a little less in disguise Analogously to before n n � � w λ = � Φ ⊤ c = ˆ f λ ( x ) = Φ ( x ) ⊤ Φ ( x i ) c i � Φ ( x i ) c i ⇔ i = 1 i = 1 Φ ⊤ + λ I ) − 1 � c = ( � Φ � ( � Φ � Φ ⊤ ) ij = Φ ( x i ) ⊤ Φ ( x j ) Y , p � Φ ( x ) ⊤ Φ (¯ x ) = ϕ s ( x ) ϕ s (¯ x ) . s = 1 L.Rosasco, 9.520/6.860 2018

  19. Unleash the features ◮ Can we consider linearly dependent features? ◮ Can we consider p = ∞ ? L.Rosasco, 9.520/6.860 2018

  20. An observation For X = R consider � ( 2 γ ) ( j − 1 ) ϕ j ( x ) = x j − 1 e − x 2 γ ( j − 1 )! , j = 2 ,..., ∞ with ϕ 1 ( x ) = 1. Then � � ∞ ∞ � � ( 2 γ ) j − 1 ( 2 γ ) j − 1 x j − 1 e − x 2 γ x 2 γ x j − 1 e − ¯ ϕ j ( x ) ϕ j (¯ x ) = ( j − 1 )! ¯ ( j − 1 )! j = 1 j = 1 � ∞ ( 2 γ ) j − 1 e − x 2 γ e − ¯ x 2 γ x ) j − 1 = e − x 2 γ e − ¯ x 2 γ e 2 x ¯ x 2 γ = ( j − 1 )! ( x ¯ j = 1 x | 2 γ e −| x − ¯ = L.Rosasco, 9.520/6.860 2018

  21. From features to kernels � ∞ Φ ( x ) ⊤ Φ (¯ x ) = ϕ j ( x ) ϕ j (¯ x ) = k ( x , ¯ x ) j = 1 We might be able to compute the series in closed form. The function k is called kernel. Can we run ridge regression ? L.Rosasco, 9.520/6.860 2018

  22. Kernel ridge regression We have � n � n ˆ Φ ( x ) ⊤ Φ ( x i ) c i = f λ ( x ) = k ( x , x i ) c i i = 1 i = 1 K + λ I ) − 1 � K ) ij = Φ ( x i ) ⊤ Φ ( x j ) = k ( x i , x j ) c = ( � ( � Y , � K is the kernel matrix, the Gram (inner products) matrix of the data. “The kernel trick” L.Rosasco, 9.520/6.860 2018

  23. Kernels ◮ Can we start from kernels instead of features? ◮ Which functions k : X × X → R define kernels we can use? L.Rosasco, 9.520/6.860 2018

  24. Positive definite kernels A function k : X × X → R is called positive definite: ◮ if the matrix ˆ K is positive semidefinite for all choice of points x 1 ,..., x n , i.e. a ⊤ � ∀ a ∈ R n . Ka ≥ 0 , ◮ Equivalently � n k ( x i , x j ) a i a j ≥ 0 , i , j = 1 for any a 1 ,..., a n ∈ R , x 1 ,..., x n ∈ X . L.Rosasco, 9.520/6.860 2018

  25. Inner product kernels are pos. def. Assume Φ : X → R p , p ≤ ∞ and x ) = Φ ( x ) ⊤ Φ (¯ k ( x , ¯ x ) Note that � � � � 2 � n � n � n � � � � Φ ( x i ) ⊤ Φ ( x j ) a i a j = � � k ( x i , x j ) a i a j = Φ ( x i ) a i . � � � � i , j = 1 i , j = 1 i = 1 Clearly k is symmetric. L.Rosasco, 9.520/6.860 2018

  26. But there are many pos. def. kernels Classic examples ◮ linear k ( x , ¯ x ) = x ⊤ ¯ x ◮ polynomial k ( x , ¯ x ) = ( x ⊤ ¯ x + 1 ) s x � 2 γ x ) = e −� x − ¯ ◮ Gaussian k ( x , ¯ But one can consider ◮ kernels on probability distributions ◮ kernels on strings ◮ kernels on functions ◮ kernels on groups ◮ kernels graphs ◮ ... It is natural to think of a kernel as a measure of similarity. L.Rosasco, 9.520/6.860 2018

  27. From pos. def. kernels to functions Let X be any set/ Given a pos. def. kernel k . ◮ consider the space H k of functions N � f ( x ) = k ( x , x i ) a i i = 1 for any a 1 ,..., a n ∈ R , x 1 ,..., x n ∈ X and any N ∈ N . ◮ Define an inner product on H k ¯ � N � N � � f , ¯ H k = k ( x i , ¯ x j ) a i ¯ f a j . i = 1 j = 1 ◮ H k can be completed to a Hilbert space. L.Rosasco, 9.520/6.860 2018

  28. A key result Functions defind by Gaussian kernels with large and small widths. L.Rosasco, 9.520/6.860 2018

  29. An illustration Theorem Given a pos. def. k there exists Φ s.t. k ( x , ¯ x ) = � Φ ( x ) , Φ (¯ x ) � H k and H Φ ≃ H k Roughly speaking � N f ( x ) = w ⊤ Φ ( x ) ≃ f ( x ) = k ( x , x i ) a i i = 1 L.Rosasco, 9.520/6.860 2018

  30. From features and kernels to RKHS and beyond H k and H Φ have many properties, characterizations, connections: ◮ reproducing property ◮ reproducing kernel Hilbert spaces (RKHS) ◮ Mercer theorem (Kar hunen Lo´ eve expansion) ◮ Gaussian processes ◮ Cameron-Martin spaces L.Rosasco, 9.520/6.860 2018

  31. Reproducing property Note that by definition of H k ◮ k x = k ( x , · ) is a function in H k ◮ For all f ∈ H k , x ∈ X f ( x ) = � f , k x � H k called the reproducing property ◮ Note that � � � � | f ( x ) − ¯ � f − ¯ � H k , f ( x ) | ≤ � k x � H k f ∀ x ∈ X . The above observations have a converse. L.Rosasco, 9.520/6.860 2018

  32. RKHS Definition A RKHS H is a Hilbert with a function k : X × X → R s.t. ◮ k x = k ( x , · ) ∈ H k , ◮ and f ( x ) = � f , k x � H k . Theorem If H is a RKHS then k is pos. def. L.Rosasco, 9.520/6.860 2018

  33. Evaluation functionals in a RKHS If H is a RKHS then the evaluation functionals e x ( f ) = f ( x ) are continuous. i.e. � � � � | e x ( f ) − e x (¯ � f − ¯ � H k , f ) | � f ∀ x ∈ X since e x ( f ) = � f , k x � H k . Note that L 2 ( R d ) or C ( R d ) don’t have this property! L.Rosasco, 9.520/6.860 2018

  34. Alternative RKHS definition Turns out the previous property also characterizes a RKHS. Theorem A Hilbert space with continuous evaluation functionals is a RKHS. L.Rosasco, 9.520/6.860 2018

  35. Summing up ◮ From linear to non linear functions ◮ using features ◮ using kernels plus ◮ pos. def. functions ◮ reproducing property ◮ RKHS L.Rosasco, 9.520/6.860 2018

Recommend


More recommend