less is more nystr om computational regularization
play

Less is More: Nystr om Computational Regularization Alessandro Rudi - PowerPoint PPT Presentation

Less is More: Nystr om Computational Regularization Alessandro Rudi , Raffaello Camoriano, Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology ale rudi@mit.edu Dec 10th NIPS 2015 A


  1. Less is More: Nystr¨ om Computational Regularization Alessandro Rudi , Raffaello Camoriano, Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology ale rudi@mit.edu Dec 10th NIPS 2015

  2. A Starting Point Classically: Statistics and optimization distinct steps in algorithm design

  3. A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Large Scale: Consider interplay between statistics and optimization! (Bottou, Bousquet ’08)

  4. Supervised Learning Problem: Estimate f ∗ f ∗

  5. Supervised Learning Estimate f ∗ given S n = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } Problem: f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 )

  6. Supervised Learning Estimate f ∗ given S n = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } Problem: f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 ) The Setting y i = f ∗ ( x i ) + ε i i ∈ { 1 , . . . , n } ◮ ε i ∈ R , x i ∈ R d random (with unknown distribution) ◮ f ∗ unknown

  7. Outline Learning with kernels Data Dependent Subsampling

  8. Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1

  9. Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function

  10. Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers

  11. Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers ◮ c i ∈ R coefficients

  12. Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers ◮ c i ∈ R coefficients ◮ M = M n could/should grow with n

  13. Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers ◮ c i ∈ R coefficients ◮ M = M n could/should grow with n Question: How to choose w i , c i and M given S n ?

  14. Learning with Positive Definite Kernels There is an elegant answer if: ◮ q is symmetric ◮ all the matrices � Q ij = q ( x i , x j ) are positive semi-definite 1 1 They have non-negative eigenvalues

  15. Learning with Positive Definite Kernels There is an elegant answer if: ◮ q is symmetric ◮ all the matrices � Q ij = q ( x i , x j ) are positive semi-definite 1 Representer Theorem (Kimeldorf, Wahba ’70; Sch¨ olkopf et al. ’01) ◮ M = n , ◮ w i = x i , ◮ c i by convex optimization! 1 They have non-negative eigenvalues

  16. Kernel Ridge Regression (KRR) a.k.a. Penalized Least Squares n � 1 ( y i − f ( x i )) 2 + λ � f � 2 � f λ = argmin n f ∈H i =1

  17. Kernel Ridge Regression (KRR) a.k.a. Penalized Least Squares n � 1 ( y i − f ( x i )) 2 + λ � f � 2 � f λ = argmin n f ∈H i =1 where M � c i q ( x, w i ) , c i ∈ R , w i ∈ R d H = { f | f ( x ) = , M ∈ N } � �� � � �� � i =1 any length! any center!

  18. Kernel Ridge Regression (KRR) a.k.a. Penalized Least Squares n � 1 ( y i − f ( x i )) 2 + λ � f � 2 � f λ = argmin n f ∈H i =1 where M � c i q ( x, w i ) , c i ∈ R , w i ∈ R d H = { f | f ( x ) = , M ∈ N } � �� � � �� � i =1 any length! any center! Solution n � � c = ( � Q + λnI ) − 1 � f λ = c i q ( x, x i ) with y i =1

  19. KRR: Statistics

  20. KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n

  21. KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n Remarks

  22. KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n Remarks 1. Optimal nonparametric bound

  23. KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n Remarks 1. Optimal nonparametric bound 2. Results for general kernels (e.g. splines/Sobolev etc.) 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � n − 2 s 2 s +1 , E ( � λ ∗ = n − 2 s +1

  24. KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n Remarks 1. Optimal nonparametric bound 2. Results for general kernels (e.g. splines/Sobolev etc.) 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � n − 2 s 2 s +1 , E ( � λ ∗ = n − 2 s +1 3. Adaptive tuning via cross validation

  25. KRR: Optimization � n � c = ( � Q + λnI ) − 1 � f λ = c i q ( x, x i ) with y i =1 Linear System Complexity ◮ Space O ( n 2 ) ◮ Time O ( n 3 ) b c = b y Q

  26. KRR: Optimization � n � c = ( � Q + λnI ) − 1 � f λ = c i q ( x, x i ) with y i =1 Linear System Complexity ◮ Space O ( n 2 ) ◮ Time O ( n 3 ) b c = b y Q BIG DATA? Running out of space before running out of time... Can this be fixed?

  27. Outline Learning with kernels Data Dependent Subsampling

  28. Subsampling 1. pick w i at random...

  29. Subsampling 1. pick w i at random... from training set (Smola, Scholk¨ opf ’00) w 1 , . . . , ˜ ˜ w M ⊂ x 1 , . . . x n M ≪ n

  30. Subsampling 1. pick w i at random... from training set (Smola, Scholk¨ opf ’00) w 1 , . . . , ˜ ˜ w M ⊂ x 1 , . . . x n M ≪ n 2. perform KRR on � M w i ∈ R d , ✘✘✘ ✘ ✘ w i ) , c i ∈ R , ✘✘✘ H M = { f | f ( x ) = c i q ( x, ˜ M ∈ N } . i =1

  31. Subsampling 1. pick w i at random... from training set (Smola, Scholk¨ opf ’00) w 1 , . . . , ˜ ˜ w M ⊂ x 1 , . . . x n M ≪ n 2. perform KRR on � M w i ∈ R d , ✘✘✘ ✘ ✘ w i ) , c i ∈ R , ✘✘✘ H M = { f | f ( x ) = c i q ( x, ˜ M ∈ N } . i =1 Linear System Complexity ✟ ◮ Space ✟✟ O ( n 2 ) → O ( nM ) c ✟ ◮ Time ✟✟ O ( n 3 ) → O ( nM 2 ) b Q M b y =

  32. Subsampling 1. pick w i at random... from training set (Smola, Scholk¨ opf ’00) w 1 , . . . , ˜ ˜ w M ⊂ x 1 , . . . x n M ≪ n 2. perform KRR on � M w i ∈ R d , ✘✘✘ ✘ ✘ w i ) , c i ∈ R , ✘✘✘ H M = { f | f ( x ) = c i q ( x, ˜ M ∈ N } . i =1 Linear System Complexity ✟ ◮ Space ✟✟ O ( n 2 ) → O ( nM ) c ✟ ◮ Time ✟✟ O ( n 3 ) → O ( nM 2 ) b Q M b y = What about statistics? What’s the price for efficient computations?

  33. Putting our Result in Context ◮ *Many* different subsampling schemes (Smola, Scholkopf ’00; Williams, Seeger ’01; . . . 20+)

  34. Putting our Result in Context ◮ *Many* different subsampling schemes (Smola, Scholkopf ’00; Williams, Seeger ’01; . . . 20+) ◮ Theoretical guarantees mainly on matrix approximation (Mahoney and Drineas ’09; Cortes et al ’10, Kumar et al.’12 . . . 10+) 1 � � Q − � √ Q M � � M

  35. Putting our Result in Context ◮ *Many* different subsampling schemes (Smola, Scholkopf ’00; Williams, Seeger ’01; . . . 20+) ◮ Theoretical guarantees mainly on matrix approximation (Mahoney and Drineas ’09; Cortes et al ’10, Kumar et al.’12 . . . 10+) 1 � � Q − � √ Q M � � M ◮ Few prediction guarantees either suboptimal or in restricted setting (Cortes et al. ’10; Jin et al. ’11, Bach ’13, Alaoui, Mahoney ’14)

  36. Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗

  37. Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗ Remarks

  38. Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗ Remarks 1. Subsampling achives optimal bound. . .

  39. Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗ Remarks 1. Subsampling achives optimal bound. . . 2. . . . with M ∗ ∼ √ n !!

  40. Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗ Remarks 1. Subsampling achives optimal bound. . . 2. . . . with M ∗ ∼ √ n !! 3. More generally , M ∗ = 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � n − 1 E x ( � 2 s λ ∗ = n − 2 s +1 , , 2 s +1 λ ∗

  41. Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗ Remarks 1. Subsampling achives optimal bound. . . 2. . . . with M ∗ ∼ √ n !! 3. More generally , M ∗ = 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � n − 1 E x ( � 2 s λ ∗ = n − 2 s +1 , , 2 s +1 λ ∗ Note: An interesting insight is obtained rewriting the result. . .

  42. Computational Regularization (CoRe) A simple idea: “swap” the role of λ and M . . .

Recommend


More recommend