less is more computational regularization by subsampling
play

Less is More: Computational Regularization by Subsampling Lorenzo - PowerPoint PPT Presentation

Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro Rudi, Raffaello Camoriano Paris A


  1. Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro Rudi, Raffaello Camoriano Paris

  2. A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Empirical process theory + Optimization

  3. A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Empirical process theory + Optimization Large Scale: Consider interplay between statistics and optimization! (Bottou, Bousquet ’08)

  4. A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Empirical process theory + Optimization Large Scale: Consider interplay between statistics and optimization! (Bottou, Bousquet ’08) Computational Regularization: Computation “tricks”=regularization

  5. Supervised Learning Problem: Estimate f ∗ f ∗

  6. Supervised Learning Estimate f ∗ given S n = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } Problem: f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 )

  7. Supervised Learning Estimate f ∗ given S n = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } Problem: f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 ) The Setting y i = f ∗ ( x i ) + ε i i ∈ { 1 , . . . , n } ◮ ε i ∈ R , x i ∈ R d random (bounded but with unknown distribution) ◮ f ∗ unknown

  8. Outline Nonparametric Learning Data Dependent Subsampling Data Independent Subsampling

  9. Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1

  10. Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function

  11. Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers

  12. Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers ◮ c i ∈ R coefficients

  13. Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers ◮ c i ∈ R coefficients ◮ M = M n could/should grow with n

  14. Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers ◮ c i ∈ R coefficients ◮ M = M n could/should grow with n Question: How to choose w i , c i and M given S n ?

  15. Learning with Positive Definite Kernels There is an elegant answer if: ◮ q is symmetric ◮ all the matrices � Q ij = q ( x i , x j ) are positive semi-definite 1 1 They have non-negative eigenvalues

  16. Learning with Positive Definite Kernels There is an elegant answer if: ◮ q is symmetric ◮ all the matrices � Q ij = q ( x i , x j ) are positive semi-definite 1 Representer Theorem (Kimeldorf, Wahba ’70; Sch¨ olkopf et al. ’01) ◮ M = n , ◮ w i = x i , ◮ c i by convex optimization! 1 They have non-negative eigenvalues

  17. Kernel Ridge Regression (KRR) a.k.a. Tikhonov Regularization n � 1 ( y i − f ( x i )) 2 + λ � f � 2 � f λ = argmin n f ∈H i =1 where 2 M � c i q ( x, w i ) , c i ∈ R , w i ∈ R d H = { f | f ( x ) = , M ∈ N } � �� � � �� � i =1 any length! any center! 2 The norm is induced by the inner product � f, f ′ � = � i,j c i c ′ j q ( x i , x j )

  18. Kernel Ridge Regression (KRR) a.k.a. Tikhonov Regularization n � 1 ( y i − f ( x i )) 2 + λ � f � 2 � f λ = argmin n f ∈H i =1 where 2 M � c i q ( x, w i ) , c i ∈ R , w i ∈ R d H = { f | f ( x ) = , M ∈ N } � �� � � �� � i =1 any length! any center! Solution n � � c = ( � Q + λnI ) − 1 � f λ = c i q ( x, x i ) with y i =1 2 The norm is induced by the inner product � f, f ′ � = � i,j c i c ′ j q ( x i , x j )

  19. KRR: Statistics

  20. KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � √ n √ n λ ∗ =

  21. KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � √ n √ n λ ∗ = Remarks

  22. KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � √ n √ n λ ∗ = Remarks 1. Optimal nonparametric bound

  23. KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � √ n √ n λ ∗ = Remarks 1. Optimal nonparametric bound 2. More refined results for smooth kernels f λ ∗ ( x ) − f ∗ ( x )) 2 � n − 1 2 s E ( � λ ∗ = n − 2 s +1 , 2 s +1

  24. KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � √ n √ n λ ∗ = Remarks 1. Optimal nonparametric bound 2. More refined results for smooth kernels f λ ∗ ( x ) − f ∗ ( x )) 2 � n − 1 2 s E ( � λ ∗ = n − 2 s +1 , 2 s +1 3. Adaptive tuning , e.g. via cross validation 4. Proofs : inverse problems results + random matrices (Smale and Zhou + Caponnetto, De Vito, R.)

  25. KRR: Optimization � n � c = ( � Q + λnI ) − 1 � f λ = c i q ( x, x i ) with y i =1 Linear System Complexity ◮ Space O ( n 2 ) ◮ Time O ( n 3 ) b c = b y Q

  26. KRR: Optimization � n � c = ( � Q + λnI ) − 1 � f λ = c i q ( x, x i ) with y i =1 Linear System Complexity ◮ Space O ( n 2 ) ◮ Time O ( n 3 ) b c = b y Q BIG DATA? Running out of time and space ... Can this be fixed?

  27. Beyond Tikhonov: Spectral Filtering Q + λI ) − 1 approximation of ˆ Q † controlled by λ ( ˆ

  28. Beyond Tikhonov: Spectral Filtering Q + λI ) − 1 approximation of ˆ Q † controlled by λ ( ˆ Q † by saving computations? Can we approximate ˆ

  29. Beyond Tikhonov: Spectral Filtering Q + λI ) − 1 approximation of ˆ Q † controlled by λ ( ˆ Q † by saving computations? Can we approximate ˆ Yes!

  30. Beyond Tikhonov: Spectral Filtering Q + λI ) − 1 approximation of ˆ Q † controlled by λ ( ˆ Q † by saving computations? Can we approximate ˆ Yes! Spectral filtering (Engl ’96- inverse problems, Rosasco et al. 05- ML ) g λ ( ˆ Q ) ∼ ˆ Q † The filter function g λ defines the form of the approximation

  31. Spectral filtering Examples ◮ Tikhonov- ridge regression ◮ Truncated SVD– principal component regression ◮ Landweber iteration– GD/ L 2 -boosting ◮ nu-method– accelerated GD/Chebyshev method ◮ . . .

  32. Spectral filtering Examples ◮ Tikhonov- ridge regression ◮ Truncated SVD– principal component regression ◮ Landweber iteration– GD/ L 2 -boosting ◮ nu-method– accelerated GD/Chebyshev method ◮ . . . Landweber iteration (truncated power series). . . � t − 1 c t = g t ( ˆ ( I − γ ˆ Q ) r � Q ) = γ y r =0

  33. Spectral filtering Examples ◮ Tikhonov- ridge regression ◮ Truncated SVD– principal component regression ◮ Landweber iteration– GD/ L 2 -boosting ◮ nu-method– accelerated GD/Chebyshev method ◮ . . . Landweber iteration (truncated power series). . . � t − 1 c t = g t ( ˆ ( I − γ ˆ Q ) r � Q ) = γ y r =0 . . . it’s GD for ERM!! c r = c r − 1 − γ ( ˆ r = 1 . . . t Qc r − 1 − ˆ y ) , c 0 = 0

  34. Statistics and computations with spectral filtering The different filters achieve essentially the same optimal statistical error!

  35. Statistics and computations with spectral filtering The different filters achieve essentially the same optimal statistical error! Difference is in computations Filter Time Space n 3 n 2 Tikhonov n 2 λ − 1 n 2 GD ∗ n 2 λ − 1 / 2 n 2 Accelerated GD ∗ n 2 λ − γ n 2 Truncated SVD ∗ Note t: λ − 1 = t , for iterative methods ∗

  36. Semiconvergence 0.24 Empirical Error Expected Error 0.22 0.2 Error 0.18 0.16 0.14 0.12 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Iteration ◮ Iterations control statistics and time complexity

  37. Computational Regularization

  38. Computational Regularization BIG DATA? Running out of ✭✭✭✭ ❤❤❤❤ time and space ...

  39. Computational Regularization BIG DATA? Running out of ✭✭✭✭ ❤❤❤❤ time and space ... Is there a principle to control statistics, time and space complexity?

  40. Outline Nonparametric Learning Data Dependent Subsampling Data Independent Subsampling

  41. Subsampling 1. pick w i at random...

  42. Subsampling 1. pick w i at random... from training set (Smola, Scholk¨ opf ’00) w 1 , . . . , ˜ ˜ w M ⊂ x 1 , . . . x n M ≪ n

  43. Subsampling 1. pick w i at random... from training set (Smola, Scholk¨ opf ’00) w 1 , . . . , ˜ ˜ w M ⊂ x 1 , . . . x n M ≪ n 2. perform KRR on � M w i ∈ R d , ✘✘✘ ✘ ✘ w i ) , c i ∈ R , ✘✘✘ H M = { f | f ( x ) = M ∈ N } . c i q ( x, ˜ i =1

Recommend


More recommend