Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro Rudi, Raffaello Camoriano Paris
A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Empirical process theory + Optimization
A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Empirical process theory + Optimization Large Scale: Consider interplay between statistics and optimization! (Bottou, Bousquet ’08)
A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Empirical process theory + Optimization Large Scale: Consider interplay between statistics and optimization! (Bottou, Bousquet ’08) Computational Regularization: Computation “tricks”=regularization
Supervised Learning Problem: Estimate f ∗ f ∗
Supervised Learning Estimate f ∗ given S n = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } Problem: f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 )
Supervised Learning Estimate f ∗ given S n = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } Problem: f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 ) The Setting y i = f ∗ ( x i ) + ε i i ∈ { 1 , . . . , n } ◮ ε i ∈ R , x i ∈ R d random (bounded but with unknown distribution) ◮ f ∗ unknown
Outline Nonparametric Learning Data Dependent Subsampling Data Independent Subsampling
Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1
Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function
Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers
Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers ◮ c i ∈ R coefficients
Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers ◮ c i ∈ R coefficients ◮ M = M n could/should grow with n
Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers ◮ c i ∈ R coefficients ◮ M = M n could/should grow with n Question: How to choose w i , c i and M given S n ?
Learning with Positive Definite Kernels There is an elegant answer if: ◮ q is symmetric ◮ all the matrices � Q ij = q ( x i , x j ) are positive semi-definite 1 1 They have non-negative eigenvalues
Learning with Positive Definite Kernels There is an elegant answer if: ◮ q is symmetric ◮ all the matrices � Q ij = q ( x i , x j ) are positive semi-definite 1 Representer Theorem (Kimeldorf, Wahba ’70; Sch¨ olkopf et al. ’01) ◮ M = n , ◮ w i = x i , ◮ c i by convex optimization! 1 They have non-negative eigenvalues
Kernel Ridge Regression (KRR) a.k.a. Tikhonov Regularization n � 1 ( y i − f ( x i )) 2 + λ � f � 2 � f λ = argmin n f ∈H i =1 where 2 M � c i q ( x, w i ) , c i ∈ R , w i ∈ R d H = { f | f ( x ) = , M ∈ N } � �� � � �� � i =1 any length! any center! 2 The norm is induced by the inner product � f, f ′ � = � i,j c i c ′ j q ( x i , x j )
Kernel Ridge Regression (KRR) a.k.a. Tikhonov Regularization n � 1 ( y i − f ( x i )) 2 + λ � f � 2 � f λ = argmin n f ∈H i =1 where 2 M � c i q ( x, w i ) , c i ∈ R , w i ∈ R d H = { f | f ( x ) = , M ∈ N } � �� � � �� � i =1 any length! any center! Solution n � � c = ( � Q + λnI ) − 1 � f λ = c i q ( x, x i ) with y i =1 2 The norm is induced by the inner product � f, f ′ � = � i,j c i c ′ j q ( x i , x j )
KRR: Statistics
KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � √ n √ n λ ∗ =
KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � √ n √ n λ ∗ = Remarks
KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � √ n √ n λ ∗ = Remarks 1. Optimal nonparametric bound
KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � √ n √ n λ ∗ = Remarks 1. Optimal nonparametric bound 2. More refined results for smooth kernels f λ ∗ ( x ) − f ∗ ( x )) 2 � n − 1 2 s E ( � λ ∗ = n − 2 s +1 , 2 s +1
KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � √ n √ n λ ∗ = Remarks 1. Optimal nonparametric bound 2. More refined results for smooth kernels f λ ∗ ( x ) − f ∗ ( x )) 2 � n − 1 2 s E ( � λ ∗ = n − 2 s +1 , 2 s +1 3. Adaptive tuning , e.g. via cross validation 4. Proofs : inverse problems results + random matrices (Smale and Zhou + Caponnetto, De Vito, R.)
KRR: Optimization � n � c = ( � Q + λnI ) − 1 � f λ = c i q ( x, x i ) with y i =1 Linear System Complexity ◮ Space O ( n 2 ) ◮ Time O ( n 3 ) b c = b y Q
KRR: Optimization � n � c = ( � Q + λnI ) − 1 � f λ = c i q ( x, x i ) with y i =1 Linear System Complexity ◮ Space O ( n 2 ) ◮ Time O ( n 3 ) b c = b y Q BIG DATA? Running out of time and space ... Can this be fixed?
Beyond Tikhonov: Spectral Filtering Q + λI ) − 1 approximation of ˆ Q † controlled by λ ( ˆ
Beyond Tikhonov: Spectral Filtering Q + λI ) − 1 approximation of ˆ Q † controlled by λ ( ˆ Q † by saving computations? Can we approximate ˆ
Beyond Tikhonov: Spectral Filtering Q + λI ) − 1 approximation of ˆ Q † controlled by λ ( ˆ Q † by saving computations? Can we approximate ˆ Yes!
Beyond Tikhonov: Spectral Filtering Q + λI ) − 1 approximation of ˆ Q † controlled by λ ( ˆ Q † by saving computations? Can we approximate ˆ Yes! Spectral filtering (Engl ’96- inverse problems, Rosasco et al. 05- ML ) g λ ( ˆ Q ) ∼ ˆ Q † The filter function g λ defines the form of the approximation
Spectral filtering Examples ◮ Tikhonov- ridge regression ◮ Truncated SVD– principal component regression ◮ Landweber iteration– GD/ L 2 -boosting ◮ nu-method– accelerated GD/Chebyshev method ◮ . . .
Spectral filtering Examples ◮ Tikhonov- ridge regression ◮ Truncated SVD– principal component regression ◮ Landweber iteration– GD/ L 2 -boosting ◮ nu-method– accelerated GD/Chebyshev method ◮ . . . Landweber iteration (truncated power series). . . � t − 1 c t = g t ( ˆ ( I − γ ˆ Q ) r � Q ) = γ y r =0
Spectral filtering Examples ◮ Tikhonov- ridge regression ◮ Truncated SVD– principal component regression ◮ Landweber iteration– GD/ L 2 -boosting ◮ nu-method– accelerated GD/Chebyshev method ◮ . . . Landweber iteration (truncated power series). . . � t − 1 c t = g t ( ˆ ( I − γ ˆ Q ) r � Q ) = γ y r =0 . . . it’s GD for ERM!! c r = c r − 1 − γ ( ˆ r = 1 . . . t Qc r − 1 − ˆ y ) , c 0 = 0
Statistics and computations with spectral filtering The different filters achieve essentially the same optimal statistical error!
Statistics and computations with spectral filtering The different filters achieve essentially the same optimal statistical error! Difference is in computations Filter Time Space n 3 n 2 Tikhonov n 2 λ − 1 n 2 GD ∗ n 2 λ − 1 / 2 n 2 Accelerated GD ∗ n 2 λ − γ n 2 Truncated SVD ∗ Note t: λ − 1 = t , for iterative methods ∗
Semiconvergence 0.24 Empirical Error Expected Error 0.22 0.2 Error 0.18 0.16 0.14 0.12 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Iteration ◮ Iterations control statistics and time complexity
Computational Regularization
Computational Regularization BIG DATA? Running out of ✭✭✭✭ ❤❤❤❤ time and space ...
Computational Regularization BIG DATA? Running out of ✭✭✭✭ ❤❤❤❤ time and space ... Is there a principle to control statistics, time and space complexity?
Outline Nonparametric Learning Data Dependent Subsampling Data Independent Subsampling
Subsampling 1. pick w i at random...
Subsampling 1. pick w i at random... from training set (Smola, Scholk¨ opf ’00) w 1 , . . . , ˜ ˜ w M ⊂ x 1 , . . . x n M ≪ n
Subsampling 1. pick w i at random... from training set (Smola, Scholk¨ opf ’00) w 1 , . . . , ˜ ˜ w M ⊂ x 1 , . . . x n M ≪ n 2. perform KRR on � M w i ∈ R d , ✘✘✘ ✘ ✘ w i ) , c i ∈ R , ✘✘✘ H M = { f | f ( x ) = M ∈ N } . c i q ( x, ˜ i =1
Recommend
More recommend