Less is More: Nystr¨ om Computational Regularization Alessandro Rudi , Raffaello Camoriano, Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology ale rudi@mit.edu Dec 10th NIPS 2015
A Starting Point Classically: Statistics and optimization distinct steps in algorithm design
A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Large Scale: Consider interplay between statistics and optimization! (Bottou, Bousquet ’08)
Supervised Learning Problem: Estimate f ∗ f ∗
Supervised Learning Estimate f ∗ given S n = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } Problem: f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 )
Supervised Learning Estimate f ∗ given S n = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } Problem: f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 ) The Setting y i = f ∗ ( x i ) + ε i i ∈ { 1 , . . . , n } ◮ ε i ∈ R , x i ∈ R d random (with unknown distribution) ◮ f ∗ unknown
Outline Learning with kernels Data Dependent Subsampling
Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1
Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function
Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers
Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers ◮ c i ∈ R coefficients
Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers ◮ c i ∈ R coefficients ◮ M = M n could/should grow with n
Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers ◮ c i ∈ R coefficients ◮ M = M n could/should grow with n Question: How to choose w i , c i and M given S n ?
Learning with Positive Definite Kernels There is an elegant answer if: ◮ q is symmetric ◮ all the matrices � Q ij = q ( x i , x j ) are positive semi-definite 1 1 They have non-negative eigenvalues
Learning with Positive Definite Kernels There is an elegant answer if: ◮ q is symmetric ◮ all the matrices � Q ij = q ( x i , x j ) are positive semi-definite 1 Representer Theorem (Kimeldorf, Wahba ’70; Sch¨ olkopf et al. ’01) ◮ M = n , ◮ w i = x i , ◮ c i by convex optimization! 1 They have non-negative eigenvalues
Kernel Ridge Regression (KRR) a.k.a. Penalized Least Squares n � 1 ( y i − f ( x i )) 2 + λ � f � 2 � f λ = argmin n f ∈H i =1
Kernel Ridge Regression (KRR) a.k.a. Penalized Least Squares n � 1 ( y i − f ( x i )) 2 + λ � f � 2 � f λ = argmin n f ∈H i =1 where M � c i q ( x, w i ) , c i ∈ R , w i ∈ R d H = { f | f ( x ) = , M ∈ N } � �� � � �� � i =1 any length! any center!
Kernel Ridge Regression (KRR) a.k.a. Penalized Least Squares n � 1 ( y i − f ( x i )) 2 + λ � f � 2 � f λ = argmin n f ∈H i =1 where M � c i q ( x, w i ) , c i ∈ R , w i ∈ R d H = { f | f ( x ) = , M ∈ N } � �� � � �� � i =1 any length! any center! Solution n � � c = ( � Q + λnI ) − 1 � f λ = c i q ( x, x i ) with y i =1
KRR: Statistics
KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n
KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n Remarks
KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n Remarks 1. Optimal nonparametric bound
KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n Remarks 1. Optimal nonparametric bound 2. Results for general kernels (e.g. splines/Sobolev etc.) 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � n − 2 s 2 s +1 , E ( � λ ∗ = n − 2 s +1
KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n Remarks 1. Optimal nonparametric bound 2. Results for general kernels (e.g. splines/Sobolev etc.) 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � n − 2 s 2 s +1 , E ( � λ ∗ = n − 2 s +1 3. Adaptive tuning via cross validation
KRR: Optimization � n � c = ( � Q + λnI ) − 1 � f λ = c i q ( x, x i ) with y i =1 Linear System Complexity ◮ Space O ( n 2 ) ◮ Time O ( n 3 ) b c = b y Q
KRR: Optimization � n � c = ( � Q + λnI ) − 1 � f λ = c i q ( x, x i ) with y i =1 Linear System Complexity ◮ Space O ( n 2 ) ◮ Time O ( n 3 ) b c = b y Q BIG DATA? Running out of space before running out of time... Can this be fixed?
Outline Learning with kernels Data Dependent Subsampling
Subsampling 1. pick w i at random...
Subsampling 1. pick w i at random... from training set (Smola, Scholk¨ opf ’00) w 1 , . . . , ˜ ˜ w M ⊂ x 1 , . . . x n M ≪ n
Subsampling 1. pick w i at random... from training set (Smola, Scholk¨ opf ’00) w 1 , . . . , ˜ ˜ w M ⊂ x 1 , . . . x n M ≪ n 2. perform KRR on � M w i ∈ R d , ✘✘✘ ✘ ✘ w i ) , c i ∈ R , ✘✘✘ H M = { f | f ( x ) = c i q ( x, ˜ M ∈ N } . i =1
Subsampling 1. pick w i at random... from training set (Smola, Scholk¨ opf ’00) w 1 , . . . , ˜ ˜ w M ⊂ x 1 , . . . x n M ≪ n 2. perform KRR on � M w i ∈ R d , ✘✘✘ ✘ ✘ w i ) , c i ∈ R , ✘✘✘ H M = { f | f ( x ) = c i q ( x, ˜ M ∈ N } . i =1 Linear System Complexity ✟ ◮ Space ✟✟ O ( n 2 ) → O ( nM ) c ✟ ◮ Time ✟✟ O ( n 3 ) → O ( nM 2 ) b Q M b y =
Subsampling 1. pick w i at random... from training set (Smola, Scholk¨ opf ’00) w 1 , . . . , ˜ ˜ w M ⊂ x 1 , . . . x n M ≪ n 2. perform KRR on � M w i ∈ R d , ✘✘✘ ✘ ✘ w i ) , c i ∈ R , ✘✘✘ H M = { f | f ( x ) = c i q ( x, ˜ M ∈ N } . i =1 Linear System Complexity ✟ ◮ Space ✟✟ O ( n 2 ) → O ( nM ) c ✟ ◮ Time ✟✟ O ( n 3 ) → O ( nM 2 ) b Q M b y = What about statistics? What’s the price for efficient computations?
Putting our Result in Context ◮ *Many* different subsampling schemes (Smola, Scholkopf ’00; Williams, Seeger ’01; . . . 20+)
Putting our Result in Context ◮ *Many* different subsampling schemes (Smola, Scholkopf ’00; Williams, Seeger ’01; . . . 20+) ◮ Theoretical guarantees mainly on matrix approximation (Mahoney and Drineas ’09; Cortes et al ’10, Kumar et al.’12 . . . 10+) 1 � � Q − � √ Q M � � M
Putting our Result in Context ◮ *Many* different subsampling schemes (Smola, Scholkopf ’00; Williams, Seeger ’01; . . . 20+) ◮ Theoretical guarantees mainly on matrix approximation (Mahoney and Drineas ’09; Cortes et al ’10, Kumar et al.’12 . . . 10+) 1 � � Q − � √ Q M � � M ◮ Few prediction guarantees either suboptimal or in restricted setting (Cortes et al. ’10; Jin et al. ’11, Bach ’13, Alaoui, Mahoney ’14)
Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗
Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗ Remarks
Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗ Remarks 1. Subsampling achives optimal bound. . .
Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗ Remarks 1. Subsampling achives optimal bound. . . 2. . . . with M ∗ ∼ √ n !!
Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗ Remarks 1. Subsampling achives optimal bound. . . 2. . . . with M ∗ ∼ √ n !! 3. More generally , M ∗ = 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � n − 1 E x ( � 2 s λ ∗ = n − 2 s +1 , , 2 s +1 λ ∗
Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗ Remarks 1. Subsampling achives optimal bound. . . 2. . . . with M ∗ ∼ √ n !! 3. More generally , M ∗ = 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � n − 1 E x ( � 2 s λ ∗ = n − 2 s +1 , , 2 s +1 λ ∗ Note: An interesting insight is obtained rewriting the result. . .
Computational Regularization (CoRe) A simple idea: “swap” the role of λ and M . . .
Recommend
More recommend