Less is More: Nystr om Computational Regularization Alessandro Rudi - PowerPoint PPT Presentation

Less is More: Nystr¨ om Computational Regularization Alessandro Rudi , Raffaello Camoriano, Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology ale rudi@mit.edu Dec 10th NIPS 2015

A Starting Point Classically: Statistics and optimization distinct steps in algorithm design

A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Large Scale: Consider interplay between statistics and optimization! (Bottou, Bousquet ’08)

Supervised Learning Problem: Estimate f ∗ f ∗

Supervised Learning Estimate f ∗ given S n = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } Problem: f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 )

Supervised Learning Estimate f ∗ given S n = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } Problem: f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 ) The Setting y i = f ∗ ( x i ) + ε i i ∈ { 1 , . . . , n } ◮ ε i ∈ R , x i ∈ R d random (with unknown distribution) ◮ f ∗ unknown

Outline Learning with kernels Data Dependent Subsampling

Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1

Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function

Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers

Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers ◮ c i ∈ R coefficients

Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers ◮ c i ∈ R coefficients ◮ M = M n could/should grow with n

Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers ◮ c i ∈ R coefficients ◮ M = M n could/should grow with n Question: How to choose w i , c i and M given S n ?

Learning with Positive Definite Kernels There is an elegant answer if: ◮ q is symmetric ◮ all the matrices � Q ij = q ( x i , x j ) are positive semi-definite 1 1 They have non-negative eigenvalues

Learning with Positive Definite Kernels There is an elegant answer if: ◮ q is symmetric ◮ all the matrices � Q ij = q ( x i , x j ) are positive semi-definite 1 Representer Theorem (Kimeldorf, Wahba ’70; Sch¨ olkopf et al. ’01) ◮ M = n , ◮ w i = x i , ◮ c i by convex optimization! 1 They have non-negative eigenvalues

Kernel Ridge Regression (KRR) a.k.a. Penalized Least Squares n � 1 ( y i − f ( x i )) 2 + λ � f � 2 � f λ = argmin n f ∈H i =1

Kernel Ridge Regression (KRR) a.k.a. Penalized Least Squares n � 1 ( y i − f ( x i )) 2 + λ � f � 2 � f λ = argmin n f ∈H i =1 where M � c i q ( x, w i ) , c i ∈ R , w i ∈ R d H = { f | f ( x ) = , M ∈ N } � �� i =1 any length! any center!

Kernel Ridge Regression (KRR) a.k.a. Penalized Least Squares n � 1 ( y i − f ( x i )) 2 + λ � f � 2 � f λ = argmin n f ∈H i =1 where M � c i q ( x, w i ) , c i ∈ R , w i ∈ R d H = { f | f ( x ) = , M ∈ N } � �� i =1 any length! any center! Solution n � � c = ( � Q + λnI ) − 1 � f λ = c i q ( x, x i ) with y i =1

KRR: Statistics

KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n

KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n Remarks

KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n Remarks 1. Optimal nonparametric bound

KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n Remarks 1. Optimal nonparametric bound 2. Results for general kernels (e.g. splines/Sobolev etc.) 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � n − 2 s 2 s +1 , E ( � λ ∗ = n − 2 s +1

KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n Remarks 1. Optimal nonparametric bound 2. Results for general kernels (e.g. splines/Sobolev etc.) 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � n − 2 s 2 s +1 , E ( � λ ∗ = n − 2 s +1 3. Adaptive tuning via cross validation

KRR: Optimization � n � c = ( � Q + λnI ) − 1 � f λ = c i q ( x, x i ) with y i =1 Linear System Complexity ◮ Space O ( n 2 ) ◮ Time O ( n 3 ) b c = b y Q

KRR: Optimization � n � c = ( � Q + λnI ) − 1 � f λ = c i q ( x, x i ) with y i =1 Linear System Complexity ◮ Space O ( n 2 ) ◮ Time O ( n 3 ) b c = b y Q BIG DATA? Running out of space before running out of time... Can this be fixed?

Outline Learning with kernels Data Dependent Subsampling

Subsampling 1. pick w i at random...

Subsampling 1. pick w i at random... from training set (Smola, Scholk¨ opf ’00) w 1 , . . . , ˜ ˜ w M ⊂ x 1 , . . . x n M ≪ n

Subsampling 1. pick w i at random... from training set (Smola, Scholk¨ opf ’00) w 1 , . . . , ˜ ˜ w M ⊂ x 1 , . . . x n M ≪ n 2. perform KRR on � M w i ∈ R d , ✘✘✘ ✘ ✘ w i ) , c i ∈ R , ✘✘✘ H M = { f | f ( x ) = c i q ( x, ˜ M ∈ N } . i =1

Subsampling 1. pick w i at random... from training set (Smola, Scholk¨ opf ’00) w 1 , . . . , ˜ ˜ w M ⊂ x 1 , . . . x n M ≪ n 2. perform KRR on � M w i ∈ R d , ✘✘✘ ✘ ✘ w i ) , c i ∈ R , ✘✘✘ H M = { f | f ( x ) = c i q ( x, ˜ M ∈ N } . i =1 Linear System Complexity ✟ ◮ Space ✟✟ O ( n 2 ) → O ( nM ) c ✟ ◮ Time ✟✟ O ( n 3 ) → O ( nM 2 ) b Q M b y =

Subsampling 1. pick w i at random... from training set (Smola, Scholk¨ opf ’00) w 1 , . . . , ˜ ˜ w M ⊂ x 1 , . . . x n M ≪ n 2. perform KRR on � M w i ∈ R d , ✘✘✘ ✘ ✘ w i ) , c i ∈ R , ✘✘✘ H M = { f | f ( x ) = c i q ( x, ˜ M ∈ N } . i =1 Linear System Complexity ✟ ◮ Space ✟✟ O ( n 2 ) → O ( nM ) c ✟ ◮ Time ✟✟ O ( n 3 ) → O ( nM 2 ) b Q M b y = What about statistics? What’s the price for efficient computations?

Putting our Result in Context ◮ *Many* different subsampling schemes (Smola, Scholkopf ’00; Williams, Seeger ’01; . . . 20+)

Putting our Result in Context ◮ *Many* different subsampling schemes (Smola, Scholkopf ’00; Williams, Seeger ’01; . . . 20+) ◮ Theoretical guarantees mainly on matrix approximation (Mahoney and Drineas ’09; Cortes et al ’10, Kumar et al.’12 . . . 10+) 1 � � Q − � √ Q M � � M

Putting our Result in Context ◮ *Many* different subsampling schemes (Smola, Scholkopf ’00; Williams, Seeger ’01; . . . 20+) ◮ Theoretical guarantees mainly on matrix approximation (Mahoney and Drineas ’09; Cortes et al ’10, Kumar et al.’12 . . . 10+) 1 � � Q − � √ Q M � � M ◮ Few prediction guarantees either suboptimal or in restricted setting (Cortes et al. ’10; Jin et al. ’11, Bach ’13, Alaoui, Mahoney ’14)

Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗

Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗ Remarks

Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗ Remarks 1. Subsampling achives optimal bound. . .

Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗ Remarks 1. Subsampling achives optimal bound. . . 2. . . . with M ∗ ∼ √ n !!

Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗ Remarks 1. Subsampling achives optimal bound. . . 2. . . . with M ∗ ∼ √ n !! 3. More generally , M ∗ = 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � n − 1 E x ( � 2 s λ ∗ = n − 2 s +1 , , 2 s +1 λ ∗

Main Result Theorem If f ∗ ∈ H , then 1 , M ∗ = 1 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � E ( � λ ∗ = √ n √ n , λ ∗ Remarks 1. Subsampling achives optimal bound. . . 2. . . . with M ∗ ∼ √ n !! 3. More generally , M ∗ = 1 f λ ∗ ,M ∗ ( x ) − f ∗ ( x )) 2 � n − 1 E x ( � 2 s λ ∗ = n − 2 s +1 , , 2 s +1 λ ∗ Note: An interesting insight is obtained rewriting the result. . .

Computational Regularization (CoRe) A simple idea: “swap” the role of λ and M . . .

Less is More: Nystr om Computational Regularization Alessandro Rudi - PowerPoint PPT Presentation

Less is More: Nystr om Computational Regularization Alessandro Rudi , Raffaello Camoriano, Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology ale rudi@mit.edu Dec 10th NIPS 2015 A

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova -

10. Regularization More on tradeoffs Regularization Effect of using different norms

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

Regularization Methods for System Identification Input Design Biqiang MU Academy of Mathematics

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 L. Rosasco Regularization for

Defect Detection Thomas Zimmermann The First Bug September 9, 1947 More Bugs More Bugs More

Outline 1 The topic 2 Decision support systems 3 Modeling 3.4 Constraints Computing with

Q&A in Google Slides: A guide for language teachers By Joe Dale Introduction The Q&A

Towards a Quantitative Approach to Attack Response Response Herv Debar Using work performed

Measuring Software Security Defining Security Metrics Dr. Bill Young Department of Computer

Busy Developer Dr. Jim Webber Chief Scientist, Neo Technology @jimwebber Roadmap Imprisoned

TRILL Header Extension Simplifica8ons Donald Eastlake 3 rd

Complexity of DLs RWTH Aachen 1 Germany Complexity of DLs: Overview of the Complexity of

at the Dawn of the Computer Age Using Archival Materials to Explore the Historic 1952 Debut of

Less is More: Nystr om Computational Regularization Alessandro Rudi - PowerPoint PPT Presentation

Less is More: Nystr om Computational Regularization Alessandro Rudi , Raffaello Camoriano, Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology ale rudi@mit.edu Dec 10th NIPS 2015 A

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova -

10. Regularization More on tradeoffs Regularization Effect of using different norms

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

Regularization Methods for System Identification Input Design Biqiang MU Academy of Mathematics

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 L. Rosasco Regularization for

Defect Detection Thomas Zimmermann The First Bug September 9, 1947 More Bugs More Bugs More

Outline 1 The topic 2 Decision support systems 3 Modeling 3.4 Constraints Computing with

Q&amp;A in Google Slides: A guide for language teachers By Joe Dale Introduction The Q&amp;A

Towards a Quantitative Approach to Attack Response Response Herv Debar Using work performed

Measuring Software Security Defining Security Metrics Dr. Bill Young Department of Computer

Busy Developer Dr. Jim Webber Chief Scientist, Neo Technology @jimwebber Roadmap Imprisoned

TRILL Header Extension Simplifica8ons Donald Eastlake 3 rd

Complexity of DLs RWTH Aachen 1 Germany Complexity of DLs: Overview of the Complexity of

at the Dawn of the Computer Age Using Archival Materials to Explore the Historic 1952 Debut of

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Q&A in Google Slides: A guide for language teachers By Joe Dale Introduction The Q&A