Regularization Methods for Large Scale Machine Learning Alessandro - PowerPoint PPT Presentation

Regularization Methods for Large Scale Machine Learning Alessandro Rudi University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu In collaboration with: Raffaello Camoriano, Lorenzo Rosasco May 6th, 2017 - RegML, Oslo

Machine learning Algorithms Desiderata ◮ flexible non-linear / non-parametric models ◮ scalable computation ◮ statistical guarantees

Statistics and computations (Chandrasekaran, Jordan ’12)

Supervised Learning Problem: Estimate f ∗

Supervised Learning Problem: Estimate f ∗ given S n = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 )

Supervised Learning Problem: Estimate f ∗ given S n = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 ) Setting y i = f ∗ ( x i ) + ε i i ∈ { 1 , . . . , n } ◮ ε i ∈ R , x i ∈ R d random (unknown distribution) ◮ f ∗ unknown

Recall Kernel Ridge Regression n � f λ ( x ) = Φ( x ) ⊤ � w λ = c λ = ( � � K + λnI ) − 1 � c λ K ( x, x i ) � i , � y i =1 ◮ Φ( x ) ⊤ Φ( x ′ ) = K ( x, x ′ ) kernel, e.g. Gaussian ◮ � K n by n data matrix ◮ � y outputs vector Computational complexity Time: O ( n 3 ) Memory: O ( n 2 )

Statistical Guarantees Let E ( f ) = E ( y − f ( x )) 2 . Theorem ( Caponetto, DeVito ’05 ) Assume ∃ w such that f ∗ ( x ) = w ⊤ Φ( x ) , subexp noise, and bounded kernel. Then w.h.p. f λ ) − E ( f ∗ ) � 1 E ( � λn + λ, so that for λ n = 1 / √ n , w.h.p. 1 E ( � f λ n ) − E ( f ∗ ) � √ n.

Remarks ◮ Bound is minimax optimal (Caponetto, DeVito ’05) ◮ Adaptivity via cross validation or Lepskii method (Caponetto, Yao ’07, De Vito Pereverzev R. ’07) ◮ Refined results, e.g. for Sobolev classes rate is n − 2 s 2 s + d (Caponetto, DeVito ’05) Computational complexity kills the method for large problems Computational complexity Time: O ( n 3 ) Memory: O ( n 2 )

BIG DATA Where it is possible to run Kernel Ridge regression ◮ n ≈ 10 000 Laptop ( ∼ 1 Gigabyte memory), ◮ n ≈ 100 000 Desktop ( ∼ 100 Gigabyte memory), ◮ n ≈ 1000 000 Cluster ( ∼ 10 Terabyte memory), ◮ n ≈ 10 000 000 Supercomputer (TOP10) ( ∼ 1 Petabyte memory) High energy physics experiments: n ≈ 10 7 per second. . .

Outline ◮ Data independent subsampling ◮ Data dependent subsampling ◮ Adaptive subsampling

An idea If K ( x, x ′ ) = Φ M ( x ) ⊤ Φ M ( x ′ ) with Φ M ( x ) ∈ R M f λ,M ( x ) = Φ M ( x ) ⊤ � w λ,M = ( � � Φ ⊤ � Φ + λnI ) − 1 � w λ,M Φ ⊤ � � y

An idea If K ( x, x ′ ) = Φ M ( x ) ⊤ Φ M ( x ′ ) with Φ M ( x ) ∈ R M f λ,M ( x ) = Φ M ( x ) ⊤ � w λ,M = ( � � Φ ⊤ � Φ + λnI ) − 1 � w λ,M Φ ⊤ � � y Φ = (Φ M ( x 1 ) , . . . Φ M ( x n )) ⊤ ∈ R n × M ◮ � w λ,M vector in R M ◮ � b y = ◮ � y outputs vector in R n Computational complexity ✟ ✟ O ( n 3 ) → O ( nM 2 ) O ( n 2 ) → O ( nM ) Time: ✟✟ Memory: ✟✟

Kernel Approximation Find Φ M ( x ) ∈ R M such that K ( x, x ′ ) ≈ Φ M ( x ) ⊤ Φ M ( x ′ ) Apply previous algorithm.

Random Fourier Features Gaussian Kernel e − γ � x − x ′ � 2 = E ω [ e iω ⊤ x e − iω ⊤ x ′ ] , ω ∼ N (0 , γ )

Random Fourier Features Gaussian Kernel e − γ � x − x ′ � 2 = E ω [ e iω ⊤ x e − iω ⊤ x ′ ] , ω ∼ N (0 , γ ) � M 1 e iω ⊤ j x e − iω ⊤ j x , ≈ ω j ∼ N (0 , γ ) , M j =1

Random Fourier Features Gaussian Kernel e − γ � x − x ′ � 2 = E ω [ e iω ⊤ x e − iω ⊤ x ′ ] , ω ∼ N (0 , γ ) � M 1 e iω ⊤ j x e − iω ⊤ j x , ≈ ω j ∼ N (0 , γ ) , M j =1 1 ( e iω ⊤ 1 x , . . . , e iω ⊤ M x ) . Φ M ( x ) := √ M

Random Features Expansion Find q such that K ( x, x ′ ) = E ω [ q ( x, ω ) q ( x ′ , ω )] ,

Random Features Expansion Find q such that K ( x, x ′ ) = E ω [ q ( x, ω ) q ( x ′ , ω )] , sample w 1 , . . . w M and consider � M K ( x, x ′ ) ≈ Φ M ( x ) ⊤ Φ M ( x ′ ) := 1 q ( x, ω j ) q ( x, ω j ) . M j =1 1 √ Φ M ( x ) := ( q ( x, ω 1 ) , . . . , q ( x, ω M )) . M

Examples of Random Features ◮ translation invariant kernels, ◮ dot product kernels, ◮ group invariant kernels, ◮ infinite neural nets kernels, ◮ homogeneous additive kernels, ◮ sum, products, composition of kernels, ◮ . . .

Computations If M ≪ n ◮ TIME: O ( nM 2 ) ≪ O ( n 3 ) ◮ SPACE: O ( nM ) ≪ O ( n 2 ) Any loss in ACCURACY?

Previous Results ◮ * Many * different random features for different kernels (Rahimi, Recht ’07, Vedaldi, Zisserman, . . . 10+) ◮ Theoretical guarantees: mainly kernel approximation (Rahimi, Recht ’07, . . . , Sriperumbudur and Szabo ’15) 1 | K ( x, x ′ ) − Φ M ( x ) ⊤ Φ M ( x ′ ) | � √ , M ◮ Theoretical guarantees: generalization bounds (Rahimi, Recht ’09, Bach, ’15) 1 E ( � f λ,M ) − E ( f ∗ ) ≤ ⇒ √ n M = n

Statistical Guarantees Let E ( f ) = E ( y − f ( x )) 2 . Theorem ( R., Camoriano, Rosasco ’16 ) Assume ∃ w such that f ∗ ( x ) = w ⊤ Φ( x ) , subexp noise, and bounded kernel. Then w.h.p. f λ,M ) − E ( f ∗ ) � 1 λn + 1 E ( � M + λ, so that for 1 1 λ n = √ n, M n = λ n the following hold w.h.p. 1 E ( � f λ n ,M n ) − E ( f ∗ ) � √ n.

Remarks ◮ Bound is minimax optimal , same as Tikhonov ◮ Adaptivity via cross validation or Lepskii method 2 s ◮ Refined results, e.g. for Sobolev classes rate is n − 2 s + d (R., Camoriano, Rosasco ’16)

Remarks ◮ Bound is minimax optimal , same as Tikhonov ◮ Adaptivity via cross validation or Lepskii method 2 s ◮ Refined results, e.g. for Sobolev classes rate is n − 2 s + d (R., Camoriano, Rosasco ’16) M = √ n guarantees NO loss in accuracy Computational complexity O ( n 2 ) → O ( n √ n ) ✟ ✟ O ( n 3 ) → O ( n 2 ) Time: ✟✟ Memory: ✟✟

M controls space, time and Statistics Corollary: Same result if we set M n = √ n, 1 λ n = . M n ◮ M can be seen as a regularization parameter

M controls space, time and Statistics Corollary: Same result if we set M n = √ n, 1 λ n = . M n ◮ M can be seen as a regularization parameter New incremental algorithm

M controls space, time and Statistics Corollary: Same result if we set M n = √ n, 1 λ n = . M n ◮ M can be seen as a regularization parameter New incremental algorithm 1. Pick a random feature ◮ + compute solution

M controls space, time and Statistics Corollary: Same result if we set M n = √ n, 1 λ n = . M n ◮ M can be seen as a regularization parameter New incremental algorithm 1. Pick a random feature ◮ + compute solution 2. Pick another random features + rank one update

M controls space, time and Statistics Corollary: Same result if we set M n = √ n, 1 λ n = . M n ◮ M can be seen as a regularization parameter New incremental algorithm 1. Pick a random feature ◮ + compute solution 2. Pick another random features + rank one update 3. Pick another random feature . . .

M controls space, time and Statistics Corollary: Same result if we set M n = √ n, 1 λ n = . M n ◮ M can be seen as a regularization parameter New incremental algorithm 1. Pick a random feature ◮ + compute solution 2. Pick another random features + rank one update 3. Pick another random feature . . . M controls at the same time: Space, Time, Statistics

Outline ◮ Data independent subsampling ◮ Data dependent subsampling ◮ Adaptive subsampling

Data dependent subsampling with Nystr¨ om { ˜ x M } ⊆ { x 1 , . . . , x n } x 1 , . . . ˜

Data dependent subsampling with Nystr¨ om { ˜ x M } ⊆ { x 1 , . . . , x n } x 1 , . . . ˜ � M c λ,M = ( � MM ) − 1 � � nM � K nM + λn � f λ,M ( x ) = c λ,M K ⊤ K ⊤ K ⊤ K ( x, ˜ x j ) � , � nM � y i j =1

Data dependent subsampling with Nystr¨ om { ˜ x M } ⊆ { x 1 , . . . , x n } x 1 , . . . ˜ � M c λ,M = ( � MM ) − 1 � � nM � K nM + λn � f λ,M ( x ) = c λ,M K ⊤ K ⊤ K ⊤ K ( x, ˜ x j ) � , � nM � y i j =1 ◮ ( � K nM ) ij = K ( x i , ˜ x j ) n by M matrix ◮ ( � K MM ) ij = K (˜ x i , ˜ x j ) M by M matrix b y = ◮ � y outputs vector Computational complexity Time: O ( nM 2 ) Memory: O ( nM )

Remarks b y = ◮ Plenty of methods for subsampling . . . ◮ Connections to Nystr¨ om for integral operators K † ◮ Previous results: kernel approximation � � K − � nM � MM � K ⊤ K nM � . Any loss in ACCURACY?

Statistical Guarantees Let E ( f ) = E ( y − f ( x )) 2 . Theorem ( R., Camoriano, Rosasco ’15 ) Assume ∃ w such that f ∗ ( x ) = w ⊤ Φ( x ) , subexp noise, and bounded kernel. Then w.h.p. f λ,M ) − E ( f ∗ ) � 1 λn + λ + 1 E ( � M , so that for 1 1 λ n = √ n, M n = λ n the following hold w.h.p. 1 E ( � f λ n ,M n ) − E ( f ∗ ) � √ n.

Remarks ◮ Bound is minimax optimal , same as Tikhonov ◮ Adaptivity via cross validation or Lepskii method 2 s ◮ Refined results, e.g. for Sobolev classes rate is n − 2 s + d (R., Camoriano, Rosasco ’15) M = √ n guarantees NO loss in accuracy

Regularization Methods for Large Scale Machine Learning Alessandro - PowerPoint PPT Presentation

Regularization Methods for Large Scale Machine Learning Alessandro Rudi University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu In collaboration with: Raffaello Camoriano, Lorenzo Rosasco May

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Regularization Methods for System Identification Input Design Biqiang MU Academy of Mathematics

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Regularization Regularization is a general approach to add a complexity parameter to a

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Learning Theory & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer

Applied Machine Learning Applied Machine Learning Regularization Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Regularization Siamak Ravanbakhsh Siamak

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 L. Rosasco Regularization for

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Large-Scale Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science,

TensorFlow: A System for Learning-Scale Machine Learning Google Brain The Problem Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

Bhyve guests with hardware accelerated graphics Michael Chiu EuroBSDCon 2019 Who am I?

BasicComputerConcepts Chapter4 Objec;ves15 ReviewQues;ons

Introduction to Data Stream Mining Albert Bifet March 2012 Motivation Source: IDCs Digital

Serverless in Action CodeStock April 13, 2019 Who is Chad Green Director of Software

DAFT -- Results From NASA High End Computing (HEC) WAN File Transfer

GCA Scalability June 99 Test Federation D. Zimmerman Argonne Meeting April 28, 1999

Fast, Reliable, Secure, Affordable MongoDB on AWS EC2 Paul Marcelin Why are we here today?

Apache Libcloud Paul Querna, Chief Architect, Cloudkick June 22, 2010 Tuesday, June 22, 2010 I

Sambuz

Useful Links

Newsletter

Mail Us

Regularization Methods for Large Scale Machine Learning Alessandro - PowerPoint PPT Presentation

Regularization Methods for Large Scale Machine Learning Alessandro Rudi University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu In collaboration with: Raffaello Camoriano, Lorenzo Rosasco May

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Regularization Methods for System Identification Input Design Biqiang MU Academy of Mathematics

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Regularization Regularization is a general approach to add a complexity parameter to a

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Learning Theory &amp; Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer

Applied Machine Learning Applied Machine Learning Regularization Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Regularization Siamak Ravanbakhsh Siamak

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 L. Rosasco Regularization for

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Large-Scale Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science,

TensorFlow: A System for Learning-Scale Machine Learning Google Brain The Problem Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

Bhyve guests with hardware accelerated graphics Michael Chiu EuroBSDCon 2019 Who am I?

BasicComputerConcepts Chapter4 Objec;ves15 ReviewQues;ons

Introduction to Data Stream Mining Albert Bifet March 2012 Motivation Source: IDCs Digital

Serverless in Action CodeStock April 13, 2019 Who is Chad Green Director of Software

DAFT -- Results From NASA High End Computing (HEC) WAN File Transfer

GCA Scalability June 99 Test Federation D. Zimmerman Argonne Meeting April 28, 1999

Fast, Reliable, Secure, Affordable MongoDB on AWS EC2 Paul Marcelin Why are we here today?

Apache Libcloud Paul Querna, Chief Architect, Cloudkick June 22, 2010 Tuesday, June 22, 2010 I

Sambuz

Useful Links

Newsletter

Mail Us

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Learning Theory & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer