Regularization Methods for Large Scale Machine Learning Alessandro Rudi University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu In collaboration with: Raffaello Camoriano, Lorenzo Rosasco May 6th, 2017 - RegML, Oslo
Machine learning Algorithms Desiderata ◮ flexible non-linear / non-parametric models ◮ scalable computation ◮ statistical guarantees
Statistics and computations (Chandrasekaran, Jordan ’12)
Supervised Learning Problem: Estimate f ∗
Supervised Learning Problem: Estimate f ∗ given S n = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 )
Supervised Learning Problem: Estimate f ∗ given S n = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 ) Setting y i = f ∗ ( x i ) + ε i i ∈ { 1 , . . . , n } ◮ ε i ∈ R , x i ∈ R d random (unknown distribution) ◮ f ∗ unknown
Recall Kernel Ridge Regression n � f λ ( x ) = Φ( x ) ⊤ � w λ = c λ = ( � � K + λnI ) − 1 � c λ K ( x, x i ) � i , � y i =1 ◮ Φ( x ) ⊤ Φ( x ′ ) = K ( x, x ′ ) kernel, e.g. Gaussian ◮ � K n by n data matrix ◮ � y outputs vector Computational complexity Time: O ( n 3 ) Memory: O ( n 2 )
Statistical Guarantees Let E ( f ) = E ( y − f ( x )) 2 . Theorem ( Caponetto, DeVito ’05 ) Assume ∃ w such that f ∗ ( x ) = w ⊤ Φ( x ) , subexp noise, and bounded kernel. Then w.h.p. f λ ) − E ( f ∗ ) � 1 E ( � λn + λ, so that for λ n = 1 / √ n , w.h.p. 1 E ( � f λ n ) − E ( f ∗ ) � √ n.
Remarks ◮ Bound is minimax optimal (Caponetto, DeVito ’05) ◮ Adaptivity via cross validation or Lepskii method (Caponetto, Yao ’07, De Vito Pereverzev R. ’07) ◮ Refined results, e.g. for Sobolev classes rate is n − 2 s 2 s + d (Caponetto, DeVito ’05) Computational complexity kills the method for large problems Computational complexity Time: O ( n 3 ) Memory: O ( n 2 )
BIG DATA Where it is possible to run Kernel Ridge regression ◮ n ≈ 10 000 Laptop ( ∼ 1 Gigabyte memory), ◮ n ≈ 100 000 Desktop ( ∼ 100 Gigabyte memory), ◮ n ≈ 1000 000 Cluster ( ∼ 10 Terabyte memory), ◮ n ≈ 10 000 000 Supercomputer (TOP10) ( ∼ 1 Petabyte memory) High energy physics experiments: n ≈ 10 7 per second. . .
Outline ◮ Data independent subsampling ◮ Data dependent subsampling ◮ Adaptive subsampling
An idea If K ( x, x ′ ) = Φ M ( x ) ⊤ Φ M ( x ′ ) with Φ M ( x ) ∈ R M f λ,M ( x ) = Φ M ( x ) ⊤ � w λ,M = ( � � Φ ⊤ � Φ + λnI ) − 1 � w λ,M Φ ⊤ � � y
An idea If K ( x, x ′ ) = Φ M ( x ) ⊤ Φ M ( x ′ ) with Φ M ( x ) ∈ R M f λ,M ( x ) = Φ M ( x ) ⊤ � w λ,M = ( � � Φ ⊤ � Φ + λnI ) − 1 � w λ,M Φ ⊤ � � y Φ = (Φ M ( x 1 ) , . . . Φ M ( x n )) ⊤ ∈ R n × M ◮ � w λ,M vector in R M ◮ � b y = ◮ � y outputs vector in R n Computational complexity ✟ ✟ O ( n 3 ) → O ( nM 2 ) O ( n 2 ) → O ( nM ) Time: ✟✟ Memory: ✟✟
Kernel Approximation Find Φ M ( x ) ∈ R M such that K ( x, x ′ ) ≈ Φ M ( x ) ⊤ Φ M ( x ′ ) Apply previous algorithm.
Random Fourier Features Gaussian Kernel e − γ � x − x ′ � 2 = E ω [ e iω ⊤ x e − iω ⊤ x ′ ] , ω ∼ N (0 , γ )
Random Fourier Features Gaussian Kernel e − γ � x − x ′ � 2 = E ω [ e iω ⊤ x e − iω ⊤ x ′ ] , ω ∼ N (0 , γ ) � M 1 e iω ⊤ j x e − iω ⊤ j x , ≈ ω j ∼ N (0 , γ ) , M j =1
Random Fourier Features Gaussian Kernel e − γ � x − x ′ � 2 = E ω [ e iω ⊤ x e − iω ⊤ x ′ ] , ω ∼ N (0 , γ ) � M 1 e iω ⊤ j x e − iω ⊤ j x , ≈ ω j ∼ N (0 , γ ) , M j =1 1 ( e iω ⊤ 1 x , . . . , e iω ⊤ M x ) . Φ M ( x ) := √ M
Random Features Expansion Find q such that K ( x, x ′ ) = E ω [ q ( x, ω ) q ( x ′ , ω )] ,
Random Features Expansion Find q such that K ( x, x ′ ) = E ω [ q ( x, ω ) q ( x ′ , ω )] , sample w 1 , . . . w M and consider � M K ( x, x ′ ) ≈ Φ M ( x ) ⊤ Φ M ( x ′ ) := 1 q ( x, ω j ) q ( x, ω j ) . M j =1 1 √ Φ M ( x ) := ( q ( x, ω 1 ) , . . . , q ( x, ω M )) . M
Examples of Random Features ◮ translation invariant kernels, ◮ dot product kernels, ◮ group invariant kernels, ◮ infinite neural nets kernels, ◮ homogeneous additive kernels, ◮ sum, products, composition of kernels, ◮ . . .
Computations If M ≪ n ◮ TIME: O ( nM 2 ) ≪ O ( n 3 ) ◮ SPACE: O ( nM ) ≪ O ( n 2 ) Any loss in ACCURACY?
Previous Results ◮ * Many * different random features for different kernels (Rahimi, Recht ’07, Vedaldi, Zisserman, . . . 10+) ◮ Theoretical guarantees: mainly kernel approximation (Rahimi, Recht ’07, . . . , Sriperumbudur and Szabo ’15) 1 | K ( x, x ′ ) − Φ M ( x ) ⊤ Φ M ( x ′ ) | � √ , M ◮ Theoretical guarantees: generalization bounds (Rahimi, Recht ’09, Bach, ’15) 1 E ( � f λ,M ) − E ( f ∗ ) ≤ ⇒ √ n M = n
Statistical Guarantees Let E ( f ) = E ( y − f ( x )) 2 . Theorem ( R., Camoriano, Rosasco ’16 ) Assume ∃ w such that f ∗ ( x ) = w ⊤ Φ( x ) , subexp noise, and bounded kernel. Then w.h.p. f λ,M ) − E ( f ∗ ) � 1 λn + 1 E ( � M + λ, so that for 1 1 λ n = √ n, M n = λ n the following hold w.h.p. 1 E ( � f λ n ,M n ) − E ( f ∗ ) � √ n.
Remarks ◮ Bound is minimax optimal , same as Tikhonov ◮ Adaptivity via cross validation or Lepskii method 2 s ◮ Refined results, e.g. for Sobolev classes rate is n − 2 s + d (R., Camoriano, Rosasco ’16)
Remarks ◮ Bound is minimax optimal , same as Tikhonov ◮ Adaptivity via cross validation or Lepskii method 2 s ◮ Refined results, e.g. for Sobolev classes rate is n − 2 s + d (R., Camoriano, Rosasco ’16) M = √ n guarantees NO loss in accuracy Computational complexity O ( n 2 ) → O ( n √ n ) ✟ ✟ O ( n 3 ) → O ( n 2 ) Time: ✟✟ Memory: ✟✟
M controls space, time and Statistics Corollary: Same result if we set M n = √ n, 1 λ n = . M n ◮ M can be seen as a regularization parameter
M controls space, time and Statistics Corollary: Same result if we set M n = √ n, 1 λ n = . M n ◮ M can be seen as a regularization parameter New incremental algorithm
M controls space, time and Statistics Corollary: Same result if we set M n = √ n, 1 λ n = . M n ◮ M can be seen as a regularization parameter New incremental algorithm 1. Pick a random feature ◮ + compute solution
M controls space, time and Statistics Corollary: Same result if we set M n = √ n, 1 λ n = . M n ◮ M can be seen as a regularization parameter New incremental algorithm 1. Pick a random feature ◮ + compute solution 2. Pick another random features + rank one update
M controls space, time and Statistics Corollary: Same result if we set M n = √ n, 1 λ n = . M n ◮ M can be seen as a regularization parameter New incremental algorithm 1. Pick a random feature ◮ + compute solution 2. Pick another random features + rank one update 3. Pick another random feature . . .
M controls space, time and Statistics Corollary: Same result if we set M n = √ n, 1 λ n = . M n ◮ M can be seen as a regularization parameter New incremental algorithm 1. Pick a random feature ◮ + compute solution 2. Pick another random features + rank one update 3. Pick another random feature . . . M controls at the same time: Space, Time, Statistics
Outline ◮ Data independent subsampling ◮ Data dependent subsampling ◮ Adaptive subsampling
Data dependent subsampling with Nystr¨ om { ˜ x M } ⊆ { x 1 , . . . , x n } x 1 , . . . ˜
Data dependent subsampling with Nystr¨ om { ˜ x M } ⊆ { x 1 , . . . , x n } x 1 , . . . ˜ � M c λ,M = ( � MM ) − 1 � � nM � K nM + λn � f λ,M ( x ) = c λ,M K ⊤ K ⊤ K ⊤ K ( x, ˜ x j ) � , � nM � y i j =1
Data dependent subsampling with Nystr¨ om { ˜ x M } ⊆ { x 1 , . . . , x n } x 1 , . . . ˜ � M c λ,M = ( � MM ) − 1 � � nM � K nM + λn � f λ,M ( x ) = c λ,M K ⊤ K ⊤ K ⊤ K ( x, ˜ x j ) � , � nM � y i j =1 ◮ ( � K nM ) ij = K ( x i , ˜ x j ) n by M matrix ◮ ( � K MM ) ij = K (˜ x i , ˜ x j ) M by M matrix b y = ◮ � y outputs vector Computational complexity Time: O ( nM 2 ) Memory: O ( nM )
Remarks b y = ◮ Plenty of methods for subsampling . . . ◮ Connections to Nystr¨ om for integral operators K † ◮ Previous results: kernel approximation � � K − � nM � MM � K ⊤ K nM � . Any loss in ACCURACY?
Statistical Guarantees Let E ( f ) = E ( y − f ( x )) 2 . Theorem ( R., Camoriano, Rosasco ’15 ) Assume ∃ w such that f ∗ ( x ) = w ⊤ Φ( x ) , subexp noise, and bounded kernel. Then w.h.p. f λ,M ) − E ( f ∗ ) � 1 λn + λ + 1 E ( � M , so that for 1 1 λ n = √ n, M n = λ n the following hold w.h.p. 1 E ( � f λ n ,M n ) − E ( f ∗ ) � √ n.
Remarks ◮ Bound is minimax optimal , same as Tikhonov ◮ Adaptivity via cross validation or Lepskii method 2 s ◮ Refined results, e.g. for Sobolev classes rate is n − 2 s + d (R., Camoriano, Rosasco ’15) M = √ n guarantees NO loss in accuracy
Recommend
More recommend