regularization methods for large scale machine learning
play

Regularization Methods for Large Scale Machine Learning Alessandro - PowerPoint PPT Presentation

Regularization Methods for Large Scale Machine Learning Alessandro Rudi University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu In collaboration with: Raffaello Camoriano, Lorenzo Rosasco May


  1. Regularization Methods for Large Scale Machine Learning Alessandro Rudi University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu In collaboration with: Raffaello Camoriano, Lorenzo Rosasco May 6th, 2017 - RegML, Oslo

  2. Machine learning Algorithms Desiderata ◮ flexible non-linear / non-parametric models ◮ scalable computation ◮ statistical guarantees

  3. Statistics and computations (Chandrasekaran, Jordan ’12)

  4. Supervised Learning Problem: Estimate f ∗

  5. Supervised Learning Problem: Estimate f ∗ given S n = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 )

  6. Supervised Learning Problem: Estimate f ∗ given S n = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 ) Setting y i = f ∗ ( x i ) + ε i i ∈ { 1 , . . . , n } ◮ ε i ∈ R , x i ∈ R d random (unknown distribution) ◮ f ∗ unknown

  7. Recall Kernel Ridge Regression n � f λ ( x ) = Φ( x ) ⊤ � w λ = c λ = ( � � K + λnI ) − 1 � c λ K ( x, x i ) � i , � y i =1 ◮ Φ( x ) ⊤ Φ( x ′ ) = K ( x, x ′ ) kernel, e.g. Gaussian ◮ � K n by n data matrix ◮ � y outputs vector Computational complexity Time: O ( n 3 ) Memory: O ( n 2 )

  8. Statistical Guarantees Let E ( f ) = E ( y − f ( x )) 2 . Theorem ( Caponetto, DeVito ’05 ) Assume ∃ w such that f ∗ ( x ) = w ⊤ Φ( x ) , subexp noise, and bounded kernel. Then w.h.p. f λ ) − E ( f ∗ ) � 1 E ( � λn + λ, so that for λ n = 1 / √ n , w.h.p. 1 E ( � f λ n ) − E ( f ∗ ) � √ n.

  9. Remarks ◮ Bound is minimax optimal (Caponetto, DeVito ’05) ◮ Adaptivity via cross validation or Lepskii method (Caponetto, Yao ’07, De Vito Pereverzev R. ’07) ◮ Refined results, e.g. for Sobolev classes rate is n − 2 s 2 s + d (Caponetto, DeVito ’05) Computational complexity kills the method for large problems Computational complexity Time: O ( n 3 ) Memory: O ( n 2 )

  10. BIG DATA Where it is possible to run Kernel Ridge regression ◮ n ≈ 10 000 Laptop ( ∼ 1 Gigabyte memory), ◮ n ≈ 100 000 Desktop ( ∼ 100 Gigabyte memory), ◮ n ≈ 1000 000 Cluster ( ∼ 10 Terabyte memory), ◮ n ≈ 10 000 000 Supercomputer (TOP10) ( ∼ 1 Petabyte memory) High energy physics experiments: n ≈ 10 7 per second. . .

  11. Outline ◮ Data independent subsampling ◮ Data dependent subsampling ◮ Adaptive subsampling

  12. An idea If K ( x, x ′ ) = Φ M ( x ) ⊤ Φ M ( x ′ ) with Φ M ( x ) ∈ R M f λ,M ( x ) = Φ M ( x ) ⊤ � w λ,M = ( � � Φ ⊤ � Φ + λnI ) − 1 � w λ,M Φ ⊤ � � y

  13. An idea If K ( x, x ′ ) = Φ M ( x ) ⊤ Φ M ( x ′ ) with Φ M ( x ) ∈ R M f λ,M ( x ) = Φ M ( x ) ⊤ � w λ,M = ( � � Φ ⊤ � Φ + λnI ) − 1 � w λ,M Φ ⊤ � � y Φ = (Φ M ( x 1 ) , . . . Φ M ( x n )) ⊤ ∈ R n × M ◮ � w λ,M vector in R M ◮ � b y = ◮ � y outputs vector in R n Computational complexity ✟ ✟ O ( n 3 ) → O ( nM 2 ) O ( n 2 ) → O ( nM ) Time: ✟✟ Memory: ✟✟

  14. Kernel Approximation Find Φ M ( x ) ∈ R M such that K ( x, x ′ ) ≈ Φ M ( x ) ⊤ Φ M ( x ′ ) Apply previous algorithm.

  15. Random Fourier Features Gaussian Kernel e − γ � x − x ′ � 2 = E ω [ e iω ⊤ x e − iω ⊤ x ′ ] , ω ∼ N (0 , γ )

  16. Random Fourier Features Gaussian Kernel e − γ � x − x ′ � 2 = E ω [ e iω ⊤ x e − iω ⊤ x ′ ] , ω ∼ N (0 , γ ) � M 1 e iω ⊤ j x e − iω ⊤ j x , ≈ ω j ∼ N (0 , γ ) , M j =1

  17. Random Fourier Features Gaussian Kernel e − γ � x − x ′ � 2 = E ω [ e iω ⊤ x e − iω ⊤ x ′ ] , ω ∼ N (0 , γ ) � M 1 e iω ⊤ j x e − iω ⊤ j x , ≈ ω j ∼ N (0 , γ ) , M j =1 1 ( e iω ⊤ 1 x , . . . , e iω ⊤ M x ) . Φ M ( x ) := √ M

  18. Random Features Expansion Find q such that K ( x, x ′ ) = E ω [ q ( x, ω ) q ( x ′ , ω )] ,

  19. Random Features Expansion Find q such that K ( x, x ′ ) = E ω [ q ( x, ω ) q ( x ′ , ω )] , sample w 1 , . . . w M and consider � M K ( x, x ′ ) ≈ Φ M ( x ) ⊤ Φ M ( x ′ ) := 1 q ( x, ω j ) q ( x, ω j ) . M j =1 1 √ Φ M ( x ) := ( q ( x, ω 1 ) , . . . , q ( x, ω M )) . M

  20. Examples of Random Features ◮ translation invariant kernels, ◮ dot product kernels, ◮ group invariant kernels, ◮ infinite neural nets kernels, ◮ homogeneous additive kernels, ◮ sum, products, composition of kernels, ◮ . . .

  21. Computations If M ≪ n ◮ TIME: O ( nM 2 ) ≪ O ( n 3 ) ◮ SPACE: O ( nM ) ≪ O ( n 2 ) Any loss in ACCURACY?

  22. Previous Results ◮ * Many * different random features for different kernels (Rahimi, Recht ’07, Vedaldi, Zisserman, . . . 10+) ◮ Theoretical guarantees: mainly kernel approximation (Rahimi, Recht ’07, . . . , Sriperumbudur and Szabo ’15) 1 | K ( x, x ′ ) − Φ M ( x ) ⊤ Φ M ( x ′ ) | � √ , M ◮ Theoretical guarantees: generalization bounds (Rahimi, Recht ’09, Bach, ’15) 1 E ( � f λ,M ) − E ( f ∗ ) ≤ ⇒ √ n M = n

  23. Statistical Guarantees Let E ( f ) = E ( y − f ( x )) 2 . Theorem ( R., Camoriano, Rosasco ’16 ) Assume ∃ w such that f ∗ ( x ) = w ⊤ Φ( x ) , subexp noise, and bounded kernel. Then w.h.p. f λ,M ) − E ( f ∗ ) � 1 λn + 1 E ( � M + λ, so that for 1 1 λ n = √ n, M n = λ n the following hold w.h.p. 1 E ( � f λ n ,M n ) − E ( f ∗ ) � √ n.

  24. Remarks ◮ Bound is minimax optimal , same as Tikhonov ◮ Adaptivity via cross validation or Lepskii method 2 s ◮ Refined results, e.g. for Sobolev classes rate is n − 2 s + d (R., Camoriano, Rosasco ’16)

  25. Remarks ◮ Bound is minimax optimal , same as Tikhonov ◮ Adaptivity via cross validation or Lepskii method 2 s ◮ Refined results, e.g. for Sobolev classes rate is n − 2 s + d (R., Camoriano, Rosasco ’16) M = √ n guarantees NO loss in accuracy Computational complexity O ( n 2 ) → O ( n √ n ) ✟ ✟ O ( n 3 ) → O ( n 2 ) Time: ✟✟ Memory: ✟✟

  26. M controls space, time and Statistics Corollary: Same result if we set M n = √ n, 1 λ n = . M n ◮ M can be seen as a regularization parameter

  27. M controls space, time and Statistics Corollary: Same result if we set M n = √ n, 1 λ n = . M n ◮ M can be seen as a regularization parameter New incremental algorithm

  28. M controls space, time and Statistics Corollary: Same result if we set M n = √ n, 1 λ n = . M n ◮ M can be seen as a regularization parameter New incremental algorithm 1. Pick a random feature ◮ + compute solution

  29. M controls space, time and Statistics Corollary: Same result if we set M n = √ n, 1 λ n = . M n ◮ M can be seen as a regularization parameter New incremental algorithm 1. Pick a random feature ◮ + compute solution 2. Pick another random features + rank one update

  30. M controls space, time and Statistics Corollary: Same result if we set M n = √ n, 1 λ n = . M n ◮ M can be seen as a regularization parameter New incremental algorithm 1. Pick a random feature ◮ + compute solution 2. Pick another random features + rank one update 3. Pick another random feature . . .

  31. M controls space, time and Statistics Corollary: Same result if we set M n = √ n, 1 λ n = . M n ◮ M can be seen as a regularization parameter New incremental algorithm 1. Pick a random feature ◮ + compute solution 2. Pick another random features + rank one update 3. Pick another random feature . . . M controls at the same time: Space, Time, Statistics

  32. Outline ◮ Data independent subsampling ◮ Data dependent subsampling ◮ Adaptive subsampling

  33. Data dependent subsampling with Nystr¨ om { ˜ x M } ⊆ { x 1 , . . . , x n } x 1 , . . . ˜

  34. Data dependent subsampling with Nystr¨ om { ˜ x M } ⊆ { x 1 , . . . , x n } x 1 , . . . ˜ � M c λ,M = ( � MM ) − 1 � � nM � K nM + λn � f λ,M ( x ) = c λ,M K ⊤ K ⊤ K ⊤ K ( x, ˜ x j ) � , � nM � y i j =1

  35. Data dependent subsampling with Nystr¨ om { ˜ x M } ⊆ { x 1 , . . . , x n } x 1 , . . . ˜ � M c λ,M = ( � MM ) − 1 � � nM � K nM + λn � f λ,M ( x ) = c λ,M K ⊤ K ⊤ K ⊤ K ( x, ˜ x j ) � , � nM � y i j =1 ◮ ( � K nM ) ij = K ( x i , ˜ x j ) n by M matrix ◮ ( � K MM ) ij = K (˜ x i , ˜ x j ) M by M matrix b y = ◮ � y outputs vector Computational complexity Time: O ( nM 2 ) Memory: O ( nM )

  36. Remarks b y = ◮ Plenty of methods for subsampling . . . ◮ Connections to Nystr¨ om for integral operators K † ◮ Previous results: kernel approximation � � K − � nM � MM � K ⊤ K nM � . Any loss in ACCURACY?

  37. Statistical Guarantees Let E ( f ) = E ( y − f ( x )) 2 . Theorem ( R., Camoriano, Rosasco ’15 ) Assume ∃ w such that f ∗ ( x ) = w ⊤ Φ( x ) , subexp noise, and bounded kernel. Then w.h.p. f λ,M ) − E ( f ∗ ) � 1 λn + λ + 1 E ( � M , so that for 1 1 λ n = √ n, M n = λ n the following hold w.h.p. 1 E ( � f λ n ,M n ) − E ( f ∗ ) � √ n.

  38. Remarks ◮ Bound is minimax optimal , same as Tikhonov ◮ Adaptivity via cross validation or Lepskii method 2 s ◮ Refined results, e.g. for Sobolev classes rate is n − 2 s + d (R., Camoriano, Rosasco ’15) M = √ n guarantees NO loss in accuracy

Recommend


More recommend