an inverse problem perspective on machine learning
play

An inverse problem perspective on machine learning Lorenzo Rosasco - PowerPoint PPT Presentation

An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems and Machine Learning Workshop, CM+X


  1. An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 – Inverse Problems and Machine Learning Workshop, CM+X Caltech

  2. Today selection I Classics: “Learning as an inverse problem” I Latest releases: “Kernel methods as a test bed for algorithm design”

  3. Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances

  4. What’s learning ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 )

  5. What’s learning ( x 7 , ?) ( x 6 , ?) ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 )

  6. What’s learning ( x 7 , ?) ( x 6 , ?) ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 ) Learning is about inference not interpolation

  7. Statistical Machine Learning (ML) I ( X, Y ) a pair of random variables in X ⇥ R . I L : R ⇥ R ! [0 , 1 ) a loss function. I H ⇢ R X Problem: Solve min f 2 H E [ L ( f ( X ) , Y )] given only ( x 1 , y 1 ) , . . . , ( x n , y n ) , a sample of n i.i. copies of ( X, Y ) .

  8. ML theory around 2000-2010 I All algorithms are ERM (empirical risk minimization) n 1 X min L ( f ( x i ) , y i ) n f 2 H i =1 [Vapnik ’96] I Emphasis on empirical process theory. . . � � ! n � � 1 X � � sup L ( f ( X i ) , Y i ) � E [ L ( f ( X ) , Y )] � > ✏ P � � n f 2 H � i =1 [Vapnik, Chervonenkis,’71 Dudley, Gin´ e, Zinn ’94] I ...and complexity measures, e.g. Gaussian/Rademacher complexities n X C ( H ) = E sup � i f ( X i ) f 2 H i =1 [Barlett, Bousquet, Koltchinskii, Massart, Mendelson. . . 00]

  9. Around the same time Cucker and Smale, On the mathematical foundations of learning theory, AMS I Caponnetto, De Vito and R. Verri, Learning as an Inverse Problem , JMLR I Smale, Zhou, Shannon sampling and function reconstruction from point values, Bull. AMS

  10. Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances

  11. Inverse Problems (IP) I A : H ! G bounded linear operator, between Hilbert spaces I g 2 G Problem: Find f solving Af = g assuming A and g � are given, with k g � g � k  � [Engl, Hanke, Neubauer’96]

  12. Ill-posedeness I Existence: g / 2 Range ( A ) I Uniqueness: Ker ( A ) 6 = ; I Stability: k A † k = 1 (large is also a mess) g f † g δ A Range ( A ) G H f † = A † g = min k Af � g k 2 , O = argmin O k f k H

  13. Is machine learning an inverse problem? I A : H ! G I ( X, Y ) I g 2 G I L : R ⇥ R ! [0 , 1 ) I H ⇢ R X Find f solving Solve Af = g min f 2 H E [ L ( f ( X ) , Y )] given A and g � with k g � g � k  � given only ( x 1 , y 1 ) , . . . , ( x n , y n ) . Actually yes, under some assumptions.

  14. Key assumptions: least squares and RKHS Assumption L ( f ( x ) , y ) = ( f ( x ) � y ) 2 Assumption I ( H , h · , · i ) is a Hilbert space (real, separable) I continuous evaluation functionals, for all x 2 X , let e x : H ! R , with e x ( f ) = f ( x ) , then | e x ( f ) � e x ( f 0 ) | . k f � f 0 k [Aronszajn ’50]

  15. Key assumptions: least squares and RKHS Assumption L ( f ( x ) , y ) = ( f ( x ) � y ) 2 Assumption I ( H , h · , · i ) is a Hilbert space (real, separable) I continuous evaluation functionals, for all x 2 X , let e x : H ! R , with e x ( f ) = f ( x ) , then | e x ( f ) � e x ( f 0 ) | . k f � f 0 k Implications [Aronszajn ’50] I k f k 1 . k f k I 9 k x 2 H such that f ( x ) = h f, k x i

  16. Interpolation and sampling operator [Bertero, De mol, Pike ’85,’88] Sampling operator: S n : H ! R n , f ( x i ) = h f, k x i i = y i , i = 1 , . . . , n ( S n f ) i = h f, k x i i , 8 i = 1 , . . . , n + S n f = y S n f x 3 x 5 x 1 x 2 x 4 X

  17. Learning and restriction operator [Caponnetto, De Vito, R. ’05] Restriction operator: S ⇢ : H ! L 2 ( X , ⇢ ) , h f, k x i = f ⇢ ( x ) , ⇢ � a.s. + ( S ⇢ f )( x ) = h f, k x i , ⇢ � almost surely . S ⇢ f = f ⇢ R f ⇢ ( x ) = d ⇢ ( x, y ) y ⇢ -almost surely. S ρ f R L 2 ( X , ⇢ ) = { f 2 R X | k f k 2 d ⇢ | f ( x ) | 2 < 1 } ⇢ = X

  18. Learning as an inverse problem Inverse problem Find f solving S ⇢ f = f ⇢ given S n and y n = ( y 1 , . . . , y n ) .

  19. Learning as an inverse problem Inverse problem Find f solving S ⇢ f = f ⇢ given S n and y n = ( y 1 , . . . , y n ) . Least squares ⇢ = E ( f ( X ) � Y ) 2 � E ( f ⇢ ( X ) � Y ) 2 H k S ⇢ f � f ⇢ k 2 k S ⇢ f � f ⇢ k 2 min ⇢ ,

  20. Let’s see what we got I Noise model I Integral operators & covariance operators I Kernels

  21. Noise model Ideal Empirical S ⇢ f = f ⇢ S n f = y S ⇤ ⇢ S ⇢ f = S ⇤ S ⇤ n S n f = S ⇤ ⇢ f ⇢ n y Noise model k S ⇤ n y � S ⇤ k S ⇤ ⇢ S ⇢ � S ⇤ ⇢ f ⇢ k  � 1 n S n k  � 2 Inverse problem discretization, Econometrics

  22. Integral and covariance operators operators I Extension operator S ⇤ ⇢ : L 2 ( X , ⇢ ) ! H Z S ⇤ ⇢ f ( x 0 ) = d ⇢ ( x ) k ( x 0 , x ) f ( x ) where k ( x, x 0 ) = h k x , k 0 x i is pos.def. I Covariance operator S ⇤ ⇢ S ⇢ : H ! H Z S ⇤ ⇢ S ⇢ = d ⇢ ( x ) k x ⌦ k x 0

  23. Kernels Choosing a RKHS implies choosing a representation. Theorem (Moore-Aronzaijn) Let k : X ⇥ X ! R , pos.def., then the completion of N X { f 2 R X | f = c i k x i , c 1 , . . . , c N 2 R , x 1 , . . . , x N 2 X , N 2 N } j =1 w.r.t. h k x , k 0 x i = k ( x, x 0 ) is a RKHS.

  24. Kernels If K ( x, x 0 ) = x > x 0 , then, I S n is the n by D data matrix ( S ⇢ infinite data matrix) I S ⇤ n S n and S ⇤ ⇢ S ⇢ are the empirical and true covariance operators

  25. Kernels If K ( x, x 0 ) = x > x 0 , then, I S n is the n by D data matrix ( S ⇢ infinite data matrix) I S ⇤ n S n and S ⇤ ⇢ S ⇢ are the empirical and true covariance operators Other kernels: I K ( x, x 0 ) = (1 + x > x 0 ) p I K ( x, x 0 ) = e �k x � x 0 k 2 � I K ( x, x 0 ) = e �k x � x 0 k �

  26. What now? Steal

  27. Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances

  28. Tikhonov aka ridge regression n = ( S ⇤ n S n + � nI ) � 1 S ⇤ f � n y

  29. Tikhonov aka ridge regression n = ( S ⇤ n S n + � nI ) � 1 S ⇤ n y = S ⇤ n ( S n S ⇤ + � nI ) � 1 y f � n | {z } K n K n c = y

  30. Statistics Theorem (Caponnetto De Vito ’05) Assume K ( X, X ) , | Y |  1 a.s. and f † 2 Range ( S ⇢ S ⇤ 1 ⇢ ) r , 1 / 2 < r < 1 . If � n = n � 2 r +1 2 r E [ k Sf � n � f † k 2 ⇢ ] . n � 2 r +1 n

  31. Statistics Theorem (Caponnetto De Vito ’05) Assume K ( X, X ) , | Y |  1 a.s. and f † 2 Range ( S ⇢ S ⇤ 1 ⇢ ) r , 1 / 2 < r < 1 . If � n = n � 2 r +1 2 r E [ k Sf � n � f † k 2 ⇢ ] . n � 2 r +1 n Proof ⇢ ] . 1 E [ k Sf � n � f ⇢ k 2 � ( � 1 + � 2 ) + � 2 r 8 � > 0 , 1 p n E [ � 1 ] , E [ � 2 ] .

  32. Iterative regularization From the Neumann series. . . t � 1 X f t ( I � � S ⇤ n S n ) j S ⇤ n = � n y j =0

  33. Iterative regularization From the Neumann series. . . t � 1 t � 1 X X f t ( I � � S ⇤ n S n ) j S ⇤ n y = � S ⇤ ( I � � S n S ⇤ ) j y n = � n n | {z } j =0 j =0 K n

  34. Iterative regularization From the Neumann series. . . t � 1 t � 1 X X f t ( I � � S ⇤ n S n ) j S ⇤ n y = � S ⇤ ( I � � S n S ⇤ ) j y n = � n n | {z } j =0 j =0 K n . . . to gradient descent n = f t � 1 � � S ⇤ n ( S n f t � 1 n = c t � 1 � � ( K n c t � 1 f t c t � y ) � y ) n n n n Test Training t

  35. Iterative regularization statistics Theorem (Bauer, Pereverzev, R. ’07) Assume K ( X, X ) , | Y |  1 a.s. and f † 2 Range ( S ⇢ S ⇤ 1 ⇢ ) r , 1 / 2 < r < 1 . If t n = n 2 r +1 2 r ⇢ ] . n � E [ k Sf t n n � f † k 2 2 r +1

  36. Iterative regularization statistics Theorem (Bauer, Pereverzev, R. ’07) Assume K ( X, X ) , | Y |  1 a.s. and f † 2 Range ( S ⇢ S ⇤ 1 ⇢ ) r , 1 / 2 < r < 1 . If t n = n 2 r +1 2 r ⇢ ] . n � E [ k Sf t n n � f † k 2 2 r +1 Proof 1 E [ k Sf t n � f ⇢ k 2 8 � > 0 , ⇢ ] . t ( � 1 + � 2 ) + t 2 r 1 E [ � 1 ] , E [ � 2 ] . p n

  37. Tikhonov vs iterative regularization I Same statistical properties... 1 I ... but time complexities are di ff erent O ( n 3 ) vs O ( n 2 n 2 r +1 ) , I Iterative regularization provides a bridge between statistics and computations. I Kernel methods become a test bed for algorithmic solutions.

  38. Computational regularization Tikhonov time O ( n 3 ) + space O ( n 2 ) for 1 / p n learning bound

  39. Computational regularization Tikhonov time O ( n 3 ) + space O ( n 2 ) for 1 / p n learning bound + Iterative regularization time O ( n 2 p n ) + space O ( n 2 ) for 1 / p n learning bound

  40. Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances

  41. Steal from optimization Acceleration I Conjugate gradient [Blanchard, Kramer ’96] I Chebyshev method [Bauer, Pervezev. R. ’07] I Nesterov acceleration (Nesterov, ’83) [Salzo, R. ’18] Stochastic gradient I Single pass stochastic gradient [Tarres, Yao, ’05, Pontil, Ying, ’09, Bach, Dieuleveut, Flammarion, ’17] I Multi-pass incremental gradient [Villa, R. ’15] I Multi-pass stochastic gradient with mini-batches. [Lin, R. ’16]

Recommend


More recommend