An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 – Inverse Problems and Machine Learning Workshop, CM+X Caltech
Today selection I Classics: “Learning as an inverse problem” I Latest releases: “Kernel methods as a test bed for algorithm design”
Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances
What’s learning ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 )
What’s learning ( x 7 , ?) ( x 6 , ?) ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 )
What’s learning ( x 7 , ?) ( x 6 , ?) ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 ) Learning is about inference not interpolation
Statistical Machine Learning (ML) I ( X, Y ) a pair of random variables in X ⇥ R . I L : R ⇥ R ! [0 , 1 ) a loss function. I H ⇢ R X Problem: Solve min f 2 H E [ L ( f ( X ) , Y )] given only ( x 1 , y 1 ) , . . . , ( x n , y n ) , a sample of n i.i. copies of ( X, Y ) .
ML theory around 2000-2010 I All algorithms are ERM (empirical risk minimization) n 1 X min L ( f ( x i ) , y i ) n f 2 H i =1 [Vapnik ’96] I Emphasis on empirical process theory. . . � � ! n � � 1 X � � sup L ( f ( X i ) , Y i ) � E [ L ( f ( X ) , Y )] � > ✏ P � � n f 2 H � i =1 [Vapnik, Chervonenkis,’71 Dudley, Gin´ e, Zinn ’94] I ...and complexity measures, e.g. Gaussian/Rademacher complexities n X C ( H ) = E sup � i f ( X i ) f 2 H i =1 [Barlett, Bousquet, Koltchinskii, Massart, Mendelson. . . 00]
Around the same time Cucker and Smale, On the mathematical foundations of learning theory, AMS I Caponnetto, De Vito and R. Verri, Learning as an Inverse Problem , JMLR I Smale, Zhou, Shannon sampling and function reconstruction from point values, Bull. AMS
Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances
Inverse Problems (IP) I A : H ! G bounded linear operator, between Hilbert spaces I g 2 G Problem: Find f solving Af = g assuming A and g � are given, with k g � g � k � [Engl, Hanke, Neubauer’96]
Ill-posedeness I Existence: g / 2 Range ( A ) I Uniqueness: Ker ( A ) 6 = ; I Stability: k A † k = 1 (large is also a mess) g f † g δ A Range ( A ) G H f † = A † g = min k Af � g k 2 , O = argmin O k f k H
Is machine learning an inverse problem? I A : H ! G I ( X, Y ) I g 2 G I L : R ⇥ R ! [0 , 1 ) I H ⇢ R X Find f solving Solve Af = g min f 2 H E [ L ( f ( X ) , Y )] given A and g � with k g � g � k � given only ( x 1 , y 1 ) , . . . , ( x n , y n ) . Actually yes, under some assumptions.
Key assumptions: least squares and RKHS Assumption L ( f ( x ) , y ) = ( f ( x ) � y ) 2 Assumption I ( H , h · , · i ) is a Hilbert space (real, separable) I continuous evaluation functionals, for all x 2 X , let e x : H ! R , with e x ( f ) = f ( x ) , then | e x ( f ) � e x ( f 0 ) | . k f � f 0 k [Aronszajn ’50]
Key assumptions: least squares and RKHS Assumption L ( f ( x ) , y ) = ( f ( x ) � y ) 2 Assumption I ( H , h · , · i ) is a Hilbert space (real, separable) I continuous evaluation functionals, for all x 2 X , let e x : H ! R , with e x ( f ) = f ( x ) , then | e x ( f ) � e x ( f 0 ) | . k f � f 0 k Implications [Aronszajn ’50] I k f k 1 . k f k I 9 k x 2 H such that f ( x ) = h f, k x i
Interpolation and sampling operator [Bertero, De mol, Pike ’85,’88] Sampling operator: S n : H ! R n , f ( x i ) = h f, k x i i = y i , i = 1 , . . . , n ( S n f ) i = h f, k x i i , 8 i = 1 , . . . , n + S n f = y S n f x 3 x 5 x 1 x 2 x 4 X
Learning and restriction operator [Caponnetto, De Vito, R. ’05] Restriction operator: S ⇢ : H ! L 2 ( X , ⇢ ) , h f, k x i = f ⇢ ( x ) , ⇢ � a.s. + ( S ⇢ f )( x ) = h f, k x i , ⇢ � almost surely . S ⇢ f = f ⇢ R f ⇢ ( x ) = d ⇢ ( x, y ) y ⇢ -almost surely. S ρ f R L 2 ( X , ⇢ ) = { f 2 R X | k f k 2 d ⇢ | f ( x ) | 2 < 1 } ⇢ = X
Learning as an inverse problem Inverse problem Find f solving S ⇢ f = f ⇢ given S n and y n = ( y 1 , . . . , y n ) .
Learning as an inverse problem Inverse problem Find f solving S ⇢ f = f ⇢ given S n and y n = ( y 1 , . . . , y n ) . Least squares ⇢ = E ( f ( X ) � Y ) 2 � E ( f ⇢ ( X ) � Y ) 2 H k S ⇢ f � f ⇢ k 2 k S ⇢ f � f ⇢ k 2 min ⇢ ,
Let’s see what we got I Noise model I Integral operators & covariance operators I Kernels
Noise model Ideal Empirical S ⇢ f = f ⇢ S n f = y S ⇤ ⇢ S ⇢ f = S ⇤ S ⇤ n S n f = S ⇤ ⇢ f ⇢ n y Noise model k S ⇤ n y � S ⇤ k S ⇤ ⇢ S ⇢ � S ⇤ ⇢ f ⇢ k � 1 n S n k � 2 Inverse problem discretization, Econometrics
Integral and covariance operators operators I Extension operator S ⇤ ⇢ : L 2 ( X , ⇢ ) ! H Z S ⇤ ⇢ f ( x 0 ) = d ⇢ ( x ) k ( x 0 , x ) f ( x ) where k ( x, x 0 ) = h k x , k 0 x i is pos.def. I Covariance operator S ⇤ ⇢ S ⇢ : H ! H Z S ⇤ ⇢ S ⇢ = d ⇢ ( x ) k x ⌦ k x 0
Kernels Choosing a RKHS implies choosing a representation. Theorem (Moore-Aronzaijn) Let k : X ⇥ X ! R , pos.def., then the completion of N X { f 2 R X | f = c i k x i , c 1 , . . . , c N 2 R , x 1 , . . . , x N 2 X , N 2 N } j =1 w.r.t. h k x , k 0 x i = k ( x, x 0 ) is a RKHS.
Kernels If K ( x, x 0 ) = x > x 0 , then, I S n is the n by D data matrix ( S ⇢ infinite data matrix) I S ⇤ n S n and S ⇤ ⇢ S ⇢ are the empirical and true covariance operators
Kernels If K ( x, x 0 ) = x > x 0 , then, I S n is the n by D data matrix ( S ⇢ infinite data matrix) I S ⇤ n S n and S ⇤ ⇢ S ⇢ are the empirical and true covariance operators Other kernels: I K ( x, x 0 ) = (1 + x > x 0 ) p I K ( x, x 0 ) = e �k x � x 0 k 2 � I K ( x, x 0 ) = e �k x � x 0 k �
What now? Steal
Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances
Tikhonov aka ridge regression n = ( S ⇤ n S n + � nI ) � 1 S ⇤ f � n y
Tikhonov aka ridge regression n = ( S ⇤ n S n + � nI ) � 1 S ⇤ n y = S ⇤ n ( S n S ⇤ + � nI ) � 1 y f � n | {z } K n K n c = y
Statistics Theorem (Caponnetto De Vito ’05) Assume K ( X, X ) , | Y | 1 a.s. and f † 2 Range ( S ⇢ S ⇤ 1 ⇢ ) r , 1 / 2 < r < 1 . If � n = n � 2 r +1 2 r E [ k Sf � n � f † k 2 ⇢ ] . n � 2 r +1 n
Statistics Theorem (Caponnetto De Vito ’05) Assume K ( X, X ) , | Y | 1 a.s. and f † 2 Range ( S ⇢ S ⇤ 1 ⇢ ) r , 1 / 2 < r < 1 . If � n = n � 2 r +1 2 r E [ k Sf � n � f † k 2 ⇢ ] . n � 2 r +1 n Proof ⇢ ] . 1 E [ k Sf � n � f ⇢ k 2 � ( � 1 + � 2 ) + � 2 r 8 � > 0 , 1 p n E [ � 1 ] , E [ � 2 ] .
Iterative regularization From the Neumann series. . . t � 1 X f t ( I � � S ⇤ n S n ) j S ⇤ n = � n y j =0
Iterative regularization From the Neumann series. . . t � 1 t � 1 X X f t ( I � � S ⇤ n S n ) j S ⇤ n y = � S ⇤ ( I � � S n S ⇤ ) j y n = � n n | {z } j =0 j =0 K n
Iterative regularization From the Neumann series. . . t � 1 t � 1 X X f t ( I � � S ⇤ n S n ) j S ⇤ n y = � S ⇤ ( I � � S n S ⇤ ) j y n = � n n | {z } j =0 j =0 K n . . . to gradient descent n = f t � 1 � � S ⇤ n ( S n f t � 1 n = c t � 1 � � ( K n c t � 1 f t c t � y ) � y ) n n n n Test Training t
Iterative regularization statistics Theorem (Bauer, Pereverzev, R. ’07) Assume K ( X, X ) , | Y | 1 a.s. and f † 2 Range ( S ⇢ S ⇤ 1 ⇢ ) r , 1 / 2 < r < 1 . If t n = n 2 r +1 2 r ⇢ ] . n � E [ k Sf t n n � f † k 2 2 r +1
Iterative regularization statistics Theorem (Bauer, Pereverzev, R. ’07) Assume K ( X, X ) , | Y | 1 a.s. and f † 2 Range ( S ⇢ S ⇤ 1 ⇢ ) r , 1 / 2 < r < 1 . If t n = n 2 r +1 2 r ⇢ ] . n � E [ k Sf t n n � f † k 2 2 r +1 Proof 1 E [ k Sf t n � f ⇢ k 2 8 � > 0 , ⇢ ] . t ( � 1 + � 2 ) + t 2 r 1 E [ � 1 ] , E [ � 2 ] . p n
Tikhonov vs iterative regularization I Same statistical properties... 1 I ... but time complexities are di ff erent O ( n 3 ) vs O ( n 2 n 2 r +1 ) , I Iterative regularization provides a bridge between statistics and computations. I Kernel methods become a test bed for algorithmic solutions.
Computational regularization Tikhonov time O ( n 3 ) + space O ( n 2 ) for 1 / p n learning bound
Computational regularization Tikhonov time O ( n 3 ) + space O ( n 2 ) for 1 / p n learning bound + Iterative regularization time O ( n 2 p n ) + space O ( n 2 ) for 1 / p n learning bound
Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances
Steal from optimization Acceleration I Conjugate gradient [Blanchard, Kramer ’96] I Chebyshev method [Bauer, Pervezev. R. ’07] I Nesterov acceleration (Nesterov, ’83) [Salzo, R. ’18] Stochastic gradient I Single pass stochastic gradient [Tarres, Yao, ’05, Pontil, Ying, ’09, Bach, Dieuleveut, Flammarion, ’17] I Multi-pass incremental gradient [Villa, R. ’15] I Multi-pass stochastic gradient with mini-batches. [Lin, R. ’16]
Recommend
More recommend