An inverse problem perspective on machine learning Lorenzo Rosasco - PowerPoint PPT Presentation

An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 – Inverse Problems and Machine Learning Workshop, CM+X Caltech

Today selection I Classics: “Learning as an inverse problem” I Latest releases: “Kernel methods as a test bed for algorithm design”

Outline Learning theory 2000 Learning as an inverse problem Regularization Recent advances

What’s learning ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 )

What’s learning ( x 7 , ?) ( x 6 , ?) ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 )

What’s learning ( x 7 , ?) ( x 6 , ?) ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 ) Learning is about inference not interpolation

Statistical Machine Learning (ML) I ( X, Y ) a pair of random variables in X ⇥ R . I L : R ⇥ R ! [0 , 1 ) a loss function. I H ⇢ R X Problem: Solve min f 2 H E [ L ( f ( X ) , Y )] given only ( x 1 , y 1 ) , . . . , ( x n , y n ) , a sample of n i.i. copies of ( X, Y ) .

ML theory around 2000-2010 I All algorithms are ERM (empirical risk minimization) n 1 X min L ( f ( x i ) , y i ) n f 2 H i =1 [Vapnik ’96] I Emphasis on empirical process theory. . . � � ! n � � 1 X � � sup L ( f ( X i ) , Y i ) � E [ L ( f ( X ) , Y )] � > ✏ P � � n f 2 H � i =1 [Vapnik, Chervonenkis,’71 Dudley, Gin´ e, Zinn ’94] I ...and complexity measures, e.g. Gaussian/Rademacher complexities n X C ( H ) = E sup � i f ( X i ) f 2 H i =1 [Barlett, Bousquet, Koltchinskii, Massart, Mendelson. . . 00]

Around the same time Cucker and Smale, On the mathematical foundations of learning theory, AMS I Caponnetto, De Vito and R. Verri, Learning as an Inverse Problem , JMLR I Smale, Zhou, Shannon sampling and function reconstruction from point values, Bull. AMS

Inverse Problems (IP) I A : H ! G bounded linear operator, between Hilbert spaces I g 2 G Problem: Find f solving Af = g assuming A and g � are given, with k g � g � k  � [Engl, Hanke, Neubauer’96]

Ill-posedeness I Existence: g / 2 Range ( A ) I Uniqueness: Ker ( A ) 6 = ; I Stability: k A † k = 1 (large is also a mess) g f † g δ A Range ( A ) G H f † = A † g = min k Af � g k 2 , O = argmin O k f k H

Is machine learning an inverse problem? I A : H ! G I ( X, Y ) I g 2 G I L : R ⇥ R ! [0 , 1 ) I H ⇢ R X Find f solving Solve Af = g min f 2 H E [ L ( f ( X ) , Y )] given A and g � with k g � g � k  � given only ( x 1 , y 1 ) , . . . , ( x n , y n ) . Actually yes, under some assumptions.

Key assumptions: least squares and RKHS Assumption L ( f ( x ) , y ) = ( f ( x ) � y ) 2 Assumption I ( H , h · , · i ) is a Hilbert space (real, separable) I continuous evaluation functionals, for all x 2 X , let e x : H ! R , with e x ( f ) = f ( x ) , then | e x ( f ) � e x ( f 0 ) | . k f � f 0 k [Aronszajn ’50]

Key assumptions: least squares and RKHS Assumption L ( f ( x ) , y ) = ( f ( x ) � y ) 2 Assumption I ( H , h · , · i ) is a Hilbert space (real, separable) I continuous evaluation functionals, for all x 2 X , let e x : H ! R , with e x ( f ) = f ( x ) , then | e x ( f ) � e x ( f 0 ) | . k f � f 0 k Implications [Aronszajn ’50] I k f k 1 . k f k I 9 k x 2 H such that f ( x ) = h f, k x i

Interpolation and sampling operator [Bertero, De mol, Pike ’85,’88] Sampling operator: S n : H ! R n , f ( x i ) = h f, k x i i = y i , i = 1 , . . . , n ( S n f ) i = h f, k x i i , 8 i = 1 , . . . , n + S n f = y S n f x 3 x 5 x 1 x 2 x 4 X

Learning and restriction operator [Caponnetto, De Vito, R. ’05] Restriction operator: S ⇢ : H ! L 2 ( X , ⇢ ) , h f, k x i = f ⇢ ( x ) , ⇢ � a.s. + ( S ⇢ f )( x ) = h f, k x i , ⇢ � almost surely . S ⇢ f = f ⇢ R f ⇢ ( x ) = d ⇢ ( x, y ) y ⇢ -almost surely. S ρ f R L 2 ( X , ⇢ ) = { f 2 R X | k f k 2 d ⇢ | f ( x ) | 2 < 1 } ⇢ = X

Learning as an inverse problem Inverse problem Find f solving S ⇢ f = f ⇢ given S n and y n = ( y 1 , . . . , y n ) .

Learning as an inverse problem Inverse problem Find f solving S ⇢ f = f ⇢ given S n and y n = ( y 1 , . . . , y n ) . Least squares ⇢ = E ( f ( X ) � Y ) 2 � E ( f ⇢ ( X ) � Y ) 2 H k S ⇢ f � f ⇢ k 2 k S ⇢ f � f ⇢ k 2 min ⇢ ,

Let’s see what we got I Noise model I Integral operators & covariance operators I Kernels

Noise model Ideal Empirical S ⇢ f = f ⇢ S n f = y S ⇤ ⇢ S ⇢ f = S ⇤ S ⇤ n S n f = S ⇤ ⇢ f ⇢ n y Noise model k S ⇤ n y � S ⇤ k S ⇤ ⇢ S ⇢ � S ⇤ ⇢ f ⇢ k  � 1 n S n k  � 2 Inverse problem discretization, Econometrics

Integral and covariance operators operators I Extension operator S ⇤ ⇢ : L 2 ( X , ⇢ ) ! H Z S ⇤ ⇢ f ( x 0 ) = d ⇢ ( x ) k ( x 0 , x ) f ( x ) where k ( x, x 0 ) = h k x , k 0 x i is pos.def. I Covariance operator S ⇤ ⇢ S ⇢ : H ! H Z S ⇤ ⇢ S ⇢ = d ⇢ ( x ) k x ⌦ k x 0

Kernels Choosing a RKHS implies choosing a representation. Theorem (Moore-Aronzaijn) Let k : X ⇥ X ! R , pos.def., then the completion of N X { f 2 R X | f = c i k x i , c 1 , . . . , c N 2 R , x 1 , . . . , x N 2 X , N 2 N } j =1 w.r.t. h k x , k 0 x i = k ( x, x 0 ) is a RKHS.

Kernels If K ( x, x 0 ) = x > x 0 , then, I S n is the n by D data matrix ( S ⇢ infinite data matrix) I S ⇤ n S n and S ⇤ ⇢ S ⇢ are the empirical and true covariance operators

Kernels If K ( x, x 0 ) = x > x 0 , then, I S n is the n by D data matrix ( S ⇢ infinite data matrix) I S ⇤ n S n and S ⇤ ⇢ S ⇢ are the empirical and true covariance operators Other kernels: I K ( x, x 0 ) = (1 + x > x 0 ) p I K ( x, x 0 ) = e �k x � x 0 k 2 � I K ( x, x 0 ) = e �k x � x 0 k �

What now? Steal

Tikhonov aka ridge regression n = ( S ⇤ n S n + � nI ) � 1 S ⇤ f � n y

Tikhonov aka ridge regression n = ( S ⇤ n S n + � nI ) � 1 S ⇤ n y = S ⇤ n ( S n S ⇤ + � nI ) � 1 y f � n | {z } K n K n c = y

Statistics Theorem (Caponnetto De Vito ’05) Assume K ( X, X ) , | Y |  1 a.s. and f † 2 Range ( S ⇢ S ⇤ 1 ⇢ ) r , 1 / 2 < r < 1 . If � n = n � 2 r +1 2 r E [ k Sf � n � f † k 2 ⇢ ] . n � 2 r +1 n

Statistics Theorem (Caponnetto De Vito ’05) Assume K ( X, X ) , | Y |  1 a.s. and f † 2 Range ( S ⇢ S ⇤ 1 ⇢ ) r , 1 / 2 < r < 1 . If � n = n � 2 r +1 2 r E [ k Sf � n � f † k 2 ⇢ ] . n � 2 r +1 n Proof ⇢ ] . 1 E [ k Sf � n � f ⇢ k 2 � ( � 1 + � 2 ) + � 2 r 8 � > 0 , 1 p n E [ � 1 ] , E [ � 2 ] .

Iterative regularization From the Neumann series. . . t � 1 X f t ( I � � S ⇤ n S n ) j S ⇤ n = � n y j =0

Iterative regularization From the Neumann series. . . t � 1 t � 1 X X f t ( I � � S ⇤ n S n ) j S ⇤ n y = � S ⇤ ( I � � S n S ⇤ ) j y n = � n n | {z } j =0 j =0 K n

Iterative regularization From the Neumann series. . . t � 1 t � 1 X X f t ( I � � S ⇤ n S n ) j S ⇤ n y = � S ⇤ ( I � � S n S ⇤ ) j y n = � n n | {z } j =0 j =0 K n . . . to gradient descent n = f t � 1 � � S ⇤ n ( S n f t � 1 n = c t � 1 � � ( K n c t � 1 f t c t � y ) � y ) n n n n Test Training t

Iterative regularization statistics Theorem (Bauer, Pereverzev, R. ’07) Assume K ( X, X ) , | Y |  1 a.s. and f † 2 Range ( S ⇢ S ⇤ 1 ⇢ ) r , 1 / 2 < r < 1 . If t n = n 2 r +1 2 r ⇢ ] . n � E [ k Sf t n n � f † k 2 2 r +1

Iterative regularization statistics Theorem (Bauer, Pereverzev, R. ’07) Assume K ( X, X ) , | Y |  1 a.s. and f † 2 Range ( S ⇢ S ⇤ 1 ⇢ ) r , 1 / 2 < r < 1 . If t n = n 2 r +1 2 r ⇢ ] . n � E [ k Sf t n n � f † k 2 2 r +1 Proof 1 E [ k Sf t n � f ⇢ k 2 8 � > 0 , ⇢ ] . t ( � 1 + � 2 ) + t 2 r 1 E [ � 1 ] , E [ � 2 ] . p n

Tikhonov vs iterative regularization I Same statistical properties... 1 I ... but time complexities are di ff erent O ( n 3 ) vs O ( n 2 n 2 r +1 ) , I Iterative regularization provides a bridge between statistics and computations. I Kernel methods become a test bed for algorithmic solutions.

Computational regularization Tikhonov time O ( n 3 ) + space O ( n 2 ) for 1 / p n learning bound

Computational regularization Tikhonov time O ( n 3 ) + space O ( n 2 ) for 1 / p n learning bound + Iterative regularization time O ( n 2 p n ) + space O ( n 2 ) for 1 / p n learning bound

Steal from optimization Acceleration I Conjugate gradient [Blanchard, Kramer ’96] I Chebyshev method [Bauer, Pervezev. R. ’07] I Nesterov acceleration (Nesterov, ’83) [Salzo, R. ’18] Stochastic gradient I Single pass stochastic gradient [Tarres, Yao, ’05, Pontil, Ying, ’09, Bach, Dieuleveut, Flammarion, ’17] I Multi-pass incremental gradient [Villa, R. ’15] I Multi-pass stochastic gradient with mini-batches. [Lin, R. ’16]

An inverse problem perspective on machine learning Lorenzo Rosasco - PowerPoint PPT Presentation

An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems and Machine Learning Workshop, CM+X

Inverse Kinematics Inverse Kinematics Inverse Kinematics Carnegie Carnegie Sebastian Grassia

Dynamic Inverse Problems: Schmitt Efficient Algorithms and Approximate Inverse Problems

Statistical Inverse Problems and abstract inverse problems examples Instrumental Variables

Course on Inverse Problems Albert Tarantola Lesson VI: a) General Formulation of the Inverse

1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse

On Inverse Halftoning: A General Problem General Problem: . . . Our First Result: . . .

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

fi Finnish Centre of Excellence in Inverse Problems Research p. 1/28 1 Inverse problem in

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A new perspective on machine learning H. N. Mhaskar Claremont Graduate University ICERM

Perspective LanguaL Structured Vocabulary: USDA Perspective Joanne Holden Perspective: Earth

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Calibration: an essential Calibration: an essential inverse problem inverse problem Haroldo

Communication avoiding low rank matrix approximation, a unified perspective on deterministic and

The Microservices journey from a startup perspective Susanne Kaiser CTO @suksr Just Software

CiteSeer x : A Cloud Perspective Pradeep Teregowda, Bhuvan Urgaonkar, C. Lee Giles Pennsylvania

What is Perspective? A mechanism for portraying 3D in 2D True Perspective

Spanish Colonies on the Borderlands Pages 9093 Nov 18:14 PM 1 3.5 Spanish Colonies on the

20/08/2018 Charles as Duke of York 1613 by Peake, Cambridge University Charles as a young man

16/03/2020 George Villiers Duke of Buckingham by an unknown artist George Villiers Duke of

H1/Q2 2015 results and update on strategic progress 30 July 2015 Philip Hampton Chairman Ross