online learning
play

Online Learning Tomaso Poggio and Lorenzo Rosasco 9.520 Class 15 - PowerPoint PPT Presentation

Online Learning Tomaso Poggio and Lorenzo Rosasco 9.520 Class 15 March 30 2011 T. Poggio and L. Rosasco Online Learning About this class Goal To introduce theory and algorithms for online learning. T. Poggio and L. Rosasco Online Learning


  1. Online Learning Tomaso Poggio and Lorenzo Rosasco 9.520 Class 15 March 30 2011 T. Poggio and L. Rosasco Online Learning

  2. About this class Goal To introduce theory and algorithms for online learning. T. Poggio and L. Rosasco Online Learning

  3. Plan Different views on online learning From batch to online least squares Other loss functions Theory T. Poggio and L. Rosasco Online Learning

  4. (Batch) Learning Algorithms A learning algorithm A is a map from the data space into the hypothesis space and f S = A ( S ) , where S = S n = ( x 0 , y 0 ) . . . . ( x n − 1 , y n − 1 ) . We typically assume that: A is deterministic, A does not depend on the ordering of the points in the training set. notation: note the weird numbering of the training set! T. Poggio and L. Rosasco Online Learning

  5. Online Learning Algorithms The pure online learning approach is O ( 1 ) in time and memory with respect to the data. let f 1 = init for n = 1 , . . . f n + 1 = A ( f n , ( x n , y n )) The algorithm works sequentially and has a recursive definition. T. Poggio and L. Rosasco Online Learning

  6. Online Learning Algorithms (cont.) A related approach (similar to transductive learning) is typically O ( 1 ) in time but not in memory with respect to the data. let f 1 = init for n = 1 , . . . f n + 1 = A ( f n , S n , ( x n , y n )) Also in this case the algorithm works sequentially and has a recursive definition, but it requires storing the past data S n . T. Poggio and L. Rosasco Online Learning

  7. Why Online Learning? Different motivations/perspectives that often corresponds to different theoretical framework. Biologically plausibility. Stochastic approximation. Incremental Optimization. Non iid data, game theoretic view. T. Poggio and L. Rosasco Online Learning

  8. Online Learning and Stochastic Approximation Our goal is to minimize the expected risk � I [ f ] = E ( x , y ) [ V ( f ( x ) , y )] = V ( f ( x ) , y ) d µ ( x , y ) over the hypothesis space H , but the data distribution is not known. The idea is to use the samples to build an approximate solution and to update such a solution as we get more data. T. Poggio and L. Rosasco Online Learning

  9. Online Learning and Stochastic Approximation Our goal is to minimize the expected risk � I [ f ] = E ( x , y ) [ V ( f ( x ) , y )] = V ( f ( x ) , y ) d µ ( x , y ) over the hypothesis space H , but the data distribution is not known. The idea is to use the samples to build an approximate solution and to update such a solution as we get more data. T. Poggio and L. Rosasco Online Learning

  10. Online Learning and Stochastic Approximation (cont.) More precisely if we are given samples ( x i , y i ) i in a sequential fashion at the n − th step we have an approximation G ( f , ( x n , y n )) of the gradient of I [ f ] then we can define a recursion by let f 1 = init for n = 1 , . . . f n + 1 = f n + γ n ( G ( f n , ( x n , y n )) T. Poggio and L. Rosasco Online Learning

  11. Online Learning and Stochastic Approximation (cont.) More precisely if we are given samples ( x i , y i ) i in a sequential fashion at the n − th step we have an approximation G ( f , ( x n , y n )) of the gradient of I [ f ] then we can define a recursion by let f 1 = init for n = 1 , . . . f n + 1 = f n + γ n ( G ( f n , ( x n , y n )) T. Poggio and L. Rosasco Online Learning

  12. Incremental Optimization Here our goal is to solve empirical risk minimization I S [ f ] , or regularized empirical risk minimization S [ f ] = I S [ f ] + λ � f � 2 I λ over the hypothesis space H , when the number of points is so big (say n = 10 8 − 10 9 ) that standard solvers would not be feasible. Memory is the main constraint here. T. Poggio and L. Rosasco Online Learning

  13. Incremental Optimization (cont.) In this case we can consider let f 1 = init for t = 1 , . . . f t +! = f t + γ t ( G ( f t , ( x n t , y n t )) where here G ( f t , ( x n t , y n t )) is a pointwise estimate of I S or I λ S . Epochs Note that in this case the number of iteration is decoupled to the index of training set points and we can look at the data more than once, that is consider different epochs . T. Poggio and L. Rosasco Online Learning

  14. Non i.i.d. data, game theoretic view If the data are not i.i.d. we can consider a setting when the data is a finite sequence that we will be disclosed to us in a sequential (possibly adversarial) fashion. Then we can see learning as a two players game where at each step nature chooses a samples ( x i , y i ) at each step a learner chooses an estimator f n . The goal of the learner is to perform as well as if he could view the whole sequence. T. Poggio and L. Rosasco Online Learning

  15. Non i.i.d. data, game theoretic view If the data are not i.i.d. we can consider a setting when the data is a finite sequence that we will be disclosed to us in a sequential (possibly adversarial) fashion. Then we can see learning as a two players game where at each step nature chooses a samples ( x i , y i ) at each step a learner chooses an estimator f n . The goal of the learner is to perform as well as if he could view the whole sequence. T. Poggio and L. Rosasco Online Learning

  16. Plan Different views on online learning From batch to online least squares Other loss functions Theory T. Poggio and L. Rosasco Online Learning

  17. Recalling Least Squares We start considering a linear kernel so that n − 1 I S [ f ] = 1 � ( y i − x T i w ) = � Y − Xw � 2 n i = 0 Remember that in this case n − 1 w n = ( X T X ) − 1 X T Y = C − 1 � x i y i . n i = 0 (Note that if we regularize we have ( C n + λ I ) − 1 in place of C − 1 n . notation: note the weird numbering of the training set! T. Poggio and L. Rosasco Online Learning

  18. Recalling Least Squares We start considering a linear kernel so that n − 1 I S [ f ] = 1 � ( y i − x T i w ) = � Y − Xw � 2 n i = 0 Remember that in this case n − 1 w n = ( X T X ) − 1 X T Y = C − 1 � x i y i . n i = 0 (Note that if we regularize we have ( C n + λ I ) − 1 in place of C − 1 n . notation: note the weird numbering of the training set! T. Poggio and L. Rosasco Online Learning

  19. A Recursive Least Squares Algorithm Then we can consider w n + 1 = w n + C − 1 n + 1 x n [ y n − x T n w n ] . Proof n ( � n − 1 w n = C − 1 i = 0 x i y i ) n + 1 ( � n − 1 w n + 1 = C − 1 i = 0 x i y i + x n y n ) w n + 1 − w n = C − 1 n + 1 ( x n y n ) + C − 1 n + 1 ( C n − C n + 1 ) C − 1 � n − 1 i = 0 x i y i n C n + 1 − C n = x n x T n . T. Poggio and L. Rosasco Online Learning

  20. A Recursive Least Squares Algorithm Then we can consider w n + 1 = w n + C − 1 n + 1 x n [ y n − x T n w n ] . Proof n ( � n − 1 w n = C − 1 i = 0 x i y i ) n + 1 ( � n − 1 w n + 1 = C − 1 i = 0 x i y i + x n y n ) w n + 1 − w n = C − 1 n + 1 ( x n y n ) + C − 1 n + 1 ( C n − C n + 1 ) C − 1 � n − 1 i = 0 x i y i n C n + 1 − C n = x n x T n . T. Poggio and L. Rosasco Online Learning

  21. A Recursive Least Squares Algorithm Then we can consider w n + 1 = w n + C − 1 n + 1 x n [ y n − x T n w n ] . Proof n ( � n − 1 w n = C − 1 i = 0 x i y i ) n + 1 ( � n − 1 w n + 1 = C − 1 i = 0 x i y i + x n y n ) w n + 1 − w n = C − 1 n + 1 ( x n y n ) + C − 1 n + 1 ( C n − C n + 1 ) C − 1 � n − 1 i = 0 x i y i n C n + 1 − C n = x n x T n . T. Poggio and L. Rosasco Online Learning

  22. A Recursive Least Squares Algorithm Then we can consider w n + 1 = w n + C − 1 n + 1 x n [ y n − x T n w n ] . Proof n ( � n − 1 w n = C − 1 i = 0 x i y i ) n + 1 ( � n − 1 w n + 1 = C − 1 i = 0 x i y i + x n y n ) w n + 1 − w n = C − 1 n + 1 ( x n y n ) + C − 1 n + 1 ( C n − C n + 1 ) C − 1 � n − 1 i = 0 x i y i n C n + 1 − C n = x n x T n . T. Poggio and L. Rosasco Online Learning

  23. A Recursive Least Squares Algorithm (cont.) We derived the algorithm w n + 1 = w n + C − 1 n + 1 x n [ y n − x T n w n ] . The above approach is recursive; requires storing all the data; requires inverting a matrix ( C i ) i at each step. T. Poggio and L. Rosasco Online Learning

  24. A Recursive Least Squares Algorithm (cont.) The following matrix equality allows to alleviate the computational burden. Matrix Inversion Lemma [ A + BCD ] − 1 = A − 1 − A − 1 B [ DA − 1 B + C − 1 ] − 1 DA − 1 Then − C − 1 n C − 1 n x n x T n C − 1 n + 1 = C − 1 . n n C − 1 1 + x T n x n T. Poggio and L. Rosasco Online Learning

  25. A Recursive Least Squares Algorithm (cont.) The following matrix equality allows to alleviate the computational burden. Matrix Inversion Lemma [ A + BCD ] − 1 = A − 1 − A − 1 B [ DA − 1 B + C − 1 ] − 1 DA − 1 Then − C − 1 n C − 1 n x n x T n C − 1 n + 1 = C − 1 . n n C − 1 1 + x T n x n T. Poggio and L. Rosasco Online Learning

  26. A Recursive Least Squares Algorithm (cont.) Moreover n x n − C − 1 n x n x T n C − 1 C − 1 C − 1 n + 1 x n = C − 1 n n x n = x n , n C − 1 n C − 1 1 + x T 1 + x T n x n n x n we can derive the algorithm C − 1 n x n [ y n − x T w n + 1 = w n + n w n ] . n C − 1 1 + x T n x n Since the above iteration is equivalent to empirical risk minimization (ERM) the conditions ensuring its convergence – as n → ∞ – are the same as those for ERM. T. Poggio and L. Rosasco Online Learning

Recommend


More recommend