Online Learning 9.520 Class, 19 March 2007 Sanmay Das (using some - PDF document

Online Learning 9.520 Class, 19 March 2007 Sanmay Das (using some slides from Andrea Caponnetto)

About this class Goal To introduce the general setting of online learning. To discuss convergence results of the classical Perceptron algorithm. To discuss online gradient descent. To introduce the “experts” framework and prove mistake bounds in that framework. To show the relationship between online learning and the theory of learning in games.

What is online learning? Sample data are arranged in a sequence. Each time we get a new input, the algorithm tries to predict the corresponding output. As the number of seen samples increases, hopefully the predictions improve.

Assets 1. does not require storing all data samples 2. more plausible model for sequential problems, especially those that involve decision-making 3. typically fast algorithms 4. it is possible to give formal guarantees not assuming probabilis- tic hypotheses ( mistake bounds )

Problems • Performance can be worse than best batch algorithms • Generalization bounds always require some assumption on the generation of sample data

Online setting Sequence of sample data z 1 , z 2 , . . . , z n . Each sample is an input-output couple z i = ( x i , y i ). R d , y i ∈ Y ⊂ I x i ∈ X ⊂ I R. In the classification case Y = { +1 , − 1 } , in the regression case Y = [ − M, M ]. Loss function V : I R × Y → I R + (e.g. E ( w, y ) = Θ( − yw ) and V ( w, y ) = | 1 − yw | + ). Estimators f i : X → Y constructed using the first i data samples.

Online setting (cont.) • initialization f 0 • for i = 1 , 2 , . . . , n • receive x i • predict f i − 1 ( x i ) • receive y i • update ( f i − 1 , z i ) → f i Note: storing efficiently f i − 1 may require much less memory than storing all previous samples z 1 , z 2 , . . . , z i − 1 .

Goals Batch learning: reducing expected loss I [ f n ] = I E z V ( f n ( x ) , y ) Online learning: reducing cumulative loss n � V ( f i − 1 ( x i ) , y i ) i =1

The Perceptron Algorithm We consider the classification problem: Y = {− 1 , +1 } . R d . We deal with linear estimators f i ( x ) = ω i · x , with ω i ∈ I The 0-1 loss E ( f i ( x ) , y ) = Θ( − y ( ω i · x )) is the natural choice in the classification context. We will also consider the more tractable hinge-loss V ( f i ( x ) , y ) = | 1 − y ( ω i · x ) | + . Initialize weight vector to 0. Update rule: If E i = E ( f i − 1 ( x i ) , y i ) = 0 then ω i = ω i − 1 , otherwise ω i = ω i − 1 + y i x i

The Perceptron Algorithm (cont.) Passive-Aggressive strategy of the update rule. If f i − 1 classifies correctly x i , don’t move. If f i − 1 classifies incorrectly , try to increase the margin y i ( ω · x i ). In fact, i � x i � 2 > y i ( ω i − 1 · x i ) y i ( ω i · x i ) = y i ( ω i − 1 · x i ) + y 2

Perceptron Convergence Theorem ∗ Theorem: If the samples z 1 , . . . , z n are linearly separable , then pre- senting them cyclically to the Perceptron algorithm, the sequence of weight vectors ω i will eventually converge. We will prove a more general result encompassing both the separable and the inseparable cases ∗ Pattern Classification. Duda, Hart, Stork, 01

Mistake Bound ∗ Theorem: Assume � x i � ≤ R for every i = 1 , 2 , . . . , n . Then for R d every u ∈ I 2  �  n � V 2 � ˆ � M ≤  R � u � + , i �  i =1 where ˆ V i = V ( u · x i , y i ) and M is the total number of mistakes: M = � n i =1 E i = � n i =1 E ( f i − 1 ( x i ) , y i ). ∗ Online Passive-Aggressive Algorithms. Crammer et al, 03

Mistake Bound (cont.) • the boundedness conditions � x i � ≤ R is necessary. • in the separable case, there exists u ∗ inducing margins y i ( u ∗ · x i ) ≥ 1, and therefore null “batch” loss over sample points. The Mistake Bound becomes M ≤ R 2 � u ∗ � 2 . • in the inseparable case, we can let u be the best possible linear separator. The bound compares the online performance with the best batch performance over a given class of competitors.

Proof The terms ω i · u increase as i increases 1. If E i = 0 then ω i · u = ω i − 1 · u 2. If E i = 1, since ˆ V i = | 1 − y i ( x i · u ) | + , ω i · u = ω i − 1 · u + y i ( x i · u ) ≥ ω i − 1 · u + 1 − ˆ V i . 3. Hence, in both cases ω i · u ≥ ω i − 1 · u + (1 − ˆ V i ) E i 4. Summing up, ω n · u ≥ M − � n i =1 ˆ V i E i .

Proof (cont.) The terms � ω i � do not increase too quickly 1. If E i = 0 then � ω i � 2 = � ω i − 1 � 2 2. If E i = 1, since y i ( ω i − 1 · x i ) ≤ 0, � ω i � 2 = ( ω i − 1 + y i x i ) · ( ω i − 1 + y i x i ) � ω i − 1 � 2 + � x i � 2 + 2 y i ( ω i − 1 · x i ) ≤ � ω i − 1 � 2 + R 2 . = 3. Summing up, � ω n � 2 ≤ M R 2 .

Proof (cont.) Using the estimates for ω n · u and � ω n � 2 , and applying Cauchy- Schwartz inequality 1. By C-S, ω n · u ≤ � ω n �� u � , hence n √ ˆ � M − V i E i ≤ ω n · u ≤ � ω n �� u � ≤ M R � u � i =1 �� n �� n 2. Finally, by C-S, � n V 2 i =1 E 2 i =1 ˆ i =1 ˆ V i E i ≤ i , hence i � n √ � V 2 � ˆ � M − i ≤ R � u � . � i =1

Online Gradient Descent In classical gradient descent algorithms, at each time take a step in the direction of steepest gradient: ∆ w ( τ ) = − η ∇ E | w ( τ ) Can grow complicated, depending on various things. Typically, use a quadratic approximation to the error function in the neighborhood of the weight vector (matrix) that actually minimizes the error function. In online variants, ∆ w ( τ ) = − η ∇ E n | w ( τ ) where n is one training example, sampled sequentially, or chosen at random.

Online Gradient Descent (contd.) An example (Werfel, Xie, and Seung, 2004). E = 1 2 | y − wx | 2 Suppose y is generated by a teacher network with weights w ∗ . Let W = ( w − w ∗ ) x . Then ∇ E = ∇ ( 1 2 | Wx | 2 ) = Wxx T Therefore, ∆ w = − ηWxx T

Discussion • Choice of learning rate effects convergence. Choosing η ( τ ) ∝ 1 /τ can guarantee convergence, but be very slow to converge. Stationary η is often the choice in practice, and is particularly useful in dealing with nonstationarity issues. • Online gradient descent is efficient, esp. with redundant information in the training set. • Stochastic nature implies it can get out of local minima. • May overshoot minima. • (Bishop, 1995) has lots of information, derivations, ...

The Experts Framework We will focus on the classification case. Suppose we have a pool of prediction strategies, called experts. Denote by E = { E 1 , . . . , E n } . Each expert predicts y i based on x i . We want to combine these experts to produce a single master algorithm for classification and prove bounds on how much worse it is than the best expert.

The Halving Algorithm ∗ Suppose all the experts are functions (their predictions for a point in the space do not change over time) and at least one of them is consistent with the data. At each step, predict what the majority of experts that have not made a mistake so far would predict. Note that all inconsistent experts get thrown away! Maximum of log 2 ( | E | ) errors. But what if there is no consistent function in the pool? (Noise in the data, limited pool, etc.) ∗ Barzdin and Freivald, On the prediction of general recursive functions , 1972, Littlestone and Warmuth, The Weighted Majority Algorithm , 1994

The Weighted Majority Algorithm ∗ Associate a weight w i with every expert. Initialize all weights to 1. At example t : | E | � q − 1 = w i I [ E i predicted y t = − 1] i =1 | E | � q 1 = w i I [ E i predicted y t = 1] i =1 Predict y t = 1 if q 1 > q − 1 , else predict y t = − 1 If the prediction is wrong, multiply the weights of each expert that made a wrong prediction by 0 ≤ β < 1. Note that for β = 0 we get the halving algorithm. ∗ Littlestone and Warmuth, 1994

Mistake Bound for WM For some example t let W t = � | E | i =1 w i = q − 1 + q 1 Then when a mistake occurs W t +1 ≤ uW t where u < 1 Therefore W 0 u m ≥ W n Or m ≤ log( W 0 /W n ) log(1 /u ) Then m ≤ log( W 0 /W n ) log(2 / (1+ β )) (setting u = 1+ β 2 )

Mistake Bound for WM (contd.) Why? Because when a mistake is made, the ratio of total weight after the trial to total weight before the trial is at most (1 + β ) / 2. W.L.o.G. assume WM predicted − 1 and the true outcome was +1. Then new weight after trial is: βq − 1 + q 1 ≤ βq − 1 + q 1 + 1 − β 2 ( q − 1 − q 1 ) = 1+ β 2 ( q − 1 + q 1 ). The main theorem (Littlestone & Warmuth): Assume m i is the number of mistakes made by the ith expert on a sequence of n instances and that | E | = k . Then the WM algorithm makes at most the following number of mistakes: log( k ) + m i log(1 /β ) log(2 / (1 + β )) Big fact: Ignoring leading constants, the number of errors of the pooled predictor is bounded by the sum of the number of errors of the best expert in the pool and the log of the number of experts!

Finishing the Proof W 0 = k and W n ≥ β m i log( W 0 /W n ) = log( W 0 ) − log( W n ) log( W n ) > m i log β , so − log( W n ) < m i log(1 /β ) Therefore log( W 0 ) − log( W n ) < log k + m i log(1 /β )

Online Learning 9.520 Class, 19 March 2007 Sanmay Das (using some - PDF document

Online Learning 9.520 Class, 19 March 2007 Sanmay Das (using some slides from Andrea Caponnetto) About this class Goal To introduce the general setting of online learning. To discuss convergence results of the classical Perceptron algo- rithm.

Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online Learning About this class Goal

Online Learning and Online Investing Jia Mao February 20, 2006 Jia Mao () Online Learning and

Teaching with Online Platforms What is an Online Learning Platform? A n Online Learning Platform

ONLINE ADVERTISING What is SIBC online? SIBC Online is a leading online news source for the

Online Learning with Kernel Losses Aldo Pacchiano UC Berkeley Joint work with Niladri Chatterji

Online Learning Tomaso Poggio and Lorenzo Rosasco 9.520 Class 15 March 30 2011 T. Poggio and L.

Online Learning Online Learning yeah ! yeah ! Any time, anywhere learning -- free

ONLINE LEARNING SERIES 2018 ONLINE LEARNING SERIES 2018 LEARNING AND CAPACITY DEVELOPMENT FOR

Efficient Online Learning using A Private Oracle Alon Gonen, UCSD Elad Hazan, Princeton Shay

SECURE-ONLINE (ZEKER-ONLINE) Quality mark for online cloud services Tom Vreeburg Boardmember

How DGD.online helps prepare DG documentation easily CIFFA Webinar, June 25, 2019 DGD.online

Getting Online Getting Online Domain Names Email Google My Business Listing

Online Identity & Social Media by: Nicole Santarsiero What is Online Identity? -Online

2008 Online Awards Awards Banquet Better Newspaper Online Contest 2008 Best Online Advertising

2013 IRS Online Services Update IRS Online Services Update Jim Weaver Director, Product

ONLINE PROCESS SIMULATION ONLINE PROCESS SIMULATION ONLINE, REAL-TIME AND PREDICTIVE PROCESS DATA

Ac#veLearning October15,2009 ReadingtheWeb:

Critical Reasoning for Beginners: Five Marianne Talbot Department for Continuing Education

caregivers with close relatives in Nursing Home (NH) Pr Anne-Sophie Rigaud, Catherine Bayle,

C ASE P REPARATION : part 1 Most debates are won and lost in the preparation room.

Try it out from the Priority Inbox settings tab. Doug Aberdeen, Ond ej Pacovsk , Andrew

MIL-UT at ILSVRC2014 IIT Guwahati (undergrad) -> Virginia Tech (intern) Senthil Purushwalkam,

Setting the Emotional Tone: Managing Emotional Culture in the Library Jason Martin Walker

Structure and Support Vector Machines SPFLODD October 31,

Online Learning 9.520 Class, 19 March 2007 Sanmay Das (using some - PDF document

Online Learning 9.520 Class, 19 March 2007 Sanmay Das (using some slides from Andrea Caponnetto) About this class Goal To introduce the general setting of online learning. To discuss convergence results of the classical Perceptron algo- rithm.

Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online Learning About this class Goal

Online Learning and Online Investing Jia Mao February 20, 2006 Jia Mao () Online Learning and

Teaching with Online Platforms What is an Online Learning Platform? A n Online Learning Platform

ONLINE ADVERTISING What is SIBC online? SIBC Online is a leading online news source for the

Online Learning with Kernel Losses Aldo Pacchiano UC Berkeley Joint work with Niladri Chatterji

Online Learning Tomaso Poggio and Lorenzo Rosasco 9.520 Class 15 March 30 2011 T. Poggio and L.

Online Learning Online Learning yeah ! yeah ! Any time, anywhere learning -- free

ONLINE LEARNING SERIES 2018 ONLINE LEARNING SERIES 2018 LEARNING AND CAPACITY DEVELOPMENT FOR

Efficient Online Learning using A Private Oracle Alon Gonen, UCSD Elad Hazan, Princeton Shay

SECURE-ONLINE (ZEKER-ONLINE) Quality mark for online cloud services Tom Vreeburg Boardmember

How DGD.online helps prepare DG documentation easily CIFFA Webinar, June 25, 2019 DGD.online

Getting Online Getting Online Domain Names Email Google My Business Listing

Online Identity &amp; Social Media by: Nicole Santarsiero What is Online Identity? -Online

2008 Online Awards Awards Banquet Better Newspaper Online Contest 2008 Best Online Advertising

2013 IRS Online Services Update IRS Online Services Update Jim Weaver Director, Product

ONLINE PROCESS SIMULATION ONLINE PROCESS SIMULATION ONLINE, REAL-TIME AND PREDICTIVE PROCESS DATA

Ac#veLearning October15,2009 ReadingtheWeb:

Critical Reasoning for Beginners: Five Marianne Talbot Department for Continuing Education

caregivers with close relatives in Nursing Home (NH) Pr Anne-Sophie Rigaud, Catherine Bayle,

C ASE P REPARATION : part 1 Most debates are won and lost in the preparation room.

Try it out from the Priority Inbox settings tab. Doug Aberdeen, Ond ej Pacovsk , Andrew

MIL-UT at ILSVRC2014 IIT Guwahati (undergrad) -&gt; Virginia Tech (intern) Senthil Purushwalkam,

Setting the Emotional Tone: Managing Emotional Culture in the Library Jason Martin Walker

Structure and Support Vector Machines SPFLODD October 31,

Online Identity & Social Media by: Nicole Santarsiero What is Online Identity? -Online

MIL-UT at ILSVRC2014 IIT Guwahati (undergrad) -> Virginia Tech (intern) Senthil Purushwalkam,