online learning
play

Online Learning Fei Xia Language Technology Institute - PowerPoint PPT Presentation

Online Learning Fei Xia Language Technology Institute feixia@cs.cmu.edu March 16, 2015 1 / 24 Outline Introduction Why online learning Basic stuff about online learning Prediction with expert advice Halving algorithm Weighted majority


  1. Online Learning Fei Xia Language Technology Institute feixia@cs.cmu.edu March 16, 2015 1 / 24

  2. Outline Introduction Why online learning Basic stuff about online learning Prediction with expert advice Halving algorithm Weighted majority algorithm Randomized weighted majority algorithm Exponential weighted average algorithm 2 / 24

  3. Why online learning? In many cases, data arrives sequentially while predictions are required on-the-fly Online algorithms do not require any distributional assumption Applicable in adversarial environments Simple algorithms Theoretical guarantees 3 / 24

  4. Introduction Basic Properties: Instead of learning from a training set and then testing on a test set, the online learning scenario mixes the training and test phases. Instead of assuming the distribution over data points is fixed both for training/test points and points are sampled in and i.i.d fashion, no distributional assumption is assumed in online learning. Instead of learning a hypothesis with small generalization error, online learning algorithms is measured using a mistake model and the regret . 4 / 24

  5. Introduction Basic Setting: For t = 1 , 2 , ..., T Receive an instance x t ∈ X Make a prediction ˆ y t ∈ Y Receive true label y t ∈ Y Suffer loss L (ˆ y t , y t ) Objective: T � min L (ˆ y t , y t ) t =1 5 / 24

  6. Prediction with Expert Advice For t = 1 , 2 , ..., T Receive an instance x t ∈ X Receive an advice y t,i ∈ Y , i ∈ [1 , N ] from N experts Make a prediction ˆ y t ∈ Y Receive true label y t ∈ Y Suffer loss L (ˆ y t , y t ) Figure : Weather forecast: an example of a prediction problem based on expert advice [Mohri et al., 2012] 6 / 24

  7. Regret Analysis Objective: minimize the regret R T T T N � � R T = L (ˆ y t , y t ) − min L (ˆ y t,i , y t ) i =1 t =1 t =1 What does low regret mean? It means that we don’t lose much from not knowing future events It means that we can perform almost as well as someone who observes the entire sequence and picks the best prediction strategy in hindsight It means that we can compete with changing environment 7 / 24

  8. Halving algorithm Realizable case — After some number of rounds T , we will learn the concept and no longer make errors. Mistake bound — How many mistakes before we learn a particular concept? Maximum number of mistakes a learning algorithm A makes for a concept c : M A ( c ) = max | mistakes ( A , c ) | S Maximum number of mistakes a learning algorithm A makes for a concept class C : M A ( C ) = max c ∈ C M A ( c ) 8 / 24

  9. Halving algorithm Algorithm 1 HALVING( H ) 1: H 1 ← H 2: for t ← 1 to T do RECEIVE( x t ) 3: y t ← MAJORITYVOTE( H t , x t ) ˆ 4: RECEIVE( y t ) 5: if ˆ y t � = y t then 6: H t +1 ← { c ∈ H t : c ( x t ) = y t } 7: 8: return H T +1 9 / 24

  10. Halving algorithm Theorem Let H be a finite hypothesis set, then M Halving ( H ) ≤ log 2 | H | Proof. The algorithm makes predictions using majority vote from the active set. Thus, at each mistake, the active set is reduced by at least half. Hence, after log 2 | H | mistakes, there can only remain one active hypothesis. Since we are in the realizable case, this hypothesis must coincide with the target concept and we won’t make mistakes any more. 10 / 24

  11. Weighted majority algorithm Algorithm 2 WEIGHTED-MAJORITY( N ) 1: for i ← 1 to N do w 1 ,i ← 1 2: 3: for t ← 1 to T do RECEIVE( x t ) 4: if � i : y t,i =1 w t,i ≥ � i : y t,i =0 w t,i then 5: ˆ y t ← 1 6: else 7: y t ← 0 ˆ 8: RECEIVE( y t ) 9: if ˆ y t � = y t then 10: for i ← 1 to N do 11: if y t,i � = y t then 12: w t +1 ,i ← βw t,i 13: else w t +1 ,i ← w t,i 14: 15: return w T +1 11 / 24

  12. Weighted majority algorithm Theorem Fix β ∈ (0 , 1) . Let m T be the number of mistakes made by algorithm WM after T ≥ 1 rounds, and m ∗ T be the number of mistakes made by the best of the N experts. Then, the following inequality holds: T log 1 log N + m ∗ β m T ≤ 2 log 1+ β Proof. Introduce a potential function W t = � N 1 w t,i , then derive its upper and lower bound. Since the predictions are generated using weighted majority vote, if the algorithm makes an error at round t , we have � 1 + β � W t +1 ≤ W t 2 12 / 24

  13. Weighted majority algorithm Proof (Cont.) m T mistakes after T rounds and W 1 = N , thus � m T N � 1 + β W T ≤ 2 Note that we also have W T ≥ w T,i = β m T,i where m T,i is the number of mistakes made by the i th expert. Thus, log N + m ∗ T log 1 � m T N ⇒ m T ≤ � 1 + β β β m ∗ T ≤ 2 2 log 1+ β 13 / 24

  14. Weighted majority algorithm T log 1 log N + m ∗ β m T ≤ 2 log 1+ β m T ≤ O (log N ) + constant × | mistakes of best expert | No assumption about the sequence of samples The number of mistakes is roughly a constant times that of the best expert in hindsight When m ∗ T = 0 , the bound reduces to m T ≤ O (log N ) , which is the same as the Halving algorithm 14 / 24

  15. Randomized weighted majority algorithm Drawback in weighted majority algorithm: zero-one loss; no deterministic algorithm can achieve a regret R T = o ( T ) In randomized scenario, A = { 1 , ..., N } of N actions is available At each round t ∈ [1 , T ] , an online algorithm A selects a distribution p t over the set of actions Receive a loss vector l t , where l t,i ∈ 0 , 1 is the loss associated with action i Define the expected loss for round t : L t = � N i =1 p t,i l t,i ; the total loss for T rounds: L T = � T t =1 L t Define the total loss associated with action i : L = � T t =1 l t,i ; the minimal loss of a single action: L min = min i ∈ A L T,i T 15 / 24

  16. Randomized weighted majority algorithm Algorithm 3 RANDOMIZED-WEIGHTED-MAJORITY( N ) 1: for i ← 1 to N do w 1 ,i ← 1 2: p 1 ,i ← 1 /N 3: 4: for t ← 1 to T do for i ← 1 to N do 5: if l t,i = 1 then 6: w t +1 ,i ← βw t,i 7: else w t +1 ,i ← w t,i 8: W t +1 ← � N i =1 w t +1 ,i 9: for i ← 1 to N do 10: p t +1 ,i ← w t +1 ,i /W t +1 11: 12: return w T +1 Note: Let w 0 be the total weight on outcome 0 , w 1 be the total weight on outcome 1 , W = w 1 + w 2 ; then the prediction strategy is to predict i with probability w i /W . 16 / 24

  17. Randomized weighted majority algorithm Theorem Fix β ∈ [1 / 2 , 1] . Then for any T ≥ 1 , the loss of algorithm RWM on any sequence can be bounded as follows: L T ≤ log N 1 − β + (2 − β ) L min T � In particular, for β = max { 1 / 2 , 1 − (log N ) /T } , the loss can be bounded as: L T ≤ L min + 2 � T log N Proof. Define potential function W t = � N i =1 w t,i , t ∈ [1 , T ] . 17 / 24

  18. Proof (Cont.) W t +1 = � i : l t,i =0 w t,i + β � i : l t,i =1 w t,i = W t + ( β − 1) W t � i : l t,i =1 p t,i = W t (1 − (1 − β ) L t ) = N � T ⇒ W T +1 t =1 (1 − (1 − β ) L t ) Note that we also have W T +1 ≥ max i ∈ [1 ,N ] w T +1 ,i = β L min , thus, T β L min ≤ N � T t =1 (1 − (1 − β ) L t ) T L min ⇒ log β ≤ log N − (1 − β ) L T T L T ≤ log N 1 − β + (2 − β ) L min ⇒ T Since L min ≤ T , this also implies T L T ≤ log N 1 − β + (1 − β ) T + L min T By minimizing the RHS w.r.t β , we get L T ≤ L min � � + 2 T log N ⇔ R T ≤ 2 T log N T 18 / 24

  19. Exponential weighted average algorithm We have extended WM algorithm to other loss functions L taking values in [0,1]. The EWA algorithm here is a further extension such that L is convex in its first argument. Algorithm 4 EXPONENTIAL-WEIGHTED-AVERAGE( N ) 1: for i ← 1 to N do w 1 ,i ← 1 2: 3: for t ← 1 to T do RECEIVE( x t ) 4: � N i =1 w t,i y t,i y t ← ˆ 5: � N i =1 w t,i RECEIVE( y t ) 6: for i ← 1 to N do 7: w t +1 ,i ← w t,i e − ηL (ˆ y t,i ,y t ) 8: 9: return w T +1 19 / 24

  20. Exponential weighted average algorithm Theorem Assume that the loss function L is convex in its first argument and takes values in [0,1]. Then, for any η > 0 and any sequence y 1 , ..., y T ∈ Y , the regret for EWA algorithm is bounded as: R T ≤ log N + ηT η 8 � In particular, for η = 8 log N/T , the regret is bounded as: � R T ≤ ( T/ 2) log N Proof. Define the potential function Φ t = log � N i =1 w t,i , t ∈ [1 , T ] 20 / 24

  21. Exponential weighted average algorithm Proof. We can prove that y t , y t ) + η 2 Φ t +1 − Φ t ≤ − ηL (ˆ 8 y t , y t ) + η 2 T Φ( T + 1) − Φ 1 ≤ − η � T ⇒ t =1 L (ˆ 8 Then we try to obtain the lower bound of Φ T +1 − Φ 1 . = log � N i =1 e − ηL T,i − log N Φ T +1 − Φ 1 i =1 e − ηL T,i − log N ≥ log max N = − η min N i =1 L T,i − log N Combining lower bound and upper bound, we get T i =1 L T,i ≤ log N N + ηT � L (ˆ y t , y t ) − min η 8 t =1 21 / 24

  22. Exponential weighted average algorithm The optimal choice of η requires knowledge of T , which is an disadvantage of this analysis. How to solve this? The doubling trick. Dividing time into periods [2 k , 2 k +1 − 1] of � length 2 k with k = 0 , ..., n , and then choose η k = 8 log N . This 2 k leads to the following theorem. Theorem Assume that the loss function L is convex in its first argument and takes values in [0 , 1] . Then for any T ≤ 1 and any sequence y 1 , ..., y T ∈ Y , the regret of the EWA algorithm after T rounds is bounded as follows: √ � � 2 T log N R T ≤ √ 2 log N + 2 2 − 1 22 / 24

Recommend


More recommend