intro to online learning
play

Intro to Online Learning Instructor: Haifeng Xu Outline Online - PowerPoint PPT Presentation

CS6501: T opics in Learning and Game Theory (Fall 2019) Intro to Online Learning Instructor: Haifeng Xu Outline Online Learning/Optimization Measure Algorithm Performance via Regret Warm-up: A Simple Example 2 Overview of Machine


  1. CS6501: T opics in Learning and Game Theory (Fall 2019) Intro to Online Learning Instructor: Haifeng Xu

  2. Outline Ø Online Learning/Optimization Ø Measure Algorithm Performance via Regret Ø Warm-up: A Simple Example 2

  3. Overview of Machine Learning Ø Supervised learning Classifier/ Labeled training ML Regression data Algorithm function Ø Unsupervised learning Unlabeled training ML Clusters/ data Algorithm Knowledge Ø Semi-supervised learning (a combination of the two) What else are there? 3

  4. Overview of Machine Learning Ø Supervised learning Ø Unsupervised learning Ø Semi-supervised learning Ø Online learning Ø Reinforcement learning Ø Active learning Ø . . . 4

  5. Online Learning: When Data Come Online The online learning pipeline Observed one more training instance Initial ML Receive algorithm loss/reward Make predictions/ decisions 5

  6. Online Learning: When Data Come Online The online learning pipeline Observed one more training instance Update ML Initial ML Receive algorithm algorithm loss/reward Make predictions/ decisions 6

  7. Typical Assumptions on Data Ø Statistical feedback: instances drawn from a fixed distribution • Image classification, predict stock prices, choose restaurants, gambling machine (a.k.a., bandits) Ø Adversarial feedback: instances are drawn adversarially • Spam detection, anomaly detection, game playing Ø Markovian feedback: instances drawn from a distribution which is dynamically changing • Interventions, treatments 7

  8. Online learning for Decision Making Ø Learn to commute to school • Bus, walking, or driving? Which route? Uncertainty on the way? Ø Learn to gamble or buy stocks 8

  9. Online learning for Decision Making Ø Learn to commute to school • Bus, walking, or driving? Which route? Uncertainty on the way? Ø Learn to gamble or buy stocks Ø Advertisers learn to bid for keywords 9

  10. Online learning for Decision Making Ø Learn to commute to school • Bus, walking, or driving? Which route? Uncertainty on the way? Ø Learn to gamble or buy stocks Ø Advertisers learn to bid for keywords Ø Recommendation systems learn to make recommendations 10

  11. Online learning for Decision Making Ø Learn to commute to school • Bus, walking, or driving? Which route? Uncertainty on the way? Ø Learn to gamble or buy stocks Ø Advertisers learn to bid for keywords Ø Recommendation systems learn to make recommendations Ø Clinical trials Ø Robotics learn to react Ø Learn to play games (video games and strategic games) Ø Even how you learn to make decisions in your life Ø . . . 11

  12. Model Sketch Ø A learner acts in an uncertain world for 𝑈 time steps Ø Each step 𝑢 = 1, ⋯ , 𝑈 , learner takes action 𝑗 ( ∈ 𝑜 = {1, ⋯ , 𝑜} Ø Learner observes cost vector 𝑑 ( where 𝑑 ( 𝑗 ∈ [0,1] is the cost of action 𝑗 ∈ [𝑜] • Learner suffers cost 𝑑 ( (𝑗 ( ) at step 𝑢 • Can be similarly defined as reward instead of cost, not much difference • There are also “partial feedback” models (will not cover here) Ø Adversarial feedbacks: 𝑑 ( is chosen by an adversary • The powerful adversary has access to all the history (learner actions, past costs, etc.) until 𝑢 − 1 and also the learner’s algorithm • There are models of stochastic feedbacks (will not cover here) Ø Learner’s goal: minimize ∑ (∈[5] 𝑑 ( (𝑗 ( ) 12

  13. Formal Procedure of the Model At each time step 𝑢 = 1, ⋯ , 𝑈 , the following occurs in order: Learner picks a distribution 𝑞 ( over actions [𝑜] 1. Adversary picks cost vector 𝑑 ( ∈ 0,1 7 (he knows 𝑞 ( ) 2. Action 𝑗 ( ∼ 𝑞 ( is chosen and learner incurs cost 𝑑 ( (𝑗 ( ) 3. Learner observes 𝑑 ( (for use in future time steps) 4. Ø Learner tries to pick distribution sequence 𝑞 9 , ⋯ , 𝑞 5 to minimize expected cost 𝔽 ∑ (∈5 𝑑 ( (𝑗 ( ) • Expectation over randomness of action Ø The adversary does not have to really exist – it is assumed mainly for the purpose of worst-case analysis 13

  14. Well, Adversary Seems Too Powerful? Ø Adversary can choose 𝑑 ( ≡ 1, ∀𝑢 ; learner suffers cost 𝑈 regardless • Cannot do anything non-trivial? We are done? Ø If 𝑑 ( ≡ 1 ∀𝑢 , if you look back at the end, you do not regret anything – had you known such costs in hindsight, you cannot do better • From this perspective, cost 𝑈 in this case is not bad So what is a good measure for the performance of an online learning algorithm? 14

  15. Outline Ø Online Learning/Optimization Ø Measure Algorithm Performance via Regret Ø Warm-up: A Simple Example 15

  16. Regret Ø Measures how much the learner regrets, had he known the cost vector 𝑑 9 , ⋯ , 𝑑 5 in hindsight Ø Formally, 𝑆 5 = 𝔽 @ B ∼C B ∑ (∈[5] 𝑑 ( 𝑗 ( @∈[7] ∑ (∈[5] 𝑑 ( (𝑗) − min @∈[7] ∑ ( 𝑑 ( (𝑗) is the learner utility had he known 𝑑 9 , ⋯ , 𝑑 5 Ø Benchmark min and is allowed to take the best single action across all rounds 16

  17. Regret Ø Measures how much the learner regrets, had he known the cost vector 𝑑 9 , ⋯ , 𝑑 5 in hindsight Ø Formally, 𝑆 5 = 𝔽 @ B ∼C B ∑ (∈[5] 𝑑 ( 𝑗 ( @∈[7] ∑ (∈[5] 𝑑 ( (𝑗) − min @∈[7] ∑ ( 𝑑 ( (𝑗) is the learner utility had he known 𝑑 9 , ⋯ , 𝑑 5 Ø Benchmark min and is allowed to take the best single action across all rounds • There are other concepts of regret, e.g., swap regret (coming later) @∈[7] ∑ ( 𝑑 ( (𝑗) is mostly used • But, min Regret is an appropriate performance measure of online algorithms • It measures exactly the loss due to not knowing the data in advance 17

  18. Average Regret G H 9 9 D 5 ∑ (∈[5] 𝑑 ( 𝑗 ( − min 5 ∑ (∈[5] 𝑑 ( (𝑗) 𝑆 5 = 5 = 𝔽 @ B ∼C B @∈[7] Ø When D 𝑆 5 → 0 as 𝑈 → ∞ , we say the algorithm has vanishing regret or no-regret; the algorithm is called a no-regret online learning algorithm • Equivalently, 𝑆 5 is sublinear in 𝑈 • Both are used, depending on your habits Our goal: design no-regret algorithms by minimizing regret 18

  19. A Naive Strategy: Follow the Leader (FTL) Ø That is, pick the action with the smallest accumulated cost so far What is the worst-case regret of FTL? Answer: worst (largest) regret 𝑈/2 Ø Consider following instance with 2 actions 𝑢 𝑈 1 2 3 4 5 . . . 𝑑 ( (1) 1 0 1 0 1 . . . ∗ 𝑑 ( (2) 0 1 0 1 0 . . . ∗ Ø FTL always pick the action with cost 1 à total cost 𝑈 Ø Best action in hindsight has cost at most 𝑈/2 19

  20. Randomization is Necessary In fact, any deterministic algorithm suffers (linear) regret (n − 1)𝑈/𝑜 Ø Recall, adversary knows history and learner’s algorithm • So he can infer our 𝑞 ( at time 𝑢 (but do not know our sampled 𝑗 ( ∼ 𝑞 ( ) Ø But if 𝑞 ( is deterministic, action 𝑗 ( can also be inferred Ø Adversary simply sets 𝑑 ( 𝑗 ( = 1 and 𝑑 ( 𝑗 = 0 for all 𝑗 ≠ 𝑗 ( Ø Learner suffers total cost 𝑈 Ø Best action in hindsight has cost at most 𝑈/𝑜 Can randomized algorithm achieve sublinear regret? 20

  21. Outline Ø Online Learning/Optimization Ø Measure Algorithm Performance via Regret Ø Warm-up: A Simple Example 21

  22. Consider a Simpler (Special) Setting Ø Only two types of costs, 𝑑 ( 𝑗 ∈ {0,1} Ø One of the actions is perfect – it always has cost 0 • Minimum cost in hindsight is thus 0 • Learner does not know which action is perfect Is it possible to achieve sublinear regret in this simpler setting? 22

  23. A Natural Algorithm Observations: If an action ever had non-zero costs, it is not perfect 1. Actions with all zero costs so far, we do not really know how to 2. distinguish them currently These motivate to the following natural algorithm For 𝑢 = 1, ⋯ , 𝑈 Ø Identify the set of actions with zero total cost so far, and pick one action from the set uniformly at random. Note: there is always at least one action to pick since the perfect action is always a candidate 23

  24. Analysis of the Algorithm Ø Fix a round 𝑢 , we examine the expected loss from this round Ø Let 𝑇 OPPQ = { actions with zero total cost before 𝑢} and 𝑙 = |𝑇 OPPQ | • So each action in 𝑇 OPPQ is picked with probability 1/𝑙 24

  25. Analysis of the Algorithm Ø Fix a round 𝑢 , we examine the expected loss from this round Ø Let 𝑇 OPPQ = { actions with zero total cost before 𝑢} and 𝑙 = |𝑇| • So each action in 𝑇 OPPQ is picked with probability 1/𝑙 Ø For any parameter 𝜗 ∈ [0,1] , one of the following two happens • Case 1: • Case 2 : 25

  26. Analysis of the Algorithm Ø Fix a round 𝑢 , we examine the expected loss from this round Ø Let 𝑇 OPPQ = { actions with zero total cost before 𝑢} and 𝑙 = |𝑇| • So each action in 𝑇 OPPQ is picked with probability 1/𝑙 Ø For any parameter 𝜗 ∈ [0,1] , one of the following two happens at most 𝜗𝑙 actions from 𝑇 OPPQ have cost 1 , in which case • Case 1: we suffer expected cost at most 𝜗 • Case 2 : 26

Recommend


More recommend