multi armed bandits
play

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends - PowerPoint PPT Presentation

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012 * Real title: regret analysis of stochastic and nonstochastic multi-armed bandit problems Overview Stochastic, adversarial, extensions &


  1. Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012 * Real title: regret analysis of stochastic and nonstochastic multi-armed bandit problems

  2. Overview • Stochastic, adversarial, extensions & connections. • Applications: • Clinical trails, Ad placement, Package routing, Video games • Flexible theoretical framework with rigorous guarantees • Feedback is partially observable • This is not supervised learning!

  3. reward Multi-armed bandit setting • Casino, 𝐿 slot machines • 𝑈 rounds, timesteps 𝑢 = 1, … , 𝑈 • For 𝑢 = 1, … 𝑈 we choose a slot machine / arm to pull / ad to show 𝑙 • If we pull arm 𝑙 at timestep t, we receive (only) reward 𝑠 𝑢 𝑙 i.i.d. sampled from distribution of arm 𝑙 • Stochastic: reward 𝑠 𝑢 • For example normal: 𝑂 𝜈 𝑙 , 1 • Generally, we don’t know the distribution, or we may only know the class.

  4. 𝜈 ∗ reward Multi-armed bandit setting • We want to maximize our gains. • Say our algorithm chooses machine 𝐵 𝑢 in round 𝑢 𝐵 𝑢 𝑈 • Gain: 𝐻 𝐵 = σ 𝑢=1 𝑠 𝑢 Best action • Performance measure: regret. • Regret compares our performance to best fixed action. • Always play best arm  in expectation gain is 𝑈𝜈 ∗ • Regret: 𝑆 = 𝑈𝜈 ∗ − 𝐻 𝐵 . Low regret = High gain = Good.

  5. reward Stochastic setting • Naïve greedy strategy: • Try all arms once (explore) • Afterward, continue pulling best arm (exploit) • Why can this fail? • We may get unlucky and observe the blue samples in the first round • Balance exploration & exploitation • Exploration: try enough arms to be certain you have a ‘good’ one • Exploitation: pull the good arm enough to maximize your reward • Algorithms: UCB, Thompson Sampling. Optimism in face of uncertainty.

  6. reward ???? Adverserial setting • Problem with stochastic setting: • We assume the distribution of the arms are fixed. • For example for the advertisements, the reward distribution can change over time. 𝑢 ∈ [0,1] • Adversary chooses rewards: no assumptions made about rewards 𝑠 𝑙 𝑢 very low. • If we choose arm k at time t, adversary can set 𝑠 𝑙 • Solve by randomization! For each t, choose distribution 𝑞 𝑢 over all k arms. • By surprising adversary, we can still get low regret. • Algorithms: EXP3, EXP3.P, FPL. • Pessimistic / ‘play it safe’ / always spread your chances.

  7. Extensions & Connections • Contextual bandit: use ‘side information’. • In each round we receive information from the user (in advertising example) • Naïve solution: run a bandit for each user category (cluster users first) • More related to supervised learning since we can use ‘features’ • Non-stationary bandit: • Distributions change slowly instead of rewards being determined by adversary. • Adversary assumption may be too pessimistic, but i.i.d. too optimistic. • Connection to reinforcement learning: MDP with only 1 state. • Full-information setting • We observe all rewards of all arms in all rounds • Online learning / Hedge algorithm / Exponential weights algorithm, EXP2.

Recommend


More recommend