multi armed bandits
play

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit - PowerPoint PPT Presentation

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed bandit machines and find a way to win most money! Note: assume you have unlimited money and never go bankrupt!


  1. Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12

  2. k-armed Bandit Problem • Playing k armed bandit machines and find a way to win most money! • Note: assume you have unlimited money and never go bankrupt! https://towardsdatascience.com/reinforcement-learning-multi-arm-bandit-implementation-5399ef67b24b

  3. 10-armed Testbed • Each bandit machine has its own reward distribution

  4. Action-value Function • 𝑅 𝑢 𝑏 : The estimated value (reward) of action a at time t • Let 𝑟 ∗ 𝑏 be the true (optimal) action-value function 𝑟 ∗ 𝑏 ← 𝐹[𝑆 𝑢 |𝐵 𝑢 = 𝑏]

  5. ε -greedy • Greedy action − Always select the action with max value − 𝐵 𝑢 ← argmax 𝑅 𝑢 (𝑏) 𝑏 • ε -greedy − Select the greedy action (1- ε ) of the time, select random actions ε of the time

  6. Performance of ε -greedy • Average rewards over 2000 runs with ε =0, 0.1, 0.01

  7. Optimal Actions Selected by ε -greedy • Optimal actions selected over 2000 runs with ε =0, 0.1, 0.01

  8. Update 𝑅 𝑢 𝑏 • let 𝑅 𝑜 denote the estimate of its action value after it has been selected n − 1 times

  9. Deriving Update Rule • Require only memory of Qn and Rn

  10. Tracking a Nonstationary Problem • Using constant step-size 𝛽 ∈ ( 0,1] • Constant step- size doesn’t converge

  11. Exponential Recency-weighted Average 𝑜 𝑅 𝑜+1 = 1 − 𝛽 𝑜 𝑅 + ෍ 𝛽 1 − 𝛽 𝑜−𝑗 𝑆 𝑗 𝑗=1 𝑜 1 − 𝛽 𝑜 + ෍ 𝛽 1 − 𝛽 𝑜−𝑗 = 1 𝑗=1

  12. Optimistic Initial Values • We should not care about initial value too much in practice

  13. Upper-Confidence-Bound Action Selection • 𝑂 𝑢 𝑏 : Number of times that action a has been selected prior to time t • Not practical for large state spaces ln 𝑢 𝐵 ← arg max 𝑅 𝑢 𝑏 + 𝑑 𝑂 𝑢 𝑏 𝑏

  14. Gradient Bandit Algorithms • Soft-max function • 𝜌 𝑢 (𝑏) is the probability of taking action a at time t

  15. Selecting Actions based on 𝜌 𝑢 (𝑏)

  16. Gradient Ascent

  17. Calculating Gradient • Adding a baseline B

  18. Convert Equation into Expectation • Multiplied by 𝜌 𝑢 (𝑦)/𝜌 𝑢 (𝑦) • Choose baseline 𝐶 𝑢 = 𝑆 𝑢

  19. Calculating Gradient of Softmax

  20. Final Result • Gradient bandit algorithm = gradient of expected reward!

  21. Reference • Chapter 2, Richard S. Sutton and Andrew G. Barto , “Reinforcement Learning: An Introduction,” 2 nd edition, Nov. 2018

Recommend


More recommend