reinforcement learning
play

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit - PowerPoint PPT Presentation

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5 0.1 n-armed bandit 0.9 0.5 0.1 0.0 0.0 0.0 0.0 estimate n-armed bandit 0.9 0.5 0.1 0 0.0 0.0 0.0 0.0 estimate 0 0 0 0.0 0


  1. Reinforcement Learning Kevin Spiteri April 21, 2015

  2. n-armed bandit

  3. n-armed bandit 0.9 0.5 0.1

  4. n-armed bandit 0.9 0.5 0.1 0.0 0.0 0.0 0.0 estimate

  5. n-armed bandit 0.9 0.5 0.1 0 0.0 0.0 0.0 0.0 estimate 0 0 0 0.0 0 attempts 0 0 0 0.0 0 payoff

  6. n-armed bandit 0.9 0.5 0.1 1.0 0.0 0.0 1.0 0.0 0.0 estimate 1 0 0 1 0.0 0 attempts 1 0 0 1 0.0 0 payoff

  7. n-armed bandit 0.9 0.5 0.1 0.5 0.0 1.0 0.5 0.0 0.0 estimate 2 0 1 2 0.0 0 attempts 1 0 1 1 0.0 0 payoff

  8. Exploration 0.9 0.5 0.1 0.67 0.0 1.0 0.5 0.0 0.0 estimate 3 0 1 2 0.0 0 attempts 2 0 1 1 0.0 0 payoff

  9. Going on … 0.9 0.5 0.1 0.86 0.9 0.5 0.0 0.1 estimate 300 280 10 0.0 10 attempts 258 252 5 0.0 1 payoff

  10. Changing environment 0.7 0.8 0.1 0.86 0.9 0.5 0.0 0.1 estimate 300 280 10 0.0 10 attempts 258 252 5 0.0 1 payoff

  11. Changing environment 0.7 0.8 0.1 0.77 0.8 0.65 0.0 0.1 estimate 600 560 20 0.0 20 attempts 463 448 13 0.0 2 payoff

  12. Changing environment 0.7 0.8 0.1 0.72 0.74 0.74 0.0 0.1 estimate 1500 1400 50 0.0 50 attempts 1078 1036 37 0.0 5 payoff

  13. n-armed bandit ● Optimal payoff (0.82): 0.9 x 300 + 0.8 x 1200 = 1230 ● Actual payoff (0.72): 0.9 x 280 + 0.5 x 10 + 0.1 x 10 + 0.7 x 1120 + 0.8 x 40 + 0.1 x 40 = 1078

  14. n-armed bandit ● Evaluation vs instruction. ● Discounting. ● Initial estimates. ● There is no best way or standard way.

  15. Markov Decision Process (MDP)

  16. Markov Decision Process (MDP) ● States

  17. Markov Decision Process (MDP) ● States

  18. Markov Decision Process (MDP) ● States ● Actions c b a

  19. Markov Decision Process (MDP) ● States ● Actions c ● Model a 0.75 b a 0.25

  20. Markov Decision Process (MDP) ● States ● Actions c ● Model a 0.75 b a 0.25

  21. Markov Decision Process (MDP) ● States ● Actions c 0 ● Model ● Reward a 0.75 5 -1 b a 0.25

  22. Markov Decision Process (MDP) ● States ● Actions c 0 ● Model ● Reward a 0.75 5 -1 ● Policy b a 0.25

  23. Markov Decision Process (MDP) ● States: ball table hand t t h basket floor b f

  24. Markov Decision Process (MDP) ● States: ball table hand t h basket floor b f

  25. Markov Decision Process (MDP) ● States: ball table c hand t h basket floor ● Actions: b a a) attempt b f b) drop c) wait

  26. Markov Decision Process (MDP) ● States: ball table c hand t h basket floor ● Actions: a 0.75 b a 0.25 a) attempt b f b) drop c) wait

  27. Markov Decision Process (MDP) ● States: ball table c hand t h basket floor ● Actions: a 0.75 b a 0.25 a) attempt b f b) drop c) wait

  28. Markov Decision Process (MDP) ● States: ball table c 0 hand t h basket floor ● Actions: a 0.75 5 -1 b a 0.25 a) attempt b f b) drop c) wait

  29. Markov Decision Process (MDP) ● States: ball table c 0 hand t h basket floor ● Actions: a 0.75 5 -1 b a 0.25 a) attempt b f b) drop c) wait Expected reward per round: 0.25 x 5 + 0.75 x (-1) = 0.5

  30. Markov Decision Process (MDP) ● States: ball table c 0 hand -1 t h basket floor ● Actions: a 0.75 5 -1 b a 0.25 a) attempt b f b) drop c) wait

  31. Markov Decision Process (MDP) ● States: ball table c 0 hand -1 t h basket floor ● Actions: a 0.75 5 -1 b a 0.25 a) attempt b f b) drop c) wait

  32. Reinforcement Learning Tools ● Dynamic Programming ● Monte Carlo Methods ● Temporal Difference Learning

  33. Grid World Reward: Normal move: -1 Over obstacle: -10 Best reward: -15

  34. Optimal Policy

  35. Value Function -15 -8 -7 0 -14 -9 -6 -1 -13 -10 -5 -2 -12 -11 -4 -3

  36. Initial Policy

  37. Policy Iteration -21 -11 -10 0 -22 -12 -11 -1 -23 -13 -12 -2 -24 -14 -13 -3

  38. Policy Iteration -21 -11 -10 0 -22 -12 -11 -1 -23 -13 -12 -2 -24 -14 -13 -3

  39. Policy Iteration -21 -11 -10 0 -22 -12 -11 -1 -23 -13 -12 -2 -24 -14 -13 -3

  40. Policy Iteration

  41. Policy Iteration -21 -11 -10 0 -22 -12 -11 -1 -23 -13 -12 -2 -15 -14 -4 -3

  42. Policy Iteration -21 -11 -10 0 -22 -12 -11 -1 -23 -13 -12 -2 -15 -14 -4 -3

  43. Policy Iteration -21 -11 -10 0 -22 -12 -11 -1 -23 -13 -12 -2 -15 -14 -4 -3

  44. Policy Iteration

  45. Policy Iteration -15 -8 -7 0 -14 -9 -6 -1 -13 -10 -5 -2 -12 -11 -4 -3

  46. Value Iteration 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

  47. Value Iteration -1 -1 -1 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

  48. Value Iteration -2 -2 -2 0 -2 -2 -2 -1 -2 -2 -2 -2 -2 -2 -2 -2

  49. Value Iteration -3 -3 -3 0 -3 -3 -3 -1 -3 -3 -3 -2 -3 -3 -3 -3

  50. Value Iteration -15 -8 -7 0 -14 -9 -6 -1 -13 -10 -5 -2 -12 -11 -4 -3

  51. Stochastic Model 0.95 0.025 0.025

  52. 0.95 Value Iteration 0.025 0.025 -19.2 -10.4 -9.3 0 -18.1 -12.1 -8.2 -1.5 -17.0 -13.6 -6.7 -2.9 -15.7 -14.7 -5.1 -4.0

  53. 0.95 Value Iteration 0.025 0.025 E.g. 13.6: -19.2 -10.4 -9.3 0 13.6 = 0.950 x 13.1 + -18.1 -12.1 -8.2 -1.5 0.025 x 27.0 + 0.025 x 16.7 -17.0 -13.6 -6.7 -2.9 16.6 = -15.7 -14.7 -5.1 -4.0 0.950 x 16.7 + 0.025 x 13.1 + 0.025 x 15.7

  54. Richard Bellman

  55. Bellman Equation

  56. Reinforcement Learning Tools ● Dynamic Programming ● Monte Carlo Methods ● Temporal Difference Learning

  57. Monte Carlo Methods 0.95 0.025 0.025

  58. 0.95 Monte Carlo Methods 0.025 0.025

  59. 0.95 Monte Carlo Methods 0.025 0.025 -32 -22 -10 0 -21 -11

  60. 0.95 Monte Carlo Methods 0.025 0.025

  61. 0.95 Monte Carlo Methods 0.025 0.025 -21 -11 -10 0

  62. 0.95 Monte Carlo Methods 0.025 0.025

  63. 0.95 Monte Carlo Methods 0.025 0.025 -32 -10 0 -31 -21 -11

  64. 0.95 Q-Value 0.025 0.025 -15 -10 -8 -20

  65. Bellman Equation -15 -10 -8 -20

  66. Learning Rate ● We do not replace an old Q value with a new one. ● We update at a designed learning rate. ● Learning rate too small: slow to converge. ● Learning rate too large: unstable. ● Will Dabney PhD Thesis: Adaptive Step-Sizes for Reinforcement Learning.

  67. Reinforcement Learning Tools ● Dynamic Programming ● Monte Carlo Methods ● Temporal Difference Learning

  68. Richard Sutton

  69. Temporal Difference Learning ● Dynamic Programming: Learn a guess from other guesses (Bootstrapping). ● Monte Carlo Methods: Learn without knowing model.

  70. Temporal Difference Learning Temporal Difference: ● Learn a guess from other guesses (Bootstrapping). ● Learn without knowing model. ● Works with longer episodes than Monte Carlo methods.

  71. Temporal Difference Learning Monte Carlo Methods: ● First run through whole episode. ● Update states at end. Temporal Difference Learning: ● Update state at each step using earlier guesses.

  72. 0.95 Monte Carlo Methods 0.025 0.025 -32 -10 0 -31 -21 -11

  73. 0.95 Monte Carlo Methods 0.025 0.025 -32 -10 0 -31 -21 -11

  74. 0.95 Temporal Difference 0.025 0.025 -19 -10 0 -22 -18 -12

  75. 0.95 Temporal Difference 0.025 0.025 -19 -10 -23 -10 0 -22 -18 -12 -28 -21 -11

  76. 0.95 Temporal Difference 0.025 0.025 23 = 1 + 22 -19 -10 -23 -10 0 28 = 10 + 18 -22 -18 -12 -28 -21 -11 21 = 10 + 11 11 = 1 + 10 10 = 10 + 0

  77. Function Approximation ● Most problems have large state space. ● We can generally design an approximation for the state space. ● Choosing the correct approximation has a large influence on system performance.

  78. Mountain Car Problem

  79. Mountain Car Problem ● Car cannot make it to top. ● Can can swing back and forth to gain momentum. ● We know x and ẋ. ● x and ẋ give an infinite state space. ● Random – may get to top in 1000 steps. ● Optimal – may get to top in 102 steps.

  80. Function Approximation ● We can partition state space in 200 x 200 grid. ● Coarse coding – different ways of partitioning state space. ● We can approximate V = w T f ● E.g. f = ( x ẋ height ẋ 2 ) T ● We can estimate w to solve problem.

  81. Problems with Reinforcement Learning Policy sometimes gets worse: ● Safe Reinforcement Learning (Phil Thomas) guarantees an improved policy over the current policy. Very specific to training task: ● Learning Parameterized Skills Bruno Castro da Silva PhD Thesis

  82. Checkers ● Arthur Samuel (IBM) 1959

  83. TD-Gammon ● Neural networks and temporal difference. ● Current programs play better than human experts. ● Expert work in input selection.

Recommend


More recommend