reinforcement learning
play

Reinforcement Learning Robert Platt Northeastern University Some - PowerPoint PPT Presentation

Reinforcement Learning Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. RN, AIMA Conception of agent act Agent World sense RL conception of agent Agent takes actions a Agent World s,r


  1. Reinforcement Learning Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. RN, AIMA

  2. Conception of agent act Agent World sense

  3. RL conception of agent Agent takes actions a Agent World s,r Agent perceives states and rewards Transition model and reward function are initially unknown to the agent! – value iteration assumed knowledge of these two things...

  4. Value iteration We know the reward function We know the probabilities of moving in each direction when an action is executed Image: Berkeley CS188 course notes (downloaded Summer 2015)

  5. Reinforcement Learning We know the reward function We know the probabilities of moving in each direction when an action is executed Image: Berkeley CS188 course notes (downloaded Summer 2015)

  6. The different between RL and value iteration Online Learning Offmine Solution (RL) (value iteration) Image: Berkeley CS188 course notes (downloaded Summer 2015)

  7. Value iteration vs RL 0.5 +1 1.0 Fast Slow -10 +1 0.5 Warm Slow 0.5 +2 Fast 0.5 Cool Overheated +1 1.0 +2 RL still assumes that we have an MDP Image: Berkeley CS188 course notes (downloaded Summer 2015)

  8. Value iteration vs RL Warm Cool Overheated RL still assumes that we have an MDP – but, we assume we don't know T or R Image: Berkeley CS188 course notes (downloaded Summer 2015)

  9. RL example https://www.youtube.com/watch?v=goqWX7bC-ZY

  10. Model-based RL a. choose an exploration policy – policy that enables agent to explore all relevant states 1. estimate T, R by averaging experiences b. follow policy for a while 2. solve for policy using c. estimate T and R value iteration Image: Berkeley CS188 course notes (downloaded Summer 2015)

  11. Model-based RL a. choose an exploration policy – policy that enables agent to explore all relevant states 1. estimate T, R by averaging experiences b. follow policy for a while 2. solve for policy using c. estimate T and R value iteration Number of times agent reached s' by taking a from s Set of rewards obtained when reaching s' by taking a from s

  12. Model-based RL a. choose an exploration policy – policy that enables agent to explore all relevant states 1. estimate T, R by averaging experiences b. follow policy for a while 2. solve for policy using c. estimate T and R value iteration What's wrong w/ this approach? Number of times agent reached s' by taking a from s Set of rewards obtained when reaching s' by taking a from s

  13. Model-based vs Model-free learning Goal: Compute expected age of students in this class Known P(A) Without P(A), instead collect samples [a 1 , a 2 , … a N ] Unknown P(A): “Model Based” Unknown P(A): “Model Free” Why does this Why does this work? Because work? Because samples eventually you appear with learn the right the right model. frequencies. Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  14. RL: model-free learning approach to estimating the value function  We want to improve our estimate of V by computing these averages:  Idea: T ake samples of outcomes s’ (by doing the action!) and average s π (s) s, π (s) ' s 1 Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  15. RL: model-free learning approach to estimating the value function  We want to improve our estimate of V by computing these averages:  Idea: T ake samples of outcomes s’ (by doing the action!) and average s π (s) s, π (s) ' s 1 ' s 2 Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  16. RL: model-free learning approach to estimating the value function  We want to improve our estimate of V by computing these averages:  Idea: T ake samples of outcomes s’ (by doing the action!) and average s π (s) s, π (s) ' s 1 s 3 ' s 2 ' Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  17. RL: model-free learning approach to estimating the value function  We want to improve our estimate of V by computing these averages:  Idea: T ake samples of outcomes s’ (by doing the action!) and average s π (s) s, π (s) ' s 1 s 3 ' s 2 ' Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  18. Sidebar: exponential moving average  Exponential moving average  The running interpolation update:  Makes recent samples more important:  Forgets about the past (distant past values were wrong anyway) Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  19. TD Value Learning  Big idea: learn from every experience!  Update V(s) each time we experience a s transition (s, a, s’, r)  Likely outcomes s’ will contribute updates π (s) more often s, π (s)  T emporal difgerence learning of values  Policy still fjxed, still doing evaluation! s'  Move values toward value of whatever successor occurs: running average Sample of V(s): Update to V(s): Same update: Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  20. TD Value Learning: example Observed States T ransitions A 0 B C D 0 0 8 E 0 Assume: γ = 1, α = 1/2 Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  21. TD Value Learning: example Observed Observed reward States T ransitions B, east, C, -2 A 0 0 B C D 0 0 -1 0 8 8 E 0 0 Assume: γ = 1, α = 1/2 Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  22. TD Value Learning: example Observed Observed reward States T ransitions B, east, C, -2 C, east, D, -2 A 0 0 0 B C D 0 0 -1 0 -1 3 8 8 8 E 0 0 0 Assume: γ = 1, α = 1/2 Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  23. What's the problem w/ TD Value Learning?

  24. What's the problem w/ TD Value Learning? Can't turn the estimated value function into a policy! This is how we did it when we were using value iteration: Why can't we do this now?

  25. What's the problem w/ TD Value Learning? Can't turn the estimated value function into a policy! This is how we did it when we were using value iteration: Why can't we do this now? Solution: Use TD value learning to estimate Q*, not V*

  26. Detour: Q-Value Iteration  Value iteration: fjnd successive (depth-limited) values  Start with V 0 (s) = 0, which we know is right  Given V k , calculate the depth k+1 values for all states:  But Q-values are more useful, so compute them instead  Start with Q 0 (s,a) = 0, which we know is right  Given Q k , calculate the depth k+1 q-values for all q-states: Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  27. Q-Learning  Q-Learning: sample-based Q-value iteration  Learn Q(s,a) values as you go  Receive a sample (s,a,s’,r)  Consider your old estimate:  Consider your new sample estimate:  Incorporate the new estimate into a running average: Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  28. Exploration v exploitation Image: Berkeley CS188 course notes (downloaded Summer 2015)

  29. Exploration v exploitation: e-greedy action selection  Several schemes for forcing exploration  Simplest: random actions ( ε -greedy)  Every time step, fmip a coin  With (small) probability ε , act randomly  With (large) probability 1- ε , act on current policy  Problems with random actions?  You do eventually explore the space, but keep thrashing around once learning is done  One solution: lower ε over time  Another solution: exploration functions Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  30. Generalizing across states  Basic Q-Learning keeps a table of all q-values  In realistic situations, we cannot possibly learn about every single state!  T oo many states to visit them all in training  T oo many states to hold the q-tables in memory  Instead, we want to generalize:  Learn about some small number of training states from experience  Generalize that experience to new, similar situations  This is a fundamental idea in machine learning, and we’ll see it over and over again Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  31. Generalizing across states Let’s say we In naïve q- Or even this discover through learning, we one! experience that know nothing this state is bad: about this state: Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  32. Feature-based representations  Solution: describe a state using a vector of features (properties)  Features are functions from states to real numbers (often 0/1) that capture important properties of the state  Example features:  Distance to closest ghost  Distance to closest dot  Number of ghosts  1 / (dist to dot) 2  Is Pacman in a tunnel? (0/1)  …… etc.  Is it the exact state on this slide?  Can also describe a q-state (s, a) with features (e.g. action moves closer to food) Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  33. Linear value functions  Using a feature representation, we can write a q function (or value function) for any state using a few weights:  Advantage: our experience is summed up in a few powerful numbers  Disadvantage: states may share features but actually be very difgerent in value! Slide: Berkeley CS188 course notes (downloaded Summer 2015)

Recommend


More recommend