cmu q 15 381
play

CMU-Q 15-381 Lecture 18: Reinforcement Learning I Teacher: Gianni - PowerPoint PPT Presentation

CMU-Q 15-381 Lecture 18: Reinforcement Learning I Teacher: Gianni A. Di Caro H OW REALISTIC ARE MDP S ? Assumption 1: state is known exactly after performing an action Do we always have an infinitely powerful GPS that tells us where


  1. CMU-Q 15-381 Lecture 18: Reinforcement Learning I Teacher: Gianni A. Di Caro

  2. H OW REALISTIC ARE MDP S ? § Assumption 1: state is known exactly after performing an action § Do we always have an infinitely powerful “GPS” that tells us where we are in the world? Think of a robot moving in a building, how does it know where it is? § Relax the assumption : Partially Observable MDP (POMDP) § Assumption 2: known model of dynamics and reward of the world, ! and " § Do we always know what will be the effect of our actions when chance is playing against us? Where those numbers come from? Image to fill in the ! matrix for the action of a wheeled robot on an icy surface … § Relax the assumption : Reinforcement Learning Problems 2

  3. R EINFORCEMENT L EARNING Memoryless Transition stochastic reward Model? process (MRP) Action State Reward model? Agent Goal: Maximize expected sum of future rewards 3

  4. MDP P LANNING VS . R EINFORCEMENT L EARNING Don’t have a simulator ! Have to actually learn what happens if take an action in a state Drawings by Ketrina Yim 4

  5. R EINFORCEMENT LEARNING PROBLEM ü The agent can ”sense” the environment (it knows the state ) and has goals ü Learning effect of actions from interaction with the environment Trial and Error search § ( Delayed ) Rewards (Advisory signals ≠ Error signals) § § What actions to take? → Exploration- exploitation dilemma The agent has to generate the training set by interaction § 5

  6. R EINFORCEMENT L EARNING Memoryless Transition stochastic reward Model? process (MRP) Action State Reward model? Agent Goal: Maximize expected sum of future rewards 6

  7. P ASSIVE R EINFORCEMENT L EARNING § Before figuring out how to act, let’s first just try to figure out how good a (given) particular policy ! is § Passive learning: agent’s policy is fixed (i.e., in state " it always execute action !(") ) and the task is to estimate policy’s value → Learn state values, % " , or State-action values, '(", () → Policy evaluation Policy evaluation in MDPs ∼ Passive RL (*, +) Model (*, +) Model Bellman eqs. Learning 7

  8. P ASSIVE R EINFORCEMENT L EARNING Two approaches 1. Build a model Transition à Solve Value Iteration Model? T(s,a,s’)=0.8, R(s,a,s’)=4,… State Action Reward model? Agent 8

  9. P ASSIVE R EINFORCEMENT L EARNING Two approaches: Transition Model? 1. Build a model 2. Model-free: directly V π (s 1 )=1.8, estimate ! " V π (s 2 )=2.5,… State Action Reward model? Agent 9

  10. P ASSIVE RL: B UILD A MODEL 1. Build a model Transition Model? T(s,a,s’)=0.8, State R(s,a,s’)=4,… Action Reward model? Agent 10

  11. G RID W ORLD E XAMPLE Start at (1,1) 11

  12. G RID W ORLD E XAMPLE G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup try up Adaption of drawing by Ketrina Yim 12

  13. G RID W ORLD E XAMPLE G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 Adaption of drawing by Ketrina Yim 13

  14. G RID W ORLD E XAMPLE G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup Adaption of drawing by Ketrina Yim 14

  15. G RID W ORLD E XAMPLE G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 Adaption of drawing by Ketrina Yim 15

  16. G RID W ORLD E XAMPLE G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1 Adaption of drawing by Ketrina Yim 16

  17. G RID W ORLD E XAMPLE G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 The gathered s=(3,2) action=tup, s’=(3,3), r = -.01 experience can be s=(3,3) action=tright, s’=(4,3), r = 1 used to estimate MDP’s ! and " Adaption of drawing by Ketrina Yim models 17

  18. G RID W ORLD E XAMPLE G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 The gathered s=(3,2) action=tup, s’=(3,3), r = -.01 experience can be s=(3,3) action=tright, s’=(4,3), r = 1 used to estimate MDP’s ! and " Estimate of T(<1,2>,tup,<1,3>) = 1/2 models Adaption of drawing by Ketrina Yim 18

  19. M ODEL -B ASED P ASSIVE R EINFORCEMENT L EARNING 1. Follow policy ! , observe transitions and rewards 2. Estimate MDP model parameters " and # given observed transitions and rewards § If finite set of states and actions, can just make a table, count, and average counts 3. Use estimated MDP to do policy evaluation of ! (using Value Iteration) Does this give us all the parameters for an MDP? 19

  20. S OME PARAMETERS ARE MISSING G RID W ORLD E XAMPLE Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 Estimate of T(<1,2>,tright,<1,3>)? s=(3,3) action=tright, s’=(4,3), r = 1 No idea! Never tried this action… Adaption of drawing by Ketrina Yim 20

  21. P ASSIVE M ODEL -B ASED RL § Does this give us all the parameters of the underlying MDP? § No. § But does that matter for computing policy value? § No, don’t need to reconstruct the whole MDP for performing policy evaluation! § Have all parameters we need! § We have !(#) , we can assign non-zero probabilities to all observed transitions and zero to the unobserved ones § We need to visit all states # ∈ & at least once in order to solve the Bellman equations for all states " # V π ( s ) = E π R ( s t +1 ) + γ V π ( s t +1 ) | s t = s ⇣ ⌘ X p ( s 0 | s, π ( s )) R ( s 0 , s, π ( s )) + γ V π ( s 0 ) = ∀ s ∈ S s 0 2 S 21

  22. P ASSIVE M ODEL -B ASED RL Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1 s=(1,1) action= tup, s’=(2,1), r = -.01 2 episodes of experience in MDP. Use to s=(2,1) action= tright, s’=(3,1), r = -.01 estimate MDP parameters & evaluate ! s=(3,1) action= tup, s’=(4,1), r = -.01 s=(4,1) action= tleft, s’=(3,1), r = -.01 Is the computed policy value likely s=(3,1) action= tup, s’=(3,2), r = -.01 to be correct? s=(3,2) action= tup, s’=(4,2), r = -1 (1) Yes (2) No (3) Not sure Adaption of drawing by Ketrina Yim 22

  23. P ASSIVE R EINFORCEMENT L EARNING Two Approaches: Transition Model? 1. Build a model 2. Model-free: directly estimate ! " V π (s 1 )=1.8, State V π (s 2 )=2.5,… Action Reward model? Agent 23

  24. L ET ’ S CONSIDER AN EPISODIC SCENARIO Start at (1,1) s=(1,1) action= tup, s’=(1,2), r = -.01 s=(1,2) action= tup, s’=(1,2), r = -.01 s=(1,2) action=tup, s’=(1,3), r = -.01 s=(1,3) action=tright, s’=(2,3), r = -.01 s=(2,3) action=tright, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(3,2), r = -.01 s=(3,2) action=tup, s’=(3,3), r = -.01 s=(3,3) action=tright, s’=(4,3), r = 1 s=(1,1) action= tup, s’=(2,1), r = -.01 s=(2,1) action= tright, s’=(3,1), r = -.01 Estimate of !(1,1) ? s=(3,1) action= tup, s’=(4,1), r = -.01 ! 1,1 = 1 s=(4,1) action= tleft, s’=(3,1), r = -.01 & 1 + 7 + −0.01 + (−1 + 5 + −0.01 ) 2 s=(3,1) action= tup, s’=(3,2), r = -.01 0 2 0 1 s=(3,2) action= tup, s’=(4,2), r = -1 Adaption of drawing by Ketrina Yim 2 episodes of (MDP) experiences Averaging episode returns 24

  25. A VERAGING OBSERVED RETURNS § Averaging the returns from ( episodes, , $ , , 6 , ⋯ , , " " "#$ % = 1 § Arithmetic average: ! ( ) , * *+$ § Incremental arithmetic average: " % + 1 ! "#$ % = ! ( (, " −! " % ) § Incremental weighted arithmetic average: § Weight of an episode: 1 " " " = ∑ *+4 § Sum of ( episodes: 2 1 * " % + 1 " ! "#$ % = ! (, " −! " % ) 2 " 25

  26. A VERAGING OBSERVED RETURNS § Exponentially-weighted average ( moving average ): ! "#$ % = ! " % + ((* " −! " % ) $ (Note: constant ( vs. " ) = (! " % + (1 − ()* " § Weights decrease exponentially: " ( "20 (1 − ()* 0 "#$ % = ( " ! ! . % + / 01$ ! $ % = (! . % + 1 − ( * $ ! 3 % = (! $ % + 1 − ( * 3 = ( (! . % + 1 − ( * $ + 1 − ( * 3 = ( 3 ! . % + ( 1 − ( * $ + 1 − ( * 3 26

Recommend


More recommend