r i f r i f reinforcement learning iii reinforcement
play

R i f R i f Reinforcement Learning III Reinforcement Learning III t - PowerPoint PPT Presentation

R i f R i f Reinforcement Learning III Reinforcement Learning III t L t L i i III III Dec 03 2008 1 Large State Spaces h When a problem has a large state space we can not longer represent the U or Q functions as explicit tables explicit


  1. R i f R i f Reinforcement Learning III Reinforcement Learning III t L t L i i III III Dec 03 2008 1

  2. Large State Spaces h When a problem has a large state space we can not longer represent the U or Q functions as explicit tables explicit tables h Even if we had enough memory 5 N 5 Never enough training data! h t i i d t ! 5 Learning takes too long h What to do?? 3

  3. Function Approximation h Never enough training data! 5 Must generalize what is learned from one situation to other “similar” new situations h Idea: 5 Instead of using large table to represent U or Q, use a parameterized function parameterized function g small number of parameters (generally exponentially fewer parameters than the number of states) 5 Learn parameters from experience Learn parameters from experience 5 When we update parameters based on observations in one state, then the U or Q estimate will also change for other similar states g facilitates generalization of experience g facilitates generalization of experience 4

  4. Example h Consider grid problem with no obstacles, deterministic actions C id id bl i h b l d i i i i U/D/L/R (49 states) h Features for state s=(x,y): f1(s)=x, f2(s)=y (just 2 features) ( ,y) ( ) , ( ) y (j ) 6 0 10 10 0 0 6 5

  5. Linear Function Approximation h Define a set of state features f 1 (s), …, f n (s) 5 The features are used as our representation of states 5 States with similar feature values will be treated similarly h A common approximation is to represent V (s) as a weighted sum of the features (i.e. a linear approximation) = θ θ + θ θ + θ θ + + θ θ U U ( ( s s ) ) f f ( ( s s ) ) f f ( ( s s ) ) ... ... f f ( ( s s ) ) θ θ 0 0 1 1 1 1 2 2 2 2 n n n n 6

  6. Example h Consider grid problem with no obstacles deterministic actions Consider grid problem with no obstacles, deterministic actions U/D/L/R (49 states) h Features for state s=(x,y): f1(s)=x, f2(s)=y (just 2 features) h U(s) = θ 0 + θ 1 x + θ 2 y 6 0 h Is there a good linear 10 0 10 0 approximation? approximation? 5 Yes. 5 θ 0 =10, θ 1 = ‐ 1, θ 2 = ‐ 1 5 (note upper right is origin) (note upper right is origin) h U(s) = 10 ‐ x ‐ y subtracts Manhattan dist. from goal reward from goal reward h Instead of storing a table of 49 entries, we now only need to i l d store 3 parameters 6 7

  7. Function approximation accuracy h The approximation accuracy is fundamentally limited by the information provided by the features h Can we always define features that allow for a perfect linear approximation? 5 Yes Assign each state an indicator feature (I e i’th feature is 1 iff Yes. Assign each state an indicator feature. (I.e. i th feature is 1 iff i’th state is present and θ i represents value of i’th state) 5 Of course this requires far too many features and gives no generalization generalization. 8

  8. Changed Reward: Bad linear approximation h U(s) = θ 0 + θ 1 x + θ 2 y h Is there a good linear approximation? g pp 0 5 No. 0 10 9

  9. But What If… h U(s) = θ 0 + θ 1 x + θ 2 y + θ 3 z 0 3 0 h Include new feature z 5 z= |3-x| + |3-y| 5 z is dist to goal location z is dist. to goal location 10 3 h Does this allow a good linear approx? 5 θ 0 =10, θ 1 = θ 2 = 0, θ 0 = -1 10

  10. Linear Function Approximation h Define a set of features f1(s), …, fn(s) 5 The features are used as our representation of states 5 States with similar feature values will be treated similarly S i h i il f l ill b d i il l 5 More complex functions require more complex features = θ + θ + θ + + θ U ( s ) f ( s ) f ( s ) ... f ( s ) θ 0 1 1 2 2 n n h Our goal is to learn good parameter values (i.e. feature O l i l d l (i f weights) that approximate the value function well 5 How can we do this? How can we do this? 5 Use TD ‐ based RL and somehow update parameters based on each experience. 11

  11. TD ‐ based RL for Linear Approximators Start with initial parameter values 1. Take action according to an explore/exploit policy g p p p y 2. (should converge to greedy policy, i.e. GLIE) Update estimated model p 3. Perform TD update for each parameter 4. θ θ ← ? ? i Goto 2 5. What is a “TD update” for a parameter? 12

  12. Aside: Gradient Descent for Squared Error h Suppose that we have a sequence of states and target values for each state K s , u ( s ) , s , u ( s ) , 1 1 2 2 5 E g produced by the TD based RL loop E.g. produced by the TD ‐ based RL loop h Our goal is to minimize the sum of squared errors between our estimated function and each target value: g ( ) 2 1 = − ˆ E U ( s ) u ( s ) θ j j j 2 squared error of example j target value for j’th state our estimated value for j’th state h After seeing j’th state gradient descent rule tells us to update all h After seeing j th state gradient descent rule tells us to update all parameters by: ∂ ∂ ∂ ∂ ˆ E E E U ( s ) θ θ θ ← ← θ θ − α = j j j j j j j j , ∂ θ ∂ θ ∂ θ ∂ i i ˆ U ( s ) θ i i i j learning rate 13

  13. Aside: continued ( ( ) ) ∂ ∂ ∂ ∂ ˆ E E U U ( ( s s ) ) θ θ ← θ + α = θ + α − ˆ j j u ( s ) U ( s ) θ ∂ θ ∂ θ i i i j j i i ∂ ∂ E E j depends on form of ∂ ˆ U ( s ) approximator θ j • For a linear approximation function: = = θ θ + + θ θ + + θ θ + + + + θ θ ˆ U U ( ( s s ) ) f f ( ( s s ) ) f f ( ( s s ) ) ... f f ( ( s s ) ) θ 1 1 1 2 2 n n ∂ ˆ U ( s ) θ = j f f ( ( s ) ) ∂ ∂ θ θ i i j j ( ) i θ ← θ + α − ˆ u ( s ) U ( s ) f ( s ) • Thus the update becomes: θ i i j j i j • For linear functions this update is guaranteed to converge to best approximation for suitable learning rate schedule 14

  14. TD ‐ based RL for Linear Approximators Start with initial parameter values Start with initial parameter values 1. 1 Take action according to an explore/exploit policy 2. (should converge to greedy policy, i.e. GLIE) Transition from s to s’ Update estimated model 3. Perform TD update for each parameter 4. ( ( ) ) θ ← θ + α − ˆ u ( ( s ) ) U ( ( s ) ) f i f ( ( s ) ) θ θ i i i i i Goto 2 5. What should we use for “target value” v(s)? What should we use for target value v(s)? • Use the TD prediction based on the next state s’ = + + γ ˆ u ( ( s ) ) R R ( ( s ) ) U U ( ( s ' ' ) ) θ this is the same as previous TD method only with approximation 15

  15. TD ‐ based RL for Linear Approximators Start with initial parameter values Start with initial parameter values 1. 1 Take action according to an explore/exploit policy 2. (should converge to greedy policy, i.e. GLIE) Update estimated model 3. Perform TD update for each parameter p p 4. ( ) θ ← θ + α + γ − ˆ ˆ R ( s ) U ( s ' ) U ( s ) f ( s ) θ θ i i i G t 2 Goto 2 5. 5 • Note that step 2 still requires T to select action • To avoid this we can do the same thing for model-free T id thi d th thi f d l f Q-learning 16

  16. Q ‐ learning with Linear Approximators = = θ θ + + θ θ + + θ θ + + + + θ θ ˆ Q Q ( ( s s , a a ) ) f f ( ( s s , a a ) ) f f ( ( s s , a a ) ) ... f f ( ( s s , a a ) ) θ 0 1 1 2 2 n n Features are a function of states and actions. Start with initial parameter values 1. Take action according to an explore/exploit policy Take action according to an explore/exploit policy 2 2. (should converge to greedy policy, i.e. GLIE) Perform TD update for each parameter p p 3. ( ) θ ← θ + α + γ − ˆ ˆ R ( s ) max Q ( s ' , a ' ) Q ( s , a ) f ( s , a ) θ θ i i i a ' Goto 2 4. • For both Q and U, these algorithms converge to the closest linear approximation to optimal Q or U. 17

  17. Summary of RL h MDP 5 Definition of an MDP (T, R, S) 5 Solving MDP for optimal policy: Value iteration, policy iteration h RL 5 Difference between RL and MDP 5 Different methods for Passive RL: DUE, ADP, TD 5 Different method for Active RL: ADP, Q ‐ Learning with TD learning TD learning 5 Function approximation for large state/action space 18

  18. Learning objectives 1) Students are able to apply supervised learning algorithms to prediction problems and evaluate the results. 2) Students are able to apply unsupervised learning algorithms to data analysis problems and evaluate results. 3) Students are able to apply reinforcement learning algorithms to control problem and evaluate results. 4) Students are able to take a description of a new problem and decide what kind of problem (supervised, unsupervised, or reinforcement) it is. i f t) it i 20

Recommend


More recommend