our toy problem lookup table N S E W 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 5 0 0 0 0 6 0 0 0 0 home 7 0 0 0 0 8 0 0 0 0 9 0 0 0 0
our toy problem lookup table 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 4 5 6 0 0 0 0 0 0 home 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
reward structure? 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 4 5 6 0 0 0 0 0 0 0 0 home 0 0 0 0 0 0 0 7 7 8 9 0 0 0 0 0 0 home 0 0 0 move⦠to 7/home: out of bounds: to 5: to any cell except 5 and 7: 10 -5 -10 -1
letβs fix π½ = 0.1, πΏ = 0.5 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 4 5 6 0 0 0 0 0 0 home 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 0 0 0
a + Ξ³ max a ' Q ( s ', a ') β Q ( s , a )) Q ( s , a ) β Q ( s , a ) + Ξ± ( r s 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 π½ = 0.1 0 0 0 πΏ = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0 say π -greedy policyβ¦ episode 1 begins...
a + Ξ³ max a ' Q ( s ', a ') β Q ( s , a )) Q ( s , a ) β Q ( s , a ) + Ξ± ( r s -1 0 0 0 1 2 3 0 0 ? 0 0 0 0 0 0 π½ = 0.1 0 0 0 πΏ = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + Ξ³ max a ' Q ( s ', a ') β Q ( s , a )) Q ( s , a ) β Q ( s , a ) + Ξ± ( r s 0 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 π½ = 0.1 0 0 0 πΏ = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + Ξ³ max a ' Q ( s ', a ') β Q ( s , a )) Q ( s , a ) β Q ( s , a ) + Ξ± ( r s 0 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 π½ = 0.1 0 0 0 πΏ = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + Ξ³ max a ' Q ( s ', a ') β Q ( s , a )) Q ( s , a ) β Q ( s , a ) + Ξ± ( r s -5 ? 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 π½ = 0.1 0 0 0 πΏ = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + Ξ³ max a ' Q ( s ', a ') β Q ( s , a )) Q ( s , a ) β Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 π½ = 0.1 0 0 0 πΏ = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + Ξ³ max a ' Q ( s ', a ') β Q ( s , a )) Q ( s , a ) β Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 π½ = 0.1 0 0 0 πΏ = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + Ξ³ max a ' Q ( s ', a ') β Q ( s , a )) Q ( s , a ) β Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -1 ? 0 0 π½ = 0.1 0 0 0 πΏ = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + Ξ³ max a ' Q ( s ', a ') β Q ( s , a )) Q ( s , a ) β Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 π½ = 0.1 0 0 0 πΏ = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + Ξ³ max a ' Q ( s ', a ') β Q ( s , a )) Q ( s , a ) β Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 π½ = 0.1 0 0 0 πΏ = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + Ξ³ max a ' Q ( s ', a ') β Q ( s , a )) Q ( s , a ) β Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -10 -0.1 0 0 π½ = 0.1 0 0 0 πΏ = 0.5 4 5 6 0 ? 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + Ξ³ max a ' Q ( s ', a ') β Q ( s , a )) Q ( s , a ) β Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 π½ = 0.1 0 0 0 πΏ = 0.5 4 5 6 0 -1 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + Ξ³ max a ' Q ( s ', a ') β Q ( s , a )) Q ( s , a ) β Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 π½ = 0.1 0 0 0 πΏ = 0.5 4 5 6 0 -1 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + Ξ³ max a ' Q ( s ', a ') β Q ( s , a )) Q ( s , a ) β Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 π½ = 0.1 0 0 0 πΏ = 0.5 4 5 6 0 -1 0 0 0 0 -1 0 ? 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + Ξ³ max a ' Q ( s ', a ') β Q ( s , a )) Q ( s , a ) β Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 π½ = 0.1 0 0 0 πΏ = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + Ξ³ max a ' Q ( s ', a ') β Q ( s , a )) Q ( s , a ) β Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 π½ = 0.1 0 0 0 πΏ = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0
a + Ξ³ max a ' Q ( s ', a ') β Q ( s , a )) Q ( s , a ) β Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 π½ = 0.1 0 0 0 πΏ = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 10 7 8 9 0 0 ? 0 0 0 home 0 0 0
a + Ξ³ max a ' Q ( s ', a ') β Q ( s , a )) Q ( s , a ) β Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 π½ = 0.1 0 0 0 πΏ = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 7 8 9 0 0 1 0 0 0 home 0 0 0 episode 1 ends.
letβs work out the next episode, star<ng at state 4 -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 π½ = 0.1 0 0 0 πΏ = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 7 8 9 0 0 1 0 0 0 home 0 0 0 go WEST and then SOUTH how does the table change?
-0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 π½ = 0.1 0 0 0 πΏ = 0.5 4 5 6 -0.5 -1 0 0 0 0 1 -0.1 0 0 0 0 7 8 9 0 0 1 0 0 0 0 0 0
and the next episode, star<ng at state 3 go WEST -> SOUTH -> WEST -> SOUTH
-0.5 0 0 1 2 3 0 0 -0.1 0 -0.1 0 -0.1 -1 0 π½ = 0.1 0 0 0 πΏ = 0.5 4 5 6 -0.5 -1 -0.05 0 0 0 1.9 -0.1 0 0 0 0 7 8 9 0 0 1 0 0 0 0 0 0 over <me, values will converge to op<mal!
what we just saw was some episodes of Q-learning values update towards value of op+mal policy : target comes from value of assumed next best ac+on off-policy learning
SARSA-learning? values update towards value of current policy : target comes from value of the actual next ac+on on-policy learning
Q SARSA By Andreas Tille (Own work) [GFDL (www.gnu.org/copylei/fdl.html) or CC-BY-SA-3.0-2.5-2.0-1.0 (www.creaCvecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons data generated by data not generated by target policy target policy π : 0.1 πΏ : 1.0 SARSA Q Example credit Travis DeWolf : https://studywolf.wordpress.com/ and https://git.io/vFBvv
Problem Decomposition nested sub-problems solution to sub-problem informs solution to whole problem
Bellman Expectation Backup system of linear equations solution: value of policy a, q(s, a) s, v(s) r sβ a r aβ, q(sβ, aβ) sβ, v(sβ) Value of = P(path) * Value(path) Value of = P(path) * Value(path) β β a + Ξ³ a + Ξ³ a v Ο ( s ') β β β β v Ο ( s ) = Ο ( a | s ) r s q Ο ( s , a ) = r s Ο ( a '| s ') q Ο ( s ', a ') P a P β β β β ss ' ss ' a s ' s ' a ' Bellman expectation equations under a given policy
Bellman Optimality Backup system of non-linear equations solution: value of optimal policy a, q(s, a) s, v(s) r sβ a r aβ, q(sβ, aβ) sβ, v(sβ) Value of = P(path) * Value(path) Value of = P(path) * Value(path) β β a + Ξ³ a + Ξ³ a v * ( s ') β β q * ( s , a ) = r s a v * ( s ) = max a r s P max a ' q * ( s ', a ') P β β β β ss ' ss ' s ' s ' Bellman optimality equations under optimal policy
Value Based
Dynamic Programming β¦using Bellman equations as iterative updates 1 2 whatβs best to do? -5 -5 N E -1 W 1 2 -1 -5 -5 3 4 S -10 10 3 4 home
Dynamic Programming β¦using Bellman equations as iterative updates 1 2 whatβs best to do? -5 -5 N E -1 W 1 2 -1 -5 -5 3 4 S -10 10 3 4 home
Recommend
More recommend