Reinforcement Learning
- M. Soleymani
Sharif University of Technology Spring 2020 Most slides are based on Bhiksha Raj, 11-785, CMU 2019, some slides from Fei Fei Li and colleagues lectures, cs231n, Stanford 2018.
Reinforcement Learning M. Soleymani Sharif University of Technology - - PowerPoint PPT Presentation
Reinforcement Learning M. Soleymani Sharif University of Technology Spring 2020 Most slides are based on Bhiksha Raj, 11-785, CMU 2019, some slides from Fei Fei Li and colleagues lectures, cs231n, Stanford 2018. Overview What is
Sharif University of Technology Spring 2020 Most slides are based on Bhiksha Raj, 11-785, CMU 2019, some slides from Fei Fei Li and colleagues lectures, cs231n, Stanford 2018.
2
– x is data – y is label
segmentation, image captioning, etc.
3
– Just data, no labels!
structure of the data
reduction, feature learning, density estimation, etc.
4
in order to maximize reward
– Concerned with taking sequences
in terms
agent interacting with a previously unknown environment, trying to maximize cumulative reward
an agent interacting with an environment, which provides numeric reward signals
5
6
7
8
9
10
11
– Observations: camera images, joint angles – Actions: joint torques – Rewards: stay balanced, navigate to target locations, serve and protect humans
12
13
14
– Change lane left? – Change lane right? – Accelerate? – Decelerate?
15
– You will not know for a while
16
– Your actions beget rewards
– But not deterministically
17
– You don't have full access to the function you're trying to optimize
– Interacting with a stateful world: input 𝑦" depend on your previous actions
18
– Exponentially large number of them – In the beginning we don’t know if they are good or bad
– Figure does not show arrows
19
– Each time a board position leads to victory, give it a little green color – Each time it leads to a loss give it a little red
20
– Alternates with states arrived at by moves by the opponent
– Too many possibilities; cant be certain of their “winningness” – Fade the green with distance
21
– Alternates with states arrived at by moves by the opponent
– Too many possibilities; cant be certain of their “winningness” – Fade the green with distance
22
– Alternates with states arrived at by moves by the opponent
– Too many possibilities; can’t be certain of their “losingness” – Fade the red with distance
23
Loss: final move is by opponent
24
25
– Some states will get greener – Some will get redder – Some, that can lead to both victory and loss will become different shades of yellow..
26
– When you win he/she loses and vice versa
– Collections of games by amateurs and experts, of which you can find millions in the books
– A schizophrenic computer can play thousands of games with itself in the time that it plays with another person
27
winning than losing), red (more losing than winning) or various shades of yellow/green/orange (can go either way)
practice games
– In fact the vast majority of positions will be unvisited!
28
practice games
– In fact the vast majority of positions will be unvisited!
29
– Which will have some colour between red and green
– How do you describe a board position numerically – What type of function maps a board position to a color between red and green
30
– Game was won/lost (binary) – Time taken to arrive – Amount of money made
– Wait till the end of the game!
– Must optimize actions for maximum reward
𝑏" 𝑡" 𝑠
"
31
– Makes an observation 𝑝" of the environment – Receives a reward 𝑠
"
– Performs an action 𝑏"
– Receives an action 𝑏" – Emits a reward 𝑠
"()
– Changes and produces an observation 𝑝"()
𝑏" 𝑝" 𝑠
"
𝑏" 𝑠
"()
𝑝"() action
32
ℎ" = 𝑝,, 𝑠
,, 𝑏,, 𝑝), 𝑠 ), 𝑏), … , 𝑝", 𝑠 "
actions
be chosen
– The Strategy that maximizes total reward 𝑠, + 𝑠
) + ⋯ + 𝑠2
33
– E.g., in an automobile: [position, velocity, acceleration] – In traffic: the position, velocity, acceleration of every vehicle on the road – In Chess: the state of the board + whose turn it is next
34
35
– This is what will finally decide the rewards
– But agent may not be able to observe all of it
𝑏"
36
– May not match the true one at all
– Formally the agent state 𝑇" = 𝑔 ℎ" is some function of the history – The closer the agent’s model is to the true environment state, the better the agent will be able to strategize
37
Image lifted from David Silver
38
– Would greatly improve prediction if known
39
– A mathematical characterization with a true value for its parameters representing the actual environment
– Formulate its own model for the environment, which must ideally match the true values as closely as possible
40
– An assumption that is generally valid for a properly defined true environment state 𝑄 𝑇"()|𝑇,, 𝑇), … , 𝑇" = 𝑄 𝑇"()|𝑇"
model what he observes of the environment as Markov!
– Amazing, but trivial result – E.g. the observations generated by an HMM are not Markov
– The agent may only have a local model of the true state of the system
environment’s actual states do
41
– The agent’s observations inform it about the environment state – The agent may observe the entire environment state
6|𝑇" = 𝑡7
– Or only part of it
Chess: environment state fully observable to agent Poker: environment state only partially and indirectly
We focus on this in our lectures
42
by the present
– Memoryless
𝑄(𝑡7 |𝑡
6)
– Formally, the tuple M = 𝒯, 𝒬 . – 𝒯 is the (possibly finite) set of states – 𝒬 is the complete set of transition probabilities 𝑄(𝑡 |𝑡′) – Note 𝑄(𝑡 |𝑡′) stands for 𝑄(𝑇"() = 𝑡|𝑇" = 𝑡′) at any time 𝑢 – Will use the shorthand 𝑄
@,@A
43
give you rewards
– 𝒯 is the (possibly finite) set of states – 𝒬 is the complete set of transition probabilities 𝑄@,@D – ℛ is a reward function, consisting of the distributions 𝑄 𝑠|𝑡 or 𝑄 𝑠|𝑡, 𝑡A
– 𝛿 ∈ [0,1] is a discount factor
44
𝐻" = 𝑠
"() + 𝛿𝑠 "(L + 𝛿L𝑠 "(M + ⋯ = N 𝛿O𝑠 "(O() P OQ,
– We trust our own observations of the future less and less
– 𝛿 = 0: The future is totally unpredictable, only trust what you see immediately ahead of you (myopic) – 𝛿 = 1: The future is clear; consider all of it (far sighted)
45
agent has the ability to decide its actions!
– We will represent the action at time t as 𝑏"
– The transitions made by the environment are functions of the action – The rewards returned are functions of the action
46
– 𝒯 is a (possibly finite) set of states : 𝒯 = {𝑡} – is a (possibly finite) set of actions : = {𝑏} – 𝒬 is the set of action conditioned transition probabilities 𝑄@,@D
U
= 𝑄(𝑇"() = 𝑡|𝑇" = 𝑡′, 𝑏" = 𝑏) – ℛ@@D
U
is an action conditioned reward function 𝐹 𝑠|𝑇 = 𝑡, 𝐵 = 𝑏, 𝑇A = 𝑡′ – 𝛿 ∈ [0,1] is a discount factor
47
policy is the probability distribution
actions that the agent may take at any state 𝜌 𝑏|𝑡 = 𝑄 𝑏" = 𝑏|𝑡" = 𝑡
– What are the preferred actions of the spider at any state
𝜌 𝑡 = 𝑏@ where 𝑏@ is the preferred action in state 𝑡
48
– Agent selects action 𝑏" – Environment samples next state 𝑡"()~𝑄(. |𝑡", 𝑏") – Environment samples reward 𝑠
"~𝑆(. |𝑡", 𝑏", 𝑡"())
– Agent receives reward 𝑠
" and next state 𝑡"()
𝐻 = 𝑠
, + 𝛿𝑠 ) + 𝛿L𝑠 L + ⋯ = N 𝛿O𝑠 O P OQ,
49
– Problem: The tree of possible moves is exponentially large
– What do we mean by “generalize”? – If a particular board position always leads to loss, avoid any moves that move you into that position
50
51
52
the process begins in that state 𝑊](𝑡) = 𝐹 𝐻,|𝑇, = 𝑡, 𝜌
present and not the past 𝑊](𝑡) = 𝐹 𝐻"|𝑇" = 𝑡, 𝜌
𝑊](𝑡) = 𝐹 𝐻|𝑇 = 𝑡, 𝜌
53
54
55
𝑊] 𝑡 = 𝐹 ∑ 𝛿"𝑠"
P "Q,
𝑡, = 𝑡, 𝜌 𝑅] 𝑡, 𝑏 = 𝐹 ∑ 𝛿"𝑠"
P "Q,
𝑡, = 𝑡, 𝑏, = 𝑏 , 𝜌
– It is simply the expected sum of discounted rewards upon starting in state s and taking actions according to 𝜌
𝑊] 𝑡 = 𝐹 𝑠 + 𝛿𝑊](𝑡′)|𝑡, 𝜌 𝑅] 𝑡, 𝑏 = 𝐹 𝑠 + 𝛿𝑅](𝑡A, 𝜌(𝑡′))|𝑡, 𝑏, 𝜌
Bellman Equations
𝑊] 𝑡 = N 𝜌 𝑏|𝑡 N 𝑄@,@D
U
ℛ@@D
U
+ 𝛿𝑊] 𝑡A
For deterministic policies: 𝑊] 𝑡 = N 𝑄@,@D
] @
ℛ@@D
] @ + 𝛿𝑊] 𝑡A
Bellman Expectation Equation for State Value Functions of an MDP Note: Although reward was not dependent on action for the fly example, more generally it will be
56
– Given complete MDP (all transition probabilities 𝑄@,@D
U , expected rewards
𝑆@,@D
U ,
and discount 𝛿) – and a policy 𝜌 – find all value terms 𝑊] 𝑡 and/or 𝑅] 𝑡, 𝑏
can be solved for the value functions
– Although this will be computationally intractable for very large state spaces
57
] ,
𝑊
] (O()) 𝑡 = N 𝑄@,@D ](@) 𝑆@@D ](@) + 𝛿𝑊 ] (O) 𝑡′
– Prediction: Given any policy 𝜌 find value function 𝑊] 𝑡 – Control: Find the optimal policy
discounted reward at every state: 𝐹 𝐻" 𝑇" = 𝑡
= 𝐹 N 𝛿O𝑠
"(O() P OQ,
|𝑇" = 𝑡 – Recall: why do we consider the discounted return, rather than the actual return ∑ 𝑠
"(O() P OQ,
?
60
greater than or equal to the value function under 𝜌′ at all states 𝜌 ≥ 𝜌A ⇒ 𝑊] 𝑡 ≥ 𝑊]D 𝑡 ∀𝑡
matter what the current state
61
𝜌∗ ≥ 𝜌 ∀𝜌
achieve the same value function 𝑊]ghij 𝑡 = 𝑊]∗ 𝑡 ∀𝑡
𝑅]ghij 𝑡, 𝑏 = 𝑅∗ 𝑡, 𝑏 ∀𝑡, 𝑏
62
𝜌∗ 𝑡 = argmax
U∈(@)
𝑅∗ 𝑡, 𝑏
– For any other policy 𝜌, 𝑅] 𝑡, 𝑏 ≤ 𝑅∗ 𝑡, 𝑏
find the optimal policy
63
𝑊∗ 𝑡 = max
U
𝑅∗ 𝑡, 𝑏 𝑊∗ 𝑡
Figures from Sutton
𝑅∗ 𝑡, 𝑏
𝑊∗ 𝑡′ 𝑅∗ 𝑡, 𝑏 = N 𝑄@,@D
U
𝑆@@D
U + 𝛿𝑊∗ 𝑡′
64
𝑊∗ 𝑡 = max
U
𝑅∗ 𝑡, 𝑏 𝑊∗ 𝑡
Figures from Sutton
𝑅∗ 𝑡, 𝑏
𝑊∗ 𝑡′ 𝑅∗ 𝑡, 𝑏 = N 𝑄@,@D
U
𝑆@@D
U + 𝛿𝑊∗ 𝑡′
𝑊∗ 𝑡 = max
U
𝑅∗ 𝑡, 𝑏 = max
U
N 𝑄@,@D
U
𝑆@@D
U + 𝛿𝑊∗ 𝑡′
65
Figures from Sutton
𝑅∗ 𝑡, 𝑏 𝑊∗ 𝑡′
𝑅∗ 𝑡, 𝑏 = N 𝑄@,@D
U
𝑆@@D
U + 𝛿𝑊∗ 𝑡′
𝑅∗ 𝑡′, 𝑏′
𝑅∗ 𝑡, 𝑏 = N 𝑄@,@D
U
𝑆@@D
U + 𝛿 max UD 𝑅∗ 𝑡A, 𝑏A
66
𝒯, 𝒬, , ℛ, 𝛿
𝑊∗ 𝑡 = max
U
𝑅∗ 𝑡, 𝑏
𝑅∗ 𝑡, 𝑏 = N 𝑄@,@D
U
𝑆@@D
U + 𝛿𝑊∗ 𝑡′
𝜌∗ 𝑡 = argmax
U∈(@)
𝑅∗ 𝑡, 𝑏
67
𝑅∗ 𝑡, 𝑏 = 𝐹 𝑠 + 𝛿 max
UD 𝑅∗ 𝑡A, 𝑏A |𝑡, 𝑏
– Value based solutions solve for 𝑊∗ 𝑡 and 𝑅∗ 𝑡, 𝑏 and derive the optimal policy from them – Policy based solutions directly estimate 𝜌∗(𝑡)
68
– Value iterations – Policy iterations
– Q-learning – SARSA..
69
– Update the value function 𝑊(O) 𝑡 = max
U
∑ 𝑄@,@D
U
𝑆@,@D
U
+ 𝛿𝑊(Oq)) 𝑡′
– But intermediate value function estimates may not represent any policy
– For N states, estimates N terms
– For M actions, must estimate MN terms
72
𝑅(O()) 𝑡, 𝑏 = ∑ 𝑄@,@D
U
𝑆@,@D
U
+ 𝛿 max
UD 𝑅(O) 𝑡A, 𝑏A
UD 𝑅(O) 𝑡A, 𝑏A
–Use value iteration (prediction DP) to find the value function 𝑊](r) 𝑡 –Find the greedy policy 𝜌 O() s = 𝑠𝑓𝑓𝑒𝑧 𝑊](r) 𝑡
– Someone gave us the MDP
– MDP unknown..
underlying MDP?
– Model-free prediction
MDP?
– Model-free control
– Model-free prediction
– Model-free control
have no knowledge of the system dynamics
– The key knowledge required to “solve” for the best policy – A reasonable assumption in many discrete-state scenarios – Can be generalized to other scenarios with infinite or unknowable state
76
different states
– Take actions according to policy 𝜌 – Note states visited and rewards obtained as a result – Record entire sequence: – 𝑡), 𝑏), 𝑠L, 𝑡L, 𝑏L, 𝑠M, … , 𝑡2 – Assumption: Each “episode” ends at some time
𝑡), 𝑏), 𝑠L, 𝑡L, 𝑏L, 𝑠M, … , 𝑡2
𝑊] 𝑡 = 𝐹 𝐻"|𝑇" = 𝑡 = 𝐹 𝑠
"() + 𝛿𝑠 "(L + ⋯ + 𝛿2q"q)𝑠2|𝑇" = 𝑡
empirical average 𝑏𝑤 𝐻"|𝑇" = 𝑡
– 𝑓𝑞𝑗𝑡𝑝𝑒𝑓 1 = 𝑡)
()), 𝑏) ()), 𝑠 L ()), 𝑡L ()), 𝑏L ()), 𝑠 M ()), … , 𝑡2 ())
– 𝑓𝑞𝑗𝑡𝑝𝑒𝑓 2 = 𝑡)
(L), 𝑏) (L), 𝑠 L (L), 𝑡L (L), 𝑏L (L), 𝑠 M (L), … , 𝑡2 (L)
– … – Different episodes may be different lengths
– 𝐻7
()) = 𝑠 7() ()) + 𝛿𝑠 7() ()) + ⋯ + 𝛿2
{qL𝑠
2 ())
– 𝐻7
(L) = 𝑠 7() (L) + 𝛿𝑠 7() (L) + ⋯ + 𝛿2
{qL𝑠
2 (L)
– … – 𝐻7
(") = 𝑠 7() (") + 𝛿𝑠 7(L (") + ⋯ + 𝛿2
{qL𝑠
2 (")
– Initialize: Count 𝑂 𝑡 = 0, Total return 𝑤] 𝑡 = 0 – For every episode 𝑓
~
– 𝑊] 𝑡 = 𝑊] 𝑡 /𝑂(𝑡)
been visited a sufficiently large number of times, we will obtain good estimates of the value functions of all states
– Will eventually get to the right answer – Unbiased estimate
– Cannot update anything until the end of an episode
– High variance! Each return adds many random values – Slow to converge
can be computed as 𝑦̅O = 1 𝑙 N 𝑦7
O 7Q)
𝑦̅O = (𝑙 − 1)𝑦̅Oq) + 𝑦O 𝑙
𝑦̅O = 𝑦̅Oq) + 1 𝑙 𝑦O − 𝑦̅Oq)
84
can be computed as 𝑦̅O = 𝑦̅Oq) + 1 𝑙 𝑦O − 𝑦̅Oq)
𝑦̅O = 𝑦̅Oq) + 𝛽 𝑦O − 𝑦̅Oq)
too fast
– ∑ 𝛽O
L
< 𝐷, ∑ 𝛽O
= ∞, 𝛽O ≥ 0
85
𝑦̅O = 𝑦̅Oq) + 1 𝑙 𝑦O − 𝑦̅Oq) 𝑦̅O = 𝑦̅Oq) + 𝛽 𝑦O − 𝑦̅Oq) 𝛽 = 0.1 𝛽 = 0.05 𝛽 = 0.03
is biased (early estimates can be expected to be wrong) but converges to true value
𝑦̅O = 𝑦̅Oq) + 1 𝑙 𝑦O − 𝑦̅Oq) 𝑦̅O = 𝑦̅Oq) + 𝛽 𝑦O − 𝑦̅Oq) 𝛽 = 0.1 𝛽 = 0.05 𝛽 = 0.03
𝑊] 𝑡 = 1 𝑂(𝑡) N 𝐻"(7)
ˆ(@) 7Q)
𝑂 𝑇" = 𝑂 𝑇" + 1 𝑊] 𝑇" = 𝑊] 𝑇" + 1 𝑂(𝑇") 𝐻" − 𝑊] 𝑇"
𝑊] 𝑇" = 𝑊] 𝑇" + 𝛽 𝐻" − 𝑊] 𝑇"
𝑂 𝑇" = 𝑂 𝑇" + 1 𝑊] 𝑇" = 𝑊] 𝑇" + 1 𝑂(𝑇") 𝐻" − 𝑊] 𝑇"
𝑊] 𝑇" = 𝑊] 𝑇" + 𝛽 𝐻" − 𝑊] 𝑇"
Problem
𝑊] 𝑇" = 𝑊] 𝑇" + 𝛽 𝐻" − 𝑊] 𝑇"
𝐻" = 𝑠
"() + 𝛿𝐻"()
𝑇"() ≈ 𝑊] 𝑇"() 𝐻" ≈ 𝑠
"() + 𝛿𝑊] 𝑇"()
its current estimate
Problem
– Using MC – Using TD, where you are allowed to repeatedly go over the data
𝑊] 𝑇" = 𝑊] 𝑇" + 𝛽 𝐻" − 𝑊] 𝑇"
𝐻" ≈ 𝑠"() + 𝛿𝑊] 𝑇"()
– 𝑊] 𝑇" = 𝑊] 𝑇" + 𝛽 𝑠
"() + 𝛿𝑊] 𝑇"() − 𝑊] 𝑇" The error between an (estimated) observation of 𝐻" and the current estimate 𝑊] 𝑇"
– For every time 𝑢 = 1 … 𝑈
~
– 𝑊] 𝑇" = 𝑊] 𝑇" + 𝛽 𝑠
"() + 𝛿𝑊] 𝑇"() − 𝑊] 𝑇"
the process arrives at at the next time
(and its successor)
– Although initial values will be biased as seen before – Is actually lower variance than MC!!
– Particularly when TD is allowed to loop over all learning episodes
knowledge of dynamics
– TD is quicker to update, and in many situations the better solution
whose transition probabilities are unknown for a given policy
we will need extra information, namely transition probabilities
– Which we do not have
– Optimal policy in any state : Choose the action that has the largest optimal action value
𝜌∗ 𝑡 = argmax
U∈
N 𝒬@@D
U (ℛ@@D U
+ 𝑊 𝑡A )
@D
– Needs knowledge of transition probabilities
𝜌∗ 𝑡 = argmax
U∈
𝑅 𝑡, 𝑏
𝑡), 𝑏), 𝑠
L, 𝑡L, 𝑏L, 𝑠 M, 𝑡M, 𝑏M, 𝑠 ‹, … , 𝑡2
– The optimal policy can be found from it
– So that we can continuously improve our policy from ongoing experience
policy
– Only learn to evaluate our current policy – Will never learn about alternate policies that may turn out to be better
– But choose a random action 𝜗 of the time – The “epsilon-greedy” policy
𝑡), 𝑏), 𝑠L, 𝑡L, 𝑏L, 𝑠M, 𝑡M, 𝑏M, 𝑠
‹, … , 𝑇2
– Do not actually wait until the end of the episode
Update for S1 Update for S2 Update for S3
102
103
UD 𝑅
e.g., ε-greedy
𝑡), 𝑏), 𝑠L, 𝑡L, 𝑏L, 𝑠M, 𝑡M, 𝑏M, 𝑠
‹, … , 𝑡2
𝑏
– The hypothetical action is guaranteed to be better than the one you actually took – But you still explore (non-greedy)
hypothetical
– Accept reward 𝑠 – Transition to 𝑡A – Find the best action 𝑏A for 𝑡A – Use it to update 𝑅(𝑡, 𝑏) – But then actually perform an epsilon-greedy action 𝑏′′ from 𝑡A
𝜌 𝑏 𝑡 = •1 𝑔𝑝𝑠 𝑏 = 𝑏𝑠max
UD 𝑅(𝑡, 𝑏A)
0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓
𝜌 𝑏 𝑡 = ’ 1 − 𝜗 𝑔𝑝𝑠 𝑏 = 𝑏𝑠max
UD 𝑅(𝑡, 𝑏A)
𝜗 𝑂U − 1 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓
– Value function approximation – Continuous state spaces – Deep-Q learning