Reinforcement Learning M. Soleymani Sharif University of Technology - - PowerPoint PPT Presentation

reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning M. Soleymani Sharif University of Technology - - PowerPoint PPT Presentation

Reinforcement Learning M. Soleymani Sharif University of Technology Spring 2020 Most slides are based on Bhiksha Raj, 11-785, CMU 2019, some slides from Fei Fei Li and colleagues lectures, cs231n, Stanford 2018. Overview What is


slide-1
SLIDE 1

Reinforcement Learning

  • M. Soleymani

Sharif University of Technology Spring 2020 Most slides are based on Bhiksha Raj, 11-785, CMU 2019, some slides from Fei Fei Li and colleagues lectures, cs231n, Stanford 2018.

slide-2
SLIDE 2

Overview

  • What is Reinforcement Learning?
  • Markov Decision Processes
  • Q-Learning
  • Policy Gradients

2

slide-3
SLIDE 3

Supervised Learning

  • Data: (x, y)

– x is data – y is label

  • Goal: Learn a function to map x -> y
  • Examples: Classification, regression, object detection, semantic

segmentation, image captioning, etc.

3

slide-4
SLIDE 4

Unsupervised Learning

  • Data: x

– Just data, no labels!

  • Goal: Learn some underlying hidden

structure of the data

  • Examples: Clustering, dimensionality

reduction, feature learning, density estimation, etc.

4

slide-5
SLIDE 5

Reinforcement Learning

  • Goal: Learn how to take actions

in order to maximize reward

– Concerned with taking sequences

  • f actions
  • Described

in terms

  • f

agent interacting with a previously unknown environment, trying to maximize cumulative reward

an agent interacting with an environment, which provides numeric reward signals

5

slide-6
SLIDE 6

Reinforcement Learning

6

slide-7
SLIDE 7

Reinforcement Learning

7

slide-8
SLIDE 8

Reinforcement Learning

8

slide-9
SLIDE 9

Reinforcement Learning

9

slide-10
SLIDE 10

Robot Locomotion

10

slide-11
SLIDE 11

Cart-Pole Problem

  • Objective: Balance a pole on top of a movable cart
  • State: angle, angular speed, horizontal velocity, …
  • Action: horizontal force applied on the cart
  • Reward: 1 at each time step if the pole is upright

11

slide-12
SLIDE 12

Motor Control and Robotics

  • Robotics:

– Observations: camera images, joint angles – Actions: joint torques – Rewards: stay balanced, navigate to target locations, serve and protect humans

12

slide-13
SLIDE 13

Atari Games

13

slide-14
SLIDE 14

Go

14

slide-15
SLIDE 15

An An interesting ng class ss of f pr probl blems

  • Do I

– Change lane left? – Change lane right? – Accelerate? – Decelerate?

15

slide-16
SLIDE 16

An An interesting ng class ss of f pr probl blems

  • Is an investment plan good?

– You will not know for a while

16

slide-17
SLIDE 17

Re Reward-ba based ed pr probl blem ems

  • And many others
  • Common theme: These are control problems where

– Your actions beget rewards

  • Win the game
  • Make money
  • Get home sooner

– But not deterministically

  • A world out there that is not predictable
  • From experience of belated rewards, you must learn to act rationally

17

slide-18
SLIDE 18

How does RL relate to other machine learning problems?

  • It is a sequential decision making
  • Differences between RL and supervised learning:

– You don't have full access to the function you're trying to optimize

  • must query it through interaction.

– Interacting with a stateful world: input 𝑦" depend on your previous actions

18

slide-19
SLIDE 19

Le Lets draw a diagram. m..

  • Circles are game states

– Exponentially large number of them – In the beginning we don’t know if they are good or bad

  • Each state can move into one of N states depending on the opponent’s move

– Figure does not show arrows

19

slide-20
SLIDE 20
  • Play a very large number of games

– Each time a board position leads to victory, give it a little green color – Each time it leads to a loss give it a little red

20

Le Lets draw a diagram. m..

slide-21
SLIDE 21
  • Sequence of game states we moved into until the winning state

– Alternates with states arrived at by moves by the opponent

  • All of these are “winning” states: color them green
  • But things from the distant past are less certain

– Too many possibilities; cant be certain of their “winningness” – Fade the green with distance

21

A A game we won

slide-22
SLIDE 22
  • Sequence of game states we moved into until the winning state

– Alternates with states arrived at by moves by the opponent

  • All of these are “winning” states: color them green
  • But things from the distant past are less certain

– Too many possibilities; cant be certain of their “winningness” – Fade the green with distance

22

A A game we won

slide-23
SLIDE 23
  • Sequence of game states we moved into until the losing state

– Alternates with states arrived at by moves by the opponent

  • All of these are “losing” states: color them red
  • But things from the distant past are less certain

– Too many possibilities; can’t be certain of their “losingness” – Fade the red with distance

23

Loss: final move is by opponent

A A game we lo lost

slide-24
SLIDE 24

Co Continue playi ying game mes

  • Play many many games
  • Some of which you will lose..

24

slide-25
SLIDE 25

Co Continue playi ying game mes

  • Play many many games
  • Some of which you will lose..
  • And some you’ll win..

25

slide-26
SLIDE 26

Co Continue playi ying game mes

  • When multiple games visit a state, simply average the colors derived from all visits

– Some states will get greener – Some will get redder – Some, that can lead to both victory and loss will become different shades of yellow..

  • More in the early stages of the game than during the endgame

26

slide-27
SLIDE 27

Co Collecting mo more game mes… s…

  • You can also learn colors from your opponent’s moves

– When you win he/she loses and vice versa

  • You can learn from others’ games

– Collections of games by amateurs and experts, of which you can find millions in the books

  • To really speed up matters, play with yourself

– A schizophrenic computer can play thousands of games with itself in the time that it plays with another person

27

slide-28
SLIDE 28

Le Lets draw a diagram. m..

  • Eventually, we’ll get many board positions with different shades of green (more

winning than losing), red (more losing than winning) or various shades of yellow/green/orange (can go either way)

  • We will also get many “blank” positions that were never visited in all our

practice games

– In fact the vast majority of positions will be unvisited!

28

slide-29
SLIDE 29

Le Lets draw a diagram. m..

  • We will also get many “blank” positions that were never visited in all our

practice games

– In fact the vast majority of positions will be unvisited!

29

slide-30
SLIDE 30

Le Lets draw a diagram. m..

  • Generalization: From the coloured nodes, learn some way of colouring the blank nodes too

– Which will have some colour between red and green

  • Different nodes will have different colours
  • The magic: some function that assigns color to different board positions

– How do you describe a board position numerically – What type of function maps a board position to a color between red and green

30

slide-31
SLIDE 31

Le Lets forma rmalize the problem

  • There is no supervisor, only a reward signal
  • i.e. nobody telling the agent “you did well”
  • Reward is a scalar – a single number, may be negative

– Game was won/lost (binary) – Time taken to arrive – Amount of money made

  • Reward may be delayed

– Wait till the end of the game!

  • Agents actions affect its current and future rewards

– Must optimize actions for maximum reward

𝑏" 𝑡" 𝑠

"

31

slide-32
SLIDE 32

Le Lets forma rmalize the system

  • At each time 𝑢 the agent:

– Makes an observation 𝑝" of the environment – Receives a reward 𝑠

"

– Performs an action 𝑏"

  • At each time 𝑢 the environment:

– Receives an action 𝑏" – Emits a reward 𝑠

"()

– Changes and produces an observation 𝑝"()

  • Challenge: How must the agent behave to maximize its rewards

𝑏" 𝑝" 𝑠

"

𝑏" 𝑠

"()

𝑝"() action

32

slide-33
SLIDE 33

Fr From the the per perspec pectiv tive e of the the Agen ent

  • What the agent perceives..
  • The following History:

ℎ" = 𝑝,, 𝑠

,, 𝑏,, 𝑝), 𝑠 ), 𝑏), … , 𝑝", 𝑠 "

  • The total history at any time is the sequence of observations, rewards and

actions

  • We need to model this sequence such that at any time t, the best 𝑏"|ℎ" can

be chosen

– The Strategy that maximizes total reward 𝑠, + 𝑠

) + ⋯ + 𝑠2

33

slide-34
SLIDE 34

Ca Can defi fine a “state”

  • Fully captures the “status” of the system

– E.g., in an automobile: [position, velocity, acceleration] – In traffic: the position, velocity, acceleration of every vehicle on the road – In Chess: the state of the board + whose turn it is next

34

slide-35
SLIDE 35

How can we mathematically formalize the RL problem?

35

slide-36
SLIDE 36

Th The state of the en envi vironm nmen ent

  • The environment’s state!

– This is what will finally decide the rewards

  • May be a complex combination of many things
  • Generally assumed to be dynamic – keeps changing
  • The agent’s actions can affect the way in which it responds

– But agent may not be able to observe all of it

𝑏"

36

slide-37
SLIDE 37

Th The Agent’s Side of the Story

  • Agent has an internal representation of the environment state

– May not match the true one at all

  • May be defined in any manner

– Formally the agent state 𝑇" = 𝑔 ℎ" is some function of the history – The closer the agent’s model is to the true environment state, the better the agent will be able to strategize

37

slide-38
SLIDE 38

De Defin inin ing A Agent S State

  • What is the outcome?

Image lifted from David Silver

38

slide-39
SLIDE 39

De Defin inin ing A Agent S State

  • Different definitions of state result in different predictions
  • True environment state not really known

– Would greatly improve prediction if known

39

slide-40
SLIDE 40

To To Maximize Reward

  • We can represent the environment as a process

– A mathematical characterization with a true value for its parameters representing the actual environment

  • The agent must model this environment process

– Formulate its own model for the environment, which must ideally match the true values as closely as possible

  • Based only on what it observes
  • Agent must formulate winning strategy based on model of environment

40

slide-41
SLIDE 41

Ma Mark rkov propert rty and observability

  • Environment state is Markov

– An assumption that is generally valid for a properly defined true environment state 𝑄 𝑇"()|𝑇,, 𝑇), … , 𝑇" = 𝑄 𝑇"()|𝑇"

  • In theory, if the agent doesn’t observe the environment’s internals, he cannot

model what he observes of the environment as Markov!

– Amazing, but trivial result – E.g. the observations generated by an HMM are not Markov

  • In practice, the agent may assume anything

– The agent may only have a local model of the true state of the system

  • But can still assume that the states in its model behave in the same Markovian way that the

environment’s actual states do

41

slide-42
SLIDE 42

Ma Mark rkov propert rty and observability

  • Observability

– The agent’s observations inform it about the environment state – The agent may observe the entire environment state

  • Now the agents state is isomorphic to the environment state
  • Note – observing the state is not the same as knowing the state’s true dynamics 𝑄 𝑇"() = 𝑡

6|𝑇" = 𝑡7

  • Markov Decision Process

– Or only part of it

  • E.g. only seeing some stock prices, or only the traffic immediately in front of you
  • Partially Observable Markov Decision Process

Chess: environment state fully observable to agent Poker: environment state only partially and indirectly

  • bservable to agent

We focus on this in our lectures

42

slide-43
SLIDE 43

A A Markov Pr Process

  • A Markov process is a random process where the future is only determined

by the present

– Memoryless

  • Is fully defined by the set of states 𝒯, and the state transition probabilities

𝑄(𝑡7 |𝑡

6)

– Formally, the tuple M = 𝒯, 𝒬 . – 𝒯 is the (possibly finite) set of states – 𝒬 is the complete set of transition probabilities 𝑄(𝑡 |𝑡′) – Note 𝑄(𝑡 |𝑡′) stands for 𝑄(𝑇"() = 𝑡|𝑇" = 𝑡′) at any time 𝑢 – Will use the shorthand 𝑄

@,@A

43

slide-44
SLIDE 44

Ma Mark rkov Reward Process

  • A Markov Reward Process (MRP) is a Markov Process where states

give you rewards

  • Formally, a Markov Reward Process is the tuple M = 𝒯, 𝒬, ℛ, 𝛿

– 𝒯 is the (possibly finite) set of states – 𝒬 is the complete set of transition probabilities 𝑄@,@D – ℛ is a reward function, consisting of the distributions 𝑄 𝑠|𝑡 or 𝑄 𝑠|𝑡, 𝑡A

  • Or alternately, the expected value 𝐹 𝑠|𝑡 or 𝐹 𝑠|𝑡, 𝑡′

– 𝛿 ∈ [0,1] is a discount factor

44

slide-45
SLIDE 45

Th The discounted return

𝐻" = 𝑠

"() + 𝛿𝑠 "(L + 𝛿L𝑠 "(M + ⋯ = N 𝛿O𝑠 "(O() P OQ,

  • The return is the total future reward all the way to the end
  • But each future step is slightly less “believable” and is hence discounted

– We trust our own observations of the future less and less

  • The future is a fuzzy place
  • The discount factor 𝛿 is our belief in the predictability of the future

– 𝛿 = 0: The future is totally unpredictable, only trust what you see immediately ahead of you (myopic) – 𝛿 = 1: The future is clear; consider all of it (far sighted)

  • Part of the Markov Reward Process model

45

slide-46
SLIDE 46

Th The Markov Decision Process

  • Mathematical formulation of RL problems
  • A Markov Decision Process is a Markov Reward Process, where the

agent has the ability to decide its actions!

– We will represent the action at time t as 𝑏"

  • The agent’s actions affect the environment’s behavior

– The transitions made by the environment are functions of the action – The rewards returned are functions of the action

46

slide-47
SLIDE 47

Th The Markov Decision Process

  • Formally, a Markov Decision Process is the tuple M = 𝒯, 𝒬, 𝒝, ℛ, 𝛿

– 𝒯 is a (possibly finite) set of states : 𝒯 = {𝑡} – 𝒝 is a (possibly finite) set of actions : 𝒝 = {𝑏} – 𝒬 is the set of action conditioned transition probabilities 𝑄@,@D

U

= 𝑄(𝑇"() = 𝑡|𝑇" = 𝑡′, 𝑏" = 𝑏) – ℛ@@D

U

is an action conditioned reward function 𝐹 𝑠|𝑇 = 𝑡, 𝐵 = 𝑏, 𝑇A = 𝑡′ – 𝛿 ∈ [0,1] is a discount factor

47

slide-48
SLIDE 48

Po Policy

  • The

policy is the probability distribution

  • ver

actions that the agent may take at any state 𝜌 𝑏|𝑡 = 𝑄 𝑏" = 𝑏|𝑡" = 𝑡

– What are the preferred actions of the spider at any state

  • The policy may be deterministic, i.e.

𝜌 𝑡 = 𝑏@ where 𝑏@ is the preferred action in state 𝑡

48

slide-49
SLIDE 49

Markov Decision Process

  • At time step t=0, environment samples initial state 𝑡,~𝑞(𝑡,)
  • Then, for t=0 until done:

– Agent selects action 𝑏" – Environment samples next state 𝑡"()~𝑄(. |𝑡", 𝑏") – Environment samples reward 𝑠

"~𝑆(. |𝑡", 𝑏", 𝑡"())

– Agent receives reward 𝑠

" and next state 𝑡"()

  • A policy 𝜌 is a function from S to A that specifies what action to take in each state
  • Objective: find policy 𝜌∗ that maximizes cumulative discounted reward:

𝐻 = 𝑠

, + 𝛿𝑠 ) + 𝛿L𝑠 L + ⋯ = N 𝛿O𝑠 O P OQ,

49

slide-50
SLIDE 50

Le Learn rning from m experi rience

  • Learn by playing (or observing)

– Problem: The tree of possible moves is exponentially large

  • Learn to generalize

– What do we mean by “generalize”? – If a particular board position always leads to loss, avoid any moves that move you into that position

50

slide-51
SLIDE 51

A simple MDP: Grid World

51

slide-52
SLIDE 52

A simple MDP: Grid World

52

slide-53
SLIDE 53

In Intr troduc ducing ing the the “Value” alue” func unctio tion

  • The “Value” of a state is the expected total discounted return, when

the process begins in that state 𝑊](𝑡) = 𝐹 𝐻,|𝑇, = 𝑡, 𝜌

  • Or, since the process is Markov and the future only depends on the

present and not the past 𝑊](𝑡) = 𝐹 𝐻"|𝑇" = 𝑡, 𝜌

  • Or more generally

𝑊](𝑡) = 𝐹 𝐻|𝑇 = 𝑡, 𝜌

53

slide-54
SLIDE 54

Definitions: Value function and Q-value function

54

slide-55
SLIDE 55

Value function for policy 𝜌

55

𝑊] 𝑡 = 𝐹 ∑ 𝛿"𝑠"

P "Q,

𝑡, = 𝑡, 𝜌 𝑅] 𝑡, 𝑏 = 𝐹 ∑ 𝛿"𝑠"

P "Q,

𝑡, = 𝑡, 𝑏, = 𝑏 , 𝜌

  • 𝑊] 𝑡 : How good for the agent to be in the state 𝑡 when its policy is 𝜌

– It is simply the expected sum of discounted rewards upon starting in state s and taking actions according to 𝜌

𝑊] 𝑡 = 𝐹 𝑠 + 𝛿𝑊](𝑡′)|𝑡, 𝜌 𝑅] 𝑡, 𝑏 = 𝐹 𝑠 + 𝛿𝑅](𝑡A, 𝜌(𝑡′))|𝑡, 𝑏, 𝜌

Bellman Equations

slide-56
SLIDE 56

Th The st state value fu function of f an MDP

  • The expected return from any state depends on the policy you follow
  • We will index the value of any state by the policy to indicate this

𝑊] 𝑡 = N 𝜌 𝑏|𝑡 N 𝑄@,@D

U

ℛ@@D

U

+ 𝛿𝑊] 𝑡A

  • @A
  • U∈𝒝

For deterministic policies: 𝑊] 𝑡 = N 𝑄@,@D

] @

ℛ@@D

] @ + 𝛿𝑊] 𝑡A

  • @A

Bellman Expectation Equation for State Value Functions of an MDP Note: Although reward was not dependent on action for the fly example, more generally it will be

56

slide-57
SLIDE 57

“C “Computi puting ng” ” the the MDP

  • Finding the state and/or action value functions for the MDP:

– Given complete MDP (all transition probabilities 𝑄@,@D

U , expected rewards

𝑆@,@D

U ,

and discount 𝛿) – and a policy 𝜌 – find all value terms 𝑊] 𝑡 and/or 𝑅] 𝑡, 𝑏

  • The Bellman expectation equations are simultaneous equations that

can be solved for the value functions

– Although this will be computationally intractable for very large state spaces

57

slide-58
SLIDE 58

Va Value Iteration (Prediction DP DP)

  • Start with an initialization 𝑊

] ,

  • Iterate (𝑙 = 0 … convergence): for all states

𝑊

] (O()) 𝑡 = N 𝑄@,@D ](@) 𝑆@@D ](@) + 𝛿𝑊 ] (O) 𝑡′

  • @A
slide-59
SLIDE 59

Va Value-ba based ed Planni nning ng

  • “Value”-based solution
  • Breakdown:

– Prediction: Given any policy 𝜌 find value function 𝑊] 𝑡 – Control: Find the optimal policy

slide-60
SLIDE 60

Op Optimal Policies

  • Different policies can result in different value functions
  • What is the optimal policy?
  • The optimal policy is the policy that will maximize the expected total

discounted reward at every state: 𝐹 𝐻" 𝑇" = 𝑡

= 𝐹 N 𝛿O𝑠

"(O() P OQ,

|𝑇" = 𝑡 – Recall: why do we consider the discounted return, rather than the actual return ∑ 𝑠

"(O() P OQ,

?

60

slide-61
SLIDE 61

Po Policy or

  • rdering def

efinition

  • A policy 𝜌 is “better” than a policy 𝜌′ if the value function under 𝜌 is

greater than or equal to the value function under 𝜌′ at all states 𝜌 ≥ 𝜌A ⇒ 𝑊] 𝑡 ≥ 𝑊]D 𝑡 ∀𝑡

  • Under the better policy, you will expect better overall outcome no

matter what the current state

61

slide-62
SLIDE 62

Th The optimal policy theorem

  • Theorem: For any MDP there exists an optimal policy 𝜌∗ that is better than
  • r equal to every other policy:

𝜌∗ ≥ 𝜌 ∀𝜌

  • Corollary: If there are multiple optimal policies 𝜌ef"), 𝜌ef"L, … all of them

achieve the same value function 𝑊]ghij 𝑡 = 𝑊]∗ 𝑡 ∀𝑡

  • All optimal policies achieve the same action value function

𝑅]ghij 𝑡, 𝑏 = 𝑅∗ 𝑡, 𝑏 ∀𝑡, 𝑏

62

slide-63
SLIDE 63

Ho How w to find ind the the optim timal al po polic licy

  • For the optimal policy:

𝜌∗ 𝑡 = argmax

U∈𝒝(@)

𝑅∗ 𝑡, 𝑏

  • Easy to prove

– For any other policy 𝜌, 𝑅] 𝑡, 𝑏 ≤ 𝑅∗ 𝑡, 𝑏

  • Knowing the optimal action value function 𝑅∗ 𝑡, 𝑏 ∀𝑡, 𝑏 is sufficient to

find the optimal policy

63

slide-64
SLIDE 64

Ba Backup di diagr gram

𝑊∗ 𝑡 = max

U

𝑅∗ 𝑡, 𝑏 𝑊∗ 𝑡

Figures from Sutton

𝑅∗ 𝑡, 𝑏

𝑊∗ 𝑡′ 𝑅∗ 𝑡, 𝑏 = N 𝑄@,@D

U

𝑆@@D

U + 𝛿𝑊∗ 𝑡′

  • @D

64

slide-65
SLIDE 65

Ba Backup di diagr gram

𝑊∗ 𝑡 = max

U

𝑅∗ 𝑡, 𝑏 𝑊∗ 𝑡

Figures from Sutton

𝑅∗ 𝑡, 𝑏

𝑊∗ 𝑡′ 𝑅∗ 𝑡, 𝑏 = N 𝑄@,@D

U

𝑆@@D

U + 𝛿𝑊∗ 𝑡′

  • @D

𝑊∗ 𝑡 = max

U

𝑅∗ 𝑡, 𝑏 = max

U

N 𝑄@,@D

U

𝑆@@D

U + 𝛿𝑊∗ 𝑡′

  • @D

65

slide-66
SLIDE 66

Ba Backup di diagr gram

Figures from Sutton

𝑅∗ 𝑡, 𝑏 𝑊∗ 𝑡′

𝑅∗ 𝑡, 𝑏 = N 𝑄@,@D

U

𝑆@@D

U + 𝛿𝑊∗ 𝑡′

  • @D

𝑅∗ 𝑡′, 𝑏′

𝑅∗ 𝑡, 𝑏 = N 𝑄@,@D

U

𝑆@@D

U + 𝛿 max UD 𝑅∗ 𝑡A, 𝑏A

  • @D

66

slide-67
SLIDE 67

Op Optimality y re relationships: Summary

  • Given the MDP:

𝒯, 𝒬, 𝒝, ℛ, 𝛿

  • Given the optimal action value functions, the optimal value function can be found

𝑊∗ 𝑡 = max

U

𝑅∗ 𝑡, 𝑏

  • Given the optimal value function, the optimal action value function can be found

𝑅∗ 𝑡, 𝑏 = N 𝑄@,@D

U

𝑆@@D

U + 𝛿𝑊∗ 𝑡′

  • @D
  • Given the optimal action value function, the optimal policy can be found

𝜌∗ 𝑡 = argmax

U∈𝒝(@)

𝑅∗ 𝑡, 𝑏

67

𝑅∗ 𝑡, 𝑏 = 𝐹 𝑠 + 𝛿 max

UD 𝑅∗ 𝑡A, 𝑏A |𝑡, 𝑏

slide-68
SLIDE 68

“S “Solvi ving ng” ” the the MDP

  • Solving the MDP equates to finding the optimal policy 𝜌∗(𝑡)
  • Which is equivalent to finding the optimal value function 𝑊∗ 𝑡
  • Or finding the optimal action value function 𝑅∗ 𝑡, 𝑏
  • Various solutions will estimate one or the other

– Value based solutions solve for 𝑊∗ 𝑡 and 𝑅∗ 𝑡, 𝑏 and derive the optimal policy from them – Policy based solutions directly estimate 𝜌∗(𝑡)

68

slide-69
SLIDE 69

So Solving the Be Bellma man Optima mality Equation

  • No closed form solutions
  • Solutions are iterative
  • Given the MDP (Planning):

– Value iterations – Policy iterations

  • Not given the MDP (Reinforcement Learning):

– Q-learning – SARSA..

69

slide-70
SLIDE 70

Va Value Iteration

  • Start with any initial value function 𝑊(,) 𝑡
  • Iterate (𝑙 = 1 … convergence):

– Update the value function 𝑊(O) 𝑡 = max

U

∑ 𝑄@,@D

U

𝑆@,@D

U

+ 𝛿𝑊(Oq)) 𝑡′

  • @D
  • Note: no explicit policy estimation
  • Directly learning optimal value function
  • Guaranteed to give you optimal value function at convergence

– But intermediate value function estimates may not represent any policy

slide-71
SLIDE 71

Al Alterna nate strategy gy

  • Worked with Value function

– For N states, estimates N terms

  • Could alternately work with action-value function

– For M actions, must estimate MN terms

  • Much more expensive
  • But more useful in some scenarios
slide-72
SLIDE 72

Solving for the optimal policy: Value iteration

72

𝑅(O()) 𝑡, 𝑏 = ∑ 𝑄@,@D

U

𝑆@,@D

U

+ 𝛿 max

UD 𝑅(O) 𝑡A, 𝑏A

  • @D

𝑅(O()) 𝑡, 𝑏 = 𝐹 𝑠 + 𝛿 max

UD 𝑅(O) 𝑡A, 𝑏A

slide-73
SLIDE 73

Po Policy Itera ration

  • Start with any policy 𝜌(,)
  • Iterate (𝑙 = 0 … convergence):

–Use value iteration (prediction DP) to find the value function 𝑊](r) 𝑡 –Find the greedy policy 𝜌 O() s = 𝑕𝑠𝑓𝑓𝑒𝑧 𝑊](r) 𝑡

slide-74
SLIDE 74

Ne Next U Up

  • We’ve worked so far with planning

– Someone gave us the MDP

  • Next: Reinforcement Learning

– MDP unknown..

slide-75
SLIDE 75

Mo Model del-Fr Free Methods

  • AKA model-free reinforcement learning
  • How do you find the value of a policy, without knowing the

underlying MDP?

– Model-free prediction

  • How do you find the optimal policy, without knowing the underlying

MDP?

– Model-free control

slide-76
SLIDE 76

Mo Model-Fr Free ee Metho thods ds

  • AKA model-free reinforcement learning
  • How do you find the value of a policy, without knowing the underlying MDP?

– Model-free prediction

  • How do you find the optimal policy, without knowing the underlying MDP?

– Model-free control

  • Assumption: We can identify the states, know the actions, and measure rewards, but

have no knowledge of the system dynamics

– The key knowledge required to “solve” for the best policy – A reasonable assumption in many discrete-state scenarios – Can be generalized to other scenarios with infinite or unknowable state

76

slide-77
SLIDE 77

Me Methods

  • Monte-Carlo Learning
  • Temporal-Difference Learning
slide-78
SLIDE 78

Mo Monte-Carlo learning to learn the value of a policy cy 𝜌

  • Just “let the system run” while following the policy 𝜌 and learn the value of

different states

  • Procedure: Record several episodes of the following

– Take actions according to policy 𝜌 – Note states visited and rewards obtained as a result – Record entire sequence: – 𝑡), 𝑏), 𝑠L, 𝑡L, 𝑏L, 𝑠M, … , 𝑡2 – Assumption: Each “episode” ends at some time

  • Estimate value functions based on observations by counting
slide-79
SLIDE 79

Mo Monte-Ca Carl rlo Value Estima mation

  • Objective: Estimate value function 𝑊](𝑡) for every state 𝑡, given recordings
  • f the kind:

𝑡), 𝑏), 𝑠L, 𝑡L, 𝑏L, 𝑠M, … , 𝑡2

  • Recall, the value function is the expected return:

𝑊] 𝑡 = 𝐹 𝐻"|𝑇" = 𝑡 = 𝐹 𝑠

"() + 𝛿𝑠 "(L + ⋯ + 𝛿2q"q)𝑠2|𝑇" = 𝑡

  • To estimate this, we replace the statistical expectation 𝐹 𝐻"|𝑇" = 𝑡 by the

empirical average 𝑏𝑤𝑕 𝐻"|𝑇" = 𝑡

slide-80
SLIDE 80

A A bi bit of f no notation

  • We actually record many episodes

– 𝑓𝑞𝑗𝑡𝑝𝑒𝑓 1 = 𝑡)

()), 𝑏) ()), 𝑠 L ()), 𝑡L ()), 𝑏L ()), 𝑠 M ()), … , 𝑡2 ())

– 𝑓𝑞𝑗𝑡𝑝𝑒𝑓 2 = 𝑡)

(L), 𝑏) (L), 𝑠 L (L), 𝑡L (L), 𝑏L (L), 𝑠 M (L), … , 𝑡2 (L)

– … – Different episodes may be different lengths

  • Return at time I for each episode:

– 𝐻7

()) = 𝑠 7() ()) + 𝛿𝑠 7() ()) + ⋯ + 𝛿2

{qL𝑠

2 ())

– 𝐻7

(L) = 𝑠 7() (L) + 𝛿𝑠 7() (L) + ⋯ + 𝛿2

{qL𝑠

2 (L)

– … – 𝐻7

(") = 𝑠 7() (") + 𝛿𝑠 7(L (") + ⋯ + 𝛿2

{qL𝑠

2 (")

slide-81
SLIDE 81

Es Estimating ng the he Value ue of f a State

  • For every state 𝑡

– Initialize: Count 𝑂 𝑡 = 0, Total return 𝑤] 𝑡 = 0 – For every episode 𝑓

  • For every time 𝑢 = 1 … 𝑈

~

  • Compute 𝐻"
  • If (𝑇"== 𝑡)
  • 𝑂 𝑡 = 𝑂 𝑡 + 1
  • 𝑊] 𝑡 = 𝑊] 𝑡 + 𝐻"

– 𝑊] 𝑡 = 𝑊] 𝑡 /𝑂(𝑡)

  • Can be done more efficiently..
slide-82
SLIDE 82

Mo Monte Ca Carl rlo estima mation

  • Learning from experience explicitly
  • After a sufficiently large number of episodes, in which all states have

been visited a sufficiently large number of times, we will obtain good estimates of the value functions of all states

  • Easily extended to evaluating action value functions
slide-83
SLIDE 83

Mo Monte Ca Carl rlo: : Good and Ba Bad

  • Good:

– Will eventually get to the right answer – Unbiased estimate

  • Bad:

– Cannot update anything until the end of an episode

  • Which may last for ever

– High variance! Each return adds many random values – Slow to converge

slide-84
SLIDE 84

Inc Increm emen ental al Upda Update e of Aver erag ages es

  • Given a sequence 𝑦), 𝑦L, 𝑦M, … a running estimate of their average

can be computed as 𝑦̅O = 1 𝑙 N 𝑦7

O 7Q)

  • This can be rewritten as:

𝑦̅O = (𝑙 − 1)𝑦̅Oq) + 𝑦O 𝑙

  • And further refined to

𝑦̅O = 𝑦̅Oq) + 1 𝑙 𝑦O − 𝑦̅Oq)

84

slide-85
SLIDE 85

Inc Increm emen ental al Upda Update e of Aver erag ages es

  • Given a sequence 𝑦), 𝑦L, 𝑦M, … a running estimate of their average

can be computed as 𝑦̅O = 𝑦̅Oq) + 1 𝑙 𝑦O − 𝑦̅Oq)

  • Or more generally as

𝑦̅O = 𝑦̅Oq) + 𝛽 𝑦O − 𝑦̅Oq)

  • The latter is particularly useful for non-stationary environments
  • For stationary environments 𝛽 must shrink with iterations, but not

too fast

– ∑ 𝛽O

L

  • O

< 𝐷, ∑ 𝛽O

  • O

= ∞, 𝛽O ≥ 0

85

slide-86
SLIDE 86

Inc Increm emen ental al Upda Updates es

  • Example of running average of a uniform random variable

𝑦̅O = 𝑦̅Oq) + 1 𝑙 𝑦O − 𝑦̅Oq) 𝑦̅O = 𝑦̅Oq) + 𝛽 𝑦O − 𝑦̅Oq) 𝛽 = 0.1 𝛽 = 0.05 𝛽 = 0.03

slide-87
SLIDE 87

Inc Increm emen ental al Upda Updates es

  • Correct equation is unbiased and converges to true value
  • Equation with 𝛽

is biased (early estimates can be expected to be wrong) but converges to true value

𝑦̅O = 𝑦̅Oq) + 1 𝑙 𝑦O − 𝑦̅Oq) 𝑦̅O = 𝑦̅Oq) + 𝛽 𝑦O − 𝑦̅Oq) 𝛽 = 0.1 𝛽 = 0.05 𝛽 = 0.03

slide-88
SLIDE 88

Updating Value Funct ction Incr crementally

  • Actual update

𝑊] 𝑡 = 1 𝑂(𝑡) N 𝐻"(7)

ˆ(@) 7Q)

  • 𝑂(𝑡) is the total number of visits to state s across all episodes
  • 𝐻"(7) is the discounted return at the time instant of the i-th visit to state 𝑡
slide-89
SLIDE 89

On Online update

  • Given any episode
  • Update the value of each state visited

𝑂 𝑇" = 𝑂 𝑇" + 1 𝑊] 𝑇" = 𝑊] 𝑇" + 1 𝑂(𝑇") 𝐻" − 𝑊] 𝑇"

  • Incremental version

𝑊] 𝑇" = 𝑊] 𝑇" + 𝛽 𝐻" − 𝑊] 𝑇"

  • Still an unrealistic rule
  • Requires the entire track until the end of the episode to compute Gt
slide-90
SLIDE 90

On Online update

  • Given any episode
  • Update the value of each state visited

𝑂 𝑇" = 𝑂 𝑇" + 1 𝑊] 𝑇" = 𝑊] 𝑇" + 1 𝑂(𝑇") 𝐻" − 𝑊] 𝑇"

  • Incremental version

𝑊] 𝑇" = 𝑊] 𝑇" + 𝛽 𝐻" − 𝑊] 𝑇"

  • Still an unrealistic rule
  • Requires the entire track until the end of the episode to compute Gt

Problem

slide-91
SLIDE 91

Temporal Difference (TD TD) solution

𝑊] 𝑇" = 𝑊] 𝑇" + 𝛽 𝐻" − 𝑊] 𝑇"

  • But

𝐻" = 𝑠

"() + 𝛿𝐻"()

  • We can approximate 𝐻"() by the expected return at the next state

𝑇"() ≈ 𝑊] 𝑇"() 𝐻" ≈ 𝑠

"() + 𝛿𝑊] 𝑇"()

  • We don’t know the real value of 𝑊] 𝑇"() but we can “bootstrap” it by

its current estimate

Problem

slide-92
SLIDE 92

TD TD vs MC

  • What are 𝑊(𝐵) and 𝑊(𝐶)

– Using MC – Using TD, where you are allowed to repeatedly go over the data

slide-93
SLIDE 93

TD TD so solution: On Online update

𝑊] 𝑇" = 𝑊] 𝑇" + 𝛽 𝐻" − 𝑊] 𝑇"

  • Where

𝐻" ≈ 𝑠"() + 𝛿𝑊] 𝑇"()

  • Giving us

– 𝑊] 𝑇" = 𝑊] 𝑇" + 𝛽 𝑠

"() + 𝛿𝑊] 𝑇"() − 𝑊] 𝑇" The error between an (estimated) observation of 𝐻" and the current estimate 𝑊] 𝑇"

slide-94
SLIDE 94

TD TD solution: Online update

  • For all 𝑡 Initialize: 𝑊] 𝑡 = 0
  • For every episode 𝑓

– For every time 𝑢 = 1 … 𝑈

~

– 𝑊] 𝑇" = 𝑊] 𝑇" + 𝛽 𝑠

"() + 𝛿𝑊] 𝑇"() − 𝑊] 𝑇"

  • There’s a “lookahead” of one state, to know which state

the process arrives at at the next time

  • But is otherwise online, with continuous updates
slide-95
SLIDE 95

TD TD Solution

  • Updates continuously – improve estimates as soon as you observe a state

(and its successor)

  • Can work even with infinitely long processes that never terminate
  • Guaranteed to converge to the true values eventually

– Although initial values will be biased as seen before – Is actually lower variance than MC!!

  • Only incorporates one RV at any time
  • TD can give correct answers when MC goes wrong

– Particularly when TD is allowed to loop over all learning episodes

slide-96
SLIDE 96

St Story so far

  • Want to compute the values of all states, given a policy, but no

knowledge of dynamics

  • Have seen monte-carlo and temporal difference solutions

– TD is quicker to update, and in many situations the better solution

slide-97
SLIDE 97

Op Optimal Policy: y: Co Control

  • We learned how to estimate the state value functions for an MDP

whose transition probabilities are unknown for a given policy

  • How do we find the optimal policy?
slide-98
SLIDE 98

Va Value vs. Action Va Value

  • The solution we saw so far only computes the value functions of states
  • Not sufficient – to compute the optimal policy from value functions alone,

we will need extra information, namely transition probabilities

– Which we do not have

  • Instead, we can use the same method to compute action value functions

– Optimal policy in any state : Choose the action that has the largest optimal action value

slide-99
SLIDE 99

Va Value vs. Action value

  • Given only value functions, the optimal policy must be estimated as:

𝜌∗ 𝑡 = argmax

U∈𝒝

N 𝒬@@D

U (ℛ@@D U

+ 𝑊 𝑡A )

@D

– Needs knowledge of transition probabilities

  • Given action value functions, we can find it as:

𝜌∗ 𝑡 = argmax

U∈𝒝

𝑅 𝑡, 𝑏

  • This is model free (no need for knowledge of model parameters)
slide-100
SLIDE 100

Pr Problem of optimal control

  • From a series of episodes of the kind:

𝑡), 𝑏), 𝑠

L, 𝑡L, 𝑏L, 𝑠 M, 𝑡M, 𝑏M, 𝑠 ‹, … , 𝑡2

  • Find the optimal action value function 𝑅∗ 𝑡, 𝑏

– The optimal policy can be found from it

  • Ideally do this online

– So that we can continuously improve our policy from ongoing experience

slide-101
SLIDE 101

Expl Exploration n vs.

  • s. Expl

Exploitation

  • Optimal policy search happens while gathering experience while following a

policy

  • For fastest learning, we will follow an estimate of the optimal policy
  • Risk: We run the risk of positive feedback

– Only learn to evaluate our current policy – Will never learn about alternate policies that may turn out to be better

  • Solution: We will follow our current optimal policy 1 − 𝜗 of the time

– But choose a random action 𝜗 of the time – The “epsilon-greedy” policy

slide-102
SLIDE 102

Online methods for estimating the value of a policy cy:

  • Temporal Difference Leaning (TD)
  • Idea: Update your value estimates after every observation

𝑡), 𝑏), 𝑠L, 𝑡L, 𝑏L, 𝑠M, 𝑡M, 𝑏M, 𝑠

‹, … , 𝑇2

– Do not actually wait until the end of the episode

Update for S1 Update for S2 Update for S3

102

slide-103
SLIDE 103

Solving for the optimal policy: Q-learning algorithm

103

  • Initialize 𝑅
  • (𝑡, 𝑏) arbitrarily
  • Repeat (for each episode):
  • Initialize 𝑡
  • Repeat (for each step of episode):
  • Choose 𝑏 from 𝑡 using a policy derived from 𝑅
  • Take action 𝑏, receive reward 𝑠, observe new state 𝑡A
  • 𝑅
  • 𝑡, 𝑏 ← 𝑅
  • 𝑡, 𝑏 + 𝛽 𝑠 + 𝛿 max

UD 𝑅

  • 𝑡A, 𝑏A − 𝑅
  • 𝑡, 𝑏
  • 𝑡 ← 𝑡A
  • until 𝑡 is terminal

e.g., ε-greedy

slide-104
SLIDE 104

So Solution: : Off-po policy lea earni ning ng

  • The policy for learning is the what if policy

𝑡), 𝑏), 𝑠L, 𝑡L, 𝑏L, 𝑠M, 𝑡M, 𝑏M, 𝑠

‹, … , 𝑡2

𝑏

  • L 𝑏
  • M
  • Use the best action for st+1 as your hypothetical off-policy action
  • But actually follow an epsilon-greedy action

– The hypothetical action is guaranteed to be better than the one you actually took – But you still explore (non-greedy)

hypothetical

slide-105
SLIDE 105

Q-Le Learn rning

  • From any state-action pair 𝑡, 𝑏

– Accept reward 𝑠 – Transition to 𝑡A – Find the best action 𝑏A for 𝑡A – Use it to update 𝑅(𝑡, 𝑏) – But then actually perform an epsilon-greedy action 𝑏′′ from 𝑡A

slide-106
SLIDE 106

Wha What abo bout ut the he actua ual po policy? y?

  • Optimal greedy policy:

𝜌 𝑏 𝑡 = •1 𝑔𝑝𝑠 𝑏 = 𝑏𝑠𝑕max

UD 𝑅(𝑡, 𝑏A)

0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓

  • Exploration policy

𝜌 𝑏 𝑡 = ’ 1 − 𝜗 𝑔𝑝𝑠 𝑏 = 𝑏𝑠𝑕max

UD 𝑅(𝑡, 𝑏A)

𝜗 𝑂U − 1 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓

  • Ideally 𝜗 should decrease with time
slide-107
SLIDE 107

Q-Le Learn rning

  • Currently most-popular RL algorithm
  • Topics not covered yet:

– Value function approximation – Continuous state spaces – Deep-Q learning