This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm –8:50pm R Zoom Lecture Fall 2020
Last Lecture v What is reinforcement learning? v Difference from other AI problems v Application stories. v Topics to be covered in this course. v Course logistics 2
Reinforcement Learning What is it? Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment to maximize some notion of cumulative reward. 1. Model 2. Value function 3. Policy (From Wikipedia)
RL involves 4 key aspects RL involves 4 key aspects 1. Optimization. 1. Optimization. 2. Exploration. 2. Exploration. v Goal is to find an optimal way v Goal is to find an optimal way to make decisions, with to make decisions, with maximized total cumulated maximized total cumulated rewards rewards 4. Delayed consequences 4. Delayed consequences 2. Generalization. 2. Generalization. v Programming v Programming all possibilities all possibilities is not possible. is not possible. $5 $20 $5 $20 4 28
Branches of Machine Learning AI planning Supervised Unsupervised Learning Learning Machine Learning Reinforcement Learning Imitation learning From David Silver’s Slides
Today’s topics v Reinforcement Learning Components § Model, Value function, Policy v Model-based Planning § Policy Evaluation, Policy Search v Project 1 demo and description.
Today’s topics v Reinforcement Learning Components § State vs observation § Stochastic vs deterministic model and policy § Model, Value function, Policy v Model-based Planning § Policy Evaluation, Policy Search v Project 1 demo and description.
Reinforcement Learning Components Observation Action Reward Environment
Agent-Environment interactions over time (sequential decision process) Observation Action o t a t Each time step t : 1. Agent takes an action a t ; 2. World updates given action Reward a t , emits observation o t and reward r t ; r t 3. Agent receives observation ot and reward r t . Environment
Interaction history, Decision-making Observation Action o t a t Reward r t Environment History h t = ( a 1 , o 1 , r 1 , ..., a t , o t , r t ) Agent chooses action a t+1 based on history h t State: information assumed to determine what happens next as a function of history: s t = f ( h t ), In many cases, for simplicity, s t = o t
State transition & Markov property Observation/State Action s t =o t a t Reward r t Environment Transition Probability p(s t+1 |s t ,a t ) State s t is Markov if and only if: p(s t+1 |s t , a t ) = p(s t+1 |h t , a t ) Future is independent of past, given present.
A taxi driver seeks for Hypertension control Passengers: State (observation): State: (Current location, (current blood pressure) with or without passenger) Action: A direction to go Action: take medication or not Path 1 Path 2 Path 3
More on Markov Property ? 1. Does Markov Property always hold? 1. No 2. What if Markov Property does not hold?
More on Markov Property ? 1. Does Markov Property always hold? 1. No 2. What if Markov Property does not hold? 1. Make it Markov by setting state as the history: s t = h t Again, in practice , we often assume most recent observation is sufficient statistic of history: s t = o t State representation has big implications for: 1. Computational complexity 2. Data required 3. Resulting performance
Fully vs Partially Observable Markov Decision Process What you observe partially What you observe fully represent the environment represents the environment state state. s t = h t s t = o t
Breakout game Poker games
Deterministic vs Stochastic Model Stochastic: Given history Deterministic: Given & action, many potential history & action, single observations & rewards observation & reward Common assumption for Common assumption in customers, patients, hard to robotics and controls model domains p(s t+1 | s t , a t ) =1, s t+1 =s 0≤ p(s t+1 | s t , a t ) < 1 p(s t+1 | s t , a t ) =0, s t+1 ≠s P[r(s t , a t ) =3]=50%, r(s t , a t ) =3, s t =s, a t =a P[r(s t , a t ) =5]=50%, s t =s, a t =a
Breakout game Hypertension control For both transition and reward
Example: Taxi passenger-seeking task as a decision-making process s 6 s 5 s 3 s 2 s 4 s 1 States: Locations of taxi ( s 1 , . . . , s 6 ) on the road Actions: Left or Right Rewards: +1 in state s 1 +3 in state s 5 0 in all other states
RL components v Often include one or more of § Model: Representation of how the world changes in response to agent’s action § Policy: function mapping agent’s states to action § Value function: Future rewards from being in a state and/or action when following a particular policy
RL components v Often include one or more of § Model: Representation of how the world changes in response to agent’s action § Policy: function mapping agent’s states to action § Value function: Future rewards from being in a state and/or action when following a particular policy
RL components: Model v Agent’s representation of how the world changes in response to agent’s action, with two parts: Transition model Reward model predicts next agent state predicts immediate reward p(s t+1 = s’ | s t = s , a t = a )
Taxi passenger-seeking task Stochastic Markov Model s 6 s 5 s 3 s 2 s 4 s 1 r’ 1 =0 r’ 2 =0 r’ 3 =0 r’ 4 =0 r’ 5 =0 r’ 6 =0 Taxi agent’s transition model: 0.5 = p(s 3 |s 3 , right) = p(s 4 |s 3 , right) 0.5 = p(s 4 |s 4 , right) = P(s 5 |s 4 , right) Numbers above show RL agent’s reward model , which may be wrong. Ture reward model is r =[1,0,0,0,3,0]
RL components v Often include one or more of § Model: Representation of how the world changes in response to agent’s action § Policy: function mapping agent’s states to action § Value function: Future rewards from being in a state and/or action when following a particular policy
RL components: Policy v Policy π determines how the agent chooses actions § π : S → A, mapping from states to actions a s v Deterministic policy: a’ § π ( s ) = a § In the other word, a’’ a • π (a| s ) = 0 , • π (a’| s ) = π (a’’| s )= 0, a’ s v Stochastic policy: § π ( a | s ) = Pr( a t = a | s t = s ) a’’
Taxi passenger-seeking task Policy s 6 s 5 s 3 s 2 s 4 s 1 50% 50% Action set: {left, right} Policy presented by arrow. Q1: Is this a deterministic or stochastic policy? Q2: Give an example of another policy type?
RL components v Often include one or more of § Model: Representation of how the world changes in response to agent’s action § Policy: function mapping agent’s states to action § Value function: Future rewards from being in a state and/or action when following a particular policy
RL components: Value Function v Value function V π : expected discounted sum of future rewards under a particular policy π v Discount factor γ weighs immediate vs future rewards, with γ in [0,1]. v Can be used to quantify goodness/badness of states and actions v And decide how to act by comparing policies a s a’
Taxi passenger-seeking task: Value function s 6 s 5 s 3 s 2 s 4 s 1 Discount factor , γ = 0 Policy #1: π(s 1 ) = π(s 2 ) = ··· = π(s 6 ) = right Q: V π ? Policy #2: π(left| s i ) = π(right| s i ) = 50%, for i=1,…,6 Q: V π ?
Types of RL agents/algorithms Model-Free: Model-based Explicit: Value function Explicit: Model and/or policy function May or may not have policy No model and/or value function
Today’s topics v Reinforcement Learning Components § Model, Value function, Policy v Model-based Planning v MDP model § Policy Evaluation, Policy Search v Project 1 demo and description.
MDP v Markov Decision Process
Transition Model Reward Model Policy function Value function
Taxi passenger-seeking task: Transition Model Reward Model Policy function MDP Value function s 6 s 5 s 3 s 2 s 4 s 1 a2 a1 deterministic transition model
Transition Model Reward Model Policy function Value function
Transition Model Reward Model Policy function Value function
Taxi passenger-seeking task: a2 MDP Policy Evaluation a1 s 6 s 5 s 3 s 2 s 4 s 1 v Let π(s) = a 1 ∀ s. γ = 0. v What is the value of this policy? 2
Taxi passenger-seeking task: a2 MDP Control a1 s 6 s 5 s 3 s 2 s 4 s 1 v 6 discrete states (location of the taxi) v 2 actions: Left or Right v How many deterministic policies are there? v Is the optimal policy for a MDP always unique? 2
If policy doesn’t change, can it ever change again? Is there a maximum number of iterations of policy iteration?
Project 1 starts today Due 9/24 mid-night v https://users.wpi.edu/~yli15/courses/DS595 CS525Fall20/Assignments.html 55
Any Comments & Critiques?
Recommend
More recommend