introduction to reinforcement learning and q learning
play

Introduction to Reinforcement Learning and Q-Learning Skyler Seto - PowerPoint PPT Presentation

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler Seto (ss3349) Introduction to Reinforcement Learning and


  1. Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

  2. Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Outline 1 Reinforcement Learning and Markov Decision Process 2 Q-Learning 3 Q-Learning Convergence Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

  3. Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Introduction How does an agent behave? 1 An agent can be a passive learner, lounging around analyzing data, then constructing its model. 2 An agent can be an active learner and learn to act on the fly given sequences of the form (state, action, reward). In this talk, we consider an agent to be one who actively learns from the environment (2). Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

  4. Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Markov Decision Process (MDP) Definition The MDP framework consists of the four elements ( S , A , R , P ) • S is the finite set of possible states, • A is the finite set of possible actions, • R is the reward model R : S × A → R , • P is the transition model P ( s | s , a ) with � s ′ ∈ S P ( s ′ | s , a ) = 1. Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

  5. Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Robot Navigation 1 State space S is the set of all possible locations and directions. 2 Action space A is the set of possible motions: move forward, backward, etc. 3 Reward model R rewards the robot positively if it gets to the goal, and negatively if it hits an obstacle. 4 Transition probability accounts for some probability the robot moves forward, doesn’t move, and moves forward twice. Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

  6. Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Markov Decision Process Diagram Figure: Two Step Markov Decision Process Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

  7. Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Properties of MDP 1 The reward function R ( s , a ) is deterministic and time-homogeneous. 2 P ( s t + 1 | s t , a t ) is independent of t and thus time-homogeneous. 3 Transition model is Markovian. Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

  8. Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Reinforcement Learning in the MDP 1 Consider a partially known model ( S , A , R , P ) where S and A are known, but R and P must be learned as the agent acts. 2 Define the policy for the MDP π t : S → A as the solution to the MDP . 3 What is the optimal policy ( π ∗ ) the agent should learn in order to maximize its total discounted expected reward ( γ s )? Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

  9. Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Outline 1 Reinforcement Learning and Markov Decision Process 2 Q-Learning 3 Q-Learning Convergence Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

  10. Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Value and Optimal Value Definition For a given policy π and discounted reward factor γ , the value of a state s is � V π ( s ) = R s ( π ( s )) + γ P s , y ( π ( s )) V π ( y ) y ∈ S     V ∗ ( s ) = V π ∗ ( s ) = max � P s , y ( a ) V π ∗ ( y )  R s ( a ) + γ a  y ∈ S Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

  11. Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Q Function Definition For a policy π define Q values (action-values) as � Q π ( s , a ) = R s ( a ) + γ P x , y ( π ( s )) V π ( y ) y ∈ S = E s [ R s ( a ) + γ V π ( y )] The Q value is the expected discounted reward for executing action a at state s and following policy π after. Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

  12. Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Q Values for the Optimal Policy 1 Let Q ∗ ( s , a ) = Q π ∗ ( s , a ) be the optimal action-values, 2 V ∗ ( s ) = max Q ∗ ( s , a ) be the optimal value, a 3 π ∗ ( s ) = arg max Q ∗ ( s , a ) be the optimal policy. a If an agent learns the Q values, it can easily determine the optimal action. Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

  13. Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Q-Learning In Q-learning the agent experiences a sequence of stages. At the n th stage, the agent: • observes its current state x n , • performs an action a n , • observes subsequent state transition to y n , • receives reward r n , Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

  14. Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Q-Learning • updates its Q function with learning factor α n according to • If s = x n and a = a n Q n ( s , a ) = ( 1 − α n ) Q n − 1 ( s , a ) + α n [ r n + γ V n − 1 ( y n )] • Otherwise Q n ( s , a ) = Q n − 1 ( s , a ) where V n − 1 ( y ) = max b { Q n − 1 ( y , b ) } Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

  15. Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Outline 1 Reinforcement Learning and Markov Decision Process 2 Q-Learning 3 Q-Learning Convergence Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

  16. Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Convergence Theorem Let n i ( s , a ) be the i th time that action a is tried in state s . Theorem Given bounded rewards | r n | ≤ R , learning rates 0 ≤ α n < 1 , and ∞ ∞ � 2 � � � α n i ( s , a ) = ∞ , α n i ( s , a ) < ∞ ∀ s , a i = 1 i = 1 then Q n ( s , a ) a . s → Q ∗ ( s , a ) as n → ∞ . Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

  17. Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Action Replay Process (ARP) 1 Let S = { s } , A = { a } be the set of states and actions for the original MDP . 2 Create an infinite deck of cards with the j th card from the bottom having ( s j , a j , y j , r j , α j ) written on it. 3 Additionally take the bottom card to have the value Q 0 ( s , a ) for all s and a . 4 We define the ARP as to have S ′ = { ( s , n ) } and A ′ = A = { a } . Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

  18. Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence State Transitions in (ARP) Given current state ( s , n ) and action a , we determine the next state by 1 Removing all cards for stages after n . 2 Find the first t searching from the top of the (remaining) deck where s t = s and a t = a . 3 Flip a biased coin with probability α t . • If the coin is heads, return reward r t , and transition to the state ( y t , t − 1 ) . Repeat the process on the remaining deck without card t . • If the coin is tails, find another card with s and a . Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

  19. Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Transition Probability for ARP 1 Define the expected reward of card n determined by the ARP as R ( n ) s ( a ) 2 Define the transition probability for the ARP as P ARP ( x , n ) , ( y , m ) ( a ) with n − 1 P ( n ) � P ARP x , y ( a ) = ( x , n ) , ( y , m ) ( a ) m = 1 Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

  20. Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Lemma A: Q n are Optimal for ARP Lemma Q n ( s , a ) = Q ∗ ARP (( s , n ) , a ) Q n ( s , a ) are the optimal action values for ARP states ( s , n ) and ARP actions a. Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

  21. Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Lemma A: Q n are Optimal for ARP The ARP was constructed to have this property. At n = 0, Q 0 ( s , a ) is the optimal and only possible action-value of ( s , 0 ) and a , so Q 0 ( s , a ) = Q ∗ ARP (( s , 0 ) , a ) It’s easy to see by induction that for all a and s , and for any n , Q n ( s , a ) = Q ∗ ARP (( s , n ) , a ) Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

  22. Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Lemma B: Convergence of Transitions and Rewards Lemma With probability 1, the probabilities P ( n ) x , y ( a ) and expected rewards R ( n ) x ( a ) in the ARP converge and tend to the transition matrices and expected rewards in the true process as the card n → ∞ . Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

Recommend


More recommend