Foundations of Machine Learning Reinforcement Learning
Reinforcement Learning Agent exploring environment. Interactions with environment: action state Agent Environment reward Problem: find action policy that maximizes cumulative reward over the course of interactions. Mehryar Mohri - Foundations of Machine Learning page
Key Features Contrast with supervised learning: • no explicit labeled training data. • distribution defined by actions taken. Delayed rewards or penalties. RL trade-off: • exploration (of unknown states and actions) to gain more reward information; vs. • exploitation (of known information) to optimize reward. Mehryar Mohri - Foundations of Machine Learning page
Applications Robot control e.g., Robocup Soccer Teams (Stone et al., 1999) . Board games, e.g., TD-Gammon (Tesauro, 1995) . Elevator scheduling (Crites and Barto, 1996) . Ads placement. Telecommunications. Inventory management. Dynamic radio channel assignment. Mehryar Mohri - Foundations of Machine Learning page
This Lecture Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem Mehryar Mohri - Foundations of Machine Learning page
Markov Decision Process (MDP) Definition: a Markov Decision Process is defined by: • a set of decision epochs . { 0 , . . . , T } • a set of states , possibly infinite. S • a start state or initial state ; s 0 • a set of actions , possibly infinite. A • a transition probability : distribution over Pr[ s � | s, a ] s � = δ ( s, a ) destination states . • a reward probability : distribution over Pr[ r � | s, a ] r � = r ( s, a ) rewards returned . Mehryar Mohri - Foundations of Machine Learning page 6
Model State observed at time : s t ∈ S. t action Action taken at time : a t ∈ A. t state Agent Environment State reached . s t +1 = δ ( s t , a t ) reward Reward received: . r t +1 = r ( s t , a t ) a t /r t +1 a t +1 /r t +2 s t +2 s t s t +1 Mehryar Mohri - Foundations of Machine Learning page 7
MDPs - Properties Finite MDPs: and finite sets. A S Finite horizon when . T < ∞ Reward : often deterministic function. r ( s, a ) Mehryar Mohri - Foundations of Machine Learning page 8
Example - Robot Picking up Balls start search/[.1, R1] search/[.9, R1] carry/[.5, R3] other carry/[.5, -1] pickup/[1, R2] Mehryar Mohri - Foundations of Machine Learning page
Policy Definition: a policy is a mapping π : S → A. Objective: find policy maximizing expected π return. • finite horizon return: . � T − 1 � � s t , π ( s t ) t =0 r • infinite horizon return: . � + ∞ � � t =0 γ t r s t , π ( s t ) Theorem: there exists an optimal policy from any start state. Mehryar Mohri - Foundations of Machine Learning page 10
Policy Value Definition: the value of a policy at state is s π • finite horizon: � T − 1 � �� � � � V π ( s ) = E s t , π ( s t ) � s 0 = s r . � t =0 • infinite horizon: discount factor , γ ∈ [0 , 1) � + ∞ � �� � � γ t r � V π ( s ) = E s t , π ( s t ) � s 0 = s . � t =0 Problem: find policy with maximum value for all π states. Mehryar Mohri - Foundations of Machine Learning page 11
Policy Evaluation Analysis of policy value: � + ∞ � �� � � γ t r � V π ( s ) = E s t , π ( s t ) � s 0 = s . � t =0 � + ∞ � �� � � γ t r � = E[ r ( s, π ( s ))] + γ E s t +1 , π ( s t +1 ) � s 0 = s � t =0 = E[ r ( s, π ( s )] + γ E[ V π ( δ ( s, π ( s )))] . Bellman equations (system of linear equations): � Pr[ s � | s, π ( s )] V π ( s � ) . V π ( s ) = E[ r ( s, π ( s )] + γ s � Mehryar Mohri - Foundations of Machine Learning page 12
Bellman Equation - Existence and Uniqueness Notation: • transition probability matrix P s,s � =Pr[ s � | s, π ( s )] . • value column matrix V = V π ( s ) . • expected reward column matrix: R =E[ r ( s, π ( s )] . Theorem: for a finite MDP , Bellman’s equation admits a unique solution given by V 0 =( I − γ P ) − 1 R . Mehryar Mohri - Foundations of Machine Learning page 13
Bellman Equation - Existence and Uniqueness Proof: Bellman’s equation rewritten as V = R + γ PV . • is a stochastic matrix, thus, P � � Pr[ s � | s, π ( s )] = 1 . � P � � = max | P ss � | = max s s s � s � • This implies that The eigenvalues � γ P � ∞ = γ < 1 . of are all less than one and is ( I − γ P ) P invertible. Notes: general shortest distance problem (MM, 2002) . Mehryar Mohri - Foundations of Machine Learning page 14
Optimal Policy Definition: policy with maximal value for all π ∗ states s ∈ S. • value of (optimal value): π ∗ ∀ s ∈ S, V π ∗ ( s ) = max V π ( s ) . π • optimal state-action value function: expected return for taking action at state and then a s following optimal policy. Q ⇤ ( s, a ) = E[ r ( s, a )] + γ E[ V ⇤ ( δ ( s, a ))] Pr[ s 0 | s, a ] V ⇤ ( s 0 ) . X = E[ r ( s, a )] + γ s 0 2 S Mehryar Mohri - Foundations of Machine Learning page 15
Optimal Values - Bellman Equations Property: the following equalities hold: ∀ s ∈ S, V ∗ ( s ) = max a ∈ A Q ∗ ( s, a ) . Proof: by definition, for all , . s V ∗ ( s ) ≤ max a ∈ A Q ∗ ( s, a ) • If for some we had , then V ∗ ( s ) < max a ∈ A Q ∗ ( s, a ) s maximizing action would define a better policy. Thus, n o X V ⇤ ( s ) = max Pr[ s 0 | s, a ] V ⇤ ( s 0 ) E[ r ( s, a )] + γ . a 2 A s 0 2 S Mehryar Mohri - Foundations of Machine Learning page 16
This Lecture Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem Mehryar Mohri - Foundations of Machine Learning page
Known Model Setting: environment model known. Problem: find optimal policy. Algorithms: • value iteration. • policy iteration. • linear programming. Mehryar Mohri - Foundations of Machine Learning page 18
Value Iteration Algorithm � � � Pr[ s � | s, a ] V ( s � ) Φ ( V )( s ) = max E[ r ( s, a )] + γ . a � A s � � S Φ ( V ) = max π { R π + γ P π V } . ValueIteration ( V 0 ) 1 V � V 0 � V 0 arbitrary value while � V � Φ ( V ) � � (1 − � ) � 2 do � 3 V � Φ ( V ) 4 return Φ ( V ) Mehryar Mohri - Foundations of Machine Learning page 19
VI Algorithm - Convergence Theorem: for any initial value , the sequence V 0 defined by converge to . V n +1 = Φ ( V n ) V ∗ Proof: we show that is -contracting for � · � ∞ Φ γ existence and uniqueness of fixed point for . Φ • for any , let be the maximizing action a ∗ ( s ) s ∈ S defining . Then, for and any , s ∈ S Φ ( V )( s ) U Pr[ s � | s, a � ( s )] U ( s � ) � � � E[ r ( s, a � ( s ))] + γ Φ ( V )( s ) � Φ ( U )( s ) � Φ ( V )( s ) � s � � S � Pr[ s � | s, a � ( s )][ V ( s � ) � U ( s � )] = γ s � � S � Pr[ s � | s, a � ( s )] � V � U � � = γ � V � U � � . � γ s � � S Mehryar Mohri - Foundations of Machine Learning page 20
Complexity and Optimality Complexity: convergence in . Observe that O (log 1 � ) � V n +1 � V n � ∞ � γ � V n � V n − 1 � ∞ � γ n � Φ ( V 0 ) � V 0 � ∞ . � n � Φ ( V 0 ) � V 0 � ∞ � (1 � � ) � log 1 � � Thus, � n = O . � � -Optimality: let be the value returned. Then, V n +1 � � V ∗ � V n +1 � ∞ � � V ∗ � Φ ( V n +1 ) � ∞ + � Φ ( V n +1 ) � V n +1 � ∞ � γ � V ∗ � V n +1 � ∞ + γ � V n +1 � V n � ∞ . Thus, � V ∗ � V n +1 � ∞ � � 1 � � � V n +1 � V n � ∞ � � . Mehryar Mohri - Foundations of Machine Learning page 21
VI Algorithm - Example a/[3/4, 2] c/[1, 2] a/[1/4, 2] b/[1, 2] 1 2 d/[1, 3] � 3 4 V n (1) + 1 � � � V n +1 (1) = max 2 + γ 4 V n (2) , 2 + γ V n (2) � � V n +1 (2) = max 3 + γ V n (1) , 2 + γ V n (2) . For , V 0 (1) = − 1 , V 0 (2) = 1 , γ = 1 / 2 V 1 (1) = V 1 (2) = 5 / 2 . But, V ∗ (1) = 14 / 3 , V ∗ (2) = 16 / 3 . , Mehryar Mohri - Foundations of Machine Learning page
Policy Iteration Algorithm PolicyIteration ( � 0 ) 1 � � 0 arbitrary policy � � � 0 � � � nil 2 3 while ( � � = � � ) do 4 � policy evaluation: solve ( I � � P π ) V = R π . V � V π � � � � 5 6 � � argmax π { R π + � P π V } � greedy policy improvement . 7 return � Mehryar Mohri - Foundations of Machine Learning page 23
Recommend
More recommend