Reinforcement Learning: Approximate Dynamic Programming Decision Making Under Uncertainty, Chapter 10 Christos Dimitrakakis Chalmers November 21, 2013 Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 1 / 19
1 Introduction Error bounds Features 2 Approximate policy iteration Estimation building blocks The value estimation step Policy estimation Rollout-based policy iteration methods Least Squares Methods 3 Approximate Value Iteration Approximate backwards induction State aggregation Representative states Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 2 / 19
Introduction Definition 1 ( u -greedy policy and value function) π ∗ v ∗ u ∈ arg max L π u , u = L u , (1.1) π where π : S → D ( A ) maps from states to action distributions. Parameteric value function estimation θ ∗ ∈ arg min V Θ = { v θ | θ ∈ Θ } , � v θ − u � φ (1.2) θ ∈ Θ where � · � φ � � S | · | d φ . Parameteric policy estimation θ ∗ ∈ arg min � π θ − π ∗ Π Θ = { π θ | θ ∈ Θ } , u � φ (1.3) θ ∈ Θ where π ∗ u = arg max π ∈ Π L π u Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 3 / 19
Introduction Error bounds Theorem 2 Consider a finite MDP µ with discount factor γ < 1 and a vector u ∈ V such that � � u − V ∗ � ∞ = ǫ . If π is the u -greedy policy then µ � 2 γǫ � � V π µ − V ∗ � ∞ ≤ 1 − γ . µ � In addition, ∃ ǫ 0 > 0 s.t. if ǫ < ǫ 0 , then π is optimal. Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 4 / 19
Introduction Features Feature mapping f : S × A → X . For X ⊂ R n , the feature mapping can be written in vector form: f 1 ( s , a ) f ( s , a ) = . . . (1.4) f n ( s , a ) Example 3 (Radial Basis Functions) Let d be a metric on S × A and { ( s i , a i ) | i = 1 , . . . , n } . Then we define each element of f as: f i ( s , a ) � exp {− d [( s , a ) , ( s i , a i )] } . (1.5) These function are sometimes called kernels . Example 4 (Tilings) Let G = { X 1 , . . . , X n } be a partition of S × A of size n . Then: f i ( s , a ) � I { ( s , a ) ∈ X i } . (1.6) Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 5 / 19
Approximate policy iteration Approximate policy ieration Algorithm 1 Generic approximate policy iteration algorithm ˆ input Initial value function v 0 , approximate Bellman operator L , approximate value estimator ˆ V . for k = 1 , . . . do � � � ˆ π k = arg min π ∈ ˆ L π v k − 1 − L v k − 1 // policy improvement � � Π � V � v − V π k v k = arg min v ∈ ˆ µ � // policy evaluation end for Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 6 / 19
Approximate policy iteration Theoretical gurantees Assumption 1 Consider a discounted problem with discount factor γ and iterates v k , π k such that: � v k − V π k � ∞ ≤ ǫ, ∀ k (2.1) � � � L π k +1 v k − L v k ∞ ≤ δ, ∀ k (2.2) � Theorem 5 ([6], proposition 6.2) Under Assumption 1 � V π k − V ∗ � ∞ ≤ δ + 2 γǫ lim sup (1 − γ ) 2 . (2.3) k →∞ Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 7 / 19
Approximate policy iteration Estimation building blocks Lookahead policies Single-step lookahead q ( i , a ′ ) π q ( a | i ) > 0 iff a ∈ arg max (2.4) a ′ ∈A � q ( i , a ) � r µ ( i , a ) + γ P µ ( j | i , a ) u ( j ) . (2.5) j ∈S T -step lookahead π ( i ; q T ) = arg max q T ( i , a ) , (2.6) a ∈A where u k is recursively defined as: � q k ( i , a ) = r µ ( i , a ) + γ P µ ( j | i , a ) u k − 1 ( j ) (2.7) j ∈S u k ( i ) = max { q k ( i , a ) | a ∈ A} (2.8) and u 0 = u . Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 8 / 19
Approximate policy iteration Estimation building blocks Rollout policies Rollout estimate of the q -factor K i T k − 1 q ( i , a ) = 1 � � r ( s t , k , a t , k ) , K i t =0 k =1 where s t , k , a t , k ∼ P π µ ( · | s 0 = i , a 0 = a ), and T k ∼ Geom (1 − γ ). Rollout policy estimation. Given a set of samples q ( i , a ) for i ∈ ˆ S , we estimate � π θ − π ∗ � � min φ , q � θ for some φ on ˆ S . Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 9 / 19
Approximate policy iteration The value estimation step Generalised linear model using features (or kernel) Feature mapping f : S → R n , parameters θ ∈ R n . n � v θ ( s ) = θ i f i ( s ) (2.9) i =1 Fitting a value function. � c s ( θ ) = φ ( s ) � v θ ( s ) − v ( s ) � κ c ( θ ) = c s ( θ ) , p . (2.10) s ∈ ˆ S Example 6 The case p = 2, κ = 2 θ ′ j = θ j − 2 αφ ( s )[ v θ ( s ) − v ( s )] f j ( s ) . (2.11) Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 10 / 19
Approximate policy iteration Policy estimation Generalised linear model using features (or kernel). Feature mapping f : S → R n , parameters θ ∈ R n . n π θ ( a | s ) = g ( s , a ) � � h ( s ) , g ( s , a ) = θ i f i ( s , a ) , h ( s ) = g ( s , b ) (2.12) i =1 b ∈A Fitting a policy through a cost function. � c s ( θ ) = φ ( s ) � π θ ( · | s ) − π ( · | s ) � κ c ( θ ) = c s ( θ ) , p . (2.13) s ∈ ˆ S The case p = 1, κ = 1. � � θ ′ � j = θ j − αφ ( s ) π θ ( a | s ) f j ( s , b ) − f j ( s , a ) . (2.14) b ∈A Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 11 / 19
Approximate policy iteration Rollout-based policy iteration methods Algorithm 2 Rollout Sampling Approximate Policy Iteration. for k = 1 , . . . do Select a set of representative states ˆ S k for n = 1 , . . . do Select a state s n ∈ ˆ S k maximising U n ( s ) and perform a rollout. a ∗ ( s n ) is optimal w.p. 1 − δ , put s n in ˆ S k ( δ ) and remove it from ˆ If ˆ S k . end for Calculate q k ≈ Q π k from the rollouts. Train a classifier π θ k +1 on the set of states ˆ a ∗ ( s ). S k ( δ ) with actions ˆ end for Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 12 / 19
Approximate policy iteration Least Squares Methods Least square value estimation Projection. Setting v = Φθ where Φ is a feature matrix and θ is a parameter vector we have Φθ = r + γ P µ,π Φθ (2.15) θ = [( I − γ P µ,π ) Φ ] − 1 r (2.16) Replacing the inverse with the pseudo-inverse, with A = ( I − γ P µ,π ) Φ A − 1 � A ⊤ � AA ⊤ � − 1 ˜ , Empirical constructions. Given a set of data points { ( s i , a i , r i , s ′ i ) | i = 1 , . . . , n } , which may not be consecutive, we define: 1 r = ( r i ) i . 2 Φ i = f ( s i , a i ), Φ = ( Φ i ) i . 3 P µ,π = P µ P π , P µ,π ( i , j ) = I { j = i + 1 } Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 13 / 19
Approximate policy iteration Least Squares Methods Algorithm 3 LSTDQ - Least Squares Temporal Differences on q -factors input data D = { ( s i , a i , r i , s ′ i ) | i = 1 , . . . , n } , feature mapping f , policy π − 1 � θ = ( Φ ( I − γ P µ,π )) r Algorithm 4 LSPI - Least Squares Policy Iteration input data D = { ( s i , a i , r i , s ′ i ) | i = 1 , . . . , n } , feature mapping f Set π 0 arbitrarily. for k = 1 , . . . do θ k = LSTDQ ( D , f , π k − 1 ). π k = π ∗ Φθ k . end for Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 14 / 19
Approximate Value Iteration Approximate backwards induction V ∗ a ∈A { r ( s , a ) + γ E µ ( V ∗ t ( s ) = max t +1 | s t = s , a t = a ) } (3.1) Iterative approximation � � P µ ( s ′ | s , a ) v t +1 ( s ′ ) ˆ � V t ( s ) = max r ( s , a ) + γ (3.2) a ∈A s ′ �� � � � � v − ˆ v t = arg min V t � v ∈ V (3.3) � � � � Online gradient estimation � � � v t − ˆ θ t +1 = θ t − α t ∇ θ V t (3.4) � � � Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 15 / 19
Approximate Value Iteration State aggregation Aggregated estimate. Let G = { S 0 , S 1 , . . . , S n } be a partition of S , with S 0 = ∅ and θ ∈ R n and let f k ( s t ) = I { s t ∈ S k } . Then the approximate value function is v ( s ) = θ ( k ) , if s ∈ S k , k � = 0 . (3.5) Online gradient estimate. Consider the case �·� = �·� 2 2 . For s t ∈ S k : � θ t +1 ( k ) = (1 − α ) θ t ( k ) + α max P ( j | s t , a ) v t ( s ) a ∈A r ( s t , a ) + γ (3.6) j For s t / ∈ S k : θ t +1 ( k ) = θ ( k ) . (3.7) Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 16 / 19
Recommend
More recommend