Direct Gradient-Based Reinforcement Learning Jonathan Baxter Research School of Information Sciences and Engineering Australian National University http://csl.anu.edu.au/ ∼ jon Joint work with Peter Bartlett and Lex Weaver December 5, 1999
1 Reinforcement Learning Models agent interacting with its environment . 1. Agent receives information about its state . 2. Agent chooses action or control based on state- information. 3. Agent receives a reward . 4. State is updated. 5. Goto ?? .
2 Reinforcement Learning • Goal: Adjust agent’s behaviour to maximize long-term average reward. • Key Assumption: state transitions are Markov .
3 Chess • State: Board position. • Control: Move pieces. • State Transitions: My move, followed by opponent’s move. • Reward: Win, draw, or lose.
4 Call Admission Control Telecomms carrier selling bandwidth: queueing problem. • State: Mix of call types on channel. • Control: Accept calls of certain type. • State Transitions: Calls finish. New calls arrive. • Reward: Revenue from calls accepted.
5 Cleaning Robot • State: Robot and environment (position, velocity, dust levels, . . . ). • Control: Actions available to robot. • State Transitions: depend on dynamics of robot and statistics of environment. • Reward: Pick up rubbish, don’t damage the furniture.
6 Summary Previous approaches: • Dynamic Programming can find optimal policies in small state spaces. • Approximate Value-Function based approaches currently the method of choice in large state spaces. • Numerous practical successes, BUT • Policy performance can degrade at each step.
7 Summary Alternative Approach: • Policy parameters θ ∈ R K , Performance: η ( θ ) . • Compute ∇ η ( θ ) and step uphill (gradient ascent). • Previous algorithms relied on accurate reward baseline or recurrent states .
8 Summary Our Contribution: • Approximation ∇ β η ( θ ) to ∇ η ( θ ) . β ∈ [0 , 1) • Parameter related to Mixing Time of problem. • Algorithm to approximate ∇ β η ( θ ) via simulation ( POMDPG ). • Line search in the presence of noise.
9 Partially Observable Markov Decision Processes (POMDPs) States: S = { 1 , 2 , . . . , n } X t ∈ S Y = { 1 , 2 , . . . , M } Y t ∈ Y Observations: Actions or Controls: U = { 1 , 2 , . . . , N } U t ∈ U Observation Process ν : Pr( Y t = y | X t = i )= ν y ( i ) Stochastic Policy µ : Pr( U t = u | Y t = y )= µ u ( θ, y ) Rewards: r : S → R Adjustable parameters: θ ∈ R K
10 POMDP Transition Probabilities: Pr( X t +1 = j | X t = i, U t = u ) = p ij ( u )
11 POMDP X t Environment ν r(X ) t Yt Ut Agent µ Policy:
12 The Induced Markov Chain • Transition Probabilities: p ij ( θ )=Pr ( X t +1 = j | X t = i ) =E y ∼ ν ( X t ) E u ∼ µ ( θ,y ) p ij ( u ) • Transition Matrix: P ( θ ) = [ p ij ( θ )]
13 Stationary Distributions q = [ q 1 · · · q n ] ′ ∈ R n is a distribution over states. X t ∼ q X t +1 ∼ q ′ P ( θ ) ⇒ Definition: A probability distribution π ∈ R n is a stationary distribution of the Markov chain if π ′ P ( θ ) = π ′ .
14 Stationary Distributions Convenient Assumption: For all values of the parameters θ , there is a unique stationary distribution π ( θ ) . Implies the Markov chain mixes : For all X 0 , the distribution of X t approaches π ( θ ) . Inconvenient Assumption: Number of states n “essentially infinite”. Meaning: forget about storing a number for each state, or inverting n × n matrices.
15 Measuring Performance • Average Reward: n � η ( θ ) = π i ( θ ) r ( i ) i =1 • Goal: Find θ maximizing η ( θ ) .
16 Summary • Partially Observable Markov Decision Processes. • Previous approaches: value function methods. • Direct gradient ascent • Approximating the gradient of the average reward. • Estimating the approximate gradient: POMDPG. • Line search in the presence of noise. • Experimental results.
17 Approximate Value Functions • Discount Factor β ∈ [0 , 1) , Discounted value of state i under policy µ : � � � J µ � X 0 = i r ( X 0 ) + βr ( X 1 ) + β 2 r ( X 2 ) + · · · β ( i ) = E µ • Idea: Choose restricted class of value functions ˜ θ ∈ R K , i ∈ S J ( θ, i ) , (e.g neural network with parameters θ ).
18 Policy Iteration Iterate: • Given policy µ , find approximation ˜ J ( θ, · ) to J µ β . • Many algorithms for finding θ : TD( λ ), Q-learning, Bellman residuals, . . . . • Simulation and non-simulation based. • Generate new policy µ ′ using ˜ J ( θ, · ) : � u ∗ ( θ, i ) = 1 ⇔ u ∗ = argmax u ∈U p ij ( u ) ˜ µ ′ J ( θ, j ) j ∈S
19 Approximate Value Functions • The Good: ⋆ Backgammon (world-champion), chess (International Master), job-shop scheduling, elevator control, . . . ⋆ Notion of “backing-up” state values can be efficient. • The Bad: � � J ( θ, i ) − J µ � ˜ � � ⋆ Unless β ( i ) � = 0 for all states i , the new policy µ ′ can be a lot worse than the old one. ⋆ “Essentially Infinite” state spaces means we are likely to have very bad approximation error for some states.
20 Summary • Partially Observable Markov Decision Processes. • Previous approaches: value function methods. • Direct gradient ascent. • Approximating the gradient of the average reward. • Estimating the approximate gradient: POMDPG. • Line search in the presence of noise. • Experimental results.
21 Direct Gradient Ascent • Desideratum: Adjusting the agent’s parameters θ should improve its performance. • Implies . . . • Adjust the parameters in the direction of the gradient of the average reward: θ := θ + γ ∇ η ( θ )
22 Direct Gradient Ascent: Main Results 1. Algorithm to estimate approximate gradient( ∇ β η ) from a sample path. 2. Accuracy of approximation depends on parameter of the algorithm ( β ); bias/variance trade-off. 3. Line search algorithm using only gradient estimates.
23 Related Work Machine Learning: Williams’ REINFORCE algorithm (1992). • Gradient ascent algorithm for restricted class of MDPs. • Requires accurate reward baseline , i.i.d. transitions. Kimura et. al. , 1998: extension to infinite horizon. Discrete Event Systems: Algorithms that rely on recurrent states. MDPs: (Cao and Chen, 1997), POMDPs: (Marbach and Tsitsiklis, 1998). Control Theory: Direct adaptive control using derivatives (Hjalmarsson, Gunnarsson, Gevers, 1994), (Kammer, Bitmead, Bartlett, 1997), (DeBruyne, Anderson, Gevers, Linard, 1997).
24 Summary • Partially Observable Markov Decision Processes. • Previous approaches: value function methods. • Direct gradient ascent. • Approximating the gradient of the average reward. • Estimating the approximate gradient: POMDPG. • Line search in the presence of noise. • Experimental results.
25 Approximating the gradient Recall: For β ∈ [0 , 1) , Discounted value of state i is � � � � X 0 = i r ( X 0 ) + βr ( X 1 ) + β 2 r ( X 2 ) + · · · J β ( i ) = E . Vector notation: J β = ( J β (1) , . . . , J β ( n )) . Theorem: For all β ∈ [0 , 1) , ∇ η ( θ )= βπ ′ ( θ ) ∇ P ( θ ) J β + (1 − β ) ∇ π ′ ( θ ) J β . + (1 − β ) ∇ π ′ ( θ ) J β = β ∇ β η ( θ ) . � �� � � �� � estimate → 0 as β → 1
26 Mixing Times of Markov Chains • ℓ 1 -distance: If p, q are distributions on the states, n � � p − q � 1 := | p ( i ) − q ( i ) | i =1 • d ( t ) -distance: Let p t ( i ) be the distribution over states at time t , starting from state i . � p t ( i ) − p t ( j ) � 1 d ( t ) := max ij • Unique stationary distribution ⇒ d ( t ) → 0 .
27 Approximating the gradient τ ∗ := min � t : d ( t ) ≤ e − 1 � Mixing time: Theorem: For all β ∈ [0 , 1) , θ ∈ R k , β η ( θ ) � ≤ constant × τ ∗ ( θ )(1 − β ) . �∇ η ( θ ) − ∇ That is, if 1 / (1 − β ) is large compared with the mixing time τ ∗ ( θ ) , ∇ β η ( θ ) accurately approximates the gradient direction ∇ η ( θ ) .
28 Summary • Partially Observable Markov Decision Processes. • Previous approaches: value function methods. • Direct gradient ascent. • Approximating the gradient of the average reward. • Estimating the approximate gradient: POMDPG. • Line search in the presence of noise. • Experimental results.
29 Estimating ∇ β η ( θ ) : POMDPG Given: parameterized policies, µ u ( θ, y ) , β ∈ [0 , 1) : 1. Set z 0 = ∆ 0 = 0 ∈ R K . 2. for each observation y t , control u t , reward r ( i t +1 ) do 3. Set z t +1 = βz t + ∇ µ u t ( θ, y t ) (eligibility trace) µ u t ( θ, y t ) 1 4. Set ∆ t +1 = ∆ t + t +1 [ r ( i t +1 ) z t +1 − ∆ t ] 5. end for
30 Convergence of POMDPG Theorem: For all β ∈ [0 , 1) , θ ∈ R K , ∆ t → ∇ β η ( θ ) .
31 Explanation of POMDPG Algorithm computes: T − 1 ∆ T = 1 ∇ µ u t � � � r ( i t +1 ) + βr ( i t +2 )+ · · · + β T − t − 1 r ( i T ) T µ u t � �� � t =0 Estimate of discounted value ‘due to’ action u t • ∇ µ u t ( θ, y t ) is the direction to increase the probability of the action u t . • It is weighted by something involving subsequent rewards, and • divided by µ u t : ensures “popular” actions don’t dominate
Recommend
More recommend