Direct Gradient-Based Reinforcement Learning Jonathan Baxter - PowerPoint PPT Presentation

Direct Gradient-Based Reinforcement Learning Jonathan Baxter Research School of Information Sciences and Engineering Australian National University http://csl.anu.edu.au/ ∼ jon Joint work with Peter Bartlett and Lex Weaver December 5, 1999

1 Reinforcement Learning Models agent interacting with its environment . 1. Agent receives information about its state . 2. Agent chooses action or control based on state- information. 3. Agent receives a reward . 4. State is updated. 5. Goto ?? .

2 Reinforcement Learning • Goal: Adjust agent’s behaviour to maximize long-term average reward. • Key Assumption: state transitions are Markov .

3 Chess • State: Board position. • Control: Move pieces. • State Transitions: My move, followed by opponent’s move. • Reward: Win, draw, or lose.

4 Call Admission Control Telecomms carrier selling bandwidth: queueing problem. • State: Mix of call types on channel. • Control: Accept calls of certain type. • State Transitions: Calls finish. New calls arrive. • Reward: Revenue from calls accepted.

5 Cleaning Robot • State: Robot and environment (position, velocity, dust levels, . . . ). • Control: Actions available to robot. • State Transitions: depend on dynamics of robot and statistics of environment. • Reward: Pick up rubbish, don’t damage the furniture.

6 Summary Previous approaches: • Dynamic Programming can find optimal policies in small state spaces. • Approximate Value-Function based approaches currently the method of choice in large state spaces. • Numerous practical successes, BUT • Policy performance can degrade at each step.

7 Summary Alternative Approach: • Policy parameters θ ∈ R K , Performance: η ( θ ) . • Compute ∇ η ( θ ) and step uphill (gradient ascent). • Previous algorithms relied on accurate reward baseline or recurrent states .

8 Summary Our Contribution: • Approximation ∇ β η ( θ ) to ∇ η ( θ ) . β ∈ [0 , 1) • Parameter related to Mixing Time of problem. • Algorithm to approximate ∇ β η ( θ ) via simulation ( POMDPG ). • Line search in the presence of noise.

9 Partially Observable Markov Decision Processes (POMDPs) States: S = { 1 , 2 , . . . , n } X t ∈ S Y = { 1 , 2 , . . . , M } Y t ∈ Y Observations: Actions or Controls: U = { 1 , 2 , . . . , N } U t ∈ U Observation Process ν : Pr( Y t = y | X t = i )= ν y ( i ) Stochastic Policy µ : Pr( U t = u | Y t = y )= µ u ( θ, y ) Rewards: r : S → R Adjustable parameters: θ ∈ R K

10 POMDP Transition Probabilities: Pr( X t +1 = j | X t = i, U t = u ) = p ij ( u )

11 POMDP X t Environment ν r(X ) t Yt Ut Agent µ Policy:

12 The Induced Markov Chain • Transition Probabilities: p ij ( θ )=Pr ( X t +1 = j | X t = i ) =E y ∼ ν ( X t ) E u ∼ µ ( θ,y ) p ij ( u ) • Transition Matrix: P ( θ ) = [ p ij ( θ )]

13 Stationary Distributions q = [ q 1 · · · q n ] ′ ∈ R n is a distribution over states. X t ∼ q X t +1 ∼ q ′ P ( θ ) ⇒ Definition: A probability distribution π ∈ R n is a stationary distribution of the Markov chain if π ′ P ( θ ) = π ′ .

14 Stationary Distributions Convenient Assumption: For all values of the parameters θ , there is a unique stationary distribution π ( θ ) . Implies the Markov chain mixes : For all X 0 , the distribution of X t approaches π ( θ ) . Inconvenient Assumption: Number of states n “essentially infinite”. Meaning: forget about storing a number for each state, or inverting n × n matrices.

15 Measuring Performance • Average Reward: n � η ( θ ) = π i ( θ ) r ( i ) i =1 • Goal: Find θ maximizing η ( θ ) .

16 Summary • Partially Observable Markov Decision Processes. • Previous approaches: value function methods. • Direct gradient ascent • Approximating the gradient of the average reward. • Estimating the approximate gradient: POMDPG. • Line search in the presence of noise. • Experimental results.

17 Approximate Value Functions • Discount Factor β ∈ [0 , 1) , Discounted value of state i under policy µ : � � � J µ � X 0 = i r ( X 0 ) + βr ( X 1 ) + β 2 r ( X 2 ) + · · · β ( i ) = E µ • Idea: Choose restricted class of value functions ˜ θ ∈ R K , i ∈ S J ( θ, i ) , (e.g neural network with parameters θ ).

18 Policy Iteration Iterate: • Given policy µ , find approximation ˜ J ( θ, · ) to J µ β . • Many algorithms for finding θ : TD( λ ), Q-learning, Bellman residuals, . . . . • Simulation and non-simulation based. • Generate new policy µ ′ using ˜ J ( θ, · ) : � u ∗ ( θ, i ) = 1 ⇔ u ∗ = argmax u ∈U p ij ( u ) ˜ µ ′ J ( θ, j ) j ∈S

19 Approximate Value Functions • The Good: ⋆ Backgammon (world-champion), chess (International Master), job-shop scheduling, elevator control, . . . ⋆ Notion of “backing-up” state values can be efficient. • The Bad: � � J ( θ, i ) − J µ � ˜ � � ⋆ Unless β ( i ) � = 0 for all states i , the new policy µ ′ can be a lot worse than the old one. ⋆ “Essentially Infinite” state spaces means we are likely to have very bad approximation error for some states.

20 Summary • Partially Observable Markov Decision Processes. • Previous approaches: value function methods. • Direct gradient ascent. • Approximating the gradient of the average reward. • Estimating the approximate gradient: POMDPG. • Line search in the presence of noise. • Experimental results.

21 Direct Gradient Ascent • Desideratum: Adjusting the agent’s parameters θ should improve its performance. • Implies . . . • Adjust the parameters in the direction of the gradient of the average reward: θ := θ + γ ∇ η ( θ )

22 Direct Gradient Ascent: Main Results 1. Algorithm to estimate approximate gradient( ∇ β η ) from a sample path. 2. Accuracy of approximation depends on parameter of the algorithm ( β ); bias/variance trade-off. 3. Line search algorithm using only gradient estimates.

23 Related Work Machine Learning: Williams’ REINFORCE algorithm (1992). • Gradient ascent algorithm for restricted class of MDPs. • Requires accurate reward baseline , i.i.d. transitions. Kimura et. al. , 1998: extension to infinite horizon. Discrete Event Systems: Algorithms that rely on recurrent states. MDPs: (Cao and Chen, 1997), POMDPs: (Marbach and Tsitsiklis, 1998). Control Theory: Direct adaptive control using derivatives (Hjalmarsson, Gunnarsson, Gevers, 1994), (Kammer, Bitmead, Bartlett, 1997), (DeBruyne, Anderson, Gevers, Linard, 1997).

25 Approximating the gradient Recall: For β ∈ [0 , 1) , Discounted value of state i is � � � � X 0 = i r ( X 0 ) + βr ( X 1 ) + β 2 r ( X 2 ) + · · · J β ( i ) = E . Vector notation: J β = ( J β (1) , . . . , J β ( n )) . Theorem: For all β ∈ [0 , 1) , ∇ η ( θ )= βπ ′ ( θ ) ∇ P ( θ ) J β + (1 − β ) ∇ π ′ ( θ ) J β . + (1 − β ) ∇ π ′ ( θ ) J β = β ∇ β η ( θ ) . � �� estimate → 0 as β → 1

26 Mixing Times of Markov Chains • ℓ 1 -distance: If p, q are distributions on the states, n � � p − q � 1 := | p ( i ) − q ( i ) | i =1 • d ( t ) -distance: Let p t ( i ) be the distribution over states at time t , starting from state i . � p t ( i ) − p t ( j ) � 1 d ( t ) := max ij • Unique stationary distribution ⇒ d ( t ) → 0 .

27 Approximating the gradient τ ∗ := min � t : d ( t ) ≤ e − 1 � Mixing time: Theorem: For all β ∈ [0 , 1) , θ ∈ R k , β η ( θ ) � ≤ constant × τ ∗ ( θ )(1 − β ) . �∇ η ( θ ) − ∇ That is, if 1 / (1 − β ) is large compared with the mixing time τ ∗ ( θ ) , ∇ β η ( θ ) accurately approximates the gradient direction ∇ η ( θ ) .

29 Estimating ∇ β η ( θ ) : POMDPG Given: parameterized policies, µ u ( θ, y ) , β ∈ [0 , 1) : 1. Set z 0 = ∆ 0 = 0 ∈ R K . 2. for each observation y t , control u t , reward r ( i t +1 ) do 3. Set z t +1 = βz t + ∇ µ u t ( θ, y t ) (eligibility trace) µ u t ( θ, y t ) 1 4. Set ∆ t +1 = ∆ t + t +1 [ r ( i t +1 ) z t +1 − ∆ t ] 5. end for

30 Convergence of POMDPG Theorem: For all β ∈ [0 , 1) , θ ∈ R K , ∆ t → ∇ β η ( θ ) .

31 Explanation of POMDPG Algorithm computes: T − 1 ∆ T = 1 ∇ µ u t � � � r ( i t +1 ) + βr ( i t +2 )+ · · · + β T − t − 1 r ( i T ) T µ u t � �� t =0 Estimate of discounted value ‘due to’ action u t • ∇ µ u t ( θ, y t ) is the direction to increase the probability of the action u t . • It is weighted by something involving subsequent rewards, and • divided by µ u t : ensures “popular” actions don’t dominate

Direct Gradient-Based Reinforcement Learning Jonathan Baxter - PowerPoint PPT Presentation

Direct Gradient-Based Reinforcement Learning Jonathan Baxter Research School of Information Sciences and Engineering Australian National University http://csl.anu.edu.au/ jon Joint work with Peter Bartlett and Lex Weaver December 5, 1999

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3.

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Reinforcement Learning for Continuous State and Action Spaces Gradient Methods 1 MACHINE LEARNING

Refresh Your Knowledge. Policy Gradient Policy gradient algorithms change the policy parameters

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Exponential Families and Kernels Lecture 1 Alexander J. Smola Alex.Smola@nicta.com.au Machine

The SABRE Proof of Principle Simone Copello on behalf of the SABRE collaboration *Gran Sasso

Glenn Stevens Governor References Butlin MW (1977), A Preliminary Annual Database 1900/01 to

Foundations of Artificial Intelligence 5. Constraint Satisfaction Problems CSPs as Search

INVARIANTS FROM KK-THEORY Joint work with Chris Bourne and Adam Rennie The Australian National

Certification for autonomous vehicles James Martin Micaiah Chrisholm jamesml@cs.unc.edu

ABSTRACT DISCRETION, DIRECTION AND THE OMBUDSMAN: TO STEER THE SHIP OR TO CHOOSE THE SHIP? In

2010 Annual General Meeting Sydney, 18 November 2010 Annual General Meeting Graham Kraehe AO