Temporal Difference Learning CMPUT 366: Intelligent Systems S&B - PowerPoint PPT Presentation

  Temporal Difference Learning CMPUT 366: Intelligent Systems   S&B §6.0-6.2, §6.4-6.5

Lecture Overview 1. Recap 2. TD Prediction 3. On-Policy TD Control (Sarsa) 4. Off-Policy TD Control (Q-Learning)

Recap: Monte Carlo RL • Monte Carlo estimation: Estimate expected returns to a state or action by averaging actual returns over sampled trajectories • Estimating action values requires either exploring starts or a soft policy (e.g., 𝜁 -greedy) • O ff -policy learning is the estimation of value functions for a target policy based on episodes generated by a different behaviour policy • O ff -policy control is learning the optimal policy (target policy) using episodes from a behaviour policy

Learning from Experience • Suppose we are playing a blackjack-like game in person , but we don't know the rules . • We know the actions we can take, we can see the cards, and we get told when we win or lose • Question: Could we compute an optimal policy using dynamic programming in this scenario? • Question: Could we compute an optimal policy using Monte Carlo ? • What would be the pros and cons of running Monte Carlo?

  Bootstrapping No Bootstrapping bootstrapping Learns from MC TD experience DP Requires full dynamics • Dynamic programming bootstraps : Each iteration's estimates are based partly on estimates from previous iterations • Each Monte Carlo estimate is based only on actual returns

      Updates V ( S t ) ← ∑ π ( a | S t ) ∑ p ( s ′ � , r | S t , a ) [ r + γ V ( s ′ � ) ] Dynamic Programming: s ′ � , r a V ( S t ) ← V ( S t ) + α [ G t − V ( S t ) ] Monte Carlo: V ( S t ) ← V ( S t ) + α [ R t +1 + γ V ( S t +1 ) − V ( S t ) ] TD(0): v π ( s ) . Monte Carlo: Approximate because of 𝔽 = E π [ G t | S t = s ] = E π [ R t +1 + γ G t +1 | S t = s ] = E π [ R t +1 + γ v π ( S t +1 ) | S t = s ] . Dynamic programming: Approximate because not known v π TD(0): Approximate because of 𝔽 and not known v π

TD(0) Algorithm Tabular TD(0) for estimating v π Input: the policy π to be evaluated Algorithm parameter: step size α ∈ (0 , 1] Initialize V ( s ), for all s ∈ S + , arbitrarily except that V ( terminal ) = 0 Loop for each episode: Initialize S Loop for each step of episode: A ← action given by π for S Take action A , observe R , S 0 ⇥ ⇤ V ( S ) ← V ( S ) + α R + γ V ( S 0 ) − V ( S ) S ← S 0 until S is terminal Question: What information does this algorithm use?

TD for Control • We can plug TD prediction into the generalized policy iteration framework • Monte Carlo control loop: 1. Generate an episode using estimated π 2. Update estimates of and Q π • On-policy TD control loop: 1. Take an action according to π 2. Update estimates of and Q π

On-Policy TD Control Sarsa (on-policy TD control) for estimating Q ≈ q ⇤ Algorithm parameters: step size α ∈ (0 , 1], small ε > 0 Initialize Q ( s, a ), for all s ∈ S + , a ∈ A ( s ) , arbitrarily except that Q ( terminal , · ) = 0 Loop for each episode: Initialize S Choose A from S using policy derived from Q (e.g., ε -greedy) Loop for each step of episode: Take action A , observe R , S 0 Choose A 0 from S 0 using policy derived from Q (e.g., ε -greedy) ⇥ ⇤ Q ( S, A ) ← Q ( S, A ) + α R + γ Q ( S 0 , A 0 ) − Q ( S, A ) S ← S 0 ; A ← A 0 ; until S is terminal Question: What information does this algorithm use? Question: Will this estimate the Q-values of the optimal policy?

Actual Q-Values vs. Optimal Q-Values • Just as with on-policy Monte Carlo control, Sarsa does not converge to the optimal policy, because it always chooses an 𝜁 -greedy action • And the estimated Q-values are with respect to the actual actions , which are 𝜁 -greedy • Question: Why is it necessary to choose 𝜁 -greedy actions? • What if we acted 𝜁 -greedy, but learned the Q-values for the optimal policy?

Off-Policy TD Control Q-learning (o ff -policy TD control) for estimating π ≈ π ⇤ Algorithm parameters: step size α ∈ (0 , 1], small ε > 0 Initialize Q ( s, a ), for all s ∈ S + , a ∈ A ( s ) , arbitrarily except that Q ( terminal , · ) = 0 Loop for each episode: Initialize S Loop for each step of episode: Choose A from S using policy derived from Q (e.g., ε -greedy) Take action A , observe R , S 0 ⇥ ⇤ Q ( S, A ) ← Q ( S, A ) + α R + γ max a Q ( S 0 , a ) − Q ( S, A ) S ← S 0 until S is terminal Question: What information does this algorithm use? Question: Why aren't we estimating the policy 𝜌 explicitly?

Example: The Cliff 𝛿 =1 (undiscounted) - R = - 1 Safer path R - , Optimal path l T h e C l i f f S G - (! ! of R = - 100 • Agent gets -1 reward until they reach the goal state • Step into the Cliff region, get reward -100 and go back to start ! % ∃ ! ! ∀ ∀ • Question: How will Q-Learning estimate the value of this state? ! ∃∀ • Question: How will Sarsa estimate the value of this state? ! #∃ ! ! ∀ ∀ ∀ !∀∀ %∀∀ &∀∀ ∍∀∀ ∃∀∀

(! ! Performance on The Cliff ! % ∃ ! ! ∀ ∀ s Sarsa ! ∃∀ at -25 ge e- ! #∃ Sum of -50 ff rewards Q-learning ac- during r ! ! ∀ ∀ episode ∀ !∀∀ %∀∀ &∀∀ ∍∀∀ ∃∀∀ -75 o t t -100 ac- 0 100 200 300 400 500 i- Episodes e Q-Learning estimates optimal policy , but Sarsa consistently outperforms Q-Learning. ( why? )

Summary • Temporal Difference Learning bootstraps and learns from experience • Dynamic programming bootstraps, but doesn't learn from experience (requires full dynamics) • Monte Carlo learns from experience, but doesn't bootstrap • Prediction: TD(0) algorithm • Sarsa estimates action-values of actual 𝜁 -greedy policy • Q-Learning estimates action-values of optimal policy while executing an 𝜁 -greedy policy

Temporal Difference Learning CMPUT 366: Intelligent Systems S&B - PowerPoint PPT Presentation

Temporal Difference Learning CMPUT 366: Intelligent Systems S&B 6.0-6.2, 6.4-6.5 Lecture Overview 1. Recap 2. TD Prediction 3. On-Policy TD Control (Sarsa) 4. Off-Policy TD Control (Q-Learning) Recap: Monte Carlo RL Monte

Chapter 6: Temporal Difference Learning Objectives of this chapter: Introduce Temporal Difference

Temporal Difference Learning Robert Platt Northeastern University If one had to identify one

Spatio-Temporal Statistics with R Chapter Two: Exploring Spatio-Temporal Data Spatio-Temporal

Causal inference Part II: Difference In Difference and Instrumental Variables Difference in

Temporal, Spatial, and Spatio-temporal Granularities Gabriele Pozzani Department of Computer

Temporal Code Temporal Code Temporal Code (Acoustic Front-end) Human Recognition Machine

Temporal Privacy in Wireless Sensor Networks Temporal Privacy in Wireless Sensor Networks

Temporal Planning Planning with Temporal and Concurrent Actions 1 Literature Malik

Temporal Distortion Temporal Distortion Perspective) Perspective) t t Blue view Blue view y

Reinforcement Learning Steven J Zeil Old Dominion Univ. Fall 2010 1 Introduction Model-based

Singer difference sets and difference system of sets Akihiro Munemasa Graduate School of

Temporal Planning Planning with Temporal and Concurrent Actions Literature Malik Ghallab,

Analysis of Peer Review data from WoS Data part 3: temporal analyses Temporal distributions

Sequential Data Types of data Temporal (focusing on this one today) Bi-Temporal (Physical Time

Temporal Logic of Actions Advanced Topics in Distributed Computing Dominik Grewe Saarland

Outline Temporal and Real-Time Temporal database Databases: A survey Real-time database

Foundations of Artificial Intelligence May 11, 2020 40. Board Games: Introduction and State of

CS 333 Introduction to Operating Systems Class 15 - Input/Output Jonathan Walpole Computer

The Global Nature of Intellectual Property: Discussion Bronwyn H. Hall UC Berkeley and Oxford

CLEF-IP 2012: Retrieval in the Intellectual Property Domain Florina Piroi , Mihai Lupu, Allan

Lecture 1. (Part 1 of 3) INTRODUCTION The objectives of Part 1 of this lecture are: To define

Dynamic Programming and Reinforcement Learning Daniel Russo Columbia Business School Decision

Home is Where Your Phone is: Usability Evaluation of Mobile Phone UI for a Smart Home Tiiu

Answer Set Grammars for Representing and Learning Generative Policies Mark Law * , Alessandra