Temporal-Di ff erence Learning What is MC estimation doing? - PDF document

Coming Up With Better Policies We can interleave policy evaluation with policy improvement as before. π * V * 21 STICK 20 E ! Q ⇡ 0 I E ! · · · I ! ⇡ ⇤ E ! Q ⇤ 19 ⇡ 0 ! ⇡ 1 � � � � � 21 Usable 18 + 1 17 ace 16 − 1 15 A HIT 14 13 12 We’ve just figured out how to do policy eval- 11 A 2 3 4 5 6 7 8 9 10 12 1 0 uation. 21 20 Player sum STICK 19 No 18 21 17 usable Policy improvement is even easier because now 16 m 15 ace u 14 A HIT s we have the direct expected rewards for each 13 Dealer showing r e 12 y a 11 l P A 2 3 4 5 6 7 8 9 action in each state Q ( s, a ) so just pick the 10 12 Dealer showing 1 0 best action among these The optimal policy for Blackjack: 1 On-Policy Learning On-policy methods attempt to evaluate the same policy that is being used to make de- cisions Get rid of the assumption of exploring starts. Now use an ✏ -greedy method where some ✏ proportion of the time you don’t take the greedy the best one can do with general strategies in action, but instead take a random action the new environment is the same as the best one could do with ✏ -greedy strategies in the Soft policies: all actions have non-zero proba- old environment. bilities of being selected in all states For any ✏ -soft policy ⇡ , any ✏ -greedy strategy with respect to Q ⇡ is guaranteed to be an improvement over ⇡ . If we move the ✏ -greedy requirement inside the environment, so that we say nature randomizes your action 1 � ✏ proportion of the time, then 2

Temporal-Di ff erence Learning What is MC estimation doing? Adaptive Dynamic Programming V ( s t ) (1 � ↵ t ) V ( s t ) + ↵ t R t Simple idea – take actions in the environment where R t is the return received following being (follow some strategy like ✏ -greedy with re- in state s t . spect to your current belief about what the value function is) and update your transition Suppose we switch to a constant step-size ↵ and reward models according to observations. (this is a trick often used in nonstationary en- Then update your value function by doing full vironments) dynamic programming on your current believed model. TD methods basically bootstrap o ff of exist- ing estimates instead of waiting for the whole In some sense this does as well as possible, reward sequence R to materialize subject to the agent’s ability to learn the transition model. But it is highly impractical for V ( s t ) (1 � ↵ ) V ( s t ) + ↵ [ r t +1 + � V ( s t +1 )] anything with a big state space (Backgammon (based on actual observed reward and new state) has 10 50 states) This target uses the current value as an estimate of V whereas the Monte Carlo target 3 4 Q-Learning: A Model-Free Approach Even without a model of the environment, you can learn e ff ectively. Q-learning is conceptually similar to TD-learning, but uses the Q function uses the sample reward as an estimate of the instead of the value function expected reward 1. In state s , choose some action a using pol- If we actually want to converge to the opti- icy derived from current Q (for example, mal policy, the decision-making policy must ✏ -greedy), resulting in state s 0 with reward be GLIE (greedy in the limit of infinite explo- r . ration) – that is, it must become more and more likely to take the greedy action, so that 2. Update: we don’t end up with faulty estimates (this Q ( s 0 , a 0 )) Q ( s, a ) (1 � ↵ ) Q ( s, a )+ ↵ ( r + � max problem can be exacerbated by the fact that a 0 we’re bootstrapping) You don’t need a model for either learning or action selection! As environments become more complex, using a model can help more (anecdotally) 5

Suppose our linear function predicts V ✓ ( s ) and Generalization in Reinforcement we actually would “like” it to have predicted Learning something else, say v . Define the error as E ( s ) = ( V ✓ ( s ) � v ) 2 / 2. Then the update rule So far, we’ve thought of Q functions and utility is: functions as being represented by tables ✓ i ✓ i � ↵@ E ( s ) @✓ i Question: can we parameterize the state space = ✓ i + ↵ ( v � V ✓ ( s )) @ V ✓ ( s ) so that we can learn (for example) a linear @✓ i function of the parameterization? If we look at the TD-learning updates in this framework, we see that we essentially replace V ✓ ( s ) = ✓ 1 f 1 ( s ) + ✓ 2 f 2 ( s ) + · · · + ✓ n f n ( s ) what we’d “like” it to be with the learned backup (sum of the reward and the value function of the next state: Monte Carlo methods: We obtain sample of ✓ i ✓ i + ↵ [ R ( s ) + � V ✓ ( s 0 ) � V ✓ ( s )] @ V ✓ ( s ) V ( s ) and then learn the ✓ ’s to minimize squared @✓ i error. This can be shown to converge to the closest In general, often makes more sense to use an function to the true function when linear func- online procedure, like the Widrow-Ho ff rule: tion approximators are used, but it’s not clear 6 how good a linear function will be at approxi- mating non-linear functions in general, and all bets on convergence are o ff when we move to non-linear spaces. The power of function approximation: allows you to generalize to values of states you haven’t yet seen! In backgammon, Tesauro constructed a player as good as the best humans although it only examined one out of every 10 44 possible states. Caveat: this is one of the few successes that has been achieved with function approximation and RL. Most of the time it’s hard to get a good parameterization and get it to work.

Temporal-Di ff erence Learning What is MC estimation doing? - PDF document

Coming Up With Better Policies We can interleave policy evaluation with policy improvement as before. * V * 21 STICK 20 E ! Q 0 I E ! I ! E ! Q 19 0 ! 1 21 Usable 18 + 1 17

Spatio-Temporal Statistics with R Chapter Two: Exploring Spatio-Temporal Data Spatio-Temporal

Reducing Sampling Error in Batch Temporal Di ff erence Learning Brahma S. Pavse 1 , Ishan Durugkar

Temporal, Spatial, and Spatio-temporal Granularities Gabriele Pozzani Department of Computer

Temporal Code Temporal Code Temporal Code (Acoustic Front-end) Human Recognition Machine

Temporal Privacy in Wireless Sensor Networks Temporal Privacy in Wireless Sensor Networks

Temporal Planning Planning with Temporal and Concurrent Actions 1 Literature Malik

Temporal Distortion Temporal Distortion Perspective) Perspective) t t Blue view Blue view y

New I nt egrat ed Modelling New I nt egrat ed Modelling wit h Special Ref erence t o wit h

Human Capital and Gender Wage Gaps: What is the Explained Di ff erence? Ronald L. Oaxaca

New I nt egrat ed Modeling Modeling wit h wit h New I nt egrat ed Special Ref erence t o APEI S

Inf erence p enalis ee dans les mod` eles ` a vraisemblance non explicite par des

Algorithmes Gradient-Proximaux pour linf erence statistique Gersende Fort Institut de

Chapter 6: Temporal Difference Learning Objectives of this chapter: Introduce Temporal Difference

Temporal Difference Learning Robert Platt Northeastern University If one had to identify one

Temporal Planning Planning with Temporal and Concurrent Actions Literature Malik Ghallab,

Analysis of Peer Review data from WoS Data part 3: temporal analyses Temporal distributions

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Monte Carlo Tree Search guided by Symbolic Advice for MDPs Damien Busatto-Gaston, Debraj

From Qualitative to Quantitative Dominance Pruning for Optimal Planning Alvaro Torralba

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

Scheduling Don Porter CSE 306 Last time We went through the high-level theory of scheduling

Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok.

Detector challenges at CLIC contrasted with the LHC case CERN detector seminar 12 Oct.

Chapter 1 Introduction Chapter 1 Objectives Know the difference between computer