Approximate Q-Learning 3-25-16 Exploration policy vs. optimal - PowerPoint PPT Presentation

Approximate Q-Learning 3-25-16

Exploration policy vs. optimal policy Where do the exploration traces come from? ● We need some policy for acting in the environment before we understand it. ● We’d like to get decent rewards while exploring. ○ Explore/exploit tradeoff. In lab, we’re using an epsilon-greedy exploration policy. After exploration, taking random bad moves doesn’t make much sense. ● If Q-value estimates are correct a greedy policy is optimal.

On-policy learning Instead of updating based on the best action from the next state, update based on the action your current policy actually takes from the next state. SARSA update: When would this be better or worse than Q-learning?

Demo: Q-learning vs SARSA https://studywolf.wordpress.com/2013/07/01/reinforcement-learning-sarsa-vs-q- learning/

Problem: large state spaces If the state space is large, several problems arise. ● The table of Q-value estimates can get extremely large. ● Q-value updates can be slow to propagate. ● High-reward states can be hard to find. State space grows exponentially with feature dimension.

PacMan state space ● PacMan’s location (107 possibilities). Location of each ghost (107 2 ). ● ● Locations still containing food. 2 104 combinations. ○ ○ Not all feasible because PacMan can’t jump. ● Pills remaining (4 possibilities). ● Whether each ghost is scared (4 possibilities … ignoring the timer). 107 3 * 4 2 = 19,600,688 … ignoring the food!

Reward Shaping Idea: give some small intermediate rewards that help the agent learn. ● Like a heuristic, this can guide the search in the right direction. ● Rewarding novelty can encourage exploration. Disadvantages: ● Requires intervention by the designer to add domain-specific knowledge. ● If reward/discount are not balanced right, the agent might prefer accumulating the small rewards to actually solving the problem. ● Doesn’t reduce the size of the Q-table.

Function Approximation Key Idea: learn a reward function as a linear combination of features. ● We can think of feature extraction as a change of basis. ● For each state encountered, determine its representation in terms of features. ● Perform a Q-learning update on each feature. ● Value estimate is a sum over the state’s features.

PacMan features from lab ● "bias" always 1.0 ● "#-of-ghosts-1-step-away" the number of ghosts (regardless of whether they are safe or dangerous) that are 1 step away from Pac-Man ● "closest-food" the distance in Pac-Man steps to the closest food pellet (does take into account walls that may be in the way) ● "eats-food" either 1 or 0 if Pac-Man will eat a pellet of food by taking the given action in the given state

Exercise: extract features from these states ● bias ● #-of-ghosts-1-step-away ● closest-food ● eats-food

Approximate Q-learning update Initialize weight for each feature to 0. Note: this is performing gradient descent; derivation in the reading.

Advantages and disadvantages of approximation + Dramatically reduces the size of the Q-table. + States will share many features. + Allows generalization to unvisited states. + Makes behavior more robust: making similar decisions in similar states. + Handles continuous state spaces! - Requires feature selection (often must be done by hand). - Restricts the accuracy of the learned rewards. - The true reward function may not be linear in the features.

Exercise: approximate Q-learning Features: discount: 0.9 learning rate: 0.2 COL ∈ {0, ⅓, ⅔, 1}, R0 ∈ {0, 1}, R1 ∈ {0, 1}, R2 ∈ {0, 1} Use these exploration traces: (0,0)→(1,0)→(2,0)→(2,1)→(3,1) +1 2 (0,0)→(0,1)→(0,2)→(1,2)→(2,2)→(3,2) (0,0)→(0,1)→(0,2)→(1,2)→(2,2)→(2,1)→(3,1) -1 1 (0,0)→(0,1)→(0,2)→(1,2)→(2,2)→(3,2) S 0 0 1 2 3

Approximate Q-Learning 3-25-16 Exploration policy vs. optimal - PowerPoint PPT Presentation

Approximate Q-Learning 3-25-16 Exploration policy vs. optimal policy Where do the exploration traces come from? We need some policy for acting in the environment before we understand it. Wed like to get decent rewards while

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Approximate Bayesian Computation Chris Drovandi, Charisse Farr October 24, 2012 Chris Drovandi,

Probable Cause The Deanonymizing Effects of Approximate DRAM Amir Rahmati , Matthew Hicks, Dan

Approximate Graph Operations on Parallel Platforms Approximate Graph Operations on Parallel

Backward Analysis via Over-Approximate Abstraction and Under-Approximate Subtraction Alexey

Approximate Reasoning for the Semantic Web Part V Approximate Resolution for OWL Frank van

Two Approximate- Programmability Birds, One Statistical- Inference Stone Adrian Sampson

Approximate Program Synthesis James Bornholt Emina Torlak Luis Ceze Dan Grossman University of

Approximate Bayesian Computation Dr. Jarad Niemi STAT 615 - Iowa State University December 5,

Approximate inference: Sampling methods Probabilistic Graphical Models Sharif University of

Approximate Cross-Validation and Dynamic Experiments for Policy Choice Maximilian Kasy

Faster Parallel Algorithm for Approximate Shortest Path Jason Li (CMU) STOC 2020 March 2, 2020

Approximate Nearest Neighbors Sariel Har Peled: Notes Arya, Mount, Netenyahu, Silverman, Wu An

Reinforcement Learning: Approximate Dynamic Programming Decision Making Under Uncertainty, Chapter

Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA

Regression Analysis Scott Richter UNCG-Statistical Consulting Center Department of Mathematics

Introduction to Functional Programming in Python David Jones drj@ravenbrook.com Programming:

1 HEALTHY WEIGHT & 2011 FOOD & HEALTH SURVEY ACTIVE LIFESTYLES 1 H E A

Pla ne t E a rth Course 3 Cre dit Ho ur Ble nde d Ho no rs Co urse ANT / BI O/ CMM 2600 ANT

ICFP International Conference on Functional Abusing Ants for Fun and Programming Profit Most

Func unctiona nal Reporting ng Edward Kmett Overview Who We Are 1 Getting FP in the Door 2

WELCOME! IMPORTANT DATES November 16, 2018 New Member Applications due for SY 2019-20

CHICAGOLAND FOOD DESERTS BY: SAMMY PASKVAN Picture source:

Approximate Q-Learning 3-25-16 Exploration policy vs. optimal - PowerPoint PPT Presentation

Approximate Q-Learning 3-25-16 Exploration policy vs. optimal policy Where do the exploration traces come from? We need some policy for acting in the environment before we understand it. Wed like to get decent rewards while

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Approximate Bayesian Computation Chris Drovandi, Charisse Farr October 24, 2012 Chris Drovandi,

Probable Cause The Deanonymizing Effects of Approximate DRAM Amir Rahmati , Matthew Hicks, Dan

Approximate Graph Operations on Parallel Platforms Approximate Graph Operations on Parallel

Backward Analysis via Over-Approximate Abstraction and Under-Approximate Subtraction Alexey

Approximate Reasoning for the Semantic Web Part V Approximate Resolution for OWL Frank van

Two Approximate- Programmability Birds, One Statistical- Inference Stone Adrian Sampson

Approximate Program Synthesis James Bornholt Emina Torlak Luis Ceze Dan Grossman University of

Approximate Bayesian Computation Dr. Jarad Niemi STAT 615 - Iowa State University December 5,

Approximate inference: Sampling methods Probabilistic Graphical Models Sharif University of

Approximate Cross-Validation and Dynamic Experiments for Policy Choice Maximilian Kasy

Faster Parallel Algorithm for Approximate Shortest Path Jason Li (CMU) STOC 2020 March 2, 2020

Approximate Nearest Neighbors Sariel Har Peled: Notes Arya, Mount, Netenyahu, Silverman, Wu An

Reinforcement Learning: Approximate Dynamic Programming Decision Making Under Uncertainty, Chapter

Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA

Regression Analysis Scott Richter UNCG-Statistical Consulting Center Department of Mathematics

Introduction to Functional Programming in Python David Jones drj@ravenbrook.com Programming:

1 HEALTHY WEIGHT &amp; 2011 FOOD &amp; HEALTH SURVEY ACTIVE LIFESTYLES 1 H E A

Pla ne t E a rth Course 3 Cre dit Ho ur Ble nde d Ho no rs Co urse ANT / BI O/ CMM 2600 ANT

ICFP International Conference on Functional Abusing Ants for Fun and Programming Profit Most

Func unctiona nal Reporting ng Edward Kmett Overview Who We Are 1 Getting FP in the Door 2

WELCOME! IMPORTANT DATES November 16, 2018 New Member Applications due for SY 2019-20

CHICAGOLAND FOOD DESERTS BY: SAMMY PASKVAN Picture source:

1 HEALTHY WEIGHT & 2011 FOOD & HEALTH SURVEY ACTIVE LIFESTYLES 1 H E A