Reinforcement Learning Robert Platt Northeastern University Some - PowerPoint PPT Presentation

Reinforcement Learning Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. RN, AIMA

Conception of agent act Agent World sense

RL conception of agent Agent takes actions a Agent World s,r Agent perceives states and rewards Transition model and reward function are initially unknown to the agent! – value iteration assumed knowledge of these two things...

Value iteration We know the reward function We know the probabilities of moving in each direction when an action is executed Image: Berkeley CS188 course notes (downloaded Summer 2015)

Reinforcement Learning We know the reward function We know the probabilities of moving in each direction when an action is executed Image: Berkeley CS188 course notes (downloaded Summer 2015)

The different between RL and value iteration Online Learning Offmine Solution (RL) (value iteration) Image: Berkeley CS188 course notes (downloaded Summer 2015)

Value iteration vs RL 0.5 +1 1.0 Fast Slow -10 +1 0.5 Warm Slow 0.5 +2 Fast 0.5 Cool Overheated +1 1.0 +2 RL still assumes that we have an MDP Image: Berkeley CS188 course notes (downloaded Summer 2015)

Value iteration vs RL Warm Cool Overheated RL still assumes that we have an MDP – but, we assume we don't know T or R Image: Berkeley CS188 course notes (downloaded Summer 2015)

RL example https://www.youtube.com/watch?v=goqWX7bC-ZY

Model-based RL a. choose an exploration policy – policy that enables agent to explore all relevant states 1. estimate T, R by averaging experiences b. follow policy for a while 2. solve for policy using c. estimate T and R value iteration Image: Berkeley CS188 course notes (downloaded Summer 2015)

Model-based RL a. choose an exploration policy – policy that enables agent to explore all relevant states 1. estimate T, R by averaging experiences b. follow policy for a while 2. solve for policy using c. estimate T and R value iteration Number of times agent reached s' by taking a from s Set of rewards obtained when reaching s' by taking a from s

Model-based RL a. choose an exploration policy – policy that enables agent to explore all relevant states 1. estimate T, R by averaging experiences b. follow policy for a while 2. solve for policy using c. estimate T and R value iteration What's wrong w/ this approach? Number of times agent reached s' by taking a from s Set of rewards obtained when reaching s' by taking a from s

Model-based vs Model-free learning Goal: Compute expected age of students in this class Known P(A) Without P(A), instead collect samples [a 1 , a 2 , … a N ] Unknown P(A): “Model Based” Unknown P(A): “Model Free” Why does this Why does this work? Because work? Because samples eventually you appear with learn the right the right model. frequencies. Slide: Berkeley CS188 course notes (downloaded Summer 2015)

RL: model-free learning approach to estimating the value function  We want to improve our estimate of V by computing these averages:  Idea: T ake samples of outcomes s’ (by doing the action!) and average s π (s) s, π (s) ' s 1 Slide: Berkeley CS188 course notes (downloaded Summer 2015)

RL: model-free learning approach to estimating the value function  We want to improve our estimate of V by computing these averages:  Idea: T ake samples of outcomes s’ (by doing the action!) and average s π (s) s, π (s) ' s 1 ' s 2 Slide: Berkeley CS188 course notes (downloaded Summer 2015)

RL: model-free learning approach to estimating the value function  We want to improve our estimate of V by computing these averages:  Idea: T ake samples of outcomes s’ (by doing the action!) and average s π (s) s, π (s) ' s 1 s 3 ' s 2 ' Slide: Berkeley CS188 course notes (downloaded Summer 2015)

Sidebar: exponential moving average  Exponential moving average  The running interpolation update:  Makes recent samples more important:  Forgets about the past (distant past values were wrong anyway) Slide: Berkeley CS188 course notes (downloaded Summer 2015)

TD Value Learning  Big idea: learn from every experience!  Update V(s) each time we experience a s transition (s, a, s’, r)  Likely outcomes s’ will contribute updates π (s) more often s, π (s)  T emporal difgerence learning of values  Policy still fjxed, still doing evaluation! s'  Move values toward value of whatever successor occurs: running average Sample of V(s): Update to V(s): Same update: Slide: Berkeley CS188 course notes (downloaded Summer 2015)

TD Value Learning: example Observed States T ransitions A 0 B C D 0 0 8 E 0 Assume: γ = 1, α = 1/2 Slide: Berkeley CS188 course notes (downloaded Summer 2015)

TD Value Learning: example Observed Observed reward States T ransitions B, east, C, -2 A 0 0 B C D 0 0 -1 0 8 8 E 0 0 Assume: γ = 1, α = 1/2 Slide: Berkeley CS188 course notes (downloaded Summer 2015)

TD Value Learning: example Observed Observed reward States T ransitions B, east, C, -2 C, east, D, -2 A 0 0 0 B C D 0 0 -1 0 -1 3 8 8 8 E 0 0 0 Assume: γ = 1, α = 1/2 Slide: Berkeley CS188 course notes (downloaded Summer 2015)

What's the problem w/ TD Value Learning?

What's the problem w/ TD Value Learning? Can't turn the estimated value function into a policy! This is how we did it when we were using value iteration: Why can't we do this now?

What's the problem w/ TD Value Learning? Can't turn the estimated value function into a policy! This is how we did it when we were using value iteration: Why can't we do this now? Solution: Use TD value learning to estimate Q*, not V*

Detour: Q-Value Iteration  Value iteration: fjnd successive (depth-limited) values  Start with V 0 (s) = 0, which we know is right  Given V k , calculate the depth k+1 values for all states:  But Q-values are more useful, so compute them instead  Start with Q 0 (s,a) = 0, which we know is right  Given Q k , calculate the depth k+1 q-values for all q-states: Slide: Berkeley CS188 course notes (downloaded Summer 2015)

Q-Learning  Q-Learning: sample-based Q-value iteration  Learn Q(s,a) values as you go  Receive a sample (s,a,s’,r)  Consider your old estimate:  Consider your new sample estimate:  Incorporate the new estimate into a running average: Slide: Berkeley CS188 course notes (downloaded Summer 2015)

Exploration v exploitation Image: Berkeley CS188 course notes (downloaded Summer 2015)

Exploration v exploitation: e-greedy action selection  Several schemes for forcing exploration  Simplest: random actions ( ε -greedy)  Every time step, fmip a coin  With (small) probability ε , act randomly  With (large) probability 1- ε , act on current policy  Problems with random actions?  You do eventually explore the space, but keep thrashing around once learning is done  One solution: lower ε over time  Another solution: exploration functions Slide: Berkeley CS188 course notes (downloaded Summer 2015)

Generalizing across states  Basic Q-Learning keeps a table of all q-values  In realistic situations, we cannot possibly learn about every single state!  T oo many states to visit them all in training  T oo many states to hold the q-tables in memory  Instead, we want to generalize:  Learn about some small number of training states from experience  Generalize that experience to new, similar situations  This is a fundamental idea in machine learning, and we’ll see it over and over again Slide: Berkeley CS188 course notes (downloaded Summer 2015)

Generalizing across states Let’s say we In naïve q- Or even this discover through learning, we one! experience that know nothing this state is bad: about this state: Slide: Berkeley CS188 course notes (downloaded Summer 2015)

Feature-based representations  Solution: describe a state using a vector of features (properties)  Features are functions from states to real numbers (often 0/1) that capture important properties of the state  Example features:  Distance to closest ghost  Distance to closest dot  Number of ghosts  1 / (dist to dot) 2  Is Pacman in a tunnel? (0/1)  …… etc.  Is it the exact state on this slide?  Can also describe a q-state (s, a) with features (e.g. action moves closer to food) Slide: Berkeley CS188 course notes (downloaded Summer 2015)

Linear value functions  Using a feature representation, we can write a q function (or value function) for any state using a few weights:  Advantage: our experience is summed up in a few powerful numbers  Disadvantage: states may share features but actually be very difgerent in value! Slide: Berkeley CS188 course notes (downloaded Summer 2015)

Reinforcement Learning Robert Platt Northeastern University Some - PowerPoint PPT Presentation

Reinforcement Learning Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. RN, AIMA Conception of agent act Agent World sense RL conception of agent Agent takes actions a Agent World s,r

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Critical 1001 days: why they matter and local interventions Sally Johnson, Head of Service

Reinforcement Learning Rob Platt Northeastern University Some images and slides are used from:

OUR WORLD IS NOT WELL Warning signs - Our planet is wounded due to negligence, greed, and

E NCOURAGING L INGUISTIC D IVERSITY IN THE C AMPUS C OMMUNITY A MY S MITH , H OUSING & F OOD S

INFS 431 LITERATURE AND SERVICES FOR CHILDREN Session 2 Factors that Affect the Development

Inconsistency in Universal Newborn Hearing Screening Programmes: a Systematic Review Pierpaolo

Fundamentals of Acoustics Introductory Course on Multiphysics Modelling T OMASZ G. Z IELI NSKI

Communities for Youth Kathleen Guarino, LMHC and Gwen Willis-Darpoh, Ph.D. Adobe Logistics

Reinforcement Learning Robert Platt Northeastern University Some - PowerPoint PPT Presentation

Reinforcement Learning Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. RN, AIMA Conception of agent act Agent World sense RL conception of agent Agent takes actions a Agent World s,r

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Critical 1001 days: why they matter and local interventions Sally Johnson, Head of Service

Reinforcement Learning Rob Platt Northeastern University Some images and slides are used from:

OUR WORLD IS NOT WELL Warning signs - Our planet is wounded due to negligence, greed, and

E NCOURAGING L INGUISTIC D IVERSITY IN THE C AMPUS C OMMUNITY A MY S MITH , H OUSING &amp; F OOD S

INFS 431 LITERATURE AND SERVICES FOR CHILDREN Session 2 Factors that Affect the Development

Inconsistency in Universal Newborn Hearing Screening Programmes: a Systematic Review Pierpaolo

Fundamentals of Acoustics Introductory Course on Multiphysics Modelling T OMASZ G. Z IELI NSKI

Communities for Youth Kathleen Guarino, LMHC and Gwen Willis-Darpoh, Ph.D. Adobe Logistics

E NCOURAGING L INGUISTIC D IVERSITY IN THE C AMPUS C OMMUNITY A MY S MITH , H OUSING & F OOD S