Recognizers A study in learning how to model temporally extended - PowerPoint PPT Presentation

Recognizers A study in learning how to model temporally extended behaviors Jordan Frank jordan.frank@cs.mcgill.ca Reasoning and Learning Lab McGill University http://www.cs.mcgill.ca/~jfrank8/ Joint work with Doina Precup

Background and Motivation • Want a flexible way to represent hierarchical knowledge. ( Options [Sutton, Precup & Singh, 1999]) • Want an efficient way to learn about these hierarchies. ( Recognizers [Precup et al. 2006]) • Concerned with off-policy learning in environments with continuous state and action spaces [Precup, Sutton & Dasgupta 2001]. Recognizers 2 NIPS’07 Workshop on Hierarchical Organization of Behavior

Terminology • Option: A tuple . is a set of initiation states, a � I , β , π � I β termination condition, and a policy. π • Recognizer: A filter on actions. A recognizer specifies a class of policies that we are interested in learning about. • Off-policy learning: We are interested in learning about a target policy by observing an agent whose behavior is π governed by a different (possibly unknown) policy . b Recognizers 3 NIPS’07 Workshop on Hierarchical Organization of Behavior

Example Problem • PuddleWorld [RL-Glue] - Continuous state space - Continuous action space • Goal is to do off-policy learning . Behavior policy is unknown . Recognizers 4 NIPS’07 Workshop on Hierarchical Organization of Behavior

Recognizers: Formally • MDP is a tuple . At time step , an agent � S , A , P , R � t receives a state and chooses an action . a t ∈ A s t ∈ S • Fixed (unknown) behavior policy , used to b : S × A → [0 , 1] generate actions. • Recognizer is a function , where c : S × A → [0 , 1] c ( s, a ) indicates to what extent the recognizer allows action in a state . s • Target policy generated by and b π c b ( s, a ) c ( s, a ) x b ( s, x ) c ( s, x ) = b ( s, a ) c ( s, a ) π ( s, a ) = , � µ ( s ) where is the recognition probability at . µ ( s ) s Recognizers 5 NIPS’07 Workshop on Hierarchical Organization of Behavior

Importance Sampling • Based on the following observation: � � x π ( x ) x π ( x ) � � E π { x } = x π ( x ) dx = b ( x ) b ( x ) dx = E b b ( x ) x x • We are trying to learn about a target policy using π samples drawn from a behavior policy , and so we just b need to calculate the appropriate weights. • Weights (also called corrections ) given by ρ ( s, a ) = π ( s, a ) b ( s, a ) = c ( s, a ) µ ( s ) • Full details of the algorithm given in Precup et al. (2006). Recognizers 6 NIPS’07 Workshop on Hierarchical Organization of Behavior

Importance Sampling Correction • depends on . µ ( s ) b • If is unknown , we can use a maximum likelihood b estimate . µ : S → [0 , 1] ˆ • For linear function approximation , we can use logistic regression with the same set of features in order to estimate . µ Recognizers 7 NIPS’07 Workshop on Hierarchical Organization of Behavior

Experiment 1: Puddle World [RL-Glue] • Continuous state space, continuous actions. Movement is noisy. • Positive reward for reaching goal (10), negative reward for entering puddle (-10 at middle). • Start state chosen randomly in small square in lower left corner. Reaching goal moves agent back to start state Recognizers 8 NIPS’07 Workshop on Hierarchical Organization of Behavior

Experiment 1: Setup • Standard tile coding function approximation for state space. • Behavior policy picks actions uniformly randomly, target policy is to pick actions that lead directly towards the goal state. • Binary recognizer, recognizes actions in a cone facing 45 ◦ directly towards the goal state. Recognizer episode can be initiated everywhere, and terminates when either goal state or puddle are entered. Recognizers 9 NIPS’07 Workshop on Hierarchical Organization of Behavior

Experiment 1: Results Learned Reward Model 5 0 − 5 − 10 • This matches our intuition that moving directly towards the goal is good unless you are below and to the left of the puddle. Recognizers 10 NIPS’07 Workshop on Hierarchical Organization of Behavior

Experiment 1: Results State Value Estimates Recognition Probability Estimates 5 0.18 0.16 4 0.14 Recognition Probability Expected Reward 3 0.12 0.1 2 0.08 0.06 1 0.04 0 Estimated Value With Estimated Recognition Probability 0.02 True Value With Exact Recognition Probability 0 − 1 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Number of Steps (x 10E4) Number of Steps (x 10E4) • We observe that the recognition probability estimate converges to the correct value, and estimating this value as we do our learning does not bias our state value estimates. Recognizers 11 NIPS’07 Workshop on Hierarchical Organization of Behavior

Experiment 2: Ship Steering [RL-Glue] • Stochastic environment. 3D Continuous state space, 2D continuous actions (throttle and rudder angle). • Goal is to keep a ship on a desired heading with a high velocity. Recognizers 12 NIPS’07 Workshop on Hierarchical Organization of Behavior

Experiment 2: Setup • Goal is to demonstrate that we can learn multiple recognizers from one stream of experience. • Behavior policy picks a rudder orientation randomly to bring ship towards desired heading. • 4 recognizers recognize different ranges of motion, from small, smooth adjustments to the rudder, to huge, sharp adjustments. Recognizers 13 NIPS’07 Workshop on Hierarchical Organization of Behavior

Experiment 2: Results Way off course and moving slowly Way off course and moving quickly 14 14 12 12 10 10 Expected Reward Expected Reward 8 8 6 6 Small Adjustments 4 4 Medium Adjustments Large Adjustments 2 2 Huge Adjustments All Actions Recognized 0 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Number of Steps (x 10E8) Number of Steps (x 10E8) • We can see that policies that make smaller rudder adjustments outperform those that make large adjustments. Recognizers 14 NIPS’07 Workshop on Hierarchical Organization of Behavior

Conclusion and Future work • Recognizers are useful for learning about options when we cannot control, or do not know the behavior policy. • Convergence has been shown for state aggregation, still need to work on proofs for function approximation, but empirical results are promising. • More experiments. Recognizers 15 NIPS’07 Workshop on Hierarchical Organization of Behavior

Questions? • RL-Glue, University of Alberta, http://rlai.cs.ualberta.ca/RLBB/top.html • Precup, D., Sutton, R.S., and Dasgupta, S. (2001). Off-policy temporal- difference learning with function approximation. In Proc. 18th International Conf. on Machine Learning , pages 417-424. • Precup, D., Sutton, R.S., Paduraru, C., Koop, A., and Singh, S. (2006). Off-policy learning with recognizers. In Advances in Neural Information Processing Systems 18 (NIPS*05) . • Sutton, R.S., Precup, D., and Singh, S.P. (1999). Between MDPs and semi-MDPS: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence , 112(1-2): 181-211. Recognizers 16 NIPS’07 Workshop on Hierarchical Organization of Behavior

Recognizers A study in learning how to model temporally extended - PowerPoint PPT Presentation

Recognizers A study in learning how to model temporally extended behaviors Jordan Frank jordan.frank@cs.mcgill.ca Reasoning and Learning Lab McGill University http://www.cs.mcgill.ca/~jfrank8/ Joint work with Doina Precup Background and

Recognizers Why and How M. Anton Ertl TU Wien How to deal with literals Recognizers Parsing

Building Recognizers for Digital Ink and Gestures Digital Ink Natural medium for pen-based

iOS Gesture Recognizers CocoaConf Boston October 2013 Jonathan Penn @jonathanpenn Slides

Building Recognizers for Digital Ink and Gestures Digital Ink Natural medium for pen-based

Building Recognizers for Digital Ink and Gestures Digital Ink l Natural medium for pen-based

CSE 105 THEORY OF COMPUTATION Fall 2016 http://cseweb.ucsd.edu/classes/fa16/cse105-abc/ T

Gestures Mobile Application Development in iOS School of EECS Washington State University

Circuit Complexity of Regular Languages Michal Koucky Presented by, Sunil K. S April 13, 2012

approximation of matrices Luis Rademacher The Ohio State University Computer Science and

Revised Implicit Equal-Weights Particle Filter Jacob Skauvold 1 Joint work with Jo Eidsvik 1 ,

Sequential Importance Sampling for Rare Event Estimation with Computer Experiments Brian

SYSTEM IDENTIFICATION AND MODEL UPDATING STUDIES IWSHM Derek Skolnik 2007 Ertugrul Taciroglu,

Safe Water Environmental Health Beach Sampling Program Chris Spooney East Algoma District

Presentation Outline The role of the FCU Sampling Methodology to be adopted by FCU

Apollo Global Management, LLC Reports Third Quarter 2015 Results New York, October 28, 2015--

Training Presentation EIV Income Validation Tool (IVT) PHA Training 2018 EIV & IVT The

Healthcare Conference January 12, 2017 Safe Harbor Statement and Non-GAAP Financial Measures

Q3 2012 interim financial results presentation for the three month period ending 31 May 19 July

Lecture #12: kNN Classification and Missing Data Data Science 1 CS 109A, STAT 121A, AC 209A,

Multiple imputation methods for incomplete longitudinal ordinal data: a simulation study

Incremental Algorithms for Missing Data Imputation based on Recursive Partitioning Claudio

Updated Survey Indices for Silver and Red hake through 2019 Background Relative index for

Genotype imputation accuracy with different reference panels Guan-Hua Huang and Yi-Chi Tseng

Working Group I: Effects of PM on Mortality; Air Quality and Morbidity Sujit K. Ghosh and

Recognizers A study in learning how to model temporally extended - PowerPoint PPT Presentation

Recognizers A study in learning how to model temporally extended behaviors Jordan Frank jordan.frank@cs.mcgill.ca Reasoning and Learning Lab McGill University http://www.cs.mcgill.ca/~jfrank8/ Joint work with Doina Precup Background and

Recognizers Why and How M. Anton Ertl TU Wien How to deal with literals Recognizers Parsing

Building Recognizers for Digital Ink and Gestures Digital Ink Natural medium for pen-based

iOS Gesture Recognizers CocoaConf Boston October 2013 Jonathan Penn @jonathanpenn Slides

Building Recognizers for Digital Ink and Gestures Digital Ink Natural medium for pen-based

Building Recognizers for Digital Ink and Gestures Digital Ink l Natural medium for pen-based

CSE 105 THEORY OF COMPUTATION Fall 2016 http://cseweb.ucsd.edu/classes/fa16/cse105-abc/ T

Gestures Mobile Application Development in iOS School of EECS Washington State University

Circuit Complexity of Regular Languages Michal Koucky Presented by, Sunil K. S April 13, 2012

approximation of matrices Luis Rademacher The Ohio State University Computer Science and

Revised Implicit Equal-Weights Particle Filter Jacob Skauvold 1 Joint work with Jo Eidsvik 1 ,

Sequential Importance Sampling for Rare Event Estimation with Computer Experiments Brian

SYSTEM IDENTIFICATION AND MODEL UPDATING STUDIES IWSHM Derek Skolnik 2007 Ertugrul Taciroglu,

Safe Water Environmental Health Beach Sampling Program Chris Spooney East Algoma District

Presentation Outline The role of the FCU Sampling Methodology to be adopted by FCU

Apollo Global Management, LLC Reports Third Quarter 2015 Results New York, October 28, 2015--

Training Presentation EIV Income Validation Tool (IVT) PHA Training 2018 EIV &amp; IVT The

Healthcare Conference January 12, 2017 Safe Harbor Statement and Non-GAAP Financial Measures

Q3 2012 interim financial results presentation for the three month period ending 31 May 19 July

Lecture #12: kNN Classification and Missing Data Data Science 1 CS 109A, STAT 121A, AC 209A,

Multiple imputation methods for incomplete longitudinal ordinal data: a simulation study

Incremental Algorithms for Missing Data Imputation based on Recursive Partitioning Claudio

Updated Survey Indices for Silver and Red hake through 2019 Background Relative index for

Genotype imputation accuracy with different reference panels Guan-Hua Huang and Yi-Chi Tseng

Working Group I: Effects of PM on Mortality; Air Quality and Morbidity Sujit K. Ghosh and

Training Presentation EIV Income Validation Tool (IVT) PHA Training 2018 EIV & IVT The