Unsupervised Methods For Subgoal Discovery During Intrinsic - PowerPoint PPT Presentation

Unsupervised Methods For Subgoal Discovery During Intrinsic Motivation in Model-Free Hierarchical Reinforcement Learning Jacob Rafati http://rafati.net Co-authored with: David C. Noelle Ph.D. Candidate Electrical Engineering and Computer Science Computational Cognitive Neuroscience Laboratory University of California, Merced

Goals & Rules • “Key components of games are goals , rules , challenge , and interaction . Games generally involve mental or physical stimulation, and often both.” https://en.wikipedia.org/wiki/Game

Reinforcement Learning Reinforcement learning (RL) is learning how to map situations ( state ) to actions so as to maximize numerical reward signals received during the experiences that an artificial agent has as it interacts with its environment . e xperience : e t = { s t , a t , s t +1 , r t +1 } Objective: Learn π : S → A to maximize cumulative rewards (Sutton and Barto, 2017)

Super-Human Success (Mnih. et. al., 2015)

Failure in a complex task (Mnih. et. al., 2015)

Learning Representations in Hierarchical Reinforcement Learning • Trade-o ff between exploration and exploitation in an environment with sparse feedback is a major challenge. • Learning to operate over di ff erent levels of temporal abstraction is an important open problem in reinforcement learning. • Exploring the state-space while learning reusable skills through intrinsic motivation . • Discovering useful subgoals in large-scale hierarchical reinforcement learning is a major open problem.

Return Return is the cumulative sum of a received reward: T γ t 0 − t − 1 r t 0 X G t = t 0 = t +1 γ ∈ [0 , 1] is the discount factor a t − 1 a t s T s t s t +1 s t − 1 s 0 r t − 1 r t +1 r t

Policy Function • Policy Function: At each time step agent implements a mapping from states to possible actions π : S → A • Objective : Finding an optimal policy that maximizes the cumulated rewards π ∗ = arg max ⇥ ⇤ G t | S t = s ∀ s ∈ S , E π

Q-Function • State-Action Value Function is the expected return when starting from ( s,a) and following a policy thereafter Q π : S × A → R Q π ( s, a ) = E π [ G t | S t = s, A t = a ]

Temporal Difference • Model-free reinforcement learning algorithm. • State-transition probabilities or reward function are not available • A powerful computational cognitive neuroscience model of learning in brain • A combination of Monte Carlo method and Dynamic Programming Q-learning a 0 Q ( s 0 , a 0 ) − Q ( s, a )] Q ( s, a ) ← Q ( s, a ) + α [ r + γ max Q ( s, a ) → prediction of return a 0 Q ( s 0 , a 0 ) → target value r + γ max

Generalization Q ( s, a ) ≈ q ( s, a ; w ) Function state-action State Values Approximator q ( s, a ; w ) w s . . . q ( s, a i ; w ) . . .

Deep RL min w L ( w ) w = arg min w L ( w ) h� � 2 i a 0 q ( s 0 , a 0 ; w � ) − q ( s, a ; w ) L ( w ) = E ( s,a,r,s 0 ) ⇠ D r + max D = { e t | t = 0 , . . . , T } → Experience replay memory Stochastic Gradient Decent method w w � r w L ( w )

Q-Learning with experience replay memory

Failure: Sparse feedback (Botvinick et al., 2009) Subgoals

Hierarchy in Human Behavior & Brain Structure Complex Simple Task Tasks Actions Minor Goals Major Goals

Hierarchical Reinforcement Learning Subproblems • Subproblem 1: Learning a meta-policy to choose a subgoal • Subproblem 2: Developing skills through intrinsic motivation • Subproblem 3: Subgoal discovery

Meta-controller/Controller Framework s t s t +1 , r t +1 Agent g t Critic Environment Meta-controller ˜ a t r t +1 a t Controller a t Kulkarni et. al. 2016

Subproblem 1: Temporal Abstraction

Rooms Task Room 1 Room 2 Room 3 Room 4

Subproblem 2. Developing skills through Intrinsic Motivation

State-Goal Q Function q ( s t , g t , a ; w ) fully connected … … … representation … conjunctive distributed weighted … . . . . g t gates . . … … … … … fully connected … … Gaussian … representation s t

Reusing the skills Room-1 Room 1 Room 2 Room-2 Key Room-3 Room-4 Lock Room 3 Room 4

Reusing the skills Room-1 Grid World Task with Key and Door 1 0.8 Room-2 0.6 0.4 Key 0.2 0 0 0.2 0.4 0.6 0.8 1 Room-3 Room-4 Lock

Reusing the skills Room-1 Grid World Task with Key and Door 1 0.8 Room-2 0.6 0.4 Key 0.2 0 0 0.2 0.4 0.6 0.8 1 Room-3 Room-4 Box

Reusing the skills Room-1 Grid World Task with Key and Door 1 0.8 Room-2 0.6 0.4 Key 0.2 0 0 0.2 0.4 0.6 0.8 1 Room-3 Room-4 Lock

Reusing the skills Room-1 Room-2 Key Grid World Task with Key and Door 1 Room-3 0.8 0.6 Room-4 0.4 0.2 Lock 0 0 0.2 0.4 0.6 0.8 1

Subproblem 3. Subgoal Discovery finding proper G (Sismek et al., 2005) (Goel and Huber, 2003) (Machado, et. al. 2017)

Subproblem 3. Subgoal Discovery • Purpose: Discovering promising states to pursue, i.e. finding G • Implementing subgoal discovery algorithm for large-scale model free reinforcement learning problem • No access to MDP models (state-transition probabilities, environment reward function, State space)

Subproblem 3. Candidate Subgoals • It is close (in terms of actions) to a rewarding state. • It represents a set of states, at least some of which tend to be along a state transition path to a rewarding state.

Subproblem 3. Subgoal Discovery • Unsupervised learning (clustering) on the limited past experience memory collected during intrinsic motivation • Centroids of clusters are useful subgoals (e.g. rooms) • Detecting outliers as potential subgoals (e.g. key, box) • Boundary of two clusters can lead to subgoals (e.g. doorway between rooms)

Unsupervised Subgoal Discovery

Unification of Hierarchical Reinforcement Learning Subproblems • Implementing a hierarchical reinforcement learning framework that makes it possible to simultaneously perform subgoal discovery, learn appropriate intrinsic motivation, and succeed at meta-policy learning • The unification element is using experience replay memory D

Model-Free HRL

Rooms 100 Success in Reaching Subgoals % 80 60 40 20 0 20000 40000 60000 80000 100000 Training steps 50 100 Success in Solving Task% 40 80 Episode Return 30 60 Our Unified Model-Free HRL Method 20 Regular RL 40 10 20 0 Our Unified Model-Free HRL Method Regular RL 0 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 Training steps Training steps

Montezuma’s Revenge Meta-Controller Controller

Montezuma’s Revenge 400 Our Unified Model-Free HRL Method Success in reaching subgoals % Average return over 10 episdes 350 DeepMind DQN Algorithm (Mnih et. al., 2015) 80 300 60 250 Our Unified Model-Free HRL Method 200 DeepMind DQN Algorithm (Mnih et. al., 2015) 40 150 100 20 50 0 0 0 500000 1000000 1500000 2000000 2500000 0 500000 1000000 1500000 2000000 2500000 Training steps Training steps

Conclusions • Unsupervised Learning can be used to discover useful subgoals in games. • Subgoals can be discovered using model-free methods. • Learning in multiple levels of temporal abstraction is the key to solve games with sparse delayed feedback. • Intrinsic motivation learning and subgoal discovery can be unified in model-free HRL framework.

References • Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529–533. • Sutton, R. S., and Barto, A. G. (2017). Reinforcement Learning: An Introduction. MIT Press. 2nd edition. • Botvinick, M. M., Niv, Y., and Barto, A. C. (2009). Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective. Cognition, 113(3):262 – 280. • Goel, S. and Huber, M. (2003). Subgoal discovery for hierarchical reinforcement learning using learned policies. In Russell, I. and Haller, S. M., editors, FLAIRS Conference, pages 346–350. AAAI Press. • Kulkarni, T. D., Narasimhan, K., Saeedi, A., and Tenenbaum, J. B. (2016). Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. NeurIPS 2016. • Machado, M. C., Bellemare, M. G., and Bowling, M. H. (2017). A laplacian framework for option discovery in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 2295–2304. • Sutton, R. S., Precup, D., and Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1): 181 – 211.

Slides, Paper, and Code: http://rafati.net Poster Session on Wednesday.

Unsupervised Methods For Subgoal Discovery During Intrinsic - PowerPoint PPT Presentation

Unsupervised Methods For Subgoal Discovery During Intrinsic Motivation in Model-Free Hierarchical Reinforcement Learning Jacob Rafati http://rafati.net Co-authored with: David C. Noelle Ph.D. Candidate Electrical Engineering and Computer

Unsupervised Subgoal Discovery Method for Learning Hierarchical Representations Jacob Rafati

Using the SOLO Taxonomy to Understand Subgoal Labels Effect in CS1 Adrienne Decker, University

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

Design and Pilot Testing of Subgoal Labeled Worked Examples for Five Core Concepts in CS1 Briana

Regression Idea: dont solve one subgoal by itself, but keep track of all subgoals that must

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Sgaard Sebastian

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Unsupervised Methods for NLP WSD Samuel Brody Department of Biomedical Informatics Columbia

Evaluation methods for unsupervised word embeddings EMNLP 2015 Tobias Schnabel, Igor Labutov,

From Search to Discovery in our Future Library From Search to Discovery W e see a spectrum of

Intrinsic Energy Partition in Fission Outline 1. TIME DEPENDENT PAIRING EQUATIONS AND

The Governing Equation(s) for a Spring-Mass-System Bernd Schr oder logo1 Bernd Schr oder

Do Social Rewards Crowd Out Intrinsic Dona5ons? Paul Mills,

Intrinsic Schreier split extensions Andrea Montoli Diana Rodelo Tim van der Linden Centre for

Intrinsic Metrics on Graphs & Graph Geometry D. J. Klein Texas A&M University @ Galveston,

Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection A Remedy

Exploring the Impact of Worked Examples in a Novice Programming Environment Rui Zhi Thomas W.

Advances and Applications of Intrinsic Low Dimensional Manifold Theory by Joseph M. Powers,

Unsupervised Methods For Subgoal Discovery During Intrinsic - PowerPoint PPT Presentation

Unsupervised Methods For Subgoal Discovery During Intrinsic Motivation in Model-Free Hierarchical Reinforcement Learning Jacob Rafati http://rafati.net Co-authored with: David C. Noelle Ph.D. Candidate Electrical Engineering and Computer

Unsupervised Subgoal Discovery Method for Learning Hierarchical Representations Jacob Rafati

Using the SOLO Taxonomy to Understand Subgoal Labels Effect in CS1 Adrienne Decker, University

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

Design and Pilot Testing of Subgoal Labeled Worked Examples for Five Core Concepts in CS1 Briana

Regression Idea: dont solve one subgoal by itself, but keep track of all subgoals that must

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Sgaard Sebastian

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Unsupervised Methods for NLP WSD Samuel Brody Department of Biomedical Informatics Columbia

Evaluation methods for unsupervised word embeddings EMNLP 2015 Tobias Schnabel, Igor Labutov,

From Search to Discovery in our Future Library From Search to Discovery W e see a spectrum of

Intrinsic Energy Partition in Fission Outline 1. TIME DEPENDENT PAIRING EQUATIONS AND

The Governing Equation(s) for a Spring-Mass-System Bernd Schr oder logo1 Bernd Schr oder

Do Social Rewards Crowd Out Intrinsic Dona5ons? Paul Mills,

Intrinsic Schreier split extensions Andrea Montoli Diana Rodelo Tim van der Linden Centre for

Intrinsic Metrics on Graphs &amp; Graph Geometry D. J. Klein Texas A&amp;M University @ Galveston,

Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection A Remedy

Exploring the Impact of Worked Examples in a Novice Programming Environment Rui Zhi Thomas W.

Advances and Applications of Intrinsic Low Dimensional Manifold Theory by Joseph M. Powers,

Intrinsic Metrics on Graphs & Graph Geometry D. J. Klein Texas A&M University @ Galveston,