Hierarchical RL and Skill Discovery CS 330 1 The Plan - PowerPoint PPT Presentation

Hierarchical RL and Skill Discovery CS 330 1

The Plan Information-theoretic concepts Skill discovery Using discovered skills Hierarchical RL 2

Why Skill Discovery? What if we want to discover interesting behaviors? [The construction of [Postural hand synergies for tool movement by the spinal cord, use, Santello, et al. , 1998] Tresch et al. , 1999] 3

Why Skill Discovery? More practical version Coming up with tasks is tricky… Write down task ideas for a tabletop manipulation scenario [Meta-World, Yu, Quillen, He, et al. , 2019] 4

Why Hierarchical RL? Performing tasks at various levels of abstractions Exploration Bake a cheesecake Buy ingredients Go to the store Walk to the door Take a step Contract muscle X 5

The Plan In Info formati mation-theor theoretic etic concepts pts Skill discovery Using discovered skills Hierarchical RL 6

Entropy Slide adapted from Sergey Levine 7

KL-divergence Distance between two distributions 8

Mutual information High MI? x- it rains tomorrow, y – streets are wet tomorrow x- it rains tomorrow, y – we find life on Mars tomorrow Slide adapted from Sergey Levine 9

Mutual information Slide adapted from Sergey Levine 1 0

Soft Q-learning Objective: Value-, Q-functions, and the policy Q-learning Soft Q-learning 𝜌 𝐛 𝐭 = arg𝑛𝑏𝑦 𝐛 𝑅 𝜚 (𝐭, 𝐛) 𝜌 𝐛 𝐭 = arg𝑛𝑏𝑦 𝐛 𝑅 𝜚 (𝐭, 𝐛) 12

Soft Q-learning Exploration Fine-tunability Robustness Haarnoja et al. RL with Deep Energy-Based Policies, 2017 13

Learning diverse skills task index Why can’t we just use MaxEnt RL 1. action on entropy is not the same as state e entropy agent can take very different actions, but land in similar states 2. MaxEnt policies are stochastic, but not always controllable intuitively, we want low diversity for a fixed z , high diversity across z’s Intuition: different skills should visit different state-space regions Slide adapted from Sergey Levine 15 Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.

Diversity-promoting reward function Environment Discriminator(D) Action State Policy(Agent) Skill (z) Predict Skill Slide adapted from Sergey Levine 16 Eysenbach, Gupta, Ibarz, Levine. Diver ersit sity y is All You Need.

Examples of learned tasks Cheetah Ant Mountain car 17 Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.

A connection to mutual information Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need. Slide adapted from Sergey Levine See also: Gregor et al. Variational Intrinsic Control. 2016 18

How to use learned skills? How can we use the learned skills to accomplish a task? Learn a policy that operates on z’s 20 Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.

Results: hierarchical RL Can we do better? 21 Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.

What’s the problem? Skills might not be particularly useful It’s not very easy to use the learned skills What makes a useful skill? 22

What’s the problem? Consequences Consequences are hard to are easy to predict predict 23

Slightly different mutual information Future hard to Predictable predict for future for a different skills given skill 24 Sharma, Gu, Levine, Kumar, Hausman, DADS, 2019.

Skill-dynamics model We are learning a skill-dynamics model compared to conventional global dynamics Skills are optimized specifically to make skill-dynamics easier to model 25 Sharma, Gu, Levine, Kumar, Hausman, DADS, 2019.

DADS algorithm (s 1 , a 1 ) … (s T , a T ) (s 1 , a 1 , r 1 ) … (s T , a T , r T ) Update 𝞀 (a | s, z) (s 1 , a 1 , r 1 ) … (s T , a T , r T ) (s 1 , a 1 ) … (s T , a T ) Update q 𝜚 (s’ | s, z) z 2 ---------- Compute r z (s, a, s’) (s 1 , a 1 , r 1 ) … (s T , a T , r T ) (s 1 , a 1 ) … (s T , a T ) z 3 p(z) repeat 26 Sharma, Gu, Levine, Kumar, Hausman, DADS, 2019.

DADS results DIAYN DADS 27

Using learned skills Use skill-dynamics for model-based planning Plan for skills not actions Tasks can be learned zero-shot iterate p 1 : (z 1 , z 2 … z H ) p 1 : (s 0 , a 0 , … s H , a H ) p 1 , ȓ 1 update planner p 2 , ȓ 2 p 2 : (z 1 , z 2 … z H ) compute or p 2 : (s 0 , a 0 , … s H , a H ) skill-dynamics q 𝝔 , estimate policy 𝝆 cumulative p 3 , ȓ 3 p 3 : (s 0 , a 0 , … s H , a H ) reward p 3 : (z 1 , z 2 … z H ) 28

Summary - Two skill discovery algorithms that use mutual information - Predictability can be used as a proxy for “usefulness” - Method that optimizes for both, predictability and diversity - Model-based planning in the skill space - Opens new avenues such as unsupervised meta-RL - Gupta et al. Unsupervised Meta-Learning for RL , 2018 29

The Plan Information-theoretic concepts Skill discovery Using discovered skills Hierar erarchic hical al RL RL 33

Why Hierarchical RL? Performing tasks at various levels of abstractions Exploration Bake a cheesecake Buy ingredients Go to the store Walk to the door T ake a step Contract muscle X 34

Hierarchical RL – design choices Design choices: - goal-conditioned vs not - pre-trained vs e2e 𝜌 ℎ 𝜌 ℎ - self-terminating vs fixed rate - on-policy vs off-policy 𝑨 1 𝑨 2 v𝑡 𝑡 𝑕1 v𝑡 𝑡 𝑕2 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝑡 3 𝑡 0 𝑏 0 𝑡 1 𝑡 2 𝑏 3 𝑡 4 𝑏 4 𝑡 5 𝑡 6 𝑏 1 𝑏 2 𝑏 5 𝑏 6 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 35

Learning Locomotor Controllers Command updated High-level every K steps controller 𝜌 ℎ 𝜌 ℎ 𝑨 1 v𝑡 𝑡 𝑕1 𝑨 2 v𝑡 𝑡 𝑕2 Low-level 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝑡 3 𝑡 0 𝑏 0 𝑡 1 𝑏 1 𝑡 2 𝑏 3 𝑡 4 𝑏 4 𝑡 5 𝑡 6 𝑏 2 𝑏 5 𝑏 6 controller 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 Design choices: - HL and LL trained separately - Trained with policy gradients - goal-conditioned vs not - Hierarchical noise - pre-trained vs e2e - self-terminating vs fixed rate Task-specific - on-policy vs off-policy information Proprioceptive information 36 Heess, Wayne, Tassa, Lillicrap, Riedmiller, Silver, Learning Locomotor Controllers, 2016.

Option Critic 𝜌 ℎ 𝜌 ℎ 𝑨 1 v𝑡 𝑡 𝑕1 𝑨 2 v𝑡 𝑡 𝑕2 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝑡 3 𝑡 0 𝑏 0 𝑡 1 𝑏 1 𝑡 2 𝑏 3 𝑡 4 𝑏 4 𝑡 5 𝑡 6 𝑏 2 𝑏 5 𝑏 6 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 Design choices: - Option is a self-terminating mini- policy - goal-conditioned vs not - Everything trained together with - pre-trained vs e2e policy gradient - self-terminating vs fixed rate - on-policy vs off-policy 37 Bacon, Harb, Precup, The Option-Critic Architecture, 2016.

Relay Policy Learning 𝜌 ℎ 𝜌 ℎ 𝑨 1 v𝑡 𝑡 𝑕1 𝑨 2 v𝑡 𝑡 𝑕2 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝑡 3 𝑡 0 𝑏 0 𝑡 1 𝑏 1 𝑡 2 𝑏 3 𝑡 4 𝑏 4 𝑡 5 𝑡 6 𝑏 2 𝑏 5 𝑏 6 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 Design choices: - goal-conditioned vs not - pre-trained vs e2e - self-terminating vs fixed rate - on-policy vs off-policy 38 Gupta, Kumar, Lynch, Levine, Hausman, Relay Policy Learning, 2019.

Relay Policy Learning 𝜌 ℎ 𝜌 ℎ 𝑨 1 v𝑡 𝑡 𝑕1 𝑨 2 v𝑡 𝑡 𝑕2 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝑡 3 𝑡 0 𝑏 0 𝑡 1 𝑏 1 𝑡 2 𝑏 3 𝑡 4 𝑏 4 𝑡 5 𝑡 6 𝑏 2 𝑏 5 𝑏 6 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 Design choices: - goal-conditioned vs not - pre-trained vs e2e - Goal-conditioned policies with relabeling - self-terminating vs fixed rate - Demonstrations to pre-train everything - on-policy vs off-policy - On-policy 39 Gupta, Kumar, Lynch, Levine, Hausman, Relay Policy Learning, 2019.

HIRO 𝜌 ℎ 𝜌 ℎ 𝑨 1 v𝑡 𝑡 𝑕1 𝑨 2 v𝑡 𝑡 𝑕2 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝑡 3 𝑡 0 𝑏 0 𝑡 1 𝑏 1 𝑡 2 𝑏 3 𝑡 4 𝑏 4 𝑡 5 𝑡 6 𝑏 2 𝑏 5 𝑏 6 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 Design choices: Design choices: - - goal-conditioned vs not goal-conditioned vs not - - pre-trained vs e2e pre-trained vs e2e - - self-terminating vs fixed rate self-terminating vs fixed rate - Goal-conditioned policies with relabeling - - on-policy vs off-policy on-policy vs off-policy - Off-policy training through off-policy corrections 40 Nachum, Gu, Lee, Levine HIRO, 2018.

Hierarchical RL and Skill Discovery CS 330 1 The Plan - PowerPoint PPT Presentation

Hierarchical RL and Skill Discovery CS 330 1 The Plan Information-theoretic concepts Skill discovery Using discovered skills Hierarchical RL 2 Why Skill Discovery? What if we want to discover interesting behaviors? [The construction of

Skill discovery from unstructured demonstrations Skill discovery from unstructured demonstrations

Presentation Presentation skill skill skill skill Presentation Presentation skill skill

Variational Option Discovery Algorithms Achiam, Edwards, Amodei, Abbeel Topic: Hierarchical

Unsupervised Subgoal Discovery Method for Learning Hierarchical Representations Jacob Rafati

Unsupervised Methods For Subgoal Discovery During Intrinsic Motivation in Model-Free Hierarchical

Hierarchical Pattern Discovery in Stochastic Lattice Systems Markos Katsoulakis University of

Hierarchical Generation of Molecular Graphs using Structural Motifs Wengong Jin, Regina Barzilay,

1. What is skill and how are skill classified? 2. How do people learn skills? 3. How can

NETWORK GROUP DISCOVERY BY HIERARCHICAL LABEL PROPAGATION Lovro Subelj & Marko Bajec

INF4820 Algorithms for AI and NLP Hierarchical Clustering Erik Velldal & Stephan

Communication. It's a skill. You already know how to communicate. Most of us do it, in one way or

Rising Skill Premium? The Roles of Capital-Skill Complementarity and Sectoral Shifts in a

UT DA Hierarchical and Analytical Hierarchical and Analytical Pla Placement T cement

Proposal of a Hierarchical Proposal of a Hierarchical Architecture for Multimodal Architecture for

Quickwrite Questions: How did you learn the skill of note taking? How did this skill

Flipping Coins in the War Room: Skill and Chance in the NFL Draft Cade Massey Yale University

Learning to Read at Lydgate What is reading? Word- Compre reading hension (skill) (skill)

and Skill Mismatches Peter Cappelli The Wharton School Whos responsible for job skills?

Skill Demand 2 Skill Demand Diversified Today and Tomorrow Business Enterprise Workforce

SKILL MEMORY PSYC 461 - LEARNING & MEMORY ARLO CLARK-FOOS, PH.D. I GOT SKILLZ Skill

Skill Saw BEST PRACTICES 2018 Skill Saw Uses Typical cutting applications Aluminum rails

Separation and convexity properties of hierarchical and non hierarchical clustering Patrice

PSYCHOLOGICAL SKILL DEVELOPMENT | PHYSICAL SKILL DEVELOPMENT | TECHNICAL SKILLS | TACTICAL SKILLS

Skill in Retrievals Evan Manning and George Aumann 17 October 2008 Skill in Retrievals The AIRS

Hierarchical RL and Skill Discovery CS 330 1 The Plan - PowerPoint PPT Presentation

Hierarchical RL and Skill Discovery CS 330 1 The Plan Information-theoretic concepts Skill discovery Using discovered skills Hierarchical RL 2 Why Skill Discovery? What if we want to discover interesting behaviors? [The construction of

Skill discovery from unstructured demonstrations Skill discovery from unstructured demonstrations

Presentation Presentation skill skill skill skill Presentation Presentation skill skill

Variational Option Discovery Algorithms Achiam, Edwards, Amodei, Abbeel Topic: Hierarchical

Unsupervised Subgoal Discovery Method for Learning Hierarchical Representations Jacob Rafati

Unsupervised Methods For Subgoal Discovery During Intrinsic Motivation in Model-Free Hierarchical

Hierarchical Pattern Discovery in Stochastic Lattice Systems Markos Katsoulakis University of

Hierarchical Generation of Molecular Graphs using Structural Motifs Wengong Jin, Regina Barzilay,

1. What is skill and how are skill classified? 2. How do people learn skills? 3. How can

NETWORK GROUP DISCOVERY BY HIERARCHICAL LABEL PROPAGATION Lovro Subelj &amp; Marko Bajec

INF4820 Algorithms for AI and NLP Hierarchical Clustering Erik Velldal &amp; Stephan

Communication. It's a skill. You already know how to communicate. Most of us do it, in one way or

Rising Skill Premium? The Roles of Capital-Skill Complementarity and Sectoral Shifts in a

UT DA Hierarchical and Analytical Hierarchical and Analytical Pla Placement T cement

Proposal of a Hierarchical Proposal of a Hierarchical Architecture for Multimodal Architecture for

Quickwrite Questions: How did you learn the skill of note taking? How did this skill

Flipping Coins in the War Room: Skill and Chance in the NFL Draft Cade Massey Yale University

Learning to Read at Lydgate What is reading? Word- Compre reading hension (skill) (skill)

and Skill Mismatches Peter Cappelli The Wharton School Whos responsible for job skills?

Skill Demand 2 Skill Demand Diversified Today and Tomorrow Business Enterprise Workforce

SKILL MEMORY PSYC 461 - LEARNING &amp; MEMORY ARLO CLARK-FOOS, PH.D. I GOT SKILLZ Skill

Skill Saw BEST PRACTICES 2018 Skill Saw Uses Typical cutting applications Aluminum rails

Separation and convexity properties of hierarchical and non hierarchical clustering Patrice

PSYCHOLOGICAL SKILL DEVELOPMENT | PHYSICAL SKILL DEVELOPMENT | TECHNICAL SKILLS | TACTICAL SKILLS

Skill in Retrievals Evan Manning and George Aumann 17 October 2008 Skill in Retrievals The AIRS

NETWORK GROUP DISCOVERY BY HIERARCHICAL LABEL PROPAGATION Lovro Subelj & Marko Bajec

INF4820 Algorithms for AI and NLP Hierarchical Clustering Erik Velldal & Stephan

SKILL MEMORY PSYC 461 - LEARNING & MEMORY ARLO CLARK-FOOS, PH.D. I GOT SKILLZ Skill