The Option-Critic Architecture Pierre-Luc Bacon, Jean Harb, Doina - PowerPoint PPT Presentation

The Option-Critic Architecture Pierre-Luc Bacon, Jean Harb, Doina Precup Reasoning and Learning Lab McGill University, Montreal, Canada AAAI 2017

Intelligence: the ability to generalize and adapt efficiently to new and uncertain situations • Having good representations is key “[...] solving a problem simply means representing it so as to make the solution transparent.” — Simon, 1969 1 / 18

Reinforcement Learning: a general framework for AI Equipped with a good state representation , RL has led to impressive results: • Tesauro’s TD Gammon (1995), • Watson’s Daily-Double Wagering in Jeopardy! (2013), • Human-level video game play in the Atari games (2013), • AlphaGo (2016)... The ability to abstract knowledge temporally over many different time scales is still missing. 2 / 18

Temporal abstraction Higher level steps Choosing the type of coffee maker, type of coffee beans Medium level steps Grind the beans, measure the right quantity of water, boil the water Lower level steps Wrist and arm movements while adding coffee to the filter, ... 3 / 18

Temporal abstraction in AI A cornerstone of AI planning since the 1970’s: • Fikes et al. (1972), Newell (1972, Kuipers (1979), Korf (1985), Laird (1986), Iba (1989), Drescher (1991) etc. It has been shown to : • Generate shorter plans • Reduce the complexity of choosing actions • Provide robustness against model misspecification • Improve exploration by taking shortcuts in the environment 4 / 18

Temporal abstraction in RL Options (Sutton, Singh, Precup 2000) can represent courses of action at variable time scales: High level Low level Trajectory, time 5 / 18

Options framework An option ω is a triple: 1. initiation set : I ω 2. internal policy : π ω 3. termination condition : β ω Example Robot navigation: if there is no obstacle in front ( I ω ), go forward ( π ω ) until you get too close to another object ( β ω ) We can derive a policy over options π Ω that maximizes the expected discounted sum of rewards: � ∞ � � � � γ t r ( s t , a t ) E � s 0 , ω 0 � � t =0 6 / 18

Contribution of this work The problem of constructing/discovering good options has been a challenge for more than 15 years. Option-critic is a scalable solution to this problem: • Online, continual and model-free (but models can be used if desired) • Requires no a priori domain knowledge, decomposition, or human intervention • Learns in a single task, at least as fast as other methods which do not use temporal abstraction • Applies to general continuous state and action spaces 7 / 18

Actor-Critic Architecture (Sutton 1984) Actor Policy Gradient Critic TD error Value s t a t function r t Environment • The policy (actor) is decoupled from its value function. • The critic provides feedback to improve the actor • Learning is fully online 8 / 18

Option-Critic Architecture Policy over options ω t π Ω Options π ω , β ω Gradients TD error Critic s t a t Q U , A Ω r t Environment • Parameterize internal policies and termination conditions • Policy over options is computed by a separate process 9 / 18

Main result: Gradient updates • The gradient wrt. the internal policy parameters θ is given by: � ∂ log π ω,θ ( a | s ) � Q U ( s, ω, a ) E ∂θ This has the usual interpretation: take better primitives more often inside the option • The gradient wrt. the termination parameters ν is given by: � − ∂β ω,ν ( s ′ ) � A π Ω ( s ′ , ω ) E ∂ν where A π Ω = Q π Ω − V π Ω is the advantage function This means that we want to lengthen options that have a large advantage 10 / 18

Results: Options transfer Hallways Walls Random goal Initial goal after 1000 episodes 11 / 18

Results: Options transfer 600 SARSA(0) 500 AC-PG Goal moves randomly OC 4 options OC 8 options 400 Steps 300 200 Primitive actions 100 Using temporal abstractions 0 0 500 1000 1500 2000 discovered by option-critic Episodes • Learning in the first task no slower than using primitives • Learning once the goal is moved faster with the options 12 / 18

Results: Learned options are intuitive Probability of terminating in a particular state, for each option: Option 1 Option 2 Option 3 Option 4 • Terminations are more likely near hallways (although there are no pseudo-rewards provided) 13 / 18

Results: Nonlinear function approximation Policy over options Termination functions Internal policies Last 4 frames Convolutional layers Shared representation Same architecture as DQN (Mnih & al., 2013) for the 4 first layers but hybridized with options and the policy over them. 14 / 18

Performance matching or better than DQN 2500 10000 2000 8000 Avg. Score 1500 6000 4000 1000 2000 Option-Critic Option-Critic 500 DQN DQN 0 0 50 100 150 200 0 50 100 150 200 Epoch Epoch (a) Asterix (b) Ms. Pacman 8000 10000 Option-Critic DQN 6000 8000 6000 4000 4000 2000 2000 Option-Critic 0 DQN 0 0 50 100 150 200 0 50 100 150 200 Epoch Epoch (c) Seaquest (d) Zaxxon 15 / 18

Interpretable and specialized options in Seaquest Action trajectory, time White: option 1 Black: option 2 Transition from option 1 to 2 Option 1: downward shooting sequences Option 2: upward shooting sequences 16 / 18

Conclusion Our results seem to be the first to be: • fully end-to-end • within a single task • at speed comparable or better than using just primitive methods Using ideas from policy gradient methods, option-critic • provides continual option construction • can be used with nonlinear function approximators • can incorporate regularizers or pseudo-rewards easily 17 / 18

Future work • Learn initiation sets: ◮ Would require a new notion of stochastic initiation functions • More empirical results ! Try our code : https://github.com/jeanharb/option_critic 18 / 18

The Option-Critic Architecture Pierre-Luc Bacon, Jean Harb, Doina - PowerPoint PPT Presentation

The Option-Critic Architecture Pierre-Luc Bacon, Jean Harb, Doina Precup Reasoning and Learning Lab McGill University, Montreal, Canada AAAI 2017 Intelligence: the ability to generalize and adapt efficiently to new and uncertain situations

DAC: The Double Actor-Critic Architecture for Learning Options NeurIPS 2019 Shangtong Zhang,

Neural Fitted Actor-Critic Matthieu Zimmer Alain Dutech Yann Boniface University of Lorraine,

Data structures wa y x 1 D ASE System System E C r* O r state D Critic Critic E

Soft Actor-Critic Zikun Chen, Minghan Li Jan. 28, 2020 Soft Actor-Critic: Ofg-Policy Maximum

The Option-Critic Architecture Author: Pierre-Luc Bacon, Jean Harb, Doina Precup Speaker: Zebin

Option A Do Nothing Option Option B Maintain All Schools & Demo Facilities Upgraded

Actor-Critic Policy Learning in Cooperative Planning Josh Redding, Alborz Geramifard Han-Lim Choi

Sudbury Previous Options Option 2 Option 5 Traffic Signals Revised Roundabout Revised

Option 1: Large areas such as gymnasiums, multi-purpose rooms, auditorium Option 2: Rooms such as

Option Greeks 1 Introduction Option Greeks 1 Introduction Set-up Assignment: Read Section

Assessment Option 1: Take-home exam Option 1: Take-home exam Replicate an analysis

DRAFT DRAFT Option Comparison Option Comparison Alignment of Options Alignment of

OYO 101 : One Year Option June 1, 2017 C-TEC Newark, Ohio One Year Option Legislation: Section

SESSION 9: OPTION PRICING BASICS Aswath Damodaran The ingredients that make an option 2

IETF 74 DHC draft-dhankins-softwire-tunnel-option draft-ietf-dhc-option-guidelines

Lecture 3.1: Option Pricing The one and two period binomial option pricing models Models:

Autonomous Intelligent Robotics Instructor: Shiqi Zhang

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many

Reinforcement Learning Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

State Armory Board (SAB) Quarterly Meeting: 15 October 2015 0 State Armory Board Quarterly

Adversarial Search and Game Playing Russell and Norvig, Chapter 5 http://xkcd.com/601/ Games n

Deep Reinforcement Learning Applications + Hacking Arjun Chandra Research Scientist Telenor

Agents Robert Platt Northeastern University Some material used from: 1. Russell/Norvig, AIMA

The Option-Critic Architecture Pierre-Luc Bacon, Jean Harb, Doina - PowerPoint PPT Presentation

The Option-Critic Architecture Pierre-Luc Bacon, Jean Harb, Doina Precup Reasoning and Learning Lab McGill University, Montreal, Canada AAAI 2017 Intelligence: the ability to generalize and adapt efficiently to new and uncertain situations

DAC: The Double Actor-Critic Architecture for Learning Options NeurIPS 2019 Shangtong Zhang,

Neural Fitted Actor-Critic Matthieu Zimmer Alain Dutech Yann Boniface University of Lorraine,

Data structures wa y x 1 D ASE System System E C r* O r state D Critic Critic E

Soft Actor-Critic Zikun Chen, Minghan Li Jan. 28, 2020 Soft Actor-Critic: Ofg-Policy Maximum

The Option-Critic Architecture Author: Pierre-Luc Bacon, Jean Harb, Doina Precup Speaker: Zebin

Option A Do Nothing Option Option B Maintain All Schools &amp; Demo Facilities Upgraded

Actor-Critic Policy Learning in Cooperative Planning Josh Redding, Alborz Geramifard Han-Lim Choi

Sudbury Previous Options Option 2 Option 5 Traffic Signals Revised Roundabout Revised

Option 1: Large areas such as gymnasiums, multi-purpose rooms, auditorium Option 2: Rooms such as

Option Greeks 1 Introduction Option Greeks 1 Introduction Set-up Assignment: Read Section

Assessment Option 1: Take-home exam Option 1: Take-home exam Replicate an analysis

DRAFT DRAFT Option Comparison Option Comparison Alignment of Options Alignment of

OYO 101 : One Year Option June 1, 2017 C-TEC Newark, Ohio One Year Option Legislation: Section

SESSION 9: OPTION PRICING BASICS Aswath Damodaran The ingredients that make an option 2

IETF 74 DHC draft-dhankins-softwire-tunnel-option draft-ietf-dhc-option-guidelines

Lecture 3.1: Option Pricing The one and two period binomial option pricing models Models:

Autonomous Intelligent Robotics Instructor: Shiqi Zhang

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many

Reinforcement Learning Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

State Armory Board (SAB) Quarterly Meeting: 15 October 2015 0 State Armory Board Quarterly

Adversarial Search and Game Playing Russell and Norvig, Chapter 5 http://xkcd.com/601/ Games n

Deep Reinforcement Learning Applications + Hacking Arjun Chandra Research Scientist Telenor

Agents Robert Platt Northeastern University Some material used from: 1. Russell/Norvig, AIMA

Option A Do Nothing Option Option B Maintain All Schools & Demo Facilities Upgraded