Model Based Reinforcement Learning Oriol Vinyals (DeepMind) - PowerPoint PPT Presentation

Model Based Reinforcement Learning Oriol Vinyals (DeepMind) @OriolVinyalsML May 2018 Stanford University

The Reinforcement Learning Paradigm OBSERVATIONS GOAL Agent Environment ACTIONS

The Reinforcement Learning Paradigm Maximize Return - long term reward: State Action R t =∑ t’≥t � t’-t r t’ = r t + � R t+1 x t a t � ∈ [0,1] With Policy - action distribution: Reward r t � =P(a t |x t ,...) Measure success with Value Function : V � (x t )=E � (R t )

A Classic Dilemma “Old school” AI Researcher Deep Learning Researcher

A Classic Dilemma

A Classic Dilemma Model Based RL Deep RL

(Deep) Model Based RL Deep Generative Model Model Based RL Deep RL + Deep RL Learning Model Based Planning Imagination Augmented Agents from Scratch

Imagination Augmented Agents (NIPS17) Joint work with: Theo Weber*, Sebastien Racaniere*, David Reichert*, Razvan Pascanu*, Yujia Li*, Lars Buesing, Arthur Guez, Danilo Rezende, Adrià Puigdomènech Badia, Peter Battaglia, Nicolas Heess, David Silver, Daan Wierstra

Intro to I2A ● We have good environment models ⇒ can we use them to solve tasks? ● How do we do model-based RL and deal with imperfect simulators? ● In this particular approach, we treat the generative model as an oracle of possible futures. ⇒ How do we interpret those ‘warnings’?

Imagination Augmented Agents (I2A)

Imagination Planning Networks (IPNs)

Sokoban environment ● Procedurally generated ● Irreversible decisions

Sokoban environment

Video Success Failure

What happens if our model is bad?

Mental retries with I2A

Mental retries with I2A Solves 95% of levels!

Imagination efficiency Imagination is expensive ⇒ can we limit the number of times we ask the agent to imagine a transition in order to solve a levels? In other words, can we guide the search more efficiently than current methods?

One model, many tasks

Metaminipacman Five events: ● Do nothing ● Eat a small pill ● Eat a power pill ● Eat a ghost ● Be eaten by a ghost We assign to each event a different reward, and create five different games: ● ‘Regular’ ● ‘Rush’ (eat big pills as fast as possible) ● ‘Hunt’ (eat ghosts, pills are ok i guess) ● ‘Ambush’ (eat ghosts, avoid everything else) ● ‘Avoid’ (everything hurts)

Results Avoid Ambush

Learning model-based planning from scratch Joint work with: Razvan Pascanu*, Yujia Li*, Theo Weber*, Sebastien Racaniere*, David Reichert*, Lars Buesing, Arthur Guez, Danilo Rezende, Adrià Puigdomènech Badia, Peter Battaglia, Nicolas Heess, David Silver, Daan Wierstra

Prior work: Spaceship Task v1.0 Hamrick, Ballard, Pascanu, Vinyals, Heess, Battaglia (2017) Metacontrol for Adaptive Imagination-Based Optimization, ICLR 2017. ● Propel spaceship to home planet (white) by choosing thruster force and magnitude ● Other planets’ (grey) gravitational fields influence the trajectory ● Continuous, context bandit problem

Prior work: Imagination-based metacontroller ● Restricted to bandit problems

This paper: Imagination-based Planner (IBP)

Spaceship Task v2.0: Multiple actions ● Use thruster multiple times ● Increase difficult than Spaceship Task v1.0: 1. Pay for fuel 2. Multiplicative control noise ● Opens up new strategies, such as: 1. Move away from challenging gravity wells 2. Apply thruster toward target

Imagination-based Planner ● Imagination can be: Current step only : imagine only from the ○ current state

Imagination-based Planner ● Imagination can be: Current step only : imagine only from the ○ current state ○ Chained steps only : imagine a sequence of actions

Imagination-based Planner ● Imagination can be: Current step only : imagine only from the ○ current state ○ Chained steps only : imagine a sequence of actions Imagination tree : manager chooses whether ○ to use current (root) state, or chain imagined states together

Imagination-based Planner

Real trials: 3 actions 0 imaginations per action 1 imagination per action 2 imaginations per action More complex plans: 1. Moves away from complex gravity 2. Slows its velocity 3. Moves to target

Different strategies for exploration 1 step Imagination trees n step

Results

Imagination-based Planner How does it work? (learnable components are bold ) 1. On each step, inputs: ○ State, s t : the planet and ship positions, etc. ○ Imagined state, s’ t : internal state belief ○ History, h t : summary of planning steps so far

Imagination-based Planner How does it work? (learnable components are bold ) 1. On each step, inputs: ○ State, s t : the planet and ship positions, etc. ○ Imagined state, s’ t : internal state belief ○ History, h t : summary of planning steps so far 2. Controller policy returns action, a t 3. Manager routes actions to world or imagination, r t

Imagination-based Planner How does it work? (learnable components are bold ) 1. On each step, inputs: ○ State, s t : the planet and ship positions, etc. ○ Imagined state, s’ t : internal state belief ○ History, h t : summary of planning steps so far 2. Controller policy returns action, a t 3. Manager routes actions to world or imagination, r t 4. If route, r t , indicates: a. “Imagination”, predicts imagined state, s’ t+1

Imagination-based Planner How does it work? (learnable components are bold ) 1. On each step, inputs: ○ State, s t : the planet and ship positions, etc. ○ Imagined state, s’ t : internal state belief ○ History, h t : summary of planning steps so far 2. Controller policy returns action, a t 3. Manager routes actions to world or imagination, r t 4. If route, r t , indicates: a. “Imagination”, predicts imagined state, s’ t+1 b. “World”, model predicts new state, s t+1

Imagination-based Planner How does it work? (learnable components are bold ) 1. On each step, inputs: ○ State, s t : the planet and ship positions, etc. ○ Imagined state, s’ t : internal state belief ○ History, h t : summary of planning steps so far 2. Controller policy returns action, a t 3. Manager routes actions to world or imagination, r t 4. If route, r t , indicates: a. “Imagination”, predicts imagined state, s’ t+1 b. “World”, model predicts new state, s t+1 5. Memory aggregates new info into updated history, h t+1

Imagination-based Planner How is it trained? Three distinct, concurrent, on-policy training loops

Imagination-based Planner How is it trained? Three distinct, concurrent, on-policy training loops 1. Model/Imagination (interaction network) Supervised: s t , a t → s t+1

Imagination-based Planner How is it trained? Three distinct, concurrent, on-policy training loops 1. Model/Imagination (interaction network) Supervised: s t , a t → s t+1 2. Controller/Memory (MLP/LSTM) SVG: Reward, u t , is assumed to be |s t+1 - s * | 2 . Model, imagination, memory, and controller are differentiable. Manager’s discrete r t choices are assumed to be constants.

Imagination-based Planner How is it trained? Three distinct, concurrent, on-policy training loops 1. Model/Imagination (interaction network) Supervised: s t , a t → s t+1 2. Controller/Memory (MLP/LSTM) SVG: Reward, u t , is assumed to be |s t+1 - s * | 2 . Model, imagination, memory, and controller are differentiable. Manager’s discrete r t choices are assumed to be constants. 3. Manager : finite horizon MDP (MLP q-net, stochastic) REINFORCE: Return = (reward + comp. costs), (u t + c t )

Bonus Paper: MCTSnet Joint work with: Arthur Guez*, Theo Weber*, Ioannis Antonoglou, Karen Simonyan, Daan Wierstra, Remi Munos, David Silver

Model Based Reinforcement Learning Oriol Vinyals (DeepMind) - PowerPoint PPT Presentation

Model Based Reinforcement Learning Oriol Vinyals (DeepMind) @OriolVinyalsML May 2018 Stanford University The Reinforcement Learning Paradigm OBSERVATIONS GOAL Agent Environment ACTIONS The Reinforcement Learning Paradigm Maximize Return -

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

Advanced Model-Based Reinforcement Learning CS 294-112: Deep Reinforcement Learning Sergey

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

strt s

Closed strings, Branes and Holes N. Itzhaki Based on: hep-th/0304192, hep-th/0307221. With

CSCI 104 Operator Overloading Mark Redekopp David Kempe 2 Function Overloading What makes

Optical potentials and knockout reactions from Green functions treatment Andrea Idini

Montessori and Imagination do they go together? by Michael Dorer Page 1 "

1 The Problem of Truth Successful Argumentation: (Truth versus Persuasion) (cont d) The

My Thoughts - Disclaimer Opinions are those of the author and do not necessarily reflect a

Introductory Educational Developer Portfolio Workshop This Introductory Educational Developer

Model Based Reinforcement Learning Oriol Vinyals (DeepMind) - PowerPoint PPT Presentation

Model Based Reinforcement Learning Oriol Vinyals (DeepMind) @OriolVinyalsML May 2018 Stanford University The Reinforcement Learning Paradigm OBSERVATIONS GOAL Agent Environment ACTIONS The Reinforcement Learning Paradigm Maximize Return -

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

Advanced Model-Based Reinforcement Learning CS 294-112: Deep Reinforcement Learning Sergey

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

strt s

Closed strings, Branes and Holes N. Itzhaki Based on: hep-th/0304192, hep-th/0307221. With

CSCI 104 Operator Overloading Mark Redekopp David Kempe 2 Function Overloading What makes

Optical potentials and knockout reactions from Green functions treatment Andrea Idini

Montessori and Imagination do they go together? by Michael Dorer Page 1 &quot;

1 The Problem of Truth Successful Argumentation: (Truth versus Persuasion) (cont d) The

My Thoughts - Disclaimer Opinions are those of the author and do not necessarily reflect a

Introductory Educational Developer Portfolio Workshop This Introductory Educational Developer

Montessori and Imagination do they go together? by Michael Dorer Page 1 "