Value Iteration Networks Aviv Tamar Joint work with Pieter Abbeel, - PowerPoint PPT Presentation

Value Iteration Networks Aviv Tamar Joint work with Pieter Abbeel, Sergey Levine, Garrett Thomas, Yi Wu Presented by: Kent Sommer Most content directly from Aviv Tamar’s 2016 presentation April 25, 2017 Korea Advanced Institute of Science and Technology

introduction

motivation ∙ Goal: autonomous robots Robot, bring me the milk bottle! http://www.wellandgood.com/wp-content/uploads/2015/02/Shira-fridge.jpg ∙ Solution: RL? 2

introduction ∙ Deep RL learns policies from high-dimensional visual input 1 , 2 ∙ Learns to act, but does it understand ? ∙ A simple test: generalization on grid worlds 1 Mnih et al. Nature 2015 2 Levine et al. JMLR 2016 3

introduction Reactive Policy Image Conv Layers Fully Connected Action Layers Probability 4

introduction Why don’t reactive policies generalize? ∙ A sequential task requires a planning computation ∙ RL gets around that – learns a mapping ∙ State → Q-value ∙ State → action with high return ∙ State → action with high advantage ∙ State → expert action ∙ [ State ] → [ planning-based term ] ∙ Q/return/advantage: planning on training domains ∙ New task – need to re-plan 6

introduction In this work: ∙ Learn to plan ∙ Policies that generalize to unseen tasks 7

background

background Planning in MDPs ∙ States s ∈ S , actions a ∈ A ∙ Reward R ( s , a ) ∙ Transitions P ( s ′ | s , a ) ∙ Policy π ( a | s ) ∙ Value function V π ( s ) . � s 0 = s t = 0 γ t r ( s t , a t ) = E π [∑ ∞ � ] ∙ Value iteration (VI) V n + 1 ( s ) = max Q n ( s , a ) ∀ s , a Q n ( s , a ) = R ( s , a ) + γ ∑ P ( s ′ | s , a ) V n ( s ′ ) . s ′ ∙ Converges to V ∗ = max π V π ∙ Optimal policy π ∗ ( a | s ) = arg max a Q ∗ ( s , a ) 9

background Policies in RL / imitation learning ∙ State observation φ ( s ) ∙ Policy: π θ ( a | φ ( s )) ∙ Neural network ∙ Greedy w.r.t. Q (DQN) ∙ Algorithms perform SGD, require ∇ θ π θ ( a | φ ( s )) ∙ Only loss function varies ∙ Q-learning (DQN) ∙ Trust region policy optimization (TRPO) ∙ Guided policy search (GPS) ∙ Imitation Learning (supervised learning, DAgger) ∙ Focus on policy representation ∙ Applies to model-free RL / imitation learning 10

a model for policies that plan

a planning-based policy model ∙ Start from a reactive policy 12

a planning-based policy model ∙ Add an explicit planning computation ∙ Map observation to planning MDP ¯ M ∙ Assumption: observation can be mapped to a useful (but unknown ) planning computation 13

a planning-based policy model ∙ NNs map observation to reward and transitions ∙ Later - learn these How to use the planning computation? 14

a planning-based policy model ∙ Fact 1: value function = sufficient information about plan ∙ Idea 1: add as features vector to reactive policy 15

a planning-based policy model ∙ Fact 2: action prediction can require only subset of ¯ V ∗ π ∗ ( a | s ) = arg max R ( s , a ) + γ ∑ P ( s ′ | s , a ) V ∗ ( s ′ ) a s ′ ∙ Similar to attention models, effective for learning 1 1 Xu et al. ICML 2015 16

a planning-based policy model ∙ Policy is still a mapping φ ( s ) → Prob ( a ) ∙ Parameters θ for mappings ¯ R , ¯ P, attention ∙ Can we backprop? How to backprop through planning computation? 17

value iteration = convnet

value iteration = convnet Convnet Value iteration VI Module K iterations of: Prev. Value New Value Reward Q V Q n (¯ ¯ s , ¯ a )= ¯ R (¯ s , ¯ a )+ P (¯ s ′ | ¯ s , ¯ a )¯ V n (¯ s ′ ) ∑ γ ¯ R P s ′ ¯ ¯ V n + 1 (¯ s )= max Q n (¯ ¯ s , ¯ a ) s ∀ ¯ a ¯ K recurrence ∙ A channels in ¯ ¯ Q layer ∙ Linear filters ⇐ ⇒ γ ¯ P ∙ Tied weights ∙ Channel-wise max-pooling ∙ Best for locally connected dynamics (grids, graphs) ∙ Extension – input-dependent filters 19

value iteration networks

value iteration network ∙ Use VI module for planning 21

value iteration network ∙ Value iteration network (VIN) 22

experiments

experiments Questions 1. Can VINs learn a planning computation? 2. Do VINs generalize better than reactive policies? 25

grid-world domain

grid-world domain ∙ Supervised learning from expert (shortest path) ∙ Observation: image of obstacles + goal, current state ∙ Compare VINs with reactive policies 27

grid-world domain ∙ VI state space: grid-world ∙ Attention: choose ¯ Q values for current state ∙ VI Reward map: convnet ∙ Reactive policy: FC, softmax ∙ VI Transitions: 3 × 3 kernel 28

grid-world domain Compare with: ∙ CNN inspired by DQN architecture 1 ∙ 5 layers ∙ Current state as additional input channel ∙ Fully convolutional net (FCN) 2 ∙ Pixel-wise semantic segmentation (labels=actions) ∙ Similar to our attention mechanism ∙ 3 layers ∙ Full-sized kernel – receptive field always includes goal Training: ∙ 5000 random maps, 7 trajectories in each ∙ Supervised learning from shortest path 1 Mnih et al. Nature 2015 2 Long et al. CVPR 2015 29

grid-world domain Evaluation: ∙ Action prediction error (on test set) ∙ Success rate – reach target without hitting obstacles Results: VIN CNN FCN Domain Prediction Success Pred. Succ. Pred. Succ. loss rate loss rate loss rate 8 × 8 0.004 99.6� 0.02 97.9� 0.01 97.3� 16 × 16 0.05 99.3� 0.10 87.6� 0.07 88.3� 28 × 28 0.11 97� 0.13 74.2� 0.09 76.6� VINs learn to plan! 30

grid-world domain Results: 31

grid-world domain Results: VIN FCN 31

grid-world domain Results: 31

summary & outlook

summary ∙ Learn to plan → generalization ∙ Framework for planning based NN policies ∙ Motivated by dynamic programming theory ∙ Differentiable planner (VI = CNN) ∙ Compositionality of NNs – perception & control ∙ Exploits flexible prior knowledge ∙ Simple to use 49

thank you!

Value Iteration Networks Aviv Tamar Joint work with Pieter Abbeel, - PowerPoint PPT Presentation

Value Iteration Networks Aviv Tamar Joint work with Pieter Abbeel, Sergey Levine, Garrett Thomas, Yi Wu Presented by: Kent Sommer Most content directly from Aviv Tamars 2016 presentation April 25, 2017 Korea Advanced Institute of Science

Matrix Iteration Higher Modes Inverse Iteration Matrix Iteration Giacomo Boffi with Shifts

Trust region policy optimization (TRPO) Value Iteration Value Iteration This is what we

Iteration/loops variety of iteration constructs provided with varying degrees of complexity,

Markov Decision Processes and Exact Solution Methods: Value Iteration Policy Iteration Linear

CS 473: Artificial Intelligence MDP Planning: Value Iteration and Policy Iteration Travis Mandel

61A Lecture 10 Monday, September 17 Sequence Iteration 2 Sequence Iteration def count(s,

Introduction to Mobile Robotics The Markov Decision Problem Value Iteration and Policy Iteration

CATTLE MARKET ITERATION 01 CIRCULATION PRESENTATION ITERATION 01 INTRODUCTION This document

CONVERGENCE OF A GENERALIZED MIDPOINT ITERATION JARED ABLE, DANIEL BRADLEY, ALVIN MOON, AND

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

Policy iteration comments Each step of policy iteration is guaranteed to strictly improve the

Manipulating an Abstraction (Iteration) CT @ VT An algorithm with iteration START BOOK LIST =

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

Combinatorial Newton iteration for Boltzmann oracle Carine Pivoteau joint work with Bruno Salvy

Blockly Lists & Iteration CT @ VT Things we are seeing Using lists to represent a data

Program Analysis with Local Policy Iteration George Karpenkov VERIMAG May 6, 2015 George

L2RPN Challenge - Learning to Run a Power Network through AI Di Shi Team: Tu Lan, Jiajun Duan,

Lessons learned from Promoting a Culture of Entrepreneurship PACE 2012-2016 IEEC,

Robotic Coach: how to revise humans' motions by Emphatic Demonstration Tetsunari Inamura National

Third Quarter 2016 Results December 8, 2016 Acushnet Holdings Corp Third Quarter 2016 Results

Load Hiding to Preserve Privacy from Smart Meter Measurements Ryan Fraser Advisor: Dr. Kevin

Integrated Meshing in STAR-CCM+ Aly Khawaja Overview The STAR-CCM+ meshing suite a

23 October 2013 Hayden Taylor* and Niswan Dhakal Nanyang Technological University, Singapore Joel

IMPRINT+ at a glance Project presentation 2015-1-PT01-KA201-012976 In the end we will conserve

Value Iteration Networks Aviv Tamar Joint work with Pieter Abbeel, - PowerPoint PPT Presentation

Value Iteration Networks Aviv Tamar Joint work with Pieter Abbeel, Sergey Levine, Garrett Thomas, Yi Wu Presented by: Kent Sommer Most content directly from Aviv Tamars 2016 presentation April 25, 2017 Korea Advanced Institute of Science

Matrix Iteration Higher Modes Inverse Iteration Matrix Iteration Giacomo Boffi with Shifts

Trust region policy optimization (TRPO) Value Iteration Value Iteration This is what we

Iteration/loops variety of iteration constructs provided with varying degrees of complexity,

Markov Decision Processes and Exact Solution Methods: Value Iteration Policy Iteration Linear

CS 473: Artificial Intelligence MDP Planning: Value Iteration and Policy Iteration Travis Mandel

61A Lecture 10 Monday, September 17 Sequence Iteration 2 Sequence Iteration def count(s,

Introduction to Mobile Robotics The Markov Decision Problem Value Iteration and Policy Iteration

CATTLE MARKET ITERATION 01 CIRCULATION PRESENTATION ITERATION 01 INTRODUCTION This document

CONVERGENCE OF A GENERALIZED MIDPOINT ITERATION JARED ABLE, DANIEL BRADLEY, ALVIN MOON, AND

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

Policy iteration comments Each step of policy iteration is guaranteed to strictly improve the

Manipulating an Abstraction (Iteration) CT @ VT An algorithm with iteration START BOOK LIST =

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

Combinatorial Newton iteration for Boltzmann oracle Carine Pivoteau joint work with Bruno Salvy

Blockly Lists &amp; Iteration CT @ VT Things we are seeing Using lists to represent a data

Program Analysis with Local Policy Iteration George Karpenkov VERIMAG May 6, 2015 George

L2RPN Challenge - Learning to Run a Power Network through AI Di Shi Team: Tu Lan, Jiajun Duan,

Lessons learned from Promoting a Culture of Entrepreneurship PACE 2012-2016 IEEC,

Robotic Coach: how to revise humans' motions by Emphatic Demonstration Tetsunari Inamura National

Third Quarter 2016 Results December 8, 2016 Acushnet Holdings Corp Third Quarter 2016 Results

Load Hiding to Preserve Privacy from Smart Meter Measurements Ryan Fraser Advisor: Dr. Kevin

Integrated Meshing in STAR-CCM+ Aly Khawaja Overview The STAR-CCM+ meshing suite a

23 October 2013 Hayden Taylor* and Niswan Dhakal Nanyang Technological University, Singapore Joel

IMPRINT+ at a glance Project presentation 2015-1-PT01-KA201-012976 In the end we will conserve

Blockly Lists & Iteration CT @ VT Things we are seeing Using lists to represent a data