CS885 Reinforcement Learning Module 3: July 5, 2020 Imitation - PowerPoint PPT Presentation

CS885 Reinforcement Learning Module 3: July 5, 2020 Imitation Learning Torabi, F., Warnell, G., & Stone, P. (2018). Behavioral cloning from observation. In IJCAI (pp. 4950-4957). Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. In NeurIPS (pp. 4565-4573). University of Waterloo CS885 Spring 2020 Pascal Poupart 1

Imitation Learning • Behavioural cloning (supervised learning) • Generative adversarial imitation learning (GAIL) • Imitation learning from observations • Inverse reinforcement learning University of Waterloo CS885 Spring 2020 Pascal Poupart 2

Motivation • Learn from expert demonstrations – No reward function needed – Faster learning autonomous driving chatbots robotics University of Waterloo CS885 Spring 2020 Pascal Poupart 3

Behavioural Cloning • Simplest form of imitation learning • Assumption: state-action pairs observable Imitation learning • Observe trajectories: 𝑡 ! , 𝑏 ! , 𝑡 " , 𝑏 " , 𝑡 # , 𝑏 # , … , (𝑡 $ , 𝑏 $ ) • Create training set: 𝑇 → 𝐵 • Train by supervised learning – Classification or regression University of Waterloo CS885 Spring 2020 Pascal Poupart 4

Case Study I: Autonomous driving Bojarski et al. (2016) End-to-end learning for self-driving cars • On road tests: • – Holmdel to Atlantic Highlands (NJ): autonomous ~98% of the time – Garden State Parkway (10 miles): no human intervention University of Waterloo CS885 Spring 2020 Pascal Poupart 5

Case study II: conversational agents Encoder: state 𝒕 am fine I How are you doing ? Decoder: action 𝒃 Pr 𝒃 𝒕 = ∏ " Pr 𝑏 " 𝑏 "#$ , … , 𝑏 $ , 𝒕 Objective: max 𝐛 Sordoni et al., 2015 University of Waterloo CS885 Spring 2020 Pascal Poupart 6

Generative adversarial imitation learning (GAIL) • Common approach: training generator to maximize likelihood of expert actions • Alternative: train generator to fool a discriminator in believing that the generated actions are from expert – Leverage GANs (Generative adversarial networks) – Ho & Ermon, 2016 University of Waterloo CS885 Spring 2020 Pascal Poupart 7

Generative adversarial networks (GANs) real random 𝑕 ! : generator 𝑒 " : discriminator 𝑨 𝑦 + or vector fake StyleGAN2 (Karras et al., 2020) real real 𝑒 " : discriminator 𝑦 or data fake CelebA (Liu et al., 2015) min ! max " , log Pr 𝑦 # 𝑗𝑡 𝑠𝑓𝑏𝑚; 𝑥 + log(Pr(𝑕 ! 𝑨 # 𝑗𝑡 𝑔𝑏𝑙𝑓; 𝑥) # = min ! max " , log 𝑒 " 𝑦 # + log 1 − 𝑒 " 𝑕 ! 𝑨 # # University of Waterloo CS885 Spring 2020 Pascal Poupart 8

GAIL Pseudocode Input: expert trajectories 𝜐 $ ∼ 𝜌 $%&$'( where 𝜐 $ = 𝑡 ) , 𝑏 ) , 𝑡 * , 𝑏 * , … Initialize params 𝜄 of policy 𝜌 ! and params 𝑥 of discriminator 𝑒 " Repeat until stopping criterion Update discriminator parameters: 𝜀 " = ∑ +,- ∈ / ! ∇ 0 log 𝑒 " (𝑡, 𝑏) + ∑ +,-∼2 " (-|+) ∇ " log(1 − 𝑒 " (𝑡, 𝑏)) 𝑥 ← 𝑥 + 𝛽 " 𝜀 " Update policy parameters with TRPO: 𝐷𝑝𝑡𝑢(𝑡 6 , 𝑏 6 ) = ∑ +,-|+ # ,- # ,2 " log(1 − 𝑒 " (𝑡, 𝑏)) 𝜀 ! = ∑ +,-|2 " ∇ ! log 𝜌 ! 𝑏 𝑡 𝐷𝑝𝑡𝑢 𝑡, 𝑏 − 𝜇∇ ! 𝐼(𝜌 ! ) 𝜄 ← 𝜄 − 𝛽 ! 𝜀 ! University of Waterloo CS885 Spring 2020 Pascal Poupart 9

Robotics Experiments • Robot imitating expert policy (Ho & Ermon, 2016) GAIL University of Waterloo CS885 Spring 2020 Pascal Poupart 10

Imitation Learning from Observations • Consider imitation learning from a human expert: Schaal et al., 2003 • Actions (e.g., forces) unobservable • Only states/observations (e.g., joint positions) observable • Problem: infer actions from state/observation sequences University of Waterloo CS885 Spring 2020 Pascal Poupart 11

Inverse Dynamics Two steps: 1. Learn inverse dynamics Learn Pr(𝑏|𝑡, 𝑡 7 ) by supervised learning – From (𝑡, 𝑏, 𝑡 7 ) samples obtained by executing random actions – 2. Behavioural cloning Learn 𝜌(W 𝑏|𝑡) by supervised learning – From (𝑡, 𝑡’) samples from expert trajectories and – 𝑏 ~ Pr(𝑏|𝑡, 𝑡 7 ) sampled by inverse dynamics from W University of Waterloo CS885 Spring 2020 Pascal Poupart 12

Pseudocode: Imitation Learning from Observations Input: expert trajectories 𝜐 ! ∼ 𝜌 !"#!$% where 𝜐 ! = 𝑡 & , 𝑡 ' , 𝑡 ( , … Initialize agent policy 𝜌 ) at random Repeat Learn inverse dynamics model with parameters 𝑥 : * # , 𝑏 % (* # ) , 𝑡 %-& * # Sample 𝑡 % by executing 𝜌 ) * # , 𝑡 %-& (* # ) |𝑡 % (* # ) ) 𝑥 ← 𝑏𝑠𝑕𝑛𝑏𝑦 . ∑ % log Pr . (𝑏 % Learn policy parameters 𝜄 : / $ , 𝑡 %-& / $ For each 𝑡 % from expert trajectories 𝜐 ! do: / $ ∼ Pr(𝑏 % / $ |𝑡 % / $ , 𝑡 %-& (/ $ ) ) 𝑏 % 9 / $ |𝑡 % (/ $ ) ) 𝜄 ← 𝑏𝑠𝑕𝑛𝑏𝑦 ) ∑ % log 𝜌 ) (9 𝑏 % University of Waterloo CS885 Spring 2020 Pascal Poupart 13

Robotics Experiments Torabi et al., 2018 University of Waterloo CS885 Spring 2020 Pascal Poupart 14

CS885 Reinforcement Learning Module 3: July 5, 2020 Imitation - PowerPoint PPT Presentation

CS885 Reinforcement Learning Module 3: July 5, 2020 Imitation Learning Torabi, F., Warnell, G., & Stone, P. (2018). Behavioral cloning from observation. In IJCAI (pp. 4950-4957). Ho, J., & Ermon, S. (2016). Generative adversarial

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

CS885 Reinforcement Learning Lecture 1a: May 2, 2018 Course Introduction [SutBar] Chapter 1,

Neural Combinatorial Optimization With Reinforcement Learning CS885 Reinforcement Learning Paper

CS885 Reinforcement Learning Lecture 8a: May 25, 2018 Multi-armed Bandits [SutBar] Sec. 2.1-2.7,

CS885 Reinforcement Learning Lecture 12: June 8, 2018 Deep Recurrent Q-Networks [GBC] Chap. 10

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

CS885 Reinforcement Learning Lecture 4a: May 11, 2018 Deep Neural Networks [GBC] Chap. 6, 7, 8

CS885 Reinforcement Learning Lecture 2a: May 4, 2018 Intro to Markov decision processes [SutBar]

CS885 Reinforcement Learning Lecture 15c: June 20, 2018 Semi-Markov Decision Processes [Put]

CS885 Reinforcement Learning Lecture 13c: June 13, 2018 Adversarial Search [RusNor] Sec. 5.1-5.4

CS885 Reinforcement Learning Lecture 1b: May 2, 2018 Markov Processes [RusNor] Sec. 15.1

CS885 Reinforcement Learning Lecture 4b: May 11, 2018 Deep Q-networks [SutBar] Sec. 9.4, 9.7,

CS885 Reinforcement Learning Lecture 14c: June 15, 2018 Trust Region Methods [Nocedal and

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning-Based End-to-End Parking for Automatic Parking System CS885

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

6hp http://www.ida.liu.se/~patla00/courses/BDA Teachers Lectures: Patrick Lambrix,

Autopsy of an automation disaster Simon J Mudd (Senior Database Engineer) Percona Live, 25 th

Introduction to Artificial Intelligence CS540-1 Yingyu Liang slide 1 Logistics Course

Algorithms for NLP CS 11711, Spring 2020 Lecture 1: Introduction Yulia Tsvetkov 1 Welcome!

Understanding Git Nelson Elhage Anders Kaseorg Student Information Processing Board September

Model-Agnostic Meta-Learning Universality, Inductive Bias, and Weak Supervision Chelsea Finn

APPLD: Adaptive Planner Parameter Learning From Demonstration Xuesu Xiao 1* , Bo Liu 1* , Garrett

ASIC and Custom in Nanometer Technologies David Chinnery Outline Introduction