CS885 Reinforcement Learning Module 3: July 5, 2020 Imitation Learning Torabi, F., Warnell, G., & Stone, P. (2018). Behavioral cloning from observation. In IJCAI (pp. 4950-4957). Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. In NeurIPS (pp. 4565-4573). University of Waterloo CS885 Spring 2020 Pascal Poupart 1
Imitation Learning • Behavioural cloning (supervised learning) • Generative adversarial imitation learning (GAIL) • Imitation learning from observations • Inverse reinforcement learning University of Waterloo CS885 Spring 2020 Pascal Poupart 2
Motivation • Learn from expert demonstrations – No reward function needed – Faster learning autonomous driving chatbots robotics University of Waterloo CS885 Spring 2020 Pascal Poupart 3
Behavioural Cloning • Simplest form of imitation learning • Assumption: state-action pairs observable Imitation learning • Observe trajectories: 𝑡 ! , 𝑏 ! , 𝑡 " , 𝑏 " , 𝑡 # , 𝑏 # , … , (𝑡 $ , 𝑏 $ ) • Create training set: 𝑇 → 𝐵 • Train by supervised learning – Classification or regression University of Waterloo CS885 Spring 2020 Pascal Poupart 4
Case Study I: Autonomous driving Bojarski et al. (2016) End-to-end learning for self-driving cars • On road tests: • – Holmdel to Atlantic Highlands (NJ): autonomous ~98% of the time – Garden State Parkway (10 miles): no human intervention University of Waterloo CS885 Spring 2020 Pascal Poupart 5
Case study II: conversational agents Encoder: state 𝒕 am fine I How are you doing ? Decoder: action 𝒃 Pr 𝒃 𝒕 = ∏ " Pr 𝑏 " 𝑏 "#$ , … , 𝑏 $ , 𝒕 Objective: max 𝐛 Sordoni et al., 2015 University of Waterloo CS885 Spring 2020 Pascal Poupart 6
Generative adversarial imitation learning (GAIL) • Common approach: training generator to maximize likelihood of expert actions • Alternative: train generator to fool a discriminator in believing that the generated actions are from expert – Leverage GANs (Generative adversarial networks) – Ho & Ermon, 2016 University of Waterloo CS885 Spring 2020 Pascal Poupart 7
Generative adversarial networks (GANs) real random ! : generator 𝑒 " : discriminator 𝑨 𝑦 + or vector fake StyleGAN2 (Karras et al., 2020) real real 𝑒 " : discriminator 𝑦 or data fake CelebA (Liu et al., 2015) min ! max " , log Pr 𝑦 # 𝑗𝑡 𝑠𝑓𝑏𝑚; 𝑥 + log(Pr( ! 𝑨 # 𝑗𝑡 𝑔𝑏𝑙𝑓; 𝑥) # = min ! max " , log 𝑒 " 𝑦 # + log 1 − 𝑒 " ! 𝑨 # # University of Waterloo CS885 Spring 2020 Pascal Poupart 8
GAIL Pseudocode Input: expert trajectories 𝜐 $ ∼ 𝜌 $%&$'( where 𝜐 $ = 𝑡 ) , 𝑏 ) , 𝑡 * , 𝑏 * , … Initialize params 𝜄 of policy 𝜌 ! and params 𝑥 of discriminator 𝑒 " Repeat until stopping criterion Update discriminator parameters: 𝜀 " = ∑ +,- ∈ / ! ∇ 0 log 𝑒 " (𝑡, 𝑏) + ∑ +,-∼2 " (-|+) ∇ " log(1 − 𝑒 " (𝑡, 𝑏)) 𝑥 ← 𝑥 + 𝛽 " 𝜀 " Update policy parameters with TRPO: 𝐷𝑝𝑡𝑢(𝑡 6 , 𝑏 6 ) = ∑ +,-|+ # ,- # ,2 " log(1 − 𝑒 " (𝑡, 𝑏)) 𝜀 ! = ∑ +,-|2 " ∇ ! log 𝜌 ! 𝑏 𝑡 𝐷𝑝𝑡𝑢 𝑡, 𝑏 − 𝜇∇ ! 𝐼(𝜌 ! ) 𝜄 ← 𝜄 − 𝛽 ! 𝜀 ! University of Waterloo CS885 Spring 2020 Pascal Poupart 9
Robotics Experiments • Robot imitating expert policy (Ho & Ermon, 2016) GAIL University of Waterloo CS885 Spring 2020 Pascal Poupart 10
Imitation Learning from Observations • Consider imitation learning from a human expert: Schaal et al., 2003 • Actions (e.g., forces) unobservable • Only states/observations (e.g., joint positions) observable • Problem: infer actions from state/observation sequences University of Waterloo CS885 Spring 2020 Pascal Poupart 11
Inverse Dynamics Two steps: 1. Learn inverse dynamics Learn Pr(𝑏|𝑡, 𝑡 7 ) by supervised learning – From (𝑡, 𝑏, 𝑡 7 ) samples obtained by executing random actions – 2. Behavioural cloning Learn 𝜌(W 𝑏|𝑡) by supervised learning – From (𝑡, 𝑡’) samples from expert trajectories and – 𝑏 ~ Pr(𝑏|𝑡, 𝑡 7 ) sampled by inverse dynamics from W University of Waterloo CS885 Spring 2020 Pascal Poupart 12
Pseudocode: Imitation Learning from Observations Input: expert trajectories 𝜐 ! ∼ 𝜌 !"#!$% where 𝜐 ! = 𝑡 & , 𝑡 ' , 𝑡 ( , … Initialize agent policy 𝜌 ) at random Repeat Learn inverse dynamics model with parameters 𝑥 : * # , 𝑏 % (* # ) , 𝑡 %-& * # Sample 𝑡 % by executing 𝜌 ) * # , 𝑡 %-& (* # ) |𝑡 % (* # ) ) 𝑥 ← 𝑏𝑠𝑛𝑏𝑦 . ∑ % log Pr . (𝑏 % Learn policy parameters 𝜄 : / $ , 𝑡 %-& / $ For each 𝑡 % from expert trajectories 𝜐 ! do: / $ ∼ Pr(𝑏 % / $ |𝑡 % / $ , 𝑡 %-& (/ $ ) ) 𝑏 % 9 / $ |𝑡 % (/ $ ) ) 𝜄 ← 𝑏𝑠𝑛𝑏𝑦 ) ∑ % log 𝜌 ) (9 𝑏 % University of Waterloo CS885 Spring 2020 Pascal Poupart 13
Robotics Experiments Torabi et al., 2018 University of Waterloo CS885 Spring 2020 Pascal Poupart 14
Recommend
More recommend