CS885 Reinforcement Learning Module 3: July 5, 2020 Imitation - PowerPoint PPT Presentation
CS885 Reinforcement Learning Module 3: July 5, 2020 Imitation Learning Torabi, F., Warnell, G., & Stone, P. (2018). Behavioral cloning from observation. In IJCAI (pp. 4950-4957). Ho, J., & Ermon, S. (2016). Generative adversarial
CS885 Reinforcement Learning Module 3: July 5, 2020 Imitation Learning Torabi, F., Warnell, G., & Stone, P. (2018). Behavioral cloning from observation. In IJCAI (pp. 4950-4957). Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. In NeurIPS (pp. 4565-4573). University of Waterloo CS885 Spring 2020 Pascal Poupart 1
Imitation Learning • Behavioural cloning (supervised learning) • Generative adversarial imitation learning (GAIL) • Imitation learning from observations • Inverse reinforcement learning University of Waterloo CS885 Spring 2020 Pascal Poupart 2
Motivation • Learn from expert demonstrations – No reward function needed – Faster learning autonomous driving chatbots robotics University of Waterloo CS885 Spring 2020 Pascal Poupart 3
Behavioural Cloning • Simplest form of imitation learning • Assumption: state-action pairs observable Imitation learning • Observe trajectories: 𝑡 ! , 𝑏 ! , 𝑡 " , 𝑏 " , 𝑡 # , 𝑏 # , … , (𝑡 $ , 𝑏 $ ) • Create training set: 𝑇 → 𝐵 • Train by supervised learning – Classification or regression University of Waterloo CS885 Spring 2020 Pascal Poupart 4
Case Study I: Autonomous driving Bojarski et al. (2016) End-to-end learning for self-driving cars • On road tests: • – Holmdel to Atlantic Highlands (NJ): autonomous ~98% of the time – Garden State Parkway (10 miles): no human intervention University of Waterloo CS885 Spring 2020 Pascal Poupart 5
Case study II: conversational agents Encoder: state 𝒕 am fine I How are you doing ? Decoder: action 𝒃 Pr 𝒃 𝒕 = ∏ " Pr 𝑏 " 𝑏 "#$ , … , 𝑏 $ , 𝒕 Objective: max 𝐛 Sordoni et al., 2015 University of Waterloo CS885 Spring 2020 Pascal Poupart 6
Generative adversarial imitation learning (GAIL) • Common approach: training generator to maximize likelihood of expert actions • Alternative: train generator to fool a discriminator in believing that the generated actions are from expert – Leverage GANs (Generative adversarial networks) – Ho & Ermon, 2016 University of Waterloo CS885 Spring 2020 Pascal Poupart 7
Generative adversarial networks (GANs) real random ! : generator 𝑒 " : discriminator 𝑨 𝑦 + or vector fake StyleGAN2 (Karras et al., 2020) real real 𝑒 " : discriminator 𝑦 or data fake CelebA (Liu et al., 2015) min ! max " , log Pr 𝑦 # 𝑗𝑡 𝑠𝑓𝑏𝑚; 𝑥 + log(Pr( ! 𝑨 # 𝑗𝑡 𝑔𝑏𝑙𝑓; 𝑥) # = min ! max " , log 𝑒 " 𝑦 # + log 1 − 𝑒 " ! 𝑨 # # University of Waterloo CS885 Spring 2020 Pascal Poupart 8
GAIL Pseudocode Input: expert trajectories 𝜐 $ ∼ 𝜌 $%&$'( where 𝜐 $ = 𝑡 ) , 𝑏 ) , 𝑡 * , 𝑏 * , … Initialize params 𝜄 of policy 𝜌 ! and params 𝑥 of discriminator 𝑒 " Repeat until stopping criterion Update discriminator parameters: 𝜀 " = ∑ +,- ∈ / ! ∇ 0 log 𝑒 " (𝑡, 𝑏) + ∑ +,-∼2 " (-|+) ∇ " log(1 − 𝑒 " (𝑡, 𝑏)) 𝑥 ← 𝑥 + 𝛽 " 𝜀 " Update policy parameters with TRPO: 𝐷𝑝𝑡𝑢(𝑡 6 , 𝑏 6 ) = ∑ +,-|+ # ,- # ,2 " log(1 − 𝑒 " (𝑡, 𝑏)) 𝜀 ! = ∑ +,-|2 " ∇ ! log 𝜌 ! 𝑏 𝑡 𝐷𝑝𝑡𝑢 𝑡, 𝑏 − 𝜇∇ ! 𝐼(𝜌 ! ) 𝜄 ← 𝜄 − 𝛽 ! 𝜀 ! University of Waterloo CS885 Spring 2020 Pascal Poupart 9
Robotics Experiments • Robot imitating expert policy (Ho & Ermon, 2016) GAIL University of Waterloo CS885 Spring 2020 Pascal Poupart 10
Imitation Learning from Observations • Consider imitation learning from a human expert: Schaal et al., 2003 • Actions (e.g., forces) unobservable • Only states/observations (e.g., joint positions) observable • Problem: infer actions from state/observation sequences University of Waterloo CS885 Spring 2020 Pascal Poupart 11
Inverse Dynamics Two steps: 1. Learn inverse dynamics Learn Pr(𝑏|𝑡, 𝑡 7 ) by supervised learning – From (𝑡, 𝑏, 𝑡 7 ) samples obtained by executing random actions – 2. Behavioural cloning Learn 𝜌(W 𝑏|𝑡) by supervised learning – From (𝑡, 𝑡’) samples from expert trajectories and – 𝑏 ~ Pr(𝑏|𝑡, 𝑡 7 ) sampled by inverse dynamics from W University of Waterloo CS885 Spring 2020 Pascal Poupart 12
Pseudocode: Imitation Learning from Observations Input: expert trajectories 𝜐 ! ∼ 𝜌 !"#!$% where 𝜐 ! = 𝑡 & , 𝑡 ' , 𝑡 ( , … Initialize agent policy 𝜌 ) at random Repeat Learn inverse dynamics model with parameters 𝑥 : * # , 𝑏 % (* # ) , 𝑡 %-& * # Sample 𝑡 % by executing 𝜌 ) * # , 𝑡 %-& (* # ) |𝑡 % (* # ) ) 𝑥 ← 𝑏𝑠𝑛𝑏𝑦 . ∑ % log Pr . (𝑏 % Learn policy parameters 𝜄 : / $ , 𝑡 %-& / $ For each 𝑡 % from expert trajectories 𝜐 ! do: / $ ∼ Pr(𝑏 % / $ |𝑡 % / $ , 𝑡 %-& (/ $ ) ) 𝑏 % 9 / $ |𝑡 % (/ $ ) ) 𝜄 ← 𝑏𝑠𝑛𝑏𝑦 ) ∑ % log 𝜌 ) (9 𝑏 % University of Waterloo CS885 Spring 2020 Pascal Poupart 13
Robotics Experiments Torabi et al., 2018 University of Waterloo CS885 Spring 2020 Pascal Poupart 14
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.