Maximum Entropy Framework: Inverse RL, Soft Optimality, and More Chelsea Finn and Sergey Levine UC Berkeley 5/20/2017
Introductions Sergey Levine Chelsea Finn assistant professor PhD student
Outline 1. A World without Rewards 2. A Probabilistic Model of Behavior 3. Application: Inverse RL 4. GANs and Energy-Based Models 5. Application: Soft-Q Learning
Outline 1. A World without Rewards 2. A Probabilistic Model of Behavior 3. Application: Inverse RL 4. GANs and Energy-Based Models 5. Application: Soft-Q Learning
reward Mnih et al. ’15 what is the reward? reinforcement learning agent In the real world, humans don’t get a score. video from Montessori New Zealand
Tesauro ’95 Kohl & Stone, ’04 Mnih et al. ’15 Silver et al. ‘16 reward function is essential for RL real-world domains : reward/cost often di ffi cult to specify • robotic manipulation • autonomous driving • dialog systems • virtual assistants • and more…
One approach: Mimic actions of human expert behavioral cloning + simple, sometimes works well - but no reasoning about outcomes or dynamics - the expert might have di ff erent degrees of freedom - the expert might not be always optimal Can we reason about human decision-making?
Outline 1. A World without Rewards 2. A Probabilistic Model of Behavior 3. Application: Inverse RL 4. GANs and Energy-Based Models 5. Application: Soft-Q Learning
Op&mal Control as a Model of Human Behavior Muybridge (c. 1870) Mombaur et al. ‘09 Li & Todorov ‘06 Ziebart ‘08 opQmize this to explain the data
What if the data is no not op&mal? some mistakes maTer more than others! behavior is stochas'c but good behavior is sQll the most likely
A probabilis&c graphical model of decision making no assumpQon of opQmal behavior!
Inference = planning how to do inference?
A closer look at the backward pass “opQmisQc” transiQon (not a good idea!) Ziebart et al. ‘10 “Modeling InteracQon via the Principle of Maximum Causal Entropy”
Stochas&c op&mal control (MaxCausalEnt) summary summary: 1. ProbabilisQc graphical model for opQmal control variants: 2. Control = inference (similar to HMM, EKF, etc.) 3. Very similar to dynamic programming, value iteraQon, etc. (but “soc”)
Outline 1. A World without Rewards 2. A Probabilistic Model of Behavior 3. Application: Inverse RL 4. GANs and Energy-Based Models 5. Application: Soft-Q Learning
Under reward, we can model how human can sub-optimally maximize reward. How can this help us with learning?
Inverse Optimal Control / Inverse Reinforcement Learning : infer cost/reward function from demonstrations (IOC/IRL) (Kalman ’64, Ng & Russell ’00) goal : given : - recover reward function - state & action space - then use reward to get policy - roll-outs from π * - dynamics model [sometimes] Challenges underde fj ned problem di ffi cult to evaluate a learned reward demonstrations may not be precisely optimal
Early IRL Approaches - deterministic MDP - alternative between solving MDP & updating reward - heuristics for handling sub-optimality Ng & Russell ‘00 : expert actions should have higher value than other actions, larger gap is better Abbeel & Ng ’04 : expert policy w.r.t. cost should match feature counts of expert trajectories Ratli ff et al. ’06 : max margin formulation between value of expert actions and other actions How to handle ambiguity and suboptimality?
Maximum Entropy Inverse RL (Ziebart et al. ’08) handle ambiguity using probabilistic model of behavior Notation: : reward with parameters [linear case ] : dataset of demonstrations Whiteboard
Maximum Entropy Inverse RL (Ziebart et al. ’08)
What about unknown dynamics? Whiteboard
Case Study : Guided Cost Learning ICML 2016 Goals : - remove need to solve MDP in the inner loop - be able to handle unknown dynamics - handle continuous state & actions spaces
guided cost learning algorithm policy π 0 generate policy x (1) h (2) h ( 3 ) h k m n p samples from π c ( x ) 2 θ x ( 2 ) ( 3 ) h ( 1 ) h h 2 2 2 2 x (1) h ( 2 ) ( 3 ) h h 1 1 1 1 Update reward using samples & demos update π w.r.t. reward reward r policy π
guided cost learning algorithm policy π 0 generate policy x (1) h (2) h ( 3 ) h k m n p samples from π c ( x ) 2 θ x ( 2 ) ( 3 ) h ( 1 ) h h 2 2 2 2 x (1) h ( 2 ) ( 3 ) h h 1 1 1 1 Update reward using generator samples & demos discriminator update π w.r.t. reward reward r policy π (partially optimize) update reward in inner loop of policy optimization
guided cost learning algorithm policy π 0 generate policy x h (1) ( 3 ) h (2) h k n p m samples from π c ( x ) 2 θ ( 2 ) x ( 1 ) h ( 3 ) h h 2 2 2 2 x (1) h ( 2 ) h ( 3 ) h 1 1 1 1 Update reward using generator samples & demos discriminator update π w.r.t. reward reward r policy π (partially optimize) Ho et al., ICML ’16, NIPS ‘16
GCL Experiments Real-world Tasks dish placement pouring almonds state includes unsupervised state includes goal plate pose visual features [Finn et al. ’16] action: joint torques
Comparisons Path Integral IRL Relative Entropy IRL (Kalakrishnan et al. ‘13) (Boularias et al. ‘11) generate policy x h (1) ( 3 ) h (2) h k n p m samples from q c ( x ) 2 θ ( 2 ) x ( 1 ) h ( 3 ) h h 2 2 2 2 x (1) h ( 2 ) h ( 3 ) h 1 1 1 1 Update reward using samples & demos reward r
Dish placement, demos
Dish placement, standard cost
Dish placement, RelEnt IRL • video of dish baseline method
Dish placement, GCL policy • video of dish our method - samples & reoptimizing
Pouring, demos • video of pouring demos
Pouring, RelEnt IRL • video of pouring baseline method
Pouring, GCL policy • video of pouring our method - samples
Conclusion : We can recover successful policies for new positions. Is the reward function also useful for new scenarios?
Dish placement - GCL reopt. • video of dish our method - samples & reoptimizing
Pouring - GCL reopt. • video of pouring our method - reoptimization Note : normally the GAN discriminator is discarded
Guided Cost Learning & Generative Adversarial Imitation Learning Strengths - can handle unknown dynamics - scales to neural net rewards - e ffi cient enough for real robots Limitations - adversarial optimization is hard - can’t scale to raw pixel observations of demos - demonstrations typically collected with kinesthetic teaching or teleoperation ( fj rst person)
Outline 1. A World without Rewards 2. A Probabilistic Model of Behavior 3. Application: Inverse RL 4. GANs and Energy-Based Models 5. Application: Soft-Q Learning
Generative Adversarial Networks (Goodfellow et al. ’14) Arjovsky et al. ‘17 Zhu et al. ‘17 Isola et al. ‘17 Similarly, GANs learn an objective for generative modeling. real D G noise generated
Connection between Inverse RL and GANs trajectory τ sample x policy π ~q( τ ) generator G reward r discriminator D discriminator discriminator only needs to learn data distribution, θ independent of generator density Finn*, Christiano*, Abbeel, Levine, arXiv ‘16
Connection between Inverse RL and GANs trajectory τ sample x policy π ~q( τ ) generator G cost c discriminator D generator generator objective is entropy-regularized RL Finn*, Christiano*, Abbeel, Levine, arXiv ‘16
GANs for training EBMs MaxEnt IRL is an energy-based model sampler q(x) generator G energy E discriminator D Use the generator’s density q(x) to form a consistent estimator of the energy function 1 Z exp( − E θ ( x )) D θ ( x ) = 1 Z exp( − E θ ( x )) + q ( x ) Dai et al., ICLR submission ‘17 Kim & Bengio ICLR Workshop ’16; Zhao et al. arXiv ’16; Zhai et al. ICLR sub ‘17 Finn*, Christiano*, Abbeel, Levine, arXiv ‘16
Outline 1. A World without Rewards 2. A Probabilistic Model of Behavior 3. Application: Inverse RL 4. GANs and Energy-Based Models 5. Application: Soft-Q Learning
Stochas&c models for learning control • How can we track both hypotheses?
StochasQc energy-based policies Tuomas Haarnoja Haoran Tang
SoK Q-learning
Tractable amor&zed inference for con&nuous ac&ons Wang & Liu, ‘17
StochasQc energy-based policies aid exploraQon
StochasQc energy-based policies provide pretraining
Stochas&c Op&mal Control & MaxEnt in RL Sallans & Hinton. Using Free Energies to Represent Q-values in a MulQagent Reinforcement Learning Task. 2000. Nachum et al. Bridging the Gap Between Value and Policy Based Reinforcement Learning. 2017. O’Donoghue et al. Combining Policy Gradient and Q-Learning. 2017 Peters et al. RelaQve Entropy Policy Search. 2010.
Recommend
More recommend