Trajectory Optimization, Imitation Learning Lecture 14
What will you take home today? Recap LQR Trajectory Optimization Paper Imitation Learning Supervised Learning Dagger
How to solve Optimal Control Problems?
Sequential Quadratic Programming
Example – Newton-Raphson Method
Sequential Linear Quadratic Programming
SLQ Algorithm
Linear Dynamical Systems, Quadratic cost – L inear Q uadratic R egulator (LQR)
Linear Dynamical Systems, Quadratic cost – L inear Q uadratic R egulator (LQR)
Trajectory Optimization
What will you take home today? Recap LQR Trajectory Optimization Paper Imitation Learning Supervised Learning Dagger
Assumptions in Optimal Control 1. Known and/or simple System Dynamics 2. Known Cost function
What are approaches for unknown dynamics and/or cost? 1. Learning approaches a. Reinforcement learning Model-based i. Model-based ii. b. Imitation learning Imitate an expert policy i.
Learning to make single predictions versus a sequence of predictions
Running Example: Super Tux Cart from A Reduction of Imitation Learning and Structured Predictionto No-Regret Online Learning. Ross, Gordon, Bagnell. AIStats. 2011. https://www.youtube.com/watch?feature=oembed&v=V00npNnWzSU
Imitation Learning 1. Useful when dynamics and/or cost are unknown/complex a. We don’t know how the next state will look like. Hard to model. b. We don’t know the cost to go for an action The expected immediate cost of taking action a in state s The expected immediate cost of executing policy pi in state s Cost-to-go = total cost of executing pi over T
Imitation learning – core idea 1. Idea: imitate expert trajectories! a. Bound J for any cost function C based on how well pi mimics expert’s policy
Imitation Learning by Classification Algorithm from - A Course in Machine Learning by Hal Daumé III. Ch. 18
How well does Imitation Learning by Classification work? 1. Depends on a. How good the expert is. b. How much error the classifier makes.
Running Example: Super Tux Cart Figure from ‘Interactive Learning for Sequential Decisions and Predictions’ by Stephane Ross.
Learned behavior influences states and observations Challenge: system dynamics are assumed both unknown and complex, we cannot ● compute dπ and can only sample it by executing π in the system. non-i.i.d. supervised learning problem due to the dependence of the input distribution on ● the policy π itself. Difficult optimization due to dependence which makes problem non-convex Typical assumption in statistics and machine learning: Observations in a sample are independent and identically distributed. This simplifies many methods although not true in many practical settings. Examples are coin flips. Roulette spins When are data samples i.i.d?
Running Example: Super Tux Cart from A Reduction of Imitation Learning and Structured Predictionto No-Regret Online Learning. Ross, Gordon, Bagnell. AIStats. 2011. https://www.youtube.com/watch?feature=oembed&v=V00npNnWzSU
Another example – Super Mario from A Reduction of Imitation Learning and Structured Predictionto No-Regret Online Learning. Ross, Gordon, Bagnell. AIStats. 2011. https://www.youtube.com/watch?v=anOI0xZ3kGM
How do we train a policy that can deal with any possible situation? 1. This is impossible since the state/observation space may be prohibitively large and we cannot train on allpossible configurations. If we could, we may just memorize anyway. 2. Goal: Train f to do well on configurations that it encounters itself. 3. Chicken and egg problem: a. We want a policy that does well in a bunch of world configuration. b. What configurations? The ones it encounters/ 4. Solve by iteration: roll out f. Collect data, retrain.
Dataset Aggregation Algorithm (Dagger) Figure and Algorithm from - A Course in Machine Learning by Hal Daumé III. Ch. 18
How well does Dagger work? Theorem from - A Course in Machine Learning by Hal Daumé III. Ch. 18
Running Example: Super Tux Cart from A Reduction of Imitation Learning and Structured Predictionto No-Regret Online Learning. Ross, Gordon, Bagnell. AIStats. 2011.
Requirements on the Expert 1. Human demonstrations 2. Expensive but exact algorithm that is too slow to run in real time
Recommend
More recommend