Reduction of Imitation Learning to No-Regret Online Learning Stephane Ross Joint work with Drew Bagnell & Geoff Gordon
Imitation Learning Machine Expert Learning Policy Demonstrations Algorithm 2
Imitation Learning • Many successes: – Legged locomotion [Ratliff 06] – Outdoor navigation [Silver 08] – Helicopter flight [Abbeel 07] – Car driving [Pomerleau 89] – etc... 3
Example Scenario Learning to drive from demonstrations Input: Output: Policy Steering in [-1,1] Camera Image Hard left turn Hard right turn 4
Supervised Training Procedure Dataset Expert Trajectories ˆ Ε * Learned Policy: arg min [ ( , s , ( s ))] sup 5 s ~ D ( *)
Poor Performance in Practice 6
# Mistakes Grows Quadratically in T! [Ross 2010] sup ˆ 2 J ( ) T Avg. loss on D( *) Exp. # of mistakes over T steps # time steps Reason: Doesn’t learn how to recover from errors! 7
Reduction-Based Approach & Analysis Easier Related Problem(s) Hard Learning Problem , , ... Performance: f( ε ) Performance: ε Example: Cost-sensitive Multiclass classification to Binary classification [Beygelzimer 2005] 8
Previous Work: Forward Training [Ross 2010] • Sequentially learn one policy/step n • # mistakes grows linearly: – J( 1:T ) T ε * n-1 • Impractical if T large 2 1 9
Previous Work: SMILe [Ross 2010] • Learn stochastic policy, changing policy slowly – n = n-1 + α n ( ’ n - *) – ’ n trained to mimic * under D( n-1 ) – Similar to SEARN [Daume 2009] n-1 • Near-linear bound: – J( ) O(Tlog(T) ε + 1) Steering • Stochasticity undesirable from expert 10
DAgger: Dataset Aggregation • Collect trajectories with expert * * Steering from expert 11
DAgger: Dataset Aggregation • Collect trajectories with expert * * • Dataset D 0 = {(s, *(s))} Steering from expert 12
DAgger: Dataset Aggregation • Collect trajectories with expert * * • Dataset D 0 = {(s, *(s))} • Train 1 on D 0 Steering from expert 13
DAgger: Dataset Aggregation 1 • Collect new trajectories with 1 Steering from expert 14
DAgger: Dataset Aggregation 1 • Collect new trajectories with 1 • New Dataset D 1 ’ = {(s, *(s))} Steering from expert 15
DAgger: Dataset Aggregation 1 • Collect new trajectories with 1 • New Dataset D 1 ’ = {(s, *(s))} • Aggregate Datasets: D 1 = D 0 U D 1 ’ Steering from expert 16
DAgger: Dataset Aggregation 1 • Collect new trajectories with 1 • New Dataset D 1 ’ = {(s, *(s))} • Aggregate Datasets: D 1 = D 0 U D 1 ’ Steering from • Train 2 on D 1 expert 17
DAgger: Dataset Aggregation • Collect new trajectories with 2 2 • New Dataset D 2 ’ = {(s, *(s))} • Aggregate Datasets: D 2 = D 1 U D 2 ’ Steering from • Train 3 on D 2 expert 18
DAgger: Dataset Aggregation • Collect new trajectories with n n • New Dataset D n ’ = {(s, *(s))} • Aggregate Datasets: D n = D n-1 U D n ’ Steering from • Train n+1 on D n expert 19
Online Learning Learner Adversary 20
Online Learning + + ... - Learner Adversary - 21
Online Learning + + ... - Learner Adversary - + + - + - 22
Online Learning + + ... - Learner Adversary - + + - + - + + - + - 23
Online Learning + + ... - Learner Adversary - + + - + - + + - + - + + - - + - 24
Online Learning ... Learner Adversary n n 1 Avg. Regret: L ( h ) min L ( h ) n i i i 25 n h H i 1 i 1
DAgger as Online Learning ... Learner Adversary n * L ( ) E [ ( , s , ( s ))] 26 n s ~ D ( ) n
DAgger as Online Learning ... Learner Adversary n n arg min L ( ) n 1 i i 1 * L ( ) E [ ( , s , ( s ))] 27 n s ~ D ( ) n
DAgger as Online Learning ... Learner Adversary n n arg min L ( ) n 1 i i 1 Follow-The-Leader (FTL) * L ( ) E [ ( , s , ( s ))] 28 n s ~ D ( ) n
Theoretical Guarantees of DAgger • Best policy in sequence 1:N guarantees: J ( ) T ( ) O ( T / N ) N N Iterations of Avg. Loss on Aggregate Avg. Regret of 1:N DAgger Dataset 29
Theoretical Guarantees of DAgger • Best policy in sequence 1:N guarantees: J ( ) T ( ) O ( T / N ) N N Iterations of Avg. Loss on Aggregate Avg. Regret of 1:N DAgger Dataset • For strongly convex loss, N = O(TlogT) iterations : J ( ) T O ( ) 1 N 30
Theoretical Guarantees of DAgger • Best policy in sequence 1:N guarantees: J ( ) T ( ) O ( T / N ) N N Iterations of Avg. Loss on Aggregate Avg. Regret of 1:N DAgger Dataset • For strongly convex loss, N = O(TlogT) iterations : J ( ) T O ( ) 1 N • Any No-Regret algorithm has same guarantees 31
Theoretical Guarantees of DAgger • If sample m trajectories at each iteration, w.p. 1- : ˆ J ( ) T ( ) O ( T log( / ) / Nm ) 1 N N Avg. Regret of 1:N Empirical Avg. Loss on Aggregate Dataset 32
Theoretical Guarantees of DAgger • If sample m trajectories at each iteration, w.p. 1- : ˆ J ( ) T ( ) O ( T log( / ) / Nm ) 1 N N Avg. Regret of 1:N Empirical Avg. Loss on Aggregate Dataset • For strongly convex loss, N = O(T 2 log(1/ )) , m=1 , w.p. 1- : ˆ J ( ) T O ( ) 1 N 33
Experiments: 3D Racing Game Input: Output: Steering in [-1,1] Resized to 25x19 pixels (1425 features) 34
DAgger Test-Time Execution 35
Average Falls/Lap Better 36
Experiments: Super Mario Bros From Mario AI competition 2009 Input: Output: Jump in {0,1} Right in {0,1} Left in {0,1} Speed in {0,1} Extracted 27K+ binary features from last 4 observations 37 (14 binary features for every cell)
Test-Time Execution 38
Average Distance/Stage Better 39
Conclusion • Take-Home Message – Simple iterative procedures can yield much better performance. • Can also be applied for Structured Prediction : – NLP (e.g. Handwriting Recognition) – Computer Vision [Ross & al., CVPR 2011] • Future Work: – Combining with other Imitation Learning techniques [Ratliff 06] – Potential extensions to Reinforcement Learning? 40
Questions 41
Structured Prediction • Example: Scene Labeling Graph Structure Image over Labels ... ... ... ... ... 42
Structured Prediction • Sequentially label each node using neighboring predictions – e.g. In Breath-First-Search Order (Forward & Backward passes) Graph Sequence of Classifications A B ... A B C D C B C D 43
Structured Prediction • Input to Classifier: – Local image features in neighborhood of pixel – Current neighboring pixels’ labels • Neighboring labels depend on classifier itself • DAgger finds a classifier that does well at predicting pixel labels given the neighbors’ labels it itself generates during the labeling process. 44
Experiments: Handwriting Recognition [Taskar 2003] Output: Input: Image current letter: Current letter in {a,b,...,z} o Previous predicted letter: 45
Test Folds Character Accuracy Better 46
Recommend
More recommend