to no regret online learning
play

to No-Regret Online Learning Stephane Ross Joint work with Drew - PowerPoint PPT Presentation

Reduction of Imitation Learning to No-Regret Online Learning Stephane Ross Joint work with Drew Bagnell & Geoff Gordon Imitation Learning Machine Expert Learning Policy Demonstrations Algorithm 2 Imitation Learning Many


  1. Reduction of Imitation Learning to No-Regret Online Learning Stephane Ross Joint work with Drew Bagnell & Geoff Gordon

  2. Imitation Learning Machine Expert Learning Policy Demonstrations Algorithm 2

  3. Imitation Learning • Many successes: – Legged locomotion [Ratliff 06] – Outdoor navigation [Silver 08] – Helicopter flight [Abbeel 07] – Car driving [Pomerleau 89] – etc... 3

  4. Example Scenario Learning to drive from demonstrations Input: Output: Policy Steering in [-1,1] Camera Image Hard left turn Hard right turn 4

  5. Supervised Training Procedure Dataset Expert Trajectories     ˆ Ε * Learned Policy:  arg min [ ( , s , ( s ))] sup  5 s ~ D ( *)   

  6. Poor Performance in Practice 6

  7. # Mistakes Grows Quadratically in T! [Ross 2010]  sup   ˆ 2 J ( ) T Avg. loss on D(  *) Exp. # of mistakes over T steps # time steps Reason: Doesn’t learn how to recover from errors! 7

  8. Reduction-Based Approach & Analysis Easier Related Problem(s) Hard Learning Problem , , ... Performance: f( ε ) Performance: ε Example: Cost-sensitive Multiclass classification to Binary classification [Beygelzimer 2005] 8

  9. Previous Work: Forward Training [Ross 2010] • Sequentially learn one policy/step  n • # mistakes grows linearly: – J(  1:T )  T ε  *  n-1 • Impractical if T large  2  1 9

  10. Previous Work: SMILe [Ross 2010] • Learn stochastic policy, changing policy slowly –  n =  n-1 + α n (  ’ n -  *) –  ’ n trained to mimic  * under D(  n-1 ) – Similar to SEARN [Daume 2009]  n-1 • Near-linear bound: – J(  )  O(Tlog(T) ε + 1) Steering • Stochasticity undesirable from expert 10

  11. DAgger: Dataset Aggregation • Collect trajectories with expert  *  * Steering from expert 11

  12. DAgger: Dataset Aggregation • Collect trajectories with expert  *  * • Dataset D 0 = {(s,  *(s))} Steering from expert 12

  13. DAgger: Dataset Aggregation • Collect trajectories with expert  *  * • Dataset D 0 = {(s,  *(s))} • Train  1 on D 0 Steering from expert 13

  14. DAgger: Dataset Aggregation  1 • Collect new trajectories with  1 Steering from expert 14

  15. DAgger: Dataset Aggregation  1 • Collect new trajectories with  1 • New Dataset D 1 ’ = {(s,  *(s))} Steering from expert 15

  16. DAgger: Dataset Aggregation  1 • Collect new trajectories with  1 • New Dataset D 1 ’ = {(s,  *(s))} • Aggregate Datasets: D 1 = D 0 U D 1 ’ Steering from expert 16

  17. DAgger: Dataset Aggregation  1 • Collect new trajectories with  1 • New Dataset D 1 ’ = {(s,  *(s))} • Aggregate Datasets: D 1 = D 0 U D 1 ’ Steering from • Train  2 on D 1 expert 17

  18. DAgger: Dataset Aggregation • Collect new trajectories with  2  2 • New Dataset D 2 ’ = {(s,  *(s))} • Aggregate Datasets: D 2 = D 1 U D 2 ’ Steering from • Train  3 on D 2 expert 18

  19. DAgger: Dataset Aggregation • Collect new trajectories with  n  n • New Dataset D n ’ = {(s,  *(s))} • Aggregate Datasets: D n = D n-1 U D n ’ Steering from • Train  n+1 on D n expert 19

  20. Online Learning Learner Adversary 20

  21. Online Learning + + ... - Learner Adversary - 21

  22. Online Learning + + ... - Learner Adversary - + + - + - 22

  23. Online Learning + + ... - Learner Adversary - + + - + - + + - + - 23

  24. Online Learning + + ... - Learner Adversary - + + - + - + + - + - + + - - + - 24

  25. Online Learning ... Learner Adversary   n n 1      Avg. Regret:  L ( h ) min L ( h )  n i i i    25 n h H   i 1 i 1

  26. DAgger as Online Learning ... Learner Adversary  n     *  L ( ) E [ ( , s , ( s ))] 26 n  s ~ D ( ) n

  27. DAgger as Online Learning ... Learner Adversary  n n      arg min L ( ) n 1 i     i 1     *  L ( ) E [ ( , s , ( s ))] 27 n  s ~ D ( ) n

  28. DAgger as Online Learning ... Learner Adversary  n n      arg min L ( ) n 1 i     i 1     Follow-The-Leader (FTL) *  L ( ) E [ ( , s , ( s ))] 28 n  s ~ D ( ) n

  29. Theoretical Guarantees of DAgger • Best policy  in sequence  1:N guarantees:       J ( ) T ( ) O ( T / N ) N N Iterations of Avg. Loss on Aggregate Avg. Regret of  1:N DAgger Dataset 29

  30. Theoretical Guarantees of DAgger • Best policy  in sequence  1:N guarantees:       J ( ) T ( ) O ( T / N ) N N Iterations of Avg. Loss on Aggregate Avg. Regret of  1:N DAgger Dataset • For strongly convex loss, N = O(TlogT) iterations :     J ( ) T O ( ) 1 N 30

  31. Theoretical Guarantees of DAgger • Best policy  in sequence  1:N guarantees:       J ( ) T ( ) O ( T / N ) N N Iterations of Avg. Loss on Aggregate Avg. Regret of  1:N DAgger Dataset • For strongly convex loss, N = O(TlogT) iterations :     J ( ) T O ( ) 1 N • Any No-Regret algorithm has same guarantees 31

  32. Theoretical Guarantees of DAgger • If sample m trajectories at each iteration, w.p. 1-  :        ˆ J ( ) T ( ) O ( T log( / ) / Nm ) 1 N N Avg. Regret of  1:N Empirical Avg. Loss on Aggregate Dataset 32

  33. Theoretical Guarantees of DAgger • If sample m trajectories at each iteration, w.p. 1-  :        ˆ J ( ) T ( ) O ( T log( / ) / Nm ) 1 N N Avg. Regret of  1:N Empirical Avg. Loss on Aggregate Dataset • For strongly convex loss, N = O(T 2 log(1/  )) , m=1 , w.p. 1-  :     ˆ J ( ) T O ( ) 1 N 33

  34. Experiments: 3D Racing Game Input: Output: Steering in [-1,1] Resized to 25x19 pixels (1425 features) 34

  35. DAgger Test-Time Execution 35

  36. Average Falls/Lap Better 36

  37. Experiments: Super Mario Bros From Mario AI competition 2009 Input: Output: Jump in {0,1} Right in {0,1} Left in {0,1} Speed in {0,1} Extracted 27K+ binary features from last 4 observations 37 (14 binary features for every cell)

  38. Test-Time Execution 38

  39. Average Distance/Stage Better 39

  40. Conclusion • Take-Home Message – Simple iterative procedures can yield much better performance. • Can also be applied for Structured Prediction : – NLP (e.g. Handwriting Recognition) – Computer Vision [Ross & al., CVPR 2011] • Future Work: – Combining with other Imitation Learning techniques [Ratliff 06] – Potential extensions to Reinforcement Learning? 40

  41. Questions 41

  42. Structured Prediction • Example: Scene Labeling Graph Structure Image over Labels ... ... ... ... ... 42

  43. Structured Prediction • Sequentially label each node using neighboring predictions – e.g. In Breath-First-Search Order (Forward & Backward passes) Graph Sequence of Classifications A B ... A B C D C B C D 43

  44. Structured Prediction • Input to Classifier: – Local image features in neighborhood of pixel – Current neighboring pixels’ labels • Neighboring labels depend on classifier itself • DAgger finds a classifier that does well at predicting pixel labels given the neighbors’ labels it itself generates during the labeling process. 44

  45. Experiments: Handwriting Recognition [Taskar 2003] Output: Input: Image current letter: Current letter in {a,b,...,z} o Previous predicted letter: 45

  46. Test Folds Character Accuracy Better 46

Recommend


More recommend