model estimation within planning and learning
play

Model Estimation Within Planning and Learning Alborz Geramifard - PowerPoint PPT Presentation

Model Estimation Within Planning and Learning Alborz Geramifard ICML W orkshop - June 2011 agf@mit.edu 1 Joint W ork Joshua Redding Joshua Joseph Jonathan How 2 +1 -1 -0.01 20% noise Conservative Aggressive Optimal 3 +1 -1


  1. Model Estimation Within Planning and Learning Alborz Geramifard ICML W orkshop - June 2011 agf@mit.edu 1

  2. Joint W ork Joshua Redding Joshua Joseph Jonathan How 2

  3. +1 -1 -0.01 20% noise Conservative Aggressive Optimal 3

  4. +1 -1 -0.01 20% noise Conservative Aggressive Optimal 3

  5. +1 -1 -0.01 20% noise Conservative Aggressive Optimal 3

  6. +1 -1 -0.01 20% noise Conservative Aggressive Optimal 3

  7. Big Picture P l a n n e r Fast, safe, sub - optimal solution M o d e l Estimator of the true model using a parametric form L e a r n e r A reinforcement learning algorithm running online 4

  8. Question A framework to integrate P l a n n e r M o d e l L e a r n e r Goal: explore safely reduce sample complexity reach optimal solution asymptotically 5

  9. Existing Gap Overly Restrictive [ Heger 1994 ] Lack of Analytical Convergence [ Geibel et al. 2005 ] No Safety Guarantees [ Abbeel et al. 2005 ] Requires Planner’s V alue Function [ Knox et al. 2010 ] 6

  10. Contributions Extended our previous framework to support adaptive modeling. Empirically verified the advantage of the new approach. Discussed the limitation of our approach and provided two potential solutions. 7

  11. 1 Approach Exploit & Explore P l a n n e r e r L e a r n intelligent Cooperative Control Architecture ( iCCA ) Planner initializes the policy and regulates the exploration of the learner. 8

  12. [ ACC 2011 ] 1 Previous work 9

  13. [ ACC 2011 ] 1 Previous work S t a t i c O ffl ine: π p n n e r e l P l a M o d 9

  14. [ ACC 2011 ] 1 Previous work S t a t i c O ffl ine: π p n n e r e l P l a M o d Online: r n e r L e a 9

  15. [ ACC 2011 ] 1 Previous work S t a t i c O ffl ine: π p n n e r e l P l a M o d Online: T y p e R m a x r n e r L e a Suggest action? 9

  16. [ ACC 2011 ] 1 Previous work S t a t i c O ffl ine: π p n n e r e l P l a M o d Online: T y p e R m a x r n e r L e a Suggest action? No a ∼ π p 9

  17. [ ACC 2011 ] 1 Previous work S t a t i c O ffl ine: π p n n e r e l P l a M o d Online: T y p e R m a x r n e r L e a Suggest action? No a ∼ π p 9

  18. [ ACC 2011 ] 1 Previous work S t a t i c O ffl ine: π p n n e r e l P l a M o d Online: T y p e R m a x Y es a ∼ π l r n e r L e a Suggest action? No a ∼ π p 9

  19. [ ACC 2011 ] 1 Previous work S t a t i c O ffl ine: π p n n e r e l P l a M o d Online: T y p e R m a x Y es a ∼ π l r n e r L e a Suggest action? No a ∼ π p π p 9

  20. [ ACC 2011 ] 1 Previous work S t a t i c O ffl ine: π p n n e r e l P l a M o d Online: T y p e R m a x Y es a ∼ π l r n e r L e a Suggest action? No a ∼ π p Safe Action? π p 9

  21. [ ACC 2011 ] 1 Previous work S t a t i c O ffl ine: π p n n e r e l P l a M o d Online: T y p e R m a x Y es a ∼ π l r n e r L e a Suggest action? No a ∼ π p No Safe Action? π p 9

  22. [ ACC 2011 ] 1 Previous work S t a t i c O ffl ine: π p n n e r e l P l a M o d Online: T y p e R m a x Y es a ∼ π l r n e r L e a Suggest action? No a ∼ π p No Y es Safe Action? π p 9

  23. 1 New Approach A d a p t Online: i v e 1 π p n n e r e l P l a M o d T y p e R m a x 2 Y es a ∼ π l r n e r L e a Suggest action? No a ∼ π p No Y es Safe Action? π p 10

  24. Empirical Results 100 learning trials with the Gridworld Sarsa ε - greedy Policy iCCA Noise = 40 % Planner’s policy = conservative AM-iCCA Initial noise = 40 % If noise ≤ 25 % planner’s policy = aggressive else planner’s policy = conservative 11

  25. Empirical Results 1 AM-iCCA 0.5 0 Aggressive Policy − 0.5 Conservative Policy iCCA Return − 1 − 1.5 − 2 Sarsa − 2.5 − 3 − 3.5 0 2000 4000 6000 8000 10000 Steps 12 daptive Mo

  26. Extensions What if the parametric form of the model can not represent the true model? Knownness is high ⇒ Ignore safety checking Estimate the value of planner policies by reflecting back on the past data. 13

  27. Contributions Extended our previous framework to support adaptive modeling. Empirically verified the advantage of the new approach. Discussed the limitation of our approach and provided two potential solutions. 14

Recommend


More recommend