understanding human teaching modalities in reinforcement
play

Understanding Human Teaching Modalities in Reinforcement Learning - PowerPoint PPT Presentation

Understanding Human Teaching Modalities in Reinforcement Learning Environments A Preliminary Report Slides available on the Program page of the ALIHT website. W. Bradley Knox Matthew E. Taylor and Peter Stone Knowledge! Desires! Current


  1. Understanding Human Teaching Modalities in Reinforcement Learning Environments A Preliminary Report Slides available on the Program page of the ALIHT website. W. Bradley Knox Matthew E. Taylor and Peter Stone

  2. Knowledge! Desires!

  3. Current state of interactive learning evaluation Beats hand- coded! Nice demo! Better than RL!

  4. 1 st 2 nd 3 rd

  5. Reinforcement learning tasks • Learn from limited feedback • Delayed reward • Very general • Possibly slow learning • Human end-user cannot determine correct behavior Environment Action Reward State Agent

  6. Learning from demonstration (LfD) • Goal: reproduce behavior / policy • generalizing effectively to unseen situations • Argall, Chernova, Veloso and Browning. A Survey of Robot Learning from Demonstration. RAS, 2009. Nicolescu & Matari ć Grollman & Jenkins Argall, Browning & Veloso Lockerd & Breazeal

  7. Learning from feedback (interactive shaping) TAMER Key insight : trainer evaluates behavior using a model of its long-term quality Knox and Stone, K-CAP 2009

  8. Learning from feedback (interactive shaping) Learn a model of TAMER human reinforcement Directly exploit the model to determine action If greedy: Knox and Stone, K-CAP 2009

  9. Learning from feedback (interactive shaping) Training:

  10. Learning from feedback (interactive shaping) After training:

  11. Learning from feedback (interactive shaping) Training:

  12. LfD and LfF vs. RL • Noisy • Limited by human ability • Requires human’s time • Faster learning • Empowers humans to define task

  13. And out come the contendas!! Good robot! Just do as I do. VS. Learning from Demonstration Learning from Feedback (LfD) (LfF)

  14. An a priori comparison Demonstration more specifically points to the correct action Interface • LfD interface may be familiar to video game players • LfF interface is simpler and task-independent

  15. An a priori comparison Expression of learned model during training: LfF? yes. LfD? generally no. • LfD - better initial training performance • LfF - can observe and address model’s weaknesses • LfF - training and testing performance match up better Painted with MLDemos software

  16. An a priori comparison Task expertise • LfF - easier to judge than to control • Easier for human to increase expertise while training with LfD Cognitive load - less for LfF

  17. An a priori comparison General hypothesis LfD generally performs better, but situation-dependent

  18. Pilot study

  19. Pilot study 16 undergraduates Cart Pole first, then Mountain Car • Practice and test rounds • Randomized: LfF or LfD first - Unbalanced result: LfF was first for 87.5% of CP and 69% of MC Keyboard interface • LfD: j, k, l • LfF: z, /

  20. Pilot study Main result .$/*$"%)"&1,&2#$3"4%50% .$/*$"%0"&1,&2#$3"4%!5% !"#$%&"'#&(%)"&%")*+,("% !"#$%&"'#&(%)"&%")*+,("% &##$ #$ 0#$ !&##$ /#$ *+,$ *+,$ .#$ !%##$ *+-$ *+-$ %#$ !"##$ #$ &$ "$ '$ ($ )$ &&$ &"$ &'$ &$ %$ "$ .$ '$ /$ ($ 0$ )$ &#$ -)*+,("% -)*+,("% .,/*01%."&2,&3#$0"4%5.% .,/*01%)"&2,&3#$0"4%!5% !"#$%&"'#&(%)"&%")*+,("% '$% !"#$%&"'#&(%)"&%")*+,("% !&($% #$% !&'$% ($% -./% -./% !""$% "$% -.0% -.0% $% !"#$% &% "% )% (% *% #% +% '% ,% &$% &% )% *% +% ,% &&% &)% &*% -)*+,("% -)*+,("%

  21. Pilot study Interaction effects 12".3+%,4%3"#./*$0%,&("&5%!6% 12".3+%,4%3"#./*$0%,&("&5%67% *$-!" !"#$%&"'#&(%)"&% $!!" !"#$%&"'#&(%)"&% *$,!" ")*+,("% #!" ")*+,("% &'(" *%%!" &'(" &')" &')" *%+!" !" $" %" $" %" -"#./*$0%,&("&% -"#./*$0%,&("&%

  22. Pilot study Interaction effects 23"0.%,4%*$+.&/01,$+%05#$6"7%!8% 23"0.+%,4%*$+.&/01,$+%05#$6"7%89% !"#$%&"'#&(%)"&% &!" 67#!" !"#$%&"'#&(%)"&% %!" ")*+,("% 67%!" ")*+,("% 234" 234" $!" 6#!!" 235" #!" 235" 6#$!" !" '()*)+,-" .'/)01/" '()*)+,-" .'/)01/" -$+.&/01,$%+".% -$+.&/01,$%+".% Added a verbal instruction to give frequent feedback for LfF.

  23. Pilot study Interaction effects -."/0%,1%"2)"&*3"$0#4%+"05)%,$%,$4*$"%6178%!9% !"#$%&"'#&(%)"&%")*+,("% !'$$% ,-./%0122345% !'*$% 6757% !')$% !'($% ,-8/%0122345% 6757% !""$% !"&$% ,-./%97:5%6757% !"#$% '% *% &% +% (% ''% '*% '&% -)*+,("% Previous experiment differed: • more subject preparation • announced high scores in progress • ...

  24. Pilot study Online vs. offline performance .,/*01%)"&2,&3#$0"%4%,$/*$"% .,/*01%)"&2,&3#$0"%4%,$/*$"% !"#$%&"'#&(%)"&% !"#$%&"'#&(%)"&% )"&2,&3#$0"5%!6% )"&2,&3#$0"5%6.% +#$ "#$ ")*+,("% ")*+,("% '#$ -./$ %#$ #$ -./$ !%#$ !'#$ -.0$ -.0$ !+#$ !"#$ %$ "$ ($ *$ ,$ %%$ %"$ %($ %$ &$ "$ '$ ($ )$ *$ +$ ,$ %#$ -)*+,("% -)*+,("%

  25. Pilot study Tentative takeaways from performance comparisons LfD was better in our experiments. But both were sensitive to the experimental setup.

  26. Pilot study Tentative takeaways from performance comparisons Subjects need more preparation for LfF. • With zero task expertise, LfD still allows learning on the job • LfF vs. LfD interfaces

  27. Pilot study Tentative takeaways from performance comparisons LfD’s offline, learned performance is generally worse than its training samples. LfF’s offline, learned performance is generally as good or better than during training.

  28. To conclude, Results • LfD was better. • But performance was situational. • LfF needed more subject preparation. • LfF models compared better to training performance.

  29. To conclude, Near future work • More subjects • More balanced conditions • More interesting manipulations (e.g., model representation and control interface quality) • Aim for crossover interactions Performance • Learn from both LfD and LfF! A B Condition

  30. To conclude, 1 st 2 nd 3 rd

  31. To conclude,

  32. To conclude,

  33. To conclude,

Recommend


More recommend