Understanding Human Teaching Modalities in Reinforcement Learning - PowerPoint PPT Presentation

Understanding Human Teaching Modalities in Reinforcement Learning Environments A Preliminary Report Slides available on the Program page of the ALIHT website. W. Bradley Knox Matthew E. Taylor and Peter Stone

Knowledge! Desires!

Current state of interactive learning evaluation Beats hand- coded! Nice demo! Better than RL!

1 st 2 nd 3 rd

Reinforcement learning tasks • Learn from limited feedback • Delayed reward • Very general • Possibly slow learning • Human end-user cannot determine correct behavior Environment Action Reward State Agent

Learning from demonstration (LfD) • Goal: reproduce behavior / policy • generalizing effectively to unseen situations • Argall, Chernova, Veloso and Browning. A Survey of Robot Learning from Demonstration. RAS, 2009. Nicolescu & Matari ć Grollman & Jenkins Argall, Browning & Veloso Lockerd & Breazeal

Learning from feedback (interactive shaping) TAMER Key insight : trainer evaluates behavior using a model of its long-term quality Knox and Stone, K-CAP 2009

Learning from feedback (interactive shaping) Learn a model of TAMER human reinforcement Directly exploit the model to determine action If greedy: Knox and Stone, K-CAP 2009

Learning from feedback (interactive shaping) Training:

Learning from feedback (interactive shaping) After training:

Learning from feedback (interactive shaping) Training:

LfD and LfF vs. RL • Noisy • Limited by human ability • Requires human’s time • Faster learning • Empowers humans to define task

And out come the contendas!! Good robot! Just do as I do. VS. Learning from Demonstration Learning from Feedback (LfD) (LfF)

An a priori comparison Demonstration more specifically points to the correct action Interface • LfD interface may be familiar to video game players • LfF interface is simpler and task-independent

An a priori comparison Expression of learned model during training: LfF? yes. LfD? generally no. • LfD - better initial training performance • LfF - can observe and address model’s weaknesses • LfF - training and testing performance match up better Painted with MLDemos software

An a priori comparison Task expertise • LfF - easier to judge than to control • Easier for human to increase expertise while training with LfD Cognitive load - less for LfF

An a priori comparison General hypothesis LfD generally performs better, but situation-dependent

Pilot study

Pilot study 16 undergraduates Cart Pole first, then Mountain Car • Practice and test rounds • Randomized: LfF or LfD first - Unbalanced result: LfF was first for 87.5% of CP and 69% of MC Keyboard interface • LfD: j, k, l • LfF: z, /

Pilot study Main result .$/*$"%)"&1,&2#$3"4%50% .$/*$"%0"&1,&2#$3"4%!5% !"#$%&"'#&(%)"&%")*+,("% !"#$%&"'#&(%)"&%")*+,("% &##$ #$ 0#$ !&##$ /#$ *+,$ *+,$ .#$ !%##$ *+-$ *+-$ %#$ !"##$ #$ &$ "$ '$ ($ )$ &&$ &"$ &'$ &$ %$ "$ .$ '$ /$ ($ 0$ )$ &#$ -)*+,("% -)*+,("% .,/*01%."&2,&3#$0"4%5.% .,/*01%)"&2,&3#$0"4%!5% !"#$%&"'#&(%)"&%")*+,("% '$% !"#$%&"'#&(%)"&%")*+,("% !&($% #$% !&'$% ($% -./% -./% !""$% "$% -.0% -.0% $% !"#$% &% "% )% (% *% #% +% '% ,% &$% &% )% *% +% ,% &&% &)% &*% -)*+,("% -)*+,("%

Pilot study Interaction effects 12".3+%,4%3"#./*$0%,&("&5%!6% 12".3+%,4%3"#./*$0%,&("&5%67% *$-!" !"#$%&"'#&(%)"&% $!!" !"#$%&"'#&(%)"&% *$,!" ")*+,("% #!" ")*+,("% &'(" *%%!" &'(" &')" &')" *%+!" !" $" %" $" %" -"#./*$0%,&("&% -"#./*$0%,&("&%

Pilot study Interaction effects 23"0.%,4%*$+.&/01,$+%05#$6"7%!8% 23"0.+%,4%*$+.&/01,$+%05#$6"7%89% !"#$%&"'#&(%)"&% &!" 67#!" !"#$%&"'#&(%)"&% %!" ")*+,("% 67%!" ")*+,("% 234" 234" $!" 6#!!" 235" #!" 235" 6#$!" !" '()*)+,-" .'/)01/" '()*)+,-" .'/)01/" -$+.&/01,$%+".% -$+.&/01,$%+".% Added a verbal instruction to give frequent feedback for LfF.

Pilot study Interaction effects -."/0%,1%"2)"&*3"$0#4%+"05)%,$%,$4*$"%6178%!9% !"#$%&"'#&(%)"&%")*+,("% !'$$% ,-./%0122345% !'*$% 6757% !')$% !'($% ,-8/%0122345% 6757% !""$% !"&$% ,-./%97:5%6757% !"#$% '% *% &% +% (% ''% '*% '&% -)*+,("% Previous experiment differed: • more subject preparation • announced high scores in progress • ...

Pilot study Online vs. offline performance .,/*01%)"&2,&3#$0"%4%,$/*$"% .,/*01%)"&2,&3#$0"%4%,$/*$"% !"#$%&"'#&(%)"&% !"#$%&"'#&(%)"&% )"&2,&3#$0"5%!6% )"&2,&3#$0"5%6.% +#$ "#$ ")*+,("% ")*+,("% '#$ -./$ %#$ #$ -./$ !%#$ !'#$ -.0$ -.0$ !+#$ !"#$ %$ "$ ($ *$ ,$ %%$ %"$ %($ %$ &$ "$ '$ ($ )$ *$ +$ ,$ %#$ -)*+,("% -)*+,("%

Pilot study Tentative takeaways from performance comparisons LfD was better in our experiments. But both were sensitive to the experimental setup.

Pilot study Tentative takeaways from performance comparisons Subjects need more preparation for LfF. • With zero task expertise, LfD still allows learning on the job • LfF vs. LfD interfaces

Pilot study Tentative takeaways from performance comparisons LfD’s offline, learned performance is generally worse than its training samples. LfF’s offline, learned performance is generally as good or better than during training.

To conclude, Results • LfD was better. • But performance was situational. • LfF needed more subject preparation. • LfF models compared better to training performance.

To conclude, Near future work • More subjects • More balanced conditions • More interesting manipulations (e.g., model representation and control interface quality) • Aim for crossover interactions Performance • Learn from both LfD and LfF! A B Condition

To conclude, 1 st 2 nd 3 rd

To conclude,

Understanding Human Teaching Modalities in Reinforcement Learning - PowerPoint PPT Presentation

Understanding Human Teaching Modalities in Reinforcement Learning Environments A Preliminary Report Slides available on the Program page of the ALIHT website. W. Bradley Knox Matthew E. Taylor and Peter Stone Knowledge! Desires! Current

Programming Modalities Modalities of Programming In 2020, there are three prevalent modalities

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Marie-France Bellin Technical innovations in existing modalities New imaging modalities

69a History of Massage: Modalities 69a History of Massage: Modalities Class Outline 5

DFG Graduiertenkolleg 1564 (Research Training Group 1564) Imaging New Modalities Multimodal

69a History of Massage: Modalities 69a History of Massage: Modalities Class Outline 5 minutes

Modalities in HoTT Egbert Rijke, Mike Shulman, Bas Spitters 1706.07526 Higher toposes Internal

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

The Use and Application of Physical Agents and Modalities: Can be used for: Physical Agent

Home Dialysis Modalities Supporting Patient Choice Objectives Review home dialysis therapy

Contents I Introduction I Automatic relevance determination (ARD) I Projection predictive method I

Reading Strand and Ideas Standard Statement 9 Range of Reading and Standard Statement 10 Level

Notice of Funding Availability (NOFA) for the Fiscal Year 2015 Continuum of Care Program

SUMMARY OF PART TWO Issues to consider when deciding to terminate Contractual or common law

ORED Revision Awards and Resubmission Tips March 26, 2018 ORED Revision Awards Purpose

EuCARD magnet development HFM-EuCARD, GdR, 14 October 2010 Gijs de Rijk CERN EUCARD - HE-LHC'10

Do Banks Pass Through Credit Expansions to Consumers Who Want to Borrow? Evidence from Credit

WELCOME Special Education Director Meeting What if teachers were treated like football stars?