extrapolating beyond suboptimal demonstrations via
play

Extrapolating Beyond Suboptimal Demonstrations via Inverse - PowerPoint PPT Presentation

Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations Daniel Brown*, Wonjoon Goo*, Prabhat Nagarajan, and Scott Niekum Inverse Reinforcement Learning Current approaches 1. Cant do better


  1. Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations Daniel Brown*, Wonjoon Goo*, Prabhat Nagarajan, and Scott Niekum

  2. Inverse Reinforcement Learning Current approaches … 1. Can’t do better than the demonstrator. We find a reward function that explains the ranking, allowing for extrapolation. 2. Are hard to scale to complex problems.

  3. Inverse Reinforcement Learning IRL via Ranked Current approaches … Demonstrations 1. Can’t do better than the demonstrator. We find a reward function that explains the ranking, allowing for extrapolation. 2. Are hard to scale to complex problems.

  4. Inverse Reinforcement Learning IRL via Ranked Current approaches … Demonstrations 1. Can’t do better than the demonstrator. We find a reward function that explains the ranking, allowing for extrapolation. 2. Are hard to scale to complex problems.

  5. Inverse Reinforcement Learning IRL via Ranked Current approaches … Demonstrations 1. Can’t do better than the demonstrator. We find a reward function that explains the ranking, allowing for extrapolation. 2. Are hard to scale to complex problems. Inverse Reinforcement Learning becomes standard binary classification.

  6. Trajectory-ranked Reward Extrapolation (T-REX)

  7. Trajectory-ranked Reward Extrapolation (T-REX)

  8. Trajectory-ranked Reward Extrapolation (T-REX) Given ranked demonstrations How do we train the reward function ?

  9. Trajectory-ranked Reward Extrapolation (T-REX)

  10. Trajectory-ranked Reward Extrapolation (T-REX)

  11. Trajectory-ranked Reward Extrapolation (T-REX)

  12. Trajectory-ranked Reward Extrapolation (T-REX)

  13. Trajectory-ranked Reward Extrapolation (T-REX)

  14. Trajectory-ranked Reward Extrapolation (T-REX)

  15. Trajectory-ranked Reward Extrapolation (T-REX)

  16. Trajectory-ranked Reward Extrapolation (T-REX)

  17. Trajectory-ranked Reward Extrapolation (T-REX) We subsample trajectories to create a large dataset of weakly labeled pairs!

  18. Trajectory-ranked Reward Extrapolation (T-REX) • Simple: • IRL as binary classification. • No human supervision during policy learning. • No inner-loop MDP solver. • No inference time data collection (e.g. GAIL). • No action labels required.

  19. Trajectory-ranked Reward Extrapolation (T-REX) • Simple: • IRL as binary classification. • No human supervision during policy learning. • No inner-loop MDP solver. • No inference time data collection (e.g. GAIL). • No action labels required. • Scales to high-dimensional tasks (e.g. Atari games)

  20. Trajectory-ranked Reward Extrapolation (T-REX) • Simple: • IRL as binary classification. • No human supervision during policy learning. • No inner-loop MDP solver. • No inference time data collection (e.g. GAIL). • No action labels required. • Scales to high-dimensional tasks (e.g. Atari games) • Can produce policies much better than demonstrator

  21. T-REX Policy Performance

  22. T-REX on HalfCheetah Best demo (88.97) T-REX (143.40)

  23. Reward Extrapolation T-REX can extrapolate beyond the performance of the best demo HalfCheetah Hopper Ant

  24. Results: Atari Games T-REX ou outperf rform orms b best d demon onstration on on 7 out of 8 g 8 games! s!

  25. T-REX on Enduro Best demo (84) T-REX (520)

  26. Come see our poster @ Pacific Ballroom #47 Human demos / ranking labels Robust to noisy ranking labels Automatic ranking by Reward function visualization watching a learner improve at a task

  27. Come see our poster @ Pacific Ballroom #47 Human demos / ranking labels Robust to noisy ranking labels Automatic ranking by Reward function visualization watching a learner improve T-REX at a task

Recommend


More recommend