modular mul task reinforcement learning with policy
play

Modular mul*task reinforcement learning with policy sketches Jacob - PowerPoint PPT Presentation

Modular mul*task reinforcement learning with policy sketches Jacob Andreas, Sergey Levine and Dan Klein The learning problem make planks 2 The learning problem make planks make sticks 3 Learning from sketches get wood get wood use saw


  1. Modular mul*task reinforcement learning with policy sketches Jacob Andreas, Sergey Levine and Dan Klein

  2. The learning problem make planks 2

  3. The learning problem make planks make sticks 3

  4. Learning from sketches get wood get wood use saw use axe 4

  5. The op*ons framework 5

  6. The op*ons framework +1 6

  7. The op*ons framework +1 7

  8. The op*ons framework [SuCon et al. 99, Bacon & Precup 16] 8

  9. Learning from intermediate rewards r r [Kearns & Singh 02, Kulkarni et al. 16] 9

  10. Learning from demonstra*ons Ï [Stolle & Precup 02, Fox & Krishnan et al. 16] 10

  11. Learning from policy sketches get wood use saw Ï 11

  12. Why sketches? Easy to collect Portable Crafting environment make plank get wood use toolshed make stick get wood use workbench make cloth get grass use factory make rope get grass use toolshed make bridge get iron get wood make bed ∗ get wood use toolshed make axe ∗ get wood use workbench make shears get wood use workbench get gold get iron get wood get gem get wood use workbench 12

  13. Learning from policy sketches

  14. Learning from policy sketches make planks get wood use saw 14

  15. Learning from policy sketches make sticks get wood use axe 15

  16. Learning from policy sketches get wood π a use saw get wood π b use axe 16 [e.g. Branavan et al. 09, Oh et al. 17, Hermann et al. 17]

  17. Learning from policy sketches get wood use saw get wood use axe 17

  18. ` get wood use saw π 1 π 2 get wood use axe π 1 π 3 18

  19. ` get wood use saw π 1 π 2 get wood use axe π 1 π 3 19

  20. ` get wood π 1

  21. Policy representa*on π 1 get wood 21

  22. Policy representa*on ??? π 1 get wood 22

  23. Policy representa*on 23

  24. Policy representa*on 24

  25. Policy representa*on 25

  26. Policy representa*on Ac*on probabili*es π 1 get wood 26

  27. Policy search ac*on state reward baseline Σ Σ ( ) ∇ log π( | ) (r t - b) tasks steps 27

  28. Policy search Σ Σ ( ) ∇ log π( | ) (r t - b) tasks steps get wood 28

  29. Policy search Σ Σ ( ) ∇ log π( | ) (r t - b) tasks steps use axe 29

  30. Policy search Reward .40 Σ Σ ( ) ∇ log π( | ) (r t - b) SUBPOLICY tasks steps 30

  31. Improving policy search 31

  32. Improving policy search ac*on state reward baseline Σ Σ ( ) ∇ log π( | ) (r t - b) tasks steps 32

  33. Improving policy search ( ) ( ) ∇ log π( | ) (r t - ) ∇ log π( | ) (r t - ) use saw use saw make planks make nails ( ) ( ) ∇ log π( | ) (r t - ) ∇ log π( | ) (r t - ) use axe use axe make planks make nails ( ) ( ) ∇ log π( | ) (r t - ) ∇ log π( | ) (r t - ) get wood get wood make planks make nails ( ) ( ) ∇ log π( | ) (r t - ) ∇ log π( | ) (r t - ) get iron get iron make planks make nails 33

  34. Improving policy search .89 Reward .40 Σ Σ ( ) ∇ log π( | ) (r t - ) SUBPOLICY TASK tasks steps 34

  35. Do sketches help?

  36. The maze naviga*on task 36

  37. The maze naviga*on task 37

  38. The maze naviga*on task Sketches: modular Unsupervised Reward Sketches: joint 0 1 2 3 x 10 6 episodes 38

  39. The mini-crag task 39

  40. The mini-crag task 40

  41. The mini-crag task Sketches: modular Sketches: joint Reward Unsupervised 0 1 2 3 x 10 6 episodes 41

  42. The cliff-walking task 42

  43. The cliff-walking task Sketches: modular log Reward Sketches: joint Unsupervised 0 1 2 3 x 10 8 *mesteps 43

  44. Zero-shot generaliza*on What if I see a sketch I’ve never seen before? get iron use axe 44

  45. Zero-shot generaliza*on What if I see a sketch I’ve never seen before? 100 89 75 Joint 77 50 Modular 49 25 1 0 Mul*task Zero-shot 45

  46. Zero-shot generaliza*on What if I see a sketch I’ve never seen before? 100 89 75 Joint 77 50 Modular 49 25 1 0 Mul*task Zero-shot 46

  47. Fast adapta*on What if I don’t get a sketch at test *me? ??? 47

  48. Fast adapta*on What if I don’t get a sketch at test *me? 100 89 75 Unsupervised 77 50 Sketches 47 25 1 0 Mul*task Adapta*on 48

  49. Fast adapta*on What if I don’t get a sketch at test *me? 100 89 75 Unsupervised 76 50 Sketches 47 42 25 0 Mul*task Adapta*on 49

  50. Conclusions

  51. A *ny bit of data goes a long way Crafting environment make plank get wood use toolshed make stick get wood use workbench make cloth get grass use factory make rope get grass use toolshed make bridge get iron get wood use factory make bed ∗ get wood use toolshed get grass use workbench make axe ∗ get wood use workbench get iron use toolshed make shears get wood use workbench get iron use workbench get gold get iron get wood use factory use bridge get gem get wood use workbench get iron use toolshed use axe 51

  52. A *ny bit of data goes a long way Crafting environment make plank get wood use toolshed make stick get wood use workbench make cloth get grass use factory make rope get grass use toolshed make bridge get iron get wood use factory make bed ∗ get wood use toolshed get grass use workbench make axe ∗ get wood use workbench get iron use toolshed make shears get wood use workbench get iron use workbench get gold get iron get wood use factory use bridge get gem get wood use workbench get iron use toolshed use axe 52

  53. Thank you! https://github.com/jacobandreas/psketch

Recommend


More recommend