learning curriculum policies for reinforcement learning
play

Learning(Curriculum(Policies(for( Reinforcement(Learning - PowerPoint PPT Presentation

Learning(Curriculum(Policies(for( Reinforcement(Learning Sanmit'Narvekar and$Peter$Stone Department$of$Computer$Science University$of$Texas$at$Austin {sanmit,$pstone}$@cs.utexas.edu Successes(of(Reinforcement(Learning


  1. Learning(Curriculum(Policies(for( Reinforcement(Learning Sanmit'Narvekar and$Peter$Stone Department$of$Computer$Science University$of$Texas$at$Austin {sanmit,$pstone}$@cs.utexas.edu

  2. Successes(of(Reinforcement(Learning Approaching$or$passing$human$level$performance BUT Can$take$ millions of$episodes!$People$learn$this$MUCH faster University$of$Texas$at$Austin Sanmit$Narvekar 2

  3. People(Learn(via(Curricula People$are$able$to$learn$a$lot$of$complex$tasks$very$efficiently$ University$of$Texas$at$Austin Sanmit$Narvekar 3

  4. Example:(Quick(Chess • Quickly$learn$the$ fundamentals$of$chess • 5$x$6$board$ • Fewer$pieces$per$type • No$castling • No$enOpassant$ University$of$Texas$at$Austin Sanmit$Narvekar 4

  5. Example:(Quick(Chess .$.$.$.$.$. University$of$Texas$at$Austin Sanmit$Narvekar 5

  6. Task(Space Pawns$+$King Pawns$only Target$task Empty$task One$piece$per$type • Quick$Chess$is$a$curriculum$designed$for$people • We$want$to$do$something$similar$automatically for$autonomous$agents University$of$Texas$at$Austin Sanmit$Narvekar 6

  7. Curriculum(Learning Task$=$MDP Environment State Action Reward Agent Task'Creation Assume$Given Transfer'Learning Sequencing This$work:$2$types • Curriculum$learning$is$a$complex$problem$that$ties$task$creation,$sequencing,$ and$transfer$learning University$of$Texas$at$Austin Sanmit$Narvekar 7

  8. Value(Function(Transfer • Initialize Q$function$in$target$task$using$values$learned$in$a$ source task Q source (s,a) • Assumptions: • Tasks$have$overlapping state$and$action$spaces$ • OR$an$interOtask$mapping is$provided • Existing$related$work$on$learning$mappings Image$credit:$Taylor$and$Stone,$JMLR$2009 University$of$Texas$at$Austin Sanmit$Narvekar 8

  9. Reward(Shaping(Transfer • Reward$function$in$target$task$augmented with$a$shaping$reward$ f : New$Reward Old$Reward Shaping$Reward • PotentialObased$advice$restricts$f$to$be$difference$of$potential$ functions: • Use$the$value$function$of$the$source as$the$potential$function: University$of$Texas$at$Austin Sanmit$Narvekar 9

  10. The(Problem:(Autonomous(Sequencing • Existing'work'heuristic=based ,$such$as$examining$performance$on$the$ target$task,$and$using$heuristics$to$select$next$task$ • In$this$work,$we$use$ learning'to'do'sequencing University$of$Texas$at$Austin Sanmit$Narvekar 10

  11. Sequencing(as(an(MDP Curriculum$Task Curriculum Agent Curriculum$Action Task$2 Task$1 Task$N Environment Environment Environment State State State Action Action Action Reward Reward Reward RL$ RL$ RL$ Agent Agent Agent Curriculum$State Curriculum$Reward Curriculum Agent University$of$Texas$at$Austin Sanmit$Narvekar 11

  12. Sequencing(as(an(MDP M 3 ! 1 R 1,3 M 1 ! 4 M 4 R 0,1 R 4,4 M 4 M 3 ! f ! 2 R 2,4 ! 0 R 0,3 M 4 R 5,4 M 2 ! 5 M 3 R 0,2 R 3,3 ! 3 • State'space' S C :$All$policies$ ! i an$agent$can$represent • Action'space' A C :$Different$tasks$ M j an$agent$can$train$on • Transition'function' p C (s C ,a C ) :$Learning$task$ a C transforms$an$agent’s$policy$ s C • Reward'function' r C (s C ,a C ) :$Cost$in$time$steps$to$learn$task$ a C given$policy$ s C University$of$Texas$at$Austin Sanmit$Narvekar 12

  13. Sequencing(as(an(MDP M 3 ! 1 R 1,3 M 1 ! 4 M 4 R 0,1 R 4,4 M 4 M 3 ! f ! 2 R 2,4 ! 0 R 0,3 M 4 R 5,4 M 2 ! 5 M 3 R 0,2 R 3,3 ! 3 • A$policy ! C :$S C ! A C on$this$curriculum$MDP$(CMDP)$specifies$which$task$to$ train$on$given$learning$agent$policy$ ! i • Essentially$training$a$teacher • How$to$do$learning$over$CMDP? • How$does$CMDP$change$when$transfer$method$changes? University$of$Texas$at$Austin Sanmit$Narvekar 13

  14. Learning(in(Curriculum(MDPs [1,3,4,…0] [1,2,3,…0] M 3 ! 1 R 1,3 M 1 ! 4 M 4 R 0,1 [1,2,3,…9] R 4,4 M 4 M 3 ! f ! 2 R 2,4 ! 0 R 0,3 M 4 R 5,4 M 2 ! 5 [0,0,0,…0] M 3 R 0,2 R 3,3 ! 3 Extract$Raw$CMDP$ Function$Approximation$ Extract$Features State$Variables and$Learning • Express$raw$CMDP$state$using$the$weights$of$base$agent’s$VF/policy • Extract$features so$that$similar$policies$(CMDP$states)$are$“close”$in$feature$ space University$of$Texas$at$Austin Sanmit$Narvekar 14

  15. Example:(Discrete(Representations State' State'2 State'3 State'4 1 CMDP'State'1 CMDP'State'2 CMDP'State'3 Left Right Policy Left Right Policy Left Right Policy State$1 0.3 0.7 ! State$1 0.2 0.8 ! State$1 0.7 0.3 " State$2 0.1 0.9 State$2 0.2 0.8 State$2 0.9 0.1 ! ! " State$3 0.4 0.6 State$3 0.2 0.8 State$3 0.6 0.4 ! ! " State$4 0.0 1.0$ State$4 0.3 0.7 State$4 0.0 1.0$ ! ! ! • CMDP$states$1$and$2$encode$very$similar$policies,$and$should$be$close$in$ CMDP$representation$space$$

  16. Example:(Discrete(Representations State'2 State'1 Normalized Normalized Q(State52,5Left) Q(State51,5Left) Normalized Normalized Q(State52,5Right) Q(State51,5Right) • One$approach:$use$tile$coding$ • Create$a$separate$tiling$on$a$stateObyOstate$level • When$comparing$CMDP$states,$the$more$similar$the$policies are$in$a$ primitive$state,$the$more$common$tiles$will$be$activated • Each$primitive$state$contributes$equally$towards$the$similarity$of$the$ CMDP$state University$of$Texas$at$Austin Sanmit$Narvekar 16

  17. Continuous(CMDP(Representations • In$continuous$domains,$weights$ are$not$local$to$a$state • Needs$to$be$done$separately$for$ each$domain • Neural$networks • Tile$coding • Etc… • If$the$base$agent$uses$a$linear$ function$approximator,$one$can$ use$tile$coding$over$the$ parameters$as$before University$of$Texas$at$Austin Sanmit$Narvekar 17

  18. Changes(in(Transfer(Algorithm M 3 " 1 R 1,3 M 1 " 4 M 4 R 0,1 R 4,4 M 4 M 3 " f " 2 R 2,4 " 0 R 0,3 M 4 R 5,4 M 2 " 5 M 3 R 0,2 R 3,3 " 3 • Transfer$method$directly$affects$CMDP$state$representation$and$transition$ function • CMDP$states$represent$“states$of$knowledge,”$where$knowledge$represented$as$ VF,$shaping$reward,$etc.$ • Similar$process$can$be$done$if$knowledge$parameterizable University$of$Texas$at$Austin Sanmit$Narvekar 18

  19. Experimental(Results • Evaluate$whether$curriculum$ policies$can$be$learned • Grid$world • Multiple$base$agents • Multiple$CMDP$state$ representations • Pacman • Multiple$transfer$learning$ algorithms • How$long$to$train$on$sources? University$of$Texas$at$Austin Sanmit$Narvekar 19

  20. Grid(world(Setup Agent'Types • Basic$Agent • State:$Sensors$on$4$sides$that$measure$distance$to$keys,$locks,$etc. • Actions:$Move$in$4$directions,$pickup$key,$unlock$lock • ActionOdependent$Agent$ • State$difference:$weights on$features$are$shared over$4$directions • Rope$Agent • Action$difference:$Like$basic,$but$can$use$rope$action$to$negate$a$pit CMDP'Representations • Finite$State$Representation • For$discrete$domains,$groups$and$normalizes$raw$weights$stateObyOstate$to$form$CMDP$features • Continuous$State$Representation • Directly$uses$raw$weights$of$learning$agent$as$features$for$CMDP$agent University$of$Texas$at$Austin Sanmit$Narvekar 20

  21. Basic(Agent(Results University$of$Texas$at$Austin Sanmit$Narvekar 21

  22. ActionIDependent(Agent(Results University$of$Texas$at$Austin Sanmit$Narvekar 22

  23. Rope(Agent(Results University$of$Texas$at$Austin Sanmit$Narvekar 23

  24. Pacman(Setup Agent'Representation • ActionOdependent$egocentric$features CMDP'Representation • Continuous$State$Representation • Directly$uses$raw$weights$of$learning$agent$as$features$for$CMDP$agent Transfer'Methods • Value$Function$Transfer • Reward$Shaping$Transfer How'long'to'train'on'a'source'task? University$of$Texas$at$Austin Sanmit$Narvekar 24

  25. Pacman(Value(Function(Transfer − 50000 Cost to Learn Target Task − 100000 − 150000 no curriculum − 200000 continuous state representation naive length 1 representation naive length 2 representation − 250000 0 100 200 300 400 500 600 700 CMDP Episodes University$of$Texas$at$Austin Sanmit$Narvekar 25

  26. Pacman(Reward(Shaping(Transfer 0 − 500 Cost to Learn Target Task − 1000 − 1500 − 2000 − 2500 no curriculum Svetlik et al. (2017) − 3000 continuous state representation naive length 2 representation − 3500 0 100 200 300 400 500 600 700 CMDP Episodes University$of$Texas$at$Austin Sanmit$Narvekar 26

  27. How(long(to(train? 0 Cost to Learn Target Task -100000 -200000 -300000 reward shaping (return-based) reward shaping (small fixed) -400000 value function (return-based) value function (small fixed) -500000 0 100 200 300 400 500 CMDP Episodes University$of$Texas$at$Austin Sanmit$Narvekar 27

  28. Related(Work Restrictions'on'source'tasks • Florensa et$al.$2018,$Riedmiller et$al.$2018,$Sukhbaatar et$al.$2017 Heuristic'based'sequencing • Da$Silva$et$al.$2018,$Svetlik et$al.$2017 MDP/POMDP'based'sequencing • Matiisen et$al.$2017,$Narvekar$et$al.$2017$ CL'for'supervised'learning • Bengio et$al.$2009,$Fan$et$al.$2018,$Graves$et$al.$2017 University$of$Texas$at$Austin Sanmit$Narvekar 28

Recommend


More recommend