Autonomous(Task(Sequencing(for(Customized( Curriculum(Design(in(Reinforcement(Learning Sanmit'Narvekar, Jivko Sinapov,+and+Peter+Stone Department+of+Computer+Science University+of+Texas+at+Austin {sanmit,+jsinapov,+pstone}+@cs.utexas.edu
Successes(of(Reinforcement(Learning Approaching+or+passing+human+level+performance BUT Can+take+ millions of+episodes!+People+learn+this+MUCH faster University+of+Texas+at+Austin Sanmit+Narvekar 2
People(Learn(via(Curricula People+are+able+to+learn+a+lot+of+complex+tasks+very+efficiently+ University+of+Texas+at+Austin Sanmit+Narvekar 3
Example:(Quick(Chess • Quickly+learn+the+ fundamentals+of+chess • 5+x+6+board+ • Fewer+pieces+per+type • No+castling • No+enQpassant+ University+of+Texas+at+Austin Sanmit+Narvekar 4
Example:(Quick(Chess .+.+.+.+.+. University+of+Texas+at+Austin Sanmit+Narvekar 5
Task(Space Pawns+++King Pawns+only Target+task Empty+task One+piece+per+type • Quick+Chess+is+a+curriculum+designed+for+people • We+want+to+do+something+similar+automatically for+autonomous+agents University+of+Texas+at+Austin Sanmit+Narvekar 6
Curriculum(Learning Task+=+MDP Environment State Action Reward Agent Task'Creation Presented+at+AAMAS+‘16 Transfer'Learning Sequencing via+Value+Function+Transfer • Curriculum+learning+is+a+complex+problem+that+ties+task+creation,+sequencing,+ and+transfer+learning University+of+Texas+at+Austin Sanmit+Narvekar 7
Autonomous(Task(Sequencing University+of+Texas+at+Austin Sanmit+Narvekar 8
Sequencing(as(an(MDP M 3 ! 1 R 1,3 M 1 ! 4 M 4 R 0,1 R 4,4 M 4 M 3 ! f ! 2 R 2,4 ! 0 R 0,3 M 4 R 5,4 M 2 ! 5 M 3 R 0,2 R 3,3 ! 3 • State'space' S C :+All+policies+ ! i an+agent+can+represent • Action'space' A C :+Different+tasks+ M j an+agent+can+train+on • Transition'function' p C (s C ,a C ) :+Learning+task+ a C transforms+an+agent’s+policy+ s C • Reward'function' r C (s C ,a C ) :+Cost+in+time+steps+to+learn+task+ a C given+policy+ s C University+of+Texas+at+Austin Sanmit+Narvekar 9
Sequencing(as(an(MDP M 3 ! 1 R 1,3 M 1 ! 4 M 4 R 0,1 R 4,4 M 4 M 3 ! f ! 2 R 2,4 ! 0 R 0,3 M 4 R 5,4 M 2 ! 5 M 3 R 0,2 R 3,3 ! 3 • A+policy ! C :+S C ! A C on+this+curriculum+MDP+(CMDP)+specifies+which+task+to+ train+on+given+learning+agent+policy+ ! i • Learning+full+policy+ ! C can+be+difficult!+ • Taking+an+action+requires+solving+a+full+task+MDP • Transitions+are+not+deterministic+ University+of+Texas+at+Austin Sanmit+Narvekar 10
Sequencing(as(an(MDP M 3 Target+Task ! 1 R 1,3 M 1 ! 4 M 4 R 0,1 R 4,4 M 4 M 3 ! f ! 2 R 2,4 ! 0 R 0,3 M 4 R 5,4 M 2 ! 5 M 3 R 0,2 R 3,3 ! 3 • Instead,+find+one+trace/execution in+CMDP+of+ ! C* • Main'Idea :+Leverage+fact+that+we+know+the+target+task and+therefore+what+is+ relevant+for+the+final+state+policy+ ! f to+guide+selection of+tasks University+of+Texas+at+Austin Sanmit+Narvekar 11
Autonomous(Sequencing Target'Task • Grid+world+domain • Objectives • Navigate+the+world • Pick+up+keys • Unlock+locks • Avoid+pits University+of+Texas+at+Austin Sanmit+Narvekar 12
Autonomous(Sequencing 1 • Recursive+algorithm+(6+steps) 2 • Each+iteration+adds+a+source+task+to+ 3 the+curriculum Unsolvable+Tasks Solvable+Tasks • This+in+turn+updates+the+policy 4 5 • Terminates+when+performance+on+ target+task+greater than+desired+ performance+threshold+ University+of+Texas+at+Austin Sanmit+Narvekar 13 6
Autonomous(Sequencing Step'1 1 Target'Task • Assume+learning+budget+ " • Attempt+to+solve target+task+ directly+in+ " steps.+Save+samples • Solvable? • Target+task+easy+to+learn • Started+with+policy+that+made+it+easy+ to+learn.+Done • Goal:+incrementally learn+subtasks+ to+build+a+policy that+can+learn+the+ target+task University+of+Texas+at+Austin Sanmit+Narvekar 14
Autonomous(Sequencing Step'2 1 • Could+not+solve+target • Create+source+tasks using+ methods+from+AAMAS+‘16.+ 2 Step'3 • Attempt+to+solve+each+source+ in+ " steps • Partition+sources+into+ 3 solvable+/+unsolvable+ Solvable+Tasks Unsolvable+Tasks University+of+Texas+at+Austin Sanmit+Narvekar 15
Autonomous(Sequencing Initial+Policy+ ! 0 Step'4 • If+solvable+tasks+exist,+select+ [s 1 ,+s 2 ,+s 3 ,+s 4 …+s " ] the+one+that+updates+the+ policy the+most+on+samples+ [ U … P ] , , , drawn+from+the+target+task Solvable+Tasks • Assumption • Source+tasks+that+can+be+ solved+have+policies+that+are+ ! 1 ! 2 relevant+to+the+target+task • Don’t+provide+negative+ [ … P [ U … P ] ] , , , , , , 4 transfer � University+of+Texas+at+Austin Sanmit+Narvekar 16
Autonomous(Sequencing Step'4'(cont.) New+Policy+ ! 1 • Add+source+task to+curriculum • Return+to+Step+1 [s 1 ,+s 2 ,+s 3 ,+s 4 …+s " ] [ P … P ] , , , • (ReQevaluate+on+target+task) • Policy+has+changed,+so+we+will+get+a+new+set+of+samples • Samples+biased towards+agent’s+current+set+of+experiences • This+in+turn+guides+selection of+source+tasks University+of+Texas+at+Austin Sanmit+Narvekar 17
Autonomous(Sequencing [s 1 ,+s 2 ,+s 3 …+s " ] Step'5 • No+sources+solvable+ • Sort+tasks+by+sample+relevance [s 4 ,+s 5 ,+s 6 …+s " ] [s 1 ,+s 2 ,+s 3 …+s " ] • Compare+states+experienced+in+ target+task+with+those+in+ Solvable+Tasks Unsolvable+Tasks experienced+in+sources • Recursively create+subQsource+ 5 tasks • Return+to+Step+2+with+the+ current+source+task+as+the+ target+task University+of+Texas+at+Austin Sanmit+Narvekar 18
Autonomous(Sequencing 1 Step'6 2 • No+sources+usable after+ exhausting+the+tree 3 • Increase+budget,+return+to+ Unsolvable+Tasks Solvable+Tasks Step+1 4 5 • Learning+can+be+cached,+so+ agent+can+pick+up+where+it+ left+off University+of+Texas+at+Austin Sanmit+Narvekar 19 6
Connection(to(CMDPs 1 2 3 M 3 ! 1 R 1,3 M 1 ! 4 M 4 R 0,1 R 4,4 Unsolvable+Tasks Solvable+Tasks M 4 M 3 4 ! f ! 2 5 R 2,4 ! 0 R 0,3 M 4 M 2 ! 5 R 5,4 M 3 R 0,2 R 3,3 ! 3 6 • An+optimal+path in+CMDP+is+one+that+reaches+ ! f with+least+cost • Selection+in+Step+4+picks+tasks+that+update+most+towards+ ! f • Learning+budget+minimizes+cost • Algorithm+behaves+greedily to+balance+updates+and+cost University+of+Texas+at+Austin Sanmit+Narvekar 20
Experimental(Setup • Grid+world+domain+presented+previously Create'multiple'agents • Multiple+agents+shows+the+algorithm+is+not+dependent+on+ implementation of+RL+agent • Evaluate+whether+different+agents+benefit+from+individualized+ curricula+ University+of+Texas+at+Austin Sanmit+Narvekar 21
Experimental(Setup Agent'Types • Basic+Agent • State:+Sensors+on+4+sides+that+measure+distance+to+keys,+locks,+etc. • Actions:+Move+in+4+directions,+pickup+key,+unlock+lock • ActionQdependent+Agent+ • State+difference:+weights on+features+are+shared over+4+directions • Rope+Agent • Action+difference:+Like+basic,+but+can+use+rope+action+to+negate+a+pit University+of+Texas+at+Austin Sanmit+Narvekar 22
Basic(Agent(Results University+of+Texas+at+Austin Sanmit+Narvekar 23
ActionEDependent(Agent(Results University+of+Texas+at+Austin Sanmit+Narvekar 24
Rope(Agent(Results University+of+Texas+at+Austin Sanmit+Narvekar 25
Summary ! 1 M 3 R 1,3 ! 4 M 1 M 4 R 0,1 R 4,4 M 4 M 3 ! f ! 2 R 2,4 ! 0 R 0,3 M 4 ! 5 M 2 R 5,4 M 3 R 0,2 R 3,3 ! 3 • Presented+a+novel+formulation+of+ curriculum+generation+as+an+MDP 1 • Proposed+an+algorithm+to+approximate+a+ 2 trace in+this+MDP 3 • Demonstrated+method+proposed+can+ Solvable+Tasks Unsolvable+Tasks 4 create+curricula+tailored+to+sensing+and+ 5 action+capabilities+of+agents 6 University+of+Texas+at+Austin Sanmit+Narvekar 26
Recommend
More recommend