13. Reinforcemen t Learning [Read Chapter 13] [Exercises 13.1, 13.2, 13.4] � Con trol learning � Con trol p olici es that c ho ose optimal actions � Q learning � Con v ergence 255 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
Con trol Learning Consider learning to c ho ose actions, e.g., � Rob ot learning to do c k on battery c harger � Learning to c ho ose actions to optimize factory output � Learning to pla y Bac kgammon Note sev eral problem c haracteristics: � Dela y ed rew ard � Opp ortunit y for activ e exploration � P ossibilit y that state only partially observ able � P ossible need to learn m ultiple tasks with same sensors/e�ectors 256 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
One Example: TD-Gammon [T esauro, 1995] Learn to pla y Bac kgammon Immediate rew ard � +100 if win � -100 if lose � 0 for all other states T rained b y pla ying 1.5 million games against itself No w appro ximately equal to b est h uman pla y er 257 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
Reinforcemen t Learning Problem Agent State Reward Action Environment a a a 0 1 2 s s s ... 0 1 2 r r r 0 1 2 258 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997 Goal: Learn to choose actions that maximize 2 γ r + r + ... , where γ γ <1 r + 0 < 0 1 2
Mark o v Decision Pro cesses Assume � �nite set of states S � set of actions A � at eac h discrete time agen t observ es state s 2 S t and c ho oses action a 2 A t � then receiv es immediate rew ard r t � and state c hanges to s t +1 � Mark o v assumption: s = � ( s ; a ) and t +1 t t r = r ( s ; a ) t t t { i.e., r and s dep end only on curr ent state t t +1 and action { functions � and r ma y b e nondeterministic { functions � and r not necessarily kno wn to agen t 259 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
Agen t's Learning T ask Execute actions in en vironmen t, observ e results, and � learn action p olicy � : S ! A that maximizes 2 E [ r + � r + � r + : : : ] t t +1 t +2 from an y starting state in S � here 0 � � < 1 is the discoun t factor for future rew ards Note something new: � T arget function is � : S ! A � but w e ha v e no training examples of form h s; a i � training examples are of form hh s; a i ; r i 260 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
V alue F unction T o b egin, consider deterministic w orlds... F or eac h p ossible p olicy � the agen t migh t adopt, w e can de�ne an ev aluation function o v er states 2 � V ( s ) � r + � r + � r + ::: t t +1 t +2 1 X i � � r t + i i =0 where r ; r ; : : : are generated b y follo wing p olicy t t +1 � starting at state s � Restated, the task is to learn the optimal p olicy � � � � � argmax V ( s ) ; ( 8 s ) � 261 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
r ( s; a ) (immediate rew ard) v alues 0 100 0 G 0 0 0 0 0 100 0 0 0 0 � Q ( s; a ) v alues V ( s ) v alues 0 90 100 G G 90 100 0 81 72 81 81 90 100 81 90 81 90 100 72 81 One optimal p olicy 262 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997 G
What to Learn W e migh t try to ha v e agen t learn the ev aluati on � � � function V (whic h w e write as V ) It could then do a lo ok ahead searc h to c ho ose b est action from an y state s b ecause � � � ( s ) = argmax [ r ( s; a ) + � V ( � ( s; a ))] a A problem: � This w orks w ell if agen t kno ws � : S � A ! S , and r : S � A ! < � But when it do esn't, it can't c ho ose actions this w a y 263 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
Q F unction � De�ne new function v ery similar to V � Q ( s; a ) � r ( s; a ) + � V ( � ( s; a )) If agen t learns Q , it can c ho ose optimal action ev en without kno wing � ! � � � ( s ) = argmax [ r ( s; a ) + � V ( � ( s; a ))] a � � ( s ) = argmax Q ( s; a ) a Q is the ev aluation function the agen t will learn 264 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
T raining Rule to Learn Q � Note Q and V closely related: � 0 V ( s ) = max Q ( s; a ) 0 a Whic h allo ws us to write Q recursiv ely as � Q ( s ; a ) = r ( s ; a ) + � V ( � ( s ; a ))) t t t t t t 0 = r ( s ; a ) + � max Q ( s ; a ) t t t +1 0 a ^ Nice! Let Q denote learner's curren t appro ximation to Q . Consider training rule 0 0 ^ ^ Q ( s; a ) r + � max Q ( s ; a ) 0 a 0 where s is the state resulting from applying action a in state s 265 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
Q Learning for Deterministi c W orlds ^ F or eac h s; a initial i ze table en try Q ( s; a ) 0 Observ e curren t state s Do forev er: � Select an action a and execute it � Receiv e immediate rew ard r 0 � Observ e the new state s ^ � Up date the table en try for Q ( s; a ) as follo ws: 0 0 ^ ^ Q ( s; a ) r + � max Q ( s ; a ) 0 a 0 � s s 266 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
^ Up dating Q 0 ^ ^ Q ( s ; a ) r + � max Q ( s ; a ) 1 r ig ht 2 0 a 0 + 0 : 9 max f 63 ; 81 ; 100 g 72 100 90 100 R R 90 63 63 81 81 a right notice if rew ards non-negativ e, then ^ ^ ( 8 s; a; n ) Q ( s; a ) � Q ( s; a ) n +1 n initial state: s 1 next state: s 2 and ^ ( 8 s; a; n ) 0 � Q ( s; a ) � Q ( s; a ) n 267 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
^ Q con v erges to Q . Consider case of deterministic w orld where see eac h h s; a i visited in�nitely often. Pr o of : De�ne a full in terv al to b e an in terv al during whic h eac h h s; a i is visited. During eac h full ^ in terv al the largest error in Q table is reduced b y factor of � ^ Let Q b e table after n up dates, and � b e the n n ^ maxim um error in Q ; that is n ^ � = max j Q ( s; a ) � Q ( s; a ) j n n s;a ^ F or an y table en try Q ( s; a ) up dated on iteration n ^ n + 1, the error in the revised estimate Q ( s; a ) is n +1 0 0 ^ ^ j Q ( s; a ) � Q ( s; a ) j = j ( r + � max Q ( s ; a )) n +1 n 0 a 0 0 � ( r + � max Q ( s ; a )) j 0 a 0 0 0 0 ^ = � j max Q ( s ; a ) � max Q ( s ; a ) j n 0 0 a a 0 0 0 0 ^ � � max j Q ( s ; a ) � Q ( s ; a ) j n 0 a 00 0 00 0 ^ � � max j Q ( s ; a ) � Q ( s ; a ) j n 00 0 s ;a ^ j Q ( s; a ) � Q ( s; a ) j � � � n +1 n 268 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
Note w e used general fact that j max f ( a ) � max f ( a ) j � max j f ( a ) � f ( a ) j 1 2 1 2 a a a 269 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
Nondeterministi c Case What if rew ard and next state are non-deterministic? W e rede�ne V ; Q b y taking exp ected v alues 2 � V ( s ) � E [ r + � r + � r + : : : ] t t +1 t +2 1 X i � E [ � r ] t + i i =0 � Q ( s; a ) � E [ r ( s; a ) + � V ( � ( s; a ))] 270 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
Nondeterministi c Case Q learning generalizes to nondeterministic w orlds Alter training rule to 0 0 ^ ^ ^ Q ( s; a ) (1 � � ) Q ( s; a )+ � [ r +max Q ( s ; a )] n n n � 1 n n � 1 0 a where 1 � = n 1 + v isits ( s; a ) n ^ Can still pro v e con v ergence of Q to Q [W atkins and Da y an, 1992] 271 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997
Recommend
More recommend