13. Reinforcemen t Learning [Read Chapter 13] [Exercises - PDF document

13. Reinforcemen t Learning [Read Chapter 13] [Exercises 13.1, 13.2, 13.4] � Con trol learning � Con trol p olici es that c ho ose optimal actions � Q learning � Con v ergence 255 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

Con trol Learning Consider learning to c ho ose actions, e.g., � Rob ot learning to do c k on battery c harger � Learning to c ho ose actions to optimize factory output � Learning to pla y Bac kgammon Note sev eral problem c haracteristics: � Dela y ed rew ard � Opp ortunit y for activ e exploration � P ossibilit y that state only partially observ able � P ossible need to learn m ultiple tasks with same sensors/e�ectors 256 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

One Example: TD-Gammon [T esauro, 1995] Learn to pla y Bac kgammon Immediate rew ard � +100 if win � -100 if lose � 0 for all other states T rained b y pla ying 1.5 million games against itself No w appro ximately equal to b est h uman pla y er 257 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

Reinforcemen t Learning Problem Agent State Reward Action Environment a a a 0 1 2 s s s ... 0 1 2 r r r 0 1 2 258 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997 Goal: Learn to choose actions that maximize 2 γ r + r + ... , where γ γ <1 r + 0 < 0 1 2

Mark o v Decision Pro cesses Assume � �nite set of states S � set of actions A � at eac h discrete time agen t observ es state s 2 S t and c ho oses action a 2 A t � then receiv es immediate rew ard r t � and state c hanges to s t +1 � Mark o v assumption: s = � ( s ; a ) and t +1 t t r = r ( s ; a ) t t t { i.e., r and s dep end only on curr ent state t t +1 and action { functions � and r ma y b e nondeterministic { functions � and r not necessarily kno wn to agen t 259 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

Agen t's Learning T ask Execute actions in en vironmen t, observ e results, and � learn action p olicy � : S ! A that maximizes 2 E [ r + � r + � r + : : : ] t t +1 t +2 from an y starting state in S � here 0 � � < 1 is the discoun t factor for future rew ards Note something new: � T arget function is � : S ! A � but w e ha v e no training examples of form h s; a i � training examples are of form hh s; a i ; r i 260 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

V alue F unction T o b egin, consider deterministic w orlds... F or eac h p ossible p olicy � the agen t migh t adopt, w e can de�ne an ev aluation function o v er states 2 � V ( s ) � r + � r + � r + ::: t t +1 t +2 1 X i � � r t + i i =0 where r ; r ; : : : are generated b y follo wing p olicy t t +1 � starting at state s � Restated, the task is to learn the optimal p olicy � � � � � argmax V ( s ) ; ( 8 s ) � 261 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

r ( s; a ) (immediate rew ard) v alues 0 100 0 G 0 0 0 0 0 100 0 0 0 0 � Q ( s; a ) v alues V ( s ) v alues 0 90 100 G G 90 100 0 81 72 81 81 90 100 81 90 81 90 100 72 81 One optimal p olicy 262 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997 G

What to Learn W e migh t try to ha v e agen t learn the ev aluati on � � � function V (whic h w e write as V ) It could then do a lo ok ahead searc h to c ho ose b est action from an y state s b ecause � � � ( s ) = argmax [ r ( s; a ) + � V ( � ( s; a ))] a A problem: � This w orks w ell if agen t kno ws � : S � A ! S , and r : S � A ! < � But when it do esn't, it can't c ho ose actions this w a y 263 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

Q F unction � De�ne new function v ery similar to V � Q ( s; a ) � r ( s; a ) + � V ( � ( s; a )) If agen t learns Q , it can c ho ose optimal action ev en without kno wing � ! � � � ( s ) = argmax [ r ( s; a ) + � V ( � ( s; a ))] a � � ( s ) = argmax Q ( s; a ) a Q is the ev aluation function the agen t will learn 264 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

T raining Rule to Learn Q � Note Q and V closely related: � 0 V ( s ) = max Q ( s; a ) 0 a Whic h allo ws us to write Q recursiv ely as � Q ( s ; a ) = r ( s ; a ) + � V ( � ( s ; a ))) t t t t t t 0 = r ( s ; a ) + � max Q ( s ; a ) t t t +1 0 a ^ Nice! Let Q denote learner's curren t appro ximation to Q . Consider training rule 0 0 ^ ^ Q ( s; a ) r + � max Q ( s ; a ) 0 a 0 where s is the state resulting from applying action a in state s 265 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

Q Learning for Deterministi c W orlds ^ F or eac h s; a initial i ze table en try Q ( s; a ) 0 Observ e curren t state s Do forev er: � Select an action a and execute it � Receiv e immediate rew ard r 0 � Observ e the new state s ^ � Up date the table en try for Q ( s; a ) as follo ws: 0 0 ^ ^ Q ( s; a ) r + � max Q ( s ; a ) 0 a 0 � s s 266 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

^ Up dating Q 0 ^ ^ Q ( s ; a ) r + � max Q ( s ; a ) 1 r ig ht 2 0 a 0 + 0 : 9 max f 63 ; 81 ; 100 g 72 100 90 100 R R 90 63 63 81 81 a right notice if rew ards non-negativ e, then ^ ^ ( 8 s; a; n ) Q ( s; a ) � Q ( s; a ) n +1 n initial state: s 1 next state: s 2 and ^ ( 8 s; a; n ) 0 � Q ( s; a ) � Q ( s; a ) n 267 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

^ Q con v erges to Q . Consider case of deterministic w orld where see eac h h s; a i visited in�nitely often. Pr o of : De�ne a full in terv al to b e an in terv al during whic h eac h h s; a i is visited. During eac h full ^ in terv al the largest error in Q table is reduced b y factor of � ^ Let Q b e table after n up dates, and � b e the n n ^ maxim um error in Q ; that is n ^ � = max j Q ( s; a ) � Q ( s; a ) j n n s;a ^ F or an y table en try Q ( s; a ) up dated on iteration n ^ n + 1, the error in the revised estimate Q ( s; a ) is n +1 0 0 ^ ^ j Q ( s; a ) � Q ( s; a ) j = j ( r + � max Q ( s ; a )) n +1 n 0 a 0 0 � ( r + � max Q ( s ; a )) j 0 a 0 0 0 0 ^ = � j max Q ( s ; a ) � max Q ( s ; a ) j n 0 0 a a 0 0 0 0 ^ � � max j Q ( s ; a ) � Q ( s ; a ) j n 0 a 00 0 00 0 ^ � � max j Q ( s ; a ) � Q ( s ; a ) j n 00 0 s ;a ^ j Q ( s; a ) � Q ( s; a ) j � � � n +1 n 268 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

Note w e used general fact that j max f ( a ) � max f ( a ) j � max j f ( a ) � f ( a ) j 1 2 1 2 a a a 269 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

Nondeterministi c Case What if rew ard and next state are non-deterministic? W e rede�ne V ; Q b y taking exp ected v alues 2 � V ( s ) � E [ r + � r + � r + : : : ] t t +1 t +2 1 X i � E [ � r ] t + i i =0 � Q ( s; a ) � E [ r ( s; a ) + � V ( � ( s; a ))] 270 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

Nondeterministi c Case Q learning generalizes to nondeterministic w orlds Alter training rule to 0 0 ^ ^ ^ Q ( s; a ) (1 � � ) Q ( s; a )+ � [ r +max Q ( s ; a )] n n n � 1 n n � 1 0 a where 1 � = n 1 + v isits ( s; a ) n ^ Can still pro v e con v ergence of Q to Q [W atkins and Da y an, 1992] 271 lecture slides for textb o ok Machine L e arning , T. Mitc hell, McGra w Hill, 1997

13. Reinforcemen t Learning [Read Chapter 13] [Exercises - PDF document

13. Reinforcemen t Learning [Read Chapter 13] [Exercises 13.1, 13.2, 13.4] Con trol learning Con trol p olici es that c ho ose optimal actions Q learning Con v ergence 255 lecture slides for

Reinforcemen t Learning Read Chapter Exercises

Outline [read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] Learning from

For Monday Read chapter 18, sections 5-6 Homework: Chapter 18, exercises 1-2 Program 3

Outline read Chapter suggested exercises

Read Chapter 7 of Machine Learning [Suggested exercises: 7.1, 7.2, 7.5, 7.7] Function

Genetic Algorithms [Read Chapter 9] [Exercises 9.1, 9.2, 9.3, 9.4] Ev olutionary

For Thursday Read chapter 9 Homework: Chapter 7, exercises 2 and 10 Program 1 Any

For Monday Read chapter 9 Homework: Chapter 8, exercises 9 and 10 Program 1 Any

For Friday Read chapter 2 Homework: Chapter 1, exercises 3, 11-13 Send email to

For Friday Read chapter 8 Homework: Chapter 7, exercises 2 and 10 Program 1,

For Wednesday Read Weiss, chapter 6, sections 1-3 Homework: Weiss, chapter 3, exercises

Computational Learning Theory [read Chapter 7] [Suggested exercises: 7.1, 7.2, 7.5, 7.8]

For Thursday Read Weiss, chapter 4, sections 1-4 Homework: Weiss, chapter 3, exercises

Genetic Algorithms Read Chapter Exercises

For Thursday Read Weiss, chapter 7, sections 7-10 Homework: Weiss, chapter 4, exercises

Ba y esian Learning [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.6] Ba y es

For Friday Read Weiss, chapter 6, section 4 Homework: Weiss, chapter 4, exercises 1-2

For Friday Read Weiss, chapter 6, sections 1-3 Homework: Weiss, chapter 4, exercises

Ba y esian Learning Read Ch Suggested exercises

Learning Sets of Rules [Read Ch. 10] [Recommended exercises 10.1, 10.2, 10.5, 10.7,

For Friday No reading Homework Chapter 23, exercises 1, 13, 14, 19 Not as bad as it

For Friday Finish chapter 22 Homework Chapter 22, exercises 1, 7, 9, 14 Allocate

Com bining Inductiv e and Analytical Learning [Read Ch. 12] [Suggested exercises: 12.1,

For Monday after Spring Break Read Weiss, chapter 5, sections 1-4 Homework: Chapter 4,