��� Reinforcemen t Learning �Read Chapter ��� �Exercises ����� ����� ����� � Con trol learning � Con trol p olici es that c ho ose optimal actions � Q learning � Con v ergence ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����
Con trol Learning Consider learning to c ho ose actions� e�g�� � Rob ot learning to do c k on battery c harger � Learning to c ho ose actions to optimize factory output � Learning to pla y Bac kgammon Note sev eral problem c haracteristics� � Dela y ed rew ard � Opp ortunit y for activ e exploration � P ossibilit y that state only partially observ able � P ossible need to learn m ultiple tasks with same sensors�e�ectors ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����
One Example� TD�Gammon �T esauro� ����� Learn to pla y Bac kgammon Immediate rew ard � ���� if win � ���� if lose � � for all other states T rained b y pla ying ��� million games against itself No w appro ximately equal to b est h uman pla y er ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����
Reinforcemen t Learning Problem Agent State Reward Action Environment a a a 0 1 2 s s s ... 0 1 2 r r r 0 1 2 Goal: Learn to choose actions that maximize 2 γ r + r + ... , where γ γ <1 r + 0 < 0 1 2 ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����
Mark o v Decision Pro cesses Assume � �nite set of states S � set of actions A � at eac h discrete time agen t observ es state � s S t and c ho oses action � a A t � then receiv es immediate rew ard r t � and state c hanges to s t �� � Mark o v assumption� � � s � and s � � a t �� t t � � s � r r � a t t t i�e�� and dep end only on state r s curr ent t t �� � and action functions � and r ma y b e nondeterministic � functions � and r not necessarily kno wn to � agen t ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����
Agen t�s Learning T ask Execute actions in en vironmen t� observ e results� and � learn action p olicy � � that maximizes � S A � � r � � � � � E � r � r � � t t �� t �� from an y starting state in S � here � � � � � is the discoun t factor for future rew ards Note something new� � T arget function is � � � S A � but w e ha v e no training examples of form h s� a i � training examples are of form hh s� a i � r i ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����
V alue F unction T o b egin� consider deterministic w orlds��� F or eac h p ossible p olicy the agen t migh t adopt� � w e can de�ne an ev aluation function o v er states � � V � s � � r � � r � � r � ��� t t �� t �� � i � � r X t � i i �� where are generated b y follo wing p olicy r � r � � � � t �� t starting at state � s Restated� the task is to learn the optimal p olicy � � � � � argmax � s � � � � s � � V � ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����
0 100 0 G 0 0 0 0 0 100 0 0 0 0 � s� a � �immediate rew ard� v alues r 0 90 100 G G 90 100 0 81 72 81 81 90 100 81 90 81 90 100 72 81 Q � s� a � v alues � s � v alues V � G One optimal p olicy ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����
What to Learn W e migh t try to ha v e agen t learn the ev aluati on � � function �whic h w e write as � � V V It could then do a lo ok ahead searc h to c ho ose b est action from an y state s b ecause � � � � s � � argmax � r � s� a � � � V � � � s� a ��� a A problem� � This w orks w ell if agen t kno ws � � S � A � S � and r � S � A � � � But when it do esn�t� it can�t c ho ose actions this w a y ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����
F unction Q De�ne new function v ery similar to � V � Q � s� a � � � s� a � � � � � s� a �� r � V If agen t learns Q � it can c ho ose optimal action ev en without kno wing � � � � � � s � � argmax � r � s� a � � � V � � � s� a ��� a � � s � � argmax Q � s� a � � a is the ev aluation function the agen t will learn Q ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����
T raining Rule to Learn Q Note and � closely related� Q V � � � s � � max Q � s� � V a � a Whic h allo ws us to write recursiv ely as Q � Q � s � a � � r � s � a � � � V � � � s � a ��� t t t t t t � � s � � max Q � s � � r � a � � a t �� t t a � � Nice� Let denote learner�s curren t appro ximation Q to Q � Consider training rule � � � � Q � s� a � � � max Q � s � r � � a � a where � is the state resulting from applying action s in state a s ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����
Learning for Deterministi c W orlds Q � F or eac h initial i ze table en try � s� a � � � s� a Q Observ e curren t state s Do forev er� � Select an action and execute it a � Receiv e immediate rew ard r � Observ e the new state s � � � Up date the table en try for Q � s� a � as follo ws� � � � � Q � s� a � � r � � max Q � s � a � � a � s � s � ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����
� Up dating Q 90 72 100 100 R R 63 63 81 81 a right initial state: s 1 next state: s 2 � � � Q � s � � � max Q � s � � a r � � a � � r ig ht � a � � � � � � max f �� � �� � ��� g � �� notice if rew ards non�negativ e� then � � � � s� n � � s� a � � � s� a � a� Q Q n �� n and � � � s� n � � � � s� a � � Q � s� a � a� Q n ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����
� Q con v erges to Q � Consider case of deterministic w orld where see eac h h s� a i visited in�nitely often� of � De�ne a full in terv al to b e an in terv al during Pr o whic h eac h h s� a i is visited� During eac h full � in terv al the largest error in Q table is reduced b y factor of � � Let b e table after up dates� and � b e the Q n n n � maxim um error in � that is Q n � � � max j Q � s� a � � Q � s� a � j n n s�a � F or an y table en try � s� a � up dated on iteration Q n � � �� the error in the revised estimate � s� a � is n Q n �� � � � � j � s� a � � Q � s� a � j � j � r � max � s �� Q � Q � a n �� n � a � � � � r � max Q � s �� j � � a � a � � � � � � j max � s � � max Q � s � j � Q � a � a n � � a a � � � � � � max j � s � � Q � s � j � Q � a � a n � a � �� � �� � � � max j Q � s � a � � Q � s � a � j n �� � s �a � j Q � s� a � � Q � s� a � j � � � n �� n ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����
Note w e used general fact that j max � a � � max � a � j � max j f � a � � � a � j f f f � � � � a a a ��� lecture slides for textb o ok Machine L e arning � T� Mitc hell� McGra w Hill� ����
Recommend
More recommend