✁ ✁ RL LECTURE 3 SIMPLE LEARNING TAXONOMY LEARNING FROM INTERACTION � Supervised Learning – with environment – “Teacher” provides required response to inputs. De- sired behaviour known. “Costly” – to achieve some goal � Unsupervised Learning – Learner looks for patterns in inputs. No “right” an- swer � Baby playing. No teacher. Sensorimotor connection to � Reinforcement Learning environment. – Learner not told which actions to take, but gets re- – Cause – effect ward/punishment from environment and adjusts/learns – Action – consequences the action to pick next time. – How to achieve goals � Learning to drive car, hold conversation, etc. – Environment’s response affects our subsequent ac- tions – We find out the effects of our actions later 1 2 REINFORCEMENT LEARNING EXPLORATION/EXPLOITATION TRADE- OFF Learning a mapping from situations to actions in order to maximise a scalar reward/reinforcement signal High rewards from trying previously-well-rewarded actions – EXPLOITATION BUT HOW? Which actions are best? Must try ones not tried before – EXPLORATION Try out actions to learn which produces highest reward – MUST DO BOTH trial-and-error search Actions affect immediate reward next situation all sub- sequent rewards – delayed effects, delayed reward Especially if task stochastic, try each action many times per situations to get reliable estimate of reward. Situations, Actions, Goals Gradually prefer those actions that prove to lead to high re- ward. Sense situations, choose actions TO achieve goals (Doesn’t arise in supervised learning) Environment uncertain 3 4
P ✠ ✠ ✠ ✣ ✠ � ✁ ✟ ✟ ✁ ✣ ✢ ✠ ✁ ✡ ✁ ✠ ✄ � ✡ ◆ ✽ ✡ ✠ P ✠ ❇ ✡ ✠ ✡ ✠ ✠ ❃ ✡ ✠ � ✁ EXAMPLES FRAMEWORK State/ � Animal learning to find food and avoid predators Situation s t � Robot trying to learn how to dock with charging station AGENT � Backgammon player learning to beat opponent Reward r t � Football team trying to find strategies to score goals Action � Infant learning to feed itself with spoon at � Cornet player learning to produce beautiful sounds rt+1 � Temperature controller keeping FH warm while minimis- ENVIRONMENT ing fuel consumption st+1 Agent in situation �✂✁ chooses action ✄☎✁ One tick later in situation ✆✞✝ gets reward ✆✞✝ POLICY �☞☛✌✄☎✍✏✎✒✑✓✟✕✔✖✄ ✎✗✄✙✘ ✎✒�✛✚ Given the situation at time ✜ is � the policy gives the proba- bility the agent’s action will be ✄ . Reinforcement learning Get/find/learn the policy 5 6 EXAMPLE POLICIES JARGON Find the coffee machine Policy Decision on what action to do �☞☛✰✄✼✍ 3 4 in that state Reward function Defines goal, and good and bad 2 experience for learner Value function Predicts reward. start 1 Estimate of total future reward Model of the environment Maps states and actions onto states ✡✤✣ ✡✤✣ ❚ . If in state ❚❊❯ turn left or ✝ we take ✍✥✢ ☛✧✦✩★☎✪✤✫✭✬ ✮✌✯ ✦✰✍✱✢ ❘❲❱ ✡✳✲ ✡✤✣ straight on ✍✥✢ ☛✩✴✧✦✩✪✩✵✷✶ ✸✖✹✺✦✱✻✂✫✼✍✱✢✾✽ action ✝ predicts �✂❳ (and sometimes ✡❀✿ ✡✤✣ turn right reward ❳ ). ✍✥✢ ☛✧✦✩★☎✪✤✫❁✪✤✶ ✸✖✹✺✦✌✍✥✢❂✽ ✡❄✣ go through door ✍✥✢ ☛✩✸✺✻❅✦✧✹☎✪✧✻✂★☎✸✖✹✭❆☎✻✛✻✖✪✧✍✱✢✾✽ Not all agents use models. etc. Bandit problem 10 arms, Q table gives the Q value for each arm Reward function and environmental model fixed external to agent. ❇ -greedy policy: Policy, value function, estimate of model adjusted during ✣▼▲ ✁ ❖◆ ◗❑P �☞☛✰✄✕❈❉✘ ✄✕❈☎✎❊✵✖✪❄✸✱❋✭✵❍●✕■❑❏ ✄✼✍✤✍✏✎ learning. else ◗❑P �☞☛✰✄✼✍✏✎ ☛✛✘ ❘❙✘✛✎ 7 8
✄ � � � ✡ ✢ � VALUE FUNCTIONS GENERAL RL ALGORITHM 1. Initialise learner’s internal state (e.g. Q values, other statistics) � How desirable is it to be in a certain state? 2. Do for a long time � Observe current world state What is its value ? � Choose action ✄ using the policy Value � Execute action �✖✍✥❱ � Let ✟ be immediate reward, ❈ new world state Value is (an estimate of) the expected future reward from � Update internal state based on ❈ , previous in- �☞☛✌✄✞☛✌✟✖☛ that state ternal state 3. Output a policy based on, e.g. learnt Q values and follow it � Value vs. reward Long-term vs. immediate We need: Want actions that lead to states of high value, not neces- sarily high immediate reward � Decision on what constitutes an internal state � Decision on what constitutes a world state � Sensing of a world state � Learn policy via learning value – when we know the values � Action-choice mechanism (policy) based usually on of states we can choose to go to states of high value � an evaluation (of current world and internal state) func- cf. GAGP discover policy directly tion � A means of executing the action � Genotypical vs. phenotypical learning? (GAGP vs. RL) � A way of updating the internal state 9 10 Environment (simulator?) provides EXAMPLE - 0 AND X � Transitions between world states, i.e. model � A reward function See Sutton and Barto Section 1.4 and Figure 1.1. But of course the learner has to discover what these are while exploring the world. 11 12
✡ ✎ ✆ ✍ � ✡ � ✡ ✡ ✡ � ✣ � � ✡ � ✡ � ▲ ✡ � � � EXAMPLE Construct a player to play against an imperfect opponent For each board state, set up �✖✍ – estimate of probability of winning from that state XXX �✖✍✏✎ OOO �✖✍✏✎❊✽ ✂ initially Rest �✖✍✏✎❊✽ ✁� Play many games Move selection � mostly pick move leading to state with highest � sometimes explore Value adjustment � back-up value of states after non-exploratory moves to states preceding moves � e.g. ✁ ✝✆✟✞ � ☎✄ ✷✍ � ☎✄ ❍✍ ✆✞✝ � ☎✄ � ✠✄ ✍ ☛✡ Reduce over time converges to probabilities of winning – optimal policy 13
Recommend
More recommend