rl lecture 3 learning from interaction
play

RL LECTURE 3 LEARNING FROM INTERACTION with environment to - PDF document

RL LECTURE 3 LEARNING FROM INTERACTION with environment to achieve some goal Baby playing. No teacher. Sensorimotor connection to environment. Cause effect Action consequences How to achieve goals Learning to


  1. RL LECTURE 3 LEARNING FROM INTERACTION – with environment – to achieve some goal � Baby playing. No teacher. Sensorimotor connection to environment. – Cause – effect – Action – consequences – How to achieve goals � Learning to drive car, hold conversation, etc. – Environment’s response affects our subsequent ac- tions – We find out the effects of our actions later 1

  2. SIMPLE LEARNING TAXONOMY � Supervised Learning – “Teacher” provides required response to inputs. De- sired behaviour known. “Costly” � Unsupervised Learning – Learner looks for patterns in inputs. No “right” an- swer � Reinforcement Learning – Learner not told which actions to take, but gets re- ward/punishment from environment and adjusts/learns the action to pick next time. 2

  3. � � REINFORCEMENT LEARNING Learning a mapping from situations to actions in order to maximise a scalar reward/reinforcement signal HOW? Try out actions to learn which produces highest reward – trial-and-error search Actions affect immediate reward next situation all sub- sequent rewards – delayed effects, delayed reward Situations, Actions, Goals Sense situations, choose actions TO achieve goals Environment uncertain 3

  4. EXPLORATION/EXPLOITATION TRADE- OFF High rewards from trying previously-well-rewarded actions – EXPLOITATION BUT Which actions are best? Must try ones not tried before – EXPLORATION MUST DO BOTH Especially if task stochastic, try each action many times per situations to get reliable estimate of reward. Gradually prefer those actions that prove to lead to high re- ward. (Doesn’t arise in supervised learning) 4

  5. EXAMPLES � Animal learning to find food and avoid predators � Robot trying to learn how to dock with charging station � Backgammon player learning to beat opponent � Football team trying to find strategies to score goals � Infant learning to feed itself with spoon � Cornet player learning to produce beautiful sounds � Temperature controller keeping FH warm while minimis- ing fuel consumption 5

  6. ✑ ✁ ✁ ✑ ✢ ☛ ✁ ✡ ✠ � FRAMEWORK State/ Situation s t AGENT Reward r t Action at rt+1 ENVIRONMENT st+1 Agent in situation �✂✁ chooses action ✄☎✁ One tick later in situation ✁✝✆✟✞ gets reward ✁✝✆✟✞ POLICY �✌☞✍✄✏✎✒✑ ✓✔✠☎✕✖✄ ✄✘✗✙� �✛✚ Given the situation at time ✜ is � the policy gives the proba- bility the agent’s action will be ✄ . Reinforcement learning Get/find/learn the policy 6

  7. � ✾ ✢ ✤ ✡ ✎ ✢ ✡ ✤ ✎ ✢ ✤ � ✑ ✗ ✡ ✢ ☛ � ☞ ✽ ✗ ☞ ✑ � ☛ ☛ ✡ ✾ ✰ ✽ ✡ ✎ ✢ ✡ ✎ ✢ ✡ ✤ ✢ ✡ ✎ ✢ � ✡ ✎ ✡ EXAMPLE POLICIES Find the coffee machine 3 4 2 1 start ☛ ✁� ☛ ✁� turn left or ☞ ✄✂✄☎✝✆✁✞✠✟☛✡✌☞✍✂ ✍✎ ☛ ✏✎ ☛ ✁� straight on ☞ ✒✑✁✂✒✆✔✓✖✕✘✗✚✙✛✂✢✜✣✞ ☛ ✦✥ ☛ ✁� turn right ☞ ✄✂✄☎✝✆✁✞✧✆✁✕✘✗✚✙✛✂ ✍✎ ☛ ✩★ ☛ ✁� go through door ☞ ✒✗✪✜✫✂✄✙✝✆✬✜✣☎✭✗✚✙✠✮✭✜✯✜✚✆ etc. Bandit problem 10 arms, Q table gives the Q value for each arm ✰ -greedy policy: �✼✻ ✿✺✾ ✄ ✲✱ ✄ ✲✱ ✓✳✆✁✗✵✴✠✓✷✶✲✸✺✹ ✄✏✎ ✎✒✑ else ✿✺✾ ☞✛✗ ❁❀ ✄✏✎✒✑ 7

  8. ❀ � ✡ ☛ ✠ ✎ ✄ � ✂ ✁ JARGON Policy Decision on what action to do �✌☞✍✄ in that state Reward function Defines goal, and good and bad experience for learner Value function Predicts reward. Estimate of total future reward Model of the environment Maps states and actions onto states � . If in state ✞ we take action ✞ predicts � ☎✄ (and sometimes reward ✄ ). Not all agents use models. Reward function and environmental model fixed external to agent. Policy, value function, estimate of model adjusted during learning. 8

  9. ✢ � ☛ � ✎ ✂ VALUE FUNCTIONS � How desirable is it to be in a certain state? What is its value ? Value Value is (an estimate of) the expected future reward from that state � Value vs. reward Long-term vs. immediate Want actions that lead to states of high value, not neces- sarily high immediate reward � Learn policy via learning value – when we know the values of states we can choose to go to states of high value cf. GAGP discover policy directly � Genotypical vs. phenotypical learning? (GAGP vs. RL) 9

  10. ☞ � � � ✄ GENERAL RL ALGORITHM 1. Initialise learner’s internal state (e.g. Q values, other statistics) 2. Do for a long time � Observe current world state � Choose action ✄ using the policy � Execute action � Let ✠ be immediate reward, ✱ new world state � Update internal state based on ✱ , previous in- �✌☞✍✄ ✠✖☞ ternal state 3. Output a policy based on, e.g. learnt Q values and follow it We need: � Decision on what constitutes an internal state � Decision on what constitutes a world state � Sensing of a world state � Action-choice mechanism (policy) based usually on � an evaluation (of current world and internal state) func- tion � A means of executing the action � A way of updating the internal state 10

  11. Environment (simulator?) provides � Transitions between world states, i.e. model � A reward function But of course the learner has to discover what these are while exploring the world. 11

  12. EXAMPLE - 0 AND X See Sutton and Barto Section 1.4 and Figure 1.1. 12

  13. ✎ ☛ ☛ � ✎ ✑ � � ✆ � ✤ ✻ � ☛ ✎ ☛ � ☛ � � � ✑ � � ☛ � ✎ � ☛ � ✎ ✑ ☛ � � ✎ EXAMPLE Construct a player to play against an imperfect opponent For each board state, set up – estimate of probability of winning from that state XXX OOO ✂ initially Rest ✤ ✁� Play many games Move selection � mostly pick move leading to state with highest � sometimes explore Value adjustment � back-up value of states after non-exploratory moves to states preceding moves � e.g. ✆✞✝ � ☎✄ � ☎✄ � ✟✄ � ✟✄ ✎ ✡✠ ✎✒✑ ✆✟✞ Reduce over time converges to probabilities of winning – optimal policy 13

Recommend


More recommend