Intro to AI: Lecture 12 Volker Sorge Introduction Q-Learning SARSA Machine Learning: Reinforcement Learning Volker Sorge
Intro to AI: Basics Lecture 12 Volker Sorge Introduction ◮ Reinforcement learning is an area concerned with how Q-Learning an agent ought to take actions in an environment so as SARSA to maximize some notion of reward. ◮ “A way of programming agents by reward and punishment without needing to specify how the task is to be achieved.” ◮ Specify what to do, but not how to do it. ◮ Only formulate the reward function. ◮ Learning “fills in the details”. ◮ Compute better final solutions for a task. ◮ Based on actual experiences, not on programmer assumptions. ◮ Less (human) time needed to find a good solution.
Intro to AI: Main Notions: Policies Lecture 12 Volker Sorge Introduction Q-Learning SARSA ◮ Policy : The function that allows us to compute the next action for a particular state. ◮ An optimal Policy is a policy that maximizes the expected reward/reinforcement/feedback of a state. ◮ Thus, the task of RL is to use observed rewards to find an optimal policy for the environment.
Intro to AI: Main Notions: Modes of Learning Lecture 12 Volker Sorge Introduction Q-Learning SARSA ◮ Passive Learning : Agents policy is fixed and our task is to learn how good the policy is. ◮ Active Learning : Agents must learn what actions to take. ◮ Off-policy learning : learn the value of the optimal policy independently of the agent’s actions. ◮ On-policy learning : learn the value of the policy the agent actually follows.
Intro to AI: Main Notions: Exploration vs. Exploitation Lecture 12 Volker Sorge Introduction Q-Learning SARSA ◮ Exploitation Use the knowledge already learned on what the next best action is in the current state. ◮ Exploration In order to improve policies the agent must explore a number of states. I.e., select an action different of the one that it currenlty thinks is best.
Intro to AI: Difficulties of Reinforcement learning Lecture 12 Volker Sorge ◮ Blame attribution problem : The problem of determining which action was responsible for a reward or Introduction punishment. Q-Learning ◮ Responsible action may have occurred a long time SARSA before the reward was received. ◮ A combination of actions might have lead to a reward. ◮ Recognising delayed rewards : What seem to be poor actions now might lead to much greater rewards in the future than what appears to be good actions. ◮ Future rewards need to be recognised and back-propagated. ◮ Problem complexity increases if the world is dynamic. ◮ Explore-exploit dilemma : If the agent has worked out a good course of actions, should it continue to follow these actions or should it explore to find better actions? ◮ Agent that never explores can not improve its policy. ◮ Agent that only explores never uses what it has learned.
Intro to AI: Some Algorithms Lecture 12 Volker Sorge Introduction Q-Learning SARSA ◮ Temporal Difference learning ◮ Q-learning ◮ SARSA ◮ Monte Carlo Method ◮ Evolutionary Algorithms
Intro to AI: Q-Learning Basics Lecture 12 Volker Sorge Introduction Q-Learning ◮ Off-policy learning technique. SARSA ◮ The environment is typically formulated as a Markov Decision Process. (See reading assignment on Markov Processes.) ◮ Finite sets of states S = { s 0 , . . . , s n } and actions A = { a 0 , . . . , a m } . ◮ Probabilities P a ( s , s ′ ) for transitions from state s to s ′ with action a . ◮ Reward function R that adjusts probabilities over time. ◮ Goal is to learn an optimal policy function Q ∗ ( s , a ).
Intro to AI: MDP Example Lecture 12 Volker Sorge ◮ Here’s a simple example of a MDP with three states Introduction S 0 , S 1 , S 2 and two actions a 0 , a 1 . Q-Learning SARSA Source http://wikipedia.org/
Intro to AI: Q-Learning Lecture 12 Volker Sorge Introduction Q-Learning ◮ Learn quality of state-action combinations: SARSA Q : S × A → R ◮ We learn Q over a (possibly infinite) squence of discrete time events. � s 0 , a 0 , r 1 , s 1 , a 1 , r 2 , s 2 , a 2 , r 3 , s 3 , a 3 , r 4 , s 4 . . . � where s i are states, a i actions, and r i rewards. ◮ Learn quality of a single experience that, i.e., for � s , a , r , s ′ � .
Intro to AI: Q-Learning: Algorithm Lecture 12 Volker Sorge Introduction Q-Learning ◮ Maintain a table for Q with an entry for each valid SARSA state action pair ( s , a ). ◮ Initialise the table with some uniform value. ◮ Update the values over time points t ≥ 0 according to the following formula: � � Q ( s t , a t ) = Q ( s t , a t )+ α × r t +1 + γ max Q ( s t +1 , a ) − Q ( s t , a t ) a where α is the learning rate and γ is the discount factor .
Intro to AI: Learning Rate Lecture 12 Volker Sorge Introduction Q-Learning SARSA ◮ Models how forgetful or stubborn an agent is. ◮ Learning rate α (small Greek letter alpha) determines to what extend the newly acquired inofrmation will override the old information. ◮ α = 0 will the agent not learn anything. ◮ α = 1 will make the agent consider only the most recent information.
Intro to AI: Discounted Reward Lecture 12 Volker Sorge ◮ The basic idea is to weigh rewards differently over time and to model the idea that earlier experiences are more Introduction Q-Learning relevant than later ones. SARSA ◮ E.g., when a child learns to walk the rewards and punishments are pretty high. But the older we get the less do we have to actually adjust our walking behaviour. ◮ One can model this by including a factor that decreases over time, i.e., it more and more discounts an experience. ◮ The discount is normally expressed by a multiplicative factor γ (small Greek letter gamma), with 0 ≤ γ < 1. ◮ γ = 0 will make the agent “opportunistic” by only considering current rewards. ◮ A γ value closer to 1 will make the agent strive for long-term reward.
Intro to AI: SARSA Learning Lecture 12 Volker Sorge Introduction Q-Learning SARSA ◮ On-policy learning technique. ◮ Very similar to Q-learning. ◮ SARSA stands for “State-Action-Reward-State-Action”. ◮ Learns the quality of the next move by actually carrying out the next move. Hence we do no longer maximise the possible next Q value.
Intro to AI: SARSA: Algorithm Lecture 12 Volker Sorge Introduction Q-Learning ◮ Equivalent to Q-learning we initialise and maintain the SARSA table for Q with an entry for each valid state action pair ( s , a ). ◮ Updating formula is similar to Q-learning with the exception that we can only take the actual next state into account. Q ( s t , a t ) = Q ( s t , a t ) + α [ r t + γ Q ( s t +1 , a t +1 ) − Q ( s t , a t )] where α is the learning rate and γ is the discount factor .
Recommend
More recommend