Spring 2018 CIS 693, EEC 693, EEC 793: Autonomous Intelligent Robotics Instructor: Shiqi Zhang http://eecs.csuohio.edu/~szhang/teaching/18spring/
Reinforcement Learning Adapted form Peter Bodík
Previous Lectures l Supervised learning l classifjcation, regression l Unsupervised learning l clustering, dimensionality reduction l Reinforcement learning l generalization of supervised learning l learn from interaction w/ environment to achieve a goal environment reward action new state agent
Today l examples l defjning a Markov Decision Process l solving an MDP using Dynamic Programming l Reinforcement Learning l Monte Carlo methods l Temporal-Difgerence learning l automatic resource allocation for in-memory database l miscellaneous l state representation l function approximation, rewards
Robot in a room +1 actions: UP , DOWN, LEFT , RIGHT -1 UP 80% move UP START 10% move LEFT 10% move RIGHT l reward +1 at [4,3], -1 at [4,2] l reward -0.04 for each step l what’s the strategy to achieve max reward? l what if the actions were deterministic?
Other examples l pole-balancing l walking robot (applet) l TD-Gammon [Gerry Tesauro] l helicopter [Andrew Ng] l no teacher who would say “good” or “bad” l is reward “10” good or bad? l rewards could be delayed l explore the environment and learn from the experience l not just blind search, try to be smart about it
Outline l examples l defjning a Markov Decision Process l solving an MDP using Dynamic Programming l Reinforcement Learning l Monte Carlo methods l Temporal-Difgerence learning l miscellaneous l state representation l function approximation, rewards
Robot in a room +1 actions: UP , DOWN, LEFT , RIGHT -1 UP 80% move UP 10% move LEFT START 10% move RIGHT reward +1 at [4,3], -1 at [4,2] reward -0.04 for each step l states l actions l rewards l what is the solution?
Is this a solution? +1 -1 l only if actions deterministic l not in this case (actions are stochastic) l solution/policy l mapping from each state to an action
Optimal policy +1 -1
Reward for each step -2 +1 -1
Reward for each step: -0.1 +1 -1
Reward for each step: -0.04 +1 -1
Reward for each step: -0.01 +1 -1
Reward for each step: +0.01 +1 -1
Markov Decision Process (MDP) l set of states S, set of actions A, initial state S 0 l transition model P(s’|s,a) environment l P( [1,2] | [1,1], up ) = 0.8 l Markov assumption reward action l reward function r(s) new state agent l r( [4,3] ) = +1 l goal: maximize cumulative reward in the long run l policy: mapping from S to A l π (s) or π (s,a) l reinforcement learning l transitions and rewards usually not available l how to change the policy based on experience l how to explore the environment
Computing return from rewards l episodic (vs. continuing) tasks l “game over” after N steps l optimal policy depends on N; harder to analyze l additive rewards l V(s 0 , s 1 , …) = r(s 0 ) + r(s 1 ) + r(s 2 ) + … l infjnite value for continuing tasks l discounted rewards l V(s 0 , s 1 , …) = r(s 0 ) + γ*r(s 1 ) + γ 2 *r(s 2 ) + … l value bounded if rewards bounded
Value functions l state value function: V π (s) l expected return when starting in s and following π l state-action value function: Q π (s,a) l expected return when starting in s , performing a, and following π l useful for fjnding the optimal policy s l can estimate from experience a l pick the best action using Q π (s,a) r l Bellman equation s’
Optimal value functions l there’s a set of optimal policies l V π defjnes partial ordering on policies l they share the same optimal value function s l Bellman optimality equation a r l system of n non-linear equations s’ l solve for V*(s) l easy to extract the optimal policy l having Q*(s,a) makes it even simpler
Outline l examples l defjning a Markov Decision Process l solving an MDP using Dynamic Programming l Reinforcement Learning l Monte Carlo methods l Temporal-Difgerence learning l miscellaneous l state representation l function approximation, rewards
Dynamic programming l main idea l use value functions to structure the search for good policies l need a perfect model of the environment l two main components l policy evaluation: compute V π from π l policy improvement: improve π based on V π l start with an arbitrary policy l repeat evaluation/improvement until convergence
Policy evaluation/improvement l policy evaluation: π -> V π l Bellman eqn’s defjne a system of n eqn’s l could solve, but will use iterative version l start with an arbitrary value function V 0 , iterate until V k converges l policy improvement: V π -> π ’ l π ’ either strictly better than π , or π ’ is optimal (if π = π ’)
Policy/Value iteration l Policy iteration l two nested iterations; too slow l don’t need to converge to V π k l just move towards it l Value iteration l use Bellman optimality equation as an update l converges to V*
Using DP l need complete model of the environment and rewards l robot in a room l state space, action space, transition model l can we use DP to solve l robot in a room? l back gammon? l helicopter? l DP bootstraps l updates estimates on the basis of other estimates
Outline l defjning a Markov Decision Process l solving an MDP using Dynamic Programming l Reinforcement Learning l Monte Carlo methods l Temporal-Difgerence learning l miscellaneous l state representation l function approximation, rewards
Monte Carlo methods l don’t need full knowledge of environment l just experience, or l simulated experience l averaging sample returns l defjned only for episodic tasks l but similar to DP l policy evaluation, policy improvement
Monte Carlo policy evaluation l want to estimate V π (s) = expected return starting from s and following π l estimate as average of observed returns in state s l fjrst-visit MC l average returns following the fjrst visit to state s s s s 0 R 1 (s) = +2 +1 -2 0 +1 -3 +5 s 0 s 0 R 2 (s) = +1 s 0 R 3 (s) = -5 s 0 s 0 R 4 (s) = +4 V π (s) ≈ (2 + 1 – 5 + 4)/4 = 0.5
Monte Carlo control l V π not enough for policy improvement l need exact model of environment l estimate Q π (s,a) l MC control l update after each episode l non-stationary environment l a problem l greedy policy won’t explore all actions
Maintaining exploration l key ingredient of RL l deterministic/greedy policy won’t explore all actions l don’t know anything about the environment at the beginning l need to try all actions to fjnd the optimal one l maintain exploration l use soft policies instead: π (s,a)>0 (for all s,a) l ε-greedy policy l with probability 1-ε perform the optimal/greedy action l with probability ε perform a random action l will keep exploring the environment l slowly move it towards greedy policy: ε -> 0
Simulated experience l 5-card draw poker l s 0 : A ♣ , A ♦ , 6 ♠ , A ♥ , 2 ♠ l a 0 : discard 6 ♠ , 2 ♠ l s 1 : A ♣ , A ♦ , A ♥ , A ♠ , 9 ♠ + dealer takes 4 cards l return: +1 (probably) l DP l list all states, actions, compute P(s,a,s’) l P( [A ♣ ,A ♦ ,6 ♠ ,A ♥ ,2 ♠ ], [6 ♠ ,2 ♠ ], [A ♠ ,9 ♠ ,4] ) = 0.00192 l MC l all you need are sample episodes l let MC play against a random policy, or itself, or another algorithm
Summary of Monte Carlo l don’t need model of environment l averaging of sample returns l only for episodic tasks l learn from: l sample episodes l simulated experience l can concentrate on “important” states l don’t need a full sweep l no bootstrapping l less harmed by violation of Markov property l need to maintain exploration l use soft policies
Outline l defjning a Markov Decision Process l solving an MDP using Dynamic Programming l Reinforcement Learning l Monte Carlo methods l Temporal-Difgerence learning l miscellaneous l state representation l function approximation, rewards
Temporal Difgerence Learning l combines ideas from MC and DP l like MC: learn directly from experience (don’t need a model) l like DP: bootstrap l works for continuous tasks, usually faster then MC l constant-alpha MC: l have to wait until the end of episode to update l simplest TD l update after every step, based on the successor target
MC vs. TD l observed the following 8 episodes: A – 0, B – 0 B – 1 B – 1 B - 1 B – 1 B – 1 B – 1 B – 0 l MC and TD agree on V(B) = 3/4 l MC: V(A) = 0 l converges to values that minimize the error on training data l TD: V(A) = 3/4 r = 1 75% l converges to estimate r = 0 l of the Markov process A B 100% r = 0 25%
Sarsa l again, need Q(s,a), not just V(s) s t s t+1 s t+2 a t a t+1 a t+2 r t r t+1 l control l start with a random policy l update Q and π after each step l again, need ε -soft policies
Q-learning l previous algorithms: on-policy algorithms l start with a random policy, iteratively improve l converge to optimal l Q-learning: ofg-policy l use any policy to estimate Q l Q directly approximates Q* (Bellman optimality eqn) l independent of the policy being followed l only requirement: keep updating each (s,a) pair l Sarsa
Outline l defjning a Markov Decision Process l solving an MDP using Dynamic Programming l Reinforcement Learning l Monte Carlo methods l Temporal-Difgerence learning l miscellaneous l state representation l function approximation, rewards
Recommend
More recommend