Autonomous Intelligent Robotics Instructor: Shiqi Zhang - PowerPoint PPT Presentation

Spring 2018 CIS 693, EEC 693, EEC 793: Autonomous Intelligent Robotics Instructor: Shiqi Zhang http://eecs.csuohio.edu/~szhang/teaching/18spring/

Reinforcement Learning Adapted form Peter Bodík

Previous Lectures l Supervised learning l classifjcation, regression l Unsupervised learning l clustering, dimensionality reduction l Reinforcement learning l generalization of supervised learning l learn from interaction w/ environment to achieve a goal environment reward action new state agent

Today l examples l defjning a Markov Decision Process l solving an MDP using Dynamic Programming l Reinforcement Learning l Monte Carlo methods l Temporal-Difgerence learning l automatic resource allocation for in-memory database l miscellaneous l state representation l function approximation, rewards

Robot in a room +1 actions: UP , DOWN, LEFT , RIGHT -1 UP 80% move UP START 10% move LEFT 10% move RIGHT l reward +1 at [4,3], -1 at [4,2] l reward -0.04 for each step l what’s the strategy to achieve max reward? l what if the actions were deterministic?

Other examples l pole-balancing l walking robot (applet) l TD-Gammon [Gerry Tesauro] l helicopter [Andrew Ng] l no teacher who would say “good” or “bad” l is reward “10” good or bad? l rewards could be delayed l explore the environment and learn from the experience l not just blind search, try to be smart about it

Outline l examples l defjning a Markov Decision Process l solving an MDP using Dynamic Programming l Reinforcement Learning l Monte Carlo methods l Temporal-Difgerence learning l miscellaneous l state representation l function approximation, rewards

Robot in a room +1 actions: UP , DOWN, LEFT , RIGHT -1 UP 80% move UP 10% move LEFT START 10% move RIGHT reward +1 at [4,3], -1 at [4,2] reward -0.04 for each step l states l actions l rewards l what is the solution?

Is this a solution? +1 -1 l only if actions deterministic l not in this case (actions are stochastic) l solution/policy l mapping from each state to an action

Optimal policy +1 -1

Reward for each step -2 +1 -1

Reward for each step: -0.1 +1 -1

Reward for each step: +0.01 +1 -1

Markov Decision Process (MDP) l set of states S, set of actions A, initial state S 0 l transition model P(s’|s,a) environment l P( [1,2] | [1,1], up ) = 0.8 l Markov assumption reward action l reward function r(s) new state agent l r( [4,3] ) = +1 l goal: maximize cumulative reward in the long run l policy: mapping from S to A l π (s) or π (s,a) l reinforcement learning l transitions and rewards usually not available l how to change the policy based on experience l how to explore the environment

Computing return from rewards l episodic (vs. continuing) tasks l “game over” after N steps l optimal policy depends on N; harder to analyze l additive rewards l V(s 0 , s 1 , …) = r(s 0 ) + r(s 1 ) + r(s 2 ) + … l infjnite value for continuing tasks l discounted rewards l V(s 0 , s 1 , …) = r(s 0 ) + γ*r(s 1 ) + γ 2 *r(s 2 ) + … l value bounded if rewards bounded

Value functions l state value function: V π (s) l expected return when starting in s and following π l state-action value function: Q π (s,a) l expected return when starting in s , performing a, and following π l useful for fjnding the optimal policy s l can estimate from experience a l pick the best action using Q π (s,a) r l Bellman equation s’

Optimal value functions l there’s a set of optimal policies l V π defjnes partial ordering on policies l they share the same optimal value function s l Bellman optimality equation a r l system of n non-linear equations s’ l solve for V*(s) l easy to extract the optimal policy l having Q*(s,a) makes it even simpler

Outline l examples l defjning a Markov Decision Process l solving an MDP using Dynamic Programming l Reinforcement Learning l Monte Carlo methods l Temporal-Difgerence learning l miscellaneous l state representation l function approximation, rewards

Dynamic programming l main idea l use value functions to structure the search for good policies l need a perfect model of the environment l two main components l policy evaluation: compute V π from π l policy improvement: improve π based on V π l start with an arbitrary policy l repeat evaluation/improvement until convergence

Policy evaluation/improvement l policy evaluation: π -> V π l Bellman eqn’s defjne a system of n eqn’s l could solve, but will use iterative version l start with an arbitrary value function V 0 , iterate until V k converges l policy improvement: V π -> π ’ l π ’ either strictly better than π , or π ’ is optimal (if π = π ’)

Policy/Value iteration l Policy iteration l two nested iterations; too slow l don’t need to converge to V π k l just move towards it l Value iteration l use Bellman optimality equation as an update l converges to V*

Using DP l need complete model of the environment and rewards l robot in a room l state space, action space, transition model l can we use DP to solve l robot in a room? l back gammon? l helicopter? l DP bootstraps l updates estimates on the basis of other estimates

Outline l defjning a Markov Decision Process l solving an MDP using Dynamic Programming l Reinforcement Learning l Monte Carlo methods l Temporal-Difgerence learning l miscellaneous l state representation l function approximation, rewards

Monte Carlo methods l don’t need full knowledge of environment l just experience, or l simulated experience l averaging sample returns l defjned only for episodic tasks l but similar to DP l policy evaluation, policy improvement

Monte Carlo policy evaluation l want to estimate V π (s) = expected return starting from s and following π l estimate as average of observed returns in state s l fjrst-visit MC l average returns following the fjrst visit to state s s s s 0 R 1 (s) = +2 +1 -2 0 +1 -3 +5 s 0 s 0 R 2 (s) = +1 s 0 R 3 (s) = -5 s 0 s 0 R 4 (s) = +4 V π (s) ≈ (2 + 1 – 5 + 4)/4 = 0.5

Monte Carlo control l V π not enough for policy improvement l need exact model of environment l estimate Q π (s,a) l MC control l update after each episode l non-stationary environment l a problem l greedy policy won’t explore all actions

Maintaining exploration l key ingredient of RL l deterministic/greedy policy won’t explore all actions l don’t know anything about the environment at the beginning l need to try all actions to fjnd the optimal one l maintain exploration l use soft policies instead: π (s,a)>0 (for all s,a) l ε-greedy policy l with probability 1-ε perform the optimal/greedy action l with probability ε perform a random action l will keep exploring the environment l slowly move it towards greedy policy: ε -> 0

Simulated experience l 5-card draw poker l s 0 : A ♣ , A ♦ , 6 ♠ , A ♥ , 2 ♠ l a 0 : discard 6 ♠ , 2 ♠ l s 1 : A ♣ , A ♦ , A ♥ , A ♠ , 9 ♠ + dealer takes 4 cards l return: +1 (probably) l DP l list all states, actions, compute P(s,a,s’) l P( [A ♣ ,A ♦ ,6 ♠ ,A ♥ ,2 ♠ ], [6 ♠ ,2 ♠ ], [A ♠ ,9 ♠ ,4] ) = 0.00192 l MC l all you need are sample episodes l let MC play against a random policy, or itself, or another algorithm

Summary of Monte Carlo l don’t need model of environment l averaging of sample returns l only for episodic tasks l learn from: l sample episodes l simulated experience l can concentrate on “important” states l don’t need a full sweep l no bootstrapping l less harmed by violation of Markov property l need to maintain exploration l use soft policies

Temporal Difgerence Learning l combines ideas from MC and DP l like MC: learn directly from experience (don’t need a model) l like DP: bootstrap l works for continuous tasks, usually faster then MC l constant-alpha MC: l have to wait until the end of episode to update l simplest TD l update after every step, based on the successor target

MC vs. TD l observed the following 8 episodes: A – 0, B – 0 B – 1 B – 1 B - 1 B – 1 B – 1 B – 1 B – 0 l MC and TD agree on V(B) = 3/4 l MC: V(A) = 0 l converges to values that minimize the error on training data l TD: V(A) = 3/4 r = 1 75% l converges to estimate r = 0 l of the Markov process A B 100% r = 0 25%

Sarsa l again, need Q(s,a), not just V(s) s t s t+1 s t+2 a t a t+1 a t+2 r t r t+1 l control l start with a random policy l update Q and π after each step l again, need ε -soft policies

Q-learning l previous algorithms: on-policy algorithms l start with a random policy, iteratively improve l converge to optimal l Q-learning: ofg-policy l use any policy to estimate Q l Q directly approximates Q* (Bellman optimality eqn) l independent of the policy being followed l only requirement: keep updating each (s,a) pair l Sarsa

Autonomous Intelligent Robotics Instructor: Shiqi Zhang - PowerPoint PPT Presentation

Spring 2018 CIS 693, EEC 693, EEC 793: Autonomous Intelligent Robotics Instructor: Shiqi Zhang http://eecs.csuohio.edu/~szhang/teaching/18spring/ Reinforcement Learning Adapted form Peter Bodk Previous Lectures l Supervised learning l

Mobile & Service Robotics Mobile & Service Robotics Sensors for Robotics Sensors for

Mobile & Service Robotics Mobile & Service Robotics Sensors for Robotics Sensors for

Mobile & Service Robotics Mobile & Service Robotics Sensors for Sensors for Robotics

Cutting edge Forum on Autonomous Driving, Contributions from Intelligent Robotics, AI and ITS

COMP 50: Autonomous Class Goals Intelligent Robotics At the end of this class you will have an

Human-Oriented Robotics Octave/Matlab Tutorial Kai Arras Social Robotics Lab, University of

Robotics Engineering Prof. Michael Gennert Robotics Engineering Program Director Fall 2016

LEGO Develops a new LEGO Develops a new robotics platform - WeDo robotics platform - WeDo

Bio-inspired Computing for Robots and Music Jim Trresen Research group Robotics and Intelligent

Autonomous Ground Systems Human Centered Teaming of Autonomous Battlefield Robotics W. Stuart

Deep Learning for Control in Robotics Narada Warakagoda Robotics = Physical Autonomous Systems

CS 309: Autonomous Intelligent Robotics FRI I Instructor: Justin Hart

Human-Oriented Robotics Basics of Probabilistic Reasoning Kai Arras Social Robotics Lab,

Human-Oriented Robotics Temporal Reasoning Part 3/3 Kai Arras Social Robotics Lab, University

Human-Oriented Robotics Unsupervised Learning Kai Arras Social Robotics Lab, University of

Human-Oriented Robotics Probability Refresher Kai Arras Social Robotics Lab, University of

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many

Reinforcement Learning Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Theory and Practice Rune Djurhuus Chess Grandmaster runed@ifi.uio.no / runedj@microsoft.com

The Option-Critic Architecture Pierre-Luc Bacon, Jean Harb, Doina Precup Reasoning and Learning

State Armory Board (SAB) Quarterly Meeting: 15 October 2015 0 State Armory Board Quarterly

Adversarial Search and Game Playing Russell and Norvig, Chapter 5 http://xkcd.com/601/ Games n

Deep Reinforcement Learning Applications + Hacking Arjun Chandra Research Scientist Telenor

Autonomous Intelligent Robotics Instructor: Shiqi Zhang - PowerPoint PPT Presentation

Spring 2018 CIS 693, EEC 693, EEC 793: Autonomous Intelligent Robotics Instructor: Shiqi Zhang http://eecs.csuohio.edu/~szhang/teaching/18spring/ Reinforcement Learning Adapted form Peter Bodk Previous Lectures l Supervised learning l

Mobile &amp; Service Robotics Mobile &amp; Service Robotics Sensors for Robotics Sensors for

Mobile &amp; Service Robotics Mobile &amp; Service Robotics Sensors for Robotics Sensors for

Mobile &amp; Service Robotics Mobile &amp; Service Robotics Sensors for Sensors for Robotics

Cutting edge Forum on Autonomous Driving, Contributions from Intelligent Robotics, AI and ITS

COMP 50: Autonomous Class Goals Intelligent Robotics At the end of this class you will have an

Human-Oriented Robotics Octave/Matlab Tutorial Kai Arras Social Robotics Lab, University of

Robotics Engineering Prof. Michael Gennert Robotics Engineering Program Director Fall 2016

LEGO Develops a new LEGO Develops a new robotics platform - WeDo robotics platform - WeDo

Bio-inspired Computing for Robots and Music Jim Trresen Research group Robotics and Intelligent

Autonomous Ground Systems Human Centered Teaming of Autonomous Battlefield Robotics W. Stuart

Deep Learning for Control in Robotics Narada Warakagoda Robotics = Physical Autonomous Systems

CS 309: Autonomous Intelligent Robotics FRI I Instructor: Justin Hart

Human-Oriented Robotics Basics of Probabilistic Reasoning Kai Arras Social Robotics Lab,

Human-Oriented Robotics Temporal Reasoning Part 3/3 Kai Arras Social Robotics Lab, University

Human-Oriented Robotics Unsupervised Learning Kai Arras Social Robotics Lab, University of

Human-Oriented Robotics Probability Refresher Kai Arras Social Robotics Lab, University of

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many

Reinforcement Learning Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Theory and Practice Rune Djurhuus Chess Grandmaster runed@ifi.uio.no / runedj@microsoft.com

The Option-Critic Architecture Pierre-Luc Bacon, Jean Harb, Doina Precup Reasoning and Learning

State Armory Board (SAB) Quarterly Meeting: 15 October 2015 0 State Armory Board Quarterly

Adversarial Search and Game Playing Russell and Norvig, Chapter 5 http://xkcd.com/601/ Games n

Deep Reinforcement Learning Applications + Hacking Arjun Chandra Research Scientist Telenor

Mobile & Service Robotics Mobile & Service Robotics Sensors for Robotics Sensors for

Mobile & Service Robotics Mobile & Service Robotics Sensors for Robotics Sensors for

Mobile & Service Robotics Mobile & Service Robotics Sensors for Sensors for Robotics