Machine Learning 10-701 Tom M. Mitchell Machine Learning Department - PDF document

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University April 26, 2011 Today: Readings: • Mitchell, chapter 13 • Learning of control policies • Markov Decision Processes • Kaelbling, et al., Reinforcement • Temporal difference learning Learning: A Survey • Q learning Thanks to Aarti Singh for several slides Tom Mitchell, April 2011 Reinforcement Learning [Sutton and Barto 1981; Samuel 1957; ...] Tom Mitchell, April 2011 1

Reinforcement Learning: Backgammon [Tessauro, 1995] Learning task: • chose move at arbitrary board states Training signal: • final win or loss Training: • played 300,000 games against itself Algorithm: • reinforcement learning + neural network Result: • World-class Backgammon player Tom Mitchell, April 2011 Outline • Learning control strategies – Credit assignment and delayed reward – Discounted rewards • Markov Decision Processes – Solving a known MDP • Online learning of control strategies – When next-state function is known: value function V * (s) – When next-state function unknown: learning Q * (s,a) • Role in modeling reward learning in animals Tom Mitchell, April 2011 2

Tom Mitchell, April 2011 Markov Decision Process = Reinforcement Learning Setting • Set of states S • Set of actions A • At each time, agent observes state s t ∈ S, then chooses action a t ∈ A • Then receives reward r t , and state changes to s t+1 • Markov assumption: P(s t+1 | s t , a t , s t-1 , a t-1 , ...) = P(s t+1 | s t , a t ) • Also assume reward Markov: P(r t | s t , a t , s t-1 , a t-1 ,...) = P(r t | s t , a t ) • The task: learn a policy π : S  A for choosing actions that maximizes for every possible starting state s 0 Tom Mitchell, April 2011 3

HMM, Markov Process, Markov Decision Process Tom Mitchell, April 2011 HMM, Markov Process, Markov Decision Process Tom Mitchell, April 2011 4

Reinforcement Learning Task for Autonomous Agent Execute actions in environment, observe results, and • Learn control policy π : S  A that maximizes from every state s ∈ S Example: Robot grid world, deterministic reward r(s,a) Tom Mitchell, April 2011 Reinforcement Learning Task for Autonomous Agent Execute actions in environment, observe results, and • Learn control policy π : S  A that maximizes from every state s ∈ S Yikes!! • Function to be learned is π : S  A • But training examples are not of the form <s, a> • They are instead of the form < <s,a>, r > Tom Mitchell, April 2011 5

Value Function for each Policy • Given a policy π : S  A, define assuming action sequence chosen according to π , starting at state s • Then we want the optimal policy π * where • For any MDP, such a policy exists! • We’ll abbreviate V π * (s) as V*(s) • Note if we have V*(s) and P(s t+1 |s t ,a), we can compute π *(s) Tom Mitchell, April 2011 Value Function – what are the V π (s) values? Tom Mitchell, April 2011 6

Value Function – what are the V * (s) values? Tom Mitchell, April 2011 Immediate rewards r(s,a) State values V*(s) Tom Mitchell, April 2011 7

Recursive definition for V*(S) assuming actions are chosen according to the optimal policy, π * Tom Mitchell, April 2011 Value Iteration for learning V* : assumes P(S t+1 |S t , A) known Initialize V(s) arbitrarily Loop until policy good enough Loop for s in S Loop for a in A • End loop End loop V(s) converges to V*(s) Dynamic programming Tom Mitchell, April 2011 8

Value Iteration Interestingly, value iteration works even if we randomly traverse the environment instead of looping through each state and action methodically • but we must still visit each state infinitely often on an infinite run • For details: [Bertsekas 1989] • Implications: online learning as agent randomly roams If max (over states) difference between two successive value function estimates is less than ε , then the value of the greedy policy differs from the optimal policy by no more than Tom Mitchell, April 2011 So far: learning optimal policy when we know P(s t | s t-1 , a t-1 ) What if we don’t? Tom Mitchell, April 2011 9

Q learning Define new function, closely related to V* If agent knows Q(s,a), it can choose optimal action without knowing P(s t+1 |s t ,a) ! And, it can learn Q without knowing P(s t+1 |s t ,a) Tom Mitchell, April 2011 Immediate rewards r(s,a) State values V*(s) State-action values Q*(s,a) Bellman equation. Consider first the case where P(s’| s,a) is deterministic Tom Mitchell, April 2011 10

Tom Mitchell, April 2011 Tom Mitchell, April 2011 11

Tom Mitchell, April 2011 Use general fact: Tom Mitchell, April 2011 12

Tom Mitchell, April 2011 Tom Mitchell, April 2011 13

Tom Mitchell, April 2011 MDP’s and RL: What You Should Know • Learning to choose optimal actions A • From delayed reward • By learning evaluation functions like V(S), Q(S,A) Key ideas: • If next state function S t x A t  S t+1 is known – can use dynamic programming to learn V(S) – once learned, choose action A t that maximizes V(S t+1 ) • If next state function S t x A t  S t+1 un known – learn Q(S t ,A t ) = E[V(S t+1 )] – to learn, sample S t x A t  S t+1 in actual world – once learned, choose action A t that maximizes Q(S t ,A t ) Tom Mitchell, April 2011 14

MDPs and Reinforcement Learning: Further Issues • What strategy for choosing actions will optimize – learning rate? ( explore uninvestigated states) – obtained reward? ( exploit what you know so far) • Partially observable Markov Decision Processes – state is not fully observable – maintain probability distribution over possible states you’re in • Convergence guarantee with function approximators? – our proof assumed a tabular representation for Q, V – some types of function approximators still converge (e.g., nearest neighbor) [Gordon, 1999] • Correspondence to human learning? Tom Mitchell, April 2011 15

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department - PDF document

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University April 26, 2011 Today: Readings: Mitchell, chapter 13 Learning of control policies Markov Decision Processes Kaelbling, et al.,

Machine Learning Machine Learning 10 10- -701/15 701/15- -781, Fall 2006 781, Fall 2006

701 HARRISON Planning Commission Hearing April 30th, 2020 701 HARRISON PROJECT SITE ASSESSOR'S

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

10-701 Machine Learning (Spring 2012) Principal Component Analysis Yang Xu This note is partly

9.1 Overview 9 Deep Learning Alexander Smola Introduction to Machine Learning 10-701

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Linear Symmetries in Integer Convex Optimization Achill Schrmann (University of Rostock) (

The Versatile Synchronous Observer John Rushby Computer Science Laboratory SRI International

Query Processing vanilladb.org Where are we? VanillaCore JDBC Interface (at Client Side)

Netw ork Services I nterface Nordic I nfrastructure for Research & Education Nordic

Instance Based Learning [Read Ch. 8] k -Nearest Neigh b or Lo cally w eigh

Computational Learning Theory [read Chapter 7] [Suggested exercises: 7.1, 7.2, 7.5, 7.8]

Machine Learning: Course Overview Yingyu Liang Computer Sciences 760 Fall 2017

AIXI Tutorial John Aslanides and Tom Part II Everitt Short Recap Intuitions, Approximations,

Sambuz

Useful Links

Newsletter

Mail Us

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department - PDF document

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University April 26, 2011 Today: Readings: Mitchell, chapter 13 Learning of control policies Markov Decision Processes Kaelbling, et al.,

Machine Learning Machine Learning 10 10- -701/15 701/15- -781, Fall 2006 781, Fall 2006

701 HARRISON Planning Commission Hearing April 30th, 2020 701 HARRISON PROJECT SITE ASSESSOR'S

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

10-701 Machine Learning (Spring 2012) Principal Component Analysis Yang Xu This note is partly

9.1 Overview 9 Deep Learning Alexander Smola Introduction to Machine Learning 10-701

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Linear Symmetries in Integer Convex Optimization Achill Schrmann (University of Rostock) (

The Versatile Synchronous Observer John Rushby Computer Science Laboratory SRI International

Query Processing vanilladb.org Where are we? VanillaCore JDBC Interface (at Client Side)

Netw ork Services I nterface Nordic I nfrastructure for Research &amp; Education Nordic

Instance Based Learning [Read Ch. 8] k -Nearest Neigh b or Lo cally w eigh

Computational Learning Theory [read Chapter 7] [Suggested exercises: 7.1, 7.2, 7.5, 7.8]

Machine Learning: Course Overview Yingyu Liang Computer Sciences 760 Fall 2017

AIXI Tutorial John Aslanides and Tom Part II Everitt Short Recap Intuitions, Approximations,

Sambuz

Useful Links

Newsletter

Mail Us

Netw ork Services I nterface Nordic I nfrastructure for Research & Education Nordic