CS325 Artificial Intelligence Ch. 21 Reinforcement Learning Cengiz - PowerPoint PPT Presentation

CS325 Artificial Intelligence Ch. 21 – Reinforcement Learning Cengiz Günay, Emory Univ. Spring 2013 Günay Ch. 21 – Reinforcement Learning Spring 2013 1 / 23

Rats! Rat put in a cage with lever. Each lever press sends a signal to rat’s brain, to the reward center . Fundooprofessor Günay Ch. 21 – Reinforcement Learning Spring 2013 2 / 23

Rats! Rat put in a cage with lever. Each lever press sends a signal to rat’s brain, to the reward center . Rat presses lever continously until . . . Fundooprofessor Günay Ch. 21 – Reinforcement Learning Spring 2013 2 / 23

Rats! Rat put in a cage with lever. Each lever press sends a signal to rat’s brain, to the reward center . Rat presses lever continously until . . . it dies because it stops eating and drinking. Fundooprofessor Günay Ch. 21 – Reinforcement Learning Spring 2013 2 / 23

Wikipedia.org

Dopamine Neurons Respond to Novelty Schultz et al. (1997) sciencemuseum.org.uk Günay Ch. 21 – Reinforcement Learning Spring 2013 4 / 23

Dopamine Neurons Respond to Novelty Schultz et al. (1997) sciencemuseum.org.uk It turns out: Novelty detection = Temporal Difference rule in Reinforcement Learning (Sutton and Barto, 1981) Günay Ch. 21 – Reinforcement Learning Spring 2013 4 / 23

Performance standard Sensors Critic feedback Environment changes Learning Performance element element knowledge learning goals Problem generator Actuators Agent

Entry/Exit Surveys Exit survey: Planning Under Uncertainty Why can’t we use a regular MDP for partially-observable situations? Give an example where you think MDPs would help you solve a problem in your daily life. Entry survey: Reinforcement Learning (0.25 points of final grade) In a partially-observable scenario, can reinforcement be used to learn MDP rewards? How can we improve MDP by using the plan-execute cycle? Günay Ch. 21 – Reinforcement Learning Spring 2013 7 / 23

Blindfolded MDPs: Enter Reinforcement Learning 1 2 3 4 a G b xt c S What if the agent does not know anything about: where walls are where goals/penalties are Günay Ch. 21 – Reinforcement Learning Spring 2013 8 / 23

Blindfolded MDPs: Enter Reinforcement Learning 1 2 3 4 a G b xt c S What if the agent does not know anything about: where walls are where goals/penalties are Can we use the plan-execute cycle? Günay Ch. 21 – Reinforcement Learning Spring 2013 8 / 23

Blindfolded MDPs: Enter Reinforcement Learning 1 2 3 4 a G b xt c S What if the agent does not know anything about: where walls are where goals/penalties are Can we use the plan-execute cycle? Explore first Update world state based on reward/reinforcement ⇒ Reinforcement Learning (see Scholarpedia article) Günay Ch. 21 – Reinforcement Learning Spring 2013 8 / 23

Where Does Reinforcement Learning Fit? Machine learning so far: Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23

Where Does Reinforcement Learning Fit? Machine learning so far: Unsupervised learning: find regularities in input data, x Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23

Where Does Reinforcement Learning Fit? Machine learning so far: Unsupervised learning: find regularities in input data, x Supervised learning: find mapping between input and output, f ( x ) → y Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23

Where Does Reinforcement Learning Fit? Machine learning so far: Unsupervised learning: find regularities in input data, x Supervised learning: find mapping between input and output, f ( x ) → y Reinforcement learning: find mapping between states and actions, s → a Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23

Where Does Reinforcement Learning Fit? Machine learning so far: Unsupervised learning: find regularities in input data, x Supervised learning: find mapping between input and output, f ( x ) → y Reinforcement learning: find mapping between states and actions, s → a (by finding optimal policy, π ( s ) → a ) Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23

Where Does Reinforcement Learning Fit? Machine learning so far: Unsupervised learning: find regularities in input data, x Supervised learning: find mapping between input and output, f ( x ) → y Reinforcement learning: find mapping between states and actions, s → a (by finding optimal policy, π ( s ) → a ) Which is it? S U R Speech recognition: connect sounds to transcripts Star data: find groupings from spectral emissions Rat presses lever: gets reward based on certain conditions Elevator controller: multiple elevators, minimize wait time Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23

Where Does Reinforcement Learning Fit? Machine learning so far: Unsupervised learning: find regularities in input data, x Supervised learning: find mapping between input and output, f ( x ) → y Reinforcement learning: find mapping between states and actions, s → a (by finding optimal policy, π ( s ) → a ) Which is it? S U R X Speech recognition: connect sounds to transcripts X Star data: find groupings from spectral emissions X Rat presses lever: gets reward based on certain conditions X Elevator controller: multiple elevators, minimize wait time Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23

But, Wasn’t That What Markov Decision Processes Were? Find optimal policy to maximize reward: � ∞ � � π ( s ) = arg max E γ t R ( s , π ( s ) , s ′ ) , π t = 0 with reward at state: R ( s ) , or from action, R ( s , a , s ′ ) . Günay Ch. 21 – Reinforcement Learning Spring 2013 10 / 23

But, Wasn’t That What Markov Decision Processes Were? Find optimal policy to maximize reward: � ∞ � � π ( s ) = arg max E γ t R ( s , π ( s ) , s ′ ) , π t = 0 with reward at state: R ( s ) , or from action, R ( s , a , s ′ ) . By estimating utility values: � � � P ( s ′ | s , a ) V ( s ′ ) V ( s ) ← arg max γ + R ( s ) , a s ′ with transition probabilities: P ( s ′ | s , a ) Günay Ch. 21 – Reinforcement Learning Spring 2013 10 / 23

But, Wasn’t That What Markov Decision Processes Were? Find optimal policy to maximize reward: � ∞ � � π ( s ) = arg max E γ t R ( s , π ( s ) , s ′ ) , π t = 0 with reward at state: R ( s ) , or from action, R ( s , a , s ′ ) . By estimating utility values: � � � P ( s ′ | s , a ) V ( s ′ ) V ( s ) ← arg max γ + R ( s ) , a s ′ with transition probabilities: P ( s ′ | s , a ) Assumes we know R ( s ) and P ( s ′ | s , a ) Günay Ch. 21 – Reinforcement Learning Spring 2013 10 / 23

Blindfolded Agent Must Learn From Rewards Don’t know R ( s ) or P ( s ′ | s , a ) . What to do? Günay Ch. 21 – Reinforcement Learning Spring 2013 11 / 23

Blindfolded Agent Must Learn From Rewards Don’t know R ( s ) or P ( s ′ | s , a ) . What to do? Use Reinforcement Learning (RL) to explore and find rewards Günay Ch. 21 – Reinforcement Learning Spring 2013 11 / 23

Blindfolded Agent Must Learn From Rewards Don’t know R ( s ) or P ( s ′ | s , a ) . What to do? Use Reinforcement Learning (RL) to explore and find rewards Agent types: knows learns uses Utility agent P R → U U Q-learning (RL) Q ( s , a ) Q Reflex π ( s ) Günay Ch. 21 – Reinforcement Learning Spring 2013 11 / 23

Video: Backgammon and Choppers

How Much to Learn? 1 Passive RL: Simple Case Keep policy π ( s ) fixed, learn others Always do same actions, and learn utilities Examples: public transit commute learning a difficult game Günay Ch. 21 – Reinforcement Learning Spring 2013 13 / 23

How Much to Learn? 1 Passive RL: Simple Case Keep policy π ( s ) fixed, learn others Always do same actions, and learn utilities Examples: public transit commute learning a difficult game 2 Active RL Learn policy at the same time Help explore better by changing policy Example: drive own car Günay Ch. 21 – Reinforcement Learning Spring 2013 13 / 23

RL in Practise: Temporal Difference (TD) Rule Animals use derivative: Remember value iteration: � � � P ( s ′ | s , a ) V ( s ′ ) V ( s ) ← arg max γ + R ( s ) . a s ′ Günay Ch. 21 – Reinforcement Learning Spring 2013 14 / 23

RL in Practise: Temporal Difference (TD) Rule Animals use derivative: Remember value iteration: � � � P ( s ′ | s , a ) V ( s ′ ) V ( s ) ← arg max γ + R ( s ) . a s ′ TD rule: Use derivative when going s → s ′ : � R ( s ) + γ V ( s ′ ) − V ( s ) � V ( s ) ← V ( s ) + α where: α is the learning rate , and γ is the discount factor. Günay Ch. 21 – Reinforcement Learning Spring 2013 14 / 23

RL in Practise: Temporal Difference (TD) Rule Animals use derivative: Remember value iteration: � � � P ( s ′ | s , a ) V ( s ′ ) V ( s ) ← arg max γ + R ( s ) . a s ′ TD rule: Use derivative when going s → s ′ : � R ( s ) + γ V ( s ′ ) − V ( s ) � V ( s ) ← V ( s ) + α where: α is the learning rate , and γ is the discount factor. It’s even simpler than before! Günay Ch. 21 – Reinforcement Learning Spring 2013 14 / 23

CS325 Artificial Intelligence Ch. 21 Reinforcement Learning Cengiz - PowerPoint PPT Presentation

CS325 Artificial Intelligence Ch. 21 Reinforcement Learning Cengiz Gnay, Emory Univ. Spring 2013 Gnay Ch. 21 Reinforcement Learning Spring 2013 1 / 23 Rats! Rat put in a cage with lever. Each lever press sends a signal to

Artificial Intelligence Artificial Intelligence Artificial Intelligence Study and design of

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Artificial intelligence Artificial Intelligence is the science of PHILOSOPHY OF ARTIFICIAL

Artificial Intelligence Intro (Chapter 1 of AIMA) Summary Artificial Intelligence What is AI?

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

CS325 Artificial Intelligence Ch. 17 Planning Under Uncertainty Cengiz Gnay, Emory Univ.

Reinforcement Learning CS 188: Artificial Intelligence Reinforcement Learning Instructors:

Deep Reinforcement Learning Philipp Koehn 21 April 2020 Philipp Koehn Artificial Intelligence:

Deep Reinforcement Learning Philipp Koehn 18 April 2019 Philipp Koehn Artificial Intelligence:

Reinforcement Learning CS 4100: Artificial Intelligence Reinforcement Learning Ja Jan-Wi

Reinforcement Learning CS 4100: Artificial Intelligence Reinforcement Learning II Still

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Toward a Fiscally Responsible Economic Growth Agenda THE CONCORD COALITION

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Health and Wellness Action Team: Evidence-Based Practices to Promote Health October 1, 2014

The ventral visual pathway: an expanded neural framework for the processing of object quality (A

NMEDA Pay for Performance Pilot Study-2012 Purposes Pay for Performance (PFP) Pilot was

What I Do Matters LCSD Keynote: August 23, 2017 Pursuing Excellence for every student, every day

How to Prepare Robin Zon, MD, FACP, FASCO Debra A. Patt, MD, MPH, MBA Welcome Thank you for

Novel Biologic Therapies for Rheumatic Diseases: An Overview Jonathan Graf, MD Professor of