Reinforcement Learning Philipp Koehn 16 April 2020 Philipp Koehn - PowerPoint PPT Presentation

Reinforcement Learning Philipp Koehn 16 April 2020 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Rewards 1 ● Agent takes actions ● Agent occasionally receives reward ● Maybe just at the end of the process, e.g., Chess: – agent has to decide on individual moves – reward only at end: win/lose ● Maybe more frequently – Scrabble: points for each word played – ping pong: any point scored – baby learning to crawl: any forward movement Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Markov Decision Process 2 State Map Stochastic Movement ● States s ∈ S , actions a ∈ A ● Model T ( s,a,s ′ ) ≡ P ( s ′ ∣ s,a ) = probability that a in s leads to s ′ ● Reward function R ( s ) (or R ( s,a ) , R ( s,a,s ′ ) ) = { − 0 . 04 (small penalty) for nonterminal states ± 1 for terminal states Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Agent Designs 3 ● Utility based agent – needs model of environment – learns utility function on states – selects action that maximize expected outcome utility ● Q-learning – learns action-utility function ( Q ( s,a ) function) – does not need to model outcomes of actions – function provides expected utility of taken a given action at a given step ● Reflex agent – learns policy that maps states to actions Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

4 passive reinforcement learning Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Setup 5 Reward Function State Map Stochastic Movement ⎧ ⎪ ⎪ ⎪ +1 for goal ⎨ ⎪ R(s) = –1 ⎪ for pit ⎪ ⎩ –0.04 for other ● We know which state we are in (= partially observable environment) ● We know which actions we can take ● But only after taking an action → new state becomes known → reward becomes known Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Passive Reinforcement Learning 6 ● Given a policy ● Task: compute utility of policy ● We will extend this later to active reinforcement learning ( ⇒ policy needs to be learned) Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Sampling 7 -0.04 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Sampling 8 -0.04 -0.04 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Sampling 9 -0.04 -0.04 -0.04 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Sampling 10 -0.04 -0.04 -0.04 -0.04 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Sampling 11 -0.04 -0.04 -0.04 -0.04 -0.04 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Sampling 12 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Sampling 13 -0.04 -0.04 -0.04 +1 -0.04 -0.04 -0.04 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Sampling 14 0.92 0.96 1.00 0.80 0.88 0.84 0.76 0.72 ● Sample of reward to go Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Sampling 15 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Sampling 16 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Utility of Policy 17 ● Definition of utility U of the policy π for state s ∞ U π ( s ) = E [ γ t R ( S t )] ∑ t = 0 ● Start at state S 0 = s ● Reward for state is R ( s ) ● Discount factor γ (we use γ = 1 in our examples) Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Direct Utility Estimation 18 ● Learning from the samples ● Reward to go: 0.92 0.96 1.00 0.80 – (1,1) one sample: 0.72 0.88 0.84 – (1,2) two samples: 0.76, 0.84 – (1,3) two samples: 0.80, 0.88 0.76 ● Reward to go 0.72 will converge to utility of state ● But very slowly — can we do better? Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Bellman Equation 19 ● Direct utility estimation ignores dependency between states ● Given by Bellman equation U π ( s ) = R ( s ) + γ ∑ P ( s ′ ∣ s,π ( s )) U π ( s ′ ) s ′ ( γ = reward decay) ● Use of this known dependence can speed up learning ● Requires learning of transition probabilities P ( s ′ ∣ s,π ( s )) Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Adaptive Dynamic Programming 20 Need to learn: ● State rewards R ( s ) – whenever a state is visited, record award (deterministic) ● Outcome of action π ( s ) at state s according to policy π – collect statistic count ( s,s ′ ) that s ′ is reached from s – estimate probability distribution count ( s,s ′ ) P ( s ′ ∣ s,π ( s )) = ∑ s ′′ count ( s,s ′′ ) ⇒ Ingredients for policy evaluation algorithm Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Adaptive Dynamic Programming 21 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Learning Curve 22 ● Major change at 78 th trial: first time terminated in –1 state at (4,2) Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Temporal Difference Learning 23 ● Idea: do not model P ( s ′ ∣ s,π ( s )) , directly adjust utilities U ( s ) for all visited states ● Estimate of current utility: U π ( s ) ● Estimate of utility after action: R ( s ) + γU π ( s ′ ) ● Adjust utility of current state U π ( s ) if they differ ∆ U π ( s ) = α ( R ( s ) + γU π ( s ′ ) − U π ( s )) ( α = learning rate) ● Learning rate may decrease when state has been visited often Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Learning Curve 24 ● Noisier, converging more slowly Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Comparison 25 ● Both eventually converge to correct values ● Adaptive dynamic programming (ADP) faster than temporal difference learning (TD) – both make adjustments to make successors agree – but: ADP adjusts all possible successors, TD only observed successor ● ADP computationally more expensive due to policy evaluation algorithm Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

26 active reinforcement learning Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Active Reinforcement Learning 27 ● Previously: passive agent follows prescribed policy ● Now: active agent decides which action to take – following optimal policy (as currently viewed) – exploration ● Goal: optimize rewards for a given time frame Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Greedy Agent 28 1. Start with initial policy 2. Compute utilities (using ADP) 3. Optimize policy 4. Go to Step 2 ● This very seldom converges to global optimal policy Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Learning Curve 29 ● Greedy agent stuck in local optimum Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Bandit Problems 30 ● Bandit: slot machine Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Bandit Problems 31 ● Bandit: slot machine ● N-armed bandit: n levers ● Each has different probability distribution over payoffs ● Spend coin on – presume optimal payoff – exploration (new lever) ● If independent – Gittins index : formula for solution – uses payoff / number of times used Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Greedy in the Limit of Infinite Exploration 32 ● Explore any action in any state unbounded number of times ● Eventually has to become greedy – carry out optimal policy ⇒ maximize reward ● Simple strategy – with probability p ( 1 / t ) take random action – initially ( t small) focus on exploration – later ( t big) focus on optimal policy Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Extension of Adaptive Dynamic Programming 33 ● Previous definition of utility calculation U ( s ) ← R ( s ) + γ max a ∑ P ( s ′ ∣ s,a ) U ( s ′ ) s ′ ● New utility calculation U + ( s ) ← R ( s ) + γ max a f (∑ P ( s ′ ∣ s,a ) U + ( s ′ ) ,N ( s,a )) s ′ ● One possible definition of f ( u,n ) f ( u,n ) = { R + if n < N c u otherwise R + is optimistic estimate, best possible award in any state Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Learning Curve 34 ● Performance of exploratory ADP agent ● Parameter settings R + = 2 and N e = 5 ● Fairly quick convergence to optimal policy Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Reinforcement Learning Philipp Koehn 16 April 2020 Philipp Koehn - PowerPoint PPT Presentation

Reinforcement Learning Philipp Koehn 16 April 2020 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020 Rewards 1 Agent takes actions Agent occasionally receives reward Maybe just at the end of the process,

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Larry Holder School of EECS Washington State University Artificial Intelligence 1 } Classic AI

Intelligent Agents Chapter 2 Intelligent Agents p.1/25 Outline Agents and environments

OPTICAL QUANTUM DOTS FOR QUANTUM INFORMATION Tom Reinecke Naval Research Laboratory Washington,

Evalua&onoftheSimulated PlanetaryBoundaryLayerin

A Desktop Can Machines Learn? Pascal Poupart Associate Professor David R. Cheriton School of

CS440/ECE448 Lecture 12: Stochastic Games, Stochastic Search, and Learned Evaluation Functions

POMDPs and Policy Gradients MLSS 2006, Canberra Douglas Aberdeen Canberra Node, RSISE Building

Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2019 Reinforcement Learning

Reinforcement Learning Philipp Koehn 16 April 2020 Philipp Koehn - PowerPoint PPT Presentation

Reinforcement Learning Philipp Koehn 16 April 2020 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020 Rewards 1 Agent takes actions Agent occasionally receives reward Maybe just at the end of the process,

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Larry Holder School of EECS Washington State University Artificial Intelligence 1 } Classic AI

Intelligent Agents Chapter 2 Intelligent Agents p.1/25 Outline Agents and environments

OPTICAL QUANTUM DOTS FOR QUANTUM INFORMATION Tom Reinecke Naval Research Laboratory Washington,

Evalua&amp;onoftheSimulated PlanetaryBoundaryLayerin

A Desktop Can Machines Learn? Pascal Poupart Associate Professor David R. Cheriton School of

CS440/ECE448 Lecture 12: Stochastic Games, Stochastic Search, and Learned Evaluation Functions

POMDPs and Policy Gradients MLSS 2006, Canberra Douglas Aberdeen Canberra Node, RSISE Building

Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2019 Reinforcement Learning

Evalua&onoftheSimulated PlanetaryBoundaryLayerin