Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit - PowerPoint PPT Presentation

Reinforcement Learning Kevin Spiteri April 21, 2015

n-armed bandit

n-armed bandit 0.9 0.5 0.1

n-armed bandit 0.9 0.5 0.1 0.0 0.0 0.0 0.0 estimate

n-armed bandit 0.9 0.5 0.1 0 0.0 0.0 0.0 0.0 estimate 0 0 0 0.0 0 attempts 0 0 0 0.0 0 payoff

n-armed bandit 0.9 0.5 0.1 1.0 0.0 0.0 1.0 0.0 0.0 estimate 1 0 0 1 0.0 0 attempts 1 0 0 1 0.0 0 payoff

n-armed bandit 0.9 0.5 0.1 0.5 0.0 1.0 0.5 0.0 0.0 estimate 2 0 1 2 0.0 0 attempts 1 0 1 1 0.0 0 payoff

Exploration 0.9 0.5 0.1 0.67 0.0 1.0 0.5 0.0 0.0 estimate 3 0 1 2 0.0 0 attempts 2 0 1 1 0.0 0 payoff

Going on … 0.9 0.5 0.1 0.86 0.9 0.5 0.0 0.1 estimate 300 280 10 0.0 10 attempts 258 252 5 0.0 1 payoff

Changing environment 0.7 0.8 0.1 0.86 0.9 0.5 0.0 0.1 estimate 300 280 10 0.0 10 attempts 258 252 5 0.0 1 payoff

n-armed bandit ● Optimal payoff (0.82): 0.9 x 300 + 0.8 x 1200 = 1230 ● Actual payoff (0.72): 0.9 x 280 + 0.5 x 10 + 0.1 x 10 + 0.7 x 1120 + 0.8 x 40 + 0.1 x 40 = 1078

n-armed bandit ● Evaluation vs instruction. ● Discounting. ● Initial estimates. ● There is no best way or standard way.

Markov Decision Process (MDP)

Markov Decision Process (MDP) ● States

Markov Decision Process (MDP) ● States ● Actions c b a

Markov Decision Process (MDP) ● States ● Actions c ● Model a 0.75 b a 0.25

Markov Decision Process (MDP) ● States ● Actions c 0 ● Model ● Reward a 0.75 5 -1 b a 0.25

Markov Decision Process (MDP) ● States ● Actions c 0 ● Model ● Reward a 0.75 5 -1 ● Policy b a 0.25

Markov Decision Process (MDP) ● States: ball table hand t t h basket floor b f

Markov Decision Process (MDP) ● States: ball table hand t h basket floor b f

Markov Decision Process (MDP) ● States: ball table c hand t h basket floor ● Actions: b a a) attempt b f b) drop c) wait

Markov Decision Process (MDP) ● States: ball table c hand t h basket floor ● Actions: a 0.75 b a 0.25 a) attempt b f b) drop c) wait

Markov Decision Process (MDP) ● States: ball table c 0 hand t h basket floor ● Actions: a 0.75 5 -1 b a 0.25 a) attempt b f b) drop c) wait

Markov Decision Process (MDP) ● States: ball table c 0 hand t h basket floor ● Actions: a 0.75 5 -1 b a 0.25 a) attempt b f b) drop c) wait Expected reward per round: 0.25 x 5 + 0.75 x (-1) = 0.5

Markov Decision Process (MDP) ● States: ball table c 0 hand -1 t h basket floor ● Actions: a 0.75 5 -1 b a 0.25 a) attempt b f b) drop c) wait

Reinforcement Learning Tools ● Dynamic Programming ● Monte Carlo Methods ● Temporal Difference Learning

Grid World Reward: Normal move: -1 Over obstacle: -10 Best reward: -15

Optimal Policy

Value Function -15 -8 -7 0 -14 -9 -6 -1 -13 -10 -5 -2 -12 -11 -4 -3

Initial Policy

Policy Iteration -21 -11 -10 0 -22 -12 -11 -1 -23 -13 -12 -2 -24 -14 -13 -3

Policy Iteration

Value Iteration 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Value Iteration -1 -1 -1 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

Value Iteration -2 -2 -2 0 -2 -2 -2 -1 -2 -2 -2 -2 -2 -2 -2 -2

Value Iteration -3 -3 -3 0 -3 -3 -3 -1 -3 -3 -3 -2 -3 -3 -3 -3

Value Iteration -15 -8 -7 0 -14 -9 -6 -1 -13 -10 -5 -2 -12 -11 -4 -3

Stochastic Model 0.95 0.025 0.025

0.95 Value Iteration 0.025 0.025 -19.2 -10.4 -9.3 0 -18.1 -12.1 -8.2 -1.5 -17.0 -13.6 -6.7 -2.9 -15.7 -14.7 -5.1 -4.0

0.95 Value Iteration 0.025 0.025 E.g. 13.6: -19.2 -10.4 -9.3 0 13.6 = 0.950 x 13.1 + -18.1 -12.1 -8.2 -1.5 0.025 x 27.0 + 0.025 x 16.7 -17.0 -13.6 -6.7 -2.9 16.6 = -15.7 -14.7 -5.1 -4.0 0.950 x 16.7 + 0.025 x 13.1 + 0.025 x 15.7

Richard Bellman

Bellman Equation

Monte Carlo Methods 0.95 0.025 0.025

0.95 Monte Carlo Methods 0.025 0.025

0.95 Monte Carlo Methods 0.025 0.025 -32 -22 -10 0 -21 -11

0.95 Monte Carlo Methods 0.025 0.025 -21 -11 -10 0

0.95 Monte Carlo Methods 0.025 0.025 -32 -10 0 -31 -21 -11

0.95 Q-Value 0.025 0.025 -15 -10 -8 -20

Bellman Equation -15 -10 -8 -20

Learning Rate ● We do not replace an old Q value with a new one. ● We update at a designed learning rate. ● Learning rate too small: slow to converge. ● Learning rate too large: unstable. ● Will Dabney PhD Thesis: Adaptive Step-Sizes for Reinforcement Learning.

Richard Sutton

Temporal Difference Learning ● Dynamic Programming: Learn a guess from other guesses (Bootstrapping). ● Monte Carlo Methods: Learn without knowing model.

Temporal Difference Learning Temporal Difference: ● Learn a guess from other guesses (Bootstrapping). ● Learn without knowing model. ● Works with longer episodes than Monte Carlo methods.

Temporal Difference Learning Monte Carlo Methods: ● First run through whole episode. ● Update states at end. Temporal Difference Learning: ● Update state at each step using earlier guesses.

0.95 Monte Carlo Methods 0.025 0.025 -32 -10 0 -31 -21 -11

0.95 Temporal Difference 0.025 0.025 -19 -10 0 -22 -18 -12

0.95 Temporal Difference 0.025 0.025 -19 -10 -23 -10 0 -22 -18 -12 -28 -21 -11

0.95 Temporal Difference 0.025 0.025 23 = 1 + 22 -19 -10 -23 -10 0 28 = 10 + 18 -22 -18 -12 -28 -21 -11 21 = 10 + 11 11 = 1 + 10 10 = 10 + 0

Function Approximation ● Most problems have large state space. ● We can generally design an approximation for the state space. ● Choosing the correct approximation has a large influence on system performance.

Mountain Car Problem

Mountain Car Problem ● Car cannot make it to top. ● Can can swing back and forth to gain momentum. ● We know x and ẋ. ● x and ẋ give an infinite state space. ● Random – may get to top in 1000 steps. ● Optimal – may get to top in 102 steps.

Function Approximation ● We can partition state space in 200 x 200 grid. ● Coarse coding – different ways of partitioning state space. ● We can approximate V = w T f ● E.g. f = ( x ẋ height ẋ 2 ) T ● We can estimate w to solve problem.

Problems with Reinforcement Learning Policy sometimes gets worse: ● Safe Reinforcement Learning (Phil Thomas) guarantees an improved policy over the current policy. Very specific to training task: ● Learning Parameterized Skills Bruno Castro da Silva PhD Thesis

Checkers ● Arthur Samuel (IBM) 1959

TD-Gammon ● Neural networks and temporal difference. ● Current programs play better than human experts. ● Expert work in input selection.

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit - PowerPoint PPT Presentation

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5 0.1 n-armed bandit 0.9 0.5 0.1 0.0 0.0 0.0 0.0 estimate n-armed bandit 0.9 0.5 0.1 0 0.0 0.0 0.0 0.0 estimate 0 0 0 0.0 0

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Deep Learning Techniques for Music Generation Reinforcement (7) Jean-Pierre Briot

Who We Are Who We Are Grassroots group of Scientists Economists Business owners

Reinforcement Learning Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 13

AAAI-14 Tutorial Image sources: britannica.com, wikimedia.org

Machine Learning: Algorithms and Applications Floriano Zini Free University of Bozen-Bolzano

Agent-Based Modeling and Simulation Introduction to Reinforcement Learning Dr. Alejandro

Artificial Neural Networks Oliver Schulte - CMPT 726 Feed-forward Networks Network Training

Learning Agents Overview Learning important aspects Learning in Agents goal, types; individual