RL Overview of topics About Reinforcement Learning The - PowerPoint PPT Presentation

Introduction to Reinforcement Learning RL

Overview of topics • About Reinforcement Learning • The Reinforcement Learning Problem • Inside an RL agent • Temporal difference learning

Many faces of Reinforcement Learning

What is Reinforcement Learning? • Learning from interaction • Goal-oriented learning • Learning about, from, and while interacting with an external environment • Learning what to do—how to map situations to actions—so as to maximize a numerical reward signal Reinforcement Learning 4

Branches of AI Supervised Unsupervised Learning Learning Machine Learning Reinforcement Learning

Supervised Learning Training Info = desired (target) outputs Supervised Learning Inputs Outputs System Error = (target output – actual output) Reinforcement Learning 6

Reinforcement Learning Training Info = evaluations (“rewards” / “penalties”) RL Inputs Outputs (“actions”) System Objective: get as much reward as possible Reinforcement Learning 7

Recipe for creative behavior: explore & exploit • Creativity: finding a new approach / solution / … – Exploration (random / systematic / … ) – Evaluation (utility = expected rewards) – Selection (ongoing behavior and learning)

Coli bacteria and creativity • Escherichia Coli searches for food using trial and error: – Choose a random direction by tumbling and then start swimming straight – Evaluate progress – Continue longer or cancel earlier depending on progress http://biology.about.com/library/weekly/aa081299.htm

Zebra finch: from singing in the shower to performing artist 1. A newborn zebra finch can ’ t sing 2. The baby bird listens to father’s song 3. The baby starts to “ babble ” father’s song as a target template 4. The song develops through trial and error – “ singing in the shower ” 5. No exploration when singing to a female http://www.brain.riken.jp/bsi-news/bsinews34/no34/ speciale.html

Zebra finch: from singing in the shower to performing artist • http://www.youtube.com/watch?v=Md6bsvkauPg

Key Features of RL • Learner is not told which actions to take • Trial-and-Error search • Possibility of delayed reward (sacrifice short- term gains for greater long-term gains) • The need to explore and exploit • Considers the whole problem of a goal- directed agent interacting with an uncertain environment Reinforcement Learning 12

Complete Agent Temporally situated Continual learning and planning Object is to affect the environment Environment is stochastic and uncertain Environment state action Agent reward Reinforcement Learning 13

Elements of RL Policy Reward Value Model of environment • Policy : what to do • Reward : what is good • Value : what is good because it predicts reward • Model : what follows what Reinforcement Learning 14

An Extended Example: Tic-Tac-Toe X X X X X O X X O X X X X X O X O O O X O X O O O O X O } x’s move ... x x x } o’s move ... ... ... x o o o x x } x’s move ... ... ... ... ... } o’s move Assume an imperfect opponent: he/ she sometimes makes mistakes } x’s move x o x x o Reinforcement Learning 15

An RL Approach to Tic-Tac-Toe 1. Make a table with one entry per state: State V ( s ) – estimated probability of winning .5 ? x 2. Now play lots of games. To .5 ? . . pick our moves, look ahead . . . . x x x 1 win o one step: o . . . . . . current state x o 0 loss o x o . . various possible . . . . o o next states x * 0 draw x x o x o o Just pick the next state with the highest estimated prob. of winning — the largest V ( s ); a greedy move. But 10% of the time pick a move at random; an exploratory move . Reinforcement Learning 16

RL Learning Rule for Tic-Tac-Toe “Exploratory” move s – the state before our greedy move s – the state after our greedy move ʹ″ We increment each V ( s ) toward V ( s ) – a : ʹ″ backup [ ] V ( s ) V ( s ) V ( s ) V ( s ) ʹ″ ← + α − a small positive fraction, e.g., . 1 α = the step - size parameter Reinforcement Learning 17

How can we improve this T.T.T. player? • Take advantage of symmetries – representation/generalization • Do we need “random” moves? Why? – Do we always need a full 10%? • Can we learn from “random” moves? • … Reinforcement Learning 18

Temporal difference learning • Solution to temporal credit assignment problem • Replace the reward signal by the change in expected future reward – Prediction moves the rewards from the future as close to the actions as possible – Primary reward such as sugar replaced with secondary (or higher order) rewards such as money – In the brain, dopamine ≈ temporal difference signal – Supervised learning is used for channelling information in predictive stimuli to learning

Reinforcement learning example Arrows indicate strength between Start S 2 two problem states Start maze … S 4 S 3 S 8 S 7 S 5 Goal

The first response Start leads to S2 … S 2 The next state is chosen by randomly sampling from the possible next states S 4 S 3 weighted by their associative strength Associative strength = line width S 8 S 7 S 5 Goal

Suppose the Start S 2 randomly sampled response leads to S3 … S 4 S 3 S 8 S 7 S 5 Goal

At S3, choices lead to Start either S2, S4, or S7. S 2 S7 was picked (randomly) S 4 S 3 S 8 S 7 S 5 Goal

By chance, S3 was Start picked next … S 2 S 4 S 3 S 8 S 7 S 5 Goal

Next response is S4 Start S 2 S 4 S 3 S 8 S 7 S 5 Goal

And S5 was chosen Start next (randomly) S 2 S 4 S 3 S 8 S 7 S 5 Goal

And the goal is Start reached … S 2 S 4 S 3 S 8 S 7 S 5 Goal

Goal is reached, Start strengthen the S 2 associative connection between goal state and last response S 4 S 3 Next time S5 is reached, part of the associative strength is passed back to S4... S 8 S 7 S 5 Goal

Start maze again … Start S 2 S 4 S 3 S 8 S 7 S 5 Goal

Let ’ s suppose after Start a couple of moves, S 2 we end up at S5 again S 4 S 3 S 8 S 7 S 5 Goal

S5 is likely to lead to Start GOAL through S 2 strenghtened route In reinforcement learning, strength is also passed back to S 4 S 3 the last state This paves the way for the next time going through maze S 8 S 7 S 5 Goal

The situation after Start lots of restarts … S 2 S 4 S 3 S 8 S 7 S 5 Goal

Stanford autonomous helicopter • https://www.youtube.com/watch?v=VCdxqn0fcnE

RL applications in robotics • Robot Learns to Flip Pancakes • Autonomous spider learns to walk forward by reinforcement learning • Reinforcement learning for a robitic soccer goalkeeper

Conclusion • The Reinforcement Learning Problem • Inside an RL agent – Policy – Reward – Value – Model • Temporal difference learning

RL Overview of topics About Reinforcement Learning The - PowerPoint PPT Presentation

Introduction to Reinforcement Learning RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem Inside an RL agent Temporal difference learning Many faces of Reinforcement Learning What is

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse

A Composable Specification Language for Reinforcement Learning Tasks Kishor Jothimurugan, Rajeev

The Value Proposition of The Value Proposition of Investor Dispute Resolution Investor Dispute

International Regulatory Cooperation The range of possible approaches Cline Kauffmann, Deputy

1 We might not always see them, but regulations the rules government imposes on businesses and

1 2 3 The Industry Standard The Industry Standard Design and investigation of rectangular,

SYNTHESIS OF CARBON NANOTUBE REINFORCEMENT IN ALUMINUM POWDER BY IN SITU CHEMICAL VAPOR

Reinforcement Learning and Model Predictive Control RL : optimizes policy for given cost

MANAGING BEHAVIOR UTILIZING POSITIVE BEHAVIORAL SUPPORTS TO IMPROVE SCHOOL CLIMATE IS DISCIPLINE