Reinforcement Learning A (almost) quick (and very incomplete) - PowerPoint PPT Presentation

Reinforcement Learning A (almost) quick (and very incomplete) introduction Slides from David Silver, Dan Klein, Mausam, Dan Weld

Reinforcement Learning At each time step t : • Agent executes an action A t • Environment emits a reward Rt • Agent transitions to state St

Rat Example

Rat Example • What if agent state = last 3 items in sequence?

Rat Example • What if agent state = last 3 items in sequence? • What if agent state = counts for lights, bells and levers?

Rat Example • What if agent state = last 3 items in sequence? • What if agent state = counts for lights, bells and levers? • What if agent state = complete sequence?

Major Components of RL An RL agent may include one or more of these components: • Policy: agent’s behaviour function • Value function: how good is each state and/or action • Model: agent’s representation of the environment

Policy • A policy is the agent’s behaviour • It is a map from state to action • Deterministic policy: a = π(s) • Stochastic policy: π(a|s) = P[A t = a|S t = s]

Value function • Value function is a prediction of future reward • Used to evaluate the goodness/badness of states… • …and therefore to select between actions

Model • A model predicts what the environment will do next • It predicts the next state… • …and predicts the next (immediate) reward

Dimensions of RL On Policy vs. Off Policy Model-based vs. Model-free • On Policy : Makes estimates based on a • Model-based : Have/learn action policy, and improves it based on estimates. models (i.e. transition probabilities. • Learning on the job. • Uses Dynamic Programming • e.g. SARSA • Model-free : Skip them and directly • Off Policy : Learn a policy while following learn what action to do when another (or re-using experience from old (without necessarily finding out the policy). exact model of the action) • Looking over someone's shoulder • e.g. Q-learning • e.g. Q-learning

Markov Decision Process • Set of states S = {s i } • Set of actions for each state A(s) = {a s i } (often independent of state) • Transition model T(s - > s’ | a) = Pr(s’ | a, s) • Reward model R(s, a, s’) • Discount factor γ MDP = <S, A, T, R, γ>

Bellman Equation for Value Function

Bellman Equation for Action-Value Function

Q vs V

Exploration vs Exploitation • Restaurant Selection • Exploitation: Go to your favourite restaurant • Exploration: Try a new restaurant • Online Banner Advertisements • Exploitation: Show the most successful advert • Exploration: Show a different advert • Oil Drilling • Exploitation: Drill at the best known location • Exploration: Drill at a new location • Game Playing • Exploitation: Play the move you believe is best • Exploration: Play an experimental move

ε -Greedy solution • Simplest idea for ensuring continual exploration • All m actions are tried with non-zero probability • With probability 1 − ε choose the greedy action • With probability ε choose an action at random

Off Policy Learning • Evaluate target policy π(a|s) to compute v π (s) or q π (s,a) while following behaviour policy μ(a|s) {s 1 ,a 1 ,r 2 ,...,s T } ∼ μ • Why is this important? • Learn from observing humans or other agents • Re- use experience generated from old policies π 1 , π 2 , ..., π t−1 • Learn about optimal policy while following exploratory policy • Learn about multiple policies while following one policy

Q - Learning • We now consider off-policy learning of action-values Q(s,a) • Next action is chosen using behaviour policy A t+1 ∼ μ(·|S t ) • But we consider alternative successor action A′ ∼ π(·|St) • And update Q(St,At) towards value of alternative action

Q - Learning • We now allow both behaviour and target policies to improve • The target policy π is greedy w.r.t. Q(s,a) • The behaviour policy μ is e.g. ε -greedy w.r.t. Q(s,a) • The Q-learning target then simplifies:

Q - Learning

Deep RL • We seek a single agent which can solve any human-level task • RL defines the objective • DL gives the mechanism RL + DL = general intelligence (David Silver) •

Function Approximators

Deep Q-Networks • Q Learning diverges using neural networks due to: Correlations between samples • Non-stationary targets •

Solution: Experience Replay • Fancy biological analogy • In reality, quite simple

Solution: Experience Replay

Improving Information Extraction by Acquiring External Evidence with Reinforcement Learning Karthik Narasimhan, Adam Yala, Regina Barzilay CSAIL, MIT Slides from Karthik Narasimhan

Why try to reason, when someone else can do it for you

Doubts* • Algo 1 line# 19. The process should end when "d" == "end_episode" and not q. [Prachi] Error. • The dimension of the match vector should be equivalent to the number of columns to ve extracted. But Fig 3 has twice the number of dim. [Prachi] Error. • Is RL the best approach. [Non believers]. • Experience Replay [Anshul]. Hope it is clear now. • Why is RL-extract better than meta classifier? Explanation provided in paper about "long tail of noisy, irrelevant documents" is unclear. [Yash] • The meta-classifier should also cut off at top-20 results per search like the RL system to be completely fair. [Anshul] * most mean questions

Discussions • Experiments • People are happy! • Queries • Cluster documents and learn queries [Yashoteja] • Many other query formulations [Surag (lowest confidence entity), Barun (LSTM), Gagan (highest confidence entity), DineshR] • Fixed set of queries [Akshay] • Simplicity. Search engines are robust. • Reliance on News articles {Gagan] • Where else would you get News from? • Domain limitations • Too narrow [Barun, Himanshu]. Domain specific [Happy]. Small ontology [Akshay] • It is not Open IE. It is task specific. Can be applied to any domain. • Better meta-classifiers [Surag] • Effect of more sophisticated RL algorithms (A3C, TRPO) [esp. if increasing action space by LSTM queries], and their effect on performance and training time.

Reinforcement Learning A (almost) quick (and very incomplete) - PowerPoint PPT Presentation

Reinforcement Learning A (almost) quick (and very incomplete) introduction Slides from David Silver, Dan Klein, Mausam, Dan Weld Reinforcement Learning At each time step t : Agent executes an action A t Environment emits a reward Rt

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Announcements The second referee report is due March 29th at 5pm The empirical project is due

A Tale of Two Camels: Implementing Lean Operations in Healthcare: Edward G. Anderson Jr., Ph.D.

The European Mercantile Empires 15 th through the 18 th Centuries 13 July 2018

1. World History and Western Civilization Late Modern, 1914-the Present Modern, 1750-1914

Spark Silver B Product Need & User 1.2 billion people 5 billion owners [1] with little to no

A Wrapped Normal Distribution on Hyperbolic Space for Gradient Based Learning Yoshihiro Nagano 1)

Silver State Health Insurance Exchange Monthly Meeting of Nevadas On-Exchange Insurance

Globally understandable LoAs Milan Sova Use case CILogon accreditation with IGTF use

Reinforcement Learning A (almost) quick (and very incomplete) - PowerPoint PPT Presentation

Reinforcement Learning A (almost) quick (and very incomplete) introduction Slides from David Silver, Dan Klein, Mausam, Dan Weld Reinforcement Learning At each time step t : Agent executes an action A t Environment emits a reward Rt

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Announcements The second referee report is due March 29th at 5pm The empirical project is due

A Tale of Two Camels: Implementing Lean Operations in Healthcare: Edward G. Anderson Jr., Ph.D.

The European Mercantile Empires 15 th through the 18 th Centuries 13 July 2018

1. World History and Western Civilization Late Modern, 1914-the Present Modern, 1750-1914

Spark Silver B Product Need &amp; User 1.2 billion people 5 billion owners [1] with little to no

A Wrapped Normal Distribution on Hyperbolic Space for Gradient Based Learning Yoshihiro Nagano 1)

Silver State Health Insurance Exchange Monthly Meeting of Nevadas On-Exchange Insurance

Globally understandable LoAs Milan Sova Use case CILogon accreditation with IGTF use

Spark Silver B Product Need & User 1.2 billion people 5 billion owners [1] with little to no