Exploration Conscious Reinforcement Learning Revisited Lior Shani* - PowerPoint PPT Presentation

Exploration Conscious Reinforcement Learning Revisited Lior Shani* Yonathan Efroni* Shie Mannor Technion Institute of Technology

Why? • To learn a good policy, an RL agent must explore! • However, it can cause hazardous behavior during training. I LOVE 𝝑 -GREEDY Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 2 /1 9

Why? • To learn a good policy, an RL agent must explore! • However, it can cause hazardous behavior during training. Damn you Exploration! I LOVE 𝝑 -GREEDY Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 3 /1 9

Exploration Conscious Reinforcement Learning • Objective: Find the optimal policy knowing that exploration might occur • For example : 𝝑 -greedy exploration ( 𝜷 = 𝝑 ) ∞ ∗ ∈ argmax 𝝆∈𝜬 𝔽 𝟐−𝜷 𝝆+𝜷𝝆 𝟏 ෍ 𝜹 𝒖 𝒔 𝒕 𝒖 , 𝒃 𝒖 𝝆 𝜷 𝒖=𝟏 Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 4 /1 9

Exploration Conscious Reinforcement Learning • Objective: Find the optimal policy knowing that exploration might occur • For example : 𝝑 -greedy exploration ( 𝜷 = 𝝑 ) ∞ ∗ ∈ argmax 𝝆∈𝜬 𝔽 𝟐−𝜷 𝝆+𝜷𝝆 𝟏 ෍ 𝜹 𝒖 𝒔 𝒕 𝒖 , 𝒃 𝒖 𝝆 𝜷 𝒖=𝟏 • Solving the Exploration-Conscious problem = Solving an MDP • We describe a bias-error sensitivity tradeoff in 𝜷 Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 5 /1 9

Exploration Conscious Reinforcement Learning • Objective: Find the optimal policy knowing that exploration might occur I ’ m Exploration Conscious Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 6 /1 9

Fixed Exploration Schemes (e.g. 𝜗 -greedy) Choose Greedy Action 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 • 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 ∈ argmax 𝒃 𝑹 𝝆 𝜷 𝒕, 𝒃 Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 7 /1 9

Fixed Exploration Schemes (e.g. 𝜗 -greedy) Choose Draw Greedy Exploratory Action Action 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 𝒃 𝒃𝒅𝒖 • 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 ∈ argmax 𝒃 𝑹 𝝆 𝜷 𝒕, 𝒃 • For 𝜷 -greedy: 𝒃 𝒃𝒅𝒖 ∈ 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 w.p. 𝟐 − 𝜷 𝝆 𝟏 else Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 8 /1 9

Fixed Exploration Schemes (e.g. 𝜗 -greedy) Choose Draw Greedy Exploratory Act Action Action 𝒃 𝒃𝒅𝒖 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 𝒃 𝒃𝒅𝒖 • 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 ∈ argmax 𝒃 𝑹 𝝆 𝜷 𝒕, 𝒃 • For 𝜷 -greedy: 𝒃 𝒃𝒅𝒖 ∈ 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 w.p. 𝟐 − 𝜷 𝝆 𝟏 else Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 9 /1 9

Fixed Exploration Schemes (e.g. 𝜗 -greedy) Choose Draw Greedy Exploratory Act Recieve Action Action 𝒃 𝒃𝒅𝒖 𝒔, 𝒕′ 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 𝒃 𝒃𝒅𝒖 • 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 ∈ argmax 𝒃 𝑹 𝝆 𝜷 𝒕, 𝒃 • For 𝜷 -greedy: 𝒃 𝒃𝒅𝒖 ∈ 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 w.p. 𝟐 − 𝜷 𝝆 𝟏 else Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 10 /1 9

Fixed Exploration Schemes (e.g. 𝜗 -greedy) Choose Draw Greedy Exploratory Act Recieve Action Action 𝒃 𝒃𝒅𝒖 𝒔, 𝒕′ 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 𝒃 𝒃𝒅𝒖 • Normally used information: 𝒕, 𝒃 𝒃𝒅𝒖 , 𝒔, 𝒕 ′ Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 11 /1 9

Fixed Exploration Schemes (e.g. 𝜗 -greedy) Choose Draw Greedy Exploratory Act Recieve Action Action 𝒃 𝒃𝒅𝒖 𝒔, 𝒕′ 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 𝒃 𝒃𝒅𝒖 • Normally used information: 𝒕, 𝒃 𝒃𝒅𝒖 , 𝒔, 𝒕 ′ • Using information about the exploration process: 𝒕, 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 , 𝒃 𝒃𝒅𝒖 , 𝒔, 𝒕 ′ Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 12 /1 9

Two Approaches – Expected approach Update 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 1. 𝒃𝒅𝒖 2. Expect that the agent might explore in the next state 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 𝒃𝒅𝒖 += 𝜽 𝒔 𝒖 + 𝜹𝔽 𝟐−𝜷 𝝆+𝜷𝝆 𝟏 𝑹 𝝆 𝜷 𝒕 𝒖+𝟐 , 𝒃 − 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 𝒃𝒅𝒖 Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 13 /1 9

Two Approaches – Expected approach Update 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 1. 𝒃𝒅𝒖 2. Expect that the agent might explore in the next state 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 𝒃𝒅𝒖 += 𝜽 𝒔 𝒖 + 𝜹𝔽 𝟐−𝜷 𝝆+𝜷𝝆 𝟏 𝑹 𝝆 𝜷 𝒕 𝒖+𝟐 , 𝒃 − 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 𝒃𝒅𝒖 • Calculating expectations can be hard. • Requires sampling in the continuous case! Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 14 /1 9

Two Approaches – Surrogate approach • Exploration is incorporated into the environment! Update 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 𝒉𝒔𝒇𝒇𝒆𝒛 1. 2. The rewards and next state 𝒔 𝒖 , 𝒕 𝒖+𝟐 are given by the acted action 𝒃 𝒖 𝒃𝒅𝒖 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 𝒉𝒔𝒇𝒇𝒆𝒛 += 𝜽 𝒔 𝒖 + 𝜹𝑹 𝝆 𝜷 𝒕 𝒖+𝟐 , 𝒃 𝒖+𝟐 − 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 𝒉𝒔𝒇𝒇𝒆𝒛 𝒉𝒔𝒇𝒇𝒆𝒛 Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 15 /1 9

Two Approaches – Surrogate approach • Exploration is incorporated into the environment ! Update 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 𝒉𝒔𝒇𝒇𝒆𝒛 1. 2. The rewards and next state 𝒔 𝒖 , 𝒕 𝒖+𝟐 are given by the acted action 𝒃 𝒖 𝒃𝒅𝒖 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 𝒉𝒔𝒇𝒇𝒆𝒛 += 𝜽 𝒔 𝒖 + 𝜹𝑹 𝝆 𝜷 𝒕 𝒖+𝟐 , 𝒃 𝒖+𝟐 − 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 𝒉𝒔𝒇𝒇𝒆𝒛 𝒉𝒔𝒇𝒇𝒆𝒛 • NO NEED TO SAMPLE! Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 16 /1 9

Deep RL Experimental Results Training Evaluation Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 17 /1 9

Summary • We define Exploration Conscious RL and analyze its properties. • Exploration Conscious RL can improve performance over both the training and evaluation regimes . • Conclusion: Exploration-Conscious RL and specifically, the Surrogate approach, can easily help to improve variety of RL algorithms. Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 18 /1 9

Summary • We define Exploration Conscious RL and analyze its properties. • Exploration Conscious RL can improve performance over both the training and evaluation regimes . • Conclusion: Exploration-Conscious RL and specifically, the Surrogate approach, can easily help to improve variety of RL algorithms. SEE YOU AT POSTER #90 Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 19 /1 9

Exploration Conscious Reinforcement Learning Revisited Lior Shani* - PowerPoint PPT Presentation

Exploration Conscious Reinforcement Learning Revisited Lior Shani* Yonathan Efroni* Shie Mannor Technion Institute of Technology Why? To learn a good policy, an RL agent must explore! However, it can cause hazardous behavior during

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Distributional Reinforcement Learning for Efficient Exploration Hengshuai Yao Huawei Hi-Silicon

Chapter 4 Marketing ethics Today Conscious marketing Ethical marketing decision 2

A Cache-conscious Profitability A Cache-conscious Profitability Model for Empirical Tuning of

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

Todays Outline Reinforcement Learning Dan Weld Reinforcement Learning Q-value

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

Computing supersingular isogenies on Kummer surfaces Craig Costello ASIACRYPT December 6, 2018

SENIOR message impeached the rights of people in society where those rights have been

Q3 2009 CONFERENCE CALL C O R P O R A T E Caution Regarding Forward-Looking Statements P A R T I

WATER 2020 WEST COAST MARINE TOURISM COLLABORATION TS2020 TS2020 was published in 2012

Organizing and Developing Batterers Intervention Programs Melissa Scaia, MPA Domestic Violence

The Topsy-Turvy World of Social Media Ten Paradoxes of Social Media A view from Social

On the effectiveness of nationalist ideologies Martin Neumann, Jacobs University Bremen

Poverty, Inequality and Prices in Post-apartheid South Africa Arden Finn (SALDRU, UCT) Murray