Reinforcement Learning with a Corrupted Reward Channel Tom Everitt, - PowerPoint PPT Presentation

Mar 16, 2023 •158 likes •275 views

Reinforcement Learning with a Corrupted Reward Channel Tom Everitt, Victoria Krakovna, Laurent Orseau, Marcus Hutter, Shane Legg Australian National University Google DeepMind IJCAI 17 and arXiv Motivation We will need to control

Reinforcement Learning with a Corrupted Reward Channel Tom Everitt, Victoria Krakovna, Laurent Orseau, Marcus Hutter, Shane Legg Australian National University Google DeepMind IJCAI 17 and arXiv
Motivation ● We will need to control Human-Level+ AI ● By identifying problems with various AI-paradigms, we can focus research on – the right paradigms – crucial problems within promising paradigms
The Wireheading Problem ● Future RL agent hijacks reward signal (wireheading) ● CoastRunners agent drives in small circle (misspecified reward function) ● RL agent shortcuts reward sensor (sensory error) ● Cooperative Inverse RL agent misperceives human action (adversarial counterexample)
Formalisation ● Reinforcement Learning is traditionally modeled with Markov Decision Process (MDP): ● This fails to model situations where there is a difference between – True reward – Observed reward ● Can be modeled with Corrupt Reward MDP:
Simplifying assumptions
Good intentions ● Natural optimise true reward using observed reward as evidence ● Theorem: Will still suffer near-maximal regret ● Good intentions is not enough!
Avoiding Over-Optimisation ● Quantilising agent randomly picks a state/policy where reward above threshold ● Theorem: For q corrupt states, exists s.t. has average regret at most ● Avoiding over-optimisation helps!
Richer Information Reward Observation Graphs ● RL: ● Decoupled RL: – States “self-estimate” – Cooperative IRL their reward – Learning values from stories – Learning from Human Preferences
Learning true reward Majority vote Safe state – Cooperative Inverse RL – Learning from Human – Learning values from Preferences stories ● Richer information helps!
Experiments ● AIXIjs: http://aslanides.io/aixijs/demo.html True reward Observed reward
Key Takeaways ● Wireheading: observed reward true reward ● Good intentions is not enough ● Either: – Avoid over-optimisation – Give the agent rich data to learn from (CIRL, stories, human preferences) ● Experiments available online

Recommend

The ULTIMATE Business Incentive Company REWARD YOUR CUSTOMERS; REWARD YOUR EMPLOYEES REWARD YOUR

The ULTIMATE Business Incentive Company REWARD YOUR CUSTOMERS; REWARD YOUR EMPLOYEES REWARD YOUR CUSTOMERS; REWARD YOUR EMPLOYEES Reward Your Customers; Reward Your Employees The Company Masterz Myind Sdn Bhd (Co. No. 568489-H) is a B2B agency

347 views • 19 slides

Risk/Reward Risk/Reward If you buy here, what is the target? What is the risk? 1 221

Risk/Reward Risk/Reward If you buy here, what is the target? What is the risk? 1 221 Risk/reward Risk/reward With a purchase at the hammer and a target to the falling window we have a good r/r trade. 222 Risk- -reward reward Risk

454 views • 29 slides

Reinforcement Learning with a Corrupted Reward Channel Tom Everitt, Victoria Krakovna, Laurent

Reinforcement Learning with a Corrupted Reward Channel Tom Everitt, Victoria Krakovna, Laurent Orseau, Marcus Hutter, Shane Legg IJCAI 2017 and arXiv (slides adapted from Tom's IJCAI talk) Motivation Want to give RL agents good incentives

390 views • 20 slides

CHANNEL ALLOCATION Channel Language Translation Channel Translation Language Channel 1 German

CHANNEL ALLOCATION Channel Language Translation Channel Translation Language Channel 1 German Kanal 1 Deutsch Channel 2 English Channel 2 English Channel 3 Italian Canale 3 Italiano Channel 4 Spanish Canal 4 Espaol Channel 5

708 views • 34 slides

ANNUAL ACCOUNTS PRESS CONFERENCE CHANNEL ALLOCATION. Channel Language Translation Channel

ANNUAL ACCOUNTS PRESS CONFERENCE CHANNEL ALLOCATION. Channel Language Translation Channel Translation Language Channel 1 German Channel 1 Deutsch Channel 2 English Channel 2 English Channel 3 French Canal 3 Franais Channel 4

868 views • 84 slides

Reward Shaping in Episodic Reinforcement Learning Marek Grze s Canterbury, UK AAMAS 2017

Reward Shaping in Episodic Reinforcement Learning Marek Grze s Canterbury, UK AAMAS 2017 S ao Paulo, May 812 Motivating Reward Shaping Reinforcement Learning Agent reward state action r t a t s t r t+ 1 Environment s t+ 1 [Sutt

1.41k views • 16 slides

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning: an Introduction, 2nd Edition: Chapters 6 (6.1 6.5) Outline Reinforcement Learning Reinforcement Learning: the

587 views • 27 slides

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Reinforcement Learning Q-Learning Deep Q-Learning on Atari Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement Learning Q-Learning Deep Q-Learning on Atari Table of Contents Reinforcement Learning

939 views • 63 slides

Foundations of Machine Learning Reinforcement Learning Reinforcement Learning Agent exploring

Foundations of Machine Learning Reinforcement Learning Reinforcement Learning Agent exploring environment. Interactions with environment: action state Agent Environment reward Problem: find action policy that maximizes cumulative reward

828 views • 66 slides

Channel Assignment and Channel Hopping in IEEE 802.11 Operating Channels for 802.11b Europe

Channel Assignment and Channel Hopping in IEEE 802.11 Operating Channels for 802.11b Europe (ETSI) channel 1 channel 7 channel 13 2400 2412 2442 2472 2483.5 [MHz] 22 MHz US (FCC)/Canada (IC) channel 1 channel 6 channel 11 2400

572 views • 44 slides

Where's The Reward? Where's The Reward? A Review of Reinforcement Learning for Instructional

Where's The Reward? Where's The Reward? A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Research Question Research Question Over the past 50 years, how Over the past 50 years, how successful has RL

1.92k views • 163 slides

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Introduction to Reinforcement Learning RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem Inside an RL agent Temporal difference learning Many faces of Reinforcement Learning What is

552 views • 35 slides

Inverse Reinforcement Learning CS 294-112: Deep Reinforcement Learning Sergey Levine Todays

Inverse Reinforcement Learning CS 294-112: Deep Reinforcement Learning Sergey Levine Todays Lecture 1. So far: manually design reward function to define a task 2. What if we want to learn the reward function from observing an expert, and

650 views • 33 slides

Reinforcement Learning: How Does It Work? We detect a state Reinforcement Learning We choose an

1 Reinforcement Learning: How Does It Work? We detect a state Reinforcement Learning We choose an action Lecture 2 We get a reward Gillian Hayes Our aim is to learn a policy what action to choose in what state to get maximum reward 11th

641 views • 5 slides

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning<br/><br/> 4/25/19, 8*06 PM Reinforcement Learning<br/><br/> 4/25/19, 8*06 PM Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning? Spring 2019 Created:

371 views • 15 slides

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning and Simulation-Based Search Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and Simulation-Based Search Outline 1 Reinforcement Learning 2 Simulation-Based Search 3 Planning Under

425 views • 20 slides

Probability and Statistics for Computer Science A major use of probability in sta4s4cal

Probability and Statistics for Computer Science A major use of probability in sta4s4cal inference is the upda4ng of probabili4es when certain events are observed Prof. M.H. DeGroot Credit: wikipedia Hongye Liu, Teaching

670 views • 51 slides

NAG : Motivating Deployment of Networked Systems Mohit Lad UCLA Deployment of Networked Systems

NAG : Motivating Deployment of Networked Systems Mohit Lad UCLA Deployment of Networked Systems Goal: How do you convince users to deploy applications? Prior work: Lottery System (Sigcomm 07) Probabilistic incentive to motivate

448 views • 7 slides

Networks Used As Episodic Memory for An Autonomous Robot Outline

Fusion Adaptive Resonance Theory Networks Used As Episodic Memory for An Autonomous Robot Outline 2 Goals for autonomous robots IRL-1 3 Role of episodic memory 4 Proposed model: EM-ART 5

392 views • 22 slides

Gen enerativ erative e Adver ersaria sarial l Im Imitation itation Le Learning arning

Gen enerativ erative e Adver ersaria sarial l Im Imitation itation Le Learning arning Stefan efano o Ermon on Joint work with Jayesh Gupta, Jonathan Ho, Yunzhu Li, Hongyu Ren, and Jiaming Song Reinforcement Learning Goal: Learn

530 views • 40 slides

Tor Metrics Ecosystem Data Collection, Archive, Analysis and Visualisation Iain R. Learmonth

Tor Metrics Ecosystem Data Collection, Archive, Analysis and Visualisation Iain R. Learmonth (irl) September 17, 2018 Tor Project $ whoami Tor Metrics Team Member Background in Internet @iainlearmonth Measurement @irl@mastodon.technology

426 views • 41 slides

Probability 2.3 Independence Anna Karlin Most slides by Alex Tsun Agenda Chain Rule

Probability 2.3 Independence Anna Karlin Most slides by Alex Tsun Agenda Chain Rule Independence Conditional Independence in class not Chain Rule (Idea) Have a Standard 52-Card Deck. 4 Suits (Clubs, Diamonds, Hearts,

678 views • 23 slides

CALIFORNIA ELECTRIC SYSTEMS FOR THE 21 ST CENTURY (CES-21) NOVEMBER, 12, 2014 DOUG RHOADES CHIEF

ANNUAL INDUSTRY WORKSHOP NOVEMBER 12-13, 2014 CALIFORNIA ELECTRIC SYSTEMS FOR THE 21 ST CENTURY (CES-21) NOVEMBER, 12, 2014 DOUG RHOADES CHIEF ENGINEER, CYBERSECURITY, SOUTHERN CALIFORNIA EDISON TRUSTWORTHY CYBER INFRASTRUCTURE FOR THE POWER

195 views • 7 slides

Formalizing Theatrical Performances Using Multi-Agent Organizations Andreas Schmidt Jensen ,

Formalizing Theatrical Performances Using Multi-Agent Organizations Andreas Schmidt Jensen , Johannes Svante Spurkeland and Jrgen Villadsen Department of Applied Mathematics and Computer Science Technical University of Denmark November 21,

601 views • 42 slides