Lecture 15: Batch RL Emma Brunskill CS234 Reinforcement Learning. - PowerPoint PPT Presentation

Lecture 15: Batch RL Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Slides drawn from Philip Thomas with modifications Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Class Structure • Last time: Meta Reinforcement Learning • This time: Batch RL • Next time: Quiz Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

A Scientific Experiment Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

What Should We Do For a New Student? Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Involves Counterfactual Reasoning Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Involves Generalization Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Batch Reinforcement Learning Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Batch RL Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

The Problem • If you apply an existing method, do you have confidence that it will work? Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

A property of many real applications • Deploying "bad" policies can be costly or dangerous Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Deploying bad policies can be costly Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Deploying bad policies can be dangerous Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

What property should a safe batch reinforcement learning algorithm have? • Given past experience from current policy/policies, produce a new policy • “Guarantee that with probability at least 1 − δ , will not change your policy to one that is worse than the current policy.” • You get to choose δ • Guarantee not contingent on the tuning of any hyperparameters Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Table of Contents Notation 1 Create a safe batch reinforement learning algorithm 2 Off-policy policy evaluation (OPE) High-confidence off-policy policy evaluation (HCOPE) Safe policy improvement (SPI) Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Notation � � s t = s ) • Policy π : π ( a ) = P ( a t = a • Trajectory: T = ( s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , · · · , s L , a L , r L ) • Historical data: D = { T 1 , T 2 , · · · , T n } • Historical data from behavior policy, π b • Objective: L V π = E [ � � π ] γ t R t � t = 1 Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Safe batch reinforement learning algorithm • Reinforcement learning algorithm, A • Historical data, D , which is a random variable • Policy produced by the algorithm, A ( D ) , which is a random variable • a safe batch reinforement learning algorithm, A , satisfies: Pr( V A ( D ) ≥ V π b ) ≥ 1 − δ or, in general Pr( V A ( D ) ≥ V min ) ≥ 1 − δ Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Table of Contents Notation 1 Create a safe batch reinforement learning algorithm 2 Off-policy policy evaluation (OPE) High-confidence off-policy policy evaluation (HCOPE) Safe policy improvement (SPI) Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Create a safe batch reinforement learning algorithm • Off-policy policy evaluation (OPE) • For any evaluation policy, π e , Convert historical data, D , into n independent and unbiased estimates of V π e • High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the n independent and unbiased estimates of V π e into a 1 − δ confidence lower bound on V π e • Safe policy improvement (SPI) • Use HCOPE method to create a safe batch reinforcement learning algorithm, Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Off-policy policy evaluation (OPE) Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Importance Sampling Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Importance Sampling � L � � L n � s t ) � � IS ( D ) = 1 π e ( a t � � � γ t R i t � � s t ) n π b ( a t i = 1 t = 1 t = 1 E [ IS ( D )] = V π e Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Create a safe batch reinforement learning algorithm • Off-policy policy evaluation (OPE) • For any evaluation policy, π e , Convert historical data, D , into n independent and unbiased estimates of V π e • High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the n independent and unbiased estimates of V π e into a 1 − δ confidence lower bound on V π e • Safe policy improvement (SPI) • Use HCOPE method to create a safe batch reinforcement learning algorithm Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

High-confidence off-policy policy evaluation (HCOPE) Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Hoeffding’s inequality • Let X 1 , · · · , X n be n independent identically distributed random variables such that X i ∈ [ 0 , b ] • Then with probability at least 1 − δ : n � E [ X i ] ≥ 1 ln( 1 /δ ) � X i − b , n 2 n i = 1 where X i = 1 � n � L t = 1 γ t R i i = 1 ( w i t ) in our case. n Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Safe policy improvement (SPI) Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Off-policy policy evaluation • Importance sampling (IS): � L � � L n � � s t ) � π e ( a t IS ( D ) = 1 � � � γ t R i � s t ) t � n π b ( a t i = 1 t = 1 t = 1 • Per-decision importance sampling (PDIS) � t L n � s τ ) � � γ t 1 π e ( a τ � � � R i PSID ( D ) = � s τ ) t � n π b ( a τ t = 1 i = 1 τ = 1 Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Off-policy policy evaluation (revisited) • Importance sampling (IS): � L n � IS ( D ) = 1 � � γ t R i w i t n i = 1 t = 1 • Weighted importance sampling (WIS) � L n � 1 � � γ t R i WIS ( D ) = w i � n t i = 1 w i i = 1 t = 1 Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Off-policy policy evaluation (revisited) • Weighted importance sampling (WIS) � L n � 1 � � γ t R i WIS ( D ) = w i � n t i = 1 w i t = 1 i = 1 • Biased. When n = 1 , E [ WIS ] = V ( π b ) • Strongly consistent estimator of V π e • i.e. Pr(lim n →∞ WIS ( D ) = V π e ) = 1 • If • Finite horizon • One behavior policy, or bounded rewards Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Off-policy policy evaluation (revisited) • Weighted per-decision importance sampling • Also called consistent weighted per-decision importance sampling • A fun exercise! Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Control variates • Given: X • Estimate: µ = E [ X ] • ˆ µ = X • Unbiased: E [ˆ µ ] = E [ X ] = µ • Variance: Var (ˆ µ ) = Var ( X ) Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL / 65

Lecture 15: Batch RL Emma Brunskill CS234 Reinforcement Learning. - PowerPoint PPT Presentation

Lecture 15: Batch RL Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Slides drawn from Philip Thomas with modifications Winter 2019 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 15: Batch RL /

Batch Systems Running calculations on HPC resources Outline What is a batch system? How

HEBT Magnet Vacuum Chambers for Batch 2 and Batch 3 PSP Code 2.3.7.1.2.3.2 Lukas Urban /

Batch Systems Running your jobs on an HPC machine Outline What are batch systems? Why are

Batch Metadata Editing in DSpace 1.6+ Maureen P. Walsh, The Ohio State University Libraries

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch

ECE 417 Fall 2018 Lecture 19: Mini-Batch Training and Data Augmentation Mark Hasegawa-Johnson

Batch Modeling and Process Monitoring Geir Rune Flten Agenda CAMO Batch analysis

Automating batch fecundity measurements Automating batch fecundity measurements using digital

A Novel Micro- -Batch Mixer Batch Mixer A Novel Micro That Scales To That Scales To The

Enabling Efficient Batch Verification Enabling Efficient Batch Verification on Data Integrity for

Process costing By: Jyotsna Khaitan Batch Costing: It is a modified form of job costing where

Building the Easy Button: Automating SAS Program Batch Runs Nancy Brucken inVentiv Health June

Information-Theoretic Considerations in Batch RL Jinglin Chen, Nan Jiang University of Illinois

Asphalt Production Asphalt Plants Batch Plant Drum Plant Produces asphalt one batch at a time

Learning about the process and organism: Batch Sef Heijnen, Department of Biotechnology, Faculty

A new batch system, dCache and nfs A. Pickford Background Nikhef Local Batch System (Stoomboot)

Computational Logic The Prolog Programming Language 1 Prolog A practical logic language

Smooth Proxy-Anchor Loss for Noisy Metric Learning Carlos Roig David Varas Issey Masuda Juan

Management of Quantified Semantic Taxonomies for Biothreat Response Cliff Joslyn Computer and

XQuery, A typed functional language for querying XML Philip Wadler, Avaya Labs wadler@avaya.com

New Clothes Matthew 3 13 Then Jesus came from Galilee to the Jordan to be baptized by John. 14

Nonconcentration, L p -Improving Estimates, and Multilinear Kakeya Philip T. Gressman Department

SMALL GROUP NOTES WONDER OF TRANSFORMATION Philip began with the observation that the

Anthony'J.'Clark,'Jared'M.'Moore,' and'Philip'K.'McKinley' 2nd'Interna:onal'Workshop'on'