Lecture 12: Batch RL Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Slides drawn from Philip Thomas with modifications Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Class Structure • Last time: Fast Reinforcement Learning / Exploration and Exploitation • This time: Batch RL • Next time: Monte Carlo Tree Search Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Table of Contents What makes an RL algorithm safe? 1 Notation 2 Create a safe batch reinforement learning algorithm 3 Off-policy policy evaluation (OPE) High-confidence off-policy policy evaluation (HCOPE) Safe policy improvement (SPI) Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
What does it mean to for a reinforcement learning algorithm to be safe? Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Changing the objective Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Changing the objective • Policy 1: • Reward = 0 with probability 0.999999 • Reward = 10 9 with probability 1-0.999999 • Expected reward approximately 1000 • Policy 2: • Reward = 999 with probability 0.5 • Reward = 1000 with probability 0.5 • Expected reward 999.5 Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Another notion of safety Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Another notion of safety (Munos et. al) Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Another notion of safety Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
The Problem • If you apply an existing method, do you have confidence that it will work? Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Reinforcement learning success Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
A property of many real applications • Deploying "bad" policies can be costly or dangerous Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Deploying bad policies can be costly Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Deploying bad policies can be dangerous Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
What property should a safe batch reinforcement learning algorithm have? • Given past experience from current policy/policies, produce a new policy • “Guarantee that with probability at least 1 − δ , will not change your policy to one that is worse than the current policy.” • You get to choose δ • Guarantee not contingent on the tuning of any hyperparameters Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Table of Contents What makes an RL algorithm safe? 1 Notation 2 Create a safe batch reinforement learning algorithm 3 Off-policy policy evaluation (OPE) High-confidence off-policy policy evaluation (HCOPE) Safe policy improvement (SPI) Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Notation � � s t = s ) • Policy π : π ( a ) = P ( a t = a • History: H = ( s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , · · · , s L , a L , r L ) • Historical data: D = { H 1 , H 2 , · · · , H n } • Historical data from behavior policy, π b • Objective: L V π = E [ � � π ] γ t R t � t = 1 Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Safe batch reinforement learning algorithm • Reinforcement learning algorithm, A • Historical data, D , which is a random variable • Policy produced by the algorithm, A ( D ) , which is a random variable • a safe batch reinforement learning algorithm, A , satisfies: Pr( V A ( D ) ≥ V π b ≥ 1 − δ or, in general Pr( V A ( D ) ≥ V min ) ≥ 1 − δ Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Table of Contents What makes an RL algorithm safe? 1 Notation 2 Create a safe batch reinforement learning algorithm 3 Off-policy policy evaluation (OPE) High-confidence off-policy policy evaluation (HCOPE) Safe policy improvement (SPI) Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Create a safe batch reinforement learning algorithm • Off-policy policy evaluation (OPE) • For any evaluation policy, π e , Convert historical data, D , into n independent and unbiased estimates of V π e • High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the n independent and unbiased estimates of V π e into a 1 − δ confidence lower bound on V π e • Safe policy improvement (SPI) • Use HCOPE method to create a safe batch reinforement learning algorithm, a Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Off-policy policy evaluation (OPE) Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Importance Sampling (Reminder) � L � � L n � s t ) � � IS ( D ) = 1 π e ( a t � � � γ t R i t � � s t ) n π b ( a t i = 1 t = 1 t = 1 E [ IS ( D )] = V π e Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Create a safe batch reinforement learning algorithm • Off-policy policy evaluation (OPE) • For any evaluation policy, π e , Convert historical data, D , into n independent and unbiased estimates of V π e • High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the n independent and unbiased estimates of V π e into a 1 − δ confidence lower bound on V π e • Safe policy improvement (SPI) • Use HCOPE method to create a safe batch reinforement learning algorithm, a Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
High-confidence off-policy policy evaluation (HCOPE) Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Hoeffding’s inequality • Let X 1 , · · · , X n be n independent identically distributed random variables such that X i ∈ [ 0 , b ] • Then with probability at least 1 − δ : n � E [ X i ] ≥ 1 ln( 1 /δ ) � X i − b , n 2 n i = 1 where X i = 1 � n � L t = 1 γ t R i i = 1 ( w i t ) in our case. n Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Safe policy improvement (SPI) Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Safe policy improvement (SPI) Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Create a safe batch reinforement learning algorithm • Off-policy policy evaluation (OPE) • For any evaluation policy, π e , Convert historical data, D , into n independent and unbiased estimates of V π e • High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the n independent and unbiased estimates of V π e into a 1 − δ confidence lower bound on V π e • Safe policy improvement (SPI) • Use HCOPE method to create a safe batch reinforement learning algorithm, a WON’T WORK! Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Off-policy policy evaluation (revisited) • Importance sampling (IS): � L � � L n � � s t ) � π e ( a t IS ( D ) = 1 � � � γ t R i � s t ) t � n π b ( a t i = 1 t = 1 t = 1 • Per-decision importance sampling (PDIS) � t L n � s τ ) � � γ t 1 π e ( a τ � � � R i PSID ( D ) = � s τ ) t � n π b ( a τ t = 1 i = 1 τ = 1 Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68
Recommend
More recommend