Safe Policy Improvement with Baseline Bootstrapping Romain Laroche, - PowerPoint PPT Presentation

Safe Policy Improvement with Baseline Bootstrapping Romain Laroche, Paul Trichelair, R´ emi Tachet des Combes

Introduction Theory Experiments Conclusion Problem setting Batch setting • Fixed dataset, no direct interaction with the environment. • Access to the behavioural policy, called baseline. • Objective: improve the baseline with high probability. • Commonly encountered in real world applications. Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 2/18

Introduction Theory Experiments Conclusion Problem setting Batch setting • Fixed dataset, no direct interaction with the environment. • Access to the behavioural policy, called baseline. • Objective: improve the baseline with high probability. • Commonly encountered in real world applications. Distributed systems Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 2/18

Introduction Theory Experiments Conclusion Problem setting Batch setting • Fixed dataset, no direct interaction with the environment. • Access to the behavioural policy, called baseline. • Objective: improve the baseline with high probability. • Commonly encountered in real world applications. Distributed systems Long trajectories Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 2/18

Introduction Theory Experiments Conclusion Contributions Novel batch RL algorithm: SPIBB • SPIBB comes with reliability guarantees in finite MDPs. • SPIBB is as computationally efficient as classic RL. Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 3/18

Introduction Theory Experiments Conclusion Contributions Novel batch RL algorithm: SPIBB • SPIBB comes with reliability guarantees in finite MDPs. • SPIBB is as computationally efficient as classic RL. Finite MDPs benchmark • Extensive benchmark of existing algorithms. • Empirical analysis on random MDPs and baselines. Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 3/18

Introduction Theory Experiments Conclusion Contributions Novel batch RL algorithm: SPIBB • SPIBB comes with reliability guarantees in finite MDPs. • SPIBB is as computationally efficient as classic RL. Finite MDPs benchmark • Extensive benchmark of existing algorithms. • Empirical analysis on random MDPs and baselines. Infinite MDPs benchmark • Model-free SPIBB for use with function approximation. • First deep RL algorithm reliable in the batch setting. Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 3/18

Introduction Theory Experiments Conclusion Robust Markov Decision Processes [Iyengar, 2005, Nilim and El Ghaoui, 2005] • True environment M ∗ = �X , A , P ∗ , R ∗ , γ � is unknown. • Maximum Likelihood Estimation (MLE) MDP built from counts: � M = �X , A , � P , � R , γ � . M , e ) : M ∗ ∈ Ξ( � • Robust MDP set Ξ( � M , e ) with probability at least 1 − δ . • Error function e ( x , a ) derived from concentration bounds. Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 4/18

Introduction Theory Experiments Conclusion Existing algorithms [Petrik et al., 2016]: SPI by robust baseline regret minimization • Robust MDPs considers the maxmin of the value over Ξ , → favors over-conservative policies. • They also consider the maxmin of the value improvement, → NP-hard problem. • RaMDP hacks the reward to account for uncertainty: κ adj R ( x , a ) ← � � R ( x , a ) − � , N D ( x , a ) → not theoretically grounded. Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 5/18

Introduction Theory Experiments Conclusion Existing algorithms [Petrik et al., 2016]: SPI by robust baseline regret minimization • Robust MDPs considers the maxmin of the value over Ξ , → favors over-conservative policies. • They also consider the maxmin of the value improvement, → NP-hard problem. • RaMDP hacks the reward to account for uncertainty: κ adj R ( x , a ) ← � � R ( x , a ) − � , N D ( x , a ) → not theoretically grounded. [Thomas, 2015]: High-Confidence Policy Improvement • HCPI searches for the best regularization hyperparameter to allow safe policy improvement. Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 5/18

Introduction Theory Experiments Conclusion Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping (SPIBB) • Tractable approximate solution to the robust policy improvement formulation. • SPIBB allows policy update only with sufficient evidence. • Sufficient evidence = state-action count that exceeds some threshold hyperparameter N ∧ . Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 6/18

Introduction Theory Experiments Conclusion Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping (SPIBB) • Tractable approximate solution to the robust policy improvement formulation. • SPIBB allows policy update only with sufficient evidence. • Sufficient evidence = state-action count that exceeds some threshold hyperparameter N ∧ . SPIBB algorithm • Construction of the bootstrapped set: B = { ( x , a ) ∈ X × A , N D ( x , a ) < N Λ } . • Optimization over a constrained policy set: π ⊙ spibb = argmax π ∈ Π b ρ ( π, � M ) , Π b = { π , s.t. π ( a | x ) = π b ( a | x ) if ( x , a ) ∈ B } . Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 6/18

Introduction Theory Experiments Conclusion SPIBB policy iteration Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 7/18

Introduction Theory Experiments Conclusion SPIBB policy iteration Policy improvement step example Q -value Baseline policy Bootstrapping SPIBB policy update Q ( i ) M ( x , a 1 ) = 1 π b ( a 1 | x ) = 0 . 1 ( x , a 1 ) ∈ B � Q ( i ) M ( x , a 2 ) = 2 π b ( a 2 | x ) = 0 . 4 ( x , a 2 ) / ∈ B � Q ( i ) M ( x , a 3 ) = 3 π b ( a 3 | x ) = 0 . 3 ( x , a 3 ) / ∈ B � Q ( i ) M ( x , a 4 ) = 4 π b ( a 4 | x ) = 0 . 2 ( x , a 4 ) ∈ B � Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 7/18

Introduction Theory Experiments Conclusion SPIBB policy iteration Policy improvement step example Q -value Baseline policy Bootstrapping SPIBB policy update Q ( i ) π ( i + 1 ) ( a 1 | x ) = 0 . 1 M ( x , a 1 ) = 1 π b ( a 1 | x ) = 0 . 1 ( x , a 1 ) ∈ B � Q ( i ) M ( x , a 2 ) = 2 π b ( a 2 | x ) = 0 . 4 ( x , a 2 ) / ∈ B � Q ( i ) M ( x , a 3 ) = 3 π b ( a 3 | x ) = 0 . 3 ( x , a 3 ) / ∈ B � Q ( i ) π ( i + 1 ) ( a 4 | x ) = 0 . 2 M ( x , a 4 ) = 4 π b ( a 4 | x ) = 0 . 2 ( x , a 4 ) ∈ B � Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 7/18

Introduction Theory Experiments Conclusion SPIBB policy iteration Policy improvement step example Q -value Baseline policy Bootstrapping SPIBB policy update Q ( i ) π ( i + 1 ) ( a 1 | x ) = 0 . 1 M ( x , a 1 ) = 1 π b ( a 1 | x ) = 0 . 1 ( x , a 1 ) ∈ B � Q ( i ) π ( i + 1 ) ( a 2 | x ) = 0 . 0 M ( x , a 2 ) = 2 π b ( a 2 | x ) = 0 . 4 ( x , a 2 ) / ∈ B � Q ( i ) π ( i + 1 ) ( a 3 | x ) = 0 . 7 M ( x , a 3 ) = 3 π b ( a 3 | x ) = 0 . 3 ( x , a 3 ) / ∈ B � Q ( i ) π ( i + 1 ) ( a 4 | x ) = 0 . 2 M ( x , a 4 ) = 4 π b ( a 4 | x ) = 0 . 2 ( x , a 4 ) ∈ B � Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 7/18

Introduction Theory Experiments Conclusion Theoretical analysis Theorem (Convergence) Policy iteration converges to a policy π ⊙ spibb that is Π b -optimal in the MLE MDP � M . Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 8/18

Introduction Theory Experiments Conclusion Theoretical analysis Theorem (Convergence) Policy iteration converges to a policy π ⊙ spibb that is Π b -optimal in the MLE MDP � M . Theorem (Safe policy improvement) With high probability 1 − δ : spibb , � M ) − ρ ( π b , � ρ ( π ⊙ spibb , M ∗ ) − ρ ( π b , M ∗ ) ≥ ρ ( π ⊙ M ) � log 2 |X||A| 2 |X| − 4 V max 2 1 − γ N ∧ δ Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 8/18

Introduction Theory Experiments Conclusion Model-free formulation SPIBB algorithm • It may be formulated in a model-free manner by setting the targets: � y ( i ) π b ( a ′ | x ′ j ) Q ( i ) ( x ′ j , a ′ ) = r j + γ j a ′ | ( x ′ j , a ′ ) ∈ B   �   π b ( a ′ | x ′ ∈ B Q ( i ) ( x ′ j , a ′ ) . + γ j ) max   a ′ | ( x ′ j , a ′ ) / a ′ | ( x ′ j , a ′ ) / ∈ B Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 9/18

Safe Policy Improvement with Baseline Bootstrapping Romain Laroche, - PowerPoint PPT Presentation

Safe Policy Improvement with Baseline Bootstrapping Romain Laroche, Paul Trichelair, R emi Tachet des Combes Introduction Theory Experiments Conclusion Problem setting Batch setting Fixed dataset, no direct interaction with the

White Paper Runs Name Description baseline2018a Baseline Project-official baseline (official

Bootstrapping without the Boot We like minimally supervised learning (bootstrapping).

Parametric Bootstrapping 18.05 Spring 2017 Parametric bootstrapping Use the estimated parameter

Technical Baseline Management Technical Baseline Management September 30, 2003 Pat Hascall LAT

Hillside Marine Baseline Overview AUSTRALIAS NEXT GREAT COPPER PROJECT HILLSIDE: SOUTH

BNL Neutrino Long Baseline Neutrino Initiative N. Simos, BNL NWG Homestake Baseline = 2540 Km

Meeting Staff Baseline Testing: How to Prepare for Workforce Disruptions May 20, 2020 Preparing

The Baseline The Baseline Personal Process Personal Process AU INSY 560, Singapore 1997, Dan

Explorations in Bootstrapping Guided Search 8th Language and Computation Day Deirdre Lungley

Improved Bootstrapping Approach in Multichannel Cognitive Radio Ad Hoc Networks The 4th Workshop

SFU NatLangLab Bootstrapping via Graph Propagation Max Whitney Anoop Sarkar Simon Fraser

INF5210 Information Infrastructure Class #11 Bootstrapping & Gateways Ben Eaton Dan Truong

Statistical analysis and bootstrapping Michel Bierlaire michel.bierlaire@epfl.ch Transport and

Bootstrapping Debian for a new architecture Pietro Abate Universite Paris Diderot / Irill

PS 406 Week 3 Section: Bootstrapping D.J. Flynn April 21, 2014 D.J. Flynn PS406 Week 3

Ring Switching and Bootstrapping FHE Chris Peikert School of Computer Science Georgia Tech

Vandalism Detection on Wikipedia The class imbalance problem & new approaches Paul Gtze

The Player Kernel Lucas Maystre , Victor Kristof, Antonio Gonzlez Ferrer, Matthias Grossglauser

Factorization of the Label Conditional Distribution for Multi-Label Classification ECML PKDD 2015

Stochastic approximation for speeding up LSTD/LSPI (and least squares regression/LinUCB) Prashanth

Product2Vec : MRNe Net-Pr A A Multi ti-task Recurrent Ne Neural Ne Network for Product

Dagstuhl Seminar 17382 AAIP17 Approaches and Applications of Inductive Programming

In Interpretin ing Deep Sports Analy lytics: Valu luin ing Actio ions and Pla layers in in

A historical perspective on Machine Learning (on the occasion of the 25th Benelearn) Luc De

Safe Policy Improvement with Baseline Bootstrapping Romain Laroche, - PowerPoint PPT Presentation

Safe Policy Improvement with Baseline Bootstrapping Romain Laroche, Paul Trichelair, R emi Tachet des Combes Introduction Theory Experiments Conclusion Problem setting Batch setting Fixed dataset, no direct interaction with the

White Paper Runs Name Description baseline2018a Baseline Project-official baseline (official

Bootstrapping without the Boot We like minimally supervised learning (bootstrapping).

Parametric Bootstrapping 18.05 Spring 2017 Parametric bootstrapping Use the estimated parameter

Technical Baseline Management Technical Baseline Management September 30, 2003 Pat Hascall LAT

Hillside Marine Baseline Overview AUSTRALIAS NEXT GREAT COPPER PROJECT HILLSIDE: SOUTH

BNL Neutrino Long Baseline Neutrino Initiative N. Simos, BNL NWG Homestake Baseline = 2540 Km

Meeting Staff Baseline Testing: How to Prepare for Workforce Disruptions May 20, 2020 Preparing

The Baseline The Baseline Personal Process Personal Process AU INSY 560, Singapore 1997, Dan

Explorations in Bootstrapping Guided Search 8th Language and Computation Day Deirdre Lungley

Improved Bootstrapping Approach in Multichannel Cognitive Radio Ad Hoc Networks The 4th Workshop

SFU NatLangLab Bootstrapping via Graph Propagation Max Whitney Anoop Sarkar Simon Fraser

INF5210 Information Infrastructure Class #11 Bootstrapping &amp; Gateways Ben Eaton Dan Truong

Statistical analysis and bootstrapping Michel Bierlaire michel.bierlaire@epfl.ch Transport and

Bootstrapping Debian for a new architecture Pietro Abate Universite Paris Diderot / Irill

PS 406 Week 3 Section: Bootstrapping D.J. Flynn April 21, 2014 D.J. Flynn PS406 Week 3

Ring Switching and Bootstrapping FHE Chris Peikert School of Computer Science Georgia Tech

Vandalism Detection on Wikipedia The class imbalance problem &amp; new approaches Paul Gtze

The Player Kernel Lucas Maystre , Victor Kristof, Antonio Gonzlez Ferrer, Matthias Grossglauser

Factorization of the Label Conditional Distribution for Multi-Label Classification ECML PKDD 2015

Stochastic approximation for speeding up LSTD/LSPI (and least squares regression/LinUCB) Prashanth

Product2Vec : MRNe Net-Pr A A Multi ti-task Recurrent Ne Neural Ne Network for Product

Dagstuhl Seminar 17382 AAIP17 Approaches and Applications of Inductive Programming

In Interpretin ing Deep Sports Analy lytics: Valu luin ing Actio ions and Pla layers in in

A historical perspective on Machine Learning (on the occasion of the 25th Benelearn) Luc De

INF5210 Information Infrastructure Class #11 Bootstrapping & Gateways Ben Eaton Dan Truong

Vandalism Detection on Wikipedia The class imbalance problem & new approaches Paul Gtze