Safe Policy Improvement with Baseline Bootstrapping Romain Laroche, Paul Trichelair, R´ emi Tachet des Combes
Introduction Theory Experiments Conclusion Problem setting Batch setting • Fixed dataset, no direct interaction with the environment. • Access to the behavioural policy, called baseline. • Objective: improve the baseline with high probability. • Commonly encountered in real world applications. Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 2/18
Introduction Theory Experiments Conclusion Problem setting Batch setting • Fixed dataset, no direct interaction with the environment. • Access to the behavioural policy, called baseline. • Objective: improve the baseline with high probability. • Commonly encountered in real world applications. Distributed systems Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 2/18
Introduction Theory Experiments Conclusion Problem setting Batch setting • Fixed dataset, no direct interaction with the environment. • Access to the behavioural policy, called baseline. • Objective: improve the baseline with high probability. • Commonly encountered in real world applications. Distributed systems Long trajectories Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 2/18
Introduction Theory Experiments Conclusion Contributions Novel batch RL algorithm: SPIBB • SPIBB comes with reliability guarantees in finite MDPs. • SPIBB is as computationally efficient as classic RL. Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 3/18
Introduction Theory Experiments Conclusion Contributions Novel batch RL algorithm: SPIBB • SPIBB comes with reliability guarantees in finite MDPs. • SPIBB is as computationally efficient as classic RL. Finite MDPs benchmark • Extensive benchmark of existing algorithms. • Empirical analysis on random MDPs and baselines. Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 3/18
Introduction Theory Experiments Conclusion Contributions Novel batch RL algorithm: SPIBB • SPIBB comes with reliability guarantees in finite MDPs. • SPIBB is as computationally efficient as classic RL. Finite MDPs benchmark • Extensive benchmark of existing algorithms. • Empirical analysis on random MDPs and baselines. Infinite MDPs benchmark • Model-free SPIBB for use with function approximation. • First deep RL algorithm reliable in the batch setting. Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 3/18
Introduction Theory Experiments Conclusion Robust Markov Decision Processes [Iyengar, 2005, Nilim and El Ghaoui, 2005] • True environment M ∗ = �X , A , P ∗ , R ∗ , γ � is unknown. • Maximum Likelihood Estimation (MLE) MDP built from counts: � M = �X , A , � P , � R , γ � . M , e ) : M ∗ ∈ Ξ( � • Robust MDP set Ξ( � M , e ) with probability at least 1 − δ . • Error function e ( x , a ) derived from concentration bounds. Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 4/18
Introduction Theory Experiments Conclusion Existing algorithms [Petrik et al., 2016]: SPI by robust baseline regret minimization • Robust MDPs considers the maxmin of the value over Ξ , → favors over-conservative policies. • They also consider the maxmin of the value improvement, → NP-hard problem. • RaMDP hacks the reward to account for uncertainty: κ adj R ( x , a ) ← � � R ( x , a ) − � , N D ( x , a ) → not theoretically grounded. Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 5/18
Introduction Theory Experiments Conclusion Existing algorithms [Petrik et al., 2016]: SPI by robust baseline regret minimization • Robust MDPs considers the maxmin of the value over Ξ , → favors over-conservative policies. • They also consider the maxmin of the value improvement, → NP-hard problem. • RaMDP hacks the reward to account for uncertainty: κ adj R ( x , a ) ← � � R ( x , a ) − � , N D ( x , a ) → not theoretically grounded. [Thomas, 2015]: High-Confidence Policy Improvement • HCPI searches for the best regularization hyperparameter to allow safe policy improvement. Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 5/18
Introduction Theory Experiments Conclusion Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping (SPIBB) • Tractable approximate solution to the robust policy improvement formulation. • SPIBB allows policy update only with sufficient evidence. • Sufficient evidence = state-action count that exceeds some threshold hyperparameter N ∧ . Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 6/18
Introduction Theory Experiments Conclusion Safe Policy Improvement with Baseline Bootstrapping Safe Policy Improvement with Baseline Bootstrapping (SPIBB) • Tractable approximate solution to the robust policy improvement formulation. • SPIBB allows policy update only with sufficient evidence. • Sufficient evidence = state-action count that exceeds some threshold hyperparameter N ∧ . SPIBB algorithm • Construction of the bootstrapped set: B = { ( x , a ) ∈ X × A , N D ( x , a ) < N Λ } . • Optimization over a constrained policy set: π ⊙ spibb = argmax π ∈ Π b ρ ( π, � M ) , Π b = { π , s.t. π ( a | x ) = π b ( a | x ) if ( x , a ) ∈ B } . Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 6/18
Introduction Theory Experiments Conclusion SPIBB policy iteration Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 7/18
Introduction Theory Experiments Conclusion SPIBB policy iteration Policy improvement step example Q -value Baseline policy Bootstrapping SPIBB policy update Q ( i ) M ( x , a 1 ) = 1 π b ( a 1 | x ) = 0 . 1 ( x , a 1 ) ∈ B � Q ( i ) M ( x , a 2 ) = 2 π b ( a 2 | x ) = 0 . 4 ( x , a 2 ) / ∈ B � Q ( i ) M ( x , a 3 ) = 3 π b ( a 3 | x ) = 0 . 3 ( x , a 3 ) / ∈ B � Q ( i ) M ( x , a 4 ) = 4 π b ( a 4 | x ) = 0 . 2 ( x , a 4 ) ∈ B � Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 7/18
Introduction Theory Experiments Conclusion SPIBB policy iteration Policy improvement step example Q -value Baseline policy Bootstrapping SPIBB policy update Q ( i ) π ( i + 1 ) ( a 1 | x ) = 0 . 1 M ( x , a 1 ) = 1 π b ( a 1 | x ) = 0 . 1 ( x , a 1 ) ∈ B � Q ( i ) M ( x , a 2 ) = 2 π b ( a 2 | x ) = 0 . 4 ( x , a 2 ) / ∈ B � Q ( i ) M ( x , a 3 ) = 3 π b ( a 3 | x ) = 0 . 3 ( x , a 3 ) / ∈ B � Q ( i ) π ( i + 1 ) ( a 4 | x ) = 0 . 2 M ( x , a 4 ) = 4 π b ( a 4 | x ) = 0 . 2 ( x , a 4 ) ∈ B � Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 7/18
Introduction Theory Experiments Conclusion SPIBB policy iteration Policy improvement step example Q -value Baseline policy Bootstrapping SPIBB policy update Q ( i ) π ( i + 1 ) ( a 1 | x ) = 0 . 1 M ( x , a 1 ) = 1 π b ( a 1 | x ) = 0 . 1 ( x , a 1 ) ∈ B � Q ( i ) π ( i + 1 ) ( a 2 | x ) = 0 . 0 M ( x , a 2 ) = 2 π b ( a 2 | x ) = 0 . 4 ( x , a 2 ) / ∈ B � Q ( i ) π ( i + 1 ) ( a 3 | x ) = 0 . 7 M ( x , a 3 ) = 3 π b ( a 3 | x ) = 0 . 3 ( x , a 3 ) / ∈ B � Q ( i ) π ( i + 1 ) ( a 4 | x ) = 0 . 2 M ( x , a 4 ) = 4 π b ( a 4 | x ) = 0 . 2 ( x , a 4 ) ∈ B � Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 7/18
Introduction Theory Experiments Conclusion Theoretical analysis Theorem (Convergence) Policy iteration converges to a policy π ⊙ spibb that is Π b -optimal in the MLE MDP � M . Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 8/18
Introduction Theory Experiments Conclusion Theoretical analysis Theorem (Convergence) Policy iteration converges to a policy π ⊙ spibb that is Π b -optimal in the MLE MDP � M . Theorem (Safe policy improvement) With high probability 1 − δ : spibb , � M ) − ρ ( π b , � ρ ( π ⊙ spibb , M ∗ ) − ρ ( π b , M ∗ ) ≥ ρ ( π ⊙ M ) � log 2 |X||A| 2 |X| − 4 V max 2 1 − γ N ∧ δ Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 8/18
Introduction Theory Experiments Conclusion Model-free formulation SPIBB algorithm • It may be formulated in a model-free manner by setting the targets: � y ( i ) π b ( a ′ | x ′ j ) Q ( i ) ( x ′ j , a ′ ) = r j + γ j a ′ | ( x ′ j , a ′ ) ∈ B � π b ( a ′ | x ′ ∈ B Q ( i ) ( x ′ j , a ′ ) . + γ j ) max a ′ | ( x ′ j , a ′ ) / a ′ | ( x ′ j , a ′ ) / ∈ B Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 9/18
Recommend
More recommend