More efficient Off-Policy Evaluation through Regularized Targeted Learning Aurelien F. Bibaut, Ivana Malenica, Nikos Vlassis, Mark J. van der Laan University of California, Berkeley Netflix, Los Gatos, CA aurelien.bibaut@berkeley.edu June 8, 2019
Problem statement What is Off-Policy Evaluation? ◮ Data: MDP trajectories collected under behavior policy π b . ◮ Question: What would be mean reward under target policy π e ? Why OPE? When too costly/dangerous/unethical to just try out π e . This work: A novel estimator for OPE in reinforcement learning.
Formalization S t : state at t, A t : action at t , R t : reward at t , π b : logging/behavior policy , π e : target policy , T π e ( A t | S t ) � ρ t := π b ( A t | S t ) : importance sampling ratio. t =1 Action-value/reward-to-go function: � � . Q π e t ( s , a ) := E π e � S t = s , A t = a R t τ ≥ t Our estimand: value function V π e ( Q π e ) := E π e [ Q π e 1 ( S 1 , A 1 ) | S 1 = s 1 ] (fix the initial state to s 1 .)
Our base estimator Overview of longitudinal TMLE Q = ( ˆ Q 1 , ..., ˆ Say we have an estimator ˆ Q T ) of Q π e = ( Q π e 1 , ..., Q π e T ) (e.g. SARSA or dynamics estimators). m Traditional Direct Model estimator: ˆ 1 ( ˆ V := V π e Q ) LTMLE: ◮ Define, for t = 1 , ..., T , logistic intercept model , � ˆ � � � Q t ( s , a ) + ∆ t ˆ σ − 1 Q t ( ǫ t )( s , a ) = 2 ∆ t + ǫ − 0 . 5 σ . 2∆ t ���� ���� max logit r.t.g. link ◮ Fit ˆ ǫ t by maximum weighted likelihood V LTMLE := V π e ◮ Define ˆ 1 ( ˆ Q 1 ( ˆ ǫ 1 )
Our base estimator Loss and recursive fitting Log likelihood of for logistic intercept at t : � ˆ � R t + ˆ � V t +1 (ˆ ǫ t +1 ) + ∆ t Q t ( ǫ t ) + ∆ t l t (ˆ ǫ t +1 )( ǫ t ) := ρ t log 2∆ t 2∆ t � �� � � �� � normalized r.t.g. normalized predicted r.t.g. � � � � � 1 − R t + ˆ ˆ V t +1 (ˆ ǫ t +1 ) + ∆ t Q t ( ǫ t ) + ∆ t + log 1 − . 2∆ t 2∆ t Recursive fitting: Likelihood for ǫ t requires fitted ˆ ǫ t +1 = ⇒ proceed backwards in time.
Our base estimator Regularizations Softening . Trajectories i = 1 , ..., n with IS ratios ρ (1) t ,..., ρ ( n ) t . For 0 < α < 1, replace IS ratios by ( ρ ( i ) t ) α t ) α . � j ( ρ ( j ) Partialing . For some τ , set ˆ ǫ τ = .... ˆ ǫ T = 0. Penalization . Add L 1 -penalty λ | ǫ t | to each l t .
Our ensemble estimator ◮ Make a pool of regularized estimators g := ( g 1 , ... g K ). ◮ ˆ Ω n : bootstrap estimate of Cov ( g ). ◮ ˆ b n : bootstrap estimate of bias of g . ◮ Compute 1 n x ⊤ ˆ Ω n x + ( x ⊤ ˆ b n ) 2 . x = arg min ˆ 0 ≤ x ≤ 1 x ⊤ 1 =1 ◮ Return V RLTMLE = ˆ ˆ x ⊤ g .
Empirical performance
Recommend
More recommend