CAB: Continuous Adaptive Blending for Policy Evaluation and Learning Yi Su*, Lequn Wang*, Michele Santacatterina and Thorsten Joachims
Example: Netflix Context π : User/History Action π : Movie to be placed here Candidate: Reward π : Whether user will click it
Goal: Off-Policy Evaluation and Learning Evaluation: Expected performance for a new policy Ο Online: A/B Testing Offline: Off-policy evaluation Draw π % Draw π * / Draw π ) / π β % / π β % π β % ' ' π β % from π % from π * ' π β % from π ) π β % ' ' π β % ' Γ ' π β % Γ ' π β % π π % Γ ' π π * π π ) ' Draw π from π . ' π β % ' π β % π β % ' ' π β % ' π β % π β % ' ' π π %@ ' π π %) π π - Draw π + Draw π , Draw π - from π + from π , from π - Γ ' Γ ' Γ ' π π + π π , π π - F π = π¦ A , π§ A , π A , π . (π§ A |π¦ A ) AE% Learning: ERM for batch learning from bandit feedback π β = ππ ππππ¦ :β< ' 2 π(π)
Main Approaches Contribution I: Present a family of counterfactual estimators. Contribution II: Design a new estimator that inherits desirable properties.
Contribution I: Interpolated Counterfactual Estimator Family Notation : G π(π¦, π§) be the estimated reward for action π§ given context π¦ . Let I π . be the estimated (known) logging policy. Interpolated Counterfactual Estimator (ICE) Family Given a triplet π³ = (π₯ L , π₯ M , π₯ N ) of weighting functions: F F F π O π = 1 L π½ AS + 1 M πΎ A + 1 / N πΏ A π R R π(π§|π¦ A ) π₯ AS π R π π§ A π¦ A π₯ A π R π π§ A π¦ A π₯ A AE% Sβπ΅ AE% AE% Model the bias Control variate Model the world β πΎ A = π (π¦ A , π§ A ) Z π . π§ A π¦ A ) G β π½ AS = G πΏ A = π(π¦ A , π§ A ) Z π . π§ A π¦ A ) π π¦ A , π§ High variance, can be Variance reduction, High bias, small variance unbiased with known prohibited use in LTR propensity
Contribution II: Continuous Adaptive Blending (CAB) Estimator Can be sustainably less biased than clipped IPS and DM. ΓΌ While having low variance compared to IPS and DR. ΓΌ Subdifferentiable and capable of gradient based learning: POEM (Swaminathan & Joachims, ΓΌ 2015a) , BanditNet (Joachims et.al., 2018) Unlike DR, can be used in off-policy Learning to Rank (LTR) algorithms. (Joachims et.al., ΓΌ 2017) See our poster at Pacific Ballroom #221 Thursday (Today) 6:30-9:00pm
Recommend
More recommend