cab continuous adaptive blending for policy evaluation
play

CAB: Continuous Adaptive Blending for Policy Evaluation and Learning - PowerPoint PPT Presentation

CAB: Continuous Adaptive Blending for Policy Evaluation and Learning Yi Su*, Lequn Wang*, Michele Santacatterina and Thorsten Joachims Example: Netflix Context : User/History Action : Movie to be placed here Candidate: Reward :


  1. CAB: Continuous Adaptive Blending for Policy Evaluation and Learning Yi Su*, Lequn Wang*, Michele Santacatterina and Thorsten Joachims

  2. Example: Netflix Context π’š : User/History Action 𝒛 : Movie to be placed here Candidate: Reward 𝒔 : Whether user will click it

  3. Goal: Off-Policy Evaluation and Learning Evaluation: Expected performance for a new policy Ο€ Online: A/B Testing Offline: Off-policy evaluation Draw 𝑇 % Draw 𝑇 * / Draw 𝑇 ) / 𝑉 β„Ž % / 𝑉 β„Ž % 𝑉 β„Ž % ' ' 𝑆 β„Ž % from 𝜌 % from 𝜌 * ' 𝑆 β„Ž % from 𝜌 ) 𝑆 β„Ž % ' ' 𝑆 β„Ž % ' Γ  ' 𝑆 β„Ž % Γ  ' 𝑆 β„Ž % 𝑆 𝜌 % Γ  ' 𝑆 𝜌 * 𝑆 𝜌 ) ' Draw 𝑇 from 𝜌 . ' 𝑆 β„Ž % ' 𝑆 β„Ž % 𝑆 β„Ž % ' ' 𝑆 β„Ž % ' 𝑆 β„Ž % 𝑆 β„Ž % ' ' 𝑆 𝜌 %@ ' 𝑆 𝜌 %) 𝑆 𝜌 - Draw 𝑇 + Draw 𝑇 , Draw 𝑇 - from 𝜌 + from 𝜌 , from 𝜌 - Γ  ' Γ  ' Γ  ' 𝑆 𝜌 + 𝑆 𝜌 , 𝑆 𝜌 - F 𝑇 = 𝑦 A , 𝑧 A , 𝑠 A , 𝜌 . (𝑧 A |𝑦 A ) AE% Learning: ERM for batch learning from bandit feedback 𝜌 βˆ— = 𝑏𝑠𝑕𝑛𝑏𝑦 :∈< ' 2 𝑆(𝜌)

  4. Main Approaches Contribution I: Present a family of counterfactual estimators. Contribution II: Design a new estimator that inherits desirable properties.

  5. Contribution I: Interpolated Counterfactual Estimator Family Notation : G πœ€(𝑦, 𝑧) be the estimated reward for action 𝑧 given context 𝑦 . Let I 𝜌 . be the estimated (known) logging policy. Interpolated Counterfactual Estimator (ICE) Family Given a triplet 𝒳 = (π‘₯ L , π‘₯ M , π‘₯ N ) of weighting functions: F F F 𝑆 O 𝜌 = 1 L 𝛽 AS + 1 M 𝛾 A + 1 / N 𝛿 A π‘œ R R 𝜌(𝑧|𝑦 A ) π‘₯ AS π‘œ R 𝜌 𝑧 A 𝑦 A π‘₯ A π‘œ R 𝜌 𝑧 A 𝑦 A π‘₯ A AE% Sβˆˆπ’΅ AE% AE% Model the bias Control variate Model the world ⁄ 𝛾 A = 𝑠(𝑦 A , 𝑧 A ) Z 𝜌 . 𝑧 A 𝑦 A ) G ⁄ 𝛽 AS = G 𝛿 A = πœ€(𝑦 A , 𝑧 A ) Z 𝜌 . 𝑧 A 𝑦 A ) πœ€ 𝑦 A , 𝑧 High variance, can be Variance reduction, High bias, small variance unbiased with known prohibited use in LTR propensity

  6. Contribution II: Continuous Adaptive Blending (CAB) Estimator Can be sustainably less biased than clipped IPS and DM. ΓΌ While having low variance compared to IPS and DR. ΓΌ Subdifferentiable and capable of gradient based learning: POEM (Swaminathan & Joachims, ΓΌ 2015a) , BanditNet (Joachims et.al., 2018) Unlike DR, can be used in off-policy Learning to Rank (LTR) algorithms. (Joachims et.al., ΓΌ 2017) See our poster at Pacific Ballroom #221 Thursday (Today) 6:30-9:00pm

Recommend


More recommend