im implicit feedback and performance
play

Im Implicit Feedback and Performance Evaluation in in Recommender - PowerPoint PPT Presentation

Im Implicit Feedback and Performance Evaluation in in Recommender Systems Shay Ben Elazar Mike Gartrell Noam Koenigstein Gal Lavee Agenda Intro Universal Store Recommendations Extreme Classification with Matrix Factorization


  1. Im Implicit Feedback and Performance Evaluation in in Recommender Systems Shay Ben Elazar Mike Gartrell Noam Koenigstein Gal Lavee

  2. Agenda • Intro Universal Store Recommendations • Extreme Classification with Matrix Factorization • Offline Evaluation Techniques • Online Evaluation • The Gap • Bridging The Gap…

  3. Microsoft Universal Store Recommendations

  4. Windows Store

  5. Groove Music

  6. Xbox

  7. Extreme Classification with Matrix Factorization

  8. History: Netflix Prize ... 4 5 2 ... 5 4 1 3 ... 4 3 4 1 ... 4 5 2 3 4

  9. Two-class data – Extreme Classification ... 1 1 0 1 ... 0 1 1 0 ... 0 1 1 0 0 ... 0 0 0 1 1 0 1

  10. One-class data ... 1 1 1 ... 1 1 ... 1 1 ... 1 1 1

  11. Problem formulation M ≈ 10 – 500M nodes N ≈ 10K – 1M nodes ? ? ... ? ? Bipartite graph → We care about ? = p ( link )

  12. Fully Bayesian model based on Variational Bayes optimization

  13. Offline Evaluation Techniques

  14. 𝑆𝑁𝑇𝐹 - Root Mean Square Error RMSE is computed by averaging the square error over all user item pairs, 𝑣, 𝑗 ∈ ℛ 1 𝑆𝑁𝑇𝐹 = ෍ 𝑇𝐹 𝑣𝑗 ℛ 𝑣,𝑗 ∈ℛ

  15. 𝑥𝑆𝑁𝑇𝐹 - Weighted Root Mean Square Error This variant of RMSE is achieved by assigning each data point a weight, 𝑥 𝑣𝑗 , based on its importance. 1 𝑆𝑁𝑇𝐹 = ෍ 𝑥 𝑣𝑗 ⋅ 𝑇𝐹 𝑣𝑗 σ 𝑥 𝑣𝑗 𝑣,𝑗 ∈ℛ

  16. Precision@ 𝑙 / Recall@ 𝑙 Ranking Induced by Ground Truth Algorithm 𝒍 = 𝟒 Positive Result 3 Positive Result 1 𝑞𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜@𝑙 = 2 3 Negative Result Positive Result 2 𝑠𝑓𝑑𝑏𝑚𝑚@𝑙 = 2 3 Positive Result 3 Positive Result 1 Positive Result 2

  17. Mean Average Precision We can plot precision as a function of recall Recall v Precision Ranking Induced by 100% Algorithm 90% 80% Positive Result 3 70% 60% Precision 50% Negative Result 40% Average Precision 30% 20% Positive Result 1 10% 0% 0% 33% 67% 100% Recall Positive Result 2

  18. 𝑂𝐸𝐷𝐻 – Normalized Discounted Cumulative Gain 1 The relevance is discounted by 𝛿 𝑗 = log 2 𝑗+1 and the sum @ k is normalized by its upper bound – the I𝐸𝐷𝐻 Ranking Induced by Algorithm 𝒍 = 𝟒 Ground Truth 1 5 Positive Result 3 𝐸𝐷𝐻@𝑙 = 𝑚𝑝𝑕 2 1 + 1 + 0 + 𝑚𝑝𝑕 2 3 + 1 = 3.5 Positive Result 1 5 3 1 Relevance: 5 I𝐸𝐷𝐻@𝑙 = 𝑚𝑝𝑕 2 1 + 1 + 𝑚𝑝𝑕 2 2 + 1 + 𝑚𝑝𝑕 2 3 + 1 = 7.39 Negative Result Positive Result 2 Relevance: 3 𝟒.𝟔 𝑶𝑬𝑫𝑯@𝒍 = 𝟖.𝟒𝟘 = 0.47 Positive Result 1 Positive Result 3 Relevance: 1 Positive Result 2

  19. 𝑁𝑄𝑆 - Mean Percentile Rank Sometimes there is only one “positive” items in the test set… Ranking Induced by Algorithm Ground Truth Positive Result 3 Negative Result Positive Result 1 Negative Result 𝒔𝒃𝒐𝒍 𝒋 = 𝟒 Positive Result 1 𝑵𝑸𝑺 = 𝟏. 𝟔 Positive Result 2 Negative Result

  20. MPR in Xbox

  21. Spearman’s Rho Coefficient In scenarios where we want to emphasize the full ranking we may compare the ranking of the algorithm to a reference ranking Ranking Induced by Ground Truth Algorithm Ranking 𝑠 1 − Ƹ 𝑠 1 = 1 − 3 Result 3 Result 1 Result 4 𝑠 2 − Ƹ 𝑠 2 = 2 − 4 Result 2 Result 3 𝑠 3 − Ƹ 𝑠 3 = 3 − 1 Result 1 Result 4 𝑠 4 − Ƹ 𝑠 4 = 4 − 2 Result 2

  22. Ƹ Kendall’s Tau Coefficient In scenarios where we want to emphasize the full ranking we may compare the ranking of the algorithm to a reference ranking Same Order Ranking Induced by Ground Truth sign 𝑠 1 − 𝑠 2 ⋅ sign 𝑠 1 − Ƹ 𝑠 2 = 1 Algorithm Ranking Positive Result 3 Positive Result 1 Negative Result Positive Result 2 Positive Result 3 Positive Result 1 Negative Result Positive Result 2

  23. Offline Techniques – Open Questions • How do we measure the importance/ relevance of the positive items? • Long tail items are more important. But how do we quantify? • How many items do we care to recommend? • Should the best item be the first item? • Maybe the best item should be in the middle? • What about diversity? • What about contextual effects? • What about items fatigue?

  24. Online Experimentation

  25. Online Experiments • Randomized controlled experiments • Measure KPIs (Key Performance Indicator) directly • Can compare several variants simultaneously • The ultimate evaluation technique!

  26. Online Experiments in Xbox

  27. Game Purchase Direct Purchases

  28. Total Game Purchase Total Purchases

  29. Experimentation Caveats • What KPIs to measure? • How long to run the experiment? • External factors may influence the results • Cannibalization is hard to account for • Expensive to implement • Can’t compare algorithms before “lighting up”

  30. The Gap

  31. Accuracy and Diversity Interactions

  32. Characterizing The Offline / Online Evaluation Gap • Overemphasis of popular items • List recommendations (diversity, item position) • Freshness/ Fatigue • Contextual information is not fully utilized • Learning from historical data lets you predict the future. But what we really care about is changing the future!

  33. Bridging The Gap

  34. Mitigating Evaluation Techniques • Domain experts / focus groups • Internal user studies • Off-policy evaluation techniques

  35. Off Policy Evaluation - Example 𝜌 𝑇 - The expected reward of a policy ℎ given data 𝑇 from a “logging 𝑊 ℎ policy” 𝜌 . 𝜌 𝑇 = 1 𝑠 ⋅ 𝕁 ℎ 𝑦 == 𝑏 𝑊 ෍ ℎ 𝑇 max ො 𝜌 𝑦 𝑏 , 𝜐 𝑦,𝑏,𝑠 ∈𝑇 where 𝑇 denotes the set of context-action-reward tuples available in the logs

  36. Caveats of Off-policy Evaluation • Need to formulate everything in terms of a policy • Needs sufficient support • Becomes very difficult when your policies are time dependent

  37. Thank you! We are looking for postdoc researchers to join us in Israel … Email: RecoRecruitmentEmail@microsoft.com

Recommend


More recommend