Im Implicit Feedback and Performance Evaluation in in Recommender - PowerPoint PPT Presentation
Im Implicit Feedback and Performance Evaluation in in Recommender Systems Shay Ben Elazar Mike Gartrell Noam Koenigstein Gal Lavee Agenda Intro Universal Store Recommendations Extreme Classification with Matrix Factorization
Im Implicit Feedback and Performance Evaluation in in Recommender Systems Shay Ben Elazar Mike Gartrell Noam Koenigstein Gal Lavee
Agenda • Intro Universal Store Recommendations • Extreme Classification with Matrix Factorization • Offline Evaluation Techniques • Online Evaluation • The Gap • Bridging The Gap…
Microsoft Universal Store Recommendations
Windows Store
Groove Music
Xbox
Extreme Classification with Matrix Factorization
History: Netflix Prize ... 4 5 2 ... 5 4 1 3 ... 4 3 4 1 ... 4 5 2 3 4
Two-class data – Extreme Classification ... 1 1 0 1 ... 0 1 1 0 ... 0 1 1 0 0 ... 0 0 0 1 1 0 1
One-class data ... 1 1 1 ... 1 1 ... 1 1 ... 1 1 1
Problem formulation M ≈ 10 – 500M nodes N ≈ 10K – 1M nodes ? ? ... ? ? Bipartite graph → We care about ? = p ( link )
Fully Bayesian model based on Variational Bayes optimization
Offline Evaluation Techniques
𝑆𝑁𝑇𝐹 - Root Mean Square Error RMSE is computed by averaging the square error over all user item pairs, 𝑣, 𝑗 ∈ ℛ 1 𝑆𝑁𝑇𝐹 = 𝑇𝐹 𝑣𝑗 ℛ 𝑣,𝑗 ∈ℛ
𝑥𝑆𝑁𝑇𝐹 - Weighted Root Mean Square Error This variant of RMSE is achieved by assigning each data point a weight, 𝑥 𝑣𝑗 , based on its importance. 1 𝑆𝑁𝑇𝐹 = 𝑥 𝑣𝑗 ⋅ 𝑇𝐹 𝑣𝑗 σ 𝑥 𝑣𝑗 𝑣,𝑗 ∈ℛ
Precision@ 𝑙 / Recall@ 𝑙 Ranking Induced by Ground Truth Algorithm 𝒍 = 𝟒 Positive Result 3 Positive Result 1 𝑞𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜@𝑙 = 2 3 Negative Result Positive Result 2 𝑠𝑓𝑑𝑏𝑚𝑚@𝑙 = 2 3 Positive Result 3 Positive Result 1 Positive Result 2
Mean Average Precision We can plot precision as a function of recall Recall v Precision Ranking Induced by 100% Algorithm 90% 80% Positive Result 3 70% 60% Precision 50% Negative Result 40% Average Precision 30% 20% Positive Result 1 10% 0% 0% 33% 67% 100% Recall Positive Result 2
𝑂𝐸𝐷𝐻 – Normalized Discounted Cumulative Gain 1 The relevance is discounted by 𝛿 𝑗 = log 2 𝑗+1 and the sum @ k is normalized by its upper bound – the I𝐸𝐷𝐻 Ranking Induced by Algorithm 𝒍 = 𝟒 Ground Truth 1 5 Positive Result 3 𝐸𝐷𝐻@𝑙 = 𝑚𝑝 2 1 + 1 + 0 + 𝑚𝑝 2 3 + 1 = 3.5 Positive Result 1 5 3 1 Relevance: 5 I𝐸𝐷𝐻@𝑙 = 𝑚𝑝 2 1 + 1 + 𝑚𝑝 2 2 + 1 + 𝑚𝑝 2 3 + 1 = 7.39 Negative Result Positive Result 2 Relevance: 3 𝟒.𝟔 𝑶𝑬𝑫𝑯@𝒍 = 𝟖.𝟒𝟘 = 0.47 Positive Result 1 Positive Result 3 Relevance: 1 Positive Result 2
𝑁𝑄𝑆 - Mean Percentile Rank Sometimes there is only one “positive” items in the test set… Ranking Induced by Algorithm Ground Truth Positive Result 3 Negative Result Positive Result 1 Negative Result 𝒔𝒃𝒐𝒍 𝒋 = 𝟒 Positive Result 1 𝑵𝑸𝑺 = 𝟏. 𝟔 Positive Result 2 Negative Result
MPR in Xbox
Spearman’s Rho Coefficient In scenarios where we want to emphasize the full ranking we may compare the ranking of the algorithm to a reference ranking Ranking Induced by Ground Truth Algorithm Ranking 𝑠 1 − Ƹ 𝑠 1 = 1 − 3 Result 3 Result 1 Result 4 𝑠 2 − Ƹ 𝑠 2 = 2 − 4 Result 2 Result 3 𝑠 3 − Ƹ 𝑠 3 = 3 − 1 Result 1 Result 4 𝑠 4 − Ƹ 𝑠 4 = 4 − 2 Result 2
Ƹ Kendall’s Tau Coefficient In scenarios where we want to emphasize the full ranking we may compare the ranking of the algorithm to a reference ranking Same Order Ranking Induced by Ground Truth sign 𝑠 1 − 𝑠 2 ⋅ sign 𝑠 1 − Ƹ 𝑠 2 = 1 Algorithm Ranking Positive Result 3 Positive Result 1 Negative Result Positive Result 2 Positive Result 3 Positive Result 1 Negative Result Positive Result 2
Offline Techniques – Open Questions • How do we measure the importance/ relevance of the positive items? • Long tail items are more important. But how do we quantify? • How many items do we care to recommend? • Should the best item be the first item? • Maybe the best item should be in the middle? • What about diversity? • What about contextual effects? • What about items fatigue?
Online Experimentation
Online Experiments • Randomized controlled experiments • Measure KPIs (Key Performance Indicator) directly • Can compare several variants simultaneously • The ultimate evaluation technique!
Online Experiments in Xbox
Game Purchase Direct Purchases
Total Game Purchase Total Purchases
Experimentation Caveats • What KPIs to measure? • How long to run the experiment? • External factors may influence the results • Cannibalization is hard to account for • Expensive to implement • Can’t compare algorithms before “lighting up”
The Gap
Accuracy and Diversity Interactions
Characterizing The Offline / Online Evaluation Gap • Overemphasis of popular items • List recommendations (diversity, item position) • Freshness/ Fatigue • Contextual information is not fully utilized • Learning from historical data lets you predict the future. But what we really care about is changing the future!
Bridging The Gap
Mitigating Evaluation Techniques • Domain experts / focus groups • Internal user studies • Off-policy evaluation techniques
Off Policy Evaluation - Example 𝜌 𝑇 - The expected reward of a policy ℎ given data 𝑇 from a “logging 𝑊 ℎ policy” 𝜌 . 𝜌 𝑇 = 1 𝑠 ⋅ 𝕁 ℎ 𝑦 == 𝑏 𝑊 ℎ 𝑇 max ො 𝜌 𝑦 𝑏 , 𝜐 𝑦,𝑏,𝑠 ∈𝑇 where 𝑇 denotes the set of context-action-reward tuples available in the logs
Caveats of Off-policy Evaluation • Need to formulate everything in terms of a policy • Needs sufficient support • Becomes very difficult when your policies are time dependent
Thank you! We are looking for postdoc researchers to join us in Israel … Email: RecoRecruitmentEmail@microsoft.com
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.