Replicable Evaluation of Recommender Systems Alejandro Bellogn - PowerPoint PPT Presentation

Replicable Evaluation of Recommender Systems Alejandro Bellogín (Universidad Autónoma de Madrid, Spain) Alan Said (Recorded Future, Sweden) Tutorial at ACM RecSys 2015

Stephansdom 2

Stephansdom 3

Stephansdom 4

Stephansdom 5

Stephansdom 6

Stephansdom 7

#EVALTUT 8

Outline • Background and Motivation [10 minutes] • Evaluating Recommender Systems [20 minutes] • Replicating Evaluation Results [20 minutes] • Replication by Example [20 minutes] • Conclusions and Wrap-up [10 minutes] • Questions [10 minutes] 9

Background • A recommender system aims to find and suggest items of likely interest based on the users’ preferences 11

Background • A recommender system aims to find and suggest items of likely interest based on the users’ preferences 12

Background • A recommender system aims to find and suggest items of likely interest based on the users’ preferences • Examples: – Netflix : TV shows and movies – Amazon : products – LinkedIn : jobs and colleagues – Last.fm : music artists and tracks – Facebook : friends 13

Background • Typically, the interactions between user and system are recorded in the form of ratings – But also: clicks (implicit feedback) • This is represented as a user-item matrix: i 1 … i k … i m u 1 … u j ? … u n 14

Motivation • Evaluation is an integral part of any experimental research area • It allows us to compare methods… 15

Motivation • Evaluation is an integral part of any experimental research area • It allows us to compare methods… • … and identify winners (in competitions) 16

Motivation A proper evaluation culture allows advance the field … or at least, identify when there is a problem! 17

Motivation In RecSys, we find inconsistent evaluation results, for the “same” – Dataset – Algorithm Movielens 100k – Evaluation metric [Gorla et al, 2013] Movielens 1M Movielens 1M Movielens 100k, SVD 18 [Yin et al, 2012] [Cremonesi et al, 2010] [Jambor & Wang, 2010]

Motivation In RecSys, we find inconsistent evaluation results, for the “same” – Dataset 0.40 P@50 SVD50 IB 0.35 – Algorithm UB50 0.30 – Evaluation metric 0.05 0 TR 3 TR 4 TeI TrI AI OPR [Bellogín et al, 2011] 19

Motivation In RecSys, we find inconsistent evaluation results, for the “same” – Dataset 0.40 P@50 SVD50 IB 0.35 – Algorithm UB50 We need to understand why this happens 0.30 – Evaluation metric 0.05 0 TR 3 TR 4 TeI TrI AI OPR 20

In this tutorial • We will present the basics of evaluation – Accuracy metrics: error-based, ranking-based – Also coverage, diversity, and novelty • We will focus on replication and reproducibility – Define the context – Present typical problems – Propose some guidelines 21

Replicability • Why do we need to replicate? 22

Reproducibility Why do we need to reproduce? Because these two are not the same 23

NOT in this tutorial • In-depth analysis of evaluation metrics – See chapter 9 on handbook [Shani & Gunawardana, 2011] • Novel evaluation dimensions – See tutorials at WSDM ’14 and SIGIR ‘13 on diversity and novelty • User evaluation – See tutorial at RecSys 2012 • Comparison of evaluation results in research – See RepSys workshop at RecSys 2013 – See [Said & Bellogín 2014] 24

Recommender Systems Evaluation Typically: as a black box Dataset Valida Train Test tion a ranking (for a user) generates Recommender a prediction for a given item (and user) precision error coverage … 26

Recommender Systems Evaluation The reproducible way: as black boxes Dataset Valida Train Test tion a ranking (for a user) generates Recommender a prediction for a given item (and user) precision error coverage … 27

Recommender as a black box What do you do when a recommender cannot predict a score? This has an impact on coverage [Said & Bellogín, 2014] 28

Candidate item generation as a black box How do you select the candidate items to be ranked? Solid triangle represents the target user. Boxed ratings denote test set. 0.40 P@50 SVD50 IB 0.35 UB50 0.30 0.05 0 TR 3 TR 4 TeI TrI AI OPR 29

Candidate item generation as a black box How do you select the candidate items to be ranked? [Said & Bellogín, 2014] 30

Evaluation metric computation as a black box What do you do when a recommender cannot predict a score? – This has an impact on coverage – It can also affect error-based metrics MAE = Mean Absolute Error RMSE = Root Mean Squared Error 31

Evaluation metric computation as a black box What do you do when a recommender cannot predict a score? – This has an impact on coverage – It can also affect error-based metrics User-item pairs Real Rec1 Rec2 Rec3 (u 1 , i 1 ) 5 4 NaN 4 (u 1 , i 2 ) 3 2 4 NaN (u 1 , i 3 ) 1 1 NaN 1 (u 2 , i 1 ) 3 2 4 NaN MAE/RMSE, ignoring NaNs 0.75/0.87 2.00/2.00 0.50/0.70 MAE/RMSE, NaNs as 0 0.75/0.87 2.00/2.65 1.75/2.18 MAE/RMSE, NaNs as 3 0.75/0.87 1.50/1.58 0.25/0.50 32

Evaluation metric computation as a black box Using internal evaluation methods in Mahout (AM), LensKit (LK), and MyMediaLite (MML) [Said & Bellogín, 2014] 33

Evaluation metric computation as a black box Variations on metrics: Error-based metrics can be normalized or averaged per user: – Normalize RMSE or MAE by the range of the ratings (divide by r max – r min ) – Average RMSE or MAE to compensate for unbalanced distributions of items or users 34

Evaluation metric computation as a black box Variations on metrics: nDCG has at least two discounting functions (linear and exponential decay) 35

Evaluation metric computation as a black box Variations on metrics: Ranking-based metrics are usually computed up to a ranking position or cutoff k P = Precision (Precision at k) R = Recall (Recall at k) MAP = Mean Average Precision 36

Evaluation metric computation as a black box If ties are present in the ranking scores, results may depend on the implementation [Bellogín et al, 2013] 37

Evaluation metric computation as a black box Not clear how to measure diversity/novelty in offline experiments (directly measured in online experiments): – Using a taxonomy (items about novel topics) [Weng et al, 2007] – New items over time [Lathia et al, 2010] – Based on entropy, self-information and Kullback- Leibler divergence [Bellogín et al, 2010; Zhou et al, 2010; Filippone & Sanguinetti, 2010] 38

Recommender Systems Evaluation: Summary • Usually, evaluation seen as a black box • The evaluation process involves everything: splitting, recommendation, candidate item generation, and metric computation • We should agree on standard implementations, parameters, instantiations, … – Example: trec_eval in IR 39

Reproducible Experimental Design • We need to distinguish – Replicability – Reproducibility • Different aspects: – Algorithmic – Published results – Experimental design • Goal: have a reproducible experimental environment 41

Definition: Replicability To copy something • The results • The data • The approach Being able to evaluate in the same setting and obtain the same results 42

Definition: Reproducibility To recreate something • The (complete) set of experiments • The (complete) set of results • The (complete) experimental setup To (re) launch it in production with the same results 43

Comparing against the state-of-the-art Your settings are not exactly like those in paper X, but it is Yes! Congrats, you’re a relevant paper done! Do results No! Reproduce results match the Replicate results of original of paper X paper X paper? They agree Do results Congrats! You have shown that agree with paper X behaves different in original the new setting paper? They do not Sorry, there is something agree wrong/incomplete in the experimental design 44

What about Reviewer 3? • “It would be interesting to see this done on a different dataset…” – Repeatability – The same person doing the whole pipeline over again • “How does your approach compare to *Reviewer 3 et al. 2003+?” – Reproducibility or replicability (depending on how similar the two papers are) 45

Repeat vs. replicate vs. reproduce vs. reuse 46

Replicable Evaluation of Recommender Systems Alejandro Bellogn - PowerPoint PPT Presentation

Replicable Evaluation of Recommender Systems Alejandro Bellogn (Universidad Autnoma de Madrid, Spain) Alan Said (Recorded Future, Sweden) Tutorial at ACM RecSys 2015 Stephansdom 2 Stephansdom 3 Stephansdom 4 Stephansdom 5 Stephansdom

Web Mining and Recommender Systems Recommender Systems: Introduction Learning Goals

2. Recommender Systems Recommenders Everywhere Advanced Topics in Information Retrieval /

Affect- and Personality-based Recommender Systems Part II: Acquisition, Usage in Recommender

On the Economics of Recommender Systems Emilio Calvano Center for Studies in Econ and Finance U.

Privacy in Recommender Systems CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 21:

The Raspberry Pi: A Platform for Replicable Performance Benchmarks? Holger Knoche and Holger

CSE 255 Lecture 5 Data Mining and Predictive Analytics Recommender Systems Why

Content- -based Recommender Systems based Recommender Systems Content problems, challenges

CSE 158 Lecture 7 Web Mining and Recommender Systems Recommender Systems Announcements

Web Mining and Recommender Systems Advanced Recommender Systems: Bayesian Personalized Ranking

CSE 158 Lecture 7 Web Mining and Recommender Systems Recommender Systems Announcements

CSE 258 Web Mining and Recommender Systems Advanced Recommender Systems This week

Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, Sanjiv Kumar Overview

CSE 258 Web Mining and Recommender Systems Advanced Recommender Systems This week

Web Mining and Recommender Systems Advanced Recommender Systems This week Methodological papers

CSE 258 Lecture 7 Web Mining and Recommender Systems Recommender Systems Announcements

Towards using Cached Data Mining for Large Scale Recommender Systems Swapneel Sheth, Gail Kaiser

DS504/CS586: Big Data Analytics Recommender System Prof. Yanhua Li Time: 6:00pm 8:50pm Thu.

FPL-004 Turkey Point Nuclear Power Plant Units 6 & 7 Overview Panel Mano Nazar President,

IPv6 Solutions Ralf Korschner Systems Engineer EMEA ralfk@a10networks.com Mike Awford, Sales

Recommender Systems Collabora2ve Filtering and Matrix Factoriza2on

G enerative A dversarial User Model for R einforcement L earning Based R ecommendation S ystem

Temporal Learning and Sequence Modeling for a Job Recommender System Kuan Liu, Xing Shi, Anoop

Shefali Garg Fangyan Sun Music dataset is too big while life is short!!!! You need someone to

Replicable Evaluation of Recommender Systems Alejandro Bellogn - PowerPoint PPT Presentation

Replicable Evaluation of Recommender Systems Alejandro Bellogn (Universidad Autnoma de Madrid, Spain) Alan Said (Recorded Future, Sweden) Tutorial at ACM RecSys 2015 Stephansdom 2 Stephansdom 3 Stephansdom 4 Stephansdom 5 Stephansdom

Web Mining and Recommender Systems Recommender Systems: Introduction Learning Goals

2. Recommender Systems Recommenders Everywhere Advanced Topics in Information Retrieval /

Affect- and Personality-based Recommender Systems Part II: Acquisition, Usage in Recommender

On the Economics of Recommender Systems Emilio Calvano Center for Studies in Econ and Finance U.

Privacy in Recommender Systems CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 21:

The Raspberry Pi: A Platform for Replicable Performance Benchmarks? Holger Knoche and Holger

CSE 255 Lecture 5 Data Mining and Predictive Analytics Recommender Systems Why

Content- -based Recommender Systems based Recommender Systems Content problems, challenges

CSE 158 Lecture 7 Web Mining and Recommender Systems Recommender Systems Announcements

Web Mining and Recommender Systems Advanced Recommender Systems: Bayesian Personalized Ranking

CSE 158 Lecture 7 Web Mining and Recommender Systems Recommender Systems Announcements

CSE 258 Web Mining and Recommender Systems Advanced Recommender Systems This week

Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, Sanjiv Kumar Overview

CSE 258 Web Mining and Recommender Systems Advanced Recommender Systems This week

Web Mining and Recommender Systems Advanced Recommender Systems This week Methodological papers

CSE 258 Lecture 7 Web Mining and Recommender Systems Recommender Systems Announcements

Towards using Cached Data Mining for Large Scale Recommender Systems Swapneel Sheth, Gail Kaiser

DS504/CS586: Big Data Analytics Recommender System Prof. Yanhua Li Time: 6:00pm 8:50pm Thu.

FPL-004 Turkey Point Nuclear Power Plant Units 6 &amp; 7 Overview Panel Mano Nazar President,

IPv6 Solutions Ralf Korschner Systems Engineer EMEA ralfk@a10networks.com Mike Awford, Sales

Recommender Systems Collabora2ve Filtering and Matrix Factoriza2on

G enerative A dversarial User Model for R einforcement L earning Based R ecommendation S ystem

Temporal Learning and Sequence Modeling for a Job Recommender System Kuan Liu, Xing Shi, Anoop

Shefali Garg Fangyan Sun Music dataset is too big while life is short!!!! You need someone to

FPL-004 Turkey Point Nuclear Power Plant Units 6 & 7 Overview Panel Mano Nazar President,