New approaches for evaluation: correctness and freshness Pablo S´ anchez Rus M. Mesas Alejandro Bellog´ ın Universidad Aut´ onoma de Madrid Escuela Polit´ ecnica Superior Departamento de Ingenier´ ıa Inform´ atica V Congreso Espa˜ nol de Recuperaci´ on de Informaci´ on (CERI 2018) 1 / 62
Outline Recommender Systems 1 Freshness 2 Correctness 3 Experiments 4 Conclusions and future work 5 2 / 62
Outline Recommender Systems 1 Freshness 2 Correctness 3 Experiments 4 Conclusions and future work 5 3 / 62
Recommender Systems ... ... ... ... Suggest new items to users based on their tastes and needs 4 / 62
Recommender Systems ... ... ... ... Suggest new items to users based on their tastes and needs Measure the quality of recommendations. How? 5 / 62
Recommender Systems ... ... ... ... Suggest new items to users based on their tastes and needs Measure the quality of recommendations. How? Several evaluation dimensions: Error, Ranking, Novelty / Diversity 6 / 62
Recommender Systems ... ... ... ... Suggest new items to users based on their tastes and needs Measure the quality of recommendations. How? Several evaluation dimensions: Error, Ranking, Novelty / Diversity We will focus on Freshness and Correctness (from S´ anchez and Bellog´ ın (2018); Mesas and Bellog´ ın (2017) ) 7 / 62
Different notions of quality 100 50 0 Coverage 100 50 0 Coverage 100 50 0 Coverage 8 / 62
Different notions of quality 100 50 Best in Relevance? R 2 > R 1 > R 3 0 Coverage 100 50 0 Coverage 100 50 0 Coverage 9 / 62
Different notions of quality 100 50 Best in Relevance? R 2 > R 1 > R 3 0 Coverage Best in Novelty? 100 R 1 > R 3 > R 2 50 0 Coverage 100 50 0 Coverage 10 / 62
Different notions of quality 100 50 Best in Relevance? R 2 > R 1 > R 3 0 Coverage Best in Novelty? 100 R 1 > R 3 > R 2 50 Best in Freshness ? R 3 > R 1 > R 2 0 Coverage 100 50 0 Coverage 11 / 62
Different notions of quality 100 50 Best in Relevance? R 2 > R 1 > R 3 0 Coverage Best in Novelty? 100 R 1 > R 3 > R 2 50 Best in Freshness ? R 3 > R 1 > R 2 0 Coverage Best in Cov-Rel 100 Tradeoff ? 50 R 1 > R 3 > R 2 ?? R 1 > R 2 > R 3 ?? 0 Coverage 12 / 62
Outline Recommender Systems 1 Freshness 2 Correctness 3 Experiments 4 Conclusions and future work 5 13 / 62
Preliminaries Framework proposed in Vargas and Castells (2011) � m ( R u | θ ) = C disc( n ) p ( rel | i n , u )nov( i n | θ ) (1) i n ∈ R u 14 / 62
Preliminaries Framework proposed in Vargas and Castells (2011) � m ( R u | θ ) = C disc( n ) p ( rel | i n , u )nov( i n | θ ) (1) i n ∈ R u Where: R u items recommended to user u θ contextual variable (e.g., the user profile) disc( n ) is a discount model (e.g. NDCG) p ( rel | i n , u ) relevance component nov( i n | θ ) novelty model 15 / 62
Preliminaries Framework proposed in Vargas and Castells (2011) � m ( R u | θ ) = C disc( n ) p ( rel | i n , u )nov( i n | θ ) (1) i n ∈ R u With this framework we can derive multiple metrics, however, all of them are time-agnostic 16 / 62
Preliminaries Framework proposed in Vargas and Castells (2011) � m ( R u | θ t ) = C disc( n ) p ( rel | i n , u ) nov( i n | θ t ) (1) i n ∈ R u With this framework we can derive multiple metrics, however, all of them are time-agnostic We propose to replace the novelty component defining new time-aware novelty models 17 / 62
Time-Aware Novelty Metrics Classic metrics do not provide any information about the evolution of the items: we can recommend relevant but well-known (old) items 18 / 62
Time-Aware Novelty Metrics Classic metrics do not provide any information about the evolution of the items: we can recommend relevant but well-known (old) items Every item in the system can be modeled with a temporal representation: θ t = { θ t ( i ) } = { ( i , � t 1 ( i ) , · · · , t n ( i ) � ) } (2) 19 / 62
Time-Aware Novelty Metrics Classic metrics do not provide any information about the evolution of the items: we can recommend relevant but well-known (old) items Every item in the system can be modeled with a temporal representation: θ t = { θ t ( i ) } = { ( i , � t 1 ( i ) , · · · , t n ( i ) � ) } (2) Two different sources for the timestamps: 20 / 62
Time-Aware Novelty Metrics Classic metrics do not provide any information about the evolution of the items: we can recommend relevant but well-known (old) items Every item in the system can be modeled with a temporal representation: θ t = { θ t ( i ) } = { ( i , � t 1 ( i ) , · · · , t n ( i ) � ) } (2) Two different sources for the timestamps: Metadata information: release date (movies or songs), creation time, etc. 21 / 62
Time-Aware Novelty Metrics Classic metrics do not provide any information about the evolution of the items: we can recommend relevant but well-known (old) items Every item in the system can be modeled with a temporal representation: θ t = { θ t ( i ) } = { ( i , � t 1 ( i ) , · · · , t n ( i ) � ) } (2) Two different sources for the timestamps: Metadata information: release date (movies or songs), creation time, etc. Rating history of the items 22 / 62
Time-Aware Novelty Metrics ... ... 23 / 62
Modeling time profiles for items How can we aggregate the temporal representation? 24 / 62
Modeling time profiles for items How can we aggregate the temporal representation? We explored four possibilities: 25 / 62
Modeling time profiles for items How can we aggregate the temporal representation? We explored four possibilities: Take the first interaction (FIN) Take the last interaction (LIN) Take the average of the ratings times (AIN) Take the median of the ratings times (MIN) 26 / 62
Modeling time profiles for items How can we aggregate the temporal representation? We explored four possibilities: Take the first interaction (FIN) Take the last interaction (LIN) Take the average of the ratings times (AIN) Take the median of the ratings times (MIN) Each case defines a function f ( θ t ( i )) 27 / 62
Modeling time profiles for items: an example ... ... 28 / 62
Modeling time profiles for items: an example Which model represents better the freshness of the items? FIN? i 2 > i 10 > i 9 > i 1 LIN? i 9 > i 1 > i 10 > i 2 ... MIN? i 10 > i 2 > i 9 > i 1 AIN? i 9 > i 10 > i 2 > i 1 ... 29 / 62
Outline Recommender Systems 1 Freshness 2 Correctness 3 Experiments 4 Conclusions and future work 5 30 / 62
Motivation Goal: balancing coverage and precision 31 / 62
Motivation Goal: balancing coverage and precision Some researchers ( Herlocker et al. (2004) Gunawardana and Shani (2015) ) alerted this is still an open problem in Recommender Systems evaluation 32 / 62
Motivation Goal: balancing coverage and precision Some researchers ( Herlocker et al. (2004) Gunawardana and Shani (2015) ) alerted this is still an open problem in Recommender Systems evaluation Typical situation: recommendations with low confidence should not be presented to the user (coverage is reduced at the expense of (potentially) more relevant recommendations) 33 / 62
Our proposal: Correctness metrics Adapted from Question Answering ( Pe˜ nas and Rodrigo (2011) ) 34 / 62
Our proposal: Correctness metrics Adapted from Question Answering ( Pe˜ nas and Rodrigo (2011) ) Each question has several options but only one answer is correct 35 / 62
Our proposal: Correctness metrics Adapted from Question Answering ( Pe˜ nas and Rodrigo (2011) ) Each question has several options but only one answer is correct If an answer is not given, it should not be considered as incorrect (the algorithm decided not to recommend ) 36 / 62
Our proposal: Correctness metrics Adapted from Question Answering ( Pe˜ nas and Rodrigo (2011) ) Each question has several options but only one answer is correct If an answer is not given, it should not be considered as incorrect (the algorithm decided not to recommend ) Applied to recommenders: if two systems have the same number of relevant items but one has retrieved less items, it should be better than the other one 37 / 62
Our proposal: Correctness metrics Based on users: User Correctness = 1 � TP ( u ) + TP ( u ) NR ( u ) � (3) N N � � Recall User Correctness = 1 TP ( u ) + TP ( u ) | T ( u ) | NR ( u ) (4) N 38 / 62
Our proposal: Correctness metrics Based on users: User Correctness = 1 � TP ( u ) + TP ( u ) NR ( u ) � (3) N N � � Recall User Correctness = 1 TP ( u ) + TP ( u ) | T ( u ) | NR ( u ) (4) N where TP ( u ): number of relevant items that we are recommending to the user FP ( u ): number of non-relevant items that we are recommending to the user N : cutoff NR ( u ) : N − ( TP + FP ) | T ( u ) | : number of relevant items in the test of user u 39 / 62
Experiments Recommender Systems 1 Freshness 2 Correctness 3 Experiments 4 Conclusions and future work 5 40 / 62
Freshness results Are the recommendations obtained by different algorithms temporally novel (fresh)? Do the different novelty models produce similar results? 41 / 62
Recommend
More recommend