New approaches for evaluation: correctness and freshness Pablo S - PowerPoint PPT Presentation

New approaches for evaluation: correctness and freshness Pablo S´ anchez Rus M. Mesas Alejandro Bellog´ ın Universidad Aut´ onoma de Madrid Escuela Polit´ ecnica Superior Departamento de Ingenier´ ıa Inform´ atica V Congreso Espa˜ nol de Recuperaci´ on de Informaci´ on (CERI 2018) 1 / 62

Outline Recommender Systems 1 Freshness 2 Correctness 3 Experiments 4 Conclusions and future work 5 2 / 62

Recommender Systems ... ... ... ... Suggest new items to users based on their tastes and needs 4 / 62

Recommender Systems ... ... ... ... Suggest new items to users based on their tastes and needs Measure the quality of recommendations. How? 5 / 62

Recommender Systems ... ... ... ... Suggest new items to users based on their tastes and needs Measure the quality of recommendations. How? Several evaluation dimensions: Error, Ranking, Novelty / Diversity 6 / 62

Recommender Systems ... ... ... ... Suggest new items to users based on their tastes and needs Measure the quality of recommendations. How? Several evaluation dimensions: Error, Ranking, Novelty / Diversity We will focus on Freshness and Correctness (from S´ anchez and Bellog´ ın (2018); Mesas and Bellog´ ın (2017) ) 7 / 62

Different notions of quality 100 50 0 Coverage 100 50 0 Coverage 100 50 0 Coverage 8 / 62

Different notions of quality 100 50 Best in Relevance? R 2 > R 1 > R 3 0 Coverage 100 50 0 Coverage 100 50 0 Coverage 9 / 62

Different notions of quality 100 50 Best in Relevance? R 2 > R 1 > R 3 0 Coverage Best in Novelty? 100 R 1 > R 3 > R 2 50 0 Coverage 100 50 0 Coverage 10 / 62

Different notions of quality 100 50 Best in Relevance? R 2 > R 1 > R 3 0 Coverage Best in Novelty? 100 R 1 > R 3 > R 2 50 Best in Freshness ? R 3 > R 1 > R 2 0 Coverage 100 50 0 Coverage 11 / 62

Different notions of quality 100 50 Best in Relevance? R 2 > R 1 > R 3 0 Coverage Best in Novelty? 100 R 1 > R 3 > R 2 50 Best in Freshness ? R 3 > R 1 > R 2 0 Coverage Best in Cov-Rel 100 Tradeoff ? 50 R 1 > R 3 > R 2 ?? R 1 > R 2 > R 3 ?? 0 Coverage 12 / 62

Preliminaries Framework proposed in Vargas and Castells (2011) � m ( R u | θ ) = C disc( n ) p ( rel | i n , u )nov( i n | θ ) (1) i n ∈ R u 14 / 62

Preliminaries Framework proposed in Vargas and Castells (2011) � m ( R u | θ ) = C disc( n ) p ( rel | i n , u )nov( i n | θ ) (1) i n ∈ R u Where: R u items recommended to user u θ contextual variable (e.g., the user profile) disc( n ) is a discount model (e.g. NDCG) p ( rel | i n , u ) relevance component nov( i n | θ ) novelty model 15 / 62

Preliminaries Framework proposed in Vargas and Castells (2011) � m ( R u | θ ) = C disc( n ) p ( rel | i n , u )nov( i n | θ ) (1) i n ∈ R u With this framework we can derive multiple metrics, however, all of them are time-agnostic 16 / 62

Preliminaries Framework proposed in Vargas and Castells (2011) � m ( R u | θ t ) = C disc( n ) p ( rel | i n , u ) nov( i n | θ t ) (1) i n ∈ R u With this framework we can derive multiple metrics, however, all of them are time-agnostic We propose to replace the novelty component defining new time-aware novelty models 17 / 62

Time-Aware Novelty Metrics Classic metrics do not provide any information about the evolution of the items: we can recommend relevant but well-known (old) items 18 / 62

Time-Aware Novelty Metrics Classic metrics do not provide any information about the evolution of the items: we can recommend relevant but well-known (old) items Every item in the system can be modeled with a temporal representation: θ t = { θ t ( i ) } = { ( i , � t 1 ( i ) , · · · , t n ( i ) � ) } (2) 19 / 62

Time-Aware Novelty Metrics Classic metrics do not provide any information about the evolution of the items: we can recommend relevant but well-known (old) items Every item in the system can be modeled with a temporal representation: θ t = { θ t ( i ) } = { ( i , � t 1 ( i ) , · · · , t n ( i ) � ) } (2) Two different sources for the timestamps: 20 / 62

Time-Aware Novelty Metrics Classic metrics do not provide any information about the evolution of the items: we can recommend relevant but well-known (old) items Every item in the system can be modeled with a temporal representation: θ t = { θ t ( i ) } = { ( i , � t 1 ( i ) , · · · , t n ( i ) � ) } (2) Two different sources for the timestamps: Metadata information: release date (movies or songs), creation time, etc. 21 / 62

Time-Aware Novelty Metrics Classic metrics do not provide any information about the evolution of the items: we can recommend relevant but well-known (old) items Every item in the system can be modeled with a temporal representation: θ t = { θ t ( i ) } = { ( i , � t 1 ( i ) , · · · , t n ( i ) � ) } (2) Two different sources for the timestamps: Metadata information: release date (movies or songs), creation time, etc. Rating history of the items 22 / 62

Time-Aware Novelty Metrics ... ... 23 / 62

Modeling time profiles for items How can we aggregate the temporal representation? 24 / 62

Modeling time profiles for items How can we aggregate the temporal representation? We explored four possibilities: 25 / 62

Modeling time profiles for items How can we aggregate the temporal representation? We explored four possibilities: Take the first interaction (FIN) Take the last interaction (LIN) Take the average of the ratings times (AIN) Take the median of the ratings times (MIN) 26 / 62

Modeling time profiles for items How can we aggregate the temporal representation? We explored four possibilities: Take the first interaction (FIN) Take the last interaction (LIN) Take the average of the ratings times (AIN) Take the median of the ratings times (MIN) Each case defines a function f ( θ t ( i )) 27 / 62

Modeling time profiles for items: an example ... ... 28 / 62

Modeling time profiles for items: an example Which model represents better the freshness of the items? FIN? i 2 > i 10 > i 9 > i 1 LIN? i 9 > i 1 > i 10 > i 2 ... MIN? i 10 > i 2 > i 9 > i 1 AIN? i 9 > i 10 > i 2 > i 1 ... 29 / 62

Motivation Goal: balancing coverage and precision 31 / 62

Motivation Goal: balancing coverage and precision Some researchers ( Herlocker et al. (2004) Gunawardana and Shani (2015) ) alerted this is still an open problem in Recommender Systems evaluation 32 / 62

Motivation Goal: balancing coverage and precision Some researchers ( Herlocker et al. (2004) Gunawardana and Shani (2015) ) alerted this is still an open problem in Recommender Systems evaluation Typical situation: recommendations with low confidence should not be presented to the user (coverage is reduced at the expense of (potentially) more relevant recommendations) 33 / 62

Our proposal: Correctness metrics Adapted from Question Answering ( Pe˜ nas and Rodrigo (2011) ) 34 / 62

Our proposal: Correctness metrics Adapted from Question Answering ( Pe˜ nas and Rodrigo (2011) ) Each question has several options but only one answer is correct 35 / 62

Our proposal: Correctness metrics Adapted from Question Answering ( Pe˜ nas and Rodrigo (2011) ) Each question has several options but only one answer is correct If an answer is not given, it should not be considered as incorrect (the algorithm decided not to recommend ) 36 / 62

Our proposal: Correctness metrics Adapted from Question Answering ( Pe˜ nas and Rodrigo (2011) ) Each question has several options but only one answer is correct If an answer is not given, it should not be considered as incorrect (the algorithm decided not to recommend ) Applied to recommenders: if two systems have the same number of relevant items but one has retrieved less items, it should be better than the other one 37 / 62

Our proposal: Correctness metrics Based on users: User Correctness = 1 � TP ( u ) + TP ( u ) NR ( u ) � (3) N N � � Recall User Correctness = 1 TP ( u ) + TP ( u ) | T ( u ) | NR ( u ) (4) N 38 / 62

Our proposal: Correctness metrics Based on users: User Correctness = 1 � TP ( u ) + TP ( u ) NR ( u ) � (3) N N � � Recall User Correctness = 1 TP ( u ) + TP ( u ) | T ( u ) | NR ( u ) (4) N where TP ( u ): number of relevant items that we are recommending to the user FP ( u ): number of non-relevant items that we are recommending to the user N : cutoff NR ( u ) : N − ( TP + FP ) | T ( u ) | : number of relevant items in the test of user u 39 / 62

Experiments Recommender Systems 1 Freshness 2 Correctness 3 Experiments 4 Conclusions and future work 5 40 / 62

Freshness results Are the recommendations obtained by different algorithms temporally novel (fresh)? Do the different novelty models produce similar results? 41 / 62

New approaches for evaluation: correctness and freshness Pablo S - PowerPoint PPT Presentation

New approaches for evaluation: correctness and freshness Pablo S anchez Rus M. Mesas Alejandro Bellog n Universidad Aut onoma de Madrid Escuela Polit ecnica Superior Departamento de Ingenier a Inform atica V Congreso

Proving Program Correctness The Axiomatic Approach What is Correctness? Correctness:

Freshness From North Carolina Waters Seafood Marketing William Small, Marketing Specialist NC

Freshness Crawling, session 6 CS6200: Information Retrieval Slides by: Jesse Anderton Page

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

New Approaches to New Approaches to New Approaches to Repair of Repair of Repair of Spinal

08Program Verification II CS 5209: Foundation in Logic and AI Martin Henz and Aquinas Hobor

Proving Correctness of Graph Programs Relative to Recursively Nested Conditions Nils Erik Flick

Program Correctness Assert formal correctness statements about

The S PARK Way to Correctness is Via Abstraction John Barnes SIGAda, Laurel, November 2000 John

CORRECTNESS CRITERIA FOR CONCURRENCY & PARALLELISM 2 6/16/2010

Reducing Total Correctness to Partial Correctness by a Transformation of the Language Semantics a

Webinar on Meta-evaluation Approaches to Improve Evaluation Practice Mnica Lomea Gelis,

device that determines the quality, freshness and safety of meat, poultry and fish. How

people Coco do Vale The perfect match between health and freshness Coconut is one of the most

Partial Updates: Losing Information for Freshness Melih Ba stop cu and S ennur Uluku s

rfc4474bis + PASSporT + certs IETF 98 (Chicago) STIR WG The good news Were done, pretuy

A Formal Analysis for Capturing Replay Attacks in Cryptographic Protocols Han Gao 1 , Chiara

OPEN EDITION Diffuser les rsultats de la recherche en sciences humaines et sociales

DIY Patch Management F L O R I A N J U N G E ( @ S H A N T YC O D E ) I N G O B E N T E ( @

A Simpler Proof Theory for Nominal Logic James Cheney University of Edinburgh FOSSACS 2005

Motivation Non-Monotonic Snapshot Isolation: Geo-replication for x scalable and strong

Competitive Freshness Algorithms for Wait free Objects Wait-free Objects Peter Damaschke, Phuong

Fall General Meeting 2020 Meeting - Introduction of CCSS - Director Reports Agenda - Budget

New approaches for evaluation: correctness and freshness Pablo S - PowerPoint PPT Presentation

New approaches for evaluation: correctness and freshness Pablo S anchez Rus M. Mesas Alejandro Bellog n Universidad Aut onoma de Madrid Escuela Polit ecnica Superior Departamento de Ingenier a Inform atica V Congreso

Proving Program Correctness The Axiomatic Approach What is Correctness? Correctness:

Freshness From North Carolina Waters Seafood Marketing William Small, Marketing Specialist NC

Freshness Crawling, session 6 CS6200: Information Retrieval Slides by: Jesse Anderton Page

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

New Approaches to New Approaches to New Approaches to Repair of Repair of Repair of Spinal

08Program Verification II CS 5209: Foundation in Logic and AI Martin Henz and Aquinas Hobor

Proving Correctness of Graph Programs Relative to Recursively Nested Conditions Nils Erik Flick

Program Correctness Assert formal correctness statements about

The S PARK Way to Correctness is Via Abstraction John Barnes SIGAda, Laurel, November 2000 John

CORRECTNESS CRITERIA FOR CONCURRENCY &amp; PARALLELISM 2 6/16/2010

Reducing Total Correctness to Partial Correctness by a Transformation of the Language Semantics a

Webinar on Meta-evaluation Approaches to Improve Evaluation Practice Mnica Lomea Gelis,

device that determines the quality, freshness and safety of meat, poultry and fish. How

people Coco do Vale The perfect match between health and freshness Coconut is one of the most

Partial Updates: Losing Information for Freshness Melih Ba stop cu and S ennur Uluku s

rfc4474bis + PASSporT + certs IETF 98 (Chicago) STIR WG The good news Were done, pretuy

A Formal Analysis for Capturing Replay Attacks in Cryptographic Protocols Han Gao 1 , Chiara

OPEN EDITION Diffuser les rsultats de la recherche en sciences humaines et sociales

DIY Patch Management F L O R I A N J U N G E ( @ S H A N T YC O D E ) I N G O B E N T E ( @

A Simpler Proof Theory for Nominal Logic James Cheney University of Edinburgh FOSSACS 2005

Motivation Non-Monotonic Snapshot Isolation: Geo-replication for x scalable and strong

Competitive Freshness Algorithms for Wait free Objects Wait-free Objects Peter Damaschke, Phuong

Fall General Meeting 2020 Meeting - Introduction of CCSS - Director Reports Agenda - Budget

CORRECTNESS CRITERIA FOR CONCURRENCY & PARALLELISM 2 6/16/2010