Get on with it! Recommender system industry challenges move towards real-world, online evaluation Padova – March 23th, 2016 Andreas Lommatzsch - TU Berlin, Berlin, Germany Jonas Seiler - plista, Berlin, Germany Daniel Kohlsdorf - XING, Hamburg, Germany CrowdRec - www.crowdrec.eu
Andreas Lommatzsch • Andreas Andreas.Lommatzsch@tu-berlin.de http://www.dai-lab.de
Jonas Seiler • s Jonas.Seiler@plista.com http://www.plista.com
Daniel Kohlsdorf • Daniel Daniel.Kohlsdorf@xing.com http://www.xing.com
Moving towards real-world evaluation Where are recommender system challenges headed? Direction 1: Use info beyond the user- item matrix. Direction 2: Online evaluation + multiple metrics. Flickr credit: rodneycampbell
Why evaluate? <Images showing “our” use cases> Evaluation is crucial for the success of real-life systems • ● ● How should we evaluate? • Influence on sales Precision and ● Recall ● Required hardware ● Technical resources complexity Business ● User models ● satisfaction ● Scalability Diversity of the presented results
Traditional Evaluation in IR Evaluation Settings • A static collection of documents • A set of queries • A list of relevant documents defined by Query0 experts for each query * #nn * #nn * #nn Advantages “The Cranfield paradigm” • Reproducible setting • All researches have exactly the same information • Optimized for measuring precision
Traditional Evaluation in IR Weaknesses of traditional IR evaluation • High costs for creating dataset • Datasets are not up-to-date • Domain-specific documents • The expert-defined ground truth does not consider individual user preferences • Individual user preferences Context is everythin g • Context-awareness is not considered • Technical aspects are ignored
Industry and recsys challenges Challenges benefit both industry and academic research. • • We look at how industry challenges have evolved since the Netflix prize 2009.
Traditional Evaluation in RecSys Evaluation Settings • Rating prediction on user-item matrices • Large, sparse dataset • Predict personalized ratings • Cross-validation, RMSE Advantages • Reproducible setting • Personalization • Dataset is based on real user ratings “The Netflix paradigm”
Traditional Evaluation in RecSys Weaknesses of traditional Recommender evaluation • Static data • Only one type of data - only user ratings • User ratings are noisy • Temporal aspects tend to be ignored • Context-awareness is not considered • Technical aspects are ignored
Challenges of Developing Applications Challenges • Data streams - continuous changes • Big data • Combine knowledge from different sources • Context-Awareness • Users expect personally relevant results • Heterogeneous devices • Technical complexity , real-time requirements
How to Setup a better Evaluation? ● How to address these challenges in the Evaluation? • Realistic evaluation setting ● – Heterogeneous data sources – Streams – Dynamic user feedback ● • Appropriate metrics – Precision and User satisfaction ● – Technical complexity – Sales and Business models • Online and Offline Evaluation
Approaches for a better Evaluation • News recommendations @ plista • Job recommendations @ XING
The plista Recommendation Scenario Setting ● 250 ms response time ● 350 Mio AI/day ● In 10 Countries Challenges ● News change continuously ● User do not log-in explicitly ● Seasonality, context- depend user preferences
Evaluation @ plista Offline Online • • Cross-validation AB Tests – – M etric O ptimization E ngine Limited • (https://github.com/Yelp/MOE) by Caching Memory – • Integration into Spark Computational • How well does it correlate with Resources – Online Evaluation? MOE* • Time Complexity
Evaluation using MOE Offline • Mean and variance estimation of parameter space with Gaussian Process • Evaluate parameter with highest Expected Improvement (EI), Upper Confidence Interval …. • Rest API
Evaluation using MOE Online • A/B Tests are expensive • Model non-stationarity • Integrate out non-stationarity to get mean EI
The CLEF-NewsREEL challenge Provide an API enabling researchers testing own ideas • The CLEF-NewsREEL challenge • A Challenge in CLEF (Conferences and Labs of the Evaluation Forum) • 2 Tasks: Online and Offline Evaluation
CLEF-NewsREEL Online Task How does the challenge work? • Live streams consisting of impressions, requests, and clicks, 5 publishers, approx 6 Million messages per day • Technical requirements: 100 ms per request • Live evaluation based on CTR
CLEF-NewsREEL Offline Task Online vs. Offline Evaluation • Technical aspects can be evaluated without user feedback • Analyze the required resources and the response time • Simulate the online evaluation by replaying a recorded stream
CLEF-NewsREEL Offline Task Challenge • Realistic simulation of streams • Reproducible setup of computing environments Solution • A framework simplifying the setup of the evaluation environment • The Idomaar framework developed in the CrowdRec project http://rf.crowdrec.eu
CLEF-NewsREEL More Information • SIGIR forum Dec 2015 (Vol 49, #2) http://sigir.org/files/forum/2015D/p129.pdf Evaluate your algorithm online and offline in NewsREEL • Register for the challenge! http://crowdrec.eu/2015/11/clef-newsreel-2016/ (register until 22nd of April) • Tutorials and Templates are provided at orp.plista.com
XING - RecSys Challenge https://recsys.xing.com/
Job Recommendations @ XING
XING - Evaluation based on interaction ● On Xing users can give feedback on recommendations. ● Number of user feedback way lower than implicit measures. ● A/B Tests focus on clickthrough rate.
XING - RecSys Challenge, Scoring, Space on Page Top 6 ● Predict 30 items for each user. ● Score: weighted combination of the precision ○ precisionAt(2) ○ precisionAt(4) ○ precisionAt(6) ○ precisionAt(20)
XING - RecSys Challenge, User Data • User ID • Job Title • Educational Degree • Field of Study • Location
XING - RecSys Challenge, User Data • Number of past jobs • Years of Experience • Current career level • Current discipline • Current industry
XING - RecSys Challenge, Item Data • Job title • Desired career level • Desired discipline • Desired industry
XING - RecSys Challenge, Interaction Data • Timestamp • User • Job • Type: – Deletion – Click – Bookmark
XING - RecSys Challenge, Anonymization
XING - RecSys Challenge, Anonymization
XING - RecSys Challenge, Future • Live Challenge – Users submit predicted future interactions – The solution is recommended on the platform – Participants get points for actual user clicks Score Release to Challenge Collect Clicks Work On Predictions
Concluding ... How to setup a better Evaluation • Consider different quality criteria (prediction, technical, business models) • Aggregate heterogeneous information sources • Consider user feedback • Use online and offline analyses to understand users and their requirements
Concluding ... Participate in challenges based on real-life scenarios • • NewsREEL challenge RecSys 2016 challenge http://orp.plista.com http://2016.recsyschallenge.com/ => Organize a challenge. Focus on real-life data .
Thank You More Information • http://www.crowdrec.eu • http://www.clef-newsreel.org • http://orp.plista.com • http://2016.recsyschallenge.com • http://www.xing.com
Recommend
More recommend