Recommender System Experiments with MyMediaLite Or: Everything you always wanted to know about offline experiments* (*but were afraid to ask) Zeno Gantner <zeno.gantner@nokia.com> Nokia Location & Commerce, Berlin
HERE Maps by Nokia … in Berlin ● ca. 800 people ● HERE Maps platform – mobile apps ● HERE Drive ● HERE Maps ● HERE Transit (public transport) – customers ● Yahoo Maps ● Bing Maps ● major car companies: BMW, VW, Toyota, ...
HERE Maps by Nokia … in Berlin Maps Search Team ● #bbuzz regulars ● 3 of us contributed to Lucene 4.3.0 ;-) http://2011.berlinbuzzwords.de/content/improving-search-ranking-through-ab-tests-case-study http://2012.berlinbuzzwords.de/sessions/efficient-scoring-lucene http://2012.berlinbuzzwords.de/sessions/introducing-cascalog-functional-data-processing-hadoop http://2012.berlinbuzzwords.de/sessions/relevance-optimization-check-candidate-lists https://issues.apache.org/jira/browse/LUCENE-4930 https://issues.apache.org/jira/browse/LUCENE-4571
(C) Paul L. Dineen; license: CC by; source http://www.flickr.com/photos/pauldineen/4529216647/sizes/o/in/photostream/
+ = ?
Data + Software/Algorithms = ??? Real-world deployments (c) Diliff; license CC by-3.0 (c) Joon Han, license: CC by-sa 3.0, source: http://en.wikipedia.org/wiki/File:Groundhog_day_tip_top_bistro.jpg
Data mining competitions
Research
+ = ?
RecSys Experiments with MyMediaLite 1. Interaction Data 2. Baseline Methods 3. Apples and Oranges 4. Metrics 5. Hyperparameter Tuning 6. Reproducibility
Running Example: MyMediaLite ● RecSys toolkit and evaluation framework ● written in C#/Mono ● C#, Python, Ruby, F# ● simple ● choice ● 2 Java ports ● free (RapidMiner plugin) ● documented ● regular releases (every ● tested 2-3 months) since 2010 http://mymedialite.net/ http://github.com/zenogantner/MyMediaLite
Running Example: MyMediaLite command-line tools ● rating_prediction ● item_recommendation Find all examples here: http://github.com/zenogantner/mml-eval-examples
1. Interaction Data Explicit feedback Implicit feedback ● views ● clicks ● purchases Not always there. Often positive-only.
1. Interaction Data item_recommendation --training-file=F1 --test-file=F2 IDs can be (almost) arbitrary strings User ID Item ID Timestamp optional 196 242 881250949 186 302 891717742 22 377 878887116 244 51 880606923 ... ... ... Separator: whitespace, Alternative format: tab, comma, :: yyyy-mm-dd
Random Splits item_recommendation … --test-ratio=0.25 Shuffle and split: Simple, but: ● Does not take temporal trends into account. ● Does not use all data for testing.
k-fold Cross-Validation item_recommendation … --cross-validation=4 Shuffle and split: ● Uses each data point for evaluation. ● Does not take temporal trends into account.
Chronological Splits rating_prediction … --chronological-split=0.25 rating_prediction … --chronological-split=01/01/2002 Sort chronologically and split: ● Use the past to predict the “future”. ● Takes trends in the data into account. – time of day, day of week – season – trending products
(c) Serolillo, license: CC by 2.5
2. Baseline Methods Why compare against baselines? ● Absolute numbers have no meaning. – … well, at least here. – Relative numbers may also have no meaning. ● … if you compare to the wrong things. Good baselines: ● the strongest solution that is still simple ● the existing solution ● standard solutions – coll. filtering: kNN, vanilla matrix factorization
2. Baseline Methods item_recommendation … --recommender=Random item_recommendation … --recommender=MostPopular item_recommendation … --recommender=MostPopularByAttributes --item-attributes=ARTISTS Item recommendation baselines: ● random ● popular items (by attribute/category)
(c) Michael Collins; license: CC by-2.0
3. Apples and Oranges Always check if you measure on the same splits. It happens quite often …
3. Apples and Oranges Always check if you measure on the same splits. It happens quite often … e.g. this ICML 2013 paper:
3. Apples and Oranges
3. Apples and Oranges ● On chronological splits of the Netflix dataset, matrix factorization (“SVD”) models usually do not perform below 0.9 RMSE. ● Chronological splits can be much harder than random splits! Lessons: ● Baselines are important – they can also help us to “debug” experiments. ● Do not compare between simple splits and chronological splits.
(c) Pastorius; license: CC by 3.0; source: http://commons.wikimedia.org/wiki/File:Plastic_tape_m
4. Metrics What is the right metric? ● Know your goal. – It always depends on what you want to achieve. – What to measure? ● Criticize your metrics. – They may ignore important aspects of your problem. – They are just approximations of user behavior. ● Eyeball the results. – Your metrics may fail to catch WTF results. http://thenoisychannel.com/2012/08/20/wtf-k-measuring-ineffectiveness/
4. Metrics item_recommendation ... --measures=”prec@5,NDCG” Precision at k ● number of “correct” items in the top k results ● The choice of k is specific to your application. ● very simple ● easy to understand and explain More ranking measures: NDCG, MAP, ERR
4. Metrics Precision at k recommendations precision at 4 bad 0 good 1 bad 0 bad 0 bad -- good -- bad -- 1/4
5. Hyperparameter Tuning item_recommendation … --recommender=WRMF --recommender-options=”reg=0.01 alpha=2” ● Hyperparameters, e.g. – regularization to control overfitting – learning rate (for gradient descent methods) – stopping criterion ● You have to do it. Also for your baselines. ● Don't get too fancy. – Grid search will do it in most cases. ● More advanced: – Nelder-Mead/Simplex
5. Hyperparameter Tuning rating_prediction … --search-hp Grid search ● simple ● brute force ● embarrassingly parallel “A practical guide to SVM classification” http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
6. Reproducible Experiments item_recommendation … --random-seed=1 Random seed ● “random” splitting ● training initialization ● debugging
6. Reproducible Experiments item_recommendation … --random-seed=1 Besides random seed: ● Put everything in version control. – data, software – scripts and configuration ● Use build tools like make for automation. – Knows when to re-run your data preprocessing steps. http://bitaesthetics.com/posts/make-for-data-scientists.html
6. Reproducible Experiments item_recommendations … --recommender=ExternalItemRecommender --recommender-options=”prediction_file=FILE” Re-use evaluation code. Create predictions using external software. Use MyMediaLite for evaluation.
6. Reproducible Experiments item_recommendations … --recommender=ExternalItemRecommender --recommender-options=”prediction_file=FILE” Why re-use evaluation code? ● Evaluation protocols (splitting+candidate selection+metrics) are not easy to get right. ● Ensures comparability. – more configuration kept fixed => less risk of accidental differences ● Laziness!
(c) by Caucas; license: CC by-nc-nd 2.0; source: http://www.flickr.com/photos/thecaucas/2597813380/sizes/o/
Summary 1. Split your data appropriately. 2. Do not compare apples and oranges . 3. Compare against simple and strong baselines . 4. Precision at k is a metric that is easy to explain. 5. Grid search is a simple method for hyperparameter tuning . 6. Make your experiments reproducible . 7. MyMediaLite can help you with some of these things ;-). Try it out!
(c) Michael Sauers; license CC by-nc-sa 2.0 http://github.com/zenogantner/mml-eval-examples http://mymedialite.net/ http://github.com/zenogantner/MyMediaLite
Recommend
More recommend