CS425: Algorithms for Web Scale Data Most of the slides are from the - PowerPoint PPT Presentation

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org

Ƹ  Training data  100 million ratings, 480,000 users, 17,770 movies  6 years of data: 2000-2005  Test data  Last few ratings of each user (2.8 million)  Evaluation criterion: Root Mean Square Error (RMSE) = 1 𝑦𝑗 2 σ (𝑗,𝑦)∈𝑆 𝑠 𝑦𝑗 − 𝑠 𝑆  Netflix’s system RMSE: 0.9514  Competition  2,700+ teams  $1 million prize for 10% improvement on Netflix J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

480,000 users Matrix R 1 3 4 3 5 5 4 5 5 3 17,700 3 movies 2 2 2 5 2 1 1 3 3 1 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

Ƹ 480,000 users Matrix R 1 3 4 𝒔 𝟒,𝟕 3 5 5 4 5 5 3 17,700 3 movies 2 ? ? Training Data Set Test Data Set ? 2 1 ? 3 ? True rating of 1 user x on item i 1 𝑦𝑗 2 σ (𝑗,𝑦)∈𝑆 𝑠 𝑦𝑗 − 𝑠 RMSE = R Predicted rating J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

 The winner of the Netflix Challenge!  Multi-scale modeling of the data: Combine top level, “regional” Global effects modeling of the data, with a refined, local view:  Global: Factorization  Overall deviations of users/movies  Factorization: Collaborative filtering  Addressing “regional” effects  Collaborative filtering:  Extract local patterns J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

 Global:  Mean movie rating: 3.7 stars  The Sixth Sense is 0.5 stars above avg.  Joe rates 0.2 stars below avg.  Baseline estimation: Joe will rate The Sixth Sense 4 stars  Local neighborhood (CF/NN):  Joe didn’t like related movie Signs   Final estimate: Joe will rate The Sixth Sense 3.8 stars J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

 Earliest and most popular collaborative filtering method  Derive unknown ratings from those of “ similar ” movies (item-item variant)  Define similarity measure s ij of items i and j  Select k - nearest neighbors, compute the rating  N(i; x): items most similar to i that were rated by x   s r  ij xj  ˆ ( ; ) j N i x r s ij … similarity of items i and j  xi r xj … rating of user x on item j s N(i;x) … set of items similar to  ij j N ( i ; x ) item i that were rated by x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

 In practice we get better estimates if we model deviations:    s ( r b ) ^  ij xj xj   ( ; ) j N i x r b  xi xi s  ij j N ( i ; x ) baseline estimate for r xi Problems/Issues: 1) Similarity measures are “ arbitrary ” 𝒄 𝒚𝒋 = 𝝂 + 𝒄 𝒚 + 𝒄 𝒋 2) Pairwise similarities neglect interdependencies among users μ = overall mean rating 3) Taking a weighted average can be b x = rating deviation of user x restricting = ( avg. rating of user x ) – μ Solution: Instead of s ij use w ij that = ( avg. rating of movie i ) – μ b i we estimate directly from data J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

 Use a weighted sum rather than weighted avg. : 𝑠 𝑦𝑗 = 𝑐 𝑦𝑗 + ෞ ෍ 𝑥 𝑗𝑘 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 𝑘∈𝑂(𝑗;𝑦)  A few notes:  𝑶(𝒋; 𝒚) … set of movies rated by user x that are similar to movie i  𝒙 𝒋𝒌 is the interpolation weight (some real number)  We allow: σ 𝒌∈𝑶(𝒋,𝒚) 𝒙 𝒋𝒌 ≠ 𝟐  𝒙 𝒋𝒌 models interaction between pairs of movies (it does not depend on user x ) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

Ƹ 𝑠 𝑦𝑗 = 𝑐 𝑦𝑗 + σ 𝑘∈𝑂(𝑗,𝑦) 𝑥 𝑗𝑘 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘  ෞ  How to set w ij ? 1  Remember, error metric is: 𝑦𝑗 2 σ (𝑗,𝑦)∈𝑆 𝑠 𝑦𝑗 − 𝑠 𝑆 𝒔 𝒚𝒋 − 𝒔 𝒚𝒋 𝟑 or equivalently SSE: σ (𝒋,𝒚)∈𝑺 ො  Find w ij that minimize SSE on training data!  Models relationships between item i and its neighbors j  w ij can be learned/estimated based on x and all other users that rated i Why is this a good idea? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

1 3 4 3 5 5  Goal: Make good recommendations 4 5 5 3 3 2 2 2  Quantify goodness using RMSE: 5 2 1 1 Lower RMSE  better recommendations 3 3 1  Want to make good recommendations on items that user has not yet seen. Can’t really do this!  Let’s set build a system such that it works well on known (user, item) ratings And hope the system will also predict well the unknown ratings J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11

 Idea: Let ’ s set values w such that they work well on known (user, item) ratings  How to find such values w ?  Idea: Define an objective function and solve the optimization problem  Find w ij that minimize SSE on training data ! 2 𝐾 𝑥 = ෍ 𝑐 𝑦𝑗 + ෍ 𝑥 𝑗𝑘 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 − 𝑠 𝑦𝑗 𝑦,𝑗 𝑘∈𝑂 𝑗;𝑦 True Predicted rating rating  Think of w as a vector of numbers J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

 A simple way to minimize a function 𝒈(𝒚) :  Compute the derivative 𝜶𝒈  Start at some point 𝒛 and evaluate 𝜶𝒈(𝒛)  Make a step in the reverse direction of the gradient: 𝒛 = 𝒛 − 𝜶𝒈(𝒛)  Repeat until converged 𝑔 𝑔 𝑧 + 𝛼𝑔(𝑧) 𝑧 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

Example: Formulation  Assume we have a dataset with a single user x and items 0, 1, and 2. We are given all ratings, and we want to compute the weights w 01 , w 02 , and w 03 . 𝑦𝑗 = 𝑐 𝑦𝑗 + σ 𝑘∈𝑂(𝑗,𝑦) 𝑥 𝑗𝑘 𝑠  Rating estimate: ෞ 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 Training dataset already has the correct r xi values. We will use the estimation formula to compute the unknown weights w 01 , w 02 , and w 03 .  Optimization problem: Compute w ij values to minimize: 𝒔 𝒚𝒋 − 𝒔 𝒚𝒋 𝟑 σ (𝒋,𝒚)∈𝑺 ො  Plug in the formulas: 𝑦0 2 minimize J w = 𝑐 𝑦0 + 𝑥 01 𝑠 𝑦1 − 𝑐 𝑦1 + 𝑥 02 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑦1 2 + 𝑐 𝑦1 + 𝑥 01 𝑠 𝑦0 − 𝑐 𝑦0 + 𝑥 12 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑦2 2 + 𝑐 𝑦2 + 𝑥 02 𝑠 𝑦0 − 𝑐 𝑦0 + 𝑥 12 𝑠 𝑦1 − 𝑐 𝑦1 − 𝑠 14 CS 425 – Lecture 9 Mustafa Ozdal, Bilkent University

Example: Algorithm Initialize unknown variables: 0 𝑜𝑓𝑥 𝑥 01 𝑥 01 𝐱 𝐨𝐟𝐱 = 𝑜𝑓𝑥 0 𝑥 02 = 𝑥 02 𝑜𝑓𝑥 0 𝑥 12 𝑥 12 Iterate: while | w new - w old | > ε w old = w new w new = w old -  ·  J(w old )  is the learning rate (a parameter) How to compute  J(w old ) ? 15 CS 425 – Lecture 9 Mustafa Ozdal, Bilkent University

Example: Gradient-Based Update 𝑦0 2 𝝐𝑲(𝒙) J w = 𝑐 𝑦0 + 𝑥 01 𝑠 𝑦1 − 𝑐 𝑦1 + 𝑥 02 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑦1 2 𝝐𝒙 𝟏𝟐 + 𝑐 𝑦1 + 𝑥 01 𝑠 𝑦0 − 𝑐 𝑦0 + 𝑥 12 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑦2 2 + 𝑐 𝑦2 + 𝑥 02 𝑠 𝑦0 − 𝑐 𝑦0 + 𝑥 12 𝑠 𝑦1 − 𝑐 𝑦1 − 𝑠 𝝐𝑲(𝒙) 𝛂𝑲(𝒙) = 𝝐𝒙 𝟏𝟑 𝝐𝑲(𝒙) 𝜖𝐾(𝑥) 𝑜𝑓𝑥 𝑝𝑚𝑒 𝝐𝒙 𝟐𝟑 𝑥 01 𝑥 01 𝜖𝑥 01 𝜖𝐾(𝑥) 𝑜𝑓𝑥 𝑝𝑚𝑒 𝑥 02 = − 𝜃 𝑥 02 𝜖𝑥 02 𝜖𝐾(𝑥) 𝑜𝑓𝑥 𝑝𝑚𝑒 𝑥 12 𝑥 12 Each partial derivative is 𝜖𝑥 12 evaluated at w old . 16 CS 425 – Lecture 9 Mustafa Ozdal, Bilkent University

Example: Computing Partial Derivatives 𝑦0 2 J w = 𝑐 𝑦0 + 𝑥 01 𝑠 𝑦1 − 𝑐 𝑦1 + 𝑥 02 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑦1 2 + 𝑐 𝑦1 + 𝑥 01 𝑠 𝑦0 − 𝑐 𝑦0 + 𝑥 12 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑦2 2 + 𝑐 𝑦2 + 𝑥 02 𝑠 𝑦0 − 𝑐 𝑦0 + 𝑥 12 𝑠 𝑦1 − 𝑐 𝑦1 − 𝑠 𝜖( ax+b 2 ) = 2 ax + b a Reminder: 𝜖x 𝜖𝐾(𝑥) 𝜖𝑥 01 = 2 𝑐 𝑦0 + 𝑥 01 𝑠 𝑦1 − 𝑐 𝑦1 + 𝑥 02 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑠 𝑦1 − 𝑐 𝑦1 𝑦0 +2 𝑐 𝑦1 + 𝑥 01 𝑠 𝑦0 − 𝑐 𝑦0 + 𝑥 12 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑠 𝑦0 − 𝑐 𝑦0 𝑦1 Evaluate each partial derivative at w old to compute the gradient direction. 17 CS 425 – Lecture 9 Mustafa Ozdal, Bilkent University

CS425: Algorithms for Web Scale Data Most of the slides are from the - PowerPoint PPT Presentation

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Training data 100 million ratings,

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Large-Scale Web Applications Mendel Rosenblum CS142 Lecture Notes - Large-Scale Web Apps Web

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web

Web Mining Web Mining to automatically discover and extract information from Web

Web Application Security Attacks on the Web Attacker Web User Application Web Database Web

Agenda Web MVC-2: Apache Struts Drawbacks with Web Model 1 Web Model 2 (Web MVC) Rimon

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

1 2 3 4 5 Second Project Implement collaborative filtering algorithm Apply to

Low-rank Matrix Completion via Convex Optimization Ben Recht Center for the Mathematics of

Machine Learning: Course Overview CS 760@UW-Madison Class enrollment typically the class was

Recommender Systems: Tutorial Andras Benczur Insitute for Computer Science and Control Hungarian

Conclusions Larry Holder CptS 570 Machine Learning School of Electrical Engineering and

Ensemble and Boosting Algorithms Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net

Information market based recommender systems fusion Efthimios Bothos Konstantinos Christidis

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Fran