CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org
Ƹ Training data 100 million ratings, 480,000 users, 17,770 movies 6 years of data: 2000-2005 Test data Last few ratings of each user (2.8 million) Evaluation criterion: Root Mean Square Error (RMSE) = 1 𝑦𝑗 2 σ (𝑗,𝑦)∈𝑆 𝑠 𝑦𝑗 − 𝑠 𝑆 Netflix’s system RMSE: 0.9514 Competition 2,700+ teams $1 million prize for 10% improvement on Netflix J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2
480,000 users Matrix R 1 3 4 3 5 5 4 5 5 3 17,700 3 movies 2 2 2 5 2 1 1 3 3 1 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3
Ƹ 480,000 users Matrix R 1 3 4 𝒔 𝟒,𝟕 3 5 5 4 5 5 3 17,700 3 movies 2 ? ? Training Data Set Test Data Set ? 2 1 ? 3 ? True rating of 1 user x on item i 1 𝑦𝑗 2 σ (𝑗,𝑦)∈𝑆 𝑠 𝑦𝑗 − 𝑠 RMSE = R Predicted rating J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4
The winner of the Netflix Challenge! Multi-scale modeling of the data: Combine top level, “regional” Global effects modeling of the data, with a refined, local view: Global: Factorization Overall deviations of users/movies Factorization: Collaborative filtering Addressing “regional” effects Collaborative filtering: Extract local patterns J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5
Global: Mean movie rating: 3.7 stars The Sixth Sense is 0.5 stars above avg. Joe rates 0.2 stars below avg. Baseline estimation: Joe will rate The Sixth Sense 4 stars Local neighborhood (CF/NN): Joe didn’t like related movie Signs Final estimate: Joe will rate The Sixth Sense 3.8 stars J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
Earliest and most popular collaborative filtering method Derive unknown ratings from those of “ similar ” movies (item-item variant) Define similarity measure s ij of items i and j Select k - nearest neighbors, compute the rating N(i; x): items most similar to i that were rated by x s r ij xj ˆ ( ; ) j N i x r s ij … similarity of items i and j xi r xj … rating of user x on item j s N(i;x) … set of items similar to ij j N ( i ; x ) item i that were rated by x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7
In practice we get better estimates if we model deviations: s ( r b ) ^ ij xj xj ( ; ) j N i x r b xi xi s ij j N ( i ; x ) baseline estimate for r xi Problems/Issues: 1) Similarity measures are “ arbitrary ” 𝒄 𝒚𝒋 = 𝝂 + 𝒄 𝒚 + 𝒄 𝒋 2) Pairwise similarities neglect interdependencies among users μ = overall mean rating 3) Taking a weighted average can be b x = rating deviation of user x restricting = ( avg. rating of user x ) – μ Solution: Instead of s ij use w ij that = ( avg. rating of movie i ) – μ b i we estimate directly from data J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8
Use a weighted sum rather than weighted avg. : 𝑠 𝑦𝑗 = 𝑐 𝑦𝑗 + ෞ 𝑥 𝑗𝑘 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 𝑘∈𝑂(𝑗;𝑦) A few notes: 𝑶(𝒋; 𝒚) … set of movies rated by user x that are similar to movie i 𝒙 𝒋𝒌 is the interpolation weight (some real number) We allow: σ 𝒌∈𝑶(𝒋,𝒚) 𝒙 𝒋𝒌 ≠ 𝟐 𝒙 𝒋𝒌 models interaction between pairs of movies (it does not depend on user x ) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9
Ƹ 𝑠 𝑦𝑗 = 𝑐 𝑦𝑗 + σ 𝑘∈𝑂(𝑗,𝑦) 𝑥 𝑗𝑘 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 ෞ How to set w ij ? 1 Remember, error metric is: 𝑦𝑗 2 σ (𝑗,𝑦)∈𝑆 𝑠 𝑦𝑗 − 𝑠 𝑆 𝒔 𝒚𝒋 − 𝒔 𝒚𝒋 𝟑 or equivalently SSE: σ (𝒋,𝒚)∈𝑺 ො Find w ij that minimize SSE on training data! Models relationships between item i and its neighbors j w ij can be learned/estimated based on x and all other users that rated i Why is this a good idea? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10
1 3 4 3 5 5 Goal: Make good recommendations 4 5 5 3 3 2 2 2 Quantify goodness using RMSE: 5 2 1 1 Lower RMSE better recommendations 3 3 1 Want to make good recommendations on items that user has not yet seen. Can’t really do this! Let’s set build a system such that it works well on known (user, item) ratings And hope the system will also predict well the unknown ratings J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11
Idea: Let ’ s set values w such that they work well on known (user, item) ratings How to find such values w ? Idea: Define an objective function and solve the optimization problem Find w ij that minimize SSE on training data ! 2 𝐾 𝑥 = 𝑐 𝑦𝑗 + 𝑥 𝑗𝑘 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 − 𝑠 𝑦𝑗 𝑦,𝑗 𝑘∈𝑂 𝑗;𝑦 True Predicted rating rating Think of w as a vector of numbers J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12
A simple way to minimize a function 𝒈(𝒚) : Compute the derivative 𝜶𝒈 Start at some point 𝒛 and evaluate 𝜶𝒈(𝒛) Make a step in the reverse direction of the gradient: 𝒛 = 𝒛 − 𝜶𝒈(𝒛) Repeat until converged 𝑔 𝑔 𝑧 + 𝛼𝑔(𝑧) 𝑧 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13
Example: Formulation Assume we have a dataset with a single user x and items 0, 1, and 2. We are given all ratings, and we want to compute the weights w 01 , w 02 , and w 03 . 𝑦𝑗 = 𝑐 𝑦𝑗 + σ 𝑘∈𝑂(𝑗,𝑦) 𝑥 𝑗𝑘 𝑠 Rating estimate: ෞ 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 Training dataset already has the correct r xi values. We will use the estimation formula to compute the unknown weights w 01 , w 02 , and w 03 . Optimization problem: Compute w ij values to minimize: 𝒔 𝒚𝒋 − 𝒔 𝒚𝒋 𝟑 σ (𝒋,𝒚)∈𝑺 ො Plug in the formulas: 𝑦0 2 minimize J w = 𝑐 𝑦0 + 𝑥 01 𝑠 𝑦1 − 𝑐 𝑦1 + 𝑥 02 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑦1 2 + 𝑐 𝑦1 + 𝑥 01 𝑠 𝑦0 − 𝑐 𝑦0 + 𝑥 12 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑦2 2 + 𝑐 𝑦2 + 𝑥 02 𝑠 𝑦0 − 𝑐 𝑦0 + 𝑥 12 𝑠 𝑦1 − 𝑐 𝑦1 − 𝑠 14 CS 425 – Lecture 9 Mustafa Ozdal, Bilkent University
Example: Algorithm Initialize unknown variables: 0 𝑜𝑓𝑥 𝑥 01 𝑥 01 𝐱 𝐨𝐟𝐱 = 𝑜𝑓𝑥 0 𝑥 02 = 𝑥 02 𝑜𝑓𝑥 0 𝑥 12 𝑥 12 Iterate: while | w new - w old | > ε w old = w new w new = w old - · J(w old ) is the learning rate (a parameter) How to compute J(w old ) ? 15 CS 425 – Lecture 9 Mustafa Ozdal, Bilkent University
Example: Gradient-Based Update 𝑦0 2 𝝐𝑲(𝒙) J w = 𝑐 𝑦0 + 𝑥 01 𝑠 𝑦1 − 𝑐 𝑦1 + 𝑥 02 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑦1 2 𝝐𝒙 𝟏𝟐 + 𝑐 𝑦1 + 𝑥 01 𝑠 𝑦0 − 𝑐 𝑦0 + 𝑥 12 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑦2 2 + 𝑐 𝑦2 + 𝑥 02 𝑠 𝑦0 − 𝑐 𝑦0 + 𝑥 12 𝑠 𝑦1 − 𝑐 𝑦1 − 𝑠 𝝐𝑲(𝒙) 𝛂𝑲(𝒙) = 𝝐𝒙 𝟏𝟑 𝝐𝑲(𝒙) 𝜖𝐾(𝑥) 𝑜𝑓𝑥 𝑝𝑚𝑒 𝝐𝒙 𝟐𝟑 𝑥 01 𝑥 01 𝜖𝑥 01 𝜖𝐾(𝑥) 𝑜𝑓𝑥 𝑝𝑚𝑒 𝑥 02 = − 𝜃 𝑥 02 𝜖𝑥 02 𝜖𝐾(𝑥) 𝑜𝑓𝑥 𝑝𝑚𝑒 𝑥 12 𝑥 12 Each partial derivative is 𝜖𝑥 12 evaluated at w old . 16 CS 425 – Lecture 9 Mustafa Ozdal, Bilkent University
Example: Computing Partial Derivatives 𝑦0 2 J w = 𝑐 𝑦0 + 𝑥 01 𝑠 𝑦1 − 𝑐 𝑦1 + 𝑥 02 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑦1 2 + 𝑐 𝑦1 + 𝑥 01 𝑠 𝑦0 − 𝑐 𝑦0 + 𝑥 12 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑦2 2 + 𝑐 𝑦2 + 𝑥 02 𝑠 𝑦0 − 𝑐 𝑦0 + 𝑥 12 𝑠 𝑦1 − 𝑐 𝑦1 − 𝑠 𝜖( ax+b 2 ) = 2 ax + b a Reminder: 𝜖x 𝜖𝐾(𝑥) 𝜖𝑥 01 = 2 𝑐 𝑦0 + 𝑥 01 𝑠 𝑦1 − 𝑐 𝑦1 + 𝑥 02 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑠 𝑦1 − 𝑐 𝑦1 𝑦0 +2 𝑐 𝑦1 + 𝑥 01 𝑠 𝑦0 − 𝑐 𝑦0 + 𝑥 12 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑠 𝑦0 − 𝑐 𝑦0 𝑦1 Evaluate each partial derivative at w old to compute the gradient direction. 17 CS 425 – Lecture 9 Mustafa Ozdal, Bilkent University
Recommend
More recommend