cs425 algorithms for web scale data
play

CS425: Algorithms for Web Scale Data Most of the slides are from the - PowerPoint PPT Presentation

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Training data 100 million ratings,


  1. CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org

  2. Ƹ  Training data  100 million ratings, 480,000 users, 17,770 movies  6 years of data: 2000-2005  Test data  Last few ratings of each user (2.8 million)  Evaluation criterion: Root Mean Square Error (RMSE) = 1 𝑦𝑗 2 σ (𝑗,𝑦)∈𝑆 𝑠 𝑦𝑗 − 𝑠 𝑆  Netflix’s system RMSE: 0.9514  Competition  2,700+ teams  $1 million prize for 10% improvement on Netflix J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

  3. 480,000 users Matrix R 1 3 4 3 5 5 4 5 5 3 17,700 3 movies 2 2 2 5 2 1 1 3 3 1 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

  4. Ƹ 480,000 users Matrix R 1 3 4 𝒔 𝟒,𝟕 3 5 5 4 5 5 3 17,700 3 movies 2 ? ? Training Data Set Test Data Set ? 2 1 ? 3 ? True rating of 1 user x on item i 1 𝑦𝑗 2 σ (𝑗,𝑦)∈𝑆 𝑠 𝑦𝑗 − 𝑠 RMSE = R Predicted rating J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

  5.  The winner of the Netflix Challenge!  Multi-scale modeling of the data: Combine top level, “regional” Global effects modeling of the data, with a refined, local view:  Global: Factorization  Overall deviations of users/movies  Factorization: Collaborative filtering  Addressing “regional” effects  Collaborative filtering:  Extract local patterns J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

  6.  Global:  Mean movie rating: 3.7 stars  The Sixth Sense is 0.5 stars above avg.  Joe rates 0.2 stars below avg.  Baseline estimation: Joe will rate The Sixth Sense 4 stars  Local neighborhood (CF/NN):  Joe didn’t like related movie Signs   Final estimate: Joe will rate The Sixth Sense 3.8 stars J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

  7.  Earliest and most popular collaborative filtering method  Derive unknown ratings from those of “ similar ” movies (item-item variant)  Define similarity measure s ij of items i and j  Select k - nearest neighbors, compute the rating  N(i; x): items most similar to i that were rated by x   s r  ij xj  ˆ ( ; ) j N i x r s ij … similarity of items i and j  xi r xj … rating of user x on item j s N(i;x) … set of items similar to  ij j N ( i ; x ) item i that were rated by x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

  8.  In practice we get better estimates if we model deviations:    s ( r b ) ^  ij xj xj   ( ; ) j N i x r b  xi xi s  ij j N ( i ; x ) baseline estimate for r xi Problems/Issues: 1) Similarity measures are “ arbitrary ” 𝒄 𝒚𝒋 = 𝝂 + 𝒄 𝒚 + 𝒄 𝒋 2) Pairwise similarities neglect interdependencies among users μ = overall mean rating 3) Taking a weighted average can be b x = rating deviation of user x restricting = ( avg. rating of user x ) – μ Solution: Instead of s ij use w ij that = ( avg. rating of movie i ) – μ b i we estimate directly from data J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

  9.  Use a weighted sum rather than weighted avg. : 𝑠 𝑦𝑗 = 𝑐 𝑦𝑗 + ෞ ෍ 𝑥 𝑗𝑘 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 𝑘∈𝑂(𝑗;𝑦)  A few notes:  𝑶(𝒋; 𝒚) … set of movies rated by user x that are similar to movie i  𝒙 𝒋𝒌 is the interpolation weight (some real number)  We allow: σ 𝒌∈𝑶(𝒋,𝒚) 𝒙 𝒋𝒌 ≠ 𝟐  𝒙 𝒋𝒌 models interaction between pairs of movies (it does not depend on user x ) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

  10. Ƹ 𝑠 𝑦𝑗 = 𝑐 𝑦𝑗 + σ 𝑘∈𝑂(𝑗,𝑦) 𝑥 𝑗𝑘 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘  ෞ  How to set w ij ? 1  Remember, error metric is: 𝑦𝑗 2 σ (𝑗,𝑦)∈𝑆 𝑠 𝑦𝑗 − 𝑠 𝑆 𝒔 𝒚𝒋 − 𝒔 𝒚𝒋 𝟑 or equivalently SSE: σ (𝒋,𝒚)∈𝑺 ො  Find w ij that minimize SSE on training data!  Models relationships between item i and its neighbors j  w ij can be learned/estimated based on x and all other users that rated i Why is this a good idea? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

  11. 1 3 4 3 5 5  Goal: Make good recommendations 4 5 5 3 3 2 2 2  Quantify goodness using RMSE: 5 2 1 1 Lower RMSE  better recommendations 3 3 1  Want to make good recommendations on items that user has not yet seen. Can’t really do this!  Let’s set build a system such that it works well on known (user, item) ratings And hope the system will also predict well the unknown ratings J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11

  12.  Idea: Let ’ s set values w such that they work well on known (user, item) ratings  How to find such values w ?  Idea: Define an objective function and solve the optimization problem  Find w ij that minimize SSE on training data ! 2 𝐾 𝑥 = ෍ 𝑐 𝑦𝑗 + ෍ 𝑥 𝑗𝑘 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 − 𝑠 𝑦𝑗 𝑦,𝑗 𝑘∈𝑂 𝑗;𝑦 True Predicted rating rating  Think of w as a vector of numbers J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

  13.  A simple way to minimize a function 𝒈(𝒚) :  Compute the derivative 𝜶𝒈  Start at some point 𝒛 and evaluate 𝜶𝒈(𝒛)  Make a step in the reverse direction of the gradient: 𝒛 = 𝒛 − 𝜶𝒈(𝒛)  Repeat until converged 𝑔 𝑔 𝑧 + 𝛼𝑔(𝑧) 𝑧 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

  14. Example: Formulation  Assume we have a dataset with a single user x and items 0, 1, and 2. We are given all ratings, and we want to compute the weights w 01 , w 02 , and w 03 . 𝑦𝑗 = 𝑐 𝑦𝑗 + σ 𝑘∈𝑂(𝑗,𝑦) 𝑥 𝑗𝑘 𝑠  Rating estimate: ෞ 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 Training dataset already has the correct r xi values. We will use the estimation formula to compute the unknown weights w 01 , w 02 , and w 03 .  Optimization problem: Compute w ij values to minimize: 𝒔 𝒚𝒋 − 𝒔 𝒚𝒋 𝟑 σ (𝒋,𝒚)∈𝑺 ො  Plug in the formulas: 𝑦0 2 minimize J w = 𝑐 𝑦0 + 𝑥 01 𝑠 𝑦1 − 𝑐 𝑦1 + 𝑥 02 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑦1 2 + 𝑐 𝑦1 + 𝑥 01 𝑠 𝑦0 − 𝑐 𝑦0 + 𝑥 12 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑦2 2 + 𝑐 𝑦2 + 𝑥 02 𝑠 𝑦0 − 𝑐 𝑦0 + 𝑥 12 𝑠 𝑦1 − 𝑐 𝑦1 − 𝑠 14 CS 425 – Lecture 9 Mustafa Ozdal, Bilkent University

  15. Example: Algorithm Initialize unknown variables: 0 𝑜𝑓𝑥 𝑥 01 𝑥 01 𝐱 𝐨𝐟𝐱 = 𝑜𝑓𝑥 0 𝑥 02 = 𝑥 02 𝑜𝑓𝑥 0 𝑥 12 𝑥 12 Iterate: while | w new - w old | > ε w old = w new w new = w old -  ·  J(w old )  is the learning rate (a parameter) How to compute  J(w old ) ? 15 CS 425 – Lecture 9 Mustafa Ozdal, Bilkent University

  16. Example: Gradient-Based Update 𝑦0 2 𝝐𝑲(𝒙) J w = 𝑐 𝑦0 + 𝑥 01 𝑠 𝑦1 − 𝑐 𝑦1 + 𝑥 02 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑦1 2 𝝐𝒙 𝟏𝟐 + 𝑐 𝑦1 + 𝑥 01 𝑠 𝑦0 − 𝑐 𝑦0 + 𝑥 12 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑦2 2 + 𝑐 𝑦2 + 𝑥 02 𝑠 𝑦0 − 𝑐 𝑦0 + 𝑥 12 𝑠 𝑦1 − 𝑐 𝑦1 − 𝑠 𝝐𝑲(𝒙) 𝛂𝑲(𝒙) = 𝝐𝒙 𝟏𝟑 𝝐𝑲(𝒙) 𝜖𝐾(𝑥) 𝑜𝑓𝑥 𝑝𝑚𝑒 𝝐𝒙 𝟐𝟑 𝑥 01 𝑥 01 𝜖𝑥 01 𝜖𝐾(𝑥) 𝑜𝑓𝑥 𝑝𝑚𝑒 𝑥 02 = − 𝜃 𝑥 02 𝜖𝑥 02 𝜖𝐾(𝑥) 𝑜𝑓𝑥 𝑝𝑚𝑒 𝑥 12 𝑥 12 Each partial derivative is 𝜖𝑥 12 evaluated at w old . 16 CS 425 – Lecture 9 Mustafa Ozdal, Bilkent University

  17. Example: Computing Partial Derivatives 𝑦0 2 J w = 𝑐 𝑦0 + 𝑥 01 𝑠 𝑦1 − 𝑐 𝑦1 + 𝑥 02 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑦1 2 + 𝑐 𝑦1 + 𝑥 01 𝑠 𝑦0 − 𝑐 𝑦0 + 𝑥 12 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑦2 2 + 𝑐 𝑦2 + 𝑥 02 𝑠 𝑦0 − 𝑐 𝑦0 + 𝑥 12 𝑠 𝑦1 − 𝑐 𝑦1 − 𝑠 𝜖( ax+b 2 ) = 2 ax + b a Reminder: 𝜖x 𝜖𝐾(𝑥) 𝜖𝑥 01 = 2 𝑐 𝑦0 + 𝑥 01 𝑠 𝑦1 − 𝑐 𝑦1 + 𝑥 02 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑠 𝑦1 − 𝑐 𝑦1 𝑦0 +2 𝑐 𝑦1 + 𝑥 01 𝑠 𝑦0 − 𝑐 𝑦0 + 𝑥 12 𝑠 𝑦2 − 𝑐 𝑦2 − 𝑠 𝑠 𝑦0 − 𝑐 𝑦0 𝑦1 Evaluate each partial derivative at w old to compute the gradient direction. 17 CS 425 – Lecture 9 Mustafa Ozdal, Bilkent University

Recommend


More recommend