CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu
Training data 100 million ratings, 480,000 users, 17,770 movies 6 years of data: 2000-2005 Test data Last few ratings of each user (2.8 million) Evaluation criterion: Root Mean Square Error ( RMSE ) Netflix’s system RMSE: 0.9514 Competition 2,700+ teams $1 million prize for 10% improvement on Netflix 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2
480,000 users Matrix R 1 3 4 3 5 5 4 5 5 3 17,700 3 movies 2 2 2 5 2 1 1 3 3 1 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 3
480,000 users Matrix R 1 3 4 𝒔 𝟒,𝟕 3 5 5 4 5 5 3 17,700 3 movies 2 ? ? Training Data Set Test Data Set ? 2 1 ? 3 ? 1 True rating of user x on item i 𝑦𝑗 2 SSE = 𝑠 𝑦𝑗 − 𝑠 (𝑗,𝑦)∈𝑆 Predicted rating 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 4
The winner of the Netflix Challenge Multi-scale modeling of the data: Combine top level, “regional” Global effects modeling of the data, with a refined, local view: Global: Factorization Overall deviations of users/movies Factorization: Collaborative filtering Addressing “regional” effects Collaborative filtering: Extract local patterns 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 5
Global: Mean movie rating: 3.7 stars The Sixth Sense is 0.5 stars above avg. Joe rates 0.2 stars below avg. Baseline estimation: Joe will rate The Sixth Sense 4 stars Local neighborhood (CF/NN): Joe didn’t like related movie Signs Final estimate: Joe will rate The Sixth Sense 3.8 stars 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 6
Earliest and most popular collaborative filtering method Derive unknown ratings from those of “ similar ” movies (item-item variant) Define similarity measure s ij of items i and j Select k - nearest neighbors, compute the rating N(i; x): items most similar to i that were rated by x s r ij xj ˆ j N ( i ; x ) r s ij … similarity of items i and j xi r uj … rating of user x on item j s N(i;x) … set of items similar to ij j N ( i ; x ) item i that were rated by x 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 7
In practice we get better estimates if we model deviations: s ( r b ) ^ ij xj xj j N ( i ; x ) r b xi xi s ij j N ( i ; x ) baseline estimate for r xi Problems/Issues: 1) Similarity measures are “arbitrary” 𝒄 𝒚𝒋 = 𝝂 + 𝒄 𝒚 + 𝒄 𝒋 2) Pairwise similarities neglect interdependencies among users μ = overall mean rating 3) Taking a weighted average can be b x = rating deviation of user x restricting = ( avg. rating of user x ) – μ Solution: Instead of s ij use w ij that b i = ( avg. rating of movie i ) – μ we estimate directly from data 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 8
Use a weighted sum rather than weighted avg. : 𝑠 = 𝑐 𝑦𝑗 + 𝑥 𝑗𝑘 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 𝑦𝑗 𝑘∈𝑂(𝑗;𝑦) A few notes: We sum over all movies j that are similar to i and were rated by x 𝒙 𝒋𝒌 is the interpolation weight (some real number) We allow: 𝒙 𝒋𝒌 ≠ 𝟐 𝒌∈𝑶(𝒋,𝒚) 𝒙 𝒋𝒌 models interaction between pairs of movies (it does not depend on user x ) 𝑶(𝒋; 𝒚) … set of movies rated by user x that are similar to movie i 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 9
= 𝑐 𝑦𝑗 + 𝑠 𝑥 𝑗𝑘 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 𝑦𝑗 𝑘∈𝑂(𝑗,𝑦) How to set w ij ? 𝑣𝑗 2 Remember, error metric is SSE : 𝑠 𝑣𝑗 − 𝑠 (𝑗,𝑣)∈𝑆 Find w ij that minimize SSE on training data! Models relationships between item i and its neighbors j w ij can be learned/estimated based on x and all other users that rated i Why is this a good idea? 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 10
1 3 4 3 5 5 Here is what we just did: 4 5 5 3 3 Goal: Make good recommendations 2 ? ? ? 2 1 ? Quantify goodness using SSE: 3 ? 1 So, Lower SSE means better recommendations We want to make good recommendations on items that some user has not yet seen. Can’t really do this. Why? Let’s set values w such that they work well on known (user, item) ratings And hope these w s will predict well the unknown ratings This is the first time in the class that we see Optimization methods 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 11
Idea: Let’s set values w such that they work well on known (user, item) ratings How to find such values w ? Idea: Define an objective function and solve the optimization problem Find w ij that minimize SSE on training data ! 2 min 𝑥 𝑗𝑘 𝑐 𝑦𝑗 + 𝑥 𝑗𝑘 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 − 𝑠 𝑦𝑗 𝑦 𝑘∈𝑂 𝑗;𝑦 Think of w as a vector of numbers 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 12
We have the optimization 2 problem, now what? min 𝑥 𝑗𝑘 𝑐 𝑦𝑗 + 𝑥 𝑗𝑘 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 − 𝑠 𝑦𝑗 Gradient decent 𝑦 𝑘∈𝑂 𝑗;𝑦 Iterate until convergence: 𝒙 𝒙 − 𝜶𝒙 … learning rate where 𝜶𝒙 is gradient (derivative evaluated on data): 𝜖 𝛼𝑥 = = 2 𝑐 𝑦𝑗 + 𝑥 𝑗𝑙 𝑠 𝑦𝑙 − 𝑐 𝑦𝑙 − 𝑠 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 𝑦𝑗 𝜖𝑥 𝑗𝑘 𝑦 𝑙∈𝑂 𝑗;𝑦 for 𝒌 ∈ {𝑶 𝒋; 𝒚 , ∀𝒋, ∀𝒚 } 𝜖 𝜖𝑥 𝑗𝑘 = 𝟏 else Note: we fix movie i , go over all r xi , for every movie 𝒌 ∈ 𝑶 𝒋; 𝒚 , while | w new - w old | > ε : 𝝐 w old = w new we compute 𝝐𝒙 𝒋𝒌 w new = w old - · w old 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 13
= 𝑐 𝑦𝑗 + So far: 𝑠 𝑥 𝑗𝑘 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 𝑦𝑗 𝑘∈𝑂(𝑗;𝑦) Weights w ij derived based Global effects on their role; no use of an arbitrary similarity measure ( w ij s ij ) Factorization Explicitly account for interrelationships among CF/NN the neighboring movies Next: Latent factor model Extract “regional” correlations 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 14
Global average: 1.1296 User average: 1.0651 Movie average: 1.0533 Netflix: 0.9514 Basic Collaborative filtering: 0.94 CF+Biases+learnt weights: 0.91 Grand Prize: 0.8563 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 15
Serious Braveheart The Color Amadeus Purple Lethal Sense and Weapon Sensibility Ocean’s 11 Geared Geared towards towards males females The Lion King The Princess Independence Diaries Day Dumb and Dumber Funny 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 16
SVD: A = U V T “SVD” on Netflix data: R ≈ Q · P T f factors users .1 -.4 .2 1 3 5 5 4 users -.5 .6 .5 4 5 4 2 1 3 items f factors 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9 -.2 .3 .5 2 4 1 2 3 4 3 5 ≈ -.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1 1.3 1.1 2.1 .3 2 4 5 4 2 items 2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6 .1 -.7 2.1 -2 4 3 4 2 2 5 P T -1 .7 .3 1 3 3 2 4 Q R For now let’s assume we can approximate the rating matrix R as a product of “thin” Q · P T R has missing entries but let’s ignore that for now! Basically, we will want the reconstruction error to be small on known ratings and we don’t care about the values on the missing ones 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 17
How to estimate the missing rating of 𝑼 user x for item i ? 𝒚𝒋 = 𝒓 𝒋 ⋅ 𝒒 𝒚 𝒔 users 1 3 5 5 4 = 𝒓 𝒋𝒈 ⋅ 𝒒 𝒚𝒈 4 ? 5 4 2 1 3 items ≈ 2 4 1 2 3 4 3 5 𝒈 2 4 5 4 2 4 3 4 2 2 5 q i = row i of Q 1 3 3 2 4 p x = column x of P T .1 -.4 .2 users f factors -.5 .6 .5 items 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9 -.2 .3 .5 -.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1 1.3 1.1 2.1 .3 2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6 .1 -.7 2.1 -2 P T -1 .7 .3 f factors Q 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 18
How to estimate the missing rating of 𝑼 user x for item i ? 𝒚𝒋 = 𝒓 𝒋 ⋅ 𝒒 𝒚 𝒔 users 1 3 5 5 4 = 𝒓 𝒋𝒈 ⋅ 𝒒 𝒚𝒈 4 ? 5 4 2 1 3 items ≈ 2 4 1 2 3 4 3 5 𝒈 2 4 5 4 2 4 3 4 2 2 5 q i = row i of Q 1 3 3 2 4 p x = column x of P T .1 -.4 .2 users f factors -.5 .6 .5 items 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9 -.2 .3 .5 -.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1 1.3 1.1 2.1 .3 2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6 .1 -.7 2.1 -2 P T -1 .7 .3 f factors Q 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 19
Recommend
More recommend