Training data ▪ 100 million ratings, 480,000 users, 17,770 movies ▪ 6 years of data: 2000-2005 Test data ▪ Last few ratings of each user (2.8 million) ▪ Evaluation criterion: Root Mean Square Error (RMSE) = ▪ Netflix’s system RMSE: 0.9514 Competition ▪ 2,700+ teams ▪ $1 million prize for 10% improvement on Netflix 4/20/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 2
Labels known publicly Labels only known to Netflix Training Data Held-Out Data 3 million ratings 100 million ratings 1.5m ratings 1.5m ratings Quiz Set: Test Set: scores scores posted on known only leaderboard to Netflix Scores used in determining final winner 4/20/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 3
480,000 users Matrix R 1 3 4 3 5 5 4 5 5 3 17,700 3 movies 2 2 2 5 2 1 1 3 3 1 4/20/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 4
480,000 users Matrix R 1 3 4 𝒔 𝟒,𝟕 3 5 5 4 5 5 3 17,700 3 movies 2 ? ? Training Data Set Test Data Set ? 2 1 ? 3 ? True rating of 1 user x on item i RMSE = Predicted rating 4/20/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 5
The winner of the Netflix Challenge Multi-scale modeling of the data: Combine top level, “regional” Global effects modeling of the data, with a refined, local view: ▪ Global: Factorization ▪ Overall deviations of users/movies ▪ Factorization: Collaborative filtering ▪ Addressing “regional” effects ▪ Collaborative filtering: ▪ Extract local patterns 4/20/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 6
Global: ▪ Mean movie rating: 3.7 stars ▪ The Sixth Sense is 0.5 stars above avg. ▪ Joe rates 0.2 stars below avg. Baseline estimation: Joe will rate The Sixth Sense 4 stars ▪ That is 4 = 3.7+0.5-0.2 Local neighborhood (CF/NN): ▪ Joe didn’t like related movie Signs ▪ Final estimate: Joe will rate The Sixth Sense 3.8 stars 4/20/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 7
The earliest and the most popular collaborative filtering method Derive unknown ratings from those of “ similar ” movies (item-item variant) Define similarity metric s ij of items i and j Select k - nearest neighbors, compute the rating ▪ N(i; x): items most similar to i that were rated by x s r ij xj = ˆ j N ( i ; x ) r s ij … similarity of items i and j xi r xj … rating of user x on item j s N(i;x) … set of items similar to ij j N ( i ; x ) item i that were rated by x 4/20/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 8
In practice we get better estimates if we model deviations: − s ( r b ) ^ ij xj xj = + j N ( i ; x ) r b xi xi s ij j N ( i ; x ) baseline estimate for r xi Problems/Issues: 1) Similarity metrics are “arbitrary” 𝒄 𝒚𝒋 = 𝝂 + 𝒄 𝒚 + 𝒄 𝒋 2) Pairwise similarities neglect interdependencies among users μ = overall mean rating 3) Taking a weighted average can be b x = rating deviation of user x restricting = ( avg. rating of user x ) – μ Solution: Instead of s ij use w ij that = ( avg. rating of movie i ) – μ b i we estimate directly from data 4/20/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 9
Use a weighted sum rather than weighted avg. : A few notes: ▪ 𝑶(𝒋; 𝒚) … set of movies rated by user x that are similar to movie i ▪ 𝒙 𝒋𝒌 is the interpolation weight (some real number) ▪ Note, we allow: σ 𝒌∈𝑶(𝒋;𝒚) 𝒙 𝒋𝒌 ≠ 𝟐 ▪ 𝒙 𝒋𝒌 models interaction between pairs of movies (it does not depend on user x ) 4/20/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 10
𝑦𝑗 = 𝑐 𝑦𝑗 + σ 𝑘∈𝑂(𝑗,𝑦) 𝑥 𝑗𝑘 𝑠 ෞ 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 How to set w ij ? ▪ Remember, error metric is: 𝟑 or equivalently SSE: σ (𝒋,𝒚)∈𝑺 ො 𝒔 𝒚𝒋 − 𝒔 𝒚𝒋 ▪ Find w ij that minimize SSE on training data! ▪ Models relationships between item i and its neighbors j ▪ w ij can be learned/estimated based on x and all other users that rated i Why is this a good idea? 4/20/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 11
1 3 4 3 5 5 Goal: Make good recommendations 4 5 5 3 3 2 2 2 ▪ Quantify goodness using RMSE: 5 2 1 1 Lower RMSE better recommendations 3 3 1 ▪ Really want to make good recommendations on items that user has not yet seen. Can’t really do this! ▪ Let’s set build a system such that it works well on known (user, item) ratings And hope the system will also predict well the unknown ratings 4/20/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 12
Idea: Let’s set values w such that they work well on known (user, item) ratings How to find such values w ? Idea: Define an objective function and solve the optimization problem Find w ij that minimize SSE on training data ! True Predicted rating rating Think of w as a vector of numbers 4/20/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 13
A simple way to minimize a function 𝒈(𝒚) : ▪ Compute the derivative 𝜶𝒈(𝒚) ▪ Start at some point 𝒛 and evaluate 𝜶𝒈(𝒛) ▪ Make a step in the reverse direction of the gradient: 𝒛 = 𝒛 − 𝜶𝒈(𝒛) ▪ Repeat until convergence 𝑔 𝑧 + 𝛼𝑔(𝑧) 𝑔 𝑧 4/20/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 14
We have the optimization 2 𝐾 𝑥 = 𝑐 𝑦𝑗 + 𝑥 𝑗𝑘 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 − 𝑠 𝑦𝑗 problem, now what? 𝑦,𝑗∈𝑆 𝑘∈𝑂 𝑗;𝑦 Gradient descent: ▪ Iterate until convergence: 𝒙 ← 𝒙 − 𝜶 𝒙 𝑲 … learning rate where 𝜶 𝒙 𝑲 is the gradient (derivative evaluated on data): 𝑥 𝐾 = 𝜖𝐾(𝑥) 𝛼 = 2 𝑐 𝑦𝑗 + 𝑥 𝑗𝑙 𝑠 𝑦𝑙 − 𝑐 𝑦𝑙 − 𝑠 𝑦𝑗 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 𝜖𝑥 𝑗𝑘 𝑦,𝑗∈𝑆 𝑙∈𝑂 𝑗;𝑦 for 𝒌 ∈ {𝑶 𝒋; 𝒚 , ∀𝒋, ∀𝒚 } 𝜖𝐾(𝑥) else 𝜖𝑥 𝑗𝑘 = 𝟏 ▪ Note: We fix movie i , go over all r xi , for every movie 𝒌 ∈ 𝑶 𝒋; 𝒚 , 𝝐𝑲(𝒙) while | w new - w old | > ε : we compute 𝝐𝒙 𝒋𝒌 w old = w new w new = w old - · w old 4/20/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 15
𝑦𝑗 = 𝑐 𝑦𝑗 + σ 𝑘∈𝑂(𝑗;𝑦) 𝑥 𝑗𝑘 𝑠 So far: ෞ 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 ▪ Weights w ij derived based Global effects on their roles; no use of an arbitrary similarity metric ( w ij s ij ) Factorization ▪ Explicitly account for interrelationships among CF/NN the neighboring movies Next: Latent factor model ▪ Extract “regional” correlations 4/20/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 16
Global average: 1.1296 User average: 1.0651 Movie average: 1.0533 Netflix: 0.9514 Basic Collaborative filtering: 0.94 CF+Biases+learned weights: 0.91 Grand Prize: 0.8563 4/20/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 17
[Slide from BellKor team] Serious Braveheart The Color Amadeus Purple Lethal Sense and Weapon Sensibility Ocean’s 11 Geared Geared towards towards males females The Lion King The Princess Independence Diaries Day Dumb and Dumber Funny 4/20/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 18
SVD: A = U V T “SVD” on Netflix data: R ≈ Q · P T factors users .1 -.4 .2 1 3 5 5 4 users -.5 .6 .5 4 5 4 2 1 3 factors items 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9 -.2 .3 .5 2 4 1 2 3 4 3 5 ≈ -.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1 1.3 1.1 2.1 .3 2 4 5 4 2 2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6 .1 items -.7 2.1 -2 4 3 4 2 2 5 P T -1 .7 .3 1 3 3 2 4 R Q For now let’s assume we can approximate the rating matrix R as a product of “thin” Q · P T ▪ R has missing entries but let’s ignore that for now! ▪ Basically, we want the reconstruction error to be small on known ratings and we don’t care about the values on the missing ones 4/20/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 19
Recommend
More recommend