Welcome to DS504/CS586: Big Data Analytics Recommender System Prof. Yanhua Li Time: 6:00pm –8:50pm Thu. Location: AK 232 Fall 2016
Example: Recommender Systems v Customer X v Customer Y § Star War I § Does search on Star War I § Star War II § Recommender system suggests Star War II from data collected about customer X J. Leskovec, A. Rajaraman, J. Ullman: 2 Mining of Massive Datasets, http:// www.mmds.org
Recommendations Examples: Search Recommendations Products, web sites, Items blogs, news items, … J. Leskovec, A. Rajaraman, J. Ullman: 3 Mining of Massive Datasets, http:// www.mmds.org
From Scarcity to Abundance v Shelf space is a scarce commodity for traditional retailers § Also: TV networks, movie theaters,… v Web enables near-zero-cost dissemination of information about products § From scarcity to abundance, e.g., Amazon, Target online, eBay, etc. v More choices necessitates better filters § Recommendation engines J. Leskovec, A. Rajaraman, J. Ullman: 4 Mining of Massive Datasets, http:// www.mmds.org
Types of Recommendations v Editorial and hand curated § List of favorites § Lists of “essential” items v Simple aggregates § Top 10, Most Popular, Recent Uploads v Tailored to individual users § Amazon, Netflix, … J. Leskovec, A. Rajaraman, J. Ullman: 5 Mining of Massive Datasets, http:// www.mmds.org
Formal Model v X = set of Customers v S = set of Items v Utility function u : X × S à R § R = set of ratings § R is a totally ordered set § e.g., 0-5 stars, real number in [0,1] J. Leskovec, A. Rajaraman, J. Ullman: 6 Mining of Massive Datasets, http:// www.mmds.org
Utility Matrix Avatar LOTR Matrix Pirates 1 0.2 Alice 0.5 0.3 Bob 0.2 1 Carol 0.4 David J. Leskovec, A. Rajaraman, J. Ullman: 7 Mining of Massive Datasets, http:// www.mmds.org
Key Problems v (1) Gathering “known” ratings for matrix § How to collect the data in the utility matrix v (2) Estimate unknown ratings from the known ones § Mainly interested in high unknown ratings • We are not interested in knowing what you don’t like but what you like v (3) Evaluating estimation methods § How to measure success/performance of recommendation methods J. Leskovec, A. Rajaraman, J. Ullman: 8 Mining of Massive Datasets, http:// www.mmds.org
(1) Gathering Ratings v Explicit § Ask people to rate items § Doesn’t work well in practice – people can’t be bothered v Implicit § Learn ratings from user actions • E.g., purchase implies high rating § What about low ratings? J. Leskovec, A. Rajaraman, J. Ullman: 9 Mining of Massive Datasets, http:// www.mmds.org
(2) Estimating Utilities v Key problem: Utility matrix U is sparse § Most people have not rated most items § Cold start: • New items have no ratings • New users have no history v Approaches to recommender systems: § 1) Content-based § 2) Collaborative filtering J. Leskovec, A. Rajaraman, J. Ullman: 10 Mining of Massive Datasets, http:// www.mmds.org
Content-based Recommender Systems
Content-based Recommendations v Main idea: Recommend items to customer x similar to previous items rated highly by x § Look at x’s items vs all items Example: v Movie recommendations § Recommend movies with same actor(s), director, genre, … v Websites, blogs, news § Recommend other sites with “similar” content J. Leskovec, A. Rajaraman, J. Ullman: 12 Mining of Massive Datasets, http:// www.mmds.org
Plan of Action Item profiles likes build recommend Red match Circles Triangles User profile J. Leskovec, A. Rajaraman, J. Ullman: 13 Mining of Massive Datasets, http:// www.mmds.org
Item Profiles v For each item, create an item profile v Profile is a set (vector) of features § Movies: author, title, actor, director,… § Text: Set of “important” words in document v How to pick important features? § Usual heuristic from text mining is TF-IDF (Term frequency * Inverse Doc Frequency) • Term … Feature • Document … Item J. Leskovec, A. Rajaraman, J. Ullman: 14 Mining of Massive Datasets, http:// www.mmds.org
Sidenote: TF-IDF f ij = frequency of term (feature) i in doc j Note: we normalize TF by the frequency of the most frequent term to discount for “longer” n i = number of docs that mention term i documents N = total number of docs TF-IDF score: w ij = TF ij × IDF i Doc profile = set of words with highest TF- IDF scores, together with their scores w j = ( w 1 j ,..., w ij ,..., w kj ) J. Leskovec, A. Rajaraman, J. Ullman: 15 Mining of Massive Datasets, http:// www.mmds.org
User Profiles and Prediction v User profile possibilities: § Weighted average of rated item profiles § Variations: weight by difference from average rating for item ∑ w x = w j ( r xj − r x ) j = 1... N x v Prediction heuristic: § Given user profile w x and item profile w j , estimate r xj = cos( w x , w j ) = w x w j / || w j |||| w x || J. Leskovec, A. Rajaraman, J. Ullman: 16 Mining of Massive Datasets, http:// www.mmds.org
Pros: Content-based Approach v +: No need for data on other users v +: Able to recommend to users with unique tastes v +: Able to recommend new & unpopular items § No item cold-start v +: Able to provide explanations § Can provide explanations of recommended items by listing content-features that caused an item to be recommended J. Leskovec, A. Rajaraman, J. Ullman: 17 Mining of Massive Datasets, http:// www.mmds.org
Cons: Content-based Approach v –: Finding the appropriate features is hard § E.g., images, movies, music v –: Recommendations for new users § How to build a user profile? § User code-start problem v –: Overspecialization § Never recommends items outside user’s content profile § People might have multiple interests § Unable to exploit quality judgments of other users J. Leskovec, A. Rajaraman, J. Ullman: 18 Mining of Massive Datasets, http:// www.mmds.org
Collaborative Filtering Harnessing quality judgments of other users
Collaborative Filtering v Consider user x v Find set N of other x users whose ratings are “ similar ” to x ’s ratings N v Estimate x ’s ratings based on ratings of users in N J. Leskovec, A. Rajaraman, J. Ullman: 20 Mining of Massive Datasets, http:// www.mmds.org
r x = [*, _, _, *, ***] Finding “Similar” Users r y = [*, _, **, **, _] v Let r x be the vector of user x’s ratings r x , r y as sets: v Jaccard similarity measure r x = {1, 4, 5} r y = {1, 3, 4} § Problem: Ignore the value of the ratings: r x , r y as points: r x = {1, 0, 0, 1, 3} v Cosine Similarity measure r y = {1, 0, 2, 2, 0} § Sim(x,y)=cos(r x , r y )=r x r y /||r x || ||r y || § Problem: Treading missing ratings as negatives v Pearson correlation coefficient v Sim(x,y)=(r x -r x,ave )(r y -r y,ave )/||r x -r x,ave || ||r y -r y,ave || 21
Cosine sim: Similarity Metric v Intuitively we want: § sim( A , B ) > sim( A , C ) v Jaccard similarity: 1/5 < 2/4 v Cosine similarity: 0.386 > 0.322 § Considers missing ratings as “negative” § Solution: subtract the (row) mean Notice cosine sim. is correlation when data is centered at 0 22
User-User Collaborative Filtering § For user u, find other similar users § Estimate rating for item i based on ratings from similar users ∑ sim ( u , n ) ⋅ r ni n ⊂ neighbors ( u ) pred ( u , i ) = ∑ sim ( u , n ) n ⊂ neighbors ( u ) Sim(u,n) … similarity of user u and n r ui … rating of user u on item i neighbor(u) … set of users similar to user u J. Leskovec, A. Rajaraman, J. Ullman: 23 Mining of Massive Datasets, http:// www.mmds.org
Item-Item Collaborative Filtering v So far: User-user collaborative filtering v Another view: Item-item § For item i , find other similar items § Estimate rating for item i based on ratings for similar items § Can use same similarity metrics and prediction functions as in user-user model s r ∑ ⋅ ij xj j N ( i ; x ) ∈ r = xi s s ij … similarity of items i and j ∑ ij r xj … rating of user x on item j j N ( i ; x ) ∈ N(i;x) … set items rated by x similar to i J. Leskovec, A. Rajaraman, J. Ullman: 24 Mining of Massive Datasets, http:// www.mmds.org
Item-Item CF (|N|=2) users 1 2 3 4 5 6 7 8 9 10 11 12 1 1 3 5 5 4 2 5 4 4 2 1 3 movies 3 2 4 1 2 3 4 3 5 4 2 4 5 4 2 5 4 3 4 2 2 5 6 1 3 3 2 4 - unknown rating - rating between 1 to 5 J. Leskovec, A. Rajaraman, J. Ullman: 25 Mining of Massive Datasets, http:// www.mmds.org
Item-Item CF (|N|=2) users 1 2 3 4 5 6 7 8 9 10 11 12 1 1 3 ? 5 5 4 2 5 4 4 2 1 3 movies 3 2 4 1 2 3 4 3 5 4 2 4 5 4 2 5 4 3 4 2 2 5 6 1 3 3 2 4 - estimate rating of movie 1 by user 5 J. Leskovec, A. Rajaraman, J. Ullman: 26 Mining of Massive Datasets, http:// www.mmds.org
Item-Item CF (|N|=2) users 1 2 3 4 5 6 7 8 9 10 11 12 sim(1,m) 1 1 3 ? 5 5 4 1.00 2 5 4 4 2 1 3 -0.18 movies 3 2 4 1 2 3 4 3 5 0.41 4 2 4 5 4 2 -0.10 -0.31 5 4 3 4 2 2 5 0.59 6 1 3 3 2 4 Here we use Pearson correlation as similarity: Neighbor selection: 1) Subtract mean rating m i from each movie i Identify movies similar to m 1 = (1+3+5+5+4)/5 = 3.6 movie 1 , rated by user 5 row 1: [-2.6, 0, -0.6, 0, 0, 1.4, 0, 0, 1.4, 0, 0.4, 0] 27 2) Compute cosine similarities between rows
Recommend
More recommend