DS504/CS586: Big Data Analytics Recommender System Prof. Yanhua Li - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics Recommender System Prof. Yanhua Li Time: 6:00pm –8:50pm Thu. Location: AK 232 Fall 2016

Example: Recommender Systems v Customer X v Customer Y § Star War I § Does search on Star War I § Star War II § Recommender system suggests Star War II from data collected about customer X J. Leskovec, A. Rajaraman, J. Ullman: 2 Mining of Massive Datasets, http:// www.mmds.org

Recommendations Examples: Search Recommendations Products, web sites, Items blogs, news items, … J. Leskovec, A. Rajaraman, J. Ullman: 3 Mining of Massive Datasets, http:// www.mmds.org

From Scarcity to Abundance v Shelf space is a scarce commodity for traditional retailers § Also: TV networks, movie theaters,… v Web enables near-zero-cost dissemination of information about products § From scarcity to abundance, e.g., Amazon, Target online, eBay, etc. v More choices necessitates better filters § Recommendation engines J. Leskovec, A. Rajaraman, J. Ullman: 4 Mining of Massive Datasets, http:// www.mmds.org

Types of Recommendations v Editorial and hand curated § List of favorites § Lists of “essential” items v Simple aggregates § Top 10, Most Popular, Recent Uploads v Tailored to individual users § Amazon, Netflix, … J. Leskovec, A. Rajaraman, J. Ullman: 5 Mining of Massive Datasets, http:// www.mmds.org

Formal Model v X = set of Customers v S = set of Items v Utility function u : X × S à R § R = set of ratings § R is a totally ordered set § e.g., 0-5 stars, real number in [0,1] J. Leskovec, A. Rajaraman, J. Ullman: 6 Mining of Massive Datasets, http:// www.mmds.org

Utility Matrix Avatar LOTR Matrix Pirates 1 0.2 Alice 0.5 0.3 Bob 0.2 1 Carol 0.4 David J. Leskovec, A. Rajaraman, J. Ullman: 7 Mining of Massive Datasets, http:// www.mmds.org

Key Problems v (1) Gathering “known” ratings for matrix § How to collect the data in the utility matrix v (2) Estimate unknown ratings from the known ones § Mainly interested in high unknown ratings • We are not interested in knowing what you don’t like but what you like v (3) Evaluating estimation methods § How to measure success/performance of recommendation methods J. Leskovec, A. Rajaraman, J. Ullman: 8 Mining of Massive Datasets, http:// www.mmds.org

(1) Gathering Ratings v Explicit § Ask people to rate items § Doesn’t work well in practice – people can’t be bothered v Implicit § Learn ratings from user actions • E.g., purchase implies high rating § What about low ratings? J. Leskovec, A. Rajaraman, J. Ullman: 9 Mining of Massive Datasets, http:// www.mmds.org

(2) Estimating Utilities v Key problem: Utility matrix U is sparse § Most people have not rated most items § Cold start: • New items have no ratings • New users have no history v Approaches to recommender systems: § 1) Content-based § 2) Collaborative filtering J. Leskovec, A. Rajaraman, J. Ullman: 10 Mining of Massive Datasets, http:// www.mmds.org

Content-based Recommender Systems

Content-based Recommendations v Main idea: Recommend items to customer x similar to previous items rated highly by x § Look at x’s items vs all items Example: v Movie recommendations § Recommend movies with same actor(s), director, genre, … v Websites, blogs, news § Recommend other sites with “similar” content J. Leskovec, A. Rajaraman, J. Ullman: 12 Mining of Massive Datasets, http:// www.mmds.org

Plan of Action Item profiles likes build recommend Red match Circles Triangles User profile J. Leskovec, A. Rajaraman, J. Ullman: 13 Mining of Massive Datasets, http:// www.mmds.org

Item Profiles v For each item, create an item profile v Profile is a set (vector) of features § Movies: author, title, actor, director,… § Text: Set of “important” words in document v How to pick important features? § Usual heuristic from text mining is TF-IDF (Term frequency * Inverse Doc Frequency) • Term … Feature • Document … Item J. Leskovec, A. Rajaraman, J. Ullman: 14 Mining of Massive Datasets, http:// www.mmds.org

Sidenote: TF-IDF f ij = frequency of term (feature) i in doc j Note: we normalize TF by the frequency of the most frequent term to discount for “longer” n i = number of docs that mention term i documents N = total number of docs TF-IDF score: w ij = TF ij × IDF i Doc profile = set of words with highest TF- IDF scores, together with their scores w j = ( w 1 j ,..., w ij ,..., w kj ) J. Leskovec, A. Rajaraman, J. Ullman: 15 Mining of Massive Datasets, http:// www.mmds.org

User Profiles and Prediction v User profile possibilities: § Weighted average of rated item profiles § Variations: weight by difference from average rating for item ∑ w x = w j ( r xj − r x ) j = 1... N x v Prediction heuristic: § Given user profile w x and item profile w j , estimate r xj = cos( w x , w j ) = w x w j / || w j |||| w x || J. Leskovec, A. Rajaraman, J. Ullman: 16 Mining of Massive Datasets, http:// www.mmds.org

Pros: Content-based Approach v +: No need for data on other users v +: Able to recommend to users with unique tastes v +: Able to recommend new & unpopular items § No item cold-start v +: Able to provide explanations § Can provide explanations of recommended items by listing content-features that caused an item to be recommended J. Leskovec, A. Rajaraman, J. Ullman: 17 Mining of Massive Datasets, http:// www.mmds.org

Cons: Content-based Approach v –: Finding the appropriate features is hard § E.g., images, movies, music v –: Recommendations for new users § How to build a user profile? § User code-start problem v –: Overspecialization § Never recommends items outside user’s content profile § People might have multiple interests § Unable to exploit quality judgments of other users J. Leskovec, A. Rajaraman, J. Ullman: 18 Mining of Massive Datasets, http:// www.mmds.org

Collaborative Filtering Harnessing quality judgments of other users

Collaborative Filtering v Consider user x v Find set N of other x users whose ratings are “ similar ” to x ’s ratings N v Estimate x ’s ratings based on ratings of users in N J. Leskovec, A. Rajaraman, J. Ullman: 20 Mining of Massive Datasets, http:// www.mmds.org

r x = [*, _, _, *, ***] Finding “Similar” Users r y = [*, _, **, **, _] v Let r x be the vector of user x’s ratings r x , r y as sets: v Jaccard similarity measure r x = {1, 4, 5} r y = {1, 3, 4} § Problem: Ignore the value of the ratings: r x , r y as points: r x = {1, 0, 0, 1, 3} v Cosine Similarity measure r y = {1, 0, 2, 2, 0} § Sim(x,y)=cos(r x , r y )=r x r y /||r x || ||r y || § Problem: Treading missing ratings as negatives v Pearson correlation coefficient v Sim(x,y)=(r x -r x,ave )(r y -r y,ave )/||r x -r x,ave || ||r y -r y,ave || 21

Cosine sim: Similarity Metric v Intuitively we want: § sim( A , B ) > sim( A , C ) v Jaccard similarity: 1/5 < 2/4 v Cosine similarity: 0.386 > 0.322 § Considers missing ratings as “negative” § Solution: subtract the (row) mean Notice cosine sim. is correlation when data is centered at 0 22

User-User Collaborative Filtering § For user u, find other similar users § Estimate rating for item i based on ratings from similar users ∑ sim ( u , n ) ⋅ r ni n ⊂ neighbors ( u ) pred ( u , i ) = ∑ sim ( u , n ) n ⊂ neighbors ( u ) Sim(u,n) … similarity of user u and n r ui … rating of user u on item i neighbor(u) … set of users similar to user u J. Leskovec, A. Rajaraman, J. Ullman: 23 Mining of Massive Datasets, http:// www.mmds.org

Item-Item Collaborative Filtering v So far: User-user collaborative filtering v Another view: Item-item § For item i , find other similar items § Estimate rating for item i based on ratings for similar items § Can use same similarity metrics and prediction functions as in user-user model s r ∑ ⋅ ij xj j N ( i ; x ) ∈ r = xi s s ij … similarity of items i and j ∑ ij r xj … rating of user x on item j j N ( i ; x ) ∈ N(i;x) … set items rated by x similar to i J. Leskovec, A. Rajaraman, J. Ullman: 24 Mining of Massive Datasets, http:// www.mmds.org

Item-Item CF (|N|=2) users 1 2 3 4 5 6 7 8 9 10 11 12 1 1 3 5 5 4 2 5 4 4 2 1 3 movies 3 2 4 1 2 3 4 3 5 4 2 4 5 4 2 5 4 3 4 2 2 5 6 1 3 3 2 4 - unknown rating - rating between 1 to 5 J. Leskovec, A. Rajaraman, J. Ullman: 25 Mining of Massive Datasets, http:// www.mmds.org

Item-Item CF (|N|=2) users 1 2 3 4 5 6 7 8 9 10 11 12 1 1 3 ? 5 5 4 2 5 4 4 2 1 3 movies 3 2 4 1 2 3 4 3 5 4 2 4 5 4 2 5 4 3 4 2 2 5 6 1 3 3 2 4 - estimate rating of movie 1 by user 5 J. Leskovec, A. Rajaraman, J. Ullman: 26 Mining of Massive Datasets, http:// www.mmds.org

Item-Item CF (|N|=2) users 1 2 3 4 5 6 7 8 9 10 11 12 sim(1,m) 1 1 3 ? 5 5 4 1.00 2 5 4 4 2 1 3 -0.18 movies 3 2 4 1 2 3 4 3 5 0.41 4 2 4 5 4 2 -0.10 -0.31 5 4 3 4 2 2 5 0.59 6 1 3 3 2 4 Here we use Pearson correlation as similarity: Neighbor selection: 1) Subtract mean rating m i from each movie i Identify movies similar to m 1 = (1+3+5+5+4)/5 = 3.6 movie 1 , rated by user 5 row 1: [-2.6, 0, -0.6, 0, 0, 1.4, 0, 0, 1.4, 0, 0.4, 0] 27 2) Compute cosine similarities between rows

DS504/CS586: Big Data Analytics Recommender System Prof. Yanhua Li - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics Recommender System Prof. Yanhua Li Time: 6:00pm 8:50pm Thu. Location: AK 232 Fall 2016 Example: Recommender Systems v Customer X v Customer Y Star War I Does search on Star War I Star War

DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Recommender System Prof. Yanhua Li Time: 6:00pm 8:50pm Thu.

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu

DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu

DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location:

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Time: 6:00pm 8:50pm R

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location:

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

FPL-004 Turkey Point Nuclear Power Plant Units 6 & 7 Overview Panel Mano Nazar President,

IPv6 Solutions Ralf Korschner Systems Engineer EMEA ralfk@a10networks.com Mike Awford, Sales

Maximizing the Benefits of Intrusion Prevention Systems: Effective Deployments Strategies Charles

Near-Threshold Computing: Reclaiming Moores Law Dr. Ronald G. Dreslinski Research Fellow

Towards using Cached Data Mining for Large Scale Recommender Systems Swapneel Sheth, Gail Kaiser

Replicable Evaluation of Recommender Systems Alejandro Bellogn (Universidad Autnoma de Madrid,

Recommender Systems Collabora2ve Filtering and Matrix Factoriza2on

G enerative A dversarial User Model for R einforcement L earning Based R ecommendation S ystem

Sambuz

Useful Links

Newsletter

Mail Us

DS504/CS586: Big Data Analytics Recommender System Prof. Yanhua Li - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics Recommender System Prof. Yanhua Li Time: 6:00pm 8:50pm Thu. Location: AK 232 Fall 2016 Example: Recommender Systems v Customer X v Customer Y Star War I Does search on Star War I Star War

DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Recommender System Prof. Yanhua Li Time: 6:00pm 8:50pm Thu.

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu

DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu

DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location:

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Time: 6:00pm 8:50pm R

DS504/CS586: Big Data Analytics --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location:

DS504/CS586: Big Data Analytics --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

FPL-004 Turkey Point Nuclear Power Plant Units 6 &amp; 7 Overview Panel Mano Nazar President,

IPv6 Solutions Ralf Korschner Systems Engineer EMEA ralfk@a10networks.com Mike Awford, Sales

Maximizing the Benefits of Intrusion Prevention Systems: Effective Deployments Strategies Charles

Near-Threshold Computing: Reclaiming Moores Law Dr. Ronald G. Dreslinski Research Fellow

Towards using Cached Data Mining for Large Scale Recommender Systems Swapneel Sheth, Gail Kaiser

Replicable Evaluation of Recommender Systems Alejandro Bellogn (Universidad Autnoma de Madrid,

Recommender Systems Collabora2ve Filtering and Matrix Factoriza2on

G enerative A dversarial User Model for R einforcement L earning Based R ecommendation S ystem

Sambuz

Useful Links

Newsletter

Mail Us

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

FPL-004 Turkey Point Nuclear Power Plant Units 6 & 7 Overview Panel Mano Nazar President,