FAQs Wednesday (4/15) is the GEAR Session IV presentation - - PDF document

faqs
SMART_READER_LITE
LIVE PREVIEW

FAQs Wednesday (4/15) is the GEAR Session IV presentation - - PDF document

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA PART B. GEAR SESSIONS SESSION 4: LARGE SCALE RECOMMENDATION SYSTEMS AND SOCIAL MEDIA Sangmi Lee Pallickara


slide-1
SLIDE 1

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

CS535 BIG DATA

PART B. GEAR SESSIONS

SESSION 4: LARGE SCALE RECOMMENDATION SYSTEMS AND SOCIAL MEDIA

Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535

CS535 Big Data | Computer Science | Colorado State University

FAQs

  • Wednesday (4/15) is the GEAR Session IV presentation
  • Discussion will be available on 4/15, 16, and 17
  • Watch video clips on Canvas à Assignments à Echo360

CS535 Big Data | Computer Science | Colorado State University

slide-2
SLIDE 2

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

Topics of Todays Class

  • Part 1: Collaborative Filtering with the case study of Item-to-Item CF
  • Part 2: Collaborative Filtering with the case study of Latent Factor CF
  • Part 3: Evaluating Recommendation Systems

CS535 Big Data | Computer Science | Colorado State University

GEAR Session 4. Large Scale Recommendation Systems and Social Media

Lecture 2. Large Scale Recommendation Systems

Amazon.com : Item-to-item collaborative filtering

CS535 Big Data | Computer Science | Colorado State University

slide-3
SLIDE 3

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

  • Amazon.com uses recommendations as a targeted marketing tool
  • Email campaigns
  • Most of their web pages

CS535 Big Data | Computer Science | Colorado State University

Recommendation System

  • Find a set of customers whose purchased and rated items overlap the user’s

purchased and rated items

  • Eliminates items the user has already purchased (or rated)
  • Recommends the remaining items to the users

CS535 Big Data | Computer Science | Colorado State University

slide-4
SLIDE 4

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

What if they use a Traditional CF [1/4]

  • Build a utility matrix
  • N-dimensional vector of items per user regarding their ratings
  • Where N is the number of distinct catalog items
  • Positive for purchased or positively rated items
  • Negative for negatively rated items
  • To compensate for the best-selling items
  • Multiplies the vector components by the inverse frequency
  • Making less well-known items more relevant

CS535 Big Data | Computer Science | Colorado State University

What if they use a Traditional CF [2/4]

  • Find out similar users
  • Cosine similarity between the vectors
  • E.g. user A and B
  • Cosine_Similarity(A,B) =cos(A,B)=

!"# ∥!∥∗∥#∥

  • Select items within the group of items purchased by the similar users
  • E.g. Rank each item according to how many similar customers purchased it
  • Highly ranked item(s) will be recommended

CS535 Big Data | Computer Science | Colorado State University

slide-5
SLIDE 5

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5

What if they use a Traditional CF [3/4]

  • For N items (in the catalog) and M users
  • Worst case
  • O(MN)
  • Average customer vector is extremely sparse
  • O(M+N)
  • Most of scanning will be approximately O(M)
  • There are a few customers who have purchased or rated a significant percentage of the catalog
  • Therefore, the final performance of the algorithm is approximately O(M+N)

CS535 Big Data | Computer Science | Colorado State University

What if they use a Traditional CF [4/4]

  • Dimensionality reduction
  • Reducing M by randomly sampled customers or discarding customers with few

purchases

  • Reducing N by discarding very popular or unpopular items
  • What will be the problem of above approaches?

CS535 Big Data | Computer Science | Colorado State University

slide-6
SLIDE 6

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6

What if they use a Traditional CF [4/4]

  • Dimensionality reduction
  • Reducing M by randomly sampled customers or discarding customers with few

purchases

  • Reducing N by discarding very popular or unpopular items
  • Disadvantages
  • Hard to capture the similarity between the users
  • Item-space partitioning restricts recommendations to a specific product or subject area
  • If the algorithm discards the most popular or unpopular items
  • They will never appear as recommendataion

CS535 Big Data | Computer Science | Colorado State University

Item-to-item collaborative filtering

  • It does NOT match the user to similar customers
  • Item-to-item collaborative filtering
  • Matches each of the user’s purchased and rated items to similar items
  • Combines those similar items into a recommendation list

CS535 Big Data | Computer Science | Colorado State University

slide-7
SLIDE 7

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 7

Determining the most-similar match

  • The algorithm builds a similar-items table
  • By finding items that customers tend to purchase together
  • How about building a product-to-product matrix by iterating through all item pairs and

computing a similarity metric for each pair?

  • Many product pairs have no common customer
  • If you already bought a TV today, will you buy another TV again today?

CS535 Big Data | Computer Science | Colorado State University

Determining the most-similar match

  • Calculating the similarity between a single product and all related products
  • It is not the same “similarity” between items
  • Based on the co-occurred items in the a client’s purchase history
  • E.g. if a client A has bought a headset X and a lawn mower Y, X and Y can be considered as “similar” item in

this context

  • How to build a similar-items matrix

For each item in product catalog, I1 For each customer C who purchased I1 For each item I2 purchased by customer C Record that a customer purchased I1 and I2 For each item I2 Compute the similarity between I1 and I2

CS535 Big Data | Computer Science | Colorado State University

slide-8
SLIDE 8

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8

Part 1: tracking co-occurrence items [1/3]

I0 I1 I2 I3 I4 I5 I6 I0 I1 I2 I3 I4 I5 I6

Purchase record for the user UA={ I1 , I3. ,I4 } Purchase record for the user UB={ I2 , I3. ,I4 } Purchase record for the user UC={ I2 } Purchase record for the user UD={ I0 , I5. ,I6 } Purchase record for the user UE={ I1 , I3. } Purchase record for the user UF={ I0 , I3. ,I5 } Purchase record for the user UG={ I5 , I6. }

CS535 Big Data | Computer Science | Colorado State University

For each item in product catalog, I1 For each customer C who purchased I1 For each item I2 purchased by customer C Record that a customer purchased I1 and I2 For each item I2 Compute the similarity between I1 and I2

Part 1: tracking co-occurrence items [2/3]

I0 I1 I2 I3 I4 I5 I6 I0 I1 1 1 I2 I3 1 1 I4 1 1 I5 I6

Purchase record for the user UA={ I1 , I3. ,I4 } Purchase record for the user UB={ I2 , I3. ,I4 } Purchase record for the user UC={ I2 } Purchase record for the user UD={ I0 , I5. ,I6 } Purchase record for the user UE={ I1 , I3. } Purchase record for the user UF={ I0 , I3. ,I5 } Purchase record for the user UG={ I5 , I6. }

CS535 Big Data | Computer Science | Colorado State University

For each item in product catalog, I1 For each customer C who purchased I1 For each item I2 purchased by customer C Record that a customer purchased I1 and I2 For each item I2 Compute the similarity between I1 and I2

slide-9
SLIDE 9

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 9

Part 1: tracking co-occurrence items [3/3]

I0 I1 I2 I3 I4 I5 I6 I0 1 2 1 I1 2 1 I2 1 1 I3 1 2 1 2 1 I4 1 1 2 I5 2 1 2 I6 1 2

Purchase record for the user UA={ I1 , I3. ,I4 } Purchase record for the user UB={ I2 , I3. ,I4 } Purchase record for the user UC={ I2 } Purchase record for the user UD={ I0 , I5. ,I6 } Purchase record for the user UE={ I1 , I3. } Purchase record for the user UF={ I0 , I3. ,I5 } Purchase record for the user UG={ I5 , I6. }

Co-occurrence matrix

CS535 Big Data | Computer Science | Colorado State University

Part 2: Computing similarity between items

  • Using cosine measure
  • Each vector corresponds to an item
  • Item A and B (rather than customers)
  • Cosine_Similarity(A,B) =cos(A,B)=

!"# ∥!∥∗∥#∥

CS535 Big Data | Computer Science | Colorado State University

slide-10
SLIDE 10

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 10

Creating a similar-item table

  • Generating the co-occurrence table is extremely computing intensive
  • O(N2M) in the worst case
  • Where N is the number of items and M is the number of users
  • Average case is closer to O(NM)
  • Most customers have very few purchases
  • Sampling customers who purchase best-selling titles reduces runtime even more
  • With little reduction in quality
  • Offline computation

CS535 Big Data | Computer Science | Colorado State University

Generating the final recommendation

  • Using a similar-items table
  • Algorithm finds items similar to each of the user’s purchases and ratings, aggregates

those items

  • Recommends the most popular or correlated items
  • Quick, depending only on the number of items the user purchased or rated

CS535 Big Data | Computer Science | Colorado State University

slide-11
SLIDE 11

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 11

Scalability

  • Amazon.com has around 606 million catalog items
  • Traditional collaborative filtering does little or no offline computation
  • Online computation scales with the number of customers and catalog items.

CS535 Big Data | Computer Science | Colorado State University

Key scalability strategy for amazon recommendations

  • Creating the expensive similar-items table offline
  • Online component
  • Looking up similar items for the user’s purchases and ratings
  • Scales independently of the catalog size or the total number of customers
  • It is dependent only on how many titles the user has purchased or rated

CS535 Big Data | Computer Science | Colorado State University

slide-12
SLIDE 12

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 12

Recommendation quality

  • The algorithm recommends highly correlated similar items
  • Recommendation quality is excellent
  • Algorithm performs well with limited user data

CS535 Big Data | Computer Science | Colorado State University

GEAR Session 4. Large Scale Recommendation Systems and Social Media

Lecture 2. Large Scale Recommendation Systems Recommendation Systems

Recommending Music and the Audioscrobbler Dataset

CS535 Big Data | Computer Science | Colorado State University

slide-13
SLIDE 13

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 13

Dataset

  • Audioscrobbler dataset
  • 2002, Richard Jones
  • Collecting and analyzing user’s songs to generate recommendation
  • Started with support for Winamp and XMMS
  • iTunes, Winamp, Windows Media Player, Foobar, iPod, Amarok, Rhythmbox, mpd, Xbox media center,

Slimserver, Jinzora, mpg321, Muine, Rhapsody, YME, Soundbridge, VLC…

CS535 Big Data | Computer Science | Colorado State University

Dataset

  • Confined rating system
  • “Bob rates Coldplay 3.5 stars.”
  • Users rate music far less frequently than they play music
  • Audioscrobbler dataset
  • “Bob played Coldplay track”
  • Each individual data carries less information
  • Implicit feedback
  • User-artist connections are implied as a side effect of other actions

CS535 Big Data | Computer Science | Colorado State University

slide-14
SLIDE 14

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 14

Dataset

  • 141,000 unique users
  • 1.6 million unique artists
  • 24.2 million user’s plays of artist are recorded
  • User_artist_data.txt
  • http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html
  • On average, each user has played songs from about 171 artists (out of 1.6 M)
  • Extremely sparse dataset

CS535 Big Data | Computer Science | Colorado State University

Netflix Prize

  • The Netflix Prize challenge concerned recommender systems for movies (October,

2006)

  • Netflix released a training set consisting of data from almost 500,000 customers and

their ratings on 18,000 movies.

  • More than 100 million ratings
  • The task was to use these data to build a model to predict ratings for a hold-out set of 3

million ratings

CS535 Big Data | Computer Science | Colorado State University

slide-15
SLIDE 15

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 15

GEAR Session 4. Large Scale Recommendation Systems and Social Media

Lecture 2. Large Scale Recommendation Systems Recommendation Systems

Collaborative Filtering: Latent Factor Model

CS535 Big Data | Computer Science | Colorado State University

Collaborative filtering [1/2]

  • Collects and analyzes a large amount of information on users’ behaviors, activities or

preferences and predicts what users will like based on their similarity to other users

  • Explicit data collection
  • Rate an item
  • Search history
  • Favorite item
  • Wish list
  • Implicit data collection
  • Viewing times
  • Tracking online purchases
  • Analyzing the user’s social network

CS535 Big Data | Computer Science | Colorado State University

slide-16
SLIDE 16

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 16

Collaborative filtering [2/2]

  • Two users may share similar tastes because they are the same age
  • It is NOT an example of collaborative filtering
  • Two users may both like the same song because they play many other same songs
  • It IS an example of collaborative filtering
  • Algorithm that learns without access to user or artist attributes

CS535 Big Data | Computer Science | Colorado State University

Latent-Factor model

  • Tries to explain observed interactions between large numbers of users and products

through a relatively small number of unobserved, underlying reasons

  • Within the music business context,
  • Why millions of people buy a particular few of thousands of possible albums by describing users

and albums for tens of genres and tastes that are not directly observable

CS535 Big Data | Computer Science | Colorado State University

slide-17
SLIDE 17

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 17

Simplified illustration of the latent factor approach

Geared toward males Geared toward females serious escapist Fast and Furious King Arthur Ken Burns the Civil war Twilight Still Alice

Bob Jennifer Tom Nancy Area 1 Area 2 Area 3 Area 4

Iron Man

CS535 Big Data | Computer Science | Colorado State University

Simplified illustration of the latent factor approach

Geared toward males Geared toward females serious escapist

Bob Jennifer Tom Nancy Area 1 Area 2 Area 3 Area 4

Fast and Furious King Arthur Ken Burns the Civil war Twilight Still Alice Iron Man

CS535 Big Data | Computer Science | Colorado State University

slide-18
SLIDE 18

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 18

How do we model this?

  • User and product data in a large matrix A
  • Row i and column j
  • If user i has played product j
  • The k columns correspond to the latent factors

≈ ×

A

k k

X YT

Products users

CS535 Big Data | Computer Science | Colorado State University

Creating user and artist matrices

  • Two matrices
  • Matrix X for user
  • Each value corresponds to a latent

feature in the model

  • Matrix Y for products
  • Each value corresponds to a latent

feature in the model

  • Rows express how much users and

products associate with these latent features

  • Product of X and Y completes

estimation of the entire, dense user-product interaction matrix

×

X

k k

YT Users’ matrix Products’ matrix

CS535 Big Data | Computer Science | Colorado State University

slide-19
SLIDE 19

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 19

Computational challenge

  • A=XYT generally no solution
  • If X and Y are not large enough
  • Goal
  • Finding the best X and Y

CS535 Big Data | Computer Science | Colorado State University

Alternating Least Squares (ALS)

  • Alternating least squares algorithm to compute X and Y
  • Spark MLib’s ALS implementation
  • Step 1
  • Y is not known
  • Initialized to a matrix with randomly chosen row vectors
  • Then simple linear algebra gives the best X, given Y and A
  • AiY(YTY)-1=Xi

CS535 Big Data | Computer Science | Colorado State University

slide-20
SLIDE 20

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 20

CS535 Big Data | Computer Science | Colorado State University

Alternating Least Squares (ALS)

  • Alternating least squares algorithm to compute X and Y
  • Spark MLib’s ALS implementation
  • Step 1
  • Y is not known
  • Initialized to a matrix with randomly chosen row vectors
  • Then simple linear algebra gives the best X, given Y and A
  • AiY(YTY)-1=Xi
  • Equality cannot achieved exactly
  • The goal becomes to minimize |AiY(YTY)-1 - Xi|
  • The sum of squared differences between the two matrices’ entries

CS535 Big Data | Computer Science | Colorado State University

slide-21
SLIDE 21

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 21

Alternating Least Squares (ALS)

  • Step 2.
  • Repeat similar sequence as step 1 to compute Y from the X (from step 1)
  • Step 3.
  • Repeat similar sequence as step 1 to compute X from the Y (from step 2)

  • X and Y do eventually converge to good (acceptable) solutions

CS535 Big Data | Computer Science | Colorado State University

Alternating Least Squares (ALS)

  • Takes advantage of the sparsity of the input data
  • Easy to apply data parallelism

CS535 Big Data | Computer Science | Colorado State University

slide-22
SLIDE 22

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 22

GEAR Session 4. Large Scale Recommendation Systems and Social Media

Lecture 2. Large Scale Recommendation Systems Recommendation Systems

Latent Factor Model: Building with Spark MLlib

CS535 Big Data | Computer Science | Colorado State University

Preparing the Data

  • Files are available at /user/ds/
  • Spark MLlib’s ALS implementation
  • Requires numeric IDs for users and items
  • Nonnegative 32-bit integers
  • An ID larger than Integer.MAX_VALUE cannot be used

val rawUserArtistData = sc.textFile(“hdfs:///user/ds/user_artist_data.txt”) rawUserArtistData.map(_.split(' ')(0).toDouble).stats() rawUserArtistData.map(_.split(' ')(1).toDouble).stats() Maximum user IDs: 24443548 Maximum artist IDs: 2147483647 No additional transformation will be needed

CS535 Big Data | Computer Science | Colorado State University

slide-23
SLIDE 23

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 23

Extracting names

  • artist_data.txt
  • Artist ID and name separated by a tab
  • Straightforward parsing of the file into (Int, String) tuples will fail

val rawArtistData = sc.textFile(" hdfs:///user/ds/artist_data.txt") val artistByID = rawArtistData.map { line = > val (id, name) = line.span(_!='\ t') (id.toInt, name.trim) }

CS535 Big Data | Computer Science | Colorado State University

Extracting names

  • Scala’s Option class
  • Option represents a value that might only optionally exist

val artistByID = rawArtistData.flatMap { line = > val (id, name) = line.span(_ != '\ t') if (name.isEmpty) { None } else { try { Some((id.toInt, name.trim)) } catch { case e: NumberFormatException = > None } } }

CS535 Big Data | Computer Science | Colorado State University

slide-24
SLIDE 24

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 24

Building Model

  • Two transformations are required
  • Alias dataset should be applied to convert all artist IDs to a canonical ID
  • The data should be converted to a Rating object
  • User-product-value data

import org.apache.spark.mllib.recommendation._ val bArtistAlias = sc.broadcast(artistAlias) val trainData = rawUserArtistData.map { line = > val Array( userID, artistID, count) = line.split(' '). map(_. toInt) val finalArtistID = bArtistAlias.value.getOrElse(artistID, artistID) Rating(userID, finalArtistID, count) }.cache()

CS535 Big Data | Computer Science | Colorado State University

cache()

  • RDD should be temporarily stored after being computed
  • ALS is iterative
  • It will typically need to access this RDD ≥ 10 times
  • Otherwise, this RDD could be repeatedly recomputed from the original data each time

CS535 Big Data | Computer Science | Colorado State University

slide-25
SLIDE 25

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 25

Broadcast variables

  • For the case that many tasks (from different closures) need access to the same

(immutable) data structure

  • Extends normal handling of task closures
  • Caching data as raw Java objects on each executor
  • Caching data across multiple jobs and stages
  • Spark will send, and hold in memory, just one copy for each executor in the cluster
  • Saves network traffic and memory

CS535 Big Data | Computer Science | Colorado State University

Building the ALS model

  • Constructs model as a MatrixFactorizationModel

val model = ALS.trainImplicit(trainData, 10, 5, 0.01, 1.0)

CS535 Big Data | Computer Science | Colorado State University

slide-26
SLIDE 26

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 26

Retrieving some feature vectors

  • Array of 10 numbers

val model = ALS.trainImplicit(trainData, 10, 5, 0.01, 1.0) model.userFeatures.mapValues(_.mkString(”,")).first() ... (4293,-0.3233030601963864, 0.31964527593541325, 0.49025505511361034, 0.09000932568001832, 0.4429537767744912, 0.4186675713407441, 0.8026858843673894, -0.4841300444834003, - 0.12485901532338621, 0.19795451025931002)

CS535 Big Data | Computer Science | Colorado State University

Spot Checking Recommendations

  • To see if the artist recommendations for user(2093760) makes any

intuitive sense

val rawArtistsForUser = rawUserArtistData.map(_. split(' ')). filter { case Array( user,_,_) = > user.toInt = = 2093760 } val existingProducts = rawArtistsForUser.map { case Array(_, artist,_) = > artist.toInt }.collect().toSet artistByID.filter { case (id, name) = > existingProducts.contains(id) }.values.collect().Foreach(println) ... David Gray Blackalicious Jurassic The Saw Doctors Xzibit

CS535 Big Data | Computer Science | Colorado State University

slide-27
SLIDE 27

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 27

Spot Checking Recommendations

  • To see five recommendations for this user (ID: 2093760)

val recommendations = model.recommendProducts(2093760, 5) recommendations.foreach(println) ... Rating( 2093760,1300642,0.02833118412903932) Rating( 2093760,2814,0.027832682960168387) Rating( 2093760,1037970,0.02726611004625264) Rating( 2093760,1001819,0.02716011293509426) Rating( 2093760,4605,0.027118271894797333)

CS535 Big Data | Computer Science | Colorado State University

GEAR Session 4. Large Scale Recommendation Systems and Social Media

Lecture 2. Large Scale Recommendation Systems Recommendation Systems

Evaluating Your Recommendation System

CS535 Big Data | Computer Science | Colorado State University

slide-28
SLIDE 28

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 28

What is a “good” recommendation?

  • “a popular artist”?
  • “artists the user has listened to”?
  • “artists the user will listen to”?

CS535 Big Data | Computer Science | Colorado State University

Preparing data for evaluation

  • To perform a meaningful evaluation, some of the artist play data can be set aside
  • Hidden from the ALS model building process
  • The held-out data can be used as a collection of good recommendations for each user
  • Compute the recommender’s score

For building model For testing model

CS535 Big Data | Computer Science | Colorado State University

slide-29
SLIDE 29

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 29

AUC metric

  • Rank 1.0 is perfect, 0.0 is the worst
  • Receiver Operating Characteristic (ROC)
  • Based on the rank used to decide final recommendations
  • Area Under the Curve (AUC) of ROC may be used as the probability that a randomly

chosen good recommendation ranks above a randomly chosen bad recommendation

  • Spark’s BinaryCalssficationMetrics
  • Computes AUC per users and averages the result
  • Generating mean AUC

CS535 Big Data | Computer Science | Colorado State University

mAP metric

  • Mean average precision
  • Focuses on the top recommendations

CS535 Big Data | Computer Science | Colorado State University

slide-30
SLIDE 30

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 30

Computing AUC

  • 90% of the data is used for training and the remaining 10% for validation

import org.apache.spark.rdd._ def areaUnderCurve( positiveData: RDD[ Rating], bAllItemIDs: Broadcast[ Array[ Int]], predictFunction: (RDD[( Int, Int)] = > RDD[Rating])) = { ... } val allData = buildRatings( rawUserArtistData, bArtistAlias) val Array( trainData, cvData) = allData.randomSplit(Array( 0.9, 0.1))

CS535 Big Data | Computer Science | Colorado State University

Computing AUC

  • continued

trainData.cache() cvData.cache() val allItemIDs = allData.map(_. product). distinct(). collect() val bAllItemIDs = sc.broadcast( allItemIDs) val model = ALS.trainImplicit( trainData, 10, 5, 0.01, 1.0) val auc = areaUnderCurve( cvData, bAllItemIDs, model.predict)

CS535 Big Data | Computer Science | Colorado State University

slide-31
SLIDE 31

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 31

k-Fold Cross-validation

  • Create a k-fold partition of the dataset
  • For each of the k experiments use K-1 folds for training
  • The remaining fold for testing

Experiment 1 Experiment 2 Experiment 3 Total number of examples Test example Experiment 4

CS535 Big Data | Computer Science | Colorado State University

True error estimate

  • k-fold cross validation is similar to random subsampling
  • The advantage of k-Fold Cross validation
  • All the examples in the dataset are eventually used for both training and testing
  • The true error is estimated as the average error rate

E = 1 K Ei

i=1 K

CS535 Big Data | Computer Science | Colorado State University

slide-32
SLIDE 32

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 32

k-Fold Cross-validation with Spark

  • MLUtils.kFold()

def predictMostListened( sc: SparkContext, train: RDD[Rating])(allData: RDD[( Int, Int)]) = { val bListenCount = sc.broadcast( train.map( r = > (r.product, r.rating)). reduceByKey(_ + _).collectAsMap() ) allData.map { case (user, product) = > Rating( user, product, bListenCount.value.getOrElse(product, 0.0) ) } } val auc = areaUnderCurve(cvData, bAllItemIDs, predictMostListened(sc,trainData))

CS535 Big Data | Computer Science | Colorado State University

Hyperparameter selection

  • MatrixFactorizationModel
  • ALS.trainImplicit()
  • rank = 10
  • The number of latent factors in the model
  • The number of columns, k
  • iterations = 5
  • The number of iterations that the factorization runs
  • lambda = 0.1
  • A standard overfitting parameter
  • Higher value guards against overfitting
  • Values that are too high will decrease the factorization’s accuracy
  • alpha = 1.0
  • Controls the relative weight of observed versus unobserved user-product interactions in the

factorization

CS535 Big Data | Computer Science | Colorado State University