[PDF] - Distributed Implementation of the Triplets View CS535 Big Data

SLIDE 1

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

CS535 BIG DATA

PART B. GEAR SESSIONS

SESSION 4: LARGE SCALE RECOMMENDATION SYSTEMS AND SOCIAL MEDIA

Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535

CS535 Big Data | Computer Science | Colorado State University

FAQs

Today is the last day of discussion period for Session III on Piazza
Watch video clips on Canvas à Assignments à Echo360
Term project phase 1 (Proposal)
Feedbacks are available in Canvas
Please arrange a meeting if needed

CS535 Big Data | Computer Science | Colorado State University

Topics of Todays Class

Part 1: Distributed implementation of Triplets View in GraphX
Recommendation Systems
Part 2: Introduction and Content based recommendation systems
Part 3: Collaborative Filtering (Case study of Amazon’s Item-to-Item model and Netflix’ Latent Factor

Model)

CS535 Big Data | Computer Science | Colorado State University

GEAR Session 3. Big Graph Analysis

Lecture 3. Distributed Large Graph Analysis-II GraphX: Graph Processing in a Distributed Dataflow Framework

Distributed Implementation of the Triplets View

CS535 Big Data | Computer Science | Colorado State University

Efficient lookup of edges

Edges within a partition are clustered by source vertex id using a compressed sparse

row (CSR) representation and hash-indexed by their target id

CSR with an example
With a sparse m x n matrix M
Using three (1 dimensional) arrays (", $%&'()*+, ,%-'()*+)
5

8 3 6

" = 5

8 3 6

Col'()*+ = 0

1 2 1

column indices
,%-'()*+ = 0

2 3 4

index in V where the given row starts

CS535 Big Data | Computer Science | Colorado State University

Define row_start = ROW_INDEX[row] row_end = ROW_INDEX[row+1]

Index Reuse

GraphX inherits the immutability of Spark
All graph operators logically create new collections rater than destructively modifying existing ones
Derived vertex and edge collections can often share indices to reduce memory overhead and improve

local performance

Hash index on vertices can enable fast aggregation and resulting aggregates share the index with the original vertices
Faster Joins
Vertex collections sharing the same index can be joined by a coordinated scan
Without requiring any index lookups
Index reuse reduces the per-iteration runtime of PageRank on the twitter graph by 59 % (GraphX paper)
Operators that do not modify the graph structure (e.g. mapV) automatically preserve indices
Operators that restrict the graph structure (e.g. subgraph) relies on bitmasks to construct restricted views
reindex operator
For the operator changes the structure heavily (e.g. filtered)

CS535 Big Data | Computer Science | Colorado State University

SLIDE 2

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

Implementing the Triplets View

Triplets view
Three way join between the source and destination vertex properties and the edge properties
Vertex Mirroring
Multicast Join
Partial Materialization
Incremental View Maintenance

CS535 Big Data | Computer Science | Colorado State University

Implementing the Triplets View: Vertex Mirroring

Join requires data movement
Vertex and edge property collections are partitioned independently
Three-way join
Shipping the vertex properties across the network to the edges
Setting the edge partitions as the join sites
Observation 1: Real-world graphs commonly have orders of magnitude more

edges than vertices

Observation 2: A single vertex may have many edges in the same partition
Enabling substantial reuse of the vertex property

CS535 Big Data | Computer Science | Colorado State University

Implementing the Triplets View: Multicast Join

Broadcast join
All vertices are sent to each edge partition
Multicast join
Each vertex property is sent only to the edge partitions that contain adjacent edges
Join site information is stored in the routing table
Co-partitioned with the vertex collection
Routing table is associated with the edge collection
Routing table is constructed lazily upon first instantiation of the triplets view
Example
Per-city partitioning scheme on the Facebook social network graph
50.5% reduction in query time

CS535 Big Data | Computer Science | Colorado State University

Implementing the Triplets View: Partial Materialization

Local joins at the edge partitions
Mirrored vertex properties are stored in local hash maps on each edge partition
Referenced when the triplets are constructed

CS535 Big Data | Computer Science | Colorado State University

Implementing the Triplets View: Incremental View Maintenance

Iterative graph algorithms often modify only a subset of the vertex properties in each

iteration

Incremental view maintenance
To avoid unnecessary movement of unchanged data
After each graph operation
You can track which vertex properties have changed since the triplets view was last constructed
When the triplets view is accessed next time
Only the changed vertices are re-routed to their edge-partition join sites
Local mirrored values of the unchanged vertices are reused

CS535 Big Data | Computer Science | Colorado State University

Query Optimizations for the mrTriplets operator

Filtered Index Scanning
myTriplets operator logically involves a scan of the triplets view to apply user-defined

map function

As iterative graph algorithms converge, the working sets tend to shrink
Map function skips many Triplets
Active set
Map function only need to operate on triplets containing active vertices
Defined by the application specific predicate
E.g. connected component analysis
Indexed scan for the triplets view
Application expresses the current active set by restricting the graph using subgraph operator
Filter the triplets using this vertex predicate

CS535 Big Data | Computer Science | Colorado State University

SLIDE 3

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

Query Optimizations for the mrTriplets operator

Automatic Join Elimination
Some operations on triplets view may access only one of the vertex properties or non

at all

E.g. counting the degree of each vertex
GraphX uses a JVM’s bytecode analyzer to inspect user defined functions at runtime
Check whether the source or destination vertex properties is referred
If only one property is referenced and the triplets view has not been already

materialized

GraphX rewrites the query plan for generating the triplets view
From three-way join to a two-way join
If none of the vertex properties are referenced
GraphX eliminates the join entirely

CS535 Big Data | Computer Science | Colorado State University

Additional Optimizations

Memory-based Shuffle
Spark’s default shuffle implementation materializes the temporary data to disk
GraphX modified the shuffle phase to materialize map outputs in memory and remove this temporary

data using a timeout

Batching and Columnar Structure
In the join code, batch a block of vertices routed to the same target join site and convert the block from

row-oriented format to column-oriented format

Apply the LZF compression algorithm on these blocks to send them
Variable Integer Encoding
While GraphX uses 64-bit vertex ids, most of ids are smaller than 264
GraphX uses a variable-encoding scheme
Uses only first 7 bits to encode the value

CS535 Big Data | Computer Science | Colorado State University

GEAR Session 4. Large Scale Recommendation Systems and Social Media

Lecture 1. Large Scale Recommendation Systems

Recommendation Systems: Introduction

CS535 Big Data | Computer Science | Colorado State University

“What percentage of the top 10,000 titles in any online media store (Netflix, iTunes, Amazon, or any other) will rent or sell at least once a month?”

CS535 Big Data | Computer Science | Colorado State University

The long tail phenomenon [1/2]

Distribution of numbers with a portion that has a large number of occurrences far from

the “head” or central part of the distribution

The vertical axis represents popularity
The items are ordered on the horizontal axis according to their popularity
The long-tail phenomenon forces online institutions to recommend items to individual users

Erik Brynjolfsson, Yu (Jeffrey) Hu, and Duncan Simester. 2011. Goodbye Pareto Principle, Hello Long Tail: The Effect of Search Costs on the Concentration of Product Sales. Manage. Sci. 57, 8 (August 2011), 1373-1386. DOI=http://dx.doi.org/10.1287/mnsc.1110.1371 CS535 Big Data | Computer Science | Colorado State University

Recommendation systems

Seek to predict the “rating” or “preference” that a user would give to an item

CS535 Big Data | Computer Science | Colorado State University

SLIDE 4

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

Applications of Recommendation Systems

Product recommendations
Amazon or similar online vendors
Movie recommendations
Netflix offers its customers recommendations of movies they might like
News articles
News services have attempted to identify articles of interest to readers based on the articles that they

have read in the past

Blogs, YouTube

CS535 Big Data | Computer Science | Colorado State University

Types of Recommendation Systems

Random prediction algorithm
Randomly chooses items from the set of available items and recommends them to the users
Accuracy of this algorithm is poor
Frequent sequence
Uses the frequent pattern to recommend other items
Cold start problem
Content based algorithms
Based on properties of items
Similarity of items is determined by measuring the similarity in their properties
Collaborative Filtering algorithms (CF)
Based on the relationship between users and items
Similarity of items is determined by the similarity of the ratings of those items by the users who have rated

both items

Serendipitous recommendation systems
Assumes that the user may want to be surprised with something unexpected
From the results of existing recommendation systems, SR increases diversity and novelty

CS535 Big Data | Computer Science | Colorado State University

GEAR Session 4. Large Scale Recommendation Systems and Social Media

Lecture 1. Large Scale Recommendation Systems

Recommendation Systems: Content-based Recommendations

CS535 Big Data | Computer Science | Colorado State University

Content-Based Recommendations

Focuses on properties of items
Similarity of items is determined by measuring the similarity in their properties

CS535 Big Data | Computer Science | Colorado State University

Item Profiles

A record or collection of records representing important characteristics of the item
E.g. the features of a movie
The set of actors of the movie (Some viewers prefer movies with their favorite actors)
The director
The year in which the movie was made
The genre or general type of movie
Other features: manufacturer, screen size, etc.

CS535 Big Data | Computer Science | Colorado State University

Discovering Features of Documents

Some items have features those are not immediately apparent to the systems
E.g. document collections and images
E.g. News articles
Suggesting articles on topics a user is interested in
Possible features
n words with the highest TF.IDF scores
n percentage of word with the highest TF.IDF scores
To measure the similarity
Jaccard distance or Cosine distance

CS535 Big Data | Computer Science | Colorado State University

What is the TF-IDF value? Term frequency–inverse document frequency a numerical statistic that is intended to reflect how important a word is to a document in a corpus. It combines term frequency and inverse document frequency.

SLIDE 5

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5

Obtaining Item Features from Tags of Images

Crowd sourcing
Inviting users to tag the items
del.icio.us: earlier attempt to tag massive amount of data
Yahoo: invited user to tag Web pages
Disadvantage
Are users willing to take the trouble to create the tags?
Erroneous tags can bias the system

CS535 Big Data | Computer Science | Colorado State University

Representing Item Profiles

Goal
Create both an item profile consisting of feature-value pairs and a user profile summarizing the

preferences of the user

Example
Word vector (with 0’s and 1’s)
1 represents the occurrence of a high TF-IDF word in the document
0 represents the occurrence of a low TF-IDF word in the document

CS535 Big Data | Computer Science | Colorado State University

Representing Item Profiles

Suppose the only features of movies are the set of actors and the average rating
Consider two movies with five actors each
Two of the actors are in both movies
Example
One movie has an average rating of 3 and the other an average of 4

A= (0 1 1 0 1 1 0 1 3") B= (1 1 0 1 0 1 1 0 4")

Cosine similarity between above vectors
CosSimilarity(A, B) =

567 ∥5∥∥7∥ = 9:;9<= 9>:;9><=:;??<@

If we use " = 1
We take the average rating as they are
If we use " = 2
We double the rating

CS535 Big Data | Computer Science | Colorado State University

User Profiles

Using the utility matrix representing the connection between users and items
Example: “Find user’s preference for movies with a specific actor!”
Suppose items are movies, represented by Boolean profiles with components corresponding to

actors

The utility matrix has a 1 if the user has seen the movie and is blank otherwise
If 20% of the movies that user U likes have Julia Roberts as one of the actors
then the user profile for U will have 0.2 in the component for Julia Roberts
Suppose user U gives an average rating of 3
There are three movies with Julia Roberts as an actor, and those movies got ratings of 3, 4, and 5
The component for Julia Roberts will have value that is the average of (3 − 3), (4 − 3), and (5 − 3), that is, 1
If user V gives an average rating of 4
Three movies with Julia Roberts as an actors, and ratings of 2, 3, and 5.
The user profile for V has, in the component for Julia Roberts, the average of (2 − 4), (3 − 4), and (5 − 4),

that is, −2/3

CS535 Big Data | Computer Science | Colorado State University

Recommending Items to Users Based on Content

With the previous example
The highest recommendations (lowest cosine distance) belong to the movies with lots of actors that

appear in many of the movies the user likes

CS535 Big Data | Computer Science | Colorado State University

Classification Algorithms

Decision tree
A collection of nodes, arranged as a binary tree
The leaves render decisions
In this case, the decision would be ”likes” or “doesn’t like”
Each interior node is a condition on the objects being classified

CS535 Big Data | Computer Science | Colorado State University

SLIDE 6

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6

GEAR Session 4. Large Scale Recommendation Systems and Social Media

Lecture 1. Large Scale Recommendation Systems

Recommendation Systems: Collaborative Filtering

CS535 Big Data | Computer Science | Colorado State University

Collaborative Filtering

Identifies similar users and recommending what similar users like
Instead of using features of items to determine their similarity
Focus on the similarity of the user rating for two items
Users are similar if their rating vectors are close according to some distance measure
Jaccard or cosine distance
Recommendation for a user U is made by looking at the users that are most similar to

U

Recommending items that these users like

CS535 Big Data | Computer Science | Colorado State University

Measuring Similarity? -- Jaccard Similarity Coefficient

SW Episode VII SW Episode VIII SW Episode IX Frozen I Frozen II Joker Avengers: Endgame Reviewer A 4 2 5 Reviewer B 5 4 5 Reviewer C 2 3 3 5 Reviewer D 4 2 The utility matrix !"##"$% &'(')"$'*+ ,, . = , ∩ . |, ∪ .| = 1 5 = 20% !"##"$% &'(')"$'*+ ,, 8 = , ∩ 8 |, ∪ 8| = 2 5 = 40% If the utility matrix only reflects purchases of the movie, this can be useful If utilities are more detailed ratings, the Jaccard distance loses important information

CS535 Big Data | Computer Science | Colorado State University

Measuring Similarity? -- Cosine Similarity

SW Episode VII SW Episode VIII SW Episode IX Frozen I Frozen II Joker Avengers: Endgame Reviewer A 4 2 5 Reviewer B 5 4 5 Reviewer C 2 3 3 5 Reviewer D 4 2 The utility matrix !"#$%& '$($)*+$,- ., 0 = . 2 0 ∥ . ∥∥ 0 ∥ = 20 16 + 4 + 25 25 + 16 + 25 = 20 6.7×8.1 = 20 54.27 = 0.37

CS535 Big Data | Computer Science | Colorado State University

Clustering Users and Items

It is hard to detect similarity among either items or users
we have little information about user-item pairs in the sparse utility matrix
Clustering items and/or users

SW Episode VII SW Episode VIII SW Episode IX Frozen I Frozen II Joker Avengers: Endgame Reviewer A 4 2 5 Reviewer B 5 4 5 Reviewer C 2 3 3 5 Reviewer D 4 2

CS535 Big Data | Computer Science | Colorado State University

Clustering Users and Items

Cluster Items based on the series

SW Episode VII/VIII/IX Frozen I and II Joker Avengers: Endgame Reviewer A 4 2 5 Reviewer B 4.66 Reviewer C 2 3 5 Reviewer D 4 2

CS535 Big Data | Computer Science | Colorado State University

SLIDE 7

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 7

Clustering Users and Items

Find the clusters to which user and items
Estimate entries based on the user-item relationship
If the entry is empty, find the most similar item group

SW Episode VII/VIII/IX Frozen I and II Joker Avengers: Endgame Reviewer A 4 2 5 Reviewer B 4.66 Reviewer C 2 3 5 Reviewer D 4 2

CS535 Big Data | Computer Science | Colorado State University

GEAR Session 4. Large Scale Recommendation Systems and Social Media

Lecture 1. Large Scale Recommendation Systems Recommendation Systems

Amazon.com : Item-to-item collaborative filtering

CS535 Big Data | Computer Science | Colorado State University

This material is built based on,

Greg Linden, Brent Smith, and Jeremy York, “Amazon.com Recommendations, Item-to-

Item Collaborative Filtering” IEEE Internet Computing, 2003

CS535 Big Data | Computer Science | Colorado State University

Amazon.com uses recommendations as a targeted marketing tool
Email campaigns
Most of their web pages

CS535 Big Data | Computer Science | Colorado State University

Item-to-item collaborative filtering

It does NOT match the user to similar customers
Item-to-item collaborative filtering
Matches each of the user’s purchased and rated items to similar items
Combines those similar items into a recommendation list

CS535 Big Data | Computer Science | Colorado State University

Determining the most-similar match

The algorithm builds a similar-items table
By finding items that customers tend to purchase together
How about building a product-to-product matrix by iterating through all item pairs and

computing a similarity metric for each pair?

Many product pairs have no common customer
If you already bought a TV today, will you buy another TV again today?

CS535 Big Data | Computer Science | Colorado State University

SLIDE 8

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8

Calculating the similarity between a single product and all related products:

For each item in product catalog, I1 For each customer C who purchased I1 For each item I2 purchased by customer C Record that a customer purchased I1 and I2 For each item I2 Compute the similarity between I1 and I2

CS535 Big Data | Computer Science | Colorado State University

Creating a similar-item table

Similar-items table is extremely computing intensive
Offline computation
O(N2M) in the worst case
Where N is the number of items and M is the number of users
Average case is closer to O(NM)
Most customers have very few purchases
Sampling customers who purchase best-selling titles reduces runtime even more
With little reduction in quality

CS535 Big Data | Computer Science | Colorado State University

Computing similarity

Option 1. Using co-occurrence matrix
If an item has been purchased by the same user together many times, it is considered as a “similar”

item

Option 2. Using cosine measure
Each vector corresponds to an item rather than a customer
M dimensions correspond to customers who have purchased that item
Cosine_Similarity(A,B) =cos(A,B)=

!"# ∥!∥∗∥#∥

CS535 Big Data | Computer Science | Colorado State University

Example

I0 I1 I2 I3 I4 I5 I6 I0 I1 I2 I3 I4 I5 I6

Purchase record for the user UA={ I1 , I3. ,I4 } Purchase record for the user UB={ I2 , I3. ,I4 } Purchase record for the user UC={ I2 } Purchase record for the user UD={ I0 , I5. ,I6 } Purchase record for the user UE={ I1 , I3. } Purchase record for the user UF={ I0 , I3. ,I5 } Purchase record for the user UG={ I5 , I6. }

CS535 Big Data | Computer Science | Colorado State University

Example

I0 I1 I2 I3 I4 I5 I6 I0 1 2 1 I1 2 1 I2 1 1 I3 1 2 1 I4 1 1 I5 2 1 I6 1 1

Purchase record for the user UA={ I1 , I3. ,I4 } Purchase record for the user UB={ I2 , I3. ,I4 } Purchase record for the user UC={ I2 } Purchase record for the user UD={ I0 , I5. ,I6 } Purchase record for the user UE={ I1 , I3. } Purchase record for the user UF={ I0 , I3. ,I5 } Purchase record for the user UG={ I5 , I6. }

Co-occurrence matrix

CS535 Big Data | Computer Science | Colorado State University

Example

I0 I1 I2 I3 I4 I5 I6 I0 1 2 1 I1 2 1 I2 1 1 I3 1 2 1 I4 1 1 I5 2 1 I6 1 1

Co-occurrence matrix

Cosine similarity (I0,I1) =Cosine (I0,I1) =

!"#!$ ∥!"∥∗∥!$∥

CS535 Big Data | Computer Science | Colorado State University

SLIDE 9

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 9

Scalability

Amazon.com has around 110 million active customers(244 million total customers) and

several million catalog items

Traditional collaborative filtering does little or no offline computation
Online computation scales with the number of customers and catalog items.

http://www.fool.com/investing/general/2014/05/24/how-many-customers-does-amazon-have.aspx

CS535 Big Data | Computer Science | Colorado State University

Key scalability strategy for amazon recommendations

Creating the expensive similar-items table offline
Online component
Looking up similar items for the user’s purchases and ratings
Scales independently of the catalog size or the total number of customers
It is dependent only on how many titles the user has purchased or rated

CS535 Big Data | Computer Science | Colorado State University

Recommendation quality

The algorithm recommends highly correlated similar items
Recommendation quality is excellent
Algorithm performs well with limited user data

CS535 Big Data | Computer Science | Colorado State University

GEAR Session 4. Large Scale Recommendation Systems and Social Media

Lecture 1. Large Scale Recommendation Systems Recommendation Systems

Recommending Music and the Audioscrobbler Dataset

CS535 Big Data | Computer Science | Colorado State University

Dataset

Audioscrobbler dataset
2002, Richard Jones
Collecting and analyzing user’s songs to generate recommendation
Started with support for Winamp and XMMS
iTunes, Winamp, Windows Media Player, Foobar, iPod, Amarok, Rhythmbox, mpd, Xbox media center,

Slimserver, Jinzora, mpg321, Muine, Rhapsody, YME, Soundbridge, VLC…

CS535 Big Data | Computer Science | Colorado State University

Dataset

Confined rating system
“Bob rates Coldplay 3.5 stars.”
Users rate music far less frequently than they play music
Audioscrobbler dataset
“Bob played Coldplay track”
Each individual data carries less information
Implicit feedback
User-artist connections are implied as a side effect of other actions

CS535 Big Data | Computer Science | Colorado State University

SLIDE 10

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 10

Dataset

141,000 unique users
1.6 million unique artists
24.2 million user’s plays of artist are recorded
User_artist_data.txt
http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html
On average, each user has played songs from about 171 artists (out of 1.6 M)
Extremely sparse dataset

CS535 Big Data | Computer Science | Colorado State University

Netflix Prize

The Netflix Prize challenge concerned recommender systems for movies (October,

2006)

Netflix released a training set consisting of data from almost 500,000 customers and

their ratings on 18,000 movies.

More than 100 million ratings
The task was to use these data to build a model to predict ratings for a hold-out set of 3

million ratings

CS535 Big Data | Computer Science | Colorado State University

GEAR Session 4. Large Scale Recommendation Systems and Social Media

Lecture 1. Large Scale Recommendation Systems Recommendation Systems

Collaborative Filtering: Latent Factor Model

CS535 Big Data | Computer Science | Colorado State University

Collaborative filtering [1/2]

Collects and analyzes a large amount of information on users’ behaviors, activities or

preferences and predicts what users will like based on their similarity to other users

Explicit data collection
Rate an item
Search history
Favorite item
Wish list
Implicit data collection
Viewing times
Tracking online purchases
Analyzing the user’s social network

CS535 Big Data | Computer Science | Colorado State University

Collaborative filtering [2/2]

Two users may share similar tastes because they are the same age
It is NOT an example of collaborative filtering
Two users may both like the same song because they play many other same songs
It IS an example of collaborative filtering
Algorithm that learns without access to user or artist attributes

CS535 Big Data | Computer Science | Colorado State University

Latent-Factor model

Tries to explain observed interactions between large numbers of users and products

through a relatively small number of unobserved, underlying reasons

Within the music business context,
Why millions of people buy a particular few of thousands of possible albums by describing users

and albums for tens of genres and tastes that are not directly observable

CS535 Big Data | Computer Science | Colorado State University

SLIDE 11

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 11

Simplified illustration of the latent factor approach

Geared toward males Geared toward females serious escapist Fast and Furious King Arthur Ken Burns the Civil war Twilight Still Alice

Bob Jennifer Tom Nancy Area 1 Area 2 Area 3 Area 4

Iron Man

CS535 Big Data | Computer Science | Colorado State University

Simplified illustration of the latent factor approach

Geared toward males Geared toward females serious escapist

Bob Jennifer Tom Nancy Area 1 Area 2 Area 3 Area 4

Fast and Furious King Arthur Ken Burns the Civil war Twilight Still Alice Iron Man

CS535 Big Data | Computer Science | Colorado State University

How do we model this?

User and product data in a large matrix A
Row i and column j
If user i has played product j
The k columns correspond to the latent factors

≈ ×

A

k k

X YT

Products users

CS535 Big Data | Computer Science | Colorado State University

Creating user and artist matrices

Two matrices
Matrix X for user
Each value corresponds to a

latent feature in the model

Matrix Y for products
Each value corresponds to a

latent feature in the model

Rows express how much users

and products associate with these latent features

Product of X and Y

Complete estimation of the entire, dense user-product interaction matrix

×

X

k k

YT Users’ matrix Products’ matrix

CS535 Big Data | Computer Science | Colorado State University

Computational challenge

A=XYT generally no solution
If X and Y are not large enough
Goal
Finding the best X and Y

CS535 Big Data | Computer Science | Colorado State University

Alternating Least Squares (ALS)

Alternating least squares algorithm to compute X and Y
Spark MLib’s ALS implementation
Step 1
Y is not known
Initialized to a matrix with randomly chosen row vectors
Then simple linear algebra gives the best X, given Y and A
AiY(YTY)-1=Xi
Equality cannot achieved exactly
The goal becomes to minimize |AiY(YTY)-1 - Xi|
The sum of squared differences between the two matrices’ entries

CS535 Big Data | Computer Science | Colorado State University

SLIDE 12

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 12

Alternating Least Squares (ALS)

Step 2.
Repeat similar sequence as step 1 to compute Y from the X (from step 1)
Step 3.
Repeat similar sequence as step 1 to compute X from the Y (from step 2)

…

X and Y do eventually converge to good (acceptable) solutions

CS535 Big Data | Computer Science | Colorado State University

Alternating Least Squares (ALS)

Takes advantage of the sparsity of the input data
Easy to apply data parallelism

CS535 Big Data | Computer Science | Colorado State University

GEAR Workshop I | Advanced Big Data Analytics Case Study

Recommendation Systems

Building a model with Spark MLlib

CS535 Big Data | Computer Science | Colorado State University

Preparing the Data

Files are available at /user/ds/
Spark MLib’s ALS implementation
Requires numeric IDs for users and items
Nonnegative 32-bit integers
An ID larger than Integer.MAX_VALUE cannot be used

val rawUserArtistData = sc.textFile(“hdfs:///user/ds/user_artist_data.txt”) rawUserArtistData.map(_.split(' ')(0).toDouble).stats() rawUserArtistData.map(_.split(' ')(1).toDouble).stats() Maximum user IDs: 24443548 Maximum artist IDs: 2147483647 No additional transformation will be needed

CS535 Big Data | Computer Science | Colorado State University

Extracting names

artist_data.txt
Artist ID and name separated by a tab
Straightforward parsing of the file into (Int, String) tuples will fail

val rawArtistData = sc.textFile(" hdfs:///user/ds/artist_data.txt") val artistByID = rawArtistData.map { line = > val (id, name) = line.span(_!='\ t') (id.toInt, name.trim) }

CS535 Big Data | Computer Science | Colorado State University

Extracting names

Scala’s Option class
Option represents a value that might only optionally exist

val artistByID = rawArtistData.flatMap { line = > val (id, name) = line.span(_ != '\ t') if (name.isEmpty) { None } else { try { Some((id.toInt, name.trim)) } catch { case e: NumberFormatException = > None } } }

CS535 Big Data | Computer Science | Colorado State University

SLIDE 13

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 13

Building a First Model

Two transformations are required
Aliases dataset should be applied to convert all artist IDs to a canonical ID
The data should be converted to a Rating object
User-product-value data

import org.apache.spark.mllib.recommendation._ val bArtistAlias = sc.broadcast( artistAlias) val trainData = rawUserArtistData.map { line = > val Array( userID, artistID, count) = line.split(' '). map(_. toInt) val finalArtistID = bArtistAlias.value.getOrElse(artistID, artistID) Rating(userID, finalArtistID, count) }.cache()

CS535 Big Data | Computer Science | Colorado State University

cache()

RDD should be temporarily stored after being computed
ALS is iterative
It will typically need to access this RDD ≥ 10 times
Otherwise, this RDD could be repeatedly recomputed from the original data each time

CS535 Big Data | Computer Science | Colorado State University

Broadcast variables

For the case that many tasks (from different closures) need access to the same

(immutable) data structure

Extends normal handling of task closures
Caching data as raw Java objects on each executor
Caching data across multiple jobs and stages
Spark will send, and hold in memory, just one copy for each executor in the cluster
Saves network traffic and memory

CS535 Big Data | Computer Science | Colorado State University

Building the ALS model

Constructs model as a MatrixFactorizationModel

val model = ALS.trainImplicit(trainData, 10, 5, 0.01, 1.0)

CS535 Big Data | Computer Science | Colorado State University

Retrieving some feature vectors

Array of 10 numbers

val model = ALS.trainImplicit(trainData, 10, 5, 0.01, 1.0) model.userFeatures.mapValues(_.mkString(”,")).first() ... (4293,-0.3233030601963864, 0.31964527593541325, 0.49025505511361034, 0.09000932568001832, 0.4429537767744912, 0.4186675713407441, 0.8026858843673894, -0.4841300444834003, - 0.12485901532338621, 0.19795451025931002)

CS535 Big Data | Computer Science | Colorado State University

Spot Checking Recommendations

To see if the artist recommendations for user(2093760) makes

any intuitive sense

val rawArtistsForUser = rawUserArtistData.map(_. split(' ')). filter { case Array( user,_,_) = > user.toInt = = 2093760 } val existingProducts = rawArtistsForUser.map { case Array(_, artist,_) = > artist.toInt }.collect().toSet artistByID.filter { case (id, name) = > existingProducts.contains(id) }.values.collect().Foreach(println) ... David Gray Blackalicious Jurassic The Saw Doctors Xzibit

CS535 Big Data | Computer Science | Colorado State University

SLIDE 14

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 14

Spot Checking Recommendations

To see five recommendations for this user (ID: 2093760)

val recommendations = model.recommendProducts(2093760, 5) recommendations.foreach(println) ... Rating( 2093760,1300642,0.02833118412903932) Rating( 2093760,2814,0.027832682960168387) Rating( 2093760,1037970,0.02726611004625264) Rating( 2093760,1001819,0.02716011293509426) Rating( 2093760,4605,0.027118271894797333)

CS535 Big Data | Computer Science | Colorado State University

5. Advanced Data Analytics with Apache Spark

Recommending Music and the Audioscrobbler Dataset

Evaluating the Recommendation Model

CS535 Big Data | Computer Science | Colorado State University

What is a “good” recommendation?

“a popular artist”?
“artists the user has listened to”?
“artists the user will listen to”?

CS535 Big Data | Computer Science | Colorado State University

Preparing data for evaluation

To perform a meaningful evaluation, some of the artist play data can be set aside
Hidden from the ALS model building process
The held-out data can be used as a collection of good recommendations for each user
Compute the recommender’s score

For building model For testing model

CS535 Big Data | Computer Science | Colorado State University

AUC metric

Rank 1.0 is perfect, 0.0 is the worst
Receiver Operating Characteristic (ROC)
Based on the rank used to decide final recommendations
Area Under the Curve (AUC) of ROC may be used as the probability that a randomly

chosen good recommendation ranks above a randomly chosen bad recommendation

Spark’s BinaryCalssficationMetrics
Computes AUC per users and averages the result
Generating mean AUC

CS535 Big Data | Computer Science | Colorado State University

MAP metric

Mean average precision
Focuses on the top recommendations

CS535 Big Data | Computer Science | Colorado State University

SLIDE 15

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 15

Computing AUC

90% of the data is used for training and the remaining 10% for validation

import org.apache.spark.rdd._ def areaUnderCurve( positiveData: RDD[ Rating], bAllItemIDs: Broadcast[ Array[ Int]], predictFunction: (RDD[( Int, Int)] = > RDD[Rating])) = { ... } val allData = buildRatings( rawUserArtistData, bArtistAlias) val Array( trainData, cvData) = allData.randomSplit(Array( 0.9, 0.1))

CS535 Big Data | Computer Science | Colorado State University

Computing AUC

continued

trainData.cache() cvData.cache() val allItemIDs = allData.map(_. product). distinct(). collect() val bAllItemIDs = sc.broadcast( allItemIDs) val model = ALS.trainImplicit( trainData, 10, 5, 0.01, 1.0) val auc = areaUnderCurve( cvData, bAllItemIDs, model.predict)

CS535 Big Data | Computer Science | Colorado State University

k-Fold Cross-validation

Create a k-fold partition of the dataset
For each of the k experiments use K-1 folds for training
The remaining fold for testing

Experiment 1 Experiment 2 Experiment 3 Total number of examples Test example Experiment 4

CS535 Big Data | Computer Science | Colorado State University

True error estimate

k-fold cross validation is similar to random subsampling
The advantage of k-Fold Cross validation
All the examples in the dataset are eventually used for both training and testing
The true error is estimated as the average error rate

E = 1 K Ei

i=1 K

∑

CS535 Big Data | Computer Science | Colorado State University

k-Fold Cross-validation with Spark

MLUtils.kFold()

def predictMostListened( sc: SparkContext, train: RDD[Rating])(allData: RDD[( Int, Int)]) = { val bListenCount = sc.broadcast( train.map( r = > (r.product, r.rating)). reduceByKey(_ + _).collectAsMap() ) allData.map { case (user, product) = > Rating( user, product, bListenCount.value.getOrElse(product, 0.0) ) } } val auc = areaUnderCurve(cvData, bAllItemIDs, predictMostListened(sc,trainData))

CS535 Big Data | Computer Science | Colorado State University

Hyperparameter selection

MatrixFactorizationModel
ALS.trainImplicit()
rank = 10
The number of latent factors in the model
The number of columns, k
iterations = 5
The number of iterations that the factorization runs
lambda = 0.1
A standard overfitting parameter
Higher value guards against overfitting
Values that are too high will decrease the factorization’s accuracy
alpha = 1.0
Controls the relative weight of observed versus unobserved user-product interactions in the

factorization

CS535 Big Data | Computer Science | Colorado State University

SLIDE 16

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 16

Questions?

CS535 Big Data | Computer Science | Colorado State University