Lecture 20 Jan-Willem van de Meent Schedule Schedule Adjustments - - PowerPoint PPT Presentation

lecture 20
SMART_READER_LITE
LIVE PREVIEW

Lecture 20 Jan-Willem van de Meent Schedule Schedule Adjustments - - PowerPoint PPT Presentation

Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 20 Jan-Willem van de Meent Schedule Schedule Adjustments Wed 28 Nov: Review Lecture Mon 3 Dec: Project Presentations Fri 7 Dec: Project Reports


slide-1
SLIDE 1

Unsupervised Machine Learning 
 and Data Mining

DS 5230 / DS 4420 - Fall 2018

Lecture 20

Jan-Willem van de Meent

slide-2
SLIDE 2

Schedule

slide-3
SLIDE 3
  • Wed 28 Nov: Review Lecture
  • Mon 3 Dec: Project Presentations
  • Fri 7 Dec: Project Reports Due
  • Wed 12 Dec: Final Exam
  • Fri 14 Dec: Peer Reviews Due

Schedule Adjustments

slide-4
SLIDE 4

Project

slide-5
SLIDE 5
  • ~10 pages (rough guideline)
  • Guidelines for contents
  • Introduction / Motivation
  • Exploratory analysis (if applicable)
  • Data mining analysis
  • Discussion of results

Project Reports

slide-6
SLIDE 6
  • 2 per person (randomly assigned)
  • Reviews should discuss 4 aspects 

  • f the report
  • Clarity 


(is the writing clear?)

  • Technical merit 


(are methods valid?)

  • Reproducibility 


(is it clear how results were obtained?)

  • Discussion 


(are results interpretable?)

Project Review

slide-7
SLIDE 7

Recommender Systems

slide-8
SLIDE 8

The Long Tail

(from: https://www.wired.com/2004/10/tail/)

slide-9
SLIDE 9

The Long Tail

(from: https://www.wired.com/2004/10/tail/)

slide-10
SLIDE 10

The Long Tail

(from: https://www.wired.com/2004/10/tail/)

slide-11
SLIDE 11

Problem Setting

slide-12
SLIDE 12

Problem Setting

slide-13
SLIDE 13

Problem Setting

slide-14
SLIDE 14

Problem Setting

  • Task: Predict user preferences for unseen items
slide-15
SLIDE 15

Content-based Filtering

Geared towards females Geared towards males serious escapist The Princess Diaries The Lion King Braveheart Lethal Weapon Independence Day Amadeus The Color Purple Dumb and Dumber Ocean’s ¡11 Sense and Sensibility

Gus Dave

Two Approaches: 


  • 1. Predict rating using item features on a per-user basis
  • 2. Predict rating using user features on a per-item basis
slide-16
SLIDE 16

Collaborative Filtering

Joe

#2 #3 #1 #4

Idea: Predict rating based on similarity to other users

slide-17
SLIDE 17

Problem Setting

  • Task: Predict user preferences for unseen items
  • Content-based filtering: Model user/item features
  • Collaborative filtering: Implicit similarity of users or items
slide-18
SLIDE 18

Applications of Recommender Systems

  • Movie recommendation (Netflix)
  • Related product recommendation (Amazon)
  • Web page ranking (Google)
  • Social recommendation (Facebook)
  • Priority inbox & spam filtering (Google)
  • Online dating (OK Cupid)
  • Computational Advertising (Everyone)
slide-19
SLIDE 19

Challenges

  • Scalability
  • Millions of objects
  • 100s of millions of users
  • Cold start
  • Changing user base
  • Changing inventory
  • Imbalanced dataset
  • User activity / item reviews 


power law distributed

  • Ratings are not missing at random
slide-20
SLIDE 20

Running Example: Netflix Data

score

date movie user

1 5/7/02 21 1 5 8/2/04 213 1 4 3/6/01 345 2 4 5/1/05 123 2 3 7/15/02 768 2 5 1/22/01 76 3 4 8/3/00 45 4 1 9/10/05 568 5 2 3/5/03 342 5 2 12/28/00 234 5 5 8/11/02 76 6 4 6/15/03 56 6

score date movie user

? 1/6/05 62 1 ? 9/13/04 96 1 ? 8/18/05 7 2 ? 11/22/05 3 2 ? 6/13/02 47 3 ? 8/12/01 15 3 ? 9/1/00 41 4 ? 8/27/05 28 4 ? 4/4/05 93 5 ? 7/16/03 74 5 ? 2/14/04 69 6 ? 10/3/03 83 6

Training data Test data

  • Released as part of $1M competition by Netflix in 2006
  • Prize awarded to BellKor in 2009
slide-21
SLIDE 21

Running Yardstick: RMSE

rmse(S) = s |S|−1 X

(i,u)∈S

(ˆ rui − rui)2

slide-22
SLIDE 22

Running Yardstick: RMSE

rmse(S) = s |S|−1 X

(i,u)∈S

(ˆ rui − rui)2

(doesn’t tell you how to actually do recommendation)

slide-23
SLIDE 23

Content-based Filtering

slide-24
SLIDE 24

Item-based Features

slide-25
SLIDE 25

Item-based Features

slide-26
SLIDE 26

Item-based Features

slide-27
SLIDE 27

wu = argmin

w

|ru − X w|2

Per-user Regression

Learn a set of regression coefficients for each user

slide-28
SLIDE 28

User Bias and 
 Item Popularity

slide-29
SLIDE 29

Bias

slide-30
SLIDE 30

Bias

Moonrise Kingdom 4 5 4 4 0.3 0.2

slide-31
SLIDE 31

Bias

Moonrise Kingdom 4 5 4 4 0.3 0.2

Problem: Some movies are universally loved / hated

slide-32
SLIDE 32

Bias

Moonrise Kingdom 4 5 4 4 0.3 0.2

Problem: Some movies are universally loved / hated
 some users are more picky than others

3 3 3

slide-33
SLIDE 33

Bias

Solution: Introduce a per-movie and per-user bias Problem: Some movies are universally loved / hated
 some users are more picky than others

Moonrise Kingdom 4 5 4 4 0.3 0.2 3 3 3

slide-34
SLIDE 34

Collaborative 
 Filtering

slide-35
SLIDE 35

Neighborhood Based Methods

Joe

#2 #3 #1 #4

Users and items form a bipartite graph (edges are ratings)

slide-36
SLIDE 36

Neighborhood Based Methods

(user, user) similarity

  • predict rating based on average


from k-nearest users

  • good if item base is small
  • good if item base changes rapidly

(item,item) similarity

  • predict rating based on average


from k-nearest items

  • good if the user base is small
  • good if user base changes rapidly
slide-37
SLIDE 37

Parzen-Window Style CF

  • Define a similarity sij between items
  • Find set εk(i,u) of k-nearest neighbors 


to i that were rated by user u

  • Predict rating using weighted average over set
  • How should we define sij?
Joe

#2 #3 #1 #4

slide-38
SLIDE 38

Pearson Correlation Coefficient

sij = Cov[rui, ruj] Std[rui]Std[ruj]

each item rated by a distinct set of users

1 ? ? 5 5 3 ? ? ? 4 2 ? ? ? ? 4 ? 5 4 1 ? ? ? 4 2 5 ? ? 1 2 5 ? ? 2 ? ? 3 ? ? ? 5 4 User ratings for item i: User ratings for item j:

slide-39
SLIDE 39

(item,item) similarity

ˆ ρij = P

u∈U(i,j)(rui − bui)(ruj − buj)

qP

u∈U(i,j)(rui − bui)2 P u∈U(i,j)(ruj − buj)2

Empirical estimate of Pearson correlation coefficient sij = |U(i, j)| − 1 |U(i, j)| − 1 + λ ˆ ρij Regularize towards 0 for small support Regularize towards baseline for small neighborhood

U(i, j): set of users who have rated both i and j

slide-40
SLIDE 40

Similarity for binary labels

mi users acting on i mij users acting on both i and j m total number of users sij = mij α + mi + mj − mij sij = observed expected ≈ mij α + mimj/m Jaccard similarity Observed / Expected ratio Pearson correlation not meaningful for binary labels
 (e.g. Views, Purchases, Clicks)

slide-41
SLIDE 41

Matrix Factorization Methods

slide-42
SLIDE 42

Matrix Factorization

Moonrise Kingdom 4 5 4 4 0.3 0.2

slide-43
SLIDE 43

Matrix Factorization

Moonrise Kingdom 4 5 4 4 0.3 0.2

Idea: pose as (biased) matrix factorization problem

slide-44
SLIDE 44

Matrix Factorization

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1

items

.2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

~ ~

items users users A rank-3 SVD approximation

slide-45
SLIDE 45

Prediction

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1

items

.2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

~ ~

items users A rank-3 SVD approximation users

?

slide-46
SLIDE 46

Prediction

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1

items

.2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

~ ~

items users

2.4

A rank-3 SVD approximation users

slide-47
SLIDE 47

SVD with missing values

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1 .2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

~

  • SVD ¡isn’t ¡defined ¡when ¡entries ¡are ¡unknown ¡

Pose as regression problem Regularize using Frobenius norm

slide-48
SLIDE 48

Alternating Least Squares

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1 .2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

~

  • SVD ¡isn’t ¡defined ¡when ¡entries ¡are ¡unknown ¡

(regress wu given X)

slide-49
SLIDE 49

Alternating Least Squares

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1 .2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

~

  • SVD ¡isn’t ¡defined ¡when ¡entries ¡are ¡unknown ¡

(regress wu given X)

L2: closed form solution w = (XTX + λI)1XTy

Remember ridge regression?

slide-50
SLIDE 50

Alternating Least Squares

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1 .2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

~

  • SVD ¡isn’t ¡defined ¡when ¡entries ¡are ¡unknown ¡

(regress xi given W) (regress wu given X)

slide-51
SLIDE 51

Stochastic Gradient Descent

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1 .2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

~

  • SVD ¡isn’t ¡defined ¡when ¡entries ¡are ¡unknown ¡

  • No need for locking
  • Multicore updates asynchronously


(Recht, Re, Wright, 2012 - Hogwild)

slide-52
SLIDE 52

Sampling Bias

slide-53
SLIDE 53

Ratings are not given at random

  • B. ¡Marlin ¡et ¡al., ¡“Collaborative ¡Filtering ¡and ¡the ¡Missing ¡

at ¡Random ¡Assumption” ¡ Yahoo! survey answers Yahoo! music ratings Netflix ratings

slide-54
SLIDE 54

Ratings are not given at random

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1

users movies

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

users movies

  • rui

cui

matrix factorization regression data

slide-55
SLIDE 55

Temporal Effects

slide-56
SLIDE 56

Changes in user behavior

2004

Netflix changed rating labels

slide-57
SLIDE 57

Are movies getting better with time?

Temporal Effects

Do movies get better with time?

slide-58
SLIDE 58

Temporal Effects

Solution: Model temporal effects in bias not weights

Are movies getting better with time?

slide-59
SLIDE 59

Netflix Prize

slide-60
SLIDE 60

Netflix Prize

Training data

  • 100 million ratings, 480,000 users, 17,770 movies
  • 6 years of data: 2000-2005

Test data

  • Last few ratings of each user (2.8 million)
  • Evaluation criterion: Root Mean Square Error (RMSE)

Competition

  • 2,700+ teams
  • Netflix’s system RMSE: 0.9514
  • $1 million prize for 10% improvement on Netflix
slide-61
SLIDE 61

Improvements

40 60 90 128 180 50 100 200 50 100 200 50 100 200 500 100 200 500 50 100 200 500 1000 1500

0.875 0.88 0.885 0.89 0.895 0.9 0.905 0.91

10 100 1000 10000 100000

RMSE

Millions of Parameters

Factor models: Error vs. #parameters

NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v.4

Add biases

Do SGD, but also learn biases μ, bu and bi

slide-62
SLIDE 62

Improvements

Account for fact that ratings are not missing at random.

40 60 90 128 180 50 100 200 50 100 200 50 100 200 500 100 200 500 50 100 200 500 1000 1500

0.875 0.88 0.885 0.89 0.895 0.9 0.905 0.91

10 100 1000 10000 100000

RMSE

Millions of Parameters

Factor models: Error vs. #parameters

NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v.4

“who ¡rated ¡ what”

slide-63
SLIDE 63

40 60 90 128 180 50 100 200 50 100 200 50 100 200 500 100 200 500 50 100 200 500 1000 1500

0.875 0.88 0.885 0.89 0.895 0.9 0.905 0.91

10 100 1000 10000 100000

RMSE

Millions of Parameters

Factor models: Error vs. #parameters

NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v.4

temporal effects

Improvements

slide-64
SLIDE 64

40 60 90 128 180 50 100 200 50 100 200 50 100 200 500 100 200 500 50 100 200 500 1000 1500

0.875 0.88 0.885 0.89 0.895 0.9 0.905 0.91

10 100 1000 10000 100000

RMSE

Millions of Parameters

Factor models: Error vs. #parameters

NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v.4

temporal effects

Improvements

Still pretty far from 0.8563 grand prize

slide-65
SLIDE 65

Winning Solution from BellKor

slide-66
SLIDE 66

Last 30 days

June 26th submission triggers 30-day “last call”

slide-67
SLIDE 67

BellKor fends off competitors by a hair

slide-68
SLIDE 68

BellKor fends off competitors by a hair

slide-69
SLIDE 69

Ratings aren’t everything

Netflix in 2009 Netflix in 2017

  • Only simpler submodels (SVD, RBMs) implemented
  • Ratings eventually proved to be only weakly informative