scalable machine learning
play

Scalable Machine Learning 8. Recommender Systems Alex Smola Yahoo! - PowerPoint PPT Presentation

Scalable Machine Learning 8. Recommender Systems Alex Smola Yahoo! Research and ANU http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12 Significant content courtesy of Yehuda Koren 8. Recommender Systems Much content courtesy of (Mr


  1. Estimate unknown ratings as inner products of latent factors users 1 3 5 5 4 5 4 4 2 1 3 2.4 items ~ 2 4 1 2 3 4 3 5 2 4 5 4 2 4 3 4 2 2 5 1 3 3 2 4 users 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9 .1 -.4 .2 items -.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1 1.3 -.5 .6 .5 ~ 2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6 .1 -.2 .3 .5 1.1 2.1 .3 -.7 2.1 -2 -1 .7 .3 A rank-3 SVD approximation

  2. Properties .1 -.4 .2 1 3 5 5 4 -.5 .6 .5 5 4 4 2 1 3 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9 -.2 .3 .5 ~ 2 4 1 2 3 4 3 5 -.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1 1.3 1.1 2.1 .3 2 4 5 4 2 2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6 .1 -.7 2.1 -2 4 3 4 2 2 5 -1 .7 .3 1 3 3 2 4 • SVD is undefined for missing entries • stochastic gradient descent (faster) • SVD ¡isn’t ¡defined ¡when ¡entries ¡are ¡unknown ¡  • alternating optimization •  • Overfitting without regularization particularly if fewer reviews than dimensions • – • Very popular on Netflix –

  3. Netflix: 0.9514 Factor models: Error vs. #parameters 0.91 40 60 NMF 90 0.905 128 50 180 BiasSVD 100 0.9 200 SVD++ SVD v.2 0.895 50 RMSE 100 SVD v.3 200 50 0.89 SVD v.4 100 200 500 0.885 100 50 200 500 100 200 500 1000 1500 0.88 0.875 10 100 1000 10000 100000 Prize: 0.8563 Millions of Parameters

  4. Risk Minimization View • Objective Function ( r ui � h p u , q i i ) 2 + λ h i k p k 2 Frob + k q k 2 X minimize Frob p,q ( u,i ) ∈ S • Alternating least squares � 1 2 3 X X q i q > 4 λ 1 + p u ← q i r ui i 5 good for i i | ( u,i ) 2 S MapReduce � 1 2 3 X X p u p > 4 λ 1 + q i ← p u r ui u 5 i u | ( u,i ) 2 S

  5. Risk Minimization View • Objective Function ( r ui � h p u , q i i ) 2 + λ h i k p k 2 Frob + k q k 2 X minimize Frob p,q ( u,i ) ∈ S • Stochastic gradient descent much p u (1 � λη t ) p u � η t q i ( r ui � h p u , q i i ) faster q i (1 � λη t ) q i � η t p u ( r ui � h p u , q i i ) • No need for locking • Multicore updates asynchronously (Recht, Re, Wright, 2012 - Hogwild)

  6. Theoretical Motivation

  7. deFinetti Theorem • Independent random variables m x i Y p ( X ) = p ( x i ) i =1 • Exchangeable random variables p ( X ) = p ( x 1 , . . . , x m ) = p ( x π (1) , . . . , x π ( m ) ) • There exists a conditionally independent representation of exchangeable r.v. ϴ m Z Y p ( X ) = dp ( θ ) p ( x i | θ ) x i i =1 This motivates latent variable models

  8. Aldous Hoover Factorization • Matrix-valued set of random variable Example - Erdos Renyi graph model Y p ( E ) = p ( V ij ) i,j • Independently exchangeable on matrix p ( E ) = p ( E 11 , E 12 , . . . , E mn ) = p ( E π (1) ρ (1) , E π (1) ρ (2) , . . . , E π ( m ) ρ ( n ) ) • Aldous Hoover Theorem m n Z Z Y Y Y p ( E ) = dp ( θ ) dp ( u i ) dp ( v j ) p ( E ij | u i , v j , θ ) i =1 j =1 i,j

  9. Aldous Hoover Factorization • Rating matrix is (row, column) exchangeable u 1 u 2 u 3 u 4 u 5 u 6 • Draw latent variables per v 1 e 11 e 12 e 15 e 16 row and column v 2 e 24 • Draw matrix entries independently given pairs v 3 e 32 • Absence / presence of v 4 e 43 e 46 rating is a signal e 55 v 5 • Can be extended to graphs with vertex attributes

  10. Aldous Hoover variants • Jointly exchangeable matrix • Social network graphs • Draw vertex attributes first, then edges • Cold start problem • New user appears • Attributes (age, location, browser) • Can estimate latent variables from that • User and item factors in matrix factorization problem can be viewed as AH-factors

  11. Improvements

  12. Factor models: Error vs. #parameters 0.91 40 60 NMF 90 0.905 128 50 180 BiasSVD Add 100 0.9 200 biases SVD++ SVD v.2 0.895 50 RMSE 100 SVD v.3 200 50 0.89 SVD v.4 100 200 500 0.885 100 50 200 500 100 200 500 1000 1500 0.88 0.875 10 100 1000 10000 100000 Millions of Parameters

  13. Bias • Objective Function X ( r ui � ( µ + b u + b i + h p u , q i i )) 2 + minimize p,q ( u,i ) ∈ S Frob + k b users k 2 + k b items k 2 i h k p k 2 Frob + k q k 2 λ • Stochastic gradient descent p u (1 � λη t ) p u � η t q i ρ ui q i (1 � λη t ) q i � η t p u ρ ui b u (1 � λη t ) b u � η t ρ ui b i (1 � λη t ) b i � η t ρ ui µ (1 � λη t ) µ � η t ρ ui where ρ ui = ( r ui � ( µ + b i + b u + h p u , q i i ))

  14. Factor models: Error vs. #parameters 0.91 40 60 NMF 90 0.905 128 50 180 BiasSVD 100 200 0.9 SVD++ “who ¡rated ¡ SVD v.2 what” 0.895 50 RMSE 100 SVD v.3 200 50 0.89 SVD v.4 100 200 500 0.885 100 50 200 500 100 200 500 1000 1500 0.88 0.875 10 100 1000 10000 100000 Millions of Parameters

  15. Ratings are not given at random Netflix ratings Yahoo! music ratings Yahoo! survey answers • Marlin et al. “Collaborative Filtering and the B. ¡Marlin ¡et ¡al., ¡“Collaborative ¡Filtering ¡and ¡the ¡Missing ¡ at ¡Random ¡Assumption” ¡ Missing at Random Assumption” UAI 2007

  16. • Movie rating matrix •  users users 1 3 5 5 4 1 0 1 0 0 1 0 0 1 0 1 0 5 4 4 2 1 3 0 0 1 1 0 0 1 0 0 1 1 1 movies movies 2 4 1 2 3 4 3 5 1 1 0 1 1 0 1 0 1 1 1 0 2 4 5 4 2 0 1 1 0 1 0 0 1 0 0 1 0 4 3 4 2 2 5 0 0 1 1 1 1 0 0 0 0 1 1 1 3 3 2 4 1 0 1 0 1 0 0 1 0 0 1 0 r ui  c ui      • Characterize users by which movies they rated Edge attributes (observed, rating) • Adding features to recommender system regression r ui = µ + b u + b i + h p u , q i i + h c u , x i i

  17. Alternative integration • Key idea - use related ratings to average • Salakhudtinov & Mnih, 2007 X q i ← q i + c ui p u u • Koren et al., 2008 X q i ← q i + c ui x j u Overparametrize items by q and x

  18. Factor models: Error vs. #parameters 0.91 40 60 NMF 90 0.905 128 50 180 BiasSVD 100 0.9 200 SVD++ SVD v.2 0.895 50 RMSE 100 SVD v.3 temporal 200 50 0.89 effects SVD v.4 100 200 500 0.885 100 50 200 500 100 200 500 1000 1500 0.88 0.875 10 100 1000 10000 100000 Millions of Parameters

  19. Something Happened in Early 2004 … Netflix ratings by date Netflix changed rating labels 2004

  20. Are movies getting better with time?

  21. Sources of temporal change • Items • Seasonal effects (Christmas, Valentine’s day, Holiday movies) • Public perception of movies (Oscar etc.) • Users • Changed labeling of reviews • Anchoring (relative to previous movie) • Change of rater in household • Selection bias for time of viewing

  22. Modeling temporal change • Time-dependent bias • Time-dependent user preferences r ui ( t ) = µ + b u ( t ) + b i ( t ) + h q i , p u ( t ) i • Parameterize functions b and p • Slow changes for items • Fast sudden changes for users • Good parametrization is key Koren et al., KDD 2009 (CF with temporal dynamics)

  23. Bias matters Sources of Variance in Netflix data Biases 33% Unexplained 57% Personalization 10% 0.732 (unexplained) + 0.415 (biases) + 0.129 (personalization) 1.276 (total variance)

  24. Netflix: 0.9514 Factor models: Error vs. #parameters 0.91 40  T r q p 60 ui i u NMF 90 0.905 128 50 180 BiasSVD 100 0.9 200 SVD++ SVD v.2 0.895 50 RMSE 100 SVD v.3 200 50 0.89 SVD v.4 100 200 500 0.885 100 50 200 500 100 200   500 1000 1500 0.88        ( ) ( ) ( ) ( ) T   r t b t b t q p t b x ui u i i u uj j   j 0.875 10 100 1000 10000 100000 Prize: 0.8563 Millions of Parameters

  25. More ideas • Explain factorizations • Cold start (new users) • Different regularization for different parameter groups / different users • Sharing of statistical strength between users • Hierarchical matrix co-clustering / factorization (write a paper on that)

  26. 3 Session Modeling

  27. Motivation

  28. User interaction • Explicit search query • Search engine • Genre selection on movie site • Implicit search query • News site • Priority inbox • Comments on article • Viewing specific movie (see also ...) • Sponsored search (advertising) Space, users’ time and attention are limited.

  29. session? models?

  30. Did the user SCROLL DOWN?

  31. Bad ideas ... • Show items based on relevance • Yes, this user likes Die Hard. • But he likes other movies, too • Show items only for majority of users ‘apple’ vs. ‘Apple’

  32. User response collapse collapse implicit user interest log it!

  33. hover on link

  34. Response is conditioned on available options • User search for ‘chocolate’ user picks this • What the user really would have wanted • User can only pick from available items •Preferences are often relative

  35. Models

  36. Independent click model • Each object has click probability • Object is viewed independently • Used in computational advertising (with some position correction) • Horribly wrong assumption • OK if probability is very small (OK in ads) n 1 Y p ( x | s ) = 1 + e − x i s i i =1

  37. Logistic click model no click • User picks at most one object • Exponential family model for click e s x p ( x | s ) = x 0 e s x 0 = exp ( s x − g ( s )) e s 0 + P no click • Ignores order of objects • Assumes that the user looks at all before taking action

  38. Sequential click model no click • User traverses list click • At each position some probability of clicking • When user reaches end of the list he aborts " j − 1 # 1 1 Y p ( x = j | s ) = 1 + e − s j 1 + e s i i =1 • This assumes that a patient user viewed all items

  39. Skip click model no no no no click click click click • User traverses list • At each position some probability of clicking • At each position the user may abandon the process • This assumes that user traverses list sequentially

  40. Context skip click model • User traverses list • At each position some probability of clicking which depends on previous content • At each position the user may abandon the process • User may click more than once

  41. Context skip click model

  42. Context skip click model • Viewing probability user is gone p ( v i = 1 | v i − 1 = 0) = 0 1 p ( v i = 1 | v i − 1 = 1 , c i − 1 = 0) = 1 + e − α i user returns 1 p ( v i = 1 | v i − 1 = 1 , c i − 1 = 1) = 1 + e − β i • Click probability (only if viewed) prior context nctional form: 1 p ( c i = 1 | v i = 1 , c i − 1 , d i ) = 1 + e − f ( | c i − 1 | ,d i ,d i − 1 )

  43. Incremental gains score f ( | c i − 1 | , d i , d i − 1 ) := ρ ( S, d i | a, b ) � ρ ( S, d i − 1 | a, b ) + γ | c i − 1 | + δ i ”! “ X X X ρ j ( d i ) � ρ j ( d i − 1 ) := [ s ] j [ d ] j + b j a j s ∈ S j d ∈ d i + γ | c i − 1 | + δ i • Submodular gain per additional document • Relevance score per document • Coverage over different aspects • Position dependent score • Score dependent on number of previous clicks

  44. Optimization • Latent variables We don’t know v whether user viewed result • Use variational inference to integrate out v (more next week in graphical models) � log p ( c )  � log p ( c ) + D ( q ( v ) k p ( v | c )) = E v ∼ q ( v ) [ � log p ( c ) + log q ( v ) � log p ( v | c )] = E v ∼ q ( v ) [ � log p ( c, v )] � H ( q ( v )) .

  45. Optimization • Compute latent viewing probability given clicks • Easy since we only have one transition from views to no views (no DP needed) • Expected log-likelihood under viewing model • Convex expected log-likelihood • Stochastic gradient descent • Parametrization uses personalization, too (user, position, viewport, browser)

  46. 4 Feature Representation

  47. Bayesian Probabilistic Matrix Factorization

  48. Statistical Model r • Aldous-Hoover factorization σ σ V U • normal distribution for V j U user and item attributes i • rating given by inner product R ij i=1,...,N j=1,...,M • Ratings s σ p ( R ij | U i , V j , σ 2 ) = N ( R ij | U T i V j , σ 2 ) • Latent factors N M p ( U | σ 2 � N ( U i | 0 , σ 2 p ( V | σ 2 � N ( V j | 0 , σ 2 U ) = U I) , V ) = V I) i =1 j =1 Salakhudtinov & Mnih, ICML 2008 BPMF

  49. Details • Priors on all factors α V α U • Wishart prior is conjugate Θ Θ to Gaussian, hence use it V U • Allows us to adapt the variance automatically V j U i • Inference (Gibbs sampler) R ij • Sample user factors (parallel) i=1,...,N j=1,...,M • Sample movie factors (parallel) eature σ • Sample hyperparameters (parallel)

  50. Making it fancier (constrained BPMF) σ σ U W t σ . V W Y k i k=1,...,M V j U i I i who rated R ij what i=1,...,N j=1,...,M σ

  51. Results (Mnih & Salakthudtinov) 20 1.2 1.15 Movie Average 15 1.1 Users (%) 1.05 PMF RMSE 10 1 0.95 Constrained PMF 5 0.9 0.85 0 0.8 1 − 5 6 − 10 − 20 − 40 − 80 − 160 − 320 − 640 >641 1 − 5 6 − 10 − 20 − 40 − 80 − 160 − 320 − 640 >641 Number of Observed Ratings Number of Observed Ratings helps for infrequent users

  52. Multiple Sources

  53. Social Network Data Data: users, connections, features Goal: suggest connections

  54. Social Network Data Data: users, connections, features Goal: suggest connections

  55. Social Network Data Data: users, connections, features y y’ Goal: suggest connections x x’ e

  56. Social Network Data Data: users, connections, features y y’ Goal: model/suggest connections x x’ e Y Y p ( x, y, e ) = p ( y i ) p ( x i | y i ) p ( e ij | x i , y i , x j , y j ) i ∈ Users i,j ∈ Users Direct application of the Aldous-Hoover theorem. Edges are conditionally independent.

  57. Applications

Recommend


More recommend