data mining and matrices
play

Data Mining and Matrices 04 Matrix Completion Rainer Gemulla, - PowerPoint PPT Presentation

Data Mining and Matrices 04 Matrix Completion Rainer Gemulla, Pauli Miettinen May 02, 2013 Recommender systems Problem Set of users Set of items (movies, books, jokes, products, stories, ...) Feedback (ratings, purchase,


  1. Data Mining and Matrices 04 – Matrix Completion Rainer Gemulla, Pauli Miettinen May 02, 2013

  2. Recommender systems Problem ◮ Set of users ◮ Set of items (movies, books, jokes, products, stories, ...) ◮ Feedback (ratings, purchase, click-through, tags, ...) ◮ Sometimes: metadata (user profiles, item properties, ...) Goal: Predict preferences of users for items Ultimate goal: Create item recommendations for each user Example Avatar The Matrix Up  ? 4 2  Alice Bob 3 2 ?   5 ? 3 Charlie 2 / 35

  3. Outline Collaborative Filtering 1 Matrix Completion 2 Algorithms 3 Summary 4 3 / 35

  4. Collaborative filtering Key idea: Make use of past user behavior No domain knowledge required No expensive data collection needed Allows discovery of complex and unexpected patterns Widely adopted: Amazon, TiVo, Netflix, Microsoft Key techniques: neighborhood models, latent factor models Avatar The Matrix Up   Alice ? 4 2 3 2 ? Bob   Charlie 5 ? 3 Leverage past behavior of other users and/or on other items. 4 / 35

  5. A simple baseline m users, n items, m × n rating matrix D Revealed entries Ω = { ( i , j ) | rating D ij is revealed } , N = | Ω | Baseline predictor : b ui = µ + b i + b j ◮ µ = 1 � ( i , j ) ∈ Ω D ij is the overall average rating N ◮ b i is a user bias (user’s tendency to rate low/high) ◮ b j is an item bias (item’s tendency to be rated low/high) ( i , j ) ∈ Ω ( D ij − µ − b i − b j ) 2 � Least squares estimates: argmin b ∗ D Avatar Matrix Up m = 3 (1 . 01) (0 . 34) ( − 1 . 32) n = 3 Ω = { (1 , 2) , (1 , 3) , (2 , 1) , . . . } Alice ? 4 2 N = 6 (0 . 32) (4 . 5) (3 . 8) (2 . 1) µ = 3 . 17 Bob 3 2 ? b 32 = 3 . 17 + 0 . 99 + 0 . 34 = 4 . 5 ( − 1 . 34) (2 . 8) (2 . 2) (0 . 5) Charlie 5 ? 3 Baseline does not account for (0 . 99) (5 . 2) (4 . 5) (2 . 8) personal tastes. 5 / 35

  6. λ ∈ κ λ - κ ∑ When does a user like an item? Neighborhood models (kNN): When he likes similar items ◮ Find the top- k most similar items the user has rated ◮ Combine the ratings of these items (e.g., average) ◮ Requires a similarity measure (e.g., Pearson correlation coefficient) is similar to Unrated by Bob Bob rated 4 → predict 4 Latent factor models (LFM): When similar users like similar items ◮ More holistic approach Serious ∈  Braveheart ∈  Amadeus The Color Purple ◮ Users and items are placed in the same “latent factor space” Lethal Weapon Sense and Ocean’s 11 ◮ Position of a user and an item Geared Sensibility Geared toward toward males females related to preference (via dot products) Dave The Lion King Dumb and Dumber The Princess Independence Diaries Day Gus 6 / 35 Escapist ∈ 

  7. λ ∈ ∑ κ - λ κ Intuition behind latent factor models (1) Serious ∈  Braveheart ∈  Amadeus The Color Purple Lethal Weapon Sense and Ocean’s 11 Sensibility Geared Geared toward toward males females Dave The Lion King Dumb and Dumber The Princess Independence Diaries Day Gus Escapist ∈  7 / 35 Koren et al., 2009.

  8. Intuition behind latent factor models (2) Does user u like item v ? Quality: measured via direction from origin (cos ∠ ( u , v )) ◮ Same direction → attraction: cos ∠ ( u , v ) ≈ 1 ◮ Opposite direction → repulsion: cos ∠ ( u , v ) ≈ − 1 ◮ Orthogonal direction → oblivious: cos ∠ ( u , v ) ≈ 0 Strength: measured via distance from origin ( � u �� v � ) ◮ Far from origin → strong relationship: � u �� v � large ◮ Close to origin → weak relationship: � u �� v � small Overall preference: measured via dot product ( u · v ) u · v = � u �� v � u · v � u �� v � = � u �� v � cos ∠ ( u , v ) ◮ Same direction, far out → strong attraction: u · v large positive ◮ Opposite direction, far out → strong repulsion: u · v large negative ◮ Orthogonal direction, any distance → oblivious: : u · v ≈ 0 But how to select dimensions and where to place items and users? Key idea: Pick dimensions that explain the known data well. 8 / 35

  9. SVD and missing values Input data Rank-10 truncated SVD 10% of input data Rank-10 truncated SVD SVD treats missing entries as 0. 9 / 35

  10. Latent factor models and missing values Input data Rank-10 LFM 10% of input data Rank-10 LFM LFMs “ignore” missing entries. 10 / 35

  11. Latent factor models (simple form) Given rank r , find m × r matrix L and r × n matrix R such that D ij ≈ [ LR ] ij for ( i , j ) ∈ Ω R Least squares formulation R ∗ j � ( D ij − [ LR ] ij ) 2 min L , R ( i , j ) ∈ Ω Example ( r = 1) L L i ∗ D ij R Avatar The Matrix Up (2 . 24) (1 . 92) (1 . 18) D Alice ? 4 2 (1 . 98) (4 . 4) (3 . 8) (2 . 3) Bob 3 2 ? L (1 . 21) (2 . 7) (2 . 3) (1 . 4) Charlie 5 ? 3 (2 . 30) (5 . 2) (4 . 4) (2 . 7) 11 / 35

  12. ∈ ∑ κ - -฀ - - λ฀ Example: Netflix prize data ( ≈ 500k users, ≈ 17k movies, ≈ 100M ratings) 1.5 Julien Donkey-Boy The Royal Tenenbaums Punch-Drunk Love I Heart Huckabees Lost in Translation Being John Malkovich 1.0 Kill Bill: Vol. 1 Belle de Jour Annie Hall Natural Born Killers Citizen Kane Freddy Got Fingered Half Baked 0.5 Scarface Sophie’s Choice Freddy vs. Jason Road Trip The Wizard of Oz Factor vector 2 Moonstruck 0.0 The Way We Were The Sound of Music The Waltons: Season 1 The Longest Yard The Fast and the Furious –0.5 Armageddon Catwoman Stepmom Runaway Bride Sister Act Coyote Ugly Maid in Manhattan –1.0 –1.5 –1.5 –1.0 –0.5 0.0 0.5 1.0 Factor vector 1 12 / 35 Koren et al., 2009.

  13. Latent factor models (summation form) R R ∗ j Least squares formulation prone to overfitting More general summation form : � L = l ij ( L i ∗ , R ∗ j ) + R ( L , R ) , L D ij L i ∗ ( i , j ) ∈ Ω ◮ L is global loss ◮ L i ∗ and R ∗ j are user and item parameters , resp. D ◮ l ij is local loss , e.g., l ij = ( D ij − [ LR ] ij ) 2 ◮ R is regularization term , e.g., R = λ ( � L � 2 F + � R � 2 F ) Loss function can be more sophisticated ◮ Improved predictors (e.g., include user and item bias) ◮ Additional feedback data (e.g., time, implicit feedback) ◮ Regularization terms (e.g., weighted depending on amount of feedback) ◮ Available metadata (e.g., demographics, genre of a movie) 13 / 35

  14. Example: Netflix prize data Root mean square error of predictions 0.91 40 60 Plain 90 0.905 128 180 With biases 50 With implicit feedback 100 200 With temporal dynamics (v.1) 0.9 With temporal dynamics (v.2) 50 0.895 RMSE 100 200 0.89 0.885 100 50 200 500 100 200 500 1,000 0.88 1,500 0.875 10 100 1,000 10,000 100,000 Millions of parameters 14 / 35 Koren et al., 2009.

  15. Outline Collaborative Filtering 1 Matrix Completion 2 Algorithms 3 Summary 4 15 / 35

  16. The matrix completion problem Complete these matrices!  1 1 1 1 1   1 1 1 1 1  ? ? ? ? 1 1 1 1 1 1         1 1 ? 1 1 1 ? ? ? ?         ? ? ? ? 1 1 1 1 1 1     1 1 1 1 1 1 ? ? ? ? Matrix completion is impossible without additional assumptions! Let’s assume that underlying full matrix is “simple” (here: rank 1).  1 1 1 1 1   1 1 1 1 1  1 1 1 1 1 1 1 1 1 1         1 1 1 1 1 1 1 1 1 1         1 1 1 1 1 1 1 1 1 1     1 1 1 1 1 1 1 1 1 1 When/how can we recover a low-rank matrix from a sample of its entries? 16 / 35

  17. Rank minimization Definition (rank minimization problem) Given an n × n data matrix D and an index set Ω of revealed entries. The rank minimization problem is minimize rank( X ) subject to D ij = X ij ( i , j ) ∈ Ω X ∈ R n × n . Seeks for “simplest explanation” fitting the data If unique and sufficient samples, recovers D (i.e., X = D ) NP-hard Time complexity of existing rank minimization algorithms dou- ble exponential in n (and also slow in practice). 17 / 35

  18. Nuclear norm minimization Rank: rank( D ) = |{ σ k ( D ) > 0 : 1 ≤ k ≤ n }| = � n k =1 I σ k ( D ) > 0 Nuclear norm : � D � ∗ = � n k =1 σ k ( D ) Definition (nuclear norm minimization) Given an n × n data matrix D and an index set Ω of revealed entries. The nuclear minimization problem is minimize � X � ∗ subject to D ij = X ij ( i , j ) ∈ Ω X ∈ R n × n . A heuristic for rank minimization Nuclear norm is convex function (thus local optimum is global opt.) Can be optimized (more) efficiently via semidefinite programming. 18 / 35

Recommend


More recommend