Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla August 23, 2011 Peter J. Haas Yannis Sismanis Erik Nijkamp
Outline Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary 2 / 32
Outline Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary 3 / 32
Collaborative Filtering ◮ Problem ◮ Set of users ◮ Set of items (movies, books, jokes, products, stories, ...) ◮ Feedback (ratings, purchase, click-through, tags, ...) 4 / 32
Collaborative Filtering ◮ Problem ◮ Set of users ◮ Set of items (movies, books, jokes, products, stories, ...) ◮ Feedback (ratings, purchase, click-through, tags, ...) ◮ Predict additional items a user may like ◮ Assumption: Similar feedback = ⇒ Similar taste 4 / 32
Collaborative Filtering ◮ Problem ◮ Set of users ◮ Set of items (movies, books, jokes, products, stories, ...) ◮ Feedback (ratings, purchase, click-through, tags, ...) ◮ Predict additional items a user may like ◮ Assumption: Similar feedback = ⇒ Similar taste ◮ Example Avatar The Matrix Up Alice 4 2 Bob 3 2 5 3 Charlie 4 / 32
Collaborative Filtering ◮ Problem ◮ Set of users ◮ Set of items (movies, books, jokes, products, stories, ...) ◮ Feedback (ratings, purchase, click-through, tags, ...) ◮ Predict additional items a user may like ◮ Assumption: Similar feedback = ⇒ Similar taste ◮ Example Avatar The Matrix Up Alice ? 4 2 Bob 3 2 ? 5 ? 3 Charlie 4 / 32
Collaborative Filtering ◮ Problem ◮ Set of users ◮ Set of items (movies, books, jokes, products, stories, ...) ◮ Feedback (ratings, purchase, click-through, tags, ...) ◮ Predict additional items a user may like ◮ Assumption: Similar feedback = ⇒ Similar taste ◮ Example Avatar The Matrix Up Alice ? 4 2 Bob 3 2 ? 5 ? 3 Charlie ◮ Netflix competition: 500k users, 20k movies, 100M movie ratings, 3M question marks 4 / 32
λ ∈ ∑ κ - λ κ Semantic Factors (Koren et al., 2009) Serious ∈ Braveheart ∈ Amadeus The Color Purple Lethal Weapon Sense and Ocean’s 11 Sensibility Geared Geared toward toward males females Dave The Lion King Dumb and Dumber The Princess Independence Diaries Day Gus Escapist ∈ 5 / 32
Latent Factor Models ◮ Discover latent factors ( r = 1) Avatar The Matrix Up Alice 4 2 Bob 3 2 Charlie 5 3 6 / 32
Latent Factor Models ◮ Discover latent factors ( r = 1) Avatar The Matrix Up ( 2.24 ) ( 1.92 ) ( 1.18 ) Alice 4 2 ( 1.98 ) Bob 3 2 ( 1.21 ) Charlie 5 3 ( 2.30 ) 6 / 32
Latent Factor Models ◮ Discover latent factors ( r = 1) Avatar The Matrix Up ( 2.24 ) ( 1.92 ) ( 1.18 ) Alice 4 2 ( 1.98 ) ( 3.8 ) ( 2.3 ) Bob 3 2 ( 1.21 ) ( 2.7 ) ( 2.3 ) Charlie 5 3 ( 2.30 ) ( 5.2 ) ( 2.7 ) ◮ Minimum loss � ( V ij − [ WH ] ij ) 2 min W , H ( i , j ) ∈ Z 6 / 32
Latent Factor Models ◮ Discover latent factors ( r = 1) Avatar The Matrix Up ( 2.24 ) ( 1.92 ) ( 1.18 ) ? Alice 4 2 ( 1.98 ) ( 4.4 ) ( 3.8 ) ( 2.3 ) Bob 3 2 ? ( 1.21 ) ( 2.7 ) ( 2.3 ) ( 1.4 ) ? Charlie 5 3 ( 2.30 ) ( 5.2 ) ( 4.4 ) ( 2.7 ) ◮ Minimum loss � ( V ij − [ WH ] ij ) 2 min W , H ( i , j ) ∈ Z 6 / 32
Latent Factor Models ◮ Discover latent factors ( r = 1) Avatar The Matrix Up ( 2.24 ) ( 1.92 ) ( 1.18 ) ? Alice 4 2 ( 1.98 ) ( 4.4 ) ( 3.8 ) ( 2.3 ) Bob 3 2 ? ( 1.21 ) ( 2.7 ) ( 2.3 ) ( 1.4 ) ? Charlie 5 3 ( 2.30 ) ( 5.2 ) ( 4.4 ) ( 2.7 ) ◮ Minimum loss � ( V ij − µ − u i − m j − [ WH ] ij ) 2 min W , H , u , m ( i , j ) ∈ Z ◮ Bias 6 / 32
Latent Factor Models ◮ Discover latent factors ( r = 1) Avatar The Matrix Up ( 2.24 ) ( 1.92 ) ( 1.18 ) Alice ? 4 2 ( 1.98 ) ( 4.4 ) ( 3.8 ) ( 2.3 ) Bob 3 2 ? ( 1.21 ) ( 2.7 ) ( 2.3 ) ( 1.4 ) Charlie 5 ? 3 ( 2.30 ) ( 5.2 ) ( 4.4 ) ( 2.7 ) ◮ Minimum loss � ( V ij − µ − u i − m j − [ WH ] ij ) 2 min W , H , u , m ( i , j ) ∈ Z + λ ( � W � + � H � + � u � + � m � ) ◮ Bias, regularization 6 / 32
Latent Factor Models ◮ Discover latent factors ( r = 1) Avatar The Matrix Up ( 2.24 ) ( 1.92 ) ( 1.18 ) Alice ? 4 2 ( 1.98 ) ( 4.4 ) ( 3.8 ) ( 2.3 ) Bob 3 2 ? ( 1.21 ) ( 2.7 ) ( 2.3 ) ( 1.4 ) Charlie 5 ? 3 ( 2.30 ) ( 5.2 ) ( 4.4 ) ( 2.7 ) ◮ Minimum loss � ( V ij − µ − u i ( t ) − m j ( t ) − [ W ( t ) H ] ij ) 2 min W , H , u , m ( i , j , t ) ∈ Z t + λ ( � W ( t ) � + � H � + � u ( t ) � + � m ( t ) � ) ◮ Bias, regularization, time 6 / 32
Generalized Matrix Factorization ◮ A general machine learning problem ◮ Recommender systems, text indexing, face recognition, . . . 7 / 32
Generalized Matrix Factorization ◮ A general machine learning problem ◮ Recommender systems, text indexing, face recognition, . . . ◮ Training data ◮ V : m × n input matrix (e.g., rating matrix) ◮ Z : training set of indexes in V (e.g., subset of known ratings) V ij V 7 / 32
Generalized Matrix Factorization ◮ A general machine learning problem ◮ Recommender systems, text indexing, face recognition, . . . ◮ Training data ◮ V : m × n input matrix (e.g., rating matrix) ◮ Z : training set of indexes in V (e.g., subset of known ratings) ◮ Parameter space ◮ W : row factors (e.g., m × r latent customer factors) ◮ H : column factors (e.g., r × n latent movie factors) H H ∗ j W W i ∗ V ij V 7 / 32
Generalized Matrix Factorization ◮ A general machine learning problem ◮ Recommender systems, text indexing, face recognition, . . . ◮ Training data ◮ V : m × n input matrix (e.g., rating matrix) ◮ Z : training set of indexes in V (e.g., subset of known ratings) ◮ Parameter space ◮ W : row factors (e.g., m × r latent customer factors) ◮ H : column factors (e.g., r × n latent movie factors) H ◮ Model H ∗ j ◮ L ij ( W i ∗ , H ∗ j ): loss at element ( i , j ) ◮ Includes prediction error, regularization, auxiliary information, . . . ◮ Constraints (e.g., non-negativity) W W i ∗ V ij V 7 / 32
Generalized Matrix Factorization ◮ A general machine learning problem ◮ Recommender systems, text indexing, face recognition, . . . ◮ Training data ◮ V : m × n input matrix (e.g., rating matrix) ◮ Z : training set of indexes in V (e.g., subset of known ratings) ◮ Parameter space ◮ W : row factors (e.g., m × r latent customer factors) ◮ H : column factors (e.g., r × n latent movie factors) H ◮ Model H ∗ j ◮ L ij ( W i ∗ , H ∗ j ): loss at element ( i , j ) ◮ Includes prediction error, regularization, auxiliary information, . . . ◮ Constraints (e.g., non-negativity) W W i ∗ V ij ◮ Find best model � argmin L ij ( W i ∗ , H ∗ j ) V W , H ( i , j ) ∈ Z 7 / 32
Successful Applications ◮ Movie recommendation (Netflix, competition papers) ◮ > 12M users, > 20k movies, 2.4B ratings (projected) ◮ 36GB data, 9.2GB model (projected) ◮ Latent factor model ◮ Website recommendation (Microsoft, WWW10) ◮ 51M users, 15M URLs, 1.2B clicks ◮ 17.8GB data, 161GB metadata, 49GB model ◮ Gaussian non-negative matrix factorization ◮ News personalization (Google, WWW07) ◮ Millions of users, millions of stories, ? clicks ◮ Probabilistic latent semantic indexing 8 / 32
Successful Applications ◮ Movie recommendation (Netflix, competition papers) ◮ > 12M users, > 20k movies, 2.4B ratings (projected) ◮ 36GB data, 9.2GB model (projected) ◮ Latent factor model ◮ Website recommendation (Microsoft, WWW10) ◮ 51M users, 15M URLs, 1.2B clicks ◮ 17.8GB data, 161GB metadata, 49GB model ◮ Gaussian non-negative matrix factorization ◮ News personalization (Google, WWW07) ◮ Millions of users, millions of stories, ? clicks ◮ Probabilistic latent semantic indexing Distributed processing is necessary! ◮ Big data ◮ Large models ◮ Expensive computations 8 / 32
Outline Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary 9 / 32
Stochastic Gradient Descent 1.0 5 4.5 4 5 ◮ Find minimum θ ∗ of function L . 5 5 . 6 6 5 . 5 4 7.5 7 3 . 5 3 2 . 5 0.5 0.0 ● 0 . 5 − 0.5 1 1.5 2 − 1.0 6.5 4.5 5 6 7 4 5 5 . − 1.0 − 0.5 0.0 0.5 1.0 10 / 32
Stochastic Gradient Descent 1.0 5 4.5 4 5 ◮ Find minimum θ ∗ of function L 5 . 5 . 6 6 5 . 5 4 7.5 7 3 . 5 3 2 . 5 ◮ Pick a starting point θ 0 0.5 ● 0.0 ● 0 . 5 − 0.5 1 1.5 2 − 1.0 6.5 4.5 5 6 7 4 5 5 . − 1.0 − 0.5 0.0 0.5 1.0 10 / 32
Stochastic Gradient Descent 1.0 5 4.5 4 5 ◮ Find minimum θ ∗ of function L 5 . 5 . 6 6 5 . 5 4 7.5 7 3 . 5 3 2 . 5 ◮ Pick a starting point θ 0 0.5 ● 0.0 ● 0 . 5 − 0.5 1 1.5 2 − 1.0 6.5 4.5 5 6 7 4 5 5 . − 1.0 − 0.5 0.0 0.5 1.0 10 / 32
Stochastic Gradient Descent 1.0 5 4.5 4 5 ◮ Find minimum θ ∗ of function L 5 . 5 . 6 6 5 . 5 4 7.5 7 3 . 5 3 2 . 5 ◮ Pick a starting point θ 0 0.5 ● 0.0 ● 0 . 5 − 0.5 1 1.5 2 − 1.0 6.5 4.5 5 6 7 4 5 5 . − 1.0 − 0.5 0.0 0.5 1.0 10 / 32
Recommend
More recommend