one pass ranking models for low latency product
play

One-Pass Ranking Models for Low-Latency Product Recommendations - PowerPoint PPT Presentation

One-Pass Ranking Models for Low-Latency Product Recommendations Martin Saveski @msaveski MIT (Amazon Berlin) One-Pass Ranking Models for Low-Latency Product Recommendations Amazon Machine Learning Team, Berlin Antonino Freno


  1. One-Pass Ranking Models for 
 Low-Latency Product Recommendations Martin Saveski @msaveski MIT (Amazon Berlin)

  2. One-Pass Ranking Models for 
 Low-Latency Product Recommendations Amazon Machine Learning Team, Berlin Antonino Freno Rodolphe Jenatton Cédric Archambeau

  3. Product Recommendations

  4. Product Recommendations Constraints

  5. Product Recommendations Constraints 1. Large # of examples 
 Large # of features

  6. Product Recommendations Constraints 1. Large # of examples 
 Large # of features 2. Drifting distribution

  7. Product Recommendations Constraints 1. Large # of examples 
 Large # of features 2. Drifting distribution 3. Real-time ranking 
 (<few ms)

  8. Product Recommendations Constraints 1. Large # of examples 
 Small memory footprint Large # of features 2. Drifting distribution 3. Real-time ranking 
 (<few ms)

  9. Product Recommendations Constraints 1. Large # of examples 
 Small memory footprint Large # of features 2. Drifting distribution Fast training time 3. Real-time ranking 
 (<few ms)

  10. Product Recommendations Constraints 1. Large # of examples 
 Small memory footprint Large # of features 2. Drifting distribution Fast training time 3. Real-time ranking 
 Low prediction latency (<few ms)

  11. Our approach Product Recommendations Small memory footprint Fast training time Low prediction latency

  12. Our approach Product Recommendations Small memory footprint Stochastic optimization One pass learning Fast training time Low prediction latency

  13. Our approach Product Recommendations Small memory footprint Stochastic optimization One pass learning Fast training time Low prediction latency Sparse models

  14. Learning Ranking Functions

  15. Learning Ranking Functions Three broad families of models 1. Pointwise (Logistic regression) 2. Pairwise (RankSVM) 3. Listwise (ListNet)

  16. Learning Ranking Functions Three broad families of models 1. Pointwise (Logistic regression) 2. Pairwise (RankSVM) 3. Listwise (ListNet) Loss functions • Evaluation functions (NDCG) • Surrogate functions

  17. Loss Function Lambda Rank (Burges et al., 2007)

  18. Loss Function Lambda Rank (Burges et al., 2007) Product 1 Product 2 Product 3 Product 4 x 3 : Features X x 2 x 4 x 1 : Ground-truth Rank 1 1 2 3 r

  19. Loss Function Lambda Rank (Burges et al., 2007) Product 1 Product 2 Product 3 Product 4 x 3 : Features X x 2 x 4 x 1 : Ground-truth Rank 1 1 2 3 r j i

  20. Loss Function Lambda Rank (Burges et al., 2007) Product 1 Product 2 Product 3 Product 4 x 3 : Features X x 2 x 4 x 1 : Ground-truth Rank 1 1 2 3 r j i j Importance of sorting and correctly i ∆ M = M ( r ) − M ( r i/j )

  21. Loss Function Lambda Rank (Burges et al., 2007) Product 1 Product 2 Product 3 Product 4 x 3 : Features X x 2 x 4 x 1 : Ground-truth Rank 1 1 2 3 r j i j Importance of sorting and correctly i ∆ M = M ( r ) − M ( r i/j ) Difference in scores S = max { 0 , w T x j − w T x i } ∆ S

  22. Loss Function Lambda Rank (Burges et al., 2007) Product 1 Product 2 Product 3 Product 4 x 3 : Features X x 2 x 4 x 1 : Ground-truth Rank 1 1 2 3 r j i j Importance of sorting and correctly i ∆ M = M ( r ) − M ( r i/j ) Difference in scores S = max { 0 , w T x j − w T x i } ∆ S Loss X L ( X ; w ) = ∆ M · ∆ S r i ≤ r j

  23. ElasticRank Introducing Sparsity Adding and penalties l 1 l 2 L ∗ ( X , w ) = L ( X , w ) + λ 1 || w || 1 + 1 2 λ 2 || w || 2 2

  24. ElasticRank Introducing Sparsity Adding and penalties l 1 l 2 L ∗ ( X , w ) = L ( X , w ) + λ 1 || w || 1 + 1 2 λ 2 || w || 2 2 Both and control model complexity λ 2 λ 1

  25. ElasticRank Introducing Sparsity Adding and penalties l 1 l 2 L ∗ ( X , w ) = L ( X , w ) + λ 1 || w || 1 + 1 2 λ 2 || w || 2 2 Both and control model complexity λ 2 λ 1 λ 1 • trades-off sparsity and performance

  26. ElasticRank Introducing Sparsity Adding and penalties l 1 l 2 L ∗ ( X , w ) = L ( X , w ) + λ 1 || w || 1 + 1 2 λ 2 || w || 2 2 Both and control model complexity λ 2 λ 1 λ 1 • trades-off sparsity and performance • adds strong convexity & improves convergence λ 2

  27. Optimization Algorithms Extensions of Stochastic Gradient Descent

  28. Optimization Algorithms Extensions of Stochastic Gradient Descent FOBOS Forward-Backward Splitting (Duchi, 2009) 1. Gradient step 2. Proximal step involving the regularization

  29. Optimization Algorithms Extensions of Stochastic Gradient Descent FOBOS Forward-Backward Splitting (Duchi, 2009) 1. Gradient step 2. Proximal step involving the regularization RDA Regularized Dual Averaging (Xiao, 2010) • Keeps a running average of all past gradients • Solves a proximal step using the average

  30. Optimization Algorithms Extensions of Stochastic Gradient Descent FOBOS Forward-Backward Splitting (Duchi, 2009) 1. Gradient step 2. Proximal step involving the regularization RDA Regularized Dual Averaging (Xiao, 2010) • Keeps a running average of all past gradients • Solves a proximal step using the average pSGD Pruned Stochastic Gradient Descent • Prunes every gradient steps k | w i | < θ ⇒ w i = 0 • If

  31. Hyper-parameter Optimization • Turn-key inference • Automatic adjustment of hyper-parameters • Bayesian Approach (Snoek, Larochelle, Adams; 2012) • Gaussian Process • Thomson Sampling

  32. LETOR Experiments ElasticRank is comparable with state-of-the-art models 0.6 0.5 0.4 NDCG @ 5 0.3 0.2 0.1 0 OHSUMED TD2003 TD2004 Logistic RankSVM ListNet ElasticRank Regression

  33. Amazon.com Experiments Experimental Setup • # examples millions ≈ • # features thousands (millions of dimensions) N ≈ • Purchase logs from contiguous time interval Validation Testing Training 9 1 1 11 11 11

  34. Experimental Results ElasticRank performs best ElasticRank ElasticRank ElasticRank pSGD FOBOS RDA RankSVM Logistic Regression Recall @ 1

  35. Sparsity vs Performance RDA achieves the best trade-off 0.305 0.3 RDA 0.295 pSGD 0.29 Recall @ 1 0.285 FOBOS 0.28 0.275 0.27 0.265 PSGD FOBOS RDA 0.26 1 4 16 64 256 1024 Number of Weights

  36. Prediction Time 15 10.9 μ s Microseconds 10 8.7 μ s 6.2 μ s 5 0 4 29 1804 Number of Weights

  37. Contributions How to learn ranking functions with • Single pass • Small memory footprint • Sparse WITHOUT sacrificing performance

  38. References • C. J. C. Burges, R. Ragno, and Q. V. Le. Learning to rank with nonsmooth cost functions . In Advances in Neural Information Processing Systems (NIPS), 2006. • J. C. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting . Journal of Machine Learning Research (JMLR), 2009. • L. Xiao. Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization . Journal of Machine Learning Research (JMLR), 2010. • J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algorithms . In Advances in Neural Information Processing Systems (NIPS), 2012.

  39. One-Pass Ranking Models for 
 Low-Latency Product Recommendations Martin Saveski @msaveski MIT (Amazon Berlin)

Recommend


More recommend