part 1
play

Part 1 MLSS 2016 Cdiz Nicolas Le Roux Criteo Why such a class? - PowerPoint PPT Presentation

ML for the industry Part 1 MLSS 2016 Cdiz Nicolas Le Roux Criteo Why such a class? Companies are an ever growing opportunity for ML researchers Academics know about the publications of these companies ...but not about the less


  1. ML for the industry Part 1 MLSS 2016 – Cádiz Nicolas Le Roux Criteo

  2. Why such a class? • Companies are an ever growing opportunity for ML researchers • Academics know about the publications of these companies • ...but not about the less academically-visible research

  3. A new zoology of problems • Most academic literature is about predictive performance • What about: • Optimisation of decision-making? • Increasing operational efficiency? • Predictive performance under operational constraints?

  4. The 3 stages of the academia industry move 1. I will use model X which will greatly improve the results (enthusiasm) 2. No new model is useful, this is pointless (disillusionment) 3. So many open questions, I do not know where to start (acceptance)

  5. Criteo – an example amongst many • We buy advertising spaces on websites • We display ads for our partners • We get paid if the user clicks on the ad

  6. 3000 2661 2500 2141 2000 1983 1983 1883 1500 1147 1026 1026 1000 776 640 520 500 526 250 250 226 200 226 226 120 100 100 81 60 40 36 26 12 0 Cluster NL PreProd Cluster FR TOTAL NODES

  7. Retargeting – an example

  8. In practice 1. A user lands on a webpage 2. The website Criteo and its competitors 3. It is an auction: each competitor tells how much it bids 4. The highest bidder wins the right to display an ad

  9. Details of the auction • Real-time bidding (RTB) • Second-price auction: the winner pays the second highest price • Optimal strategy: bid the expected gain • Expected gain = price per click (CPC) * probability of click (CTR)

  10. What to do once we win the display? • We are now directly in contact with the website • Choose the best products • Choose the color, the font and the layout

  11. Identified ML problems • Prediction problem: click/no click • Recommendation problem: find the top products

  12. What is the input? • The list of data we can collect about the user and the context • Time since last visit, current URL, etc. • There is potentially no limit to the number of variables in X

  13. Choosing a model class • Response time is critical • There is little signal to predict clicks: we need to add features often • Solution: a logistic regression - pCTR = 𝜏 𝑥 𝑈 𝑦

  14. A major difference Structured data Unstructured data • Lots of info in the data • Poor predictability • High predictability • Signal dominated by noise • Highly structured info • Highly unstructured info

  15. Dealing with many modalities • Some variables can take many different values • CurrentURL • List of articles read • List of items seen

  16. Idea 1: one-hot encoding + dictionary • Associate each entry with an index i • x = [ 0 0 0 ... 0 1 0 ... 0 0] 0 1 2 i (P-2) (P-1)

  17. Idea 1: one-hot encoding + dictionary • Associate each entry with an index i • x = [ 0 0 0 ... 0 1 0 ... 0 0] 0 1 2 i (P-2) (P-1) • pCTR = 𝜏 𝑥 𝑈 𝑦 = 𝜏 𝑥 𝑗

  18. Building a dictionary 𝒙 𝒋 i URL 0 http://google.com -1.2 1 http://facebook.com -3.4 … … 129547171991 http://thiswebsiteisgreat.com -0.5

  19. Building a dictionary 𝒙 𝒋 i URL 0 http://google.com -1.2 1 http://facebook.com -3.4 … … 129547171991 http://thiswebsiteisgreat.com -0.5 129547171992 http://thisoneisevenbetter.com -0.45

  20. Idea 2: using a hash table 𝒙 𝒋 i • h: 𝑇 → [0, 2 𝑙 − 1] 0 -1.7 1 -2.1 • h("http://google.com")=14563 … … … 16777215 -1.2

  21. Idea 2: using a hash table 𝒙 𝒋 i • h: 𝑇 → [0, 2 𝑙 − 1] 0 -1.7 1 -2.1 • h("http://google.com")=14563 … 14563 -1.23 … 16777215 -1.2

  22. Collisions • What if h 𝑇 0 = h 𝑇 1 ? • We will use the same w i for both. • This is called a collision.

  23. Collisions in practice • h("http://google.com") = h("http://nicolas.le-roux.name")=14563 • pCTR("http://google.com")= pCTR("http://nicolas.le-roux.name") ≈ CTR ("http://google.com")

  24. Example of a hash • Current URL = http://gobernie.com/ • ℎ(" http://gobernie.com/") = 12 • 𝑦 = 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

  25. Example of a hash • Current URL = http://gobernie.com/ and Advertiser = S&W • ℎ(" http://gobernie.com/") = 12 , h(" S&W ") = 4 • 𝑦 = 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

  26. Limitations of the linear model • 𝑦 = 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1+ 𝑓 −𝑥𝑈𝑦 ≈ 𝑓 𝑥 𝑈 𝑦 = 𝑗 𝑓 𝑥 𝑗 𝑦 𝑗 1 • pCTR = 𝜏 𝑥 𝑈 𝑦 =

  27. Introducing cross-features • Current URL = http://gobernie.com/ and Advertiser = S&W • ℎ(" http://gobernie.com/" and " S&W ") = 6 • 𝑦 𝑑𝑔 = 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

  28. Cross-features as a second-order method • 𝑦 = 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 • 𝑦 𝑑𝑔 = 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0

  29. Cross-features as a second-order method • 𝑦 = 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 • 𝑦 𝑑𝑔 = 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 • 𝑥 𝑈 𝑦 𝑑𝑔 = 𝑗 𝑥 𝑗 𝑦 𝑗

  30. Cross-features as a second-order method • 𝑦 = 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 • 𝑦 𝑑𝑔 = 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 • 𝑥 𝑈 𝑦 𝑑𝑔 = 𝑗 𝑥 𝑗 𝑦 𝑗 + 𝑗,𝑘 𝑥 𝑗𝑘 𝑦 𝑗 𝑦 𝑘

  31. Cross-features as a second-order method • 𝑥 𝑈 𝑦 𝑑𝑔 = 𝑗 𝑥 𝑗 𝑦 𝑗 + 𝑗,𝑘 𝑥 𝑗𝑘 𝑦 𝑗 𝑦 𝑘 • 𝑥 𝑈 𝑦 𝑑𝑔 = 𝑥 𝑈 𝑦 + 𝑦 𝑈 𝑁𝑦 The values in M are the same as those in w!

  32. A matrix view of cross-features • pCTR = 𝜏 𝑦 𝑈 𝑁𝑦 The structure is determined by the 2.3 1.1 3.7 -3.0 1.1 2.3 hashing function -1.4 2.3 -3.0 3.7 -1.4 3.7 -3.0 -3.0 5.9 1.1 2.3 5.9 M= 3.7 5.9 -1.4 1.1 -3.0 -1.4 -1.4 2.3 -1.4 -1.4 3.7 5.9 -3.0 1.1 1.1 5.9 5.9 5.9

  33. Exploiting the magic "Thanks to hashing, the number of parameters in the model is independent of the number of variables. This means we should add as many variables as possible."

  34. Reasons to NOT do that • Because of collisions, adding variables may decrease performance • Any variable needs to be computed and stored

  35. The cost of adding variables • « Hey, I thought of this great variable: Time since last product view. Can we add it to the model? » • Storage: #Banners/day x #Days x 4 = 480GB • RAM: #Users x #Campaigns x 4 = 40GB

  36. Feature selection • How to keep features while maintaining good performance?A tool to increase statistical efficiency • Solution: selection of the optimal features and cross-features

  37. Using sparsity-inducing regularizers 𝑥 𝑗 𝑚(𝑥, 𝑦 𝑗 , 𝑧 𝑗 ) • min

  38. Using sparsity-inducing regularizers 𝑥 𝑗 𝑚(𝑥, 𝑦 𝑗 , 𝑧 𝑗 ) + 𝜇 𝑥 1 • min • Statistically efficient • Still requires to extract all variables

  39. Using group-sparsity regularizers 𝑥 𝑗 𝑚(𝑥, 𝑦 𝑗 , 𝑧 𝑗 ) + 𝜇 ℊ 𝑥 ℊ • min 2 • Forces all elements in a group to be 0 • The optimization problem remains efficient R. Jenatton, J.-Y. Audibert and F. Bach. Structured Variable Selection with Sparsity-Inducing Norms. Journal of Machine Learning Research

  40. Reducing bias • Sparsity-inducing regularization introduces bias • Two-stage process: • Select subset of variables • Re-optimize with the selected subset

  41. Feature selection as kernel selection • 𝑥 𝑈 𝑦 𝑑𝑔 = 𝑥 𝑈 𝑦 + 𝑦 𝑈 𝑁𝑦 • Doing feature selection on M is equivalent to learning the kernel

  42. ML improves human efficiency • Adding features is a critical part of an R&D • Doing it automatically and well spares valuable people's time

  43. Factorization machines • pCTR = 𝜏 𝑦 𝑈 𝑁𝑦 2.3 1.1 2.3 -1.4 -3.0 3.7 -1.4 -3.0 -1.4 2.3 1.1 2.3 -3.0 5.9 2.3 1.1 -3.0 -3.0 M= 3.7 5.9 -1.4 2.3 -3.0 1.1 Rendle, S. Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International Conference on (pp. 995-1000). IEEE.

  44. Factorization machines • 𝜚 𝑥, 𝑦 = 𝑥 𝑈 𝑦 • 𝜚 𝑁, 𝑦 = 𝑦 𝑈 𝑁𝑦 • 𝜚 𝑉, 𝑦 = 𝑦 𝑈 𝑉𝑉 𝑈 𝑦

  45. Linear model gobernie.com drumpf4ever.com hillaryous.com S&W f( 𝑥 𝑒𝑠𝑣𝑛𝑞𝑔 + 𝑥 𝑇&𝑋 ) f( 𝑥 ℎ𝑗𝑚𝑚𝑏𝑠𝑧 + 𝑥 𝑇&𝑋 ) f( 𝑥 𝑐𝑓𝑠𝑜𝑗𝑓 + 𝑥 𝑇&𝑋 ) Carebear f( 𝑥 𝑒𝑠𝑣𝑛𝑞𝑔 + 𝑥 𝑑𝑏𝑠𝑓𝑐𝑓𝑏𝑠 ) f( 𝑥 ℎ𝑗𝑚𝑚𝑏𝑠𝑧 + 𝑥 𝑑𝑏𝑠𝑓𝑐𝑓𝑏𝑠 ) f( 𝑥 𝑐𝑓𝑠𝑜𝑗𝑓 + 𝑥 𝑑𝑏𝑠𝑓𝑐𝑓𝑏𝑠 ) JP Morgan f( 𝑥 𝑐𝑓𝑠𝑜𝑗𝑓 + 𝑥 𝐾𝑄𝑁𝑝𝑠𝑕𝑏𝑜 ) f( 𝑥 𝑒𝑠𝑣𝑛𝑞𝑔 + 𝑥 𝐾𝑄𝑁𝑝𝑠𝑕𝑏𝑜 ) f( 𝑥 ℎ𝑗𝑚𝑚𝑏𝑠𝑧 + 𝑥 𝐾𝑄𝑁𝑝𝑠𝑕𝑏𝑜 )

Recommend


More recommend