a journey towards real life results
play

A journey towards real-life results illustrated by using AI in - PowerPoint PPT Presentation

A journey towards real-life results illustrated by using AI in Twitters Timelines Overview ML Workflows The Timelines Ranking case The power of the platform, opportunities Future Deep Learning Workflows Pure


  1. A journey towards real-life results … illustrated by using AI in Twitter’s Timelines

  2. Overview ● ML Workflows ● The Timelines Ranking case ● The power of the platform, opportunities ● Future

  3. Deep Learning Workflows ● Pure Research ○ Model Exploration ● Applied Research ○ Dataset/Feature Exploration ○ Model Exploration ● Production ○ Feature Addition ○ Data Addition ○ Training ○ Deployment ○ A/B test

  4. Deep Learning Workflows ● Pure Research ○ Model Exploration + Training → Very flexible modeling framework ● Applied Research ○ Dataset/Feature Exploration → Flexible data exploration framework ○ Model Exploration + Training → Flexible modeling framework ● Production ○ Feature Addition → Scalable data manipulation framework ○ Data Addition → Scalable data manipulation framework ○ Training → Fast, robust training engine ○ Deployment → Seamless and tested ML services ○ A/B test → Good AB test environment

  5. Deep Learning Workflows Pure Applied Research Research Production

  6. Deep Learning Workflows PRODUCTION

  7. Data First Workflow ● Model architecture doesn’t matter (anymore) ● Large Scale data manipulation matters ● Fast training matters ● Ease of deployment matters ● Testing matters!!! ○ Training VS online ○ Continuous integration

  8. Case Study Timelines Ranking (Blog Post @TwitterEng)

  9. Timelines Ranking ● Sparse features ● A few billions data samples ● Low latency ● Candidates generation → Heavy model → sort → publish ● Before: decision trees + other sparse techniques ● Probability prediction

  10. Timelines Ranking New Modules

  11. Sparse Linear Layer 2 Nj = F(∑ Wi,j * norm(Vi) + Bj ) . . . . . . . 1 2 5 0 F = Sigmoid/ReLU/PReLU . . . . . . . V1 V2 Vn-2 Vn-1 Vn has_image is_vit engagement days_since obama_word {0,1} {0,1} ratio [0,+infinity]] {0,1} [0,+infinity]]

  12. Sparse Linear Layer: Online Normalization ● Example: input feature (value == 1M) ⇒ weight_gradient == 1M ⇒ update == 1M * learning_rate ⇒ explosion ● Solution: normalization of input values norm(Vi) == Vi / max(all_abs_Vi) + bi Trainable per-feature Belongs to [-1,1] bias: discriminate absence and presence of features

  13. Sparse Linear Layer: Speedups CPU -- i7 3790k Forward pass -- ~500 features -- output size == 50 Batch VS PyTorch (1 thread) VS TensorFlow (’’’’) VS PyTorch (4 threads) VS TF* (’’’’) Size 1 2.1x 4.1x 2.8x 5.5x 16 1.7x 1.7x 4.6x 4.3x 64 1.7x 1.3x 5.1x 3.8x 256 1.8x 1.2x 5.6x 3.5x

  14. Sparse Linear Layer: Speedups GPU -- Tesla M40 -- CUDA 7.5 Forward pass -- ~500 features -- output size == 50 Batch VS cuSparse Size 1 0.7x 16 4.4x 64 5.2x 256 2x

  15. Split Nets ... ... 1 2 N 1 2 N SPLIT NET 1: SPLIT NET K: TWEET BINARY ENGAGEMENT FEATURES FEATURES . . . . . . . V1 V2 Vn-2 Vn-1 Vn has_image has_link engagement days_since obama_word {0,1} {0,1} ratio [0,+infinity]] {0,1} [0,+infinity]]

  16. Split Nets UNIQUE DEEP NET (N*K neurons) GLUE ALL SPLIT NETS !!! SPLIT 1: ... ... 1 2 N 1 2 N SPLIT K: . . . . . . . TWEET BINARY ENGAGEMENT FEATURES FEATURES

  17. Prevent overfitting -- Split by feature type ● Send “dense” features on one side ○ BINARY ○ CONTINUOUS ○ (SPARSE_CONTINUOUS) ● “Sparse” features on the other side ○ DISCRETE ○ STRING ○ SPARSE_BINARY ○ (SPARSE_CONTINUOUS)

  18. Sampling -- Calibration ● Sample according to positive ratio P ● Output average probability == P ⇒ Need Calibration ● Use Isotonic Calibration

  19. Feature Discretization Intuition ● Max normalization good to avoid explosion BUT ● Per-aggregate-feature min/max range larger ● Max-normalization will generate very small input feature values ● The deep net will have tremendous trouble learning on such small values ● Std/mean normalization? Better but still not satisfying Solution ● Discretization

  20. Dsicretization ● Feature id == 10 ● → over the entire dataset, compute equal sized bins, assign bin_id ● At inference time, for key/value (id,value): ○ id → bin_id ○ value → 1 ● Other possibilities: Decision trees, ...

  21. Final simplest architecture 1) Discretizer(s) 2) Sparse Layer with online normalization 3) MLP 4) Prediction 5) Isotonic Calibration

  22. The power of the platform

  23. The power of the platform ● Testing ● Tracking ● Automation ● Robustness ● Standardization ● Speed ● Workflow ● Examples ● Support ● Easy Multimodal (Text + media + sparse + …)

  24. The power of the platform ● How to train all this? ○ Train the discretizer ○ Train the deep net ○ Calibrate the probabilities ○ Validate ○ ... ● Training loop + ML scheduler → one-liner ● Unique serialization format for params

  25. The power of the platform ● How to deploy all this? ● Tight Twitter infra integration + saved model → one-liner deployment ● Arbitrary number of instances ● All the goodies from Twitter services infra! ● Seamless

  26. The power of the platform ● How to test all this? ● Model offline validation → PredictionSet A ● Model online prediction → PredictionSet B ● PredictionSet A == PredictionSet B ?? ● Yes → ready to ship ● Continuous integration

  27. Future

  28. Future of DL platforms In a single platform: ● Abstract DAG of: ○ Services ○ Storages ○ Dataset ○ ... ● Model dependency handling ● Offline/Online feature mapping ● Coverage for all the workflows ● Bundling ● … Cloud?

  29. DAG of services Cache Model A Model D Storage Cache Model B Storage Timelines, Recommendations, ... Model C

  30. THANKS!

Recommend


More recommend