A journey towards real-life results … illustrated by using AI in Twitter’s Timelines
Overview ● ML Workflows ● The Timelines Ranking case ● The power of the platform, opportunities ● Future
Deep Learning Workflows ● Pure Research ○ Model Exploration ● Applied Research ○ Dataset/Feature Exploration ○ Model Exploration ● Production ○ Feature Addition ○ Data Addition ○ Training ○ Deployment ○ A/B test
Deep Learning Workflows ● Pure Research ○ Model Exploration + Training → Very flexible modeling framework ● Applied Research ○ Dataset/Feature Exploration → Flexible data exploration framework ○ Model Exploration + Training → Flexible modeling framework ● Production ○ Feature Addition → Scalable data manipulation framework ○ Data Addition → Scalable data manipulation framework ○ Training → Fast, robust training engine ○ Deployment → Seamless and tested ML services ○ A/B test → Good AB test environment
Deep Learning Workflows Pure Applied Research Research Production
Deep Learning Workflows PRODUCTION
Data First Workflow ● Model architecture doesn’t matter (anymore) ● Large Scale data manipulation matters ● Fast training matters ● Ease of deployment matters ● Testing matters!!! ○ Training VS online ○ Continuous integration
Case Study Timelines Ranking (Blog Post @TwitterEng)
Timelines Ranking ● Sparse features ● A few billions data samples ● Low latency ● Candidates generation → Heavy model → sort → publish ● Before: decision trees + other sparse techniques ● Probability prediction
Timelines Ranking New Modules
Sparse Linear Layer 2 Nj = F(∑ Wi,j * norm(Vi) + Bj ) . . . . . . . 1 2 5 0 F = Sigmoid/ReLU/PReLU . . . . . . . V1 V2 Vn-2 Vn-1 Vn has_image is_vit engagement days_since obama_word {0,1} {0,1} ratio [0,+infinity]] {0,1} [0,+infinity]]
Sparse Linear Layer: Online Normalization ● Example: input feature (value == 1M) ⇒ weight_gradient == 1M ⇒ update == 1M * learning_rate ⇒ explosion ● Solution: normalization of input values norm(Vi) == Vi / max(all_abs_Vi) + bi Trainable per-feature Belongs to [-1,1] bias: discriminate absence and presence of features
Sparse Linear Layer: Speedups CPU -- i7 3790k Forward pass -- ~500 features -- output size == 50 Batch VS PyTorch (1 thread) VS TensorFlow (’’’’) VS PyTorch (4 threads) VS TF* (’’’’) Size 1 2.1x 4.1x 2.8x 5.5x 16 1.7x 1.7x 4.6x 4.3x 64 1.7x 1.3x 5.1x 3.8x 256 1.8x 1.2x 5.6x 3.5x
Sparse Linear Layer: Speedups GPU -- Tesla M40 -- CUDA 7.5 Forward pass -- ~500 features -- output size == 50 Batch VS cuSparse Size 1 0.7x 16 4.4x 64 5.2x 256 2x
Split Nets ... ... 1 2 N 1 2 N SPLIT NET 1: SPLIT NET K: TWEET BINARY ENGAGEMENT FEATURES FEATURES . . . . . . . V1 V2 Vn-2 Vn-1 Vn has_image has_link engagement days_since obama_word {0,1} {0,1} ratio [0,+infinity]] {0,1} [0,+infinity]]
Split Nets UNIQUE DEEP NET (N*K neurons) GLUE ALL SPLIT NETS !!! SPLIT 1: ... ... 1 2 N 1 2 N SPLIT K: . . . . . . . TWEET BINARY ENGAGEMENT FEATURES FEATURES
Prevent overfitting -- Split by feature type ● Send “dense” features on one side ○ BINARY ○ CONTINUOUS ○ (SPARSE_CONTINUOUS) ● “Sparse” features on the other side ○ DISCRETE ○ STRING ○ SPARSE_BINARY ○ (SPARSE_CONTINUOUS)
Sampling -- Calibration ● Sample according to positive ratio P ● Output average probability == P ⇒ Need Calibration ● Use Isotonic Calibration
Feature Discretization Intuition ● Max normalization good to avoid explosion BUT ● Per-aggregate-feature min/max range larger ● Max-normalization will generate very small input feature values ● The deep net will have tremendous trouble learning on such small values ● Std/mean normalization? Better but still not satisfying Solution ● Discretization
Dsicretization ● Feature id == 10 ● → over the entire dataset, compute equal sized bins, assign bin_id ● At inference time, for key/value (id,value): ○ id → bin_id ○ value → 1 ● Other possibilities: Decision trees, ...
Final simplest architecture 1) Discretizer(s) 2) Sparse Layer with online normalization 3) MLP 4) Prediction 5) Isotonic Calibration
The power of the platform
The power of the platform ● Testing ● Tracking ● Automation ● Robustness ● Standardization ● Speed ● Workflow ● Examples ● Support ● Easy Multimodal (Text + media + sparse + …)
The power of the platform ● How to train all this? ○ Train the discretizer ○ Train the deep net ○ Calibrate the probabilities ○ Validate ○ ... ● Training loop + ML scheduler → one-liner ● Unique serialization format for params
The power of the platform ● How to deploy all this? ● Tight Twitter infra integration + saved model → one-liner deployment ● Arbitrary number of instances ● All the goodies from Twitter services infra! ● Seamless
The power of the platform ● How to test all this? ● Model offline validation → PredictionSet A ● Model online prediction → PredictionSet B ● PredictionSet A == PredictionSet B ?? ● Yes → ready to ship ● Continuous integration
Future
Future of DL platforms In a single platform: ● Abstract DAG of: ○ Services ○ Storages ○ Dataset ○ ... ● Model dependency handling ● Offline/Online feature mapping ● Coverage for all the workflows ● Bundling ● … Cloud?
DAG of services Cache Model A Model D Storage Cache Model B Storage Timelines, Recommendations, ... Model C
THANKS!
Recommend
More recommend