TVM for Ads Ranking @ Facebook Hao Lu, Ansha Yu, Yinghai Lu, Andrew Tulloch
Ads Ranking at Facebook . . . ad 1 ad 2 ad 3 ad n model 2 model 2 model 1 model 3 . . . model k batch 1 batch 2 predictions X 2
Ads Ranking at Facebook: Production Requirements • Parallel execution between model evaluation ad 1 ad 2 ad 3 • Each model runs on a single thread • For each model, there can be multiple batches executing at the same time. In this case, weights are global and shared between threads, but activations are thread local • Model weights are refreshed every few model 2 model 2 model 1 hours. Therefore, activations needs to be batch 1 batch 2 released at the end of each inference to avoid running out of memory • Batch size is dynamic • C++ only • Mutiple CPU architectures: avx512, avx2 predictions X 3
Model Architecture TVM EMB MLP: Multilayer perceptron (sequence of FC + activation function) https://ai.facebook.com/blog/dlrm-an-advanced-open-source-deep-learning-recommendation-model/
Ads Ranking Models Implementation Dense features + embeddings from ca ff e2 batch_size x • JIT (not AOT): because models are updated periodically • Graph runtime does not manage memory • weights are shared between threads for the same model graph runtime graph runtime graph runtime • activations are shared by instances of all graph batch_size 1 batch_size 2 batch_size n runtimes • release activation after each iteration to avoid OOM Performance • Use MKL for FC for simplicity prediction • 5-10% speedup from fusion • Runtime overhead eats into speedup 5
What's Next Relay VM Performance • Handles dynamic shapes • Autotuning at scale • JIT compilation • FBGEMM for fp16 and int8 • Dynamic memory allocation • Embedding lookup 6
Recommend
More recommend