Machine Learning Pipelines Marco Serafini COMPSCI 532 Lecture 21
Training vs. Inference • Training: data à model • Computationally expensive • No hard real-time requirements (typically) •Inference: data + model à prediction • Computationally cheaper • Real-time requirements (sometimes sub-millisecond) • Today we talk about inference 3 3
Lifecycle 4 4
Challenge: Different Frameworks • Different training frameworks, each has its strengths • E.g.: Caffe for computer vision, HTK for speech recognition • Each uses different formats à tailored deployment • Best tool may change over time • Solution: model abstraction 5 5
Challenge: Prediction Latency • Many ML models have high prediction latency • Some are too slow to use online, e.g., when choosing an ad • Combining model outputs makes it worse • Trade-off between accuracy and latency • Solutions • Adaptive batching • Enable mixing models with different complexity • Straggler mitigation when using multiple models 6 6
Challenge: Model Selection • How to decide which models to deploy? • Selecting the best model offline is expensive • Best model changes over time • Concept drift: relationships in data change over time • Feature corruption • Combining multiple models can increase accuracy • Solution: automatically select among multiple models 7 7
Overview • Requests flow top to bottom and back • We start reviewing the Model Abstraction Layer Project 3 8 8
Caching • Stores prediction results • Avoids rerunning inference on recent predictions • Enables correlating prediction with feedback • Useful when selecting one model 9 9
Batching • Maximize batch size given upper bound on latency • Advantages of batching • Fewer RPC requests • Data-parallel optimizations (e.g. using GPUs) • Different queue/batch size per model container • Some systems like TensorFlow require static batch sizes • Adaptive Batch Sizing: AIMD • Additively increase batch size until exceed latency threshold • Scale down by 10% 10 10
Benefits of (Adaptive) Batching up to 26x throughput increase 11 11
Per-Model Batch size • Different models have different optimal batch sizes • Linear latency growth, easy to predict with AIDM 12 12
Delayed Batching • When a batch is done and the next is not full, wait • Not always beneficial 13 13
Model Containers 14 14
Model Containers • Docker containers • API to be implemented • State (parameters) passed during initialization • No other state management • Clipper replicates containers as needed 15 15
Effect of Replication • 10 GB network: GPU bottleneck, scales out • 1 GB network: network bottleneck does not scale out 16 16
Model Selection 17 17
Model Selection • Enables running multiple models • Advantages • Combine outputs from different models (if run in parallel) • Estimate prediction accuracy (through comparison) • Switch to better model (when feedback available) • Disadvantage of running models in parallel: stragglers • They can often be ignored with minimal accuracy loss • Context: different model selection state per user or session 18 18
Model Selection API S: Selection policy state X: Input Y: Prediction/Feedback incorporate feedback 19 19
Single-Model Selection • Multi-Armed Bandit • Select one action, observe outcome • Decide whether to explore a new action or exploit current one • Exp3 algorithm • Choose an action based on a probability distribution • Adjust probably distribution of current choice based on loss 20 20
Multi-Model Ensembles 21 21
Ensembles and Changing Accuracy 22 22
Ensembles and Stragglers Ensembles and Stragglers 23 24 23 24
Personalized Model Selection • Model selection can be done per-user 24 24
TensorFlow Serving • Inference mechanism of TensorFlow • Can run TensorFlow models • Also uses batching (static) • Missing features • Latency objectives • No support for multiple models • No feedback 25 25
Recommend
More recommend