machine learning pipelines
play

Machine Learning Pipelines Marco Serafini COMPSCI 532 Lecture 21 - PowerPoint PPT Presentation

Machine Learning Pipelines Marco Serafini COMPSCI 532 Lecture 21 Training vs. Inference Training: data model Computationally expensive No hard real-time requirements (typically) Inference: data + model prediction


  1. Machine Learning Pipelines Marco Serafini COMPSCI 532 Lecture 21

  2. Training vs. Inference • Training: data à model • Computationally expensive • No hard real-time requirements (typically) •Inference: data + model à prediction • Computationally cheaper • Real-time requirements (sometimes sub-millisecond) • Today we talk about inference 3 3

  3. Lifecycle 4 4

  4. Challenge: Different Frameworks • Different training frameworks, each has its strengths • E.g.: Caffe for computer vision, HTK for speech recognition • Each uses different formats à tailored deployment • Best tool may change over time • Solution: model abstraction 5 5

  5. Challenge: Prediction Latency • Many ML models have high prediction latency • Some are too slow to use online, e.g., when choosing an ad • Combining model outputs makes it worse • Trade-off between accuracy and latency • Solutions • Adaptive batching • Enable mixing models with different complexity • Straggler mitigation when using multiple models 6 6

  6. Challenge: Model Selection • How to decide which models to deploy? • Selecting the best model offline is expensive • Best model changes over time • Concept drift: relationships in data change over time • Feature corruption • Combining multiple models can increase accuracy • Solution: automatically select among multiple models 7 7

  7. Overview • Requests flow top to bottom and back • We start reviewing the Model Abstraction Layer Project 3 8 8

  8. Caching • Stores prediction results • Avoids rerunning inference on recent predictions • Enables correlating prediction with feedback • Useful when selecting one model 9 9

  9. Batching • Maximize batch size given upper bound on latency • Advantages of batching • Fewer RPC requests • Data-parallel optimizations (e.g. using GPUs) • Different queue/batch size per model container • Some systems like TensorFlow require static batch sizes • Adaptive Batch Sizing: AIMD • Additively increase batch size until exceed latency threshold • Scale down by 10% 10 10

  10. Benefits of (Adaptive) Batching up to 26x throughput increase 11 11

  11. Per-Model Batch size • Different models have different optimal batch sizes • Linear latency growth, easy to predict with AIDM 12 12

  12. Delayed Batching • When a batch is done and the next is not full, wait • Not always beneficial 13 13

  13. Model Containers 14 14

  14. Model Containers • Docker containers • API to be implemented • State (parameters) passed during initialization • No other state management • Clipper replicates containers as needed 15 15

  15. Effect of Replication • 10 GB network: GPU bottleneck, scales out • 1 GB network: network bottleneck does not scale out 16 16

  16. Model Selection 17 17

  17. Model Selection • Enables running multiple models • Advantages • Combine outputs from different models (if run in parallel) • Estimate prediction accuracy (through comparison) • Switch to better model (when feedback available) • Disadvantage of running models in parallel: stragglers • They can often be ignored with minimal accuracy loss • Context: different model selection state per user or session 18 18

  18. Model Selection API S: Selection policy state X: Input Y: Prediction/Feedback incorporate feedback 19 19

  19. Single-Model Selection • Multi-Armed Bandit • Select one action, observe outcome • Decide whether to explore a new action or exploit current one • Exp3 algorithm • Choose an action based on a probability distribution • Adjust probably distribution of current choice based on loss 20 20

  20. Multi-Model Ensembles 21 21

  21. Ensembles and Changing Accuracy 22 22

  22. Ensembles and Stragglers Ensembles and Stragglers 23 24 23 24

  23. Personalized Model Selection • Model selection can be done per-user 24 24

  24. TensorFlow Serving • Inference mechanism of TensorFlow • Can run TensorFlow models • Also uses batching (static) • Missing features • Latency objectives • No support for multiple models • No feedback 25 25

Recommend


More recommend