Model-Switching: Dealing with Fluctuating Workloads in MLaaS * Systems Jeff Zhang 1 , Sameh Elnikety 2 , Shuayb Zarar 2 , Atul Gupta 2 , Siddharth Garg 1 1 New York University, 2 Microsoft jeffjunzhang@nyu.edu [*] Machine Learning-as-a-Service USENIX HotCloud 2020
Deep-Learning Models are Pervasive 1
Computations in Deep Learning Convolutions account for more than 90% computation à dominate both run-time and energy. * Execution time factors: . . . Convolution = depth, activation/filter size in each layer Convolution Filters Output Tensor Activation Tensor Output Tensor 2 [Ref.] Sze, V., et al. , “ Efficient Processing of Deep Neural Networks: A Tutorial and Survey”, IEEE Proceedings 2017.
Machine Learning Lifecycle (Typical) Training Inference Model Development Online Prediction Service Data Cleaning & Query Collection Visualization Logic Prediction Trained End User Training Pipelines Training & Feature Eng. & Models Application Validation Model Design Offline Live Training Feedback Validation Data Data MLaaS Model Development • Prescribes model design, architecture and data processing Training • At scale on live data • Retraining on new data and manage model versioning Serving or Online Inference • Deploys trained model into device, edge, or cloud 3 [Ref.] Gonzalez, J. et al. , “ Deploying Interactive ML Applications with Clipper.”
MLaaS: Challenges and Limitations Maintain QoS under dynamic workloads . Existing Solution - Static model versioning - Tie each application to one specific model at run-time - In the event of load spikes: • Prune requests (new, low priority, near deadline etc. ) - QoS violations, customer L • Add “significant new capacity" (autoscaling) - Not economically viable, provider L March 28, 2020 - Service success rates dropped below 99.99% - Teams suffered 2-hour outage in Europe - Free offers and new subscriptions were limited 4 [Ref.] Gujarati, A. et al. , “ Swayam: Distributed Autoscaling to Meet SLAs of ML Inference Services with Resource Efficiency”, ACM Middleware 2017.
Opportunity: DNN Model Diversity Single Image Inference with ResNet-x 94.5 ResNet-152 94 93.5 ResNet-101 Residual Block 93 ResNet-50 Accuracy 92.5 92 91.5 ResNet-34 1 thread 2 threads 91 4 threads 8 threads 90.5 16 threads ResNet-18 90 0 200 400 600 800 1000 Model Execution Time (ms) For the same application, many models can be trained with tradeoffs among: Accuracy, Inference Time and Computation Cost (Parallelism) 5 [Ref.] He, K., et al. , “ Deep Residual Learning for Image Recognition”, IEEE CVPR 2016;
Questions in this Study MLaaS Provider What is s e n d i n the QoS? g r e q u e s t s ML App End Users SLAs rendering predictions Which DNN How to allocate model to use? resources? Make all decisions online! Assumption: fluctuating workloads Typical SLA Objectives: fixed hardware capacity latency, throughput, cost,… 6
What Do Users Care About? MLaaS Provider Application End Users 1. “Cat”, 5s 2. “Dog”, 0. 3s From the users’ perspective, deadline misses and incorrect predictions are equally bad: • User can a lways meet deadline by guessing randomly Quick and correct predictions! 7
A New Metric for MLaaS Effective Accuracy (𝑏 !"" ) : the fraction of correct predictions within deadline ( D ) 𝑏 #$$ = 𝑞 %,' ×𝑏 ( 𝜇 : load, 𝑏 : baseline accuracy) Likelihood of meeting deadline 0.94 ResNet152 ResNet101 0.93 Effective Accuracy ResNet50 ResNet34 0.92 ResNet18 No single DNN works 0.91 best at all load levels 0.9 0.89 0.88 0 5 10 15 20 25 30 Load (Queries Per Second) 8
Characterizing DNN Parallelism ResNet-152 Tail Latency Vs Job arrival rate 4500 R : Replicas 4000 T: Threads 99th percentile tail latency (ms) 3500 3000 Fixed Capacity: 2500 16 threads 2000 1500 R:16 T:1 R:8 T:2 End to End 1000 R:4 T:4 Query Latency R:2 T:8 500 R:1 T:16 0 1 2 3 4 5 6 7 8 Load (Queries Per Second) As load increases, additional replicas help more than threads. 9
Online Model Switching Framework Model that exhibits best effective accuracy is a function of load SLA deadline Load Best Model Parallelism 0-4 ResNet-152 <R:4 T:4> Offline policy QPS training > 20 ResNet-18 … Online Serving Predictions Front-End Load Change Model Queries Detection Switching Model Switching Controller Dynamically select best model (effective accuracy) based on load 10
Experimental Setup • Built on top of Clipper , an open-source containerized model-serving framework (caching, adaptive batching disabled) • Deployed PyTorch pretrained ResNet models on ImageNet ( R : 4 T : 4) • Two dedicated Azure VMs: • Server: 32 vCPUs + 128GB RAM • Client: 8vCPUs + 32GB RAM • Markov Model based load generator • Open system model • Poisson inter-arrivals Model sampling Switching period 1 sec 11 [Ref.] D. Crankshaw et al. , “ Clipper: A low-latency online prediction serving system”, USENIX NDSI 2017.
Evaluation: Automatic Model-Switching 25 ResNet-152 Load (Queries Per Second) 20 ResNet-101 ResNet-50 15 ResNet-34 10 ResNet-18 5 0 0 50 100 150 200 250 300 Time (sec) Model-Switching can quickly adapt to load spikes. 12
Evaluation: Effective Accuracy 0.95 0.9 Effective Accuracy 0.85 0.8 0.75 Model Switch ResNet-18 0.7 ResNet-34 ResNet-50 0.65 ResNet-101 ResNet-152 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 ( s ) Deadline (ms) Model-Switching achieves pareto-optimal effective accuracy. 13
Evaluation: Tail Latency Empirical CDF 1 0.8 Percentage 0.6 SLA deadline: 750 ms 0.4 Model Switch ResNet-18 ResNet-34 0.2 ResNet-50 ResNet-101 ResNet-152 0 0.75 0 0.5 1 1.5 2 Latency (ms) ( s ) Model-Switching tradeoffs deadline slack for accuracy. 14
Thank You and Questions Model-Switching: Manage Fluctuating Workloads in MLaaS Systems Jeff Zhang jeffjunzhang@nyu.edu
Discussion and Future Work • How to prepare a pool of models for each application? • Neural Architecture Search, Multi-level Quantization • Current approach pre-deploys all (20) candidate models • Cold start time (ML): tens of seconds • RAM overheads: currently 11.8% of the total 128 GB RAM • Reinforcement learning based controller for model switching • Account for job queue status, system load, current latency • Offline training free • Integrate with existing MLaaS techniques • Batching, caching, autoscaling etc. • Exploit availability of heterogenous computing resources • CPU, GPU, TPU, FPGA 15
Recommend
More recommend