Model-Switching: Dealing with Fluctuating Workloads in MLaaS * - PowerPoint PPT Presentation

Model-Switching: Dealing with Fluctuating Workloads in MLaaS * Systems Jeff Zhang 1 , Sameh Elnikety 2 , Shuayb Zarar 2 , Atul Gupta 2 , Siddharth Garg 1 1 New York University, 2 Microsoft jeffjunzhang@nyu.edu [*] Machine Learning-as-a-Service USENIX HotCloud 2020

Deep-Learning Models are Pervasive 1

Computations in Deep Learning Convolutions account for more than 90% computation à dominate both run-time and energy. * Execution time factors: . . . Convolution = depth, activation/filter size in each layer Convolution Filters Output Tensor Activation Tensor Output Tensor 2 [Ref.] Sze, V., et al. , “ Efficient Processing of Deep Neural Networks: A Tutorial and Survey”, IEEE Proceedings 2017.

Machine Learning Lifecycle (Typical) Training Inference Model Development Online Prediction Service Data Cleaning & Query Collection Visualization Logic Prediction Trained End User Training Pipelines Training & Feature Eng. & Models Application Validation Model Design Offline Live Training Feedback Validation Data Data MLaaS Model Development • Prescribes model design, architecture and data processing Training • At scale on live data • Retraining on new data and manage model versioning Serving or Online Inference • Deploys trained model into device, edge, or cloud 3 [Ref.] Gonzalez, J. et al. , “ Deploying Interactive ML Applications with Clipper.”

MLaaS: Challenges and Limitations Maintain QoS under dynamic workloads . Existing Solution - Static model versioning - Tie each application to one specific model at run-time - In the event of load spikes: • Prune requests (new, low priority, near deadline etc. ) - QoS violations, customer L • Add “significant new capacity" (autoscaling) - Not economically viable, provider L March 28, 2020 - Service success rates dropped below 99.99% - Teams suffered 2-hour outage in Europe - Free offers and new subscriptions were limited 4 [Ref.] Gujarati, A. et al. , “ Swayam: Distributed Autoscaling to Meet SLAs of ML Inference Services with Resource Efficiency”, ACM Middleware 2017.

Opportunity: DNN Model Diversity Single Image Inference with ResNet-x 94.5 ResNet-152 94 93.5 ResNet-101 Residual Block 93 ResNet-50 Accuracy 92.5 92 91.5 ResNet-34 1 thread 2 threads 91 4 threads 8 threads 90.5 16 threads ResNet-18 90 0 200 400 600 800 1000 Model Execution Time (ms) For the same application, many models can be trained with tradeoffs among: Accuracy, Inference Time and Computation Cost (Parallelism) 5 [Ref.] He, K., et al. , “ Deep Residual Learning for Image Recognition”, IEEE CVPR 2016;

Questions in this Study MLaaS Provider What is s e n d i n the QoS? g r e q u e s t s ML App End Users SLAs rendering predictions Which DNN How to allocate model to use? resources? Make all decisions online! Assumption: fluctuating workloads Typical SLA Objectives: fixed hardware capacity latency, throughput, cost,… 6

What Do Users Care About? MLaaS Provider Application End Users 1. “Cat”, 5s 2. “Dog”, 0. 3s From the users’ perspective, deadline misses and incorrect predictions are equally bad: • User can a lways meet deadline by guessing randomly Quick and correct predictions! 7

A New Metric for MLaaS Effective Accuracy (𝑏 !"" ) : the fraction of correct predictions within deadline ( D ) 𝑏 #$$ = 𝑞 %,' ×𝑏 ( 𝜇 : load, 𝑏 : baseline accuracy) Likelihood of meeting deadline 0.94 ResNet152 ResNet101 0.93 Effective Accuracy ResNet50 ResNet34 0.92 ResNet18 No single DNN works 0.91 best at all load levels 0.9 0.89 0.88 0 5 10 15 20 25 30 Load (Queries Per Second) 8

Characterizing DNN Parallelism ResNet-152 Tail Latency Vs Job arrival rate 4500 R : Replicas 4000 T: Threads 99th percentile tail latency (ms) 3500 3000 Fixed Capacity: 2500 16 threads 2000 1500 R:16 T:1 R:8 T:2 End to End 1000 R:4 T:4 Query Latency R:2 T:8 500 R:1 T:16 0 1 2 3 4 5 6 7 8 Load (Queries Per Second) As load increases, additional replicas help more than threads. 9

Online Model Switching Framework Model that exhibits best effective accuracy is a function of load SLA deadline Load Best Model Parallelism 0-4 ResNet-152 <R:4 T:4> Offline policy QPS training > 20 ResNet-18 … Online Serving Predictions Front-End Load Change Model Queries Detection Switching Model Switching Controller Dynamically select best model (effective accuracy) based on load 10

Experimental Setup • Built on top of Clipper , an open-source containerized model-serving framework (caching, adaptive batching disabled) • Deployed PyTorch pretrained ResNet models on ImageNet ( R : 4 T : 4) • Two dedicated Azure VMs: • Server: 32 vCPUs + 128GB RAM • Client: 8vCPUs + 32GB RAM • Markov Model based load generator • Open system model • Poisson inter-arrivals Model sampling Switching period 1 sec 11 [Ref.] D. Crankshaw et al. , “ Clipper: A low-latency online prediction serving system”, USENIX NDSI 2017.

Evaluation: Automatic Model-Switching 25 ResNet-152 Load (Queries Per Second) 20 ResNet-101 ResNet-50 15 ResNet-34 10 ResNet-18 5 0 0 50 100 150 200 250 300 Time (sec) Model-Switching can quickly adapt to load spikes. 12

Evaluation: Effective Accuracy 0.95 0.9 Effective Accuracy 0.85 0.8 0.75 Model Switch ResNet-18 0.7 ResNet-34 ResNet-50 0.65 ResNet-101 ResNet-152 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 ( s ) Deadline (ms) Model-Switching achieves pareto-optimal effective accuracy. 13

Evaluation: Tail Latency Empirical CDF 1 0.8 Percentage 0.6 SLA deadline: 750 ms 0.4 Model Switch ResNet-18 ResNet-34 0.2 ResNet-50 ResNet-101 ResNet-152 0 0.75 0 0.5 1 1.5 2 Latency (ms) ( s ) Model-Switching tradeoffs deadline slack for accuracy. 14

Thank You and Questions Model-Switching: Manage Fluctuating Workloads in MLaaS Systems Jeff Zhang jeffjunzhang@nyu.edu

Discussion and Future Work • How to prepare a pool of models for each application? • Neural Architecture Search, Multi-level Quantization • Current approach pre-deploys all (20) candidate models • Cold start time (ML): tens of seconds • RAM overheads: currently 11.8% of the total 128 GB RAM • Reinforcement learning based controller for model switching • Account for job queue status, system load, current latency • Offline training free • Integrate with existing MLaaS techniques • Batching, caching, autoscaling etc. • Exploit availability of heterogenous computing resources • CPU, GPU, TPU, FPGA 15

Model-Switching: Dealing with Fluctuating Workloads in MLaaS * - PowerPoint PPT Presentation

Model-Switching: Dealing with Fluctuating Workloads in MLaaS * Systems Jeff Zhang 1 , Sameh Elnikety 2 , Shuayb Zarar 2 , Atul Gupta 2 , Siddharth Garg 1 1 New York University, 2 Microsoft jeffjunzhang@nyu.edu [*] Machine Learning-as-a-Service

Routing in packet-switching networks Circuit switching vs. Packet switching Most of WANs based on

Dealing With The Irate Customer Dealing With The Irate Customer Dealing with difficult

Primordial magnetic fields: Fluctuating . . . HI signal: CDM Fluctuating . . . Cosmological

Integration of Routing and Switching Label Switching & IP switching The goal is to avoid

Introduction Workloads for Experiments Introduction to workloads CS 239 Workload

Switching in Pharmacy Practice What is Switching Switching is where a prescription that

Switching Combinatorial Objects Patric R. J. Osterg ard Department of Communications and

Switching Algebra Basic postulate: existence of two-valued switching variable that takes two

Circuit and Packet Switching Packet Switching Comparison ITS323: Introduction to Data

Switching Packet Switching Comparison ITS323: Introduction to Data Communications CSS331:

TELEPHONE SWITCHING ECE 2526 Monday, February 10, 2020 1 DIRECT AND COMMON CONTROL SWITCHING

Stream Switching Control draft-gentric-mmusic-stream-switching-00.txt Philippe Gentric

The Market Value of Fluctuating Renewables What drives the market value of electricity from

An NFR Pattern Approach to Dealing An NFR Pattern Approach to Dealing An NFR Pattern Approach to

Understanding Big Data Workloads on Understanding Big Data Workloads on Modern Processors using

Evaluation of Memory and CPU usage via Cgroups of ATLAS workloads via Cgroups of ATLAS workloads

Mind Your Keys? A Security Evaluation of Java Keystores Marco Squarcina (Universit Ca

RECSM Summer School: Machine Learning for Social Sciences Session 3.3: K -Means Clustering Reto

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

Generating asymptotics for factorially divergent sequences Michael Borinsky 1 Humboldt-University

Alaska Native Womens Resource Center 101 Training Series FVPSA Webinar 2018-2020 Tami Truett

Faster convex optimization Simulated annealing & Interior point Elad Hazan Joint work with

Unix and Undergraduate Teaching Carlo Kopp, BE(Hons), MSc, PhD, PEng Monash University,

Operator equations and domain dependence Hermann Knig Kiel, Germany Bedlewo, July 2014

Model-Switching: Dealing with Fluctuating Workloads in MLaaS * - PowerPoint PPT Presentation

Model-Switching: Dealing with Fluctuating Workloads in MLaaS * Systems Jeff Zhang 1 , Sameh Elnikety 2 , Shuayb Zarar 2 , Atul Gupta 2 , Siddharth Garg 1 1 New York University, 2 Microsoft jeffjunzhang@nyu.edu [*] Machine Learning-as-a-Service

Routing in packet-switching networks Circuit switching vs. Packet switching Most of WANs based on

Dealing With The Irate Customer Dealing With The Irate Customer Dealing with difficult

Primordial magnetic fields: Fluctuating . . . HI signal: CDM Fluctuating . . . Cosmological

Integration of Routing and Switching Label Switching &amp; IP switching The goal is to avoid

Introduction Workloads for Experiments Introduction to workloads CS 239 Workload

Switching in Pharmacy Practice What is Switching Switching is where a prescription that

Switching Combinatorial Objects Patric R. J. Osterg ard Department of Communications and

Switching Algebra Basic postulate: existence of two-valued switching variable that takes two

Circuit and Packet Switching Packet Switching Comparison ITS323: Introduction to Data

Switching Packet Switching Comparison ITS323: Introduction to Data Communications CSS331:

TELEPHONE SWITCHING ECE 2526 Monday, February 10, 2020 1 DIRECT AND COMMON CONTROL SWITCHING

Stream Switching Control draft-gentric-mmusic-stream-switching-00.txt Philippe Gentric

The Market Value of Fluctuating Renewables What drives the market value of electricity from

An NFR Pattern Approach to Dealing An NFR Pattern Approach to Dealing An NFR Pattern Approach to

Understanding Big Data Workloads on Understanding Big Data Workloads on Modern Processors using

Evaluation of Memory and CPU usage via Cgroups of ATLAS workloads via Cgroups of ATLAS workloads

Mind Your Keys? A Security Evaluation of Java Keystores Marco Squarcina (Universit Ca

RECSM Summer School: Machine Learning for Social Sciences Session 3.3: K -Means Clustering Reto

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

Generating asymptotics for factorially divergent sequences Michael Borinsky 1 Humboldt-University

Alaska Native Womens Resource Center 101 Training Series FVPSA Webinar 2018-2020 Tami Truett

Faster convex optimization Simulated annealing &amp; Interior point Elad Hazan Joint work with

Unix and Undergraduate Teaching Carlo Kopp, BE(Hons), MSc, PhD, PEng Monash University,

Operator equations and domain dependence Hermann Knig Kiel, Germany Bedlewo, July 2014

Integration of Routing and Switching Label Switching & IP switching The goal is to avoid

Faster convex optimization Simulated annealing & Interior point Elad Hazan Joint work with