Swayam Distributed Autoscaling for Machine Learning as a Service Sameh Elnikety, Arpan Gujarati, Kathryn S. McKinley Yuxiong He Björn B. Brandenburg 1
Machine Learning as a Service (MLaaS) Machine Learning Amazon Machine Learning Data Science & Machine Learning Google Cloud AI 2
Machine Learning as a Service (MLaaS) 1. Training Machine Learning = + Amazon Machine Learning Trained Untrained Dataset Model model Data Science & 2. Prediction Machine Learning + = Google Cloud AI Answer Trained Query Model 2
Machine Learning as a Service (MLaaS) This work 2. Prediction + = Answer Trained Query Model Models are already trained and available for prediction 2
Swayam Distributed autoscaling 2. Prediction + = of the compute resources needed for prediction serving Answer Trained Query Model inside the MLaaS infrastructure 3
Prediction serving ( application perspective) MLaaS Provider Application / End User image "cat" Image classifier 4
Prediction serving ( provider perspective) Finite compute resources MLaaS Provider "Backends" for prediction Lots of trained models! 5
Prediction serving ( provider perspective) Finite compute resources MLaaS Provider "Backends" for prediction Application / End User Lots of trained models! (1) New prediction request for the Multiple request pink model dispatchers "Frontends" (2) A frontend receives the request 5
Prediction serving ( provider perspective) Finite compute resources MLaaS Provider "Backends" for prediction Application / End User (4) The backend fetches the pink model Lots of trained models! (3) The request is dispatched to an idle backend (1) New prediction request for the Multiple request pink model dispatchers "Frontends" (2) A frontend receives the request 5
Prediction serving ( provider perspective) Finite compute resources MLaaS Provider "Backends" for prediction Application / End User (4) The backend fetches the pink model (5) The request Lots of trained models! outcome is predicted (3) The request is dispatched to an idle backend (1) New prediction request for the Multiple request (6) The response is pink model sent back through dispatchers "Frontends" the frontend (2) A frontend receives the request 5
Prediction serving ( objectives ) Finite compute resources MLaaS Provider "Backends" for prediction Application / End User Application / End User Lots of trained models! Multiple request dispatchers "Frontends" 6
Prediction serving ( objectives ) Low latency, SLAs Finite compute resources MLaaS Provider MLaaS Provider "Backends" for prediction Application / End User Application / End User Application / End User Resource e ffi ciency Lots of trained models! Multiple request dispatchers "Frontends" 6
Static partitioning of trained models 7
Static partitioning of trained models MLaaS Provider MLaaS Provider The trained models partitioned among the finite backends 7
Static partitioning of trained models MLaaS Provider MLaaS Provider The trained models partitioned among Application / End User the finite backends No need to fetch and install the pink model Multiple request dispatchers "Frontends" 7
Static partitioning of trained models MLaaS Provider MLaaS Provider The trained models partitioned among Application / End User the finite backends No need to fetch and install the pink model Problem: Not all models are used at all times Multiple request dispatchers "Frontends" 7
Static partitioning of trained models MLaaS Provider MLaaS Provider The trained models partitioned among Application / End User the finite backends No need to fetch and install the pink model Problem: Not all models are used at all times Multiple request dispatchers "Frontends" Problem: Many more models than backends, high memory footprint per model 7
Static partitioning of trained models Low latency, SLAs MLaaS Provider MLaaS Provider The trained models partitioned among Application / End User Resource e ffi ciency the finite backends Static partitioning is infeasible No need to fetch and install the pink model Problem: Not all models are used at all times Multiple request dispatchers "Frontends" Problem: Many more models than backends, high memory footprint per model 8
Classical approach: autoscaling The number of active backends # Active backends Request load for for the pink model the pink model are automatically scaled up or down based on load Time 9
Classical approach: autoscaling The number of active backends # Active backends Request load for for the pink model the pink model are automatically scaled up or down based on load With ideal autoscaling ... Enough backends to guarantee low latency # Active backends over time is Time minimized for resource e ffi ciency 9
Autoscaling for MLaaS is challenging [1/3] 10
Autoscaling for MLaaS is challenging [1/3] Finite compute resources MLaaS Provider "Backends" for prediction (4) The backend fetches the pink model Lots of trained models! (5) The request outcome is predicted Multiple request dispatchers "Frontends" 10
Autoscaling for MLaaS is challenging [1/3] Finite compute resources MLaaS Provider "Backends" for prediction Challenge (4) The backend fetches the pink model Lots of trained models! Provisioning >> Execution Time (4) Time (5) (~ a few seconds) (~ 10ms to 500ms) (5) The request outcome is predicted Requirement Predictive autoscaling to Multiple request dispatchers "Frontends" hide the provisioning latency 10
Autoscaling for MLaaS is challenging [2/3] MLaaS architecture is large-scale, multi-tiered Hardware broker Frontends Backends [ VMs, containers ] 11
Autoscaling for MLaaS is challenging [2/3] MLaaS architecture is large-scale, multi-tiered Challenge Multiple frontends with Hardware partial information about broker the workload Frontends Requirement Fast, coordination-free, globally-consistent autoscaling decisions on the frontends Backends [ VMs, containers ] 11
Autoscaling for MLaaS is challenging [3/3] Strict , model-specific SLAs on response times "99% of requests must complete under 500ms" "99.9% of requests must complete under 1s" "[A] 95% of requests "[B] Tolerate up to 25% must complete under increase in request rates 850ms" without violating [A]" 12
Autoscaling for MLaaS is challenging [3/3] Strict , model-specific SLAs Challenge on response times No closed-form solutions to "99% of requests must get response-time distributions complete under 500ms" for SLA-aware autoscaling "99.9% of requests must complete under 1s" Requirement "[A] 95% of requests "[B] Tolerate up to 25% Accurate waiting-time and must complete under increase in request rates execution-time distributions 850ms" without violating [A]" 12
} Swayam: model-driven distributed autoscaling Challenges Provisioning >> Execution Time (4) Time (5) (~ a few seconds) (~ 10ms to 500ms) We address these challenges Multiple frontends with by leveraging specific partial information about ML workload characteristics the workload and design an analytical model No closed-form solutions to for resource estimation get response-time distributions that allows distributed and for SLA-aware autoscaling predictive autoscaling 13
Outline 1. System architecture, key ideas 2. Analytical model for resource estimation 3. Evaluation results 14
System architecture 15
System architecture Backends dedicated for the pink model Application / End User Application / End User Backends dedicated for the blue model Application / End User Hardware broker Application / End User Backends dedicated for the Global pool Frontends green model of backends 15
System architecture Backends dedicated for the pink model Application / End User Application / End User Backends dedicated for the blue model Application / End User Hardware broker Application / End User Backends dedicated for the Global pool Frontends green model of backends Objective: dedicated set of backends should dynamically scale 1. If load decreases, extra backends go back to the global pool (for resource e ffi ciency) 2. If load increases, new backends are set up in advance (for SLA compliance) 15
System architecture Let's focus on the pink model Backends dedicated Application / End User for the pink model Application / End User Application / End User Hardware broker Application / End User Frontends Objective: dedicated set of backends should dynamically scale 1. If load decreases, extra backends go back to the global pool (for resource e ffi ciency) 2. If load increases, new backends are set up in advance (for SLA compliance) 15
Key idea 1: Assign states to each backend 16
Key idea 1: Assign states to each backend In the cold global pool warm Dedicated to a trained model 16
Key idea 1: Assign states to each backend Haven't executed a request for a while In the cold global pool not-in-use warm Dedicated to a in-use trained model Maybe executing a request 16
Recommend
More recommend