FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-oriented Microservices Haoran Qiu* , Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer DEPEND Research Group University of Illinois at Urbana-Champaign * Presenter
From Monolithic to Microservices • Microservice architecture growing in popularity • A set of loosely-coupled, self-concerned “micro” services • Scalability, fault isolation, flexibility, etc. • Scale and complexity are increasing • Increasing in scale, e.g. 700+ (Netflix in ’17), 1000+ (Uber in ’19) • Performance guarded by service level objectives (SLOs) • Violation leads to financial loss (100ms increase converted to $0.7 billion loss in Amazon sales (Q4 ’18) Main() Microservices Client Function A Gateway Function B Service B Service A Function C Service D Function D Service C Monolithic Binary Uber in 2019 2
Performance Predictability in Microservices is Hard • Challenge #1 : Difficulty in isolating root causes of SLO violations • Complex inter-microservice dependencies cascading SLO violations • Challenge #2 : Inability in capturing shared-resource contention at a lower-level • Interference over shared resources (e.g. LLC, memory bandwidth, network devices) • Challenge #3 : Difficulty in taking the right action to mitigate SLO violations • High fidelity performance models/scheduling heuristics -> significant human-effort and training • Frequent service updates/migrations -> recurring effort for model reconstruction and re-training with FIRM without FIRM DRAM Access Container 1 Container 2 Per-core 150 100 Container Engine 50 Host OS Kernel 400 CPU Util (%) 200 Multicore Shared L3 Cache 0 Processor Latency (ms) 800 Undesired SLO Violations 99%ile I/O 400 Main I/O Network System Memory Devices Devices 0 50 100 150 200 250 300 Time (s) 3
FIRM As The Cure • Two-level machine learning based SLO violation mitigation framework • Challenge #1 – Detection and localization of SLO violations to individual microservices • Challenge #2 & #3 – Estimation of resources in contention and dynamic resource reprovision • Benefits: Improved interpretability and less training time • Designed, developed, and deployed in a 15-node Kubernetes cluster • Outperforms Kubernetes autoscaling by up to 16x in reducing SLO violations Master Node Worker Kubelet Node Controller Manager Proxy API Server Pod Pod FIRM’s Two-level ML Model Clients etcd Container Container Support Vector Machine (SVM)- Container Container based State Inference Actions Scheduler Docker Reinforcement Learning (RL)- FIRM based SLO Violation Mitigation Measurements 4
Insight 1: Dynamic Behavior of Critical Paths • Critical path defines the longest path in 1.3x 1.5x 99 th execution 1 . 00 1 . 00 percentile 0 . 75 0 . 75 • Detection of critical paths helps reveal the CDF CDF 0 . 50 0 . 50 Max-CP Max-CP bottleneck of performance 0 . 25 0 . 25 Min-CP Min-CP 0 . 00 0 . 00 • Critical path is not static, but dynamically 400 500 600 700 600 800 1000 Latency (ms) Latency (ms) changing based on the performance of Social Network Media Service individual service instances 1.6x 2x 99 th • Different type of underlying shared-resource 1 . 00 1 . 00 percentile contention 0 . 75 0 . 75 CDF CDF • Different degree of sensitivity to the same 0 . 50 0 . 50 Max-CP Max-CP 0 . 25 0 . 25 Min-CP Min-CP type of interference 0 . 00 0 . 00 400 600 300 400 500 600 • It’s important to capture the changes at Latency (ms) Latency (ms) runtime, and make runtime decision Train Ticket Booking Hotel Reservation 5
Insight 2: Significance of Latency Variability • Microservices with larger latency are not necessarily the root causes of SLO violations • Processing time with higher variance makes it harder to obtain low tail latency • Variability represents opportunities for reducing latency High Variance High Median 1 . 00 1 . 00 0 . 75 0 . 75 Better Before CDF CDF 0 . 50 0 . 50 Text Text 0 . 25 0 . 25 Compose Compose 0 . 00 0 . 00 40 60 80 100 100 125 150 Individual Latency (ms) Total Latency (ms) Social Network – Composing Post Request 6
State Inference (1) • Real-time observability on request execution Microservices Deployment & Service Dependency Graph provided by end-to-end distributed tracing Tracing Module Microservice • Auto-labeled training data driven by the Instance Replica Set performance anomaly injection Microservices Load Nginx Balancer PHP-FPM Gateway Service B Tracing Service Coordinator Dependence Service A Performance Anomaly Injector Service D Service C Span Graph (Execution) Gateway Service A Service C Service B Service D 7
State Inference (3) • Real-time observability on request execution Microservices Deployment & Service Dependency Graph provided by end-to-end distributed tracing Tracing Module Microservice • Auto-labeled training data driven by the Instance Replica Set performance anomaly injection • SLO violation detection and narrow down via critical path analysis Nginx Load PHP-FPM Balancer • SVM-based critical component localization Tracing • Given individual latency vector T i , and end-to- Coordinator end latency vector T CP Performance Anomaly Injector Execution Telemetry History Graph Data • Relative importance defined as the Pearson correlation coefficient between T i and T CP Critical Path Extraction • Congestion intensity defined as 99-th Critical l ongest Pat h( ) Paths percentile value divided by median value of T i Critical Instance Extraction Candidates cr i t i cal Com ponent ( ) 8 Extractor
Insight 3: No Common Mitigation Policy for All • SLO violation mitigation policies vary with applications, user loads, and the types of resource in contention • Designing optimal resource provisioning strategy is intractable, just like scheduling problems • Modeling complexity: Tetris [SIGCOMM ’14], Jokey [EuroSys ’12] • Placement constraints: TetriSched [EuroSys ’16], device placement [NIPS ’17] • Data locality: Delayed scheduling [EuroSys ’10], SWAG [SoCC ’15] • … Scale Up Scale Out CPU Memory Scale Up Scale Out CPU Memory 10 8 10 7 Social 10 6 End-to-End Latency (us) Network End-to-End Latency (us) 10 5 10 4 10 8 10 7 Train Ticket 10 6 Booking 10 5 10 4 250 500 750 1000 1250 1500 1750 2000 2250 250 500 750 1000 1250 1500 1750 2000 2250 Load (# requests/s) Load (# requests/s) 9
Why not human-driven performance engineering? • No “one-size-fits-all” solution for the online decision problem • Best algorithm depends on specific workload and system • Human-driven performance engineering • RL-based SLO violation mitigation • Assume a simple system model • Assume a random scheduling policy • Produce some clever heuristics • Perceive states and receive rewards • Painstakingly test & tune the heuristics in practice • Optimize the policy based on the rewards • Redo the above steps • Loop continues until convergence • Is there a way to work around human-generated heuristics? Yes Microservices Resources OR 10
RL-based Resource Deployment Module Estimator Actions Re-allocation SLO Violation Mitigation (1) • Observability improved through online Microservices Deployment & Service Dependency Graph distributed tracing Tracing Module Microservice • Auto-labeled training data and RL online Instance Replica Set learning driven by the performance anomaly injector • SLO violation detection and localization via Load Nginx Balancer PHP-FPM critical path analysis Tracing • SVM-based critical component extraction Coordinator Performance Performance Anomaly Injector • SLO violation mitigation based on Execution Telemetry Counters History Graph Data reinforcement learning Critical Path Extraction • Identifies low-level resource contention CPU LLC Memory Critical l ongest Pat h( ) • Estimates the right amount to reprovision Paths Critical Instance Extraction I/O Network Replicas Candidates Controlled Resources cr i t i cal Com ponent ( ) 11 Extractor
SLO Violation Mitigation (2) • An RL-based resource estimation agent that learns to make provisioning decisions directly from experience • Optimizes objectives end-to-end: • Minimize SLO violation • Maximize resource utilization efficiency Mitigate SLO Avoid Over- Violation Fast provisioning Performance & Resource Measurements CPU Memory LLC SLO Utilization Bandwidth Bandwidth Violation |ℛ| Arrival " /𝑆𝑀 ! " LLC Disk I/O Network 𝑠 𝑢 = 𝛽 % 𝑇𝑁 ! % ℛ + (1 − 𝛽) % . 𝑆𝑉 ! Capacity Bandwidth Bandwidth Rate " States ( s t ) Actions ( a t ) Resource utilization Microservices 𝑇𝑁 ! = 𝑀𝑏𝑢𝑓𝑜𝑑𝑧 %&' Managed by FIRM of 𝑗 at time 𝑢 Actor 𝑀𝑏𝑢𝑓𝑜𝑑𝑧 ! Resource limit SLO V t Rewards ( r t ) (SLO maintenance) of 𝑗 at time 𝑢 Utilization Critic RL Agent 12
Multilevel ML Training Workload Generators Reinforcement FIRM’s K8S Cluster Anomaly Injector Learning Training RL Agent Labeling Generate experience Feature ( X ) Label ( y ) data FIRM’s SVM Model 13
Multilevel ML Training Workload Generators Reinforcement FIRM’s K8S Cluster Anomaly Injector Learning Training RL Agent Labeling Generate experience Feature ( X ) Label ( y ) data 2x 9x FIRM’s SVM Model 14
Recommend
More recommend