FIRM: An Intelligent Fine-grained Resource Management Framework for - PowerPoint PPT Presentation

FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-oriented Microservices Haoran Qiu* , Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer DEPEND Research Group University of Illinois at Urbana-Champaign * Presenter

From Monolithic to Microservices • Microservice architecture growing in popularity • A set of loosely-coupled, self-concerned “micro” services • Scalability, fault isolation, flexibility, etc. • Scale and complexity are increasing • Increasing in scale, e.g. 700+ (Netflix in ’17), 1000+ (Uber in ’19) • Performance guarded by service level objectives (SLOs) • Violation leads to financial loss (100ms increase converted to $0.7 billion loss in Amazon sales (Q4 ’18) Main() Microservices Client Function A Gateway Function B Service B Service A Function C Service D Function D Service C Monolithic Binary Uber in 2019 2

Performance Predictability in Microservices is Hard • Challenge #1 : Difficulty in isolating root causes of SLO violations • Complex inter-microservice dependencies cascading SLO violations • Challenge #2 : Inability in capturing shared-resource contention at a lower-level • Interference over shared resources (e.g. LLC, memory bandwidth, network devices) • Challenge #3 : Difficulty in taking the right action to mitigate SLO violations • High fidelity performance models/scheduling heuristics -> significant human-effort and training • Frequent service updates/migrations -> recurring effort for model reconstruction and re-training with FIRM without FIRM DRAM Access Container 1 Container 2 Per-core 150 100 Container Engine 50 Host OS Kernel 400 CPU Util (%) 200 Multicore Shared L3 Cache 0 Processor Latency (ms) 800 Undesired SLO Violations 99%ile I/O 400 Main I/O Network System Memory Devices Devices 0 50 100 150 200 250 300 Time (s) 3

FIRM As The Cure • Two-level machine learning based SLO violation mitigation framework • Challenge #1 – Detection and localization of SLO violations to individual microservices • Challenge #2 & #3 – Estimation of resources in contention and dynamic resource reprovision • Benefits: Improved interpretability and less training time • Designed, developed, and deployed in a 15-node Kubernetes cluster • Outperforms Kubernetes autoscaling by up to 16x in reducing SLO violations Master Node Worker Kubelet Node Controller Manager Proxy API Server Pod Pod FIRM’s Two-level ML Model Clients etcd Container Container Support Vector Machine (SVM)- Container Container based State Inference Actions Scheduler Docker Reinforcement Learning (RL)- FIRM based SLO Violation Mitigation Measurements 4

Insight 1: Dynamic Behavior of Critical Paths • Critical path defines the longest path in 1.3x 1.5x 99 th execution 1 . 00 1 . 00 percentile 0 . 75 0 . 75 • Detection of critical paths helps reveal the CDF CDF 0 . 50 0 . 50 Max-CP Max-CP bottleneck of performance 0 . 25 0 . 25 Min-CP Min-CP 0 . 00 0 . 00 • Critical path is not static, but dynamically 400 500 600 700 600 800 1000 Latency (ms) Latency (ms) changing based on the performance of Social Network Media Service individual service instances 1.6x 2x 99 th • Different type of underlying shared-resource 1 . 00 1 . 00 percentile contention 0 . 75 0 . 75 CDF CDF • Different degree of sensitivity to the same 0 . 50 0 . 50 Max-CP Max-CP 0 . 25 0 . 25 Min-CP Min-CP type of interference 0 . 00 0 . 00 400 600 300 400 500 600 • It’s important to capture the changes at Latency (ms) Latency (ms) runtime, and make runtime decision Train Ticket Booking Hotel Reservation 5

Insight 2: Significance of Latency Variability • Microservices with larger latency are not necessarily the root causes of SLO violations • Processing time with higher variance makes it harder to obtain low tail latency • Variability represents opportunities for reducing latency High Variance High Median 1 . 00 1 . 00 0 . 75 0 . 75 Better Before CDF CDF 0 . 50 0 . 50 Text Text 0 . 25 0 . 25 Compose Compose 0 . 00 0 . 00 40 60 80 100 100 125 150 Individual Latency (ms) Total Latency (ms) Social Network – Composing Post Request 6

State Inference (1) • Real-time observability on request execution Microservices Deployment & Service Dependency Graph provided by end-to-end distributed tracing Tracing Module Microservice • Auto-labeled training data driven by the Instance Replica Set performance anomaly injection Microservices Load Nginx Balancer PHP-FPM Gateway Service B Tracing Service Coordinator Dependence Service A Performance Anomaly Injector Service D Service C Span Graph (Execution) Gateway Service A Service C Service B Service D 7

State Inference (3) • Real-time observability on request execution Microservices Deployment & Service Dependency Graph provided by end-to-end distributed tracing Tracing Module Microservice • Auto-labeled training data driven by the Instance Replica Set performance anomaly injection • SLO violation detection and narrow down via critical path analysis Nginx Load PHP-FPM Balancer • SVM-based critical component localization Tracing • Given individual latency vector T i , and end-to- Coordinator end latency vector T CP Performance Anomaly Injector Execution Telemetry History Graph Data • Relative importance defined as the Pearson correlation coefficient between T i and T CP Critical Path Extraction • Congestion intensity defined as 99-th Critical l ongest Pat h( ) Paths percentile value divided by median value of T i Critical Instance Extraction Candidates cr i t i cal Com ponent ( ) 8 Extractor

Insight 3: No Common Mitigation Policy for All • SLO violation mitigation policies vary with applications, user loads, and the types of resource in contention • Designing optimal resource provisioning strategy is intractable, just like scheduling problems • Modeling complexity: Tetris [SIGCOMM ’14], Jokey [EuroSys ’12] • Placement constraints: TetriSched [EuroSys ’16], device placement [NIPS ’17] • Data locality: Delayed scheduling [EuroSys ’10], SWAG [SoCC ’15] • … Scale Up Scale Out CPU Memory Scale Up Scale Out CPU Memory 10 8 10 7 Social 10 6 End-to-End Latency (us) Network End-to-End Latency (us) 10 5 10 4 10 8 10 7 Train Ticket 10 6 Booking 10 5 10 4 250 500 750 1000 1250 1500 1750 2000 2250 250 500 750 1000 1250 1500 1750 2000 2250 Load (# requests/s) Load (# requests/s) 9

Why not human-driven performance engineering? • No “one-size-fits-all” solution for the online decision problem • Best algorithm depends on specific workload and system • Human-driven performance engineering • RL-based SLO violation mitigation • Assume a simple system model • Assume a random scheduling policy • Produce some clever heuristics • Perceive states and receive rewards • Painstakingly test & tune the heuristics in practice • Optimize the policy based on the rewards • Redo the above steps • Loop continues until convergence • Is there a way to work around human-generated heuristics? Yes Microservices Resources OR 10

RL-based Resource Deployment Module Estimator Actions Re-allocation SLO Violation Mitigation (1) • Observability improved through online Microservices Deployment & Service Dependency Graph distributed tracing Tracing Module Microservice • Auto-labeled training data and RL online Instance Replica Set learning driven by the performance anomaly injector • SLO violation detection and localization via Load Nginx Balancer PHP-FPM critical path analysis Tracing • SVM-based critical component extraction Coordinator Performance Performance Anomaly Injector • SLO violation mitigation based on Execution Telemetry Counters History Graph Data reinforcement learning Critical Path Extraction • Identifies low-level resource contention CPU LLC Memory Critical l ongest Pat h( ) • Estimates the right amount to reprovision Paths Critical Instance Extraction I/O Network Replicas Candidates Controlled Resources cr i t i cal Com ponent ( ) 11 Extractor

SLO Violation Mitigation (2) • An RL-based resource estimation agent that learns to make provisioning decisions directly from experience • Optimizes objectives end-to-end: • Minimize SLO violation • Maximize resource utilization efficiency Mitigate SLO Avoid Over- Violation Fast provisioning Performance & Resource Measurements CPU Memory LLC SLO Utilization Bandwidth Bandwidth Violation |ℛ| Arrival " /𝑆𝑀 ! " LLC Disk I/O Network 𝑠 𝑢 = 𝛽 % 𝑇𝑁 ! % ℛ + (1 − 𝛽) % . 𝑆𝑉 ! Capacity Bandwidth Bandwidth Rate " States ( s t ) Actions ( a t ) Resource utilization Microservices 𝑇𝑁 ! = 𝑀𝑏𝑢𝑓𝑜𝑑𝑧 %&' Managed by FIRM of 𝑗 at time 𝑢 Actor 𝑀𝑏𝑢𝑓𝑜𝑑𝑧 ! Resource limit SLO V t Rewards ( r t ) (SLO maintenance) of 𝑗 at time 𝑢 Utilization Critic RL Agent 12

Multilevel ML Training Workload Generators Reinforcement FIRM’s K8S Cluster Anomaly Injector Learning Training RL Agent Labeling Generate experience Feature ( X ) Label ( y ) data FIRM’s SVM Model 13

Multilevel ML Training Workload Generators Reinforcement FIRM’s K8S Cluster Anomaly Injector Learning Training RL Agent Labeling Generate experience Feature ( X ) Label ( y ) data 2x 9x FIRM’s SVM Model 14

FIRM: An Intelligent Fine-grained Resource Management Framework for - PowerPoint PPT Presentation

FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-oriented Microservices Haoran Qiu* , Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer DEPEND Research Group University of Illinois at

Fine Grained Access Control Fine-Grained Access Control Fine Grained Access Control

Fine-Grained Access Control Fine Grained Access Control Fine-grained access control examples:

Fine-Grained Geographic Communication (Geocast) Nexus Workshop Frank Drr 23.07.2003 1

Average-Case Fine-Grained Hardness Marshall Ball Alon Rosen Manuel Sabin Prashant Nalini

Fine-grained Visual Analysis: From Classification to Retrieval Yi-Zhe Song SketchX Lab, CVSSP,

Mechanized Verification of Fine-grained Concurrent Programs Ilya Sergey Aleks Nanevski

Theory of the Firm Production Technology The Firm What is a firm ? In reality, the concept firm

The Theory of the firm What is a firm ? How does a firm behave? A firm should transform

Junfeng Fan ESAT/COSIC ECC implementation methods Multi-core systems Coarse-Grained

Martin Law Firm Martin Law Firm Martin Law Firm Martin Law Firm 1- -800 800- -633 633-

Combining Data-Intense and Compute-Intense Methods for Fine-Grained Morphological Analyses Petra

Fine-Grained Power Modeling for Smartphones Using System Call Tracing Based on paper and

Fine-Grained Tracking of Grid Infections Ashish Gehani SRI Basim Baig, Salman Mahmood, Dawood

Addressing Inter-Class Similarity in Fine-Grained Visual Classification Abhimanyu Dubey

Fine-grained Image Recognition Lei Wang VILA group School of Computing and Information

On the Correctness Criteria of Fine-Grained Access Control in Relational Databases Qihua Wang,

a shallow survey of deep learning Applications, Models, Algorithms and Theory (?) Chiyuan Zhang

The unilateral shift as a Hilbert module over the disc algebra Rapha el Clou atre Indiana

Exclusive 0 Meson Photoproduction with a Leading Neutron at HERA Lidia Goerlich Institute of

Review, Fast KR&R, Classical Planning AI Class 21 (Ch. 10.1-10.2, 10.4.2-10.4.4 ) Material

Revisited Tomer Blumkin, Yoram Margalioth and Efraim Sadka September 2012, Munich Some Background

Two Advanced Certificates on Data Protection leading to Advanced Diploma Professional Education

Ju Juvenile Ju Justice System Some reflections on sti tigma, prejudice and professional

The Behavioral Health Workforce Gail W. Stuart, PhD, RN 55% of US counties have no behavioral

FIRM: An Intelligent Fine-grained Resource Management Framework for - PowerPoint PPT Presentation

FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-oriented Microservices Haoran Qiu* , Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer DEPEND Research Group University of Illinois at

Fine Grained Access Control Fine-Grained Access Control Fine Grained Access Control

Fine-Grained Access Control Fine Grained Access Control Fine-grained access control examples:

Fine-Grained Geographic Communication (Geocast) Nexus Workshop Frank Drr 23.07.2003 1

Average-Case Fine-Grained Hardness Marshall Ball Alon Rosen Manuel Sabin Prashant Nalini

Fine-grained Visual Analysis: From Classification to Retrieval Yi-Zhe Song SketchX Lab, CVSSP,

Mechanized Verification of Fine-grained Concurrent Programs Ilya Sergey Aleks Nanevski

Theory of the Firm Production Technology The Firm What is a firm ? In reality, the concept firm

The Theory of the firm What is a firm ? How does a firm behave? A firm should transform

Junfeng Fan ESAT/COSIC ECC implementation methods Multi-core systems Coarse-Grained

Martin Law Firm Martin Law Firm Martin Law Firm Martin Law Firm 1- -800 800- -633 633-

Combining Data-Intense and Compute-Intense Methods for Fine-Grained Morphological Analyses Petra

Fine-Grained Power Modeling for Smartphones Using System Call Tracing Based on paper and

Fine-Grained Tracking of Grid Infections Ashish Gehani SRI Basim Baig, Salman Mahmood, Dawood

Addressing Inter-Class Similarity in Fine-Grained Visual Classification Abhimanyu Dubey

Fine-grained Image Recognition Lei Wang VILA group School of Computing and Information

On the Correctness Criteria of Fine-Grained Access Control in Relational Databases Qihua Wang,

a shallow survey of deep learning Applications, Models, Algorithms and Theory (?) Chiyuan Zhang

The unilateral shift as a Hilbert module over the disc algebra Rapha el Clou atre Indiana

Exclusive 0 Meson Photoproduction with a Leading Neutron at HERA Lidia Goerlich Institute of

Review, Fast KR&amp;R, Classical Planning AI Class 21 (Ch. 10.1-10.2, 10.4.2-10.4.4 ) Material

Revisited Tomer Blumkin, Yoram Margalioth and Efraim Sadka September 2012, Munich Some Background

Two Advanced Certificates on Data Protection leading to Advanced Diploma Professional Education

Ju Juvenile Ju Justice System Some reflections on sti tigma, prejudice and professional

The Behavioral Health Workforce Gail W. Stuart, PhD, RN 55% of US counties have no behavioral

Review, Fast KR&R, Classical Planning AI Class 21 (Ch. 10.1-10.2, 10.4.2-10.4.4 ) Material