S EER: L EVERAGING B IG D ATA T O N AVIGATE T HE C OMPLEXITY O F P ERFORMANCE D EBUGGING I N C LOUD M ICROSERVICES Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou Cornell University ASPLOS – April 15 th 2019
Executive Summary From monoliths to microservices: Monoliths all functionality in a single service Microservices many single-concerned, loosely-coupled services 2
Executive Summary From monoliths to microservices: Monoliths all functionality in a single service Microservices many single-concerned, loosely-coupled services 3
Executive Summary From monoliths to microservices: Monoliths all functionality in a single service Microservices many single-concerned, loosely-coupled services Microservices implications: Modularity, specialization, faster development Performance unpredictability (us-level QoS), cascading QoS violations A-posteriori debugging 4
Executive Summary From monoliths to microservices: Monoliths all functionality in a single service Microservices many single-concerned, loosely-coupled services Microservices implications: Modularity, specialization, faster development Performance unpredictability (us-level QoS), cascading QoS violations A-posteriori debugging Seer: Proactive performance debugging for interactive microservices Leverage DL to anticipate & diagnose root cause of QoS violations >90% accuracy on large-scale end-to-end microservices deployments Avoid unpredictable performance Offer insight to improve microservices design and deployment 5
Motivation recommender photos ads posts webserver databases 6
Motivation posts recommender ads webserver photos databases 7
Motivation recommender photos posts recommender ads posts webserver webserver databases ads photos databases Monolith Microservices 8
Motivation recommender photos posts recommender ads posts webserver webserver databases ads photos databases Monolith Microservices Advantages of microservices: Modular easier to understand Speed of development & deployment On-demand provisioning, elasticity Language/framework heterogeneity 9
Performance Debugging Challenges Netflix Twitter Amazon Complicate cluster management & performance debugging Dependencies cause cascading QoS violations Difficult to isolate root cause of performance unpredictability 10
Performance Debugging Challenges Amazon Netflix Twitter Complicate cluster management & performance debugging Dependencies cause cascading QoS violations Difficult to isolate root cause of performance unpredictability 11
Performance Debugging Challenges Amazon Netflix Social Network Dependencies cause cascading QoS violations Empirical performance debugging too slow, bottlenecks propagate Long recovery times for performance 12
Performance Debugging Challenges Amazon Netflix Social Network Dependencies cause cascading QoS violations Empirical performance debugging too slow, bottlenecks propagate Long recovery times for performance 13
Performance Debugging Challenges Amazon Netflix Social Network Dependencies cause cascading QoS violations Empirical performance debugging too slow, bottlenecks propagate Long recovery times for performance 14
Performance Debugging Challenges Amazon Netflix Social Network Dependencies cause cascading QoS violations Empirical performance debugging too slow, bottlenecks propagate Long recovery times for performance 15
Performance Debugging Challenges Amazon Netflix Social Network Dependencies cause cascading QoS violations Empirical performance debugging too slow, bottlenecks propagate Long recovery times for performance 16
Performance Debugging Challenges Amazon Netflix Social Network Dependencies cause cascading QoS violations Empirical performance debugging too slow, bottlenecks propagate QoS met Long recovery times for performance 17
Performance Debugging Challenges Amazon Netflix Social Network Dependencies cause cascading QoS violations Empirical performance debugging too slow, bottlenecks propagate QoS violated Long recovery times for performance 18
Performance Debugging Challenges Amazon Netflix Social Network Dependencies cause cascading QoS violations Empirical performance debugging too slow, bottlenecks propagate Long recovery times for performance Demo: http://www.csl.cornell.edu/~delimitrou/2019.asplos.seer.demo_motivation.mp4 19
Seer: Proactive Performance Debugging Cluster manager Seer TraceDB Use ML to identify the culprit (root cause) of an upcoming QoS violation Leverage the massive amount of distributed traces collected over time Use targeted per-server hardware probes to determine the cause of the QoS violation Inform cluster manager to take proactive action & prevent QoS violation Need to predict 100s of msec – a few sec in the future 20
Instrumentation & Tracing Two-level tracing Logic tiers Back-end DB Distributed RPC-level tracing Front-end Similar to Dapper, Zipkin DB Client Per-microservice latencies DB C LB Inter- and intra-microservice queue DB lengths Tracing overhead: <0.1% in QPS, DB <0.2% in 99 th %ile latency DB Per-node hardware monitoring Nginx TCP RX Epoll TCP TX proc Targeted on nodes with problematic microservices Perf counters & contentious microbenchmarks 21
Instrumentation & Tracing Two-level tracing Logic tiers Back-end DB Distributed RPC-level tracing Front-end Similar to Dapper, Zipkin DB Client Per-microservice latencies DB C LB Inter- and intra-microservice queue DB lengths Tracing overhead: <0.1% in QPS, DB <0.2% in 99 th %ile latency DB Per-node hardware monitoring Nginx TCP RX Epoll TCP TX proc Targeted on nodes with problematic microservices Perf counters & contentious microbenchmarks 22
Instrumentation & Tracing Two-level tracing Logic tiers Back-end DB Distributed RPC-level tracing Front-end Similar to Dapper, Zipkin DB Client Per-microservice latencies DB C LB Inter- and intra-microservice queue DB lengths Tracing overhead: <0.1% in QPS, DB <0.2% in 99 th %ile latency DB Per-node hardware monitoring Nginx TCP RX Epoll TCP TX proc Targeted on nodes with problematic microservices Perf counters & contentious microbenchmarks 23
Instrumentation & Tracing Two-level tracing Logic tiers Back-end DB Distributed RPC-level tracing Front-end Similar to Dapper, Zipkin DB Client Per-microservice latencies DB C LB Inter- and intra-microservice queue DB lengths Tracing overhead: <0.1% in QPS, DB <0.2% in 99 th %ile latency DB Per-node hardware monitoring Nginx TCP RX Epoll TCP TX proc Targeted on nodes with problematic microservices Perf counters & contentious microbenchmarks 24
Instrumentation & Tracing Two-level tracing Logic tiers Back-end DB Distributed RPC-level tracing Front-end Similar to Dapper, Zipkin DB Client Per-microservice latencies DB C LB Inter- and intra-microservice queue DB lengths Tracing overhead: <0.1% in QPS, DB <0.2% in 99 th %ile latency DB Per-node hardware monitoring Nginx TCP RX Epoll TCP TX proc Targeted on nodes with problematic microservices Perf counters & contentious microbenchmarks 25
DL for Cloud Performance Debugging Output signal Probability that a microservice will initiate a QoS violation in the near future Why? Architecture-agnostic Adjusts to changes over time High accuracy, good scalability & fast inference (within window of opportunity) 26
DL for Cloud Performance Debugging Output signal Probability that a microservice will initiate a QoS violation in the near future 27
DL for Cloud Performance Debugging Input Output signal signal Probability Container that a utilization microservice will initiate a QoS violation in the near future 28
DL for Cloud Performance Debugging Input Output signal signal Probability Container that a utilization microservice will initiate a Latency QoS violation in the near future 29
DL for Cloud Performance Debugging Input Output signal signal Probability Container that a utilization microservice will initiate a Latency QoS violation in the near Queue future length 30
DL for Cloud Performance Debugging Input Output signal signal Probability Container that a utilization microservice will initiate a Latency QoS violation in the near Queue future length 31
DL for Cloud Performance Debugging Input Output signal signal Probability Container that a utilization microservice will initiate a Latency QoS violation in the near Queue future length Dimensionality reduction 32
Recommend
More recommend