executive summary
play

Executive Summary From monoliths to microservices: Monoliths all - PowerPoint PPT Presentation

S EER: L EVERAGING B IG D ATA T O N AVIGATE T HE C OMPLEXITY O F P ERFORMANCE D EBUGGING I N C LOUD M ICROSERVICES Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou Cornell University ASPLOS


  1. S EER: L EVERAGING B IG D ATA T O N AVIGATE T HE C OMPLEXITY O F P ERFORMANCE D EBUGGING I N C LOUD M ICROSERVICES Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou Cornell University ASPLOS – April 15 th 2019

  2. Executive Summary  From monoliths to microservices:  Monoliths  all functionality in a single service  Microservices  many single-concerned, loosely-coupled services 2

  3. Executive Summary  From monoliths to microservices:  Monoliths  all functionality in a single service  Microservices  many single-concerned, loosely-coupled services 3

  4. Executive Summary  From monoliths to microservices:  Monoliths  all functionality in a single service  Microservices  many single-concerned, loosely-coupled services  Microservices implications:  Modularity, specialization, faster development  Performance unpredictability (us-level QoS), cascading QoS violations  A-posteriori debugging 4

  5. Executive Summary  From monoliths to microservices:  Monoliths  all functionality in a single service  Microservices  many single-concerned, loosely-coupled services  Microservices implications:  Modularity, specialization, faster development  Performance unpredictability (us-level QoS), cascading QoS violations  A-posteriori debugging  Seer: Proactive performance debugging for interactive microservices  Leverage DL to anticipate & diagnose root cause of QoS violations  >90% accuracy on large-scale end-to-end microservices deployments  Avoid unpredictable performance  Offer insight to improve microservices design and deployment 5

  6. Motivation recommender photos ads posts webserver databases 6

  7. Motivation posts recommender ads webserver photos databases 7

  8. Motivation recommender photos posts recommender ads posts webserver webserver databases ads photos databases Monolith Microservices 8

  9. Motivation recommender photos posts recommender ads posts webserver webserver databases ads photos databases Monolith Microservices  Advantages of microservices:  Modular  easier to understand  Speed of development & deployment  On-demand provisioning, elasticity  Language/framework heterogeneity 9

  10. Performance Debugging Challenges Netflix Twitter Amazon  Complicate cluster management & performance debugging  Dependencies cause cascading QoS violations  Difficult to isolate root cause of performance unpredictability 10

  11. Performance Debugging Challenges Amazon Netflix Twitter  Complicate cluster management & performance debugging  Dependencies cause cascading QoS violations  Difficult to isolate root cause of performance unpredictability 11

  12. Performance Debugging Challenges Amazon Netflix Social Network  Dependencies cause cascading QoS violations  Empirical performance debugging  too slow, bottlenecks propagate  Long recovery times for performance 12

  13. Performance Debugging Challenges Amazon Netflix Social Network  Dependencies cause cascading QoS violations  Empirical performance debugging  too slow, bottlenecks propagate  Long recovery times for performance 13

  14. Performance Debugging Challenges Amazon Netflix Social Network  Dependencies cause cascading QoS violations  Empirical performance debugging  too slow, bottlenecks propagate  Long recovery times for performance 14

  15. Performance Debugging Challenges Amazon Netflix Social Network  Dependencies cause cascading QoS violations  Empirical performance debugging  too slow, bottlenecks propagate  Long recovery times for performance 15

  16. Performance Debugging Challenges Amazon Netflix Social Network  Dependencies cause cascading QoS violations  Empirical performance debugging  too slow, bottlenecks propagate  Long recovery times for performance 16

  17. Performance Debugging Challenges Amazon Netflix Social Network  Dependencies cause cascading QoS violations  Empirical performance debugging  too slow, bottlenecks propagate QoS met  Long recovery times for performance 17

  18. Performance Debugging Challenges Amazon Netflix Social Network  Dependencies cause cascading QoS violations  Empirical performance debugging  too slow, bottlenecks propagate QoS violated  Long recovery times for performance 18

  19. Performance Debugging Challenges Amazon Netflix Social Network  Dependencies cause cascading QoS violations  Empirical performance debugging  too slow, bottlenecks propagate  Long recovery times for performance Demo: http://www.csl.cornell.edu/~delimitrou/2019.asplos.seer.demo_motivation.mp4 19

  20. Seer: Proactive Performance Debugging Cluster manager Seer TraceDB  Use ML to identify the culprit (root cause) of an upcoming QoS violation  Leverage the massive amount of distributed traces collected over time  Use targeted per-server hardware probes to determine the cause of the QoS violation  Inform cluster manager to take proactive action & prevent QoS violation  Need to predict 100s of msec – a few sec in the future 20

  21. Instrumentation & Tracing  Two-level tracing Logic tiers Back-end DB  Distributed RPC-level tracing Front-end  Similar to Dapper, Zipkin DB Client  Per-microservice latencies DB C LB  Inter- and intra-microservice queue DB lengths  Tracing overhead: <0.1% in QPS, DB <0.2% in 99 th %ile latency DB  Per-node hardware monitoring Nginx TCP RX Epoll TCP TX proc  Targeted on nodes with problematic microservices  Perf counters & contentious microbenchmarks 21

  22. Instrumentation & Tracing  Two-level tracing Logic tiers Back-end DB  Distributed RPC-level tracing Front-end  Similar to Dapper, Zipkin DB Client  Per-microservice latencies DB C LB  Inter- and intra-microservice queue DB lengths  Tracing overhead: <0.1% in QPS, DB <0.2% in 99 th %ile latency DB  Per-node hardware monitoring Nginx TCP RX Epoll TCP TX proc  Targeted on nodes with problematic microservices  Perf counters & contentious microbenchmarks 22

  23. Instrumentation & Tracing  Two-level tracing Logic tiers Back-end DB  Distributed RPC-level tracing Front-end  Similar to Dapper, Zipkin DB Client  Per-microservice latencies DB C LB  Inter- and intra-microservice queue DB lengths  Tracing overhead: <0.1% in QPS, DB <0.2% in 99 th %ile latency DB  Per-node hardware monitoring Nginx TCP RX Epoll TCP TX proc  Targeted on nodes with problematic microservices  Perf counters & contentious microbenchmarks 23

  24. Instrumentation & Tracing  Two-level tracing Logic tiers Back-end DB  Distributed RPC-level tracing Front-end  Similar to Dapper, Zipkin DB Client  Per-microservice latencies DB C LB  Inter- and intra-microservice queue DB lengths  Tracing overhead: <0.1% in QPS, DB <0.2% in 99 th %ile latency DB  Per-node hardware monitoring Nginx TCP RX Epoll TCP TX proc  Targeted on nodes with problematic microservices  Perf counters & contentious microbenchmarks 24

  25. Instrumentation & Tracing  Two-level tracing Logic tiers Back-end DB  Distributed RPC-level tracing Front-end  Similar to Dapper, Zipkin DB Client  Per-microservice latencies DB C LB  Inter- and intra-microservice queue DB lengths  Tracing overhead: <0.1% in QPS, DB <0.2% in 99 th %ile latency DB  Per-node hardware monitoring Nginx TCP RX Epoll TCP TX proc  Targeted on nodes with problematic microservices  Perf counters & contentious microbenchmarks 25

  26. DL for Cloud Performance Debugging Output signal Probability that a microservice will initiate a QoS violation in the near future  Why?  Architecture-agnostic  Adjusts to changes over time  High accuracy, good scalability & fast inference (within window of opportunity) 26

  27. DL for Cloud Performance Debugging Output signal Probability that a microservice will initiate a QoS violation in the near future 27

  28. DL for Cloud Performance Debugging Input Output signal signal Probability  Container that a utilization microservice will initiate a QoS violation in the near future 28

  29. DL for Cloud Performance Debugging Input Output signal signal Probability  Container that a utilization microservice will initiate a  Latency QoS violation in the near future 29

  30. DL for Cloud Performance Debugging Input Output signal signal Probability  Container that a utilization microservice will initiate a  Latency QoS violation in the near  Queue future length 30

  31. DL for Cloud Performance Debugging Input Output signal signal Probability  Container that a utilization microservice will initiate a  Latency QoS violation in the near  Queue future length 31

  32. DL for Cloud Performance Debugging Input Output signal signal Probability  Container that a utilization microservice will initiate a  Latency QoS violation in the near  Queue future length Dimensionality reduction 32

Recommend


More recommend