in distributed applications
play

in Distributed Applications By Emre Ates 1 , Lily Sturmann 2 , Mert - PowerPoint PPT Presentation

1 An Automated, Cross-Layer Instrumentation Framework for Diagnosing Performance Problems in Distributed Applications By Emre Ates 1 , Lily Sturmann 2 , Mert Toslali 1 , Orran Krieger 1 , Richard Megginson 2 , Ayse K. Coskun 1 , and Raja


  1. 1 An Automated, Cross-Layer Instrumentation Framework for Diagnosing Performance Problems in Distributed Applications By Emre Ates 1 , Lily Sturmann 2 , Mert Toslali 1 , Orran Krieger 1 , Richard Megginson 2 , Ayse K. Coskun 1 , and Raja Sambasivan 3 1 Boston University; 2 RedHat, Inc.; 3 Tufts University ACM Symposium on Cloud Computing November 21, 2019, Santa Cruz, CA

  2. 2 Debugging Distributed Systems Challenging: Where is the problem? It could be in: One of many components ● One of several stack levels ● ● VM vs. hypervisor ● Application vs. kernel Inter-component interactions ●

  3. 3 Today’s Debugging Methods Instrumentation data = instrumentation point Different problems benefit from different instrumentation points. You can’t instrument everything: too much overhead, too much data.

  4. 4 Today’s Debugging Cycle Able to idenfity problem source? Usually no… Gather data from current Use data to instrumentation guess where to add instrumentation

  5. 5 Our Research Question Sometimes yes!! Able to idenfity problem source? Gather data from current Use data to instrumentation guess where to Report to add developers instrumentation Can we create a continuously-running instrumentation framework for production distributed systems that will automatically explore instrumentation choices across stack-layers for a newly- observed performance problem?

  6. 6 Key insight: Performance variation indicates where to instrument  If requests that are expected to perform similarly do not:  There is something unknown about their workflows, which could represent performance problems  Localizing source of variation gives insight into where instrumentation is needed. time hierarchy A READ request from storage client start client end LB start LB end Request #2 client start client end metadata read LB start LB end client start client end metadata read Request #3 LB start LB end metadata read

  7. 7 Key Enabler: Workflow-centric Tracing  Used to get workflows from running systems  Works by propagating common context with requests (e.g., request ID)  Trace points record important events with context  Granularity is determined by instrumentation in the system time hierarchy client start client end LB start LB end metadata read client start client end LB start LB end metadata read

  8. 8 Vision of Pythia

  9. 9 Vision of Pythia

  10. 10 Vision of Pythia

  11. 11 Vision of Pythia

  12. 12 Vision of Pythia

  13. 13 Vision of Pythia

  14. 14 Vision of Pythia

  15. 15 Vision of Pythia

  16. 16 Vision of Pythia

  17. 17 Vision of Pythia 17

  18. 18 Challenge 1: Grouping

  19. 19 Which Requests are Expected to Perform Similarly  Depends on the distributed application begin debugged  Generally applicable: Requests of the same type that access the same services  Additional app-specific details could be incorporated client start client end client start client end client start client end client start client end LB start LB end LB start LB end Expectation 1: LB start LB end metadata read Read requests LB start LB end metadata read metadata read metadata read client start client end client start client end client start client end Expectation 2: auth start auth end Auth requests auth start auth end auth start auth end

  20. 20 Challenge 2: Localization

  21. 21 Localizing Performance Variations completed  Order groups and edges within groups. # reqs  How to quantify performance variation?  Multiple metrics to measure variation time  Variance/standard deviation = acceptable std dev threshold  Coefficient of variance (std. / mean)  Intuitive  Very small mean -> very high CoV  Multimodality completed # reqs  Multiple modes of operation time

  22. 22 Challenge 3: What to enable 22

  23. 23 Search Space Search Strategies  How to represent all of the  How to explore the search space? instrumentation that Pythia can  Quickly converge on problems control?  Keep instrumentation overhead low  How to find relevant next-trace-  Reduce time-to-solution points after problem is narrowed  Many possible options down?  Pluggable design  Trade-offs:  Quick to access  Compact  Limit spurious instrumentation choices

  24. 24 Search Space: Calling Context Trees Offline-collected trace Search space nova start  One node for each calling nova context i.e., stack trace keystone start  Leverages the hierarchy of glance keystone neutron keystone end distributed system architecture glance start  Construction: offline profiling keystone  Trade-offs neutron start keystone  Quick to access  Compact neutron end glance start  Limit spurious instrumentation choices nova end

  25. 25 Search Strategy: Hierarchical Search Offline-collected trace Search space nova start nova keystone start  One of many choices glance keystone neutron keystone end  Search trace point choices glance start keystone top-down neutron start keystone  Very compatible with Calling Context Trees neutron end glance end nova end

  26. 26 Search Strategy: Hierarchical Search Offline-collected trace Search space nova start nova keystone start  One of many choices glance keystone neutron keystone end  Search trace point choices glance start keystone top-down neutron start keystone  Very compatible with Calling Context Trees neutron end glance start nova end

  27. 27 Search Strategy: Hierarchical Search Offline-collected trace Search space nova start nova keystone start  One of many choices glance keystone neutron keystone end  Search trace point choices glance start keystone top-down neutron start keystone  Very compatible with Calling Context Trees neutron end glance start nova end

  28. 28 Search Strategy: Hierarchical Search Offline-collected trace Search space nova start nova keystone start  One of many choices glance keystone neutron keystone end  Search trace point choices glance start keystone top-down neutron start keystone  Very compatible with Calling Context Trees neutron end glance start nova end

  29. 29 Search Strategy: Hierarchical Search Offline-collected trace Search space nova start nova keystone start  One of many choices glance keystone neutron keystone end  Search trace point choices glance start keystone top-down neutron start keystone  Very compatible with Calling Context Trees neutron end glance start nova end

  30. 30 Explaining Variation Using Key-Value Pairs in Trace Points  Canonical Correlation Analysis (CCA)  Used to find important key-value pairs in the traces 𝑏 ′ = max 𝑑𝑝𝑠𝑠(𝑏 𝑈 𝑌, 𝑍) 𝑏 𝑍 = (𝑢 1 , 𝑢 2 , … , 𝑢 𝑜 ) the request durations 𝑌 = (𝑦 1 , 𝑦 2 , … , 𝑦 𝑛 ) the collected variables 𝑏′ ∈ ℝ 𝑛 the coefficients indicating most correlated variables

  31. 31 Vision of Pythia – Completing the Cycle 31

  32. 32 Validating Pythia’s Approach  Can performance variation guide instrumentation choices?  Run exploratory analysis for OpenStack  Start with default instrumentation  Localize performance variation  Find next instrumentation to enable  Use CCA for finding important key-value pairs

  33. 33 Validating Pythia’s Approach - Setup  OpenStack: an open source cloud platform, written in Python  OSProfiler: OpenStack’s tracing framework  We implemented controllable trace points  Store more variables such as queue lengths  Running on MOC  8 vCPUs, 32 GB memory  Workload  9 request types, VM/floating IP/volume create/list/delete  Simultaneously execute 20 workloads

  34. 34 Step 1: Grouping & Localization  Collect latency values for each request  Grouping: Same request type with same trace points  Server create requests have unusually high variance and latency  Pythia would focus on this group

  35. 35 Step 2: Enable additional instrumentation Groups with different queue lengths  Pythia localizes variation into a semaphore in server create  After adding queue length variable into traces, we see 3 distinct latency groups  CCA also finds this variable important TAKEAWAY: Pythia’s approach identifies the instrumentation needed to debug this problem

  36. 36 Open Questions  What is the ideal structure of the search space? What are possible search strategies? What are the trade-offs?  How can we formulate and choose an “instrumentation budget”?  How granular should the performance expectations be?  How can we integrate multiple stack layers into Pythia?

  37. 37 More in the paper  Pythia architecture  Problem scenarios  Instrumentation plane requirements  Cross-layer instrumentation

  38. An automated, cross-layer instrumentation framework for diagnosing performance problems in distributed applications 11/21/2019 38 38  It is very difficult to debug distributed systems  Automating instrumentation choice is a promising solution to overcome this difficulty More info in our paper (bu.edu/peaclab/publications) Concluding Remarks Please send feedback to ates@bu.edu or join us at the poster sesion

Recommend


More recommend