1 An Automated, Cross-Layer Instrumentation Framework for Diagnosing Performance Problems in Distributed Applications By Emre Ates 1 , Lily Sturmann 2 , Mert Toslali 1 , Orran Krieger 1 , Richard Megginson 2 , Ayse K. Coskun 1 , and Raja Sambasivan 3 1 Boston University; 2 RedHat, Inc.; 3 Tufts University ACM Symposium on Cloud Computing November 21, 2019, Santa Cruz, CA
2 Debugging Distributed Systems Challenging: Where is the problem? It could be in: One of many components ● One of several stack levels ● ● VM vs. hypervisor ● Application vs. kernel Inter-component interactions ●
3 Today’s Debugging Methods Instrumentation data = instrumentation point Different problems benefit from different instrumentation points. You can’t instrument everything: too much overhead, too much data.
4 Today’s Debugging Cycle Able to idenfity problem source? Usually no… Gather data from current Use data to instrumentation guess where to add instrumentation
5 Our Research Question Sometimes yes!! Able to idenfity problem source? Gather data from current Use data to instrumentation guess where to Report to add developers instrumentation Can we create a continuously-running instrumentation framework for production distributed systems that will automatically explore instrumentation choices across stack-layers for a newly- observed performance problem?
6 Key insight: Performance variation indicates where to instrument If requests that are expected to perform similarly do not: There is something unknown about their workflows, which could represent performance problems Localizing source of variation gives insight into where instrumentation is needed. time hierarchy A READ request from storage client start client end LB start LB end Request #2 client start client end metadata read LB start LB end client start client end metadata read Request #3 LB start LB end metadata read
7 Key Enabler: Workflow-centric Tracing Used to get workflows from running systems Works by propagating common context with requests (e.g., request ID) Trace points record important events with context Granularity is determined by instrumentation in the system time hierarchy client start client end LB start LB end metadata read client start client end LB start LB end metadata read
8 Vision of Pythia
9 Vision of Pythia
10 Vision of Pythia
11 Vision of Pythia
12 Vision of Pythia
13 Vision of Pythia
14 Vision of Pythia
15 Vision of Pythia
16 Vision of Pythia
17 Vision of Pythia 17
18 Challenge 1: Grouping
19 Which Requests are Expected to Perform Similarly Depends on the distributed application begin debugged Generally applicable: Requests of the same type that access the same services Additional app-specific details could be incorporated client start client end client start client end client start client end client start client end LB start LB end LB start LB end Expectation 1: LB start LB end metadata read Read requests LB start LB end metadata read metadata read metadata read client start client end client start client end client start client end Expectation 2: auth start auth end Auth requests auth start auth end auth start auth end
20 Challenge 2: Localization
21 Localizing Performance Variations completed Order groups and edges within groups. # reqs How to quantify performance variation? Multiple metrics to measure variation time Variance/standard deviation = acceptable std dev threshold Coefficient of variance (std. / mean) Intuitive Very small mean -> very high CoV Multimodality completed # reqs Multiple modes of operation time
22 Challenge 3: What to enable 22
23 Search Space Search Strategies How to represent all of the How to explore the search space? instrumentation that Pythia can Quickly converge on problems control? Keep instrumentation overhead low How to find relevant next-trace- Reduce time-to-solution points after problem is narrowed Many possible options down? Pluggable design Trade-offs: Quick to access Compact Limit spurious instrumentation choices
24 Search Space: Calling Context Trees Offline-collected trace Search space nova start One node for each calling nova context i.e., stack trace keystone start Leverages the hierarchy of glance keystone neutron keystone end distributed system architecture glance start Construction: offline profiling keystone Trade-offs neutron start keystone Quick to access Compact neutron end glance start Limit spurious instrumentation choices nova end
25 Search Strategy: Hierarchical Search Offline-collected trace Search space nova start nova keystone start One of many choices glance keystone neutron keystone end Search trace point choices glance start keystone top-down neutron start keystone Very compatible with Calling Context Trees neutron end glance end nova end
26 Search Strategy: Hierarchical Search Offline-collected trace Search space nova start nova keystone start One of many choices glance keystone neutron keystone end Search trace point choices glance start keystone top-down neutron start keystone Very compatible with Calling Context Trees neutron end glance start nova end
27 Search Strategy: Hierarchical Search Offline-collected trace Search space nova start nova keystone start One of many choices glance keystone neutron keystone end Search trace point choices glance start keystone top-down neutron start keystone Very compatible with Calling Context Trees neutron end glance start nova end
28 Search Strategy: Hierarchical Search Offline-collected trace Search space nova start nova keystone start One of many choices glance keystone neutron keystone end Search trace point choices glance start keystone top-down neutron start keystone Very compatible with Calling Context Trees neutron end glance start nova end
29 Search Strategy: Hierarchical Search Offline-collected trace Search space nova start nova keystone start One of many choices glance keystone neutron keystone end Search trace point choices glance start keystone top-down neutron start keystone Very compatible with Calling Context Trees neutron end glance start nova end
30 Explaining Variation Using Key-Value Pairs in Trace Points Canonical Correlation Analysis (CCA) Used to find important key-value pairs in the traces 𝑏 ′ = max 𝑑𝑝𝑠𝑠(𝑏 𝑈 𝑌, 𝑍) 𝑏 𝑍 = (𝑢 1 , 𝑢 2 , … , 𝑢 𝑜 ) the request durations 𝑌 = (𝑦 1 , 𝑦 2 , … , 𝑦 𝑛 ) the collected variables 𝑏′ ∈ ℝ 𝑛 the coefficients indicating most correlated variables
31 Vision of Pythia – Completing the Cycle 31
32 Validating Pythia’s Approach Can performance variation guide instrumentation choices? Run exploratory analysis for OpenStack Start with default instrumentation Localize performance variation Find next instrumentation to enable Use CCA for finding important key-value pairs
33 Validating Pythia’s Approach - Setup OpenStack: an open source cloud platform, written in Python OSProfiler: OpenStack’s tracing framework We implemented controllable trace points Store more variables such as queue lengths Running on MOC 8 vCPUs, 32 GB memory Workload 9 request types, VM/floating IP/volume create/list/delete Simultaneously execute 20 workloads
34 Step 1: Grouping & Localization Collect latency values for each request Grouping: Same request type with same trace points Server create requests have unusually high variance and latency Pythia would focus on this group
35 Step 2: Enable additional instrumentation Groups with different queue lengths Pythia localizes variation into a semaphore in server create After adding queue length variable into traces, we see 3 distinct latency groups CCA also finds this variable important TAKEAWAY: Pythia’s approach identifies the instrumentation needed to debug this problem
36 Open Questions What is the ideal structure of the search space? What are possible search strategies? What are the trade-offs? How can we formulate and choose an “instrumentation budget”? How granular should the performance expectations be? How can we integrate multiple stack layers into Pythia?
37 More in the paper Pythia architecture Problem scenarios Instrumentation plane requirements Cross-layer instrumentation
An automated, cross-layer instrumentation framework for diagnosing performance problems in distributed applications 11/21/2019 38 38 It is very difficult to debug distributed systems Automating instrumentation choice is a promising solution to overcome this difficulty More info in our paper (bu.edu/peaclab/publications) Concluding Remarks Please send feedback to ates@bu.edu or join us at the poster sesion
Recommend
More recommend