An Automated, Cross-Layer Instrumentation Framework for Diagnosing - PowerPoint PPT Presentation

1 An Automated, Cross-Layer Instrumentation Framework for Diagnosing Performance Problems in Distributed Applications By Emre Ates 1 , Lily Sturmann 2 , Mert Toslali 1 , Orran Krieger 1 , Richard Megginson 2 , Ayse K. Coskun 1 , and Raja Sambasivan 3 1 Boston University; 2 RedHat, Inc.; 3 Tufts University ACM Symposium on Cloud Computing November 21, 2019, Santa Cruz, CA

2 Debugging Distributed Systems Challenging: Where is the problem? It could be in: One of many components ● One of several stack levels ● ● VM vs. hypervisor ● Application vs. kernel Inter-component interactions ●

3 Today’s Debugging Methods Instrumentation data = instrumentation point Different problems benefit from different instrumentation points. You can’t instrument everything: too much overhead, too much data.

4 Today’s Debugging Cycle Able to idenfity problem source? Usually no… Gather data from current Use data to instrumentation guess where to add instrumentation

5 Our Research Question Sometimes yes!! Able to idenfity problem source? Gather data from current Use data to instrumentation guess where to Report to add developers instrumentation Can we create a continuously-running instrumentation framework for production distributed systems that will automatically explore instrumentation choices across stack-layers for a newly- observed performance problem?

6 Key insight: Performance variation indicates where to instrument ¡ If requests that are expected to perform similarly do not: There is something unknown about their workflows, which could represent performance problems ¡ Localizing source of variation gives insight into where instrumentation is needed. ¡ time A READ request from storage client start client end hierarchy LB start LB end Request #2 client start client end metadata read LB start LB end client start client end metadata read LB start LB end Request #3 metadata read

7 Key Enabler: Workflow-centric Tracing ¡ Used to get workflows from running systems ¡ Works by propagating common context with requests (e.g., request ID) ¡ Trace points record important events with context ¡ Granularity is determined by instrumentation in the system time client start client end hierarchy LB start LB end metadata read client start client end LB start LB end metadata read

8 Vision of Pythia

9 Vision of Pythia

10 Vision of Pythia

11 Vision of Pythia

12 Vision of Pythia

13 Vision of Pythia

14 Vision of Pythia

15 Vision of Pythia

16 Vision of Pythia

17 Vision of Pythia 17

18 Challenge 1: Grouping

19 Which Requests are Expected to Perform Similarly ¡ Depends on the distributed application begin debugged ¡ Generally applicable: Requests of the same type that access the same services ¡ Additional app-specific details could be incorporated client start client end client start client end client start client end client start client end LB start LB end LB start LB end Expectation 1: LB start LB end metadata Read requests read LB start LB end metadata read metadata read metadata read client start client end client start client end client start client end Expectation 2: Auth requests auth start auth end auth start auth end auth start auth end

20 Challenge 2: Localization

21 Localizing Performance Variations ¡ Order groups and edges within groups. completed # reqs ¡ How to quantify performance variation? ¡ Multiple metrics to measure variation time ¡ Variance/standard deviation = acceptable std dev threshold ¡ Coefficient of variance (std. / mean) ¡ Intuitive ¡ Very small mean -> very high CoV ¡ Multimodality completed # reqs ¡ Multiple modes of operation time

22 Challenge 3: What to enable 22

23 Search Space Search Strategies ¡ How to represent all of the ¡ How to explore the search space? instrumentation that Pythia can ¡ Quickly converge on problems control? ¡ Keep instrumentation overhead low ¡ How to find relevant next-trace- ¡ Reduce time-to-solution points after problem is narrowed ¡ Many possible options down? ¡ Pluggable design ¡ Trade-offs: ¡ Quick to access ¡ Compact ¡ Limit spurious instrumentation choices

24 Search Space: Calling Context Trees Offline-collected trace Search space nova ¡ One node for each calling start nova context i.e., stack trace keystone start ¡ Leverages the hierarchy of glance keystone neutron keystone end distributed system architecture glance start ¡ Construction: offline profiling keystone ¡ Trade-offs neutron start keystone ¡ Quick to access ¡ Compact neutron end glance start ¡ Limit spurious instrumentation choices nova end

25 Search Strategy: Hierarchical Search Offline-collected trace Search space nova start nova keystone start ¡ One of many choices glance keystone neutron keystone end ¡ Search trace point choices glance start keystone top-down neutron start keystone ¡ Very compatible with Calling Context Trees neutron end glance end nova end

26 Search Strategy: Hierarchical Search Offline-collected trace Search space nova start nova keystone start ¡ One of many choices glance keystone neutron keystone end ¡ Search trace point choices glance start keystone top-down neutron start keystone ¡ Very compatible with Calling Context Trees neutron end glance start nova end

30 Explaining Variation Using Key-Value Pairs in Trace Points ¡ Canonical Correlation Analysis (CCA) ¡ Used to find important key-value pairs in the traces 𝑏 " = max 𝑑𝑝𝑠𝑠(𝑏 , 𝑌, 𝑍) ' 𝑍 = (𝑢 2 , 𝑢 3 , … , 𝑢 5 ) the request durations 𝑌 = (𝑦 2 , 𝑦 3 , … , 𝑦 7 ) the collected variables 𝑏′ ∈ ℝ 7 the coefficients indicating most correlated variables

31 Vision of Pythia – Completing the Cycle 31

32 Validating Pythia’s Approach ¡ Can performance variation guide instrumentation choices? ¡ Run exploratory analysis for OpenStack ¡ Start with default instrumentation ¡ Localize performance variation ¡ Find next instrumentation to enable ¡ Use CCA for finding important key-value pairs

33 Validating Pythia’s Approach - Setup ¡ OpenStack: an open source cloud platform, written in Python ¡ OSProfiler: OpenStack’s tracing framework ¡ We implemented controllable trace points ¡ Store more variables such as queue lengths ¡ Running on MOC ¡ 8 vCPUs, 32 GB memory ¡ Workload ¡ 9 request types, VM/floating IP/volume create/list/delete ¡ Simultaneously execute 20 workloads

34 Step 1: Grouping & Localization ¡ Collect latency values for each request ¡ Grouping: Same request type with same trace points ¡ Server create requests have unusually high variance and latency ¡ Pythia would focus on this group

35 Step 2: Enable additional instrumentation Groups with different queue lengths ¡ Pythia localizes variation into a semaphore in server create ¡ After adding queue length variable into traces, we see 3 distinct latency groups CCA also finds this variable important ¡ TAKEAWAY: Pythia’s approach identifies the instrumentation needed to debug this problem

36 Open Questions ¡ What is the ideal structure of the search space? What are possible search strategies? What are the trade-offs? ¡ How can we formulate and choose an “instrumentation budget”? ¡ How granular should the performance expectations be? ¡ How can we integrate multiple stack layers into Pythia?

37 More in the paper ¡ Pythia architecture ¡ Problem scenarios ¡ Instrumentation plane requirements ¡ Cross-layer instrumentation

An automated, cross-layer instrumentation framework for diagnosing performance problems in distributed applications 12/16/19 38 38 ¡ It is very difficult to debug distributed systems ¡ Automating instrumentation choice is a promising solution to overcome this difficulty More info in our paper (bu.edu/peaclab/publications) Concluding Remarks Please send feedback to ates@bu.edu or join us at the poster sesion

An Automated, Cross-Layer Instrumentation Framework for Diagnosing - PowerPoint PPT Presentation

1 An Automated, Cross-Layer Instrumentation Framework for Diagnosing Performance Problems in Distributed Applications By Emre Ates 1 , Lily Sturmann 2 , Mert Toslali 1 , Orran Krieger 1 , Richard Megginson 2 , Ayse K. Coskun 1 , and Raja

Dynamic Binary Instrumentation: Introduction to Pin Instrumentation A technique that injects

Network Layer October 2, 2019 guha.jayachandran@sjsu.edu Layer 2: Protocol atop Layer 1

Cross-ISA Machine Instrumentation Cross-ISA Machine Instrumentation using Fast and Scalable

Lecture 6: Wireless Link Layer, Lecture 6: Wireless Link Layer, MAC protocols, CSMA MAC

1 Transport Layer Transport Layer Outline Message, Segment, Datagram Transport-layer

ELEC / COMP 177 Fall 2016 Some slides from Kurose and Ross, Computer Networking , 5 th Edition

5 Network Layer Network Layer Network Layer Network Layer Example: Choosing among multiple ASes

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

Beam Instrumentation Hermann Schmickler (CERN Beam Instrumentation Group) Hermann Schmickler

MPIfR APEX Instrumentation MPIfR APEX Instrumentation Bernd Klein Bernd Klein bklein@mpifr.de

The Shadow of the Cross The Cross of Jesus part 1B The Shadow of the Cross Hebrews 10:1-14 The

10 mm Cytoarchitecture and function layer 4: input layer 5: output Motor cortex: expanded layer

Data-link layer Da Data ta-link link layer er Referred to as layer 2 Physical

CompSci 356: Computer Network Architectures Lecture 25: Application Layer Protocols Chapter 9.1

7 Network Layer Network Layer Network Layer Network Layer Subnets Classful Address

1 Network Layer Network Layer Recall: Circuit Switching vs. Packet Interplay between routing

Introduction Questions: Assessment of the expected error of a learning algorithm: Is the error

Process Robustness Studies Background When factors interact, the level of one can sometimes be

Towards Controllable Explanation Generation for Recommender Systems via Neural Template Lei Li 1

Learning Theory Part 3: Bias-Variance Tradeoff Yingyu Liang Computer Sciences 760 Fall 2017

1 Q- -digest digest Q Example Example Exact data: frequency of data value {f 1 , f 2

Tensor Network Renormalization of Quantum Spin Liquids Haijun Liao, IOP, China Tao Xiang Lei

Chinas Financial Opening-up Peng Qinqin Caixin Reporter qinqinpeng@caixin.com Aug 11, 2020

Ryan Bradetich, Paul Oman, Jim Alves-Foss, and Theora Rice