in Distributed Applications By Emre Ates 1 , Lily Sturmann 2 , Mert - PowerPoint PPT Presentation

1 An Automated, Cross-Layer Instrumentation Framework for Diagnosing Performance Problems in Distributed Applications By Emre Ates 1 , Lily Sturmann 2 , Mert Toslali 1 , Orran Krieger 1 , Richard Megginson 2 , Ayse K. Coskun 1 , and Raja Sambasivan 3 1 Boston University; 2 RedHat, Inc.; 3 Tufts University ACM Symposium on Cloud Computing November 21, 2019, Santa Cruz, CA

2 Debugging Distributed Systems Challenging: Where is the problem? It could be in: One of many components ● One of several stack levels ● ● VM vs. hypervisor ● Application vs. kernel Inter-component interactions ●

3 Today’s Debugging Methods Instrumentation data = instrumentation point Different problems benefit from different instrumentation points. You can’t instrument everything: too much overhead, too much data.

4 Today’s Debugging Cycle Able to idenfity problem source? Usually no… Gather data from current Use data to instrumentation guess where to add instrumentation

5 Our Research Question Sometimes yes!! Able to idenfity problem source? Gather data from current Use data to instrumentation guess where to Report to add developers instrumentation Can we create a continuously-running instrumentation framework for production distributed systems that will automatically explore instrumentation choices across stack-layers for a newly- observed performance problem?

6 Key insight: Performance variation indicates where to instrument  If requests that are expected to perform similarly do not:  There is something unknown about their workflows, which could represent performance problems  Localizing source of variation gives insight into where instrumentation is needed. time hierarchy A READ request from storage client start client end LB start LB end Request #2 client start client end metadata read LB start LB end client start client end metadata read Request #3 LB start LB end metadata read

7 Key Enabler: Workflow-centric Tracing  Used to get workflows from running systems  Works by propagating common context with requests (e.g., request ID)  Trace points record important events with context  Granularity is determined by instrumentation in the system time hierarchy client start client end LB start LB end metadata read client start client end LB start LB end metadata read

8 Vision of Pythia

9 Vision of Pythia

10 Vision of Pythia

11 Vision of Pythia

12 Vision of Pythia

13 Vision of Pythia

14 Vision of Pythia

15 Vision of Pythia

16 Vision of Pythia

17 Vision of Pythia 17

18 Challenge 1: Grouping

19 Which Requests are Expected to Perform Similarly  Depends on the distributed application begin debugged  Generally applicable: Requests of the same type that access the same services  Additional app-specific details could be incorporated client start client end client start client end client start client end client start client end LB start LB end LB start LB end Expectation 1: LB start LB end metadata read Read requests LB start LB end metadata read metadata read metadata read client start client end client start client end client start client end Expectation 2: auth start auth end Auth requests auth start auth end auth start auth end

20 Challenge 2: Localization

21 Localizing Performance Variations completed  Order groups and edges within groups. # reqs  How to quantify performance variation?  Multiple metrics to measure variation time  Variance/standard deviation = acceptable std dev threshold  Coefficient of variance (std. / mean)  Intuitive  Very small mean -> very high CoV  Multimodality completed # reqs  Multiple modes of operation time

22 Challenge 3: What to enable 22

23 Search Space Search Strategies  How to represent all of the  How to explore the search space? instrumentation that Pythia can  Quickly converge on problems control?  Keep instrumentation overhead low  How to find relevant next-trace-  Reduce time-to-solution points after problem is narrowed  Many possible options down?  Pluggable design  Trade-offs:  Quick to access  Compact  Limit spurious instrumentation choices

24 Search Space: Calling Context Trees Offline-collected trace Search space nova start  One node for each calling nova context i.e., stack trace keystone start  Leverages the hierarchy of glance keystone neutron keystone end distributed system architecture glance start  Construction: offline profiling keystone  Trade-offs neutron start keystone  Quick to access  Compact neutron end glance start  Limit spurious instrumentation choices nova end

25 Search Strategy: Hierarchical Search Offline-collected trace Search space nova start nova keystone start  One of many choices glance keystone neutron keystone end  Search trace point choices glance start keystone top-down neutron start keystone  Very compatible with Calling Context Trees neutron end glance end nova end

26 Search Strategy: Hierarchical Search Offline-collected trace Search space nova start nova keystone start  One of many choices glance keystone neutron keystone end  Search trace point choices glance start keystone top-down neutron start keystone  Very compatible with Calling Context Trees neutron end glance start nova end

30 Explaining Variation Using Key-Value Pairs in Trace Points  Canonical Correlation Analysis (CCA)  Used to find important key-value pairs in the traces 𝑏 ′ = max 𝑑𝑝𝑠𝑠(𝑏 𝑈 𝑌, 𝑍) 𝑏 𝑍 = (𝑢 1 , 𝑢 2 , … , 𝑢 𝑜 ) the request durations 𝑌 = (𝑦 1 , 𝑦 2 , … , 𝑦 𝑛 ) the collected variables 𝑏′ ∈ ℝ 𝑛 the coefficients indicating most correlated variables

31 Vision of Pythia – Completing the Cycle 31

32 Validating Pythia’s Approach  Can performance variation guide instrumentation choices?  Run exploratory analysis for OpenStack  Start with default instrumentation  Localize performance variation  Find next instrumentation to enable  Use CCA for finding important key-value pairs

33 Validating Pythia’s Approach - Setup  OpenStack: an open source cloud platform, written in Python  OSProfiler: OpenStack’s tracing framework  We implemented controllable trace points  Store more variables such as queue lengths  Running on MOC  8 vCPUs, 32 GB memory  Workload  9 request types, VM/floating IP/volume create/list/delete  Simultaneously execute 20 workloads

34 Step 1: Grouping & Localization  Collect latency values for each request  Grouping: Same request type with same trace points  Server create requests have unusually high variance and latency  Pythia would focus on this group

35 Step 2: Enable additional instrumentation Groups with different queue lengths  Pythia localizes variation into a semaphore in server create  After adding queue length variable into traces, we see 3 distinct latency groups  CCA also finds this variable important TAKEAWAY: Pythia’s approach identifies the instrumentation needed to debug this problem

36 Open Questions  What is the ideal structure of the search space? What are possible search strategies? What are the trade-offs?  How can we formulate and choose an “instrumentation budget”?  How granular should the performance expectations be?  How can we integrate multiple stack layers into Pythia?

37 More in the paper  Pythia architecture  Problem scenarios  Instrumentation plane requirements  Cross-layer instrumentation

An automated, cross-layer instrumentation framework for diagnosing performance problems in distributed applications 11/21/2019 38 38  It is very difficult to debug distributed systems  Automating instrumentation choice is a promising solution to overcome this difficulty More info in our paper (bu.edu/peaclab/publications) Concluding Remarks Please send feedback to ates@bu.edu or join us at the poster sesion

in Distributed Applications By Emre Ates 1 , Lily Sturmann 2 , Mert - PowerPoint PPT Presentation

1 An Automated, Cross-Layer Instrumentation Framework for Diagnosing Performance Problems in Distributed Applications By Emre Ates 1 , Lily Sturmann 2 , Mert Toslali 1 , Orran Krieger 1 , Richard Megginson 2 , Ayse K. Coskun 1 , and Raja

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Distributed Databases Distributed database management system A distributed database (DDB) is

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

` James R. Wilcox Zach Tatlock Ilya Sergey Distributed Systems Distributed Infrastructure

Object-Oriented Distributed Technology Objects Objects in Distributed Systems

Distributed Machine Learning Maria-Florina Balcan Carnegie Mellon University Distributed Machine

The Arvy Distributed Directory Protocol Pankaj Khanchandani, Roger Wattenhofer ETH Zurich -

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Time in Distributed Systems, Distributed Simulation, and Distributed Debugging Friedemann

Distributed File Systems Issues in Distributed File Service Case Studies: Sun

Flat and nested distributed Outline transactions Flat and nested distributed transactions

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Distributed Objects Message Passing vs. Distributed Objects Message Passing versus Distributed

Distributed Algorithms Distributed Algorithms Distributed Mutual Exclusion Olivier Dalle (*)

Distributed Coordination What makes a system distributed? Time in a distributed system

CS 4803 / 7643: Deep Learning Topics: Image Classification Supervised Learning view

The ABCs of School Debt Financing Session I Basic Considerations Before Assuming Debt

Introduction to UML and SysML Mark Austin E-mail: austin@isr.umd.edu Institute for Systems

THE ROCK You continued looking until a stone was cut out without hands, and it struck the

Chapter 5 Slide 1 Normal Probability Distributions 5-1 Overview 5-2 The Standard Normal

All In Attendance Campaign Back to School Slides All In Attendance Campaign

Multiple Single-Facility Location 9 Distribution 3 6 Manufacturing 1 10 Customers

% of Proposed Fee % of Time/Resources 6% 5% 11% 12% Small Scales Small Scales 19% 10% Large

in Distributed Applications By Emre Ates 1 , Lily Sturmann 2 , Mert - PowerPoint PPT Presentation

1 An Automated, Cross-Layer Instrumentation Framework for Diagnosing Performance Problems in Distributed Applications By Emre Ates 1 , Lily Sturmann 2 , Mert Toslali 1 , Orran Krieger 1 , Richard Megginson 2 , Ayse K. Coskun 1 , and Raja

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Distributed Databases Distributed database management system A distributed database (DDB) is

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

` James R. Wilcox Zach Tatlock Ilya Sergey Distributed Systems Distributed Infrastructure

Object-Oriented Distributed Technology Objects Objects in Distributed Systems

Distributed Machine Learning Maria-Florina Balcan Carnegie Mellon University Distributed Machine

The Arvy Distributed Directory Protocol Pankaj Khanchandani, Roger Wattenhofer ETH Zurich -

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals &amp; Challenges

Time in Distributed Systems, Distributed Simulation, and Distributed Debugging Friedemann

Distributed File Systems Issues in Distributed File Service Case Studies: Sun

Flat and nested distributed Outline transactions Flat and nested distributed transactions

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals &amp; Challenges

Distributed Objects Message Passing vs. Distributed Objects Message Passing versus Distributed

Distributed Algorithms Distributed Algorithms Distributed Mutual Exclusion Olivier Dalle (*)

Distributed Coordination What makes a system distributed? Time in a distributed system

CS 4803 / 7643: Deep Learning Topics: Image Classification Supervised Learning view

The ABCs of School Debt Financing Session I Basic Considerations Before Assuming Debt

Introduction to UML and SysML Mark Austin E-mail: austin@isr.umd.edu Institute for Systems

THE ROCK You continued looking until a stone was cut out without hands, and it struck the

Chapter 5 Slide 1 Normal Probability Distributions 5-1 Overview 5-2 The Standard Normal

All In Attendance Campaign Back to School Slides All In Attendance Campaign

Multiple Single-Facility Location 9 Distribution 3 6 Manufacturing 1 10 Customers

% of Proposed Fee % of Time/Resources 6% 5% 11% 12% Small Scales Small Scales 19% 10% Large

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges