Enhancing End-to-End Tracing Systems for Automated Performance Debugging in Distributed Systems Jethro S. Sun January 23, 2018 MassOpenCloud Research Group 1
Introduction
A Sad Story ... 2
A Sad Story ... A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable. – Leslie Lamport 2
A Sad Story ... What developers and operators really need is a way to understand and troubleshoot a distributed system as a whole . 2
Performance Diagnosis in OpenStack OPENSTACK Bug # 1587777 was filed against HORIZON. 3
Performance Diagnosis in OpenStack And only took 10 Month to figure out it was something wrong in KEYSTONE . 3
Performance Diagnosis in OpenStack Q uestion: Is there a way to make developers’ and operators’ life less miserable? 3
Performance Diagnosis in OpenStack Q uestion: Is there a way to make developers’ and operators’ life less miserable? YES. End-to-end tracing 3
End-to-End Tracing, what is it and where we are today?
End-to-End Tracing Definition (End-to-End Tracing) E nd-to-end tracing captures the workflow of causally-related activity (e.g., work done to process a request) within and among every component of a distributed system. 1 Request work � ows Boundary 3ms 2ms 3ms Work 1ms 2ms 2ms Storage nodes Client Server App server Distributed � lesystem Table store 1 So, you want to trace your distributed system? Key design insights from 4 years of practical experience. Raja Sambasivan et al.
A Typical End-to-End Tracing Infrastructure Definition (Trace Metadata) F ields propagated with causally-related event to identify their workflows. They are usually unique IDs or in a format of logical clock stored thread-locally or context-locally. Definition (Trace Points) Instrumentation points in the system used to identify individual work done, and also propagate necessary metadata. Definition (Backend) Central collector that gathers pieces of trace data and reconstruct them into full feature-riched trace. 5
End-to-end Tracing gains its popularity gradually... TABLE 1 T imeline 2002 • Pinpoint 2004 • Magpie, SDI 2005 • Causeway 2006 • Pip, Stardust 2007 • X-Trace 2010 • Google Dapper 2012 • Zipkin, HTrace 2013 • Node.js CLS 2014 • Apple Activity Tracing, Blkin 2015 • AppNeta, AppDynamics, NewRelic, OSProfiler 2017 • 6
End-to-end Tracing gains its popularity gradually... TABLE 1 T imeline 2002 • Pinpoint 2004 • Magpie, SDI 2005 • Causeway 2006 • Pip, Stardust 2007 • X-Trace 2010 • Google Dapper 2012 • Zipkin, HTrace 2013 • Node.js CLS 2014 • Apple Activity Tracing, Blkin 2015 • AppNeta, AppDynamics, NewRelic, OSProfiler 2017 • ..., Twitter, Prezi, SoundCloud, HDFS, HBase, Accumulo, Phoenix, Baidu, Neflit, Pivotal, Coursera, Census (Google), Canopy (Facebook), Jaeger (Uber), ... 6
End-to-End Tracing Systems Service Model T o distinguish tracing systems: 7
End-to-End Tracing Systems Service Model T o distinguish tracing systems: • On-demand (Rudimentary) 7
End-to-End Tracing Systems Service Model T o distinguish tracing systems: • On-demand (Rudimentary) • Be always on (Smart Sampling) 7
End-to-End Tracing Systems Service Model T o distinguish tracing systems: • On-demand (Rudimentary) • Be always on (Smart Sampling) • Collect trace data asynchronously 7
End-to-End Tracing Systems Service Model T o distinguish tracing systems: • On-demand (Rudimentary) • Be always on (Smart Sampling) • Collect trace data asynchronously • DAG-based model to represent events 7
End-to-End Tracing Systems Service Model T o distinguish tracing systems: • On-demand (Rudimentary) • Be always on (Smart Sampling) • Collect trace data asynchronously • DAG-based model to represent events • Logical clock support 7
Comparing End-to-End Tracing Systems Table 2: C omparing end-to-end tracing systems features between Jaeger, Zipkin, Pivot Tracing, Dapper, Canopy, OSProfiler and Blkin. Systems Can Be Applied to Rudimentary Features Needed to Be Always on Advanced Features On-demand Sampling Async. Collect. DAG-based Model Interval Tree Clock Jaeger Tracing Broadly (K8s, OpenShift) ✗ � � ✗ ✗ Zipkin Tracing Broadly ✗ � � ✗ ✗ Pivot Tracing Hadoop/Java based systems ✗ � � � � Dapper N/A ✗ � � ✗ ✗ Canopy N/A ✗ � � � ✗ OSProfiler Blkin 8
Comparing End-to-End Tracing Systems Table 2: C omparing end-to-end tracing systems features between Jaeger, Zipkin, Pivot Tracing, Dapper, Canopy, OSProfiler and Blkin. Systems Can Be Applied to Rudimentary Features Needed to Be Always on Advanced Features On-demand Sampling Async. Collect. DAG-based Model Interval Tree Clock Jaeger Tracing Broadly (K8s, OpenShift) ✗ � � ✗ ✗ Zipkin Tracing Broadly ✗ � � ✗ ✗ Pivot Tracing Hadoop/Java based systems ✗ � � � � Dapper N/A ✗ � � ✗ ✗ Canopy N/A ✗ � � � ✗ OSProfiler OpenStack � ✗ ✗ ✗ ✗ Blkin Ceph � ✗ ✗ ✗ ✗ 8
Approaches for Enabling Sophisticated Tracing in OpenStack
Jaeger vs OSProfiler J aeger Tracing DISADVANTAGES ADVANTA GES • Doesn’t support • Support smart DAG-based model sampling • Doesn’t use advanced • Support collecting logical clock as the trace data async. metadata 9
Jaeger vs OSProfiler OSP rofiler DISADVANTAGES • Doesn’t have sampling ADVANTAGES • Doesn’t collect trace data • Rudimentary on-demand asynchronously tracing • Doesn’t support • Already adopt by DAG-based model OpenStack and have • Doesn’t use advanced instrumentation logical clock as the metadata 9
Jaeger vs OSProfiler OSP rofiler DISADVANTAGES Doesn’t have sampling • ADVANTAGES • Doesn’t collect trace data • Rudimentary on-demand asynchronously tracing • Doesn’t support • Already adopt by DAG-based model OpenStack and have • Doesn’t use advanced instrumentation logical clock as the metadata 9
Jaeger vs OSProfiler OSP rofiler with Jaeger Tracing ADVANTAGES DISADVANTAGES • Rudimentary on-demand • Doesn’t have sampling tracing • Doesn’t collect trace data • Already adopt by asynchronously OpenStack and have • Doesn’t support instrumentation DAG-based model • Doesn’t use advanced logical clock as the metadata 9
Jaeger vs OSProfiler OSP rofiler with Jaeger Tracing ADVANTAGES DISADVANTAGES • Rudimentary on-demand • Doesn’t have sampling tracing • Doesn’t collect trace data • Already adopt by asynchronously OpenStack and have Doesn’t support • instrumentation DAG-based model • Modifications we done • Doesn’t use advanced can be directly other logical clock as the Jaeger instrumented metadata systems 9
Feasibility K ey Challenges: Trace Metadata/OSProfiler library change • Implement CONTEXT generation using Jaeger • Implement CONTEXT propagation using Jaeger Trace Points/OpenStack instrumentation • All of the instrumentation will be able to be reused 2 Backend side • Need to deploy Backend/Collector for Jaeger Tracing 2 Modifying instrumentation for the purpose of our research is orthogonal. 10
Feasibility K ey Challenges: Trace Metadata/OSProfiler library change • Implement CONTEXT generation using Jaeger • Implement CONTEXT propagation using Jaeger Trace Points/OpenStack instrumentation • All of the instrumentation will be able to be reused 2 � Backend side • Need to deploy Backend/Collector for Jaeger Tracing � 2 Modifying instrumentation for the purpose of our research is orthogonal. 10
Feasibility Definition (Context) C ontext is an abstraction of the metadata so that it is easier to interact with (injecting/extracting a trace to/from). Example Implementation // Context holds the basic metadata. type Context struct { TraceID uint64 SpanID uint64 Sampled bool Baggage map[string]string // initialized on first use } 11
Feasibility: Context Generation CONTEXT generation : All of the modification will be done in OSProfiler library 3 • The span context generation will be done using Jaeger to substitute the OSProfiler implementation. 3 In OpenStack developers instrument their codebase using functionalities implemented in OSProfiler library. 12
Feasibility: Context Propagation CONTEXT pr opagation : OpenStack Instrumentation side • REST API Transform the metadata propagation in OpenStack clients to propagate Jaeger metadata. We might only need to change OSProfiler library. • RPC API Need to implement helper functions for metadata propagation RPC. We might need to modify component codebase depends on the RCP is handled in different components. OSProfiler Library side • Need to deploy Backend/Collector for Jaeger 12 Tracing
Status Update CONTEXT generation : • A talk during 2017 OpenStack Sydney Summit demonstrates how easy to plainly record all the OSProfiler tracing information in Jaeger. ( i.e. Context generation is done in OSProfiler) • Additionally we need to generate context using Jaeger tracing. CONTEXT propagation : • Will begin to look at ways to enforce metadata propagation in OpenStack RPC API and REST API 13
Jaeger Tracing Approach
Recommend
More recommend