conquering microservices complexity uber
play

Conquering Microservices Complexity @Uber With Distributed Tracing - PowerPoint PPT Presentation

Conquering Microservices Complexity @Uber With Distributed Tracing Yuri Shkuro SOFTWARE ENGINEER @ UBER Why Distributed Tracing Trace as a Narrative Trace vs. Trace Agenda Traces vs. Trace Data Lineage Q & A Yuri Shkuro Founder &


  1. Conquering Microservices Complexity @Uber With Distributed Tracing Yuri Shkuro SOFTWARE ENGINEER @ UBER

  2. Why Distributed Tracing Trace as a Narrative Trace vs. Trace Agenda Traces vs. Trace Data Lineage Q & A

  3. Yuri Shkuro Founder & Maintainer 
 of CNCF Jaeger jaegertracing.io Co-founder of OpenTracing & OpenTelemetry Software Engineer Author of "Mastering Uber Technologies Distributed Tracing", by Packt Publishing shkuro.com

  4. Quick Poll

  5. Why Distributed Tracing

  6. Scaling With Users Distributed Systems

  7. Scaling With Engineering Organization Monoliths to Microservices A A B B D C C D

  8. Scaling With CPU Cores Asynchronous Programming Models, Distributed Concurrency BASIC CONCURRENCY ASYNC CONCURRENCY DISTRIBUTED CONCURRENCY

  9. In microservices architectures the number of failure modes increases exponentially

  10. Observability of distributed transactions is paramount!

  11. Observability vs. monitoring

  12. Observability vs. monitoring

  13. Observability System’s ability to answer questions How different was the execution from Which services did the request go the normal system behavior through What did every service do when Structural differences processing the request Performance differences If the request was slow, where were the bottlenecks What was on the critical path of the If the request failed, where did the request errors happen Who should be paged

  14. Distributed tracing can answer these questions and accelerate root cause analysis

  15. Distributed Tracing in a Nutshell

  16. Trace as a narrative

  17. Trace Timeline Classic trace view as Gantt chart

  18. Trace Timeline Parent → Child → Grandchild 1

  19. Trace Timeline Time + Mini-Map 2 1

  20. Trace Timeline Blocking operation 2 3 1

  21. Trace Timeline Sequential operations 2 3 1 4

  22. Trace Timeline Errors 2 3 1 5 4

  23. Span details

  24. Span details Database query 1

  25. Span details Timed events (logs) 1 2

  26. We can also trace asynchronous workflows

  27. Tracing Talk Application Mastering Distributed Tracing , Chapter 5

  28. Tracing Talk Application Architecture

  29. Tracing Talk Application Request trace

  30. Tracing Talk Application Message sent 1

  31. Tracing Talk Application Message received 1 2

  32. Single Trace Pros and cons Tells a story about a single Tells a story about a single transaction. What if it’s an anomaly? transaction One trace can be overwhelmingly Allows deep contextual drill-down complex Acts as a distributed stack trace

  33. Too Much Complexity One request - 30 services, 100+ RPCs

  34. Too Much Complexity Some traces have hundreds of thousands spans

  35. Reducing complexity by smarter visualizations

  36. Trace graph Time ordered, repeated edges collapsed

  37. Trace graph Latency heat map

  38. Finding anomalies is easier when we look at differences in performance profiles

  39. Trace vs. Trace

  40. Comparing Trace Structures Just like a Code Diff

  41. Comparing Trace Structures Shared Structure 1

  42. Comparing Trace Structures Absent in One or the Traces 1 2

  43. Comparing Trace Structures More or Fewer Spans Within a Node 3 1 2

  44. Comparing Trace Structures Substantial Divergence 3 1 4 2

  45. Deep Linking to Raw Traces & Spans Error: ”You have an outstanding balance…" 5

  46. Production story Migrating services to a nearby datacenter Request latency doubles

  47. Investigating latency Structural comparison not always useful

  48. Investigating latency Very similar structure 1

  49. Investigating latency Left trace 2.74 seconds 2 1

  50. Investigating latency Right trace 4.2 seconds 3 2 1

  51. Investigating latency Due to structural differences? 3 2 1 4

  52. Investigating latency Or dispersed contributors? 3 2 5 1 4

  53. Heat-maps!

  54. Comparing trace durations Heat-map of latencies

  55. Comparing trace durations Similar durations (grey) 1

  56. Comparing trace durations Nodes that are not shared (white) 1 2

  57. Comparing trace durations Red heat-map for latency differences 1 3 2

  58. Comparing trace durations Details on Mouse-Over

  59. Comparing trace durations Details on Mouse-Over

  60. How Are These Approach Different? Summary Distinct comparison Surface less Condense 
 Emphasize modes simplify 
 information the structural the differences the comparisons representation

  61. Challenges Individual traces can be an outliers. User must find the right baseline.

  62. Traces vs. Trace

  63. What Went Wrong? Root Cause Analysis

  64. Top Level Outcome Including Request/Response Payloads 1

  65. Link to the Trace Can Always Go Back to Raw Data 1 2

  66. Trace Structure Nodes Are Sorted Chronologically 1 2 3

  67. Present and Missing Nodes Color-Coding 1 2 3 4

  68. A Node With Error Data 1 2 3 5 4

  69. Error Data Panel 1 2 6 3 5 4

  70. How Is This Approach Different? Summary Much broader One purpose: root context: cause analysis of aggregate vs. reliability issues one trace

  71. Tackling Data Complexity

  72. Uber is a data company OK, and a transportation company Streams / Kafka Data lake / HDFS Microservices / RPCs Data undergoes many transformations More data is derived from other data Debugging data quality is difficult

  73. Data Lineage Debugging Data Quality Streams / Kafka Data lake / HDFS Microservices / RPCs

  74. Observability requires high quality instrumentation.

  75. Our Software Is Highly Composable Often from Open Source Components Server RPC Framework Framework Microservice Threads Queue Driver DB Driver DB Queue

  76. Tracing breaks if components 
 don’t understand each other.

  77. Standardization Efforts Instrumentation and Data Formats Effective observability requires high-quality Distributed Tracing Working Group telemetry. Data formats for on-the-wire trace context & OpenTelemetry makes robust, portable correlation-context, and out-of-band trace telemetry a built-in feature of cloud-native data. software.

  78. In Summary Distributed tracing helps us to deal with the overwhelming complexity of microservices

  79. In Summary Creative visualizations are essential in performance analysis

  80. In Summary Distributed tracing empowers unparalleled insights into our distributed systems

  81. Thank You Find me @ shkuro.com Q&A

Recommend


More recommend