distributed tracing
play

Distributed Tracing Understand how your components work together - PowerPoint PPT Presentation

Distributed Tracing Understand how your components work together About me Jos Carlos Chvez Software Engineer at Typeform focused on the aggregate of responses services. Zipkin core team and open source contributor for


  1. Distributed Tracing Understand how your components work together

  2. About me José Carlos Chávez ● Software Engineer at Typeform focused on the aggregate of responses services. ● Zipkin core team and open source contributor for Observability projects.

  3. Distributed Systems

  4. Distributed systems A collection of independent components appears to its users as a single coherent system. Characteristics: ● Concurrency ● No global clock Independent failures ●

  5. Distributed systems Tank valve Cold water storage tank 爆 $ ❄ # ☭ ฀ Shutoff valve First floor branch Water heater Gas supplier

  6. Distributed systems: Understanding failures Images DB2 service Media API Videos GET /media/e5k2 DB3 service API Proxy DB1 500 Internal Error Auth DB4 service 500 Internal Error TCP error (2003)

  7. Distributed systems: Understanding failures Tank valve Cold water storage tank 爆 $ ❄ # ☭ ฀ Shutoff valve First floor branch Water heater I AM HERE! First floor distributor Gas supplier is clogged!

  8. We do have that, it is called logs!

  9. Logs & Concurrency Images DB2 service Media API Videos GET /media/e5k2 DB3 service API Proxy DB1 500 Internal Error Auth DB4 service 500 Internal Error TCP error (2003)

  10. Logs & Concurrency [24/Oct/2017 13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/13548” [24/Oct/2017 13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/23948” [24/Oct/2017 13:50:08 +0000] “GET /media HTTP/1.1” 200 … **0/12396” [24/Oct/2017 13:50:07 +0000] “GET /videos HTTP/1.1” 200 … **0/23748” ? [24/Oct/2017 13:50:07 +0000] “GET /images HTTP/1.1” 200 … **0/23248” ? [24/Oct/2017 13:50:07 +0000] “GET /auth HTTP/1.1” 200 … **0/26548” [24/Oct/2017 13:50:07 +0000] “POST /media HTTP/1.1” 200 … **0/13148” [24/Oct/2017 13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/2588” [24/Oct/2017 13:50:07 +0000] “GET /auth HTTP/1.1” 500 … **0/3248” [24/Oct/2017 13:50:07 +0000] “POST /media HTTP/1.1” 200 … **0/23548” [24/Oct/2017 13:50:07 +0000] “GET /images HTTP/1.1” 200 … **0/22598” [24/Oct/2017 13:50:07 +0000] “GET /videos HTTP/1.1” 200 … **0/13948” ...

  11. Distributed systems: Understanding failures Tank valve Cold water storage tank 爆 $ ❄ # ☭ ฀ Shutoff valve First floor branch Water heater I AM HERE! First floor distributor Gas supplier is clogged!

  12. Distributed Tracing to unclog your pipes

  13. Distributed tracing TraceID d52d38b69b0fb15efa API Proxy Media API Videos [1508410442] no cache for resource, retrieving from DBc Images 500 error Auth Time

  14. Distributed Tracing: What answers I get? ● What services did a request pass through ? ● What occurred in each service for a given request? Where did the error happen? ● ● Where are the bottlenecks ? ● What is the critical path for a request? ● Who should I page?

  15. Distributed Tracing & friends Credits: Peter Bourgon

  16. Benefits of Distributed Tracing ● (almost) Immediate feedback ● System insight , clarifies non trivial interactions Visibility to critical paths and dependencies ● ● Understand latencies ● Request scoped , not request’s lifecycle scoped.

  17. Trace’s Anatomy Time ● A trace shows an execution path through a distributed system /things A span in the trace represents a logical ● T unit of work (with a start and end) R auth.Auth A ● A context includes information that C E mysql.Get should be propagated across services GET /videos Tags and logs (optional) add ● complementary information to spans.

  18. Elements of distributed tracing Distributed Tracing Leg 3: in-process propagation Leg 1: inbound propagation Leg 2: outbound propagation Credits: Nic Munroe

  19. Leg 1: Inbound propagation When your service process a request or consume a message. TraceID: fAf3oXL6DS SpanID: dZ0xHIBa1A ... API Proxy Media API GET /media

  20. Leg 2: Outbound propagation When your service makes an outbound call to another service TraceID: fAf3oXL6DS ParentID: dZ0xHIBa1A SpanID: y74fr5udj Video Media API service http/get GET /videos

  21. Leg 3: In process propagation When performing an operation inside the service GET /images Images service Media API mysql.Query redis.Get Cache service

  22. Distributed tracing TraceID d52d38b69b0fb15efa API Proxy Media API Videos [1508410442] no cache for resource, retrieving from DBc Images 500 error Auth Time

  23. Any overhead? For users: ● Observability tools are meant to be unintrusive Sampling reduces overhead ● ● (Don’t) trace every single operation For developers: Not all libraries are ready to plug instruments ● ● Instrumentation can be delegated to common frameworks

  24. Introducing Apache Zipkin

  25. Apache Zipkin Based on BigBrotherBird (B3) and inspired on Google Dapper (2010). It was open sourced by Twitter (2012) and joined Apache Incubator on September 2018. ● Mature tracing model emerged from users’ needs. ● Used by large companies like Netflix, SoundCloud and Yelp but also not too big ones . ● Strong community: ○ @zipkinproject ○ gitter.im/openzipkin

  26. Zipkin: architecture Service (instrumented) Visualize Collect Receive spans API UI spans Transport Collector Retrieve data http/kafka/grpc Storage Deserialize and schedule for Store spans storage DB Cassandra/MySQL/ElasticSearch

  27. Zipkin: traces

  28. Zipkin: traces

  29. Zipkin: traces

  30. Zipkin: trace overview

  31. Zipkin: tags and logs

  32. Zipkin: traces with errors

  33. Zipkin: traces for async operations

  34. Zipkin: dependency graph

  35. Zipkin: dependency graph

  36. Q&As twitter.com/jcchavezs Find more: http://bit.ly/dist-trac

Recommend


More recommend