principled work fm ow centric tracing of distributed
play

Principled work fm ow-centric tracing of distributed systems Raja - PowerPoint PPT Presentation

Principled work fm ow-centric tracing of distributed systems Raja Sambasivan Ilari Shafer, Jonathan Mace, Ben Sigelman, Rodrigo Fonseca, Greg Ganger Todays distributed systems E.g.,Twitter Twitter death star:


  1. Principled work fm ow-centric tracing of distributed systems Raja Sambasivan Ilari Shafer, Jonathan Mace, Ben Sigelman, Rodrigo Fonseca, Greg Ganger

  2. Today’s distributed systems E.g.,Twitter Twitter “death star”: https://twitter.com/adrianco/status/441883572618948608 2

  3. Today’s distributed systems Amazingly complex E.g., Net fm ix E.g.,Twitter insu ffj cient { GDB , gprof, Machine-centric tools strace, linux perf. counters Net fm ix “death star”: http://www.slideshare.net/adriancockcroft/fast-delivery-devops-israel 3

  4. Work fm ow-centric tracing Provides the needed coherent view ! ! 17 µs 25 ms Get ! ! ! 27 ms ! Client Server App Server Table store Distributed FS ! Metadata (e.g., IDs) Trace point (e.g., at functions) 4

  5. Stardust [SIGM’06] It is useful / being adopted Stardust ✚ [NSDI’11] X-Trace [NSDI’07] Category Management task X-Trace ✚ [WREN’10] ID anomalous work fm ows Diagnosis \ Retro [NSDI’15] ID work fm ows w/ PivotTrace [SOSP’15] steady-state problems Pip [NSDI’06] Pro fj ling \ Pinpoint [NSDI’04] Attribution Resource Mace [PLDI’07] mgmt. Performance tuning Dapper [TR10-14] \ HTrace Zipkin Dynamic monitoring Multiple UberTrace But, no clarity for tracing developers 5

  6. But, no clarity for tracing developers Expectation Spectroscope Stardust Stardust Spectroscope Reality Stardust Stardust ✚ 6

  7. We provide clarity for tracing developers } Task A Tracing infrastructure Task B ? Task C 1 2 3 4 6 5 Choices: Task D Methodology : Use experiences ID design Compare to to distill choices best existing design axes for di fg erent tasks infrastructures 7

  8. Key results Di fg erent design decisions needed for 1 diagnosis and resource management Batching causes some design decisions across 2 some axes to interact poorly Existing tracing infrastructures suited to a 3 task make similar choices to our suggestions 8

  9. Anatomy & design axes Causal relationships? How to de fj ne Management tasks a request? Conc./Sync. d needed? d n Trace construction n a Inter-request a b b - needed? f - o n - I t Trace storage How will u O trace points be added? ! ! In-band / Sample? out-of-band? What to use to App Server Table store File system reduce ovhd? Tracing infrastructure 9

  10. How original Stardust de fj ned requests WRITE START Response time: ~20 ms 10 µs CACHE WRITE INSERT BLOCK } Unaccounted 20 ms latency 2 µs WRITE REPLY Trace not useful for diagnosis tasks 10

  11. Two valid ways to de fj ne a request’s work fm ow WRITE START 10 µs CACHE WRITE 9 µs WRITE START INSERT BLOCK 10 µs 15 µs CACHE WRITE WRITE REPLY 9 µs EVICT BLOCK 5 µs ~20 ms DISK START 20,000 µs DISK END INSERT BLOCK Latent work 2 µs Resource management : Assign WRITE REPLY latent work to original submitter 11

  12. Two valid ways to de fj ne a request’s work fm ow WRITE START 10 µs CACHE WRITE 9 µs WRITE INSERT BLOCK 10 µs 15 µs CACHE WRITE WRITE REPLY 5µs EVICT BLOCK 5 µs DISK START 20,000 µs DISK END 5µs INSERT BLOCK Latent work 2 µs Diagnosis : Assign latent work to WRITE REPLY request on whose critical path it is executed 12

  13. Future research directions Reducing di ffj culty of adding trace points Lowering overhead when identifying anomalous work fm ows Exploring new analyses 13

  14. Summary Key design choices dictate work fm ow-centric utility for di fg erent tasks We identify choices best suited for di fg erent tasks 14

Recommend


More recommend