diagnosing distributed cps with timing provenance
play

Diagnosing Distributed CPS with Timing Provenance Yang Andreas - PowerPoint PPT Presentation

NSF/Intel PI Meeting July 13, 2018 Diagnosing Distributed CPS with Timing Provenance Yang Andreas Linh Thi Xuan Wu Haeberlen Phan Problem: Timing faults n Many CPS are time dependent n The right thing must happen at the right


  1. NSF/Intel PI Meeting – July 13, 2018 Diagnosing Distributed CPS with Timing Provenance Yang Andreas Linh Thi Xuan Wu Haeberlen Phan

  2. Problem: Timing faults n Many CPS are time dependent n The “right thing” must happen at the “right time”! n What if this goes wrong? n Reasons: attack, bug, misconfiguration, … n Goal: A powerful diagnostic capability n Can we find the root cause of both functional and timing issues, such as low throughput, oscillations, high latencies, …? 2 @2018 Linh T. X. Phan – Timing Provenance

  3. (Q) How was the computing response generated? V1 Computing Rsp Challenge generated at C at 95s Storage RPC Rsp received by C at 93s Storage RPC Rsp Root cause! sent by B at 93s Storage Block was ... during [ 0s , 93s ) Bottleneck! Misbehaving Storage RPCs Root cause? Storage RPC Req received by B at 81s Storage RPC Req S4 sent by C at 81s V2 S2 Computing Req Storage Type S3 Storage received by C at 80s was remote during [ 0s , ∞ ) Maintenance Victim Storage RPC Backend (B) Service (M) S1 Why is the request taking (time) so long to complete? Computing Request span id: 1, parent id: none t6 Computing t0 Computing Request Service (C) Storage RPC span id: 2, parent id: 1 End Start t1 t2 t3 t4 t5 Bottleneck! Client Server Start Server Client Send Recv Root cause? Job Send Recv n State of the art Distributed tracing: explain what was computed when, but not why n Network provenance: only reason about functional causality n n Cannot reason about timing 3 @2018 Linh T. X. Phan – Timing Provenance

  4. Approach: Timing provenance Root cause Misbehaving Storage RPCs captured! Queuing delay! … S4 S2 S3 Storage Maintenance Victim Storage RPC Backend (B) Service (M) S1 Why is the request taking so long to complete? Computing Computing Request Service (C) n A generalization of provenance that tracks both functional causality and temporal causality i.e., causes that affect the timing of the observed symptom n may involve requests that are functionally independent n n Result: Can explain both the ‘what’ and the ‘when’ 4 @2018 Linh T. X. Phan – Timing Provenance

  5. How to capture temporal causality? n Intuition: Represent ordering relationship between exec. We need to know not just what the system did, but also in what order n (queuing and scheduling semantics) n Extend critical-path analysis in a novel way for the analysis D A B C D C Enqueue Dequeue Request D can only be dequeued after Timing provenance of D C is dequeued and finished processing must include C 5 @2018 Linh T. X. Phan – Timing Provenance

  6. Insight #1: Sequencing edges n Add a sequencing edge from execution X to execution Y if X immediately precedes Y in the queue Symptom A B C D Slow Compute t=114, @C, VMRsp Seq: 7, Enq: 113, Acq: 113, Deq: 114 Response t=100, @B-C, BillingRsp t=113, @S-C, StorageRsp Seq: 1, Enq: 98, Acq: 98, Deq: 100 Seq: 1, Enq: 111, Acq: 111, Deq: 113 t=97, @C, ComputeRsp, t=98, @B, BillingRsp, Seq: 6, Enq: 89, Acq: 92, Deq: 97 Seq: 2, Enq: 93, Acq: 93, Deq: 98 t=111, @S, StorageRsp Seq: 3, Enq: 91, Acq: 108, Deq: 111 D C t=108, @S, StorageRsp t=92, @C, NetworkRsp t=93, @C-B, BillingReq Seq: 2, Enq: 86, Acq: 105, Deq: 108 Seq: 5, Enq: 87, Acq: 91, Deq: 92 Seq: 1, Enq: 91, Acq: 91, Deq: 93 t=91, @C-S, StorageReq Seq: 1, Enq: 90, Acq: 90, Deq: 91 Billing t=88, @M-S, StorageReq t=91, @C, BillingReq RPC Seq: 4, Enq: 87 , Acq: 90, Deq: 91 Seq: 2, Enq: 86, Acq: 86, Deq: 88 t=105, @S, StorageRsp B Compute Seq: 1, Enq: 86, Acq: 102, Deq: 105 Seq: 0, Enq: 84, Acq: 99, Deq: 102 A RPC t=90, @C, StorageReq t=86, @M, StorageReq Seq: 3, Enq: 87 , Acq: 89, Deq: 90 t=102, @S, StorageRsp Seq: 2, Enq: 85 , Acq: 85, Deq: 86 t=89, @C, ComputeReq t=86, @M-S, StorageReq Seq: 2, Enq: 87 , Acq: 88, Deq: 89 Seq: 1, Enq: 84, Acq: 84, Deq: 86 Network Storage RPC RPC t=85, @M, MaintainReq t=84, @C-S, StorageReq t=84, @M, StorageReq Seq: 0, Enq: 83, Acq: 83, Deq: 84 Seq: 1, Enq: 83 , Acq: 83, Deq: 84 t=88, @C, NetworkReq Seq: 1, Enq: 87 , Acq: 87, Deq: 88 t=83, @M, StorageReq t=83, @M, MaintainReq Compute Root cause! Seq: 0, Enq: 82 , Acq: 82, Deq: 83 t=87, @C, VMReq Request Maintenance t=82, @M, MaintainReq Jobs 6 @2018 Linh T. X. Phan – Timing Provenance

  7. Challenge: Usability n Not all executions are equally important n How to isolate executions that contribute substantially to the overall delay? Slow Compute t=114, @C, VMRsp Seq: 7, Enq: 113, Acq: 113, Deq: 114 Response t=100, @B-C, BillingRsp t=113, @S-C, StorageRsp Seq: 1, Enq: 98, Acq: 98, Deq: 100 Seq: 1, Enq: 111, Acq: 111, Deq: 113 t=97, @C, ComputeRsp, t=98, @B, BillingRsp, Seq: 6, Enq: 89, Acq: 92, Deq: 97 Seq: 2, Enq: 93, Acq: 93, Deq: 98 t=111, @S, StorageRsp Seq: 3, Enq: 91, Acq: 108, Deq: 111 t=108, @S, StorageRsp t=92, @C, NetworkRsp t=93, @C-B, BillingReq Seq: 2, Enq: 86, Acq: 105, Deq: 108 Seq: 5, Enq: 87, Acq: 91, Deq: 92 Seq: 1, Enq: 91, Acq: 91, Deq: 93 t=91, @C-S, StorageReq Seq: 1, Enq: 90, Acq: 90, Deq: 91 Billing t=88, @M-S, StorageReq t=91, @C, BillingReq RPC Seq: 2, Enq: 86, Acq: 86, Deq: 88 t=105, @S, StorageRsp Seq: 4, Enq: 87 , Acq: 90, Deq: 91 Compute Seq: 1, Enq: 86, Acq: 102, Deq: 105 RPC t=86, @M, StorageReq t=90, @C, StorageReq Seq: 2, Enq: 85 , Acq: 85, Deq: 86 t=102, @S, StorageRsp t=86, @M-S, StorageReq Seq: 3, Enq: 87 , Acq: 89, Deq: 90 t=89, @C, ComputeReq Seq: 0, Enq: 84, Acq: 99, Deq: 102 Seq: 1, Enq: 84, Acq: 84, Deq: 86 Seq: 2, Enq: 87 , Acq: 88, Deq: 89 Network Storage t=85, @M, MaintainReq t=84, @C-S, StorageReq t=84, @M, StorageReq RPC RPC Seq: 0, Enq: 83, Acq: 83, Deq: 84 Seq: 1, Enq: 83 , Acq: 83, Deq: 84 t=88, @C, NetworkReq Seq: 1, Enq: 87 , Acq: 87, Deq: 88 t=83, @M, StorageReq t=83, @M, MaintainReq Seq: 0, Enq: 82 , Acq: 82, Deq: 83 Compute Maintenance t=87, @C, VMReq t=82, @M, MaintainReq Request 7 Jobs @2018 Linh T. X. Phan – Timing Provenance

  8. Insight #2: Delay annotations n Annotate vertexes with the delays that they contribute +90.009s Slow Compute t=114, @C, VMRsp Seq: 7, Enq: 113, Acq: 113, Deq: 114 +90s Response +0.004s t=100, @B-C, BillingRsp t=113, @S-C, StorageRsp +0.002s Seq: 1, Enq: 98, Acq: 98, Deq: 100 Seq: 1, Enq: 111, Acq: 111, Deq: 113 t=97, @C, ComputeRsp, t=98, @B, BillingRsp, Seq: 6, Enq: 89, Acq: 92, Deq: 97 Seq: 2, Enq: 93, Acq: 93, Deq: 98 t=111, @S, StorageRsp +0.003s Seq: 3, Enq: 91, Acq: 108, Deq: 111 t=108, @S, StorageRsp t=92, @C, NetworkRsp t=93, @C-B, BillingReq Seq: 2, Enq: 86, Acq: 105, Deq: 108 Seq: 5, Enq: 87, Acq: 91, Deq: 92 Seq: 1, Enq: 91, Acq: 91, Deq: 93 t=91, @C-S, StorageReq Seq: 1, Enq: 90, Acq: 90, Deq: 91 Billing t=88, @M-S, StorageReq t=91, @C, BillingReq RPC Seq: 2, Enq: 86, Acq: 86, Deq: 88 t=105, @S, StorageRsp Seq: 4, Enq: 87 , Acq: 90, Deq: 91 Compute Seq: 1, Enq: 86, Acq: 102, Deq: 105 RPC t=86, @M, StorageReq t=90, @C, StorageReq Seq: 2, Enq: 85 , Acq: 85, Deq: 86 t=102, @S, StorageRsp t=86, @M-S, StorageReq Seq: 3, Enq: 87 , Acq: 89, Deq: 90 t=89, @C, ComputeReq Seq: 0, Enq: 84, Acq: 99, Deq: 102 Seq: 1, Enq: 84, Acq: 84, Deq: 86 Seq: 2, Enq: 87 , Acq: 88, Deq: 89 Network Storage t=85, @M, MaintainReq t=84, @C-S, StorageReq t=84, @M, StorageReq RPC RPC Seq: 0, Enq: 83, Acq: 83, Deq: 84 Seq: 1, Enq: 83 , Acq: 83, Deq: 84 t=88, @C, NetworkReq Seq: 1, Enq: 87 , Acq: 87, Deq: 88 t=83, @M, StorageReq t=83, @M, MaintainReq Seq: 0, Enq: 82 , Acq: 82, Deq: 83 Compute Maintenance t=87, @C, VMReq t=82, @M, MaintainReq Request 8 Jobs @2018 Linh T. X. Phan – Timing Provenance

  9. Insight #2: Delay annotations n Annotate vertexes with the delays that they contribute n Goal: Delay annotations should correspond to “potential speedup” +0s +90.009s +0.009s Slow Compute t=114, @C, VMRsp Seq: 7, Enq: 113, Acq: 113, Deq: 114 +90s Response +0.004s t=100, @B-C, BillingRsp t=113, @S-C, StorageRsp +0.002s Seq: 1, Enq: 98, Acq: 98, Deq: 100 Seq: 1, Enq: 111, Acq: 111, Deq: 113 t=97, @C, ComputeRsp, t=98, @B, BillingRsp, Seq: 6, Enq: 89, Acq: 92, Deq: 97 Seq: 2, Enq: 93, Acq: 93, Deq: 98 t=111, @S, StorageRsp +0.003s Seq: 3, Enq: 91, Acq: 108, Deq: 111 t=108, @S, StorageRsp t=92, @C, NetworkRsp t=93, @C-B, BillingReq Seq: 2, Enq: 86, Acq: 105, Deq: 108 Seq: 5, Enq: 87, Acq: 91, Deq: 92 Seq: 1, Enq: 91, Acq: 91, Deq: 93 t=91, @C-S, StorageReq Seq: 1, Enq: 90, Acq: 90, Deq: 91 Billing t=88, @M-S, StorageReq t=91, @C, BillingReq RPC Seq: 2, Enq: 86, Acq: 86, Deq: 88 t=105, @S, StorageRsp Seq: 4, Enq: 87 , Acq: 90, Deq: 91 Compute Seq: 1, Enq: 86, Acq: 102, Deq: 105 RPC t=86, @M, StorageReq t=90, @C, StorageReq Seq: 2, Enq: 85 , Acq: 85, Deq: 86 t=102, @S, StorageRsp t=86, @M-S, StorageReq Seq: 3, Enq: 87 , Acq: 89, Deq: 90 t=89, @C, ComputeReq Seq: 0, Enq: 84, Acq: 99, Deq: 102 Seq: 1, Enq: 84, Acq: 84, Deq: 86 Seq: 2, Enq: 87 , Acq: 88, Deq: 89 Network Storage t=85, @M, MaintainReq t=84, @C-S, StorageReq t=84, @M, StorageReq RPC RPC Seq: 0, Enq: 83, Acq: 83, Deq: 84 Seq: 1, Enq: 83 , Acq: 83, Deq: 84 t=88, @C, NetworkReq Seq: 1, Enq: 87 , Acq: 87, Deq: 88 t=83, @M, StorageReq t=83, @M, MaintainReq Seq: 0, Enq: 82 , Acq: 82, Deq: 83 Compute Maintenance t=87, @C, VMReq t=82, @M, MaintainReq Request 9 Jobs @2018 Linh T. X. Phan – Timing Provenance

Recommend


More recommend