NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani, Selim Ciraci, Boon Thau Loo, Assaf Schuster, Geoff Outhred
Datacenters can fail … 2
Failures are disruptive • • • • 3
Why is debugging hard? Azure VM Service X Azure Network Network Penn researcher 4
In the case of a failure… ` Someone accepts responsibility Each blames the other Network Network 5
A real example… Event X • • • • 6
Current tools are insufficient TRat SIGCOMM-02 Netprofile r P2Psys-05 Sherlock NetMedic SIGCOMM- SIGCOMM- 07 09 NSDI-11 7
Can we do better? (Overview) • Introducing… Learning Agent Fault injector NetPoirot 8
The monitoring agent • • • • • • • 9
What is the TCP event digest? • • • 10
Why do we think this can work? • • • • • • 11
To distinguish failures… • • • 12
Decision trees… • His uncertainty is X 13
Decision trees… • • His uncertainty is X- Y 14
Decision trees alone are not enough 15
Decision trees alone are not enough 16
Decision trees alone are not enough Feature 2 Feature 1 17
Decision trees alone are not enough Feature 2 Hardest to classify Easiest to Feature 1 18
What we do to deal with this Feature 2 Feature 1 19
Upper portion of an example tree… Mean of max congestion window 50 th percentile of number of Min of the last congestion window triple duplicate ACKs 50 th percentile of connection duration 95 th percentile of the max Max of the number of triple duplicate Acks congestion window 20
What we do to deal with this Feature 2 Feature 1 21
Upper portion of an example tree… 50 TH percentile of the max RTT Number of flows 50 th percentile of amount of data received 95 th percentile of the number of timeouts 22
Decision trees alone are not enough Feature 2 Feature 1 23
The upper portion of an example tree… Mean time spent in zero window probing 95 th percentile of the ratio Number of flows of number of bytes posted to received 95 th percentile of connection durations Minimum of the number of bytes received Number of flows 24
Is it a network failure? Is it a server problem? Is it a client side problem? 25
Other details • • If throughput <x: If throughput < x: Send more data on the Open more • same connection connections • 26
What did we learn from all this? • • • • • • • • • 27
Evaluation • • • • • • • 28
How did we get labeled data? • • • • • • • 29
Worse case application • 30
What if we haven’t seen the failure before? 31
Performance on real applications General Normal Client Networ label k Precisio 97.78% 99.7% 100% n Recall 99.68% 98.25% 99.37 YouTube Event X 32
Things we did not talk about • • • • • 33
What’s next? • • • • 34
Conclusion • • 35
Recommend
More recommend