trumpet timely and precise triggers in data centers
play

Trumpet: Timely and Precise Triggers in Data Centers The Problem - PowerPoint PPT Presentation

Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat Trumpet: Timely and Precise Triggers in Data Centers The Problem Evolve or Die, SIGCOMM 2016 Long failure repair times in large networks Human-in-the-loop failure assessment and


  1. Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat Trumpet: Timely and Precise Triggers in Data Centers

  2. The Problem Evolve or Die, SIGCOMM 2016 Long failure repair times in large networks Human-in-the-loop failure assessment and repair 2

  3. Humans in the Loop Fix Inspect Detect Locate 3

  4. Programs in the Loop Fix Inspect Programs in the loop Detect Locate 4

  5. Our Focus Detect A framework for programmed detection of events in large datacenters 5

  6. Events Link failure Loop Packet burst Middlebox failure DDoS Traffic surge Packet delay Congestion ❖ Availability ❖ Performance Lost packet Burst Loss ❖ Security Switch failure Blackhole Incast Load imbalance Traffic hijack 6

  7. Our Focus Detect Aggregated , often sampled measures of network health 7

  8. Fine Detecting Transient Congestion Timescale Events 40 ms burst Timeouts lasting several 100 ms 8

  9. Fine Detecting Attack Onset Timescale Events Did this tenant see a sudden increase in traffic over the last few milliseconds ? 9

  10. Inspect Every Packet Link failure Loop Packet burst Middlebox failure DDoS Traffic surge Packet delay Congestion Lost packet Some event Burst Loss definitions may Switch failure Blackhole require inspecting Incast every packet Load imbalance Traffic hijack 10

  11. Eventing Framework Requirements Expressivity Fine timescale eventing Per-packet processing ▸ Set of possible ▸ Capture transient ▸ Precise event events not known a and onset events determination priori Because data centers will require high availability and high utilization 11

  12. A Key Where do we place eventing Architectural Question functionality? Switches NICs Hosts ❖ Are programmable ❖ Have processing power for fine-time scale eventing ❖ Already inspect every packet 12

  13. We explore the design of a host-based eventing framework 13

  14. Research Questions What eventing How can we achieve What is the architecture permits precise eventing at fine performance envelope programmability and timescales? of such an eventing visibility? framework? 14

  15. Research Questions What eventing How can we achieve What is the architecture permits precise eventing at fine performance envelope programmability and timescales? of such an eventing visibility? framework? Trumpet has a logically centralized event manager that aggregates local events from per-host packet monitors 15

  16. Event Definition For each packet matching Filter group by Flow-granularity and report every Time-interval Predicate each group that satisfies Flow volumes, loss rate, loss pattern (bursts), delay 16

  17. Event Is there any flow sourced by a service that Example sees a burst of losses in a small interval? For each packet matching Service IP Prefix group by 5-tuple and report every 10ms any flow whose sum (is_lost & is_burst) > 10% 17

  18. Event Is there a job in a cluster that sees abnormal Example traffic volumes in a small interval? Cluster IP Prefix For each packet matching and Port group by Job IP Prefix and report every 10ms any job whose sum (volume) > 100MB 18

  19. Trumpet Design Controller Event Report Trumpet Event Manager Triggers Trigger Reports Server Server VM Hypervisor Trumpet Packet Monitor Software switch VM 19

  20. Trumpet Event Manager Congestion? Trumpet Event Manager Contains event Congestion attributes, detects Triggers local events 20

  21. Trumpet Event Manager Trumpet Event Manager 21

  22. Trumpet Trumpet can be used by Event programs to drill-down to Manager potential root causes Large flow? Trumpet Event Manager Large Flow Triggers 22

  23. Research Questions What eventing How can we achieve What is the architecture permits precise eventing at fine performance envelope programmability and timescales? of such an eventing visibility? framework? The monitor optimizes packet processing to inspect every packet and evaluate predicates at fine timescales 23

  24. The Packet Monitor Server VM Hypervisor Trumpet Packet Monitor Software switch VM 24

  25. A Key Assumption Server VM Hypervisor Trumpet Packet Monitor Software switch VM Piggyback on CPU core used by software switch ❖ Conserves server CPU resources ❖ Avoids inter-core synchronization 25

  26. Can a single core monitor thousands of triggers at full packet rate (14.8 Mpps) on a 10G NIC? 26

  27. Two Obvious Tricks Use kernel bypass Use polling to have tighter ▸ Avoid kernel stack scheduling overhead ▸ Trigger time intervals at 10ms Necessary, but far from sufficient…. 27

  28. Monitor With 1000s Design of triggers Update Packet Check Match statistics at predicate filters flow granularity time-interval at Time interval Filter Predicate Flow granularity Source IP = 10.1.1.0/24 5-tuple 10ms Sum(loss) > 10% Source IP = 20.2.2.0/24 Service IP prefix 100ms Sum(size) < 10MB 28

  29. Design Challenges Update Packet Check Match statistics at predicate filters flow granularity time-interval at Which of these should be performed ❖ On-path ❖ Off-path 29

  30. Design Challenges Update Packet Check Match statistics at predicate filters flow granularity time-interval at Which operations to do on-path? ❖ 70ns to forward and inspect packet 30

  31. Design Challenges Update Packet Check Match statistics at predicate filters flow granularity time-interval at How to schedule off-path operations? ❖ Off-path on same core, can delay packets ❖ Bound delay to a few µs 31

  32. Strawman Packet Packet Design History On-Path Off-Path Update Check Match statistics at predicate filters flow granularity time-interval at Doesn’t scale to large numbers of triggers 32

  33. Strawman Update Match statistics at Design Packet filters flow granularity On-Path Off-Path Check predicate time-interval at Still cannot reach goal ❖ Memory subsystem becomes a bottleneck 33

  34. Trumpet Monitor Design Update Match statistics at Packet filters 5-tuple granularity On-Path Off-Path Check Gather predicate statistics at flow granularity time-interval at 34

  35. Optimizations Update Match statistics at Packet filters 5-tuple granularity On-Path ❖ Use tuple-space search for matching ❖ Match on first packet, cache match ❖ Lay out tables to enable cache prefetch ❖ Use TLB huge pages for tables 35

  36. Optimizations ❖ Lazy cleanup of statistics across intervals ❖ Lay out tables to enable cache prefetch ❖ Bounded-delay cooperative scheduling Off-Path Check Gather predicate statistics at flow granularity time-interval at 36

  37. Bound Bounded delay to Delay Off-Path On-Path a few µs Cooperative Scheduling Bounded Delay 37

  38. Research Questions What eventing How can we achieve What is the architecture permits precise eventing at fine performance envelope programmability and timescales? of such an eventing visibility? framework? Trumpet can monitor thousands of triggers at full packet rate on a 10G NIC 38

  39. Evaluation Trumpet is expressive ❖ Transient congestion ❖ Burst loss ❖ Attack onset Trumpet scales to thousands of triggers Trumpet is DoS-Resilient 39

  40. Detecting Transient Congestion 40 ms Trumpet can Large Flow detect (Reactive) millisecond scale congestion Congestion events 40

  41. Scalability Trumpet can process ❉ 14.8 Mpps ❖ 64 byte packets at 10G ❖ 650 byte packets at 4x10G … while evaluating 16K triggers at 10ms granularity ❉ Xeon ES-2650, 10-core 2.3 Ghz, Intel 82599 10G NIC 41

  42. Performance Triggers matched by Envelope each flow Above this rate, Trumpet would miss events How often each predicate is checked 42

  43. Performance Envelope Number of <trigger, flow> pairs increases statistics gathering overhead At moderate packet rates, can detect events at 1ms 43

  44. Performance Envelope Above 10ms, CPU can sustain full packet rate Need to profile and provision Trumpet deployment 44

  45. Conclusion Future datacenters will Trumpet can process 16K Future work: scale to need fast and precise triggers at full packet rate 40G NICs eventing ▸ … without delaying ▸ … perhaps with ▸ Trumpet is an packets by more than NIC or switch expressive system for 10 µs support host-based eventing https://github.com/USC-NSL/Trumpet 45

  46. A Big Discrepancy Outage budget for five 9s availability 99.999% uptime 24 seconds per month Long failure durations due to time to root- cause failures 46

  47. Every optimization is necessary ❉ ❉ Details in the paper 47

Recommend


More recommend