speedlight synchronized
play

SpeedLight: Synchronized Network Snapshots Nofel Yaseen , John - PowerPoint PPT Presentation

SpeedLight: Synchronized Network Snapshots Nofel Yaseen , John Sonchack, Vincent Liu 1 Network Measurements 2 Network Measurements Measurements are how we understand networks Operators: configuration, management and provisioning


  1. SpeedLight: Synchronized Network Snapshots Nofel Yaseen , John Sonchack, Vincent Liu 1

  2. Network Measurements 2

  3. Network Measurements • Measurements are how we understand networks • Operators: configuration, management and provisioning • Architects: designing new protocols and topologies • Researchers: measurement studies and evaluation • Today’s measurement techniques • Single device, e.g., counters, sampling • Single path or packet, e.g., pings, INT, ECN 3

  4. A Case for Consistency X Y B A 4

  5. A Case for Consistency X Y B A 5

  6. A Case for Consistency X Y B A What is the reason for this packet drop? 6

  7. A Case for Consistency 7

  8. A Case for Consistency 8

  9. A Case for Consistency Congestion 9

  10. A Case for Consistency Congestion Poor Load Balancing 10

  11. A Case for Consistency Congestion Poor Load Balancing • Single Device : No relationship among measurements across time or devices. 11

  12. A Case for Consistency Congestion Poor Load Balancing • Single Device : No relationship among measurements across time or devices. 12

  13. A Case for Consistency Congestion Poor Load Balancing • Single Device : No relationship among measurements across time or devices. • Single Path or Packet: No relationship among measurements across paths or packets. 13

  14. A Case for Consistency Existing tools fail to capture simultaneous behavior Congestion Poor Load Balancing • Single Device : No relationship among measurements across time or devices. • Single Path or Packet: No relationship among measurements across paths or packets. 14

  15. Our Goal 15

  16. Our Goal A set of data-plane measurements that capture the state of the network at ~(single point in time) Truly simultaneous behavior is not possible • Causal consistency , i.e., the set should make sense • Near synchrony, i.e., it should be as close as possible to an actual state (<RTT) 16

  17. Speedlight A set of data-plane measurements that capture the state of the network at ~(single point in time) 17

  18. Speedlight A set of data-plane measurements that capture the state of the network at ~(single point in time) • A P4-based system for Synchronized Network Snapshot • Implemented on Wedge100BF • Can capture network-wide state of any value accessible in the data plane • Amenable to partial deployment • <100 μ s synchronization, even for large networks 18

  19. Outline • Chandy - Lamport Algorithm. • Challenges of taking Synchronized Network Snapshots. • Protocol • Prototype Implementation • Evaluation 19

  20. Global Network View Event A0 Event A1 Event A2 Event A3 A B Event B0 Event B1 Event B2 Event B3 • Partition the network into pre- and post- snapshot • e is pre-snapshot ⇒ all events that caused e are pre-snapshot • E.g., receive and send of a message 20 Figure adapted from Linh T. X. Phan

  21. Global Network View Event A0 Event A1 Event A2 Event A3 A B Event B0 Event B1 Event B2 Event B3 Inconsistent cut • Partition the network into pre- and post- snapshot • e is pre-snapshot ⇒ all events that caused e are pre-snapshot • E.g., receive and send of a message 21 Figure adapted from Linh T. X. Phan

  22. Global Network View Event A0 Event A1 Event A2 Event A3 A B Event B0 Event B1 Event B2 Event B3 Inconsistent cut Consistent cut • Partition the network into pre- and post- snapshot • e is pre-snapshot ⇒ all events that caused e are pre-snapshot • E.g., receive and send of a message 22 Figure adapted from Linh T. X. Phan

  23. Chandy – Lamport (CL) Snapshots SS# 1 A SS# 1 B SS# 1 C • Messages carry the current SS# • On seeing a message with a new SS# for the first time • Node takes a local checkpoint • Node attaches the new SS# to all subsequent messages • On seeing a message with an old SS# • Message was in-flight. Update channel state. 23 Figure adapted from Linh T. X. Phan

  24. Chandy – Lamport (CL) Snapshots SS# 1 A SS# 1 B SS# 1 C • Messages carry the current SS# • On seeing a message with a new SS# for the first time • Node takes a local checkpoint • Node attaches the new SS# to all subsequent messages • On seeing a message with an old SS# • Message was in-flight. Update channel state. 24 Figure adapted from Linh T. X. Phan

  25. Chandy – Lamport (CL) Snapshots SS# 1 A SS# 1 B SS# 2 SS# 1 C • Messages carry the current SS# • On seeing a message with a new SS# for the first time • Node takes a local checkpoint • Node attaches the new SS# to all subsequent messages • On seeing a message with an old SS# • Message was in-flight. Update channel state. 25 Figure adapted from Linh T. X. Phan

  26. Chandy – Lamport (CL) Snapshots SS# 1 A SS# 1 B SS# 2 SS# 1 C • Messages carry the current SS# • On seeing a message with a new SS# for the first time • Node takes a local checkpoint • Node attaches the new SS# to all subsequent messages • On seeing a message with an old SS# • Message was in-flight. Update channel state. 26 Figure adapted from Linh T. X. Phan

  27. Chandy – Lamport (CL) Snapshots SS# 1 A SS# 2 SS# 1 B SS# 2 SS# 1 C • Messages carry the current SS# • On seeing a message with a new SS# for the first time • Node takes a local checkpoint • Node attaches the new SS# to all subsequent messages • On seeing a message with an old SS# • Message was in-flight. Update channel state. 27 Figure adapted from Linh T. X. Phan

  28. Chandy – Lamport (CL) Snapshots SS# 2 SS# 1 A SS# 2 SS# 1 B SS# 2 SS# 1 C • Messages carry the current SS# • On seeing a message with a new SS# for the first time • Node takes a local checkpoint • Node attaches the new SS# to all subsequent messages • On seeing a message with an old SS# • Message was in-flight. Update channel state. 28 Figure adapted from Linh T. X. Phan

  29. Challenges for Synchronized Network Snapshots 29

  30. Challenges for Synchronized Network Snapshots 1. CL provides no guarantee of synchrony • We want something that’s close to an actual state 2. CL assumes single-threaded nodes, FIFO channels • Modern networks are highly parallel – breaks consistency 3. CL assumes general purpose CPUs • Switch data planes are extremely limited • Switch CPUs are no better than remote hosts (wrt consistency) 30

  31. Ensuring Synchrony Challenge 1: Chandy- Lamport provides no guarantee of synchrony Observer 31

  32. Ensuring Synchrony Challenge 1: Chandy- Lamport provides no guarantee of synchrony Observer • Router CPUs are synchronized via PTP 32

  33. Ensuring Synchrony Challenge 1: Chandy- Lamport provides no guarantee of synchrony Take SS# n at time t Observer • Router CPUs are synchronized via PTP • User/Observer schedules a snapshot at every router 33

  34. Ensuring Synchrony Challenge 1: Chandy- Lamport provides no guarantee of synchrony Take SS# n at time t Observer CPU ASIC • Router CPUs are synchronized via PTP • User/Observer schedules a snapshot at every router 34

  35. Ensuring Synchrony Challenge 1: Chandy- Lamport provides no guarantee of synchrony Take SS# n at time t Observer CPU ASIC • Router CPUs are synchronized via PTP • User/Observer schedules a snapshot at every router 35

  36. Ensuring Consistency Challenge 2: CL assumes single-threaded nodes, FIFO channels Observer CPU ASIC Figure from P4 language Specification 36

  37. Ensuring Consistency Challenge 2: CL assumes single-threaded nodes, FIFO channels Observer CPU ASIC • Data plane snapshot on the level of individual processing units and priority channels 37

  38. Ensuring Consistency Challenge 2: CL assumes single-threaded nodes, FIFO channels Observer CPU ASIC • Data plane snapshot on the level of individual processing units and priority channels • Snapshot propagates even if CPU invocation is delayed 38

  39. Ensuring Consistency Challenge 2: CL assumes single-threaded nodes, FIFO channels Ethernet IP Observer Snapshot TCP/UDP CPU Data ASIC • Data plane snapshot on the level of individual processing units and priority channels • Snapshot propagates even if CPU invocation is delayed 39

  40. Compensate for Data-plane Limitations Challenge 3: CL assumes general purpose CPUs 40

  41. Compensate for Data-plane Limitations Challenge 3: CL assumes general purpose CPUs • Programmable ASICs are limited • Limited programming model, registers and accesses • Control plane compensates, for example: • Detects snapshot completion • Notifications • Extract from RAM • Lack of traffic • Liveness • Skipped snapshots 41

  42. Implementation and Evaluation • Implemented on a Barefoot Wedge100BF-32X • Control plane: ~2000 lines of Python • Data plane: ~1000 lines of P4 (per variant) • Evaluation • How synchronized is Speedlight? • What is the overhead? • How does its results compare against current mechanism? 42

  43. Implementation and Evaluation • Implemented on a Barefoot Wedge100BF-32X • Control plane: ~2000 lines of Python • Data plane: ~1000 lines of P4 (per variant) • Evaluation • How synchronized is Speedlight? • What is the overhead? • How does its results compare against current mechanism? 43

  44. How Synchronized is Speedlight? 44

  45. How Synchronized is Speedlight? Speedlight Polling 1 0.8 0.6 CDF 0.4 0.2 0 1 10 100 1000 10000 Synchronization (us) 45

Recommend


More recommend