faria kalim le xu sharanya bathey richa meherwal indranil
play

Faria Kalim, Le Xu, Sharanya Bathey, Richa Meherwal, Indranil Gupta - PowerPoint PPT Presentation

Henge: Intent-Driven Multi-Tenant Stream Processing Faria Kalim, Le Xu, Sharanya Bathey, Richa Meherwal, Indranil Gupta Distributed Protocols Research Group Department of Computer Science University of Illinois at Urbana Champaign 1


  1. Henge: Intent-Driven 
 Multi-Tenant Stream Processing Faria Kalim, Le Xu, Sharanya Bathey, Richa Meherwal, Indranil Gupta Distributed Protocols Research Group Department of Computer Science University of Illinois at Urbana Champaign 1 http://dprg.cs.uiuc.edu/

  2. Henge allows stream processing jobs to satisfy user-specified performance requirements while reducing costs by performing online resource reconfigurations in a multi-tenant environment. 2

  3. A Typical Deployment Job 1 Job 2 Job 3 Job 4 Per-job clusters � overprovisioning 3

  4. A Typical Deployment Low level metrics e.g., queue sizes, CPU load as performance indicators Job 1 Job 2 Job 3 Job 4 4

  5. A Typical Deployment Low level metrics e.g., queue sizes, CPU load as performance indicators Job 1 Job 2 Job 3 Job 4 4

  6. A Typical Deployment Low level metrics e.g., queue sizes, CPU load as performance indicators Job 1 Job 2 Job 3 Job 4 Manual scaling 4

  7. Intent-Driven Multi-Tenancy 5

  8. Intent-Driven Multi-Tenancy Efficient resource usage across multiple users ➔ Multi-tenancy ... 5

  9. Intent-Driven Multi-Tenancy Efficient resource usage across multiple users ➔ Multi-tenancy Application-aware adaptation to user requirements ➔ Intent-driven Multi-tenancy 6

  10. Intent-Driven Multi-Tenancy CPU Load, Queue Sizes … Efficient resource usage across multiple users ➔ Multi-tenancy Application-aware adaptation to user requirements ➔ Intent-driven Multi-tenancy Job Description Service Level ... Objective (SLO) 1 Finding ride price Latency < 5 s 2 Analyzing Throughput > earnings over 10K/hr. 6 time

  11. Problem How can we achieve user-facing service level objectives for stream processing jobs on multi-tenant clusters? 7

  12. Problem How can we achieve user-facing service level objectives for stream processing jobs on multi-tenant clusters? Latency, Throughput 7

  13. Absolute Throughput SLOs are not Useful Rate (Tuples/s) Day 1 Day 2 Day 1 Day 2 Workload Variability 8

  14. Absolute Throughput SLOs are not Useful Rate (Tuples/s) SLO? Day 1 Day 2 Day 1 Day 2 Workload Variability 8

  15. Absolute Throughput SLOs are not Useful Rate (Tuples/s) SLO? Day 1 Day 2 Day 1 Day 2 Workload Variability 8

  16. Absolute Throughput SLOs are not Useful Rate (Tuples/s) SLO? Day 1 Day 2 Day 1 Day 2 Workload Variability 8

  17. Absolute Throughput SLOs are not Useful Rate (Tuples/s) SLO? Day 1 Day 2 Day 1 Day 2 Input Workload Variability Output 8

  18. Absolute Throughput SLOs are not Useful Job Operations … Filter … 9

  19. Absolute Throughput SLOs are not Useful Job Operations … Filter … Juice : fraction* of the input data processed by the job per unit time. 9

  20. Jobs benefit even below SLO threshold Job Utility Functions 10

  21. Jobs benefit even below SLO threshold Job Utility Functions Expected Utility Latency SLO Threshold Current Utility Utility function for a single job 10

  22. Jobs benefit even below SLO threshold Job Utility Functions Expected Utility Latency Henge’s goal � Maximize the total utility of the SLO cluster Threshold Current Utility Utility function for a single job 10

  23. Background: Stream Processing Topologies (Jobs) Spou t Splitte r Operators Coun t Logical DAG for a Word Count Job 11

  24. Bolt Sink Spout Spout Bolt Sink Spout Bolt Sink Spout Sink Bolt Star Topology Diamond Topology 12

  25. Background: Stream Processing Jobs [“So”] [“it”] [“So it goes…”] [“So”] [“goes”] Coun … t [“goes”] Spou Coun Splitter t t Coun Spou Splitte t Coun t r t [“it”] Executors (Threads) 13

  26. Background: Stream Processing Jobs [“So”] [“it”] [“So it goes…”] [“So”] [“goes”] Coun … t [“goes”] Spou Coun Splitter t t Coun Spou Splitte t Coun t r t [“it”] Parallelism � 2 Executors (Threads) 13

  27. Background: A Physical Deployment 14

  28. Background: A Physical Deployment Spout Splitter Count Count Workers 14

  29. Henge’s Cluster-Wide State Machine Not Converged Converged Total System Utility < Total Expected Utility 15

  30. Henge’s Cluster-Wide State Machine Reversion or Reconfiguration Reconfiguration Not Converged Converged Reduction Total System Utility < Total Expected Utility 15

  31. Reconfiguration De-congest operator by increasing parallelism level of executors 1) Reconfiguration 2) Reconfiguration Not Converged Converged 16

  32. Reconfiguration De-congest operator by increasing parallelism level of executors 3) Black-list topologies that show less than Δ % improvement 1) Reconfiguration 2) Reconfiguration Not Converged Converged 16

  33. Bottlenecks Spout Splitter Count Count Workers 17

  34. Reconfiguration Bottlenecks Spout Splitter Reconfigs. Splitter Splitter Count Count Workers 18

  35. Bottlenecks Spout Splitter Reconfigs. Splitter Splitter High Count Count Load Workers 19

  36. Bottlenecks SLO-Satisfying Job Spout Splitter Reconfigs. Splitter Splitter High Count Count Load Workers 19

  37. Bottlenecks Reconfigs. High Load Reduction 20

  38. Reduction Bottlenecks Reconfigs. High Load Reduction 20

  39. Reduction Reconfigurations � drop in utility Reduction Not Converged 21

  40. Reduction Reconfigurations � drop in utility If high CPU load on majority of machines, reduce parallelism for operators that a) are in topologies that satisfy their SLO b) are not congested Reduction Not Converged 21

  41. Reversion Reconfigurations � drop in utility and reduction is not possible Revert to a past configuration that provided best utility Reversion Not Converged Converged 22

  42. Evaluation Real-world workloads: Yahoo! Twitter Web log traces Experimental Setup: 10-40 node Emulab cluster 23

  43. Reducing cost and achieving high utilities 24

  44. Reducing cost and achieving high utilities 93.5% utility at 40% resources 24

  45. Reducing cost and achieving high utilities 93.5% utility at 40% resources 100% utility at 60% resources 24

  46. Adapting to a Diurnal Pattern 25

  47. Day 1 Day 2 25

  48. Day 1 Day 2 Max. Utility Reconfigurations Day 1 Day 2 25

  49. Fewer reconfigurations are required once a job has Day 1 Day 2 adjusted to max load Max. Utility Reconfigurations Day 1 Day 2 25

  50. Can Henge do better than manual configuration? 26

  51. Can Henge do better than manual configuration? Henge does better in the 15th to 45th percentile, and is comparable later. 26

  52. Scaling Cluster Size 27

  53. Scaling Cluster Size Limited resources entail more reconfigurations to reach max. utility 27

  54. More Results Henge can: handle dynamic workloads abrupt e.g., spikes & natural fluctuations gradual e.g., diurnal patterns satisfy hybrid SLOs scale with number of jobs & cluster size gracefully handle failures 28

  55. Summary • Henge allows users to specify performance intents for their jobs • Henge’s goal is to maximize cluster-wide utility • The scheduler performs fine-grained reconfigurations to allow stream processing jobs to meet user-specified intents 29

Recommend


More recommend