q uasar r esource e fficient a nd
play

Q UASAR : R ESOURCE -E FFICIENT A ND Q O S-A WARE C LUSTER M - PowerPoint PPT Presentation

Q UASAR : R ESOURCE -E FFICIENT A ND Q O S-A WARE C LUSTER M ANAGEMENT Christina Delimitrou and Christos Kozyrakis Stanford University http://mast.stanford.edu ASPLOS March 3 rd 2014 Executive Summary Problem: low datacenter utilization


  1. Q UASAR : R ESOURCE -E FFICIENT A ND Q O S-A WARE C LUSTER M ANAGEMENT Christina Delimitrou and Christos Kozyrakis Stanford University http://mast.stanford.edu ASPLOS – March 3 rd 2014

  2. Executive Summary  Problem: low datacenter utilization  Overprovisioned reservations by users  Problem: high jitter on application performance  Interference, HW heterogeneity  Quasar: resource-efficient cluster management  User provides resource reservations performance goals  Online analysis of resource needs using info from past apps  Automatic selection of number & type of resources  High utilization and low performance jitter 2

  3. Datacenter Underutilization  A few thousand server cluster at Twitter managed by Mesos  Running mostly latency-critical, user-facing apps 80% of servers @ < 20% utilization Servers are 65% of TCO 3

  4. Datacenter Underutilization  Goal: raise utilization without introducing performance jitter Twitter Google 1 1 L. A. Barroso, U. Holzle. The Datacenter as a Computer, 2009. 4

  5. Reserved vs. Used Resources 1.5-2x 3-5x  Twitter: up to 5x CPU & up to 2x memory overprovisioning 5

  6. Reserved vs. Used Resources  20% of job under-sized, ~70% of jobs over-sized 6

  7. Rightsizing Applications is Hard 7

  8. Performance Depends on Scale-up Performance Cores 8

  9. Performance Depends on Heterogeneity Performance Cores 9

  10. Performance Depends on Heterogeneity Scale-out Performance Performance Servers Cores 10

  11. Performance Depends on Heterogeneity Scale-out Performance Performance Servers Cores Input load Performance Input size 11

  12. Performance Depends on Heterogeneity Scale-out Performance Performance Servers Cores When sw changes, when platforms change, etc. Input load Interference Performance Performance Input size Cores 12

  13. Performance Depends on Heterogeneity Scale-out Performance Performance Servers Cores When sw changes, when platforms change, etc. Input load Interference Performance Performance Input size Cores 13

  14. Rethinking Cluster Management  User provides resource reservations performance goals  Joint allocation and assignment of resources  Right amount depends on quality of available resources  Monitor and adjust dynamically as needed  But wait…  The manager must know the resource/performance tradeoffs 14

  15. Understanding Resource/Performance Tradeoffs  Combine: Small app signal  Small signal from short run of new app  Large signal from previously-run apps  Generate:  Detailed insights for resource management  Performance vs scale-up/out, Big cluster heterogeneity, … data  Looks like a classification problem Resource/performance tradeoffs 15

  16. Something familiar…  Collaborative filtering – similar to Netflix Challenge system  Predict preferences of new users given preferences of other users  Singular Value Decomposition (SVD) + PQ reconstruction (SGD)  High accuracy, low complexity, relaxed density constraints movies PQ SVD SVD users SGD Initial Reconstructed Final Sparse utility matrix decomposition decomposition utility matrix 16

  17. Application Analysis with Classification Rows Columns Recommendation Netflix Users Movies Movie ratings Heterogeneity Interference Scale-up Scale-out  4 parallel classifications  Lower overheads & similar accuracy to exhaustive classification 17

  18. Heterogeneity Classification Rows Columns Recommendation Netflix Users Movies Movie ratings Heterogeneity Apps Platforms Server type Interference Scale-up Scale-out  Profiling on two randomly selected server types  Predict performance on each server type 18

  19. Interference Classification Rows Columns Recommendation Netflix Users Movies Movie ratings Heterogeneity Apps Platforms Server type Interference Apps Sources of interference Interference sensitivity Scale-up Scale-out  Predict sensitivity to interference  Interference intensity that leads to >5% performance loss  Profiling by injecting increasing interference 19

  20. Scale-Up Classification Rows Columns Recommendation Netflix Users Movies Movie ratings Heterogeneity Apps Platforms Server type Interference Apps Sources of interference Interference sensitivity Scale-up Apps Resource vectors Resources/node Scale-out  Predict speedup from scale-up  Profiling with two allocations (cores & memory) 20

  21. Scale-Out Classification Rows Columns Recommendation Netflix Users Movies Movie ratings Heterogeneity Apps Platforms Server type Interference Apps Sources of interference Interference sensitivity Scale-up Apps Resource vectors Resources/node Scale-out Apps Nodes Number of nodes  Predict speedup from scale-out  Profiling with two allocations (1 & N>1 nodes) 21

  22. Classification Validation Heterogeneity Interference Scale-up Scale-out avg avg max avg max avg max max 4% 8% 5% 10% 4% 9% - - Single-node 4% 5% 2% 6% 5% 11% 5% 17% Batch distributed 5% 6% 7% 10% 6% 11% 6% 12% Latency-critical 22

  23. Quasar Overview QoS 23

  24. Quasar Overview QoS 24

  25. Quasar Overview H <a..b...> I <..cd...> SU <e..f..> QoS SO <kl....> 25

  26. Quasar Overview U Σ V T H <abcdefghi> H <a..b...> I <..cd...>   I <qwertyuio> ... SU <e..f..> QoS u u u 11 12 1  n  SO <kl....> SU <esdfghjkl> ...   u u u 21 22 2 n   SO <kljhgfdsa>         ... u u u 1 2 m m mn 26

  27. Quasar Overview U Σ V T H <a..b...> H <abcdefghi> I <..cd...>   ... I <qwertyuio> u u u 11 12 1 SU <e..f..>  n  QoS ... SU <esdfghjkl>   u u u SO <kl....> 21 22 2 n       SO <kljhgfdsa>     ... u u u 1 2 m m mn 27

  28. Quasar Overview U Σ V T H <abcdefghi> H <a..b...>   ... I <qwertyuio> u u u I <..cd...> 11 12 1 n   QoS ... SU <esdfghjkl> SU <e..f..>   u u u 21 22 2 n   SO <kl....> SO <kljhgfdsa>         ... u u u 1 2 m m mn Resource Profiling Sparse Classification Dense selection [10-60sec] input [20msec] output [50msec-2sec] signal signal 28

  29. Greedy Resource Selection  Goals  Allocate least needed resources to meet QoS target  Pack together non-interfering applications  Overview  Start with most appropriate server types  Look for servers with interference below critical intensity  Depends on which applications are running on these servers  First scale-up, next scale-out 29

  30. Quasar Implementation  6,000 loc of C++ and Python  Runs on Linux and OS X  Supports frameworks in C/C++, Java and Python  ~100-600 loc for framework-specific code  Side-effect free profiling using Linux containers with chroot 30

  31. Evaluation: Cloud Scenario  Cluster  200 EC2 servers, 14 different server types  Workloads: 1,200 apps with 1sec inter-arrival rate  Analytics: Hadoop, Spark, Storm  Latency-critical: Memcached, HotCrp, Cassandra  Single-threaded: SPEC CPU2006  Multi-threaded: PARSEC, SPLASH-2, BioParallel, Specjbb  Multiprogrammed: 4-app mixes of SPEC CPU2006  Objectives: high cluster utilization and good app QoS 31

  32. memcached Hadoop Storm Demo Single-node Cassandra Spark Quasar Reservation + LL 100% 100% Instance Core Instance Size Core Allocation Map Core Allocation Map Reservation & LL Quasar 0% 0% Cluster Utilization Performance histogram Performance histogram Quasar Reservation & LL Progress bar Progress bar 32

  33. 33

  34. Cloud Scenario Summary Quasar achieves:  88% of applications get >95% performance  ~10% overprovisioning as opposed to up to 5x  Up to 70% cluster utilization at steady-state  23% shorter scenario completion time 34

  35. Conclusions  Quasar: high utilization, high app performance  From reservation to performance-centric cluster management  Uses info from previous apps for accurate & online app analysis  Joint resource allocation and resource assignment  See paper for:  Utilization analysis of Twitter cluster  Detailed validation & sensitivity analysis of classification  Further evaluation scenarios and features  E.g., setting framework parameters for Hadoop 35

  36. Questions??  Quasar: high utilization, high app performance  From reservation to performance-centric cluster management  Uses info from previous apps for accurate & online app analysis  Joint resource allocation and resource assignment  See paper for:  Utilization analysis of Twitter cluster  Detailed validation & sensitivity analysis of classification  Further evaluation scenarios and features  E.g., setting framework parameters for Hadoop 36

  37. Questions?? Thank you 37

  38. Cloud Provider: Performance 38

  39. Cloud Provider: Performance  Most applications violate their QoS constraints 39

  40. Cloud Provider: Performance  83% of performance target when only assignment is heterogeneity & interference aware 40

Recommend


More recommend