rapid and robust impact assessment of software changes in
play

Rapid and Robust Impact Assessment of Software Changes in Large - PowerPoint PPT Presentation

Rapid and Robust Impact Assessment of Software Changes in Large Internet-based Services Shenglin Zhang, Ying Liu, Dan Pei Yu Chen, Xianping Qu, Shimin Tao, Zhi Zang 6/22/18 CoNEXT 2015 1 Internet-based Services l Search l Shopping l Social l


  1. Rapid and Robust Impact Assessment of Software Changes in Large Internet-based Services Shenglin Zhang, Ying Liu, Dan Pei Yu Chen, Xianping Qu, Shimin Tao, Zhi Zang 6/22/18 CoNEXT 2015 1

  2. Internet-based Services l Search l Shopping l Social l Portal l Video 6/22/18 CoNEXT 2015 2

  3. Software Change: Software Upgrade or Configuration Change • Software upgrade Introduce Improve Fix bugs new feature performance 6/22/18 CoNEXT 2015 3

  4. Software Change: Software Upgrade or Configuration Change • Software upgrade Introduce Improve Fix bugs new feature performance • Configuration change • e.g., traffic switching for load balancing reasons 6/22/18 CoNEXT 2015 4

  5. Software Change: Software Upgrade or Configuration Change • Software upgrade Introduce Improve Fix bugs new feature performance • Configuration change • e.g., traffic switching for load balancing reasons • Occurs frequently • 10K+ per day in Baidu 6/22/18 CoNEXT 2015 5

  6. Impact of Erroneous Software Upgrades 2012.10, Google • An update to Google’s load balancing software • Poor performance to Gmail for 18 minutes 6

  7. Impact of Erroneous Software Upgrades 2012.10, Google 2014.11, Microsoft Azure • A performance update • An update to Google’s to Azure Storage load balancing • Reduced capacity software across services • Poor performance to utilizing Azure Storage Gmail for 18 minutes 7

  8. Impact of Erroneous Configuration Changes 2014.1, Dropbox • Planned maintenance to upgrade the OS on some machines • Dropbox service been down for three hours 6/22/18

  9. Impact of Erroneous Configuration Changes 2014.1, Dropbox 2014.6, Facebook • Planned maintenance to upgrade the OS on some machines • Dropbox service been down for three hours • Update the configuration of the software systems • Failed Facebook for 31 minutes 6/22/18

  10. Impact of Erroneous Software Changes • Poor user experience 6/22/18 CoNEXT 2015 10

  11. Impact of Erroneous Software Changes • Poor user experience • A drop in revenue The normalized number of successful orders A real-world example 6/22/18 CoNEXT 2015 11

  12. Manual Software Change Impact Assessment Select a subset of KPIs that maybe impacted 6/22/18 CoNEXT 2015 12

  13. Manual Software Change Impact Assessment Inspect KPI changes Select a subset of KPIs that maybe impacted 6/22/18 CoNEXT 2015 13

  14. Manual Software Change Impact Assessment Inspect KPI changes Decide Select a subset of KPIs whether to roll that maybe impacted back 6/22/18 CoNEXT 2015 14

  15. KPI (Key Performance Indicator) in Software Change • KPIs of servers • CPU utilization • Memory utilization • NIC throughput • … 6/22/18 CoNEXT 2015 15

  16. KPI (Key Performance Indicator) in Software Change • KPIs of servers • CPU utilization • Memory utilization • NIC throughput • … • KPIs of modules/processes • Web page view count • Web page view delay • … 6/22/18 CoNEXT 2015 16

  17. KPI (Key Performance Indicator) in Software Change • KPIs of servers • CPU utilization • Memory utilization • NIC throughput • … • KPIs of modules/processes • Web page view count • Web page view delay • … • Up to hundreds of KPIs for a single software change 6/22/18 CoNEXT 2015 17

  18. Definition of KPI Change: Level Shift or Ramp up/down • KPI change • Indicative of performance increase/degradation • Hard to simulate in testbeds • Not reproducible 6/22/18 CoNEXT 2015 18

  19. Manual Software Change Impact Assessment Inspect KPI changes Decide Select a subset of KPIs whether to roll that maybe impacted back • Labor-intensive • Prone to error • Not scalable 6/22/18 CoNEXT 2015 19

  20. Design Goal Manual inspection of KPI changes Software Change Impact Decide Select a subset of KPIs whether to roll that maybe impacted Assessment System back • Automatic • Scalable • Robust to various software changes and KPIs 6/22/18 CoNEXT 2015 20

  21. Outline • Background and Motivation • Challenges • Key Ideas • Results • Conclusion 6/22/18 CoNEXT 2015 21

  22. Challenge 1: Short Detection Delay Requirement Against Robustness • Poor user experience • A drop in revenue The number of successful orders (normalized) A real-world example 6/22/18 CoNEXT 2015 22

  23. Challenge 1: Short Detection Delay Requirement Against Robustness • Poor user experience • A drop in revenue The number of successful orders (normalized) A real-world example spike level shift 6/22/18 CoNEXT 2015 23

  24. Challenge 1: Short Detection Delay Requirement Against Robustness • Poor user experience Detect KPI changes rapidly and accurately • A drop in revenue The number of successful orders (normalized) A real-world example 6/22/18 CoNEXT 2015 24

  25. Challenge 2: Large Number of KPIs 6/22/18 CoNEXT 2015 25

  26. Challenge 2: Large Number of KPIs 100+ Internet-based services 20+ Internet-based services has 100+ million users 10k+ modules 500+ thousand servers 6/22/18 CoNEXT 2015 26

  27. Challenge 2: Large Number of KPIs Monitored by one operations team 6/22/18 CoNEXT 2015 27

  28. Challenge 2: Large Number of KPIs 10k+ software changes per day Monitored by one operations team 6/22/18 CoNEXT 2015 28

  29. Challenge 2: Large Number of KPIs 10k+ software changes per day Monitored by 100+ KPIs in a one operations software team change 6/22/18 CoNEXT 2015 29

  30. Challenge 2: Large Number of KPIs 10k+ software changes per day Monitored by 100+ KPIs in a one operations software team change Millions of KPIs should be monitored 6/22/18 CoNEXT 2015 30

  31. Challenge 2: Large Number of KPIs 10k+ software changes per day Monitored by 100+ KPIs in a Detect KPI changes with low computational cost one operations software team change Millions of KPIs be monitored 6/22/18 CoNEXT 2015 31

  32. Challenge 3: Diverse Types of Data • Diverse types of KPI data Variable Stationary Seasonal Page view count NIC throughput Memory utilization 6/22/18 CoNEXT 2015 32

  33. Challenge 3: Diverse Types of Data • Diverse types of KPI data Variable Stationary Seasonal Robust to various KPIs Page view count NIC throughput Memory utilization 6/22/18 CoNEXT 2015 33

  34. Challenge 4: KPI Changes Maybe Caused by Other Factors Network Malicious Seasonality breakdowns attacks 6/22/18 CoNEXT 2015 34

  35. Challenge 4: KPI Changes Maybe Caused by Other Factors Network Malicious Seasonality breakdowns attacks Eliminate KPI changes induced by other factors 6/22/18 CoNEXT 2015 35

  36. Outline • Background and Motivation • Challenges • Key Ideas • Results • Conclusion 6/22/18 CoNEXT 2015 36

  37. Design Overview Step 1 – Identify the impact set Step 1 … Software change in module A 6/22/18 CoNEXT 2015 37

  38. Design Overview Step 1 – Identify the impact set KPIs in the impact set Step 1 … Software change in module A 6/22/18 CoNEXT 2015 38

  39. Identify the Impact Set: Automatically Retrieve the Relevant KPIs 6/22/18 CoNEXT 2015 39

  40. Identify the Impact Set: Automatically Retrieve the Relevant KPIs Input from operators • Modules related module A: module B, C, D • Servers/processes where the software change is deployed. 6/22/18 CoNEXT 2015 40

  41. Design Overview Step 1 – Identify the impact set KPIs in the impact set Step 2 – Detect behavior changes in KPIs Step 2 Step 1 Software change in module A 6/22/18 CoNEXT 2015 41

  42. Design Overview Step 1 – Identify the impact set KPIs in the impact set Step 2 – Detect behavior changes in KPIs KPIs with behavior changes Step 2 Step 1 Software change in module A 6/22/18 CoNEXT 2015 42

  43. Design Overview Step 1 – Identify the impact set KPIs in the impact set Step 2 – Detect behavior changes in KPIs KPIs with behavior changes Short detection delay requirement against robustness Step 2 Step 1 Diverse types of data Large number of KPIs Software change in module A 6/22/18 CoNEXT 2015 43

  44. Improved Singular Spectrum Transform (SST) • Improved singular spectrum transform (SST) Accurate Advantage Short detection delay Short detection delay requirement against robustness 6/22/18 CoNEXT 2015 44

  45. Improved Singular Spectrum Transform (SST) • Improved singular spectrum transform (SST) Accurate Advantage Short detection delay Accuracy degrades with noisy baseline Drawbacks High computational cost T. Ide ́ and K. Tsuda, SDM 2007 6/22/18 CoNEXT 2015 45

  46. Improved Singular Spectrum Transform (SST) • Improved singular spectrum transform (SST) Accurate Advantage Short detection delay Accuracy degrades with noisy baseline Drawbacks High computational cost Improve robustness Utilize more information in the testing space Diverse types of data 6/22/18 CoNEXT 2015 46

  47. Improved Singular Spectrum Transform (SST) • Improved singular spectrum transform (SST) Accurate Advantage Short detection delay Accuracy degrades with noisy baseline Drawbacks High computational cost Large number of KPIs Improve robustness Utilize more information in the testing space Matrix compression Reduce computational cost Implicit inner product calculation 6/22/18 CoNEXT 2015 47

Recommend


More recommend