Rapid and Robust Impact Assessment of Software Changes in Large Internet-based Services Shenglin Zhang, Ying Liu, Dan Pei Yu Chen, Xianping Qu, Shimin Tao, Zhi Zang 6/22/18 CoNEXT 2015 1
Internet-based Services l Search l Shopping l Social l Portal l Video 6/22/18 CoNEXT 2015 2
Software Change: Software Upgrade or Configuration Change • Software upgrade Introduce Improve Fix bugs new feature performance 6/22/18 CoNEXT 2015 3
Software Change: Software Upgrade or Configuration Change • Software upgrade Introduce Improve Fix bugs new feature performance • Configuration change • e.g., traffic switching for load balancing reasons 6/22/18 CoNEXT 2015 4
Software Change: Software Upgrade or Configuration Change • Software upgrade Introduce Improve Fix bugs new feature performance • Configuration change • e.g., traffic switching for load balancing reasons • Occurs frequently • 10K+ per day in Baidu 6/22/18 CoNEXT 2015 5
Impact of Erroneous Software Upgrades 2012.10, Google • An update to Google’s load balancing software • Poor performance to Gmail for 18 minutes 6
Impact of Erroneous Software Upgrades 2012.10, Google 2014.11, Microsoft Azure • A performance update • An update to Google’s to Azure Storage load balancing • Reduced capacity software across services • Poor performance to utilizing Azure Storage Gmail for 18 minutes 7
Impact of Erroneous Configuration Changes 2014.1, Dropbox • Planned maintenance to upgrade the OS on some machines • Dropbox service been down for three hours 6/22/18
Impact of Erroneous Configuration Changes 2014.1, Dropbox 2014.6, Facebook • Planned maintenance to upgrade the OS on some machines • Dropbox service been down for three hours • Update the configuration of the software systems • Failed Facebook for 31 minutes 6/22/18
Impact of Erroneous Software Changes • Poor user experience 6/22/18 CoNEXT 2015 10
Impact of Erroneous Software Changes • Poor user experience • A drop in revenue The normalized number of successful orders A real-world example 6/22/18 CoNEXT 2015 11
Manual Software Change Impact Assessment Select a subset of KPIs that maybe impacted 6/22/18 CoNEXT 2015 12
Manual Software Change Impact Assessment Inspect KPI changes Select a subset of KPIs that maybe impacted 6/22/18 CoNEXT 2015 13
Manual Software Change Impact Assessment Inspect KPI changes Decide Select a subset of KPIs whether to roll that maybe impacted back 6/22/18 CoNEXT 2015 14
KPI (Key Performance Indicator) in Software Change • KPIs of servers • CPU utilization • Memory utilization • NIC throughput • … 6/22/18 CoNEXT 2015 15
KPI (Key Performance Indicator) in Software Change • KPIs of servers • CPU utilization • Memory utilization • NIC throughput • … • KPIs of modules/processes • Web page view count • Web page view delay • … 6/22/18 CoNEXT 2015 16
KPI (Key Performance Indicator) in Software Change • KPIs of servers • CPU utilization • Memory utilization • NIC throughput • … • KPIs of modules/processes • Web page view count • Web page view delay • … • Up to hundreds of KPIs for a single software change 6/22/18 CoNEXT 2015 17
Definition of KPI Change: Level Shift or Ramp up/down • KPI change • Indicative of performance increase/degradation • Hard to simulate in testbeds • Not reproducible 6/22/18 CoNEXT 2015 18
Manual Software Change Impact Assessment Inspect KPI changes Decide Select a subset of KPIs whether to roll that maybe impacted back • Labor-intensive • Prone to error • Not scalable 6/22/18 CoNEXT 2015 19
Design Goal Manual inspection of KPI changes Software Change Impact Decide Select a subset of KPIs whether to roll that maybe impacted Assessment System back • Automatic • Scalable • Robust to various software changes and KPIs 6/22/18 CoNEXT 2015 20
Outline • Background and Motivation • Challenges • Key Ideas • Results • Conclusion 6/22/18 CoNEXT 2015 21
Challenge 1: Short Detection Delay Requirement Against Robustness • Poor user experience • A drop in revenue The number of successful orders (normalized) A real-world example 6/22/18 CoNEXT 2015 22
Challenge 1: Short Detection Delay Requirement Against Robustness • Poor user experience • A drop in revenue The number of successful orders (normalized) A real-world example spike level shift 6/22/18 CoNEXT 2015 23
Challenge 1: Short Detection Delay Requirement Against Robustness • Poor user experience Detect KPI changes rapidly and accurately • A drop in revenue The number of successful orders (normalized) A real-world example 6/22/18 CoNEXT 2015 24
Challenge 2: Large Number of KPIs 6/22/18 CoNEXT 2015 25
Challenge 2: Large Number of KPIs 100+ Internet-based services 20+ Internet-based services has 100+ million users 10k+ modules 500+ thousand servers 6/22/18 CoNEXT 2015 26
Challenge 2: Large Number of KPIs Monitored by one operations team 6/22/18 CoNEXT 2015 27
Challenge 2: Large Number of KPIs 10k+ software changes per day Monitored by one operations team 6/22/18 CoNEXT 2015 28
Challenge 2: Large Number of KPIs 10k+ software changes per day Monitored by 100+ KPIs in a one operations software team change 6/22/18 CoNEXT 2015 29
Challenge 2: Large Number of KPIs 10k+ software changes per day Monitored by 100+ KPIs in a one operations software team change Millions of KPIs should be monitored 6/22/18 CoNEXT 2015 30
Challenge 2: Large Number of KPIs 10k+ software changes per day Monitored by 100+ KPIs in a Detect KPI changes with low computational cost one operations software team change Millions of KPIs be monitored 6/22/18 CoNEXT 2015 31
Challenge 3: Diverse Types of Data • Diverse types of KPI data Variable Stationary Seasonal Page view count NIC throughput Memory utilization 6/22/18 CoNEXT 2015 32
Challenge 3: Diverse Types of Data • Diverse types of KPI data Variable Stationary Seasonal Robust to various KPIs Page view count NIC throughput Memory utilization 6/22/18 CoNEXT 2015 33
Challenge 4: KPI Changes Maybe Caused by Other Factors Network Malicious Seasonality breakdowns attacks 6/22/18 CoNEXT 2015 34
Challenge 4: KPI Changes Maybe Caused by Other Factors Network Malicious Seasonality breakdowns attacks Eliminate KPI changes induced by other factors 6/22/18 CoNEXT 2015 35
Outline • Background and Motivation • Challenges • Key Ideas • Results • Conclusion 6/22/18 CoNEXT 2015 36
Design Overview Step 1 – Identify the impact set Step 1 … Software change in module A 6/22/18 CoNEXT 2015 37
Design Overview Step 1 – Identify the impact set KPIs in the impact set Step 1 … Software change in module A 6/22/18 CoNEXT 2015 38
Identify the Impact Set: Automatically Retrieve the Relevant KPIs 6/22/18 CoNEXT 2015 39
Identify the Impact Set: Automatically Retrieve the Relevant KPIs Input from operators • Modules related module A: module B, C, D • Servers/processes where the software change is deployed. 6/22/18 CoNEXT 2015 40
Design Overview Step 1 – Identify the impact set KPIs in the impact set Step 2 – Detect behavior changes in KPIs Step 2 Step 1 Software change in module A 6/22/18 CoNEXT 2015 41
Design Overview Step 1 – Identify the impact set KPIs in the impact set Step 2 – Detect behavior changes in KPIs KPIs with behavior changes Step 2 Step 1 Software change in module A 6/22/18 CoNEXT 2015 42
Design Overview Step 1 – Identify the impact set KPIs in the impact set Step 2 – Detect behavior changes in KPIs KPIs with behavior changes Short detection delay requirement against robustness Step 2 Step 1 Diverse types of data Large number of KPIs Software change in module A 6/22/18 CoNEXT 2015 43
Improved Singular Spectrum Transform (SST) • Improved singular spectrum transform (SST) Accurate Advantage Short detection delay Short detection delay requirement against robustness 6/22/18 CoNEXT 2015 44
Improved Singular Spectrum Transform (SST) • Improved singular spectrum transform (SST) Accurate Advantage Short detection delay Accuracy degrades with noisy baseline Drawbacks High computational cost T. Ide ́ and K. Tsuda, SDM 2007 6/22/18 CoNEXT 2015 45
Improved Singular Spectrum Transform (SST) • Improved singular spectrum transform (SST) Accurate Advantage Short detection delay Accuracy degrades with noisy baseline Drawbacks High computational cost Improve robustness Utilize more information in the testing space Diverse types of data 6/22/18 CoNEXT 2015 46
Improved Singular Spectrum Transform (SST) • Improved singular spectrum transform (SST) Accurate Advantage Short detection delay Accuracy degrades with noisy baseline Drawbacks High computational cost Large number of KPIs Improve robustness Utilize more information in the testing space Matrix compression Reduce computational cost Implicit inner product calculation 6/22/18 CoNEXT 2015 47
Recommend
More recommend