toward highly available
play

Toward Highly Available, Intelligent Cloud and ML Systems - PowerPoint PPT Presentation

Toward Highly Available, Intelligent Cloud and ML Systems Chuanxiong Guo Bytedance NetAI 2018 1 Outline Background: System/networking meets ML Deepview: ML for availability improvement of cloud systems RDMA for scalable ML


  1. Toward Highly Available, Intelligent Cloud and ML Systems Chuanxiong Guo Bytedance NetAI 2018 1

  2. Outline • Background: System/networking meets ML • Deepview: ML for availability improvement of cloud systems • RDMA for scalable ML training acceleration • Summary 2

  3. Two Different Approaches client server training socket socket Interfaces training Protocols model dataset tcp tcp ip ip nic nic labeling inference data packets • Network/systems are designed by following • Models in machine learning are learned principles from data without explicit programming • Interfaces are explicitly defined, protocols are • Deep learning made breakthroughs in explicitly coded, and packets can be traced and computer vision and speech 3 explained

  4. Networking Meets Machine Learning ML helps to improve system/network availability ML Networking/system Networking to scale and accelerate ML systems 4

  5. Software Rules the Clouds Data Repo Code Repo Software systems Deployment/ Config/ Resource Monitoring provisioning Management mgmt 5

  6. Incidents, Incidents, Incidents 6

  7. System Availability is Plagued by Incidents 𝛵𝑈_𝑣 𝐵 = 5 min downtime per year σ 𝑈_𝑣 + 𝛵𝑈_𝑒 99.999% 99.99% 53 min downtime per year 7

  8. Incident Handling Practice Lessons learned Data Repo Code Repo Software systems Deployment/ Config/ Resource Monitoring mgmt provisioning Management 8

  9. Design implement Dev Gray Incident prevention failure ByteBrain Panorama Incident resolution, Availability mitigation fundamentals Incident localization detection Deepview OPS Netbouncer Pingmesh Deployment Provisioning Monitoring Resource mgmt Automation 9

  10. Deepview for Virtual Disk Failure Diagnosis -- A case where ML helps system availability 10

  11. VM Availability • IaaS is one of the largest cloud services today • High VM availability is a key performance metric • Yet, achieving 99.999% VM uptime remains a challenge 1. What is the VM availability bottleneck? 2. How to eliminate it? 11

  12. IaaS Architecture • Compute and storage clusters with a Clos-like network Clos Network • Compute-storage Separation Host • VMs and Virtual Hard Disks (VHDs) provisioned from VM VM different clusters • Hypervisor transparently redirects disk access to remote Hypervisor storage • Keep data available during localized Compute Cluster Storage Cluster power failure to a rack Subsystems inside a Datacenter 12

  13. A New Type of Failure: VHD Failures • Infra failures can disrupt VHD access Clos Network • Hypervisor can retry, but not indefinitely Host • Hypervisor will crash the VM to surface failures to customer VM VM • Allow customers to take actions to keep their app-level SLAs Hypervisor How much do VHD failures impact VM availability? Compute Cluster Storage Cluster Subsystems inside a Datacenter 13

  14. Availability Bottleneck Unknown 1% HW Failure • VHD failure localization is the bottleneck 6% • 52% of unplanned VM downtime • Take 10s minutes to hours to localize VHD Failure SW • This talk: quick and accurate failure localization 52% Failure 41% Breakdown of Unplanned VM Downtime in a Year 14

  15. Failure Triage was Slow and Inaccurate • SREs from each team check their subsystem for anomalies to match the incident • e.g. compute host heart-beats, storage perf-counters, network link discards • Incidents get ping-ponged among different teams due to false positives • Inaccurate diagnosis and delayed mitigation • Gray failures in network and storage are hard to catch • Troubled but not totally down, e.g. performance issues or software bugs • Only fail a subset of VHDs requests • Can take hours to localize 15

  16. Deepview Approach: Global View • Isolate failures by examining interactions between subsystems • Instead of alerting every SRE team to check if their subsystem is at fault • Bipartite model • Compute Clusters (left) : Storage Clusters (right) • VMs are provisioned from compute/storage cluster pair • Edge weight = VHD failure rate C1 S1 C1 C2 C2 S2 C3 C3 S3 C4 C4 S1 S2 S3 Bipartite Model Grid View 16

  17. Our Approach: Global View Example Storage Cluster Failure Example Compute Cluster Failure S1 C1 C1 S1 C1 C1 C2 C2 C2 C2 S2 S2 C3 C3 C3 C3 S3 S3 C4 C4 C4 C4 S1 S2 S3 S1 S2 S3 Compute Cluster Storage Cluster C2 Failure S1 Gray Failure C2 failed S1 Failed Grid View Grid View 17

  18. Challenges Remaining challenges: 1. Need to pinpoint network failures Generalized model to include network devices 2. Need to handle gray failures Lasso regression/Hypothesis testing algorithm 3. Need to be near-real-time Streaming data pipeline Summary of our goal: A system to localize VHD failures to underlying failures in compute, storage or network subsystems within a time budget of 15 minutes Time budget set by production team to meet availability goals 18

  19. Deepview Model: Include the Network Clos Network Compute Cluster Storage Cluster • Need to handle multipath and ECMP • Simplify Clos network to a tree by aggregating network devices • Can model at the granularity of clusters or ToRs 19

  20. Deepview Model: Estimate Component Health 𝐐𝐬𝐩𝐜 𝐪𝐛𝐮𝐢 𝐣 𝐣𝐭 𝐢𝐟𝐛𝐦𝐮𝐢𝐳 = ෑ 𝐐𝐬𝐩𝐜 𝐝𝐩𝐧𝐪𝐩𝐨𝐟𝐨𝐮 𝐤 𝐣𝐭 𝐢𝐟𝐛𝐦𝐮𝐢𝐳 𝐤∈𝐪𝐛𝐮𝐢(𝐣) Blue: observable *Assume independent failures 𝟐 − 𝐟 𝐣 Red: unknown = ෑ 𝐪 𝐤 𝐨 𝐣 Purple: topology 𝐟 𝐣 =num of VMs crashed 𝐤∈𝐪𝐛𝐮𝐢(𝐣) 𝒐 𝐣 =num of VMs 𝐦𝐩𝐡 𝟐 − 𝐟 𝐣 = ෍ 𝐦𝐩𝐡 𝐪 𝐤 𝐨 𝐣 System of Linear Equations 𝐤∈𝐪𝐛𝐮𝐢(𝐣) Component j is healthy with 𝐟 𝐣 𝐎 𝐳 𝐣 = 𝐦𝐩𝐡 𝟐 − 𝐪 𝐤 = 𝐟𝐲𝐪(𝛄 𝐤 ) 𝐨 𝐣 • β j = 0 , clear component j 𝐳 𝐣 = ෍ 𝛄 𝐤 𝐲 𝐣𝐤 + 𝛇 𝐣 𝛄 𝐤 = 𝐦𝐩𝐡 𝐪 𝐤 • β j ≪ 0 , may blame it 𝐤=𝟐 𝛇 𝐣 =measurement noise 20

  21. Deepview Algorithm: Prefer Simpler Explanation via Lasso 𝐎 Example: Net 𝐳 𝐣 = ෍ 𝛄 𝐤 𝐲 𝐣𝐤 + 𝛇 𝐣 𝐤=𝟐 S1 S2 C1 C2 • Potentially #unknowns > #equations 𝐳 𝟐 = 𝛄 𝐝𝟐 + 𝛄 𝐨𝐟𝐮 + 𝛄 𝐭𝟐 + 𝛇 𝟐 • Traditional least-square regression would fail 𝐳 𝟑 = 𝛄 𝐝𝟐 + 𝛄 𝐨𝐟𝐮 + 𝛄 𝐭𝟑 + 𝛇 𝟑 𝐳 𝟒 = 𝛄 𝐝𝟑 + 𝛄 𝐨𝐟𝐮 + 𝛄 𝐭𝟐 + 𝛇 𝟒 • But multiple simultaneous failures are rare 𝐳 𝟓 = 𝛄 𝐝𝟑 + 𝛄 𝐨𝐟𝐮 + 𝛄 𝐭𝟑 + 𝛇 𝟓 • How to encode this domain knowledge mathematically? Lasso Objective Function: 𝐳 − 𝐘𝛄 𝟑 + 𝛍 𝛄 𝟐 • Equivalent to prefer most β j to be zero ෡ 𝛄 = 𝐛𝐬𝐡𝐧𝐣𝐨 𝛄∈ℝ 𝐎 ,𝛄≤𝟏 • Lasso regression can get sparse solutions efficiently Sparsity 21

  22. Deepview Algorithm: Principled Blame Decision via Hypothesis Testing • Need a binary decision ( flag/clear ) for each component • Ad-hoc thresholds do not work reliably • Can we make a principled decision? • If estimated failure probability worse than average, then likely a real failure • Automate this empirical decision criterion using a hypothesis test: 𝐈 𝟏 𝐤 : 𝛄 𝐤 = ഥ 𝐈 𝐁 𝐤 : 𝛄 𝐤 < ഥ 𝛄 𝐰𝐭. 𝛄 • Reject H 0 j means blame component j • Otherwise, clear component j 22

  23. Deepview System Architecture: NRT Data Pipeline Near-realtime Scheduler VHD Failure Real-time Alerts VM Info Non-RT StorageAcct VMsPerPath Input Algo Output Net Topo Ingestion Vis Pipeline Kusto Engine 23 RAW DATA SLIDING WINDOW OF INPUT RUN ALGO ACTIONS

  24. Some Statistics • Analyzed Deepview results for one month • Daily VHD failures: hundreds to tens of thousands • Detected 100 failures instances • 70 matched with existing tickets, 30 were previously undetected • Reduced unclassified VHD failures to less than a max of 500 per day • Single-host failures or customer mistakes (e.g. expired storage accounts) 24

  25. Case Study 1: Unplanned ToR Reboot • Unplanned ToR reboot can cause VMs to crash • We knew this can happens, but not where and when ToR_11 ToR_12 ToR_13 • Deepview can flag those ToRs ToR_14 • The figure shows a ToR down in one small region ToR_15 • Blamed the right ToR among 288 components STR_01 STR_02 STR_03 STR_04 STR_05 STR_06 STR_07 • Associate VM downtime with ToR failures Unplanned ToR reboot • Quantify the impact of ToR as a single-point-of-failure in a region on VM availability 25

  26. Case Study 2: Storage Cluster Gray Failure • Impact only a subset of VMs • A storage cluster was brought online with a bug that puts some VHDs in negative cache • Deepview flagged the faulty storage cluster almost immediately while Number of VMs with VHD manual triage took 20+ hours Failures per Hour during a Storage Cluster Gray Failure 26

Recommend


More recommend