Syslog Processing for Switch Failure Diagnosis and Prediction in Datacenter Networks Shenglin Zhang, Weibin Meng, Jiahao Bu, Sen Yang Dan Pei, Ying Liu, Jun (Jim) Xu, Yu Chen, Hui Dong, Xianping Qu, Lei Song 9/21/2017 IWQOS 2017 1
Network Devices in Data Center Networks Inter-DC Network Core Core IDPS IDPS Router Firewall Firewall Access L3 Router VPN VPN Load Load Aggregation Switch balancer balancer L2 ToR Switch Server 9/21/2017 IWQOS 2017 2
Network Devices in Data Center Networks Inter-DC Network • Switch Core Core IDPS IDPS • Top-of-rack switch Router • Aggregation switch Firewall Firewall Access • Router L3 Router VPN VPN • Access router Load Load Aggregation • Core router Switch balancer balancer • Middle box L2 ToR • Firewall Switch • Intrusion detection and prevention system (IDPS) • Load balancer • VPN Server 9/21/2017 IWQOS 2017 3
Network Devices in Data Center Networks Inter-DC Network • Switch Core Core IDPS IDPS • Top-of-rack switch Router • Aggregation switch Firewall Firewall Access • Router L3 Router VPN VPN • Access router Load Load Aggregation • Core router Switch balancer balancer • Middle box L2 ToR • Firewall Switch • Intrusion detection and prevention system (IDPS) • Load balancer • VPN Server 9/21/2017 IWQOS 2017 4
Scale of Network Devices in Datacenter • Hundreds of thousands to millions of servers Microsoft (C. Guo, et al., • Hundreds of thousands of switches SIGCOMM’15) • Millions of cables and fibers 9/21/2017 IWQOS 2017 5
Scale of Network Devices in Datacenter • Hundreds of thousands to millions of servers Microsoft (C. Guo, et al., • Hundreds of thousands of switches SIGCOMM’15) • Millions of cables and fibers • Hundreds of thousands of servers Baidu • Tens of thousands of switches 9/21/2017 IWQOS 2017 6
Scale of Network Devices in Datacenter • Hundreds of thousands to millions of servers Microsoft (C. Guo, et al., • Hundreds of thousands of switches SIGCOMM’15) • Millions of cables and fibers • Hundreds of thousands of servers Baidu • Tens of thousands of switches Swich failures are the norm rather than the • More than 400 switch failures per year exception (P. Gill, et al., SIGCOMM’11) 9/21/2017 IWQOS 2017 7
Switch Failures Lead to Outages • A Cisco switch failure at the datacenter of Hosting.com • Affected a number of services including AWS for 1.5 hours 8
Switch Failures Lead to Outages • A Cisco switch failure at the datacenter of Hosting.com • Affected a number of services including AWS for 1.5 hours • The datacenter network went dark after a switch failure • Almost every executive branch agency are affected for a few hours 9
Switch Failure Diagnosis and Proactive Detection Frameworks Based on analyzing • SyslogDigest (IMC 2010) syslogs • Spatio-temporal Factorization (INFOCOM 2014) • Proactive Failure Detection (CNSM 2015) 9/21/2017 IWQOS 2017 10
Syslog Structure Switch Message Message Detailed message ID timestamp type Jun 12 Interface te-1/1/59, changed state to Switch 1 19:03:03 SIF down 2014 Jul 15 Neighbour(rid:10.231.0.43, Switch 2 11:05:07 OSPF addr:10.231.39.61) on vlan23, changed 2015 state from Exchange to Loading Jan 12 SFP te-1/1/33 is plugged in, vendor: %%SLOT Switch 3 21:03:01 BROCADE, serial number: 2016 AAA210383148232 11
The detailed message field Describe events occurring on switches • Interface up/down • Plug in/out of slot • DDoS attack • Operator log in/out Important to failure diagnosis and proactive detection Extracting events from the detailed message field • Pre-processing for failure diagnosis • Pre-processing for proactive failure detection 12
Syslog M essages Under the Type “SIF” 1. Interface ae3, changed state to down 2. Vlan-interface vlan22, changed state to down 3. Interface ae3, changed state to up 4. Vlan-interface vlan22, changed state to up 5. Interface ae1, changed state to down 6. Vlan-interface vlan20, changed state to down 7. Interface ae1, changed state to up 8. Vlan-interface vlan20, changed state to up 13
Syslog M essages Under the Type “SIF” Before A Failure 1. Interface *, changed state to down 2. Vlan-interface *, changed state to down 3. Interface *, changed state to up 4. Vlan-interface *, changed state to up Common practice for syslog pre-processing: Extracting templates from syslog messages Matching syslog messages to templates 14
Syslog M essages Under the Type “SIF” Before A Failure 1. Interface *, changed state to down A template is a 2. Vlan-interface *, changed state to down combination of 3. Interface *, changed state to up words with high frequency 4. Vlan-interface *, changed state to up Common practice for syslog pre-processing: Extracting templates from syslog messages Matching syslog messages to templates 15
Outline • Background and Motivation • Challenges • Key Ideas • Results • Conclusion 9/21/2017 CoNEXT 2015 16
Challenges Huge amount of syslog Diverse types of messages syslog messages • Tens of millions everyday • Operator log in/out Unstructured • Long period of historical data for • Interface up/down texts training (two years) • Plug in/out of slot 17
Challenges Huge amount of syslog Diverse types of messages syslog messages • Tens of millions everyday • Operator log in/out Unstructured • Long period of historical data for • Interface up/down texts training (two years) • Plug in/out of slot 18
Templates should be updated periodically New kinds of syslog messages Failure diagnosis and Templates • Due to software or prediction should be firmware upgrades • Based on templates updated • Cannot be matched to • Periodically retrained periodically any existing template to keep up-to-date • New templates should be extracted 9/21/2017 IWQOS 2017 19
Incrementally re-trainable Not incrementally re-trainable Template extraction method Computationally Incrementally efficient re-trainable Template extraction method 9/21/2017 IWQOS 2017 20
Existing template extraction methods Method Conference Merits Drawbacks Not incrementally Signature Tree IMC 10 Accurate re-trainable Inaccurate and STE INFOCOM 14 None not incrementally re-trainable Learn LogSimilarity CNSM 15 Inaccurate incrementally 21
Our goal Method Conference Merits Drawbacks Not incrementally Signature Tree IMC 10 Accurate re-trainable Accurate, incrementally re-trainable, efficient Inaccurate and template extraction method STE INFOCOM 14 None not incrementally re-trainable Learn LogSimilarity CNSM 15 Inaccurate incrementally 22
Outline • Background and Motivation • Challenges • Key Ideas • Results • Conclusion 9/21/2017 CoNEXT 2015 23
Construct FT-tree • Support: if a word W appears in some message, (the support of W) ++ 24
Construct FT-tree • Support: if a word W appears in some message, (the support of W) ++ • Scan all the messages, order all of the words into a map M in the descending order of support 25
Construct FT-tree • Support: if a word W appears in some message, (the support of W) ++ • Scan all the messages, order all of the words into a map M in the descending order of support Words Support “changed”, “state”, “to” 8 M “Interface”, “ Vlan- interface”, “up”, “down” 4 “vlan20”, “vlan22”, “ae1”, “ae3” 2 26
Construct FT-tree • Order words in each message in the descending order of support • Interface ae3, changed state to down ➢ V1 = {“changed”, “state”, “to”, “Interface”, “down”, “ae3”} • Vlan-interface vlan22, changed state to down ➢ V2 = {“changed”, “state”, “to”, “ Vlan- interface”, “down”, “vlan22”} • Interface ae3, changed state to up ➢ V3 = {“changed”, “state”, “to”, “Interface”, “up”, “ae3” } • Vlan-interface vlan22, changed state to up ➢ V4 = {“changed”, “state”, “to”, “ Vlan- interface”, “up”, “vlan22”} • … Words Support “changed”, “state”, “to” 8 M “Interface”, “ Vlan- interface”, “up”, “down” 4 “vlan20”, “vlan22”, “ae1”, “ae3” 2 27
Construct FT-tree SIF 28
Construct FT-tree SIF changed V1 = {“changed”, “state”, “to”, State “Interface”, “down”, “ae3”} SIF to Interface down ae3 29
Construct FT-tree SIF SIF changed changed V2 = {“changed”, “state”, V1 = {“changed”, “state”, “to”, “to”, “ Vlan- interface”, State State “Interface”, “down”, “ae3”} “down”, “vlan22”} SIF to to Interface Interface Vlan-interface down down down vlan22 ae3 ae3 30
Construct FT-tree SIF V3 = {“changed”, “state”, changed “to”, “Interface”, “up”, State “ae3” } to Interface Vlan-interface down up down ae3 vlan22 ae3 31
Construct FT-tree SIF SIF V3 = {“changed”, “state”, changed changed “to”, “Interface”, “up”, State State “ae3” } … to to Interface Vlan-interface Interface Vlan-interface down down up up up down down ae3 vlan22 ae3 ae1 vlan22 vlan20 vlan22 vlan20 ae3 ae3 ae1 32
Recommend
More recommend