Predicting intermittent network device failures based on network metrics from multiple data sources Supervisors: Authors: P. Boers H.P.M. van Doorn M. Kaat C.H.J. Kuipers SURFnet University of Amsterdam Tuesday 3 juli RP91 1
I ntroduction Collected Data over 2 years ~690 Million Device Events ~163 Billion Device Metrics 2
I ntroduction Relevance Failures impacting connectivity 3
I ntroduction Research question To what extent is it possible to predict intermittent network device failures based on network metrics from multiple data sources ? 4
I ntroduction Sub questions - Which metrics are relevant? - Patterns between failures? - Correlation between data sources? 5
I ntroduction Fault vs Failure 6 Source: Salfner et al. “ A Survey of Online Failure Prediction Methods ”.
M ethodology Identifying outages Startingpoint: Big outages in the past 2 years: Big: multiple customers losing connectivity Based on: - Ticketing System - Network operators 7
M ethodology Categorizing outages - Intermittent failure (Spontaneous reboots) - Permanent failure (Line-card malfunctioning) 8
M ethodology Metrics at hand Switch chassis metrics Metrics per interface: - CPU and Memory utilization - Throughput - Temperature - Unicast packets - Uptime - Multicast packets - Broadcast packets - In/Out Errors 9
D ata Sources Overview Device Data Device Metrics: Device Events: 10
11
M ethodology Line-card failure - Line-card Bor malfunctioning 12
Findings Line Card fault 13
R esults Packet CRC Error at core router [BOR] Some Charts # Interface Input Errors 14
F indings Interface Input errors 11-09-2017 [TRUUS] # Interface Input Errors 15
Findings Loss of throughput 16
F indings Spontaneous throughput loss (1) 17
F indings Spontaneous throughput loss (2) - Syslog event 2018 May 24 09:50:33 active.5410-01.Asd001A.dcn.surf.net DATAPLANE-4-FLOOD_CONTAINMENT_THRESHOLD: chassis(1): :Flood Containment Threshold Event Container LIMIT_2 on l2-ucast EXCEEDED 18
F indings Spontaneous throughput loss (3) - So is this a real problem? ? s y a d i l o Events h y a M Roughly 21.000 events for this switch alone 19
F indings Spontaneous throughput loss (2) - Syslog event 2018 May 24 09:50:33 active.5410-01.Asd001A.dcn.surf.net DATAPLANE-4-FLOOD_CONTAINMENT_THRESHOLD: chassis(1): :Flood Containment Threshold Event Container LIMIT_2 on l2-ucast EXCEEDED 20
F indings Validating our hypothesis Transmitted PDUs Received PDUs PDU/sec Opposite link 21 0 15 45 Time (s)
D iscussion Identified: - 2 cases of permanent line-card faults - Thousands of flood containment events Challenges: - Data inconsistencies - Measurement errors - No labeled dataset 22
C onclusion - Dataset not (yet) suitable for automated predictions - No data that could indicate failure beforehand - Proved link between two datasets - Validated hypothesis 23
F uture W ork - Normalizing datasets - Create labeled dataset - Other areas: - Capacity Management - Service Level Specification 24
Questions? 25
Backup slides 26
B onus Spontaneous throughput loss 27
B onus Spontaneous throughput loss - So is this a real problem? 28
B onus Spontaneous throughput loss 29
B onus Spontaneous reboots 30
Recommend
More recommend