predicting intermittent network device failures based on
play

Predicting intermittent network device failures based on network - PowerPoint PPT Presentation

Predicting intermittent network device failures based on network metrics from multiple data sources Supervisors: Authors: P. Boers H.P.M. van Doorn M. Kaat C.H.J. Kuipers SURFnet University of Amsterdam Tuesday 3 juli RP91 1 I


  1. Predicting intermittent network device failures based on network metrics from multiple data sources Supervisors: Authors: P. Boers H.P.M. van Doorn M. Kaat C.H.J. Kuipers SURFnet University of Amsterdam Tuesday 3 juli RP91 1

  2. I ntroduction Collected Data over 2 years ~690 Million Device Events ~163 Billion Device Metrics 2

  3. I ntroduction Relevance Failures impacting connectivity 3

  4. I ntroduction Research question To what extent is it possible to predict intermittent network device failures based on network metrics from multiple data sources ? 4

  5. I ntroduction Sub questions - Which metrics are relevant? - Patterns between failures? - Correlation between data sources? 5

  6. I ntroduction Fault vs Failure 6 Source: Salfner et al. “ A Survey of Online Failure Prediction Methods ”.

  7. M ethodology Identifying outages Startingpoint: Big outages in the past 2 years: Big: multiple customers losing connectivity Based on: - Ticketing System - Network operators 7

  8. M ethodology Categorizing outages - Intermittent failure (Spontaneous reboots) - Permanent failure (Line-card malfunctioning) 8

  9. M ethodology Metrics at hand Switch chassis metrics Metrics per interface: - CPU and Memory utilization - Throughput - Temperature - Unicast packets - Uptime - Multicast packets - Broadcast packets - In/Out Errors 9

  10. D ata Sources Overview Device Data Device Metrics: Device Events: 10

  11. 11

  12. M ethodology Line-card failure - Line-card Bor malfunctioning 12

  13. Findings Line Card fault 13

  14. R esults Packet CRC Error at core router [BOR] Some Charts # Interface Input Errors 14

  15. F indings Interface Input errors 11-09-2017 [TRUUS] # Interface Input Errors 15

  16. Findings Loss of throughput 16

  17. F indings Spontaneous throughput loss (1) 17

  18. F indings Spontaneous throughput loss (2) - Syslog event 2018 May 24 09:50:33 active.5410-01.Asd001A.dcn.surf.net DATAPLANE-4-FLOOD_CONTAINMENT_THRESHOLD: chassis(1): :Flood Containment Threshold Event Container LIMIT_2 on l2-ucast EXCEEDED 18

  19. F indings Spontaneous throughput loss (3) - So is this a real problem? ? s y a d i l o Events h y a M Roughly 21.000 events for this switch alone 19

  20. F indings Spontaneous throughput loss (2) - Syslog event 2018 May 24 09:50:33 active.5410-01.Asd001A.dcn.surf.net DATAPLANE-4-FLOOD_CONTAINMENT_THRESHOLD: chassis(1): :Flood Containment Threshold Event Container LIMIT_2 on l2-ucast EXCEEDED 20

  21. F indings Validating our hypothesis Transmitted PDUs Received PDUs PDU/sec Opposite link 21 0 15 45 Time (s)

  22. D iscussion Identified: - 2 cases of permanent line-card faults - Thousands of flood containment events Challenges: - Data inconsistencies - Measurement errors - No labeled dataset 22

  23. C onclusion - Dataset not (yet) suitable for automated predictions - No data that could indicate failure beforehand - Proved link between two datasets - Validated hypothesis 23

  24. F uture W ork - Normalizing datasets - Create labeled dataset - Other areas: - Capacity Management - Service Level Specification 24

  25. Questions? 25

  26. Backup slides 26

  27. B onus Spontaneous throughput loss 27

  28. B onus Spontaneous throughput loss - So is this a real problem? 28

  29. B onus Spontaneous throughput loss 29

  30. B onus Spontaneous reboots 30

Recommend


More recommend