deTector: a Topology-aware Monitoring System for Data Center - PDF document

deTector: a Topology-aware Monitoring System for Data Center Networks Yanghua Peng, The University of Hong Kong; Ji Yang, Xi’an Jiaotong University; Chuan Wu, The University of Hong Kong; Chuanxiong Guo, Microsoft Research; Chengchen Hu, Xi’an Jiaotong University; Zongpeng Li, University of Calgary https://www.usenix.org/conference/atc17/technical-sessions/presentation/peng This paper is included in the Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC ’17). July 12–14, 2017 • Santa Clara, CA, USA ISBN 978-1-931971-38-6 Open access to the Proceedings of the 2017 USENIX Annual Technical Conference is sponsored by USENIX.

deTector : a Topology-aware Monitoring System for Data Center Networks Yanghua Peng Ji Yang Chuan Wu The University of Hong Kong Xi’an Jiaotong University The University of Hong Kong Chuanxiong Guo Chengchen Hu Zongpeng Li Microsoft Research Xi’an Jiaotong University University of Calgary Abstract high quality of service (QoS) for users ( e.g. , no more than a few minutes of downtime per month [21]) and to Troubleshooting network performance issues is a chal- increase revenue for operators. lenging task especially in large-scale data center net- Rapid failure recovery is not possible without a good works. This paper presents deTector , a network mon- network monitoring system. There have been a number itoring system that is able to detect and localize net- of systems proposed in the past few years [36, 26, 37, work failures (manifested mainly by packet losses) ac- 48]. Several limitations still exist in these systems that curately in near real time while minimizing the moni- prohibit fast failure detection and localization. toring overhead. deTector achieves this goal by tightly First , existing monitoring systems may fail to detect coupling detection and localization and carefully select- one type of failures or another. Traditional passive mon- ing probe paths so that packet losses can be localized itoring approaches, such as querying the device counter only according to end-to-end observations without the via SNMP or retrieving information via device CLI when help of additional tools ( e.g. , tracert). In particular, we users have perceived some issues, can detect clean fail- quantify the desirable properties of the matrix of probe ures such as link down, line card malfunctions. How- paths, i.e. , coverage and identifiability, and leverage an ever, gray failures may occur, i.e. , faults not detected or efficient greedy algorithm with a good approximation ra- ignored by the device, or malfunctioning not properly re- tio and fast speed to select probe paths. We also propose ported by the device due to some bugs [37]. Active mon- a loss localization method according to loss patterns in itoring systems ( e.g. , Pingmesh [26], NetNORAD [37]) a data center network. Our algorithm analysis, experi- can detect such failures by sending end-to-end probes, mental evaluation on a Fattree testbed and supplementary but they may fail to capture failures that cause low rate large-scale simulation validate the scalability, feasibility losses, due to ECMP in data center ( § 2). and effectiveness of deTector . Second , probe systems such as Pingmesh and NetNO- RAD inject probes between each pair of servers without selection, which may introduce too much bandwidth 1 Introduction overhead. In addition, they typically treat the whole DCN as a black box, and hence require many probes to A variety of services are hosted in large-scale data cen- cover all parallel paths between any server pair with high ters today, e.g. , search engines, social networks and file probability. sharing. To support these services with high quality, data center networks (DCNs) are carefully designed to effi- Third , failures in the network can be reported in these ciently connect thousands of network devices together, active monitoring systems, but the exact failure locations e.g. , a 64-ary Fattree [9] DCN has more than 60,000 cannot be pinpointed automatically. The network opera- servers and 5,000 switches. However, due to the large tor typically learns a suspected source-destination server network scale, frequent upgrades and management com- pair once packet loss happens. Then she/he needs to re- plexity, failures in DCNs are the norm rather than the sort to additional tools such as tracert to verify the issue exception [21], such as routing misconfigurations, link and locate the faulty spot. However, it may be difficult flaps, etc. Among these failures, those leading to user- to play back the issues due to transient failures. Hence perceived performance issues ( e.g. , packet losses, latency this diagnosis approach ( i.e. , separation of detection and spikes) are among the first priority to be detected and localization) may take several hours or even days to pin- eliminated promptly [27, 26, 21], in order to maintain point the fault spot [21], yet ideally the failures should USENIX Association 2017 USENIX Annual Technical Conference 55

deTector: a Topology-aware Monitoring System for Data Center - PDF document

deTector: a Topology-aware Monitoring System for Data Center Networks Yanghua Peng, The University of Hong Kong; Ji Yang, Xian Jiaotong University; Chuan Wu, The University of Hong Kong; Chuanxiong Guo, Microsoft Research; Chengchen Hu, Xian

Topological data analysis and topology-based visualization Leila De Floriani Topology-based

Strategic Plan for Detector R&D at Fermilab Petra Merkel Fermilab Detector R&D

Topology-aware OpenMP Process Scheduling Peter Thoman, Hans Moritsch, and Thomas Fahringer

APPLICATION-AWARE FLOW MONITORING Thursday 11 th April, 2019 Petr Velan Motivation

Towards system-scale optimisation of HPC applications TADaaM : Topology-Aware System-Scale Data

Toolkit to Support Intelligibility in Context Aware Applications Context-Aware Applications P

DUNE Near Detector Transport System Update Discussion topics Transport system - detector

Km3 neutrino detector workshop January 23 20.30-22.00 KM3 Cerenkov neutrino detector, the

A Medium Size Detector for the ILC ... what used to be the TESLA or LD detector concept Ties

Integration of detector and cryogenic safety systems into FSCF BMS system Detector plans and

Topology Discovery Correlating different network topology layers in heterogeneous environments

Combinatorics and topology of toric arrangements II. Topology of arrangements in the complex torus

Order Topology Definition Let ( X , < ) be an ordered set. Then the order topology on X is the

I2RS Service Topology Draft-hares-i2rs-service-topo-dm-05 I2RS Service Topology Model Why

A Topology-Aware Performance Monitoring Tool for Shared Resource Management in Multicore Systems

ATLAS Detector Commissioning Oslo EPF group aspect Y. Pylypchenko 1 , M. Pedersen 3 O. Rhne 2 ,

Ping Ans AI + Financial Service Mei Han Director of Ping An Technology, US Research Labs Talk

Investor Presentation August 2018 1 Company Overview 2 Company Overview Consolidated

Topeka / Shawnee County Community Technology Planning Pilot Project Consultant Team Presentation

T a pia Ra nc h Wa te r Supply Assig nme nt BOARD OF DI RE CT ORS ME E T I NG JUL Y

Monitoring the NZ Internet with AMP WAND Update NZNOG 2014 The Active Measurement Project

Since 1957, Made in Italy Quality and Innovation in Decorative Laminates ABET LAMINATI PING PONG

Briefing on using renewables to integrate renewables Clyde Loutan, Sr. Advisor, Renewable Energy

+ POSTSECONDARY ENROLLMENT OPTIONS (PSEO) Guidelines and Checklist + What is PSEO?

Sambuz

Useful Links

Newsletter

Mail Us

deTector: a Topology-aware Monitoring System for Data Center - PDF document

deTector: a Topology-aware Monitoring System for Data Center Networks Yanghua Peng, The University of Hong Kong; Ji Yang, Xian Jiaotong University; Chuan Wu, The University of Hong Kong; Chuanxiong Guo, Microsoft Research; Chengchen Hu, Xian

Topological data analysis and topology-based visualization Leila De Floriani Topology-based

Strategic Plan for Detector R&amp;D at Fermilab Petra Merkel Fermilab Detector R&amp;D

Topology-aware OpenMP Process Scheduling Peter Thoman, Hans Moritsch, and Thomas Fahringer

APPLICATION-AWARE FLOW MONITORING Thursday 11 th April, 2019 Petr Velan Motivation

Towards system-scale optimisation of HPC applications TADaaM : Topology-Aware System-Scale Data

Toolkit to Support Intelligibility in Context Aware Applications Context-Aware Applications P

DUNE Near Detector Transport System Update Discussion topics Transport system - detector

Km3 neutrino detector workshop January 23 20.30-22.00 KM3 Cerenkov neutrino detector, the

A Medium Size Detector for the ILC ... what used to be the TESLA or LD detector concept Ties

Integration of detector and cryogenic safety systems into FSCF BMS system Detector plans and

Topology Discovery Correlating different network topology layers in heterogeneous environments

Combinatorics and topology of toric arrangements II. Topology of arrangements in the complex torus

Order Topology Definition Let ( X , &lt; ) be an ordered set. Then the order topology on X is the

I2RS Service Topology Draft-hares-i2rs-service-topo-dm-05 I2RS Service Topology Model Why

A Topology-Aware Performance Monitoring Tool for Shared Resource Management in Multicore Systems

ATLAS Detector Commissioning Oslo EPF group aspect Y. Pylypchenko 1 , M. Pedersen 3 O. Rhne 2 ,

Ping Ans AI + Financial Service Mei Han Director of Ping An Technology, US Research Labs Talk

Investor Presentation August 2018 1 Company Overview 2 Company Overview Consolidated

Topeka / Shawnee County Community Technology Planning Pilot Project Consultant Team Presentation

T a pia Ra nc h Wa te r Supply Assig nme nt BOARD OF DI RE CT ORS ME E T I NG JUL Y

Monitoring the NZ Internet with AMP WAND Update NZNOG 2014 The Active Measurement Project

Since 1957, Made in Italy Quality and Innovation in Decorative Laminates ABET LAMINATI PING PONG

Briefing on using renewables to integrate renewables Clyde Loutan, Sr. Advisor, Renewable Energy

+ POSTSECONDARY ENROLLMENT OPTIONS (PSEO) Guidelines and Checklist + What is PSEO?

Sambuz

Useful Links

Newsletter

Mail Us

Strategic Plan for Detector R&D at Fermilab Petra Merkel Fermilab Detector R&D

Order Topology Definition Let ( X , < ) be an ordered set. Then the order topology on X is the