Enabling Grids for E-sciencE The network monitoring in grid context Operations Perspective Emir Imamagic /SRCE EGEE’09, Barcelona, Spain www.eu-egee.org ������������������������ ����������������������������������������
Overview Enabling Grids for E-sciencE • Monitoring In Operations • Service Availability Monitoring – Architecture – Network Monitoring • Performance Monitoring • Possible Future Work • Possible Future Work • Conclusion EGEE-III INFSO-RI-222667 2
Enabling Grids for E-sciencE Monitoring In Operations • Provide means to site and grid operators to monitor their resources • Focus on improving availability and reliability by spotting problems and issuing alarms • Define procedures for escalation and resolution of • Define procedures for escalation and resolution of more complex problems EGEE-III INFSO-RI-222667 3
Service Availability Monitoring Enabling Grids for E-sciencE Schema provided by Karolis Eigelis EGEE-III INFSO-RI-222667 4
The New Architecture Enabling Grids for E-sciencE Schema provided by Karolis Eigelis EGEE-III INFSO-RI-222667 5
The New Architecture Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 6
Which Other Systems Are Used? Enabling Grids for E-sciencE • Database components – Aggregated Topology Provider (ATP) – Metric Description Database (MDDB) • Operations services – GOCDB, ENOC, OIM • Grid information services – BDII EGEE-III INFSO-RI-222667 7
What Do We Check? Enabling Grids for E-sciencE • SAM probes – various grid services (CE, WN and SRM) • WLCG probes (SRCE, CERN) – various grid services (e.g. GridFTP, LFC) • BDII & Gstat probes – validation of content in information system BDII • Nagios native probes – standard services (e.g. web, ftp, ssh servers) EGEE-III INFSO-RI-222667 8
Network Monitoring Enabling Grids for E-sciencE • Collaboration with ENOC – integration of ENOC Downcollector features into SAM • Added lightweight service checks – based on nmap – executed with high frequency – used for masking other alarms EGEE-III INFSO-RI-222667 9
Network Monitoring Enabling Grids for E-sciencE • Integrated network topology data – ENOC provided static list of border routers for all sites – Nagios supports network hierarchy – in case of router failure site resources flagged as unreachable EGEE-III INFSO-RI-222667 10
Performance Monitoring - Grid Enabling Grids for E-sciencE • Several grid systems gather performance – BDII, GridFTP transfers – Dashboards and VO-specific systems • Some raise alarms based on performance data EGEE-III INFSO-RI-222667 11
Performance Monitoring - Network Enabling Grids for E-sciencE • Majority of sites are without dedicated links – without SLAs what should we alarm on? • Severe degradation of network performance – e.g. failure of primary link – interpreted as service unavailability EGEE-III INFSO-RI-222667 12
Possible Future Work – Availability Monitoring Enabling Grids for E-sciencE • Lightweight checks improvement? • Dynamic network topology info? • Better integration with networking monitoring systems? systems? • End-to-end monitoring between sites? EGEE-III INFSO-RI-222667 13
Possible Future Work – Performance Monitoring Enabling Grids for E-sciencE • Dynamic performance testing – to distinguish between failure and severe degradation – interesting for grid services (job & file transfer management) • With dedicated links – monitoring network parameters – raising alarms in case of degradation • Monitoring dynamic link reservation EGEE-III INFSO-RI-222667 14
Conclusion Enabling Grids for E-sciencE • Multilevel monitoring provide the means for administrators to better monitor their services • Integration with existing components to automate operations of monitoring instances • Network monitoring mainly focused on end-to-end links EGEE-III INFSO-RI-222667 15
Links Enabling Grids for E-sciencE • OAT web page https://twiki.cern.ch/twiki/bin/view/EGEE/OAT_EGEE_III • OAT Multi-level monitoring architecture https://twiki.cern.ch/twiki/bin/view/EGEE/MultiLevelMon itoringOverview EGEE-III INFSO-RI-222667 16
Enabling Grids for E-sciencE Thank You! Questions? EGEE-III INFSO-RI-222667 17
Recommend
More recommend