The dashboard Grid monitoring framework Benjamin Gaidioz on behalf of the ARDA dashboard team (CERN/EGEE) ISGC 2007 conference The dashboard Grid monitoring framework – p. 1
introduction/outline goals of the project, the team, the framework, some monitoring applications: job monitoring, site monitoring, data management monitoring. The dashboard Grid monitoring framework – p. 2
the project (EGEE/ARDA) another monitoring tool, a VO specific monitoring service, showing Grid usage from a VO point of view (cross Grid, cross application, submission tool, etc.), merging Grid information and VO information. implemented in close contact with the VOs. The dashboard Grid monitoring framework – p. 3
the team Julia Andreeva (lead, CMS) and Juha Herrala (former member, CMS), Benjamin Gaidioz and Ricardo Rocha (ATLAS), Pablo Saiz (ALICE), Gerhild Maier, collaborators and visitors: Taipei: Fu-Ming Tsai (daily summaries), Tao-Sheng Chen (Postgresql and Oracle), Shih-Chun Chiu (user web interface, PHP), etc., Moscow State University, our contacts in all the VOs and Grids. contact: dashboard-support@cern.ch The dashboard Grid monitoring framework – p. 4
the framework a python framework for collecting and publishing monitoring information RGMA GridPP RGMA collector GridPP collector dashboard web server Monalisa text/html, text/xml, image/png, etc. question monalisa request collector data access object (DAO) client dashboard oracle database developer guide, savannah project. The dashboard Grid monitoring framework – p. 5
a set of applications The dashboard Grid monitoring framework – p. 6
applications 1. job monitoring, 2. site monitoring, 3. data management monitoring. see the links in the last slide for accessing them all. The dashboard Grid monitoring framework – p. 7
job monitoring real-time view of Grid jobs for a VO, summary views, various grid information systems used (EGEE RGMA, GridPP XML files, LCG BDII), VO info: job instrumentation (Monalisa’s ApMon), ATLAS prodsys database, panda monitoring, GangaAtlas monitoring, Dirac database, etc. consistent merging (Grid info + VO info). powerful filtering for serving different use cases (managers, site admins, users), examples: ATLAS activities today, ATLAS jobs in Taiwan, CMS daily views. The dashboard Grid monitoring framework – p. 8
job monitoring summary installed for ALICE, ATLAS, CMS, LHCb. latest/next developments: open HTTP API for a VO to publish job information to the dashboard (in progress), user task monitoring (in progress), alerts (with failure pattern recognition), link with the SAM tests (site functionality tests). RSS feeds. The dashboard Grid monitoring framework – p. 9
site monitoring linked to job monitoring, identify reason of failure of jobs in sites, using RGMA (which reports Grid error messages), examples: ALICE site info. Waiting Ready (unavailable ) Scheduled (Job successfully submitted to Globus ) Ready (7 an authentication operation failed ) Running (Job successfully submitted to Globus ) Done (Job got an error while in the CondorG queue. ) Submitted Done (Job terminated successfully ) Done (Cannot read JobWrapper output both from Condor and from Maradona. ) Done (/net/hisrv0001/opt.x86_64/grid/globus/etc/globus-user-env.sh not found or unreadable ) Cleared (user retrieved output sandbox ) Waiting (unavailable ) The dashboard Grid monitoring framework – p. 10
site monitoring linked to job monitoring, identify reason of failure of jobs in sites, using RGMA (which reports Grid error messages), examples: ALICE site info. submit ce02.grid.acad.bg lepton.rcac.purdue.edu ce01.cmsaf.mit.edu cluster.pnpi.nw.ru Waiting Waiting Waiting Waiting Ready Ready Ready Ready Scheduled Error, authentication Scheduled Scheduled Running Running Running Error, maradona Error, wrong installation Done Success The dashboard Grid monitoring framework – p. 10
site monitoring summary installed for ALICE, ATLAS, CMS, LHCb. latest/next developments: merging of all information of a site (not per VO), in order to see if failures are similar for all VOs (in progress). The dashboard Grid monitoring framework – p. 11
data management an ATLAS specific application, monitoring the ATLAS DDM tool, events directly reported by ATLAS software to the dashboard, current performance, details, developed in close contact with ATLAS DDM admins and developers, daily summary sent by mail. The dashboard Grid monitoring framework – p. 12
data management: summary installed for ATLAS, critical component of ATLAS DDM (now official monitoring system), latest/next developments: text summary sent by e-mail to site admins, correlation with the SAM tests (site functionality tests). The dashboard Grid monitoring framework – p. 13
conclusion The dashboard Grid monitoring framework – p. 14
conclusion goal: grid monitoring from a VO point of view: merging VO infos and Grid information, feeting the various use cases (managers, users, site admins), several applications already implemented using a flexible python framework, future work: new applications, new information sources (GridICE, APEL, SAM), new functionalities: alerts, assistance in error tracking. The dashboard Grid monitoring framework – p. 15
links Savannah project dashboard main page CMS dashboard main page ATLAS dashboard main page LHCb dashboard main page ALICE dashboard main page site reliability dashboard-support@cern.ch The dashboard Grid monitoring framework – p. 16
Recommend
More recommend