Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014
Introduction • Two aspects of monitoring – General overview of the system • How many running/idle jobs? By user/VO? By schedd? • How full is the farm? • How many draining worker nodes? – More detailed views • What are individual jobs doing? • What’s happening on individual worker nodes? • Health of the different components of the HTCondor pool • ...in addition to Nagios
Introduction • Methods – Command line utilities – Ganglia – Third-party applications (which run command-line tools or use python API)
Command line • Three useful commands – condor_status • Overview of the pool (including jobs, machines) • Information about specific worker nodes – condor_q • Information about jobs in the queue – condor_history • Information about completed jobs
Overview of jobs -bash-4.1$ condor_status -collector Name Machine RunningJobs IdleJobs HostsTotal RAL-LCG2@condor01.gridpp.rl. condor01.gridpp.rl 10608 8355 11347 RAL-LCG2@condor02.gridpp.rl. condor02.gridpp.rl 10616 8364 11360
Overview of machines -bash-4.1$ condor_status -total Total Owner Claimed Unclaimed Matched Preempting Backfill X86_64/LINUX 11183 95 10441 592 0 0 0 Total 11183 95 10441 592 0 0 0
Jobs by schedd -bash-4.1$ condor_status -schedd Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs arc-ce01.gridpp.rl.a arc-ce01.g 2388 1990 13 arc-ce02.gridpp.rl.a arc-ce02.g 2011 1995 31 arc-ce03.gridpp.rl.a arc-ce03.g 4272 1994 9 arc-ce04.gridpp.rl.a arc-ce04.g 1424 2385 12 arc-ce05.gridpp.rl.a arc-ce05.g 1 0 6 cream-ce01.gridpp.rl cream-ce01 266 0 0 cream-ce02.gridpp.rl cream-ce02 247 0 0 lcg0955.gridpp.rl.ac lcg0955.gr 0 0 0 lcgui03.gridpp.rl.ac lcgui03.gr 3 0 0 lcgui04.gridpp.rl.ac lcgui04.gr 0 0 0 lcgvm21.gridpp.rl.ac lcgvm21.gr 0 0 0 TotalRunningJobs TotalIdleJobs TotalHeldJobs Total 10612 8364 71
Jobs by user, schedd -bash-4.1$ condor_status -submitters Name Machine RunningJobs IdleJobs HeldJobs group_ALICE.alice.alice043@g arc-ce01.gridpp.rl 0 0 0 group_ALICE.alice.alicesgm@g arc-ce01.gridpp.rl 540 0 1 group_ATLAS.atlas_pilot.tatl arc-ce01.gridpp.rl 142 0 0 group_ATLAS.prodatls.patls00 arc-ce01.gridpp.rl 82 5 0 group_CMS.cms.cmssgm@gridpp. arc-ce01.gridpp.rl 1 0 0 group_CMS.cms_pilot.ttcms022 arc-ce01.gridpp.rl 214 390 0 group_CMS.cms_pilot.ttcms043 arc-ce01.gridpp.rl 68 100 0 group_CMS.prodcms.pcms004@gr arc-ce01.gridpp.rl 78 476 4 group_CMS.prodcms.pcms054@gr arc-ce01.gridpp.rl 12 910 0 group_CMS.prodcms_multicore. arc-ce01.gridpp.rl 47 102 0 group_DTEAM_OPS.ops.ops047@g arc-ce01.gridpp.rl 0 0 0 group_LHCB.lhcb_pilot.tlhcb0 arc-ce01.gridpp.rl 992 0 2 group_NONLHC.snoplus.snoplus arc-ce01.gridpp.rl 0 0 0 …
…Jobs by user RunningJobs IdleJobs HeldJobs group_ALICE.alice.al 0 0 0 group_ALICE.alice.al 3500 368 5 group_ALICE.alice_pi 0 0 0 group_ATLAS.atlas.at 0 0 0 group_ATLAS.atlas.at 0 0 0 group_ATLAS.atlas_pi 414 12 10 group_ATLAS.atlas_pi 0 0 2 group_ATLAS.prodatls 354 36 11 group_CMS.cms.cmssgm 1 0 0 group_CMS.cms_pilot. 371 2223 0 group_CMS.cms_pilot. 0 0 1 group_CMS.cms_pilot. 68 200 0 group_CMS.prodcms.pc 188 1905 10 group_CMS.prodcms.pc 312 3410 0 group_CMS.prodcms_mu 47 102 0 …
condor_q [root@arc-ce01 ~]# condor_q -- Submitter: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:64454> : arc- ce01.gridpp.rl.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 794717.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794718.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794719.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794720.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794721.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794722.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794723.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794725.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794726.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) … 3502 jobs; 0 completed, 0 removed, 1528 idle, 1965 running, 9 held, 0 suspended
Multi-core jobs -bash-4.1$ condor_q -global - constraint 'RequestCpus > 1’ -- Schedd: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:39356> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 832677.0 pcms004 12/5 14:33 0+00:15:07 R 0 2.0 (gridjob ) 832717.0 pcms004 12/5 14:37 0+00:12:02 R 0 0.0 (gridjob ) 832718.0 pcms004 12/5 14:37 0+00:00:00 I 0 0.0 (gridjob ) 832719.0 pcms004 12/5 14:37 0+00:00:00 I 0 0.0 (gridjob ) 832893.0 pcms004 12/5 14:47 0+00:00:00 I 0 0.0 (gridjob ) 832894.0 pcms004 12/5 14:47 0+00:00:00 I 0 0.0 (gridjob ) …
Multi-core jobs • Custom print format -bash-4.1$ condor_q -global -pr queue_mc.cpf -- Schedd: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:39356> ID OWNER SUBMITTED RUN_TIME ST SIZE CMD CORES 832677.0 pcms004 12/5 14:33 0+00:00:00 R 2.0 (gridjob) 8 832717.0 pcms004 12/5 14:37 0+00:00:00 R 0.0 (gridjob) 8 832718.0 pcms004 12/5 14:37 0+00:00:00 I 0.0 (gridjob) 8 832719.0 pcms004 12/5 14:37 0+00:00:00 I 0.0 (gridjob) 8 832893.0 pcms004 12/5 14:47 0+00:00:00 I 0.0 (gridjob) 8 832894.0 pcms004 12/5 14:47 0+00:00:00 I 0.0 (gridjob) 8 … https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=ExperimentalCustomPrintFormats
Jobs with specific DN -bash-4.1$ condor_q -global -constraint 'x509userproxysubject=="/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1”’ -- Schedd: arc-ce03.gridpp.rl.ac.uk : <130.246.181.25:62763> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 678275.0 tatls015 12/2 17:57 2+06:07:15 R 0 2441.4 (arc_pilot ) 681762.0 tatls015 12/3 03:13 1+21:12:31 R 0 2197.3 (arc_pilot ) 705153.0 tatls015 12/4 07:36 0+16:49:12 R 0 2197.3 (arc_pilot ) 705807.0 tatls015 12/4 08:16 0+16:09:27 R 0 2197.3 (arc_pilot ) 705808.0 tatls015 12/4 08:16 0+16:09:27 R 0 2197.3 (arc_pilot ) 706612.0 tatls015 12/4 09:16 0+15:09:37 R 0 2197.3 (arc_pilot ) 706614.0 tatls015 12/4 09:16 0+15:09:26 R 0 2197.3 (arc_pilot ) …
Jobs killed • Jobs which were removed [root@arc-ce01 ~]# condor_history - constraint 'JobStatus == 3’ ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD 823881.0 alicesgm 12/5 01:01 1+06:13:22 X ??? /var/spool/arc/grid03/CVuMDmBSwGlnCIXDjqi 831849.0 tlhcb005 12/5 13:19 0+18:52:26 X ??? /var/spool/arc/grid09/gWmLDm5x7GlnCIXDjqi 832753.0 tlhcb005 12/5 14:38 0+17:07:07 X ??? /var/spool/arc/grid00/5wqKDm7C9GlnCIXDjqi 819636.0 alicesgm 12/4 19:27 1+12:13:56 X ??? /var/spool/arc/grid00/mlrNDmoErGlnCIXDjqi 825511.0 alicesgm 12/5 03:03 0+18:52:10 X ??? /var/spool/arc/grid04/XpuKDmxLyGlnCIXDjqi 823799.0 alicesgm 12/5 00:56 1+05:58:15 X ??? /var/spool/arc/grid03/DYuMDmzMwGlnCIXDjqi 820001.0 alicesgm 12/4 19:48 1+06:43:22 X ??? /var/spool/arc/grid08/cmzNDmpYrGlnCIXDjqi 833589.0 alicesgm 12/5 16:01 0+14:06:34 X ??? /var/spool/arc/grid09/HKSLDmqUAHlnCIXDjqi 778644.0 tlhcb005 12/2 05:56 4+00:00:10 X ??? /var/spool/arc/grid00/pIJNDm6cvFlnCIXDjqi …
Recommend
More recommend