Agent Factory automatic job submission mechanism for Ganga/diane
Presentation overview Background Motivation Algorithm Implementation Usage scenarios Lattice QCD production data
Lattice QCD; searching for QCD critical point Discretize space and time into 4-dimensional grid Generate a sample of the most important configurations of quark and gluon fields Evolve the “snapshots”, one Monte-Carlo step at a time
Computational model snapshot snapshot snapshot iter 77 iter 351 iter 233 ⇧ ⇧ ⇧ snapshot snapshot snapshot iter 78 iter 352 iter 234 ... ... ... ⇧ ⇧ ⇧ beta beta beta 5.18 5.1825 5.1845
Setup Agent Factory grid Run Worker Agent Worker Agent Worker Agent Master
Algorithm Automatically create and maintain diane Worker Agents Adaptable heuristic approach; independent of the underlying application Three simple phases: 1. Evaluation 2. Decision (nondeterministic!) 3. Submission Relies on application exit code only; no glue (JDL) requirements
Good / bad CE - where to draw the line? + positive feedback - negative feedback running jobs failed jobs successfully completed jobs (without pending jobs (queueing) errors) etc.
Fitness = a measure of reliability ce01 ⇧ running + failed other pending completed (without errors) ⇧ running + completed fitness = total
Nondeterministic decision process F = total fitness ce01 ce03 grid* p ∈ [0, F) *CE chosen by WMS
Implementation Ganga script, part of diane distribution Follows Ganga directory structure: ../gangadir/ ./agent_factory ./agent_factory/log ./agent_factory/agent_factory_data ./agent_factory/failure_log ./agent_factory/lockfile_17273 File lock system to prevent concurrent access
Usage scenarios Live process ganga --config=config-lostman.gear AgentFactory.py --diane-worker-number=1063 --diane-max-pending=50 --repeat-interval=300 Acrontab ganga --config=config-lostman.gear AgentFactory.py --diane-worker-number=1063 --diane-max-pending=50 --repeat-interval=300 --run-time=10800 >& /dev/null ganga --config=config-lostman.gear AgentFactory.py --kill Limitations only one instance per workspace allowed simultaneously running Ganga may result in a crash finite, non extensible proxy lifetime
Lattice QCD production data
Good computing element
Bad computing element
Run history (part 1)
Run history (part 2)
Lattice QCD: the story so far Production run using Gear VO Collaboration sites: * Swiss National Supercomputing Centre * National Institute for Subatomic Physics, Netherlands * CYFRONET, Poland * CERN ~2 million cpu hours / 3 months = 231 years on a single machine! ~4.3 TB of data transferred
Summary
Recommend
More recommend