agent factory
play

Agent Factory automatic job submission mechanism for Ganga/diane - PowerPoint PPT Presentation

Agent Factory automatic job submission mechanism for Ganga/diane Presentation overview Background Motivation Algorithm Implementation Usage scenarios Lattice QCD production data Lattice QCD; searching for QCD critical point Discretize


  1. Agent Factory automatic job submission mechanism for Ganga/diane

  2. Presentation overview Background Motivation Algorithm Implementation Usage scenarios Lattice QCD production data

  3. Lattice QCD; searching for QCD critical point Discretize space and time into 4-dimensional grid Generate a sample of the most important configurations of quark and gluon fields Evolve the “snapshots”, one Monte-Carlo step at a time

  4. Computational model snapshot snapshot snapshot iter 77 iter 351 iter 233 ⇧ ⇧ ⇧ snapshot snapshot snapshot iter 78 iter 352 iter 234 ... ... ... ⇧ ⇧ ⇧ beta beta beta 5.18 5.1825 5.1845

  5. Setup Agent Factory grid Run Worker Agent Worker Agent Worker Agent Master

  6. Algorithm Automatically create and maintain diane Worker Agents Adaptable heuristic approach; independent of the underlying application Three simple phases: 1. Evaluation 2. Decision (nondeterministic!) 3. Submission Relies on application exit code only; no glue (JDL) requirements

  7. Good / bad CE - where to draw the line? + positive feedback - negative feedback running jobs failed jobs successfully completed jobs (without pending jobs (queueing) errors) etc.

  8. Fitness = a measure of reliability ce01 ⇧ running + failed other pending completed (without errors) ⇧ running + completed fitness = total

  9. Nondeterministic decision process F = total fitness ce01 ce03 grid* p ∈ [0, F) *CE chosen by WMS

  10. Implementation Ganga script, part of diane distribution Follows Ganga directory structure: ../gangadir/ ./agent_factory ./agent_factory/log ./agent_factory/agent_factory_data ./agent_factory/failure_log ./agent_factory/lockfile_17273 File lock system to prevent concurrent access

  11. Usage scenarios Live process ganga --config=config-lostman.gear AgentFactory.py --diane-worker-number=1063 --diane-max-pending=50 --repeat-interval=300 Acrontab ganga --config=config-lostman.gear AgentFactory.py --diane-worker-number=1063 --diane-max-pending=50 --repeat-interval=300 --run-time=10800 >& /dev/null ganga --config=config-lostman.gear AgentFactory.py --kill Limitations only one instance per workspace allowed simultaneously running Ganga may result in a crash finite, non extensible proxy lifetime

  12. Lattice QCD production data

  13. Good computing element

  14. Bad computing element

  15. Run history (part 1)

  16. Run history (part 2)

  17. Lattice QCD: the story so far Production run using Gear VO Collaboration sites: * Swiss National Supercomputing Centre * National Institute for Subatomic Physics, Netherlands * CYFRONET, Poland * CERN ~2 million cpu hours / 3 months = 231 years on a single machine! ~4.3 TB of data transferred

  18. Summary

Recommend


More recommend