what is condor
play

What is Condor? Specialized job and resource management system - PDF document

What is Condor? Specialized job and resource management system (RMS) for compute intensive jobs 1. User submit their jobs to Condor Condor and the Grid 2. Condor chooses when and where to run them Authors: D. Thain, T. Tannenbaum, and M.


  1. What is Condor? • Specialized job and resource management system (RMS) for compute intensive jobs 1. User submit their jobs to Condor Condor and the Grid 2. Condor chooses when and where to run them Authors: D. Thain, T. Tannenbaum, and M. Livny based upon a policy 3. Condor monitors their progress 4. Condor informs the user upon completion Presenter: Ibrahim H Suslu Submit CSC 7700 Jobs Data Intensive Distributed Computing Fall 2006 Feedback Condor Provide Why Condor ? • High-throughput computing • A job management mechanism – Provide large amounts of fault-tolerant computational power • Scheduling policy – Effective utilization of resource • Opportunistic computing • Priority schema – Use resource whenever available • Resource monitoring • ClassAds – Resource allocation Language that describe resources and jobs • Resource management • Job checkpoint and migration – Record a checkpoint and resume the application from it. – A checkpoint permit a job to migrate from one machine to other (like other full-featured systems) • Remote system calls – Preserve local execution environment Condor Kernel The Philosophy of Flexibility Matchmaker (Central manager) • Let communities grow naturally – Relationships and obligations will develop according to user necessity ClassAds Plan of • Plan without being picky job jobs Problem Solver Agent Resource – Be prepared to retry or reassign work when failures User (Master-Worker) (schedd) (startd) come claim (DAGMan) • Leave the owner in control – Happy owners � more resources � higher throughput Shadow Sandbox • Land and borrow Details of the Environment – Collaborate with related fields job • Understand previous research Job 1

  2. Typical Condor Pool Flocking Links pools of resources ������������ ����������������� ��������������� ��������� ������ ������������� Gateway Flocking ������ ������� Organizational level ������ Transparent ���������� ������ ������������ ������ ��������� ����������� ������ ������ ������ ������ Direct Flocking �������� ��� �������� ��� One individual to another ������ Organization ������ ������ ������ ������ ������ Planning and Scheduling Matchmaker • Bridge between planning and scheduling • Planning • Agents and resources advertise – Acquisition of resources by users characteristics and requirements as – Concerned with ‘what’ and ‘where’ ClassAds • Scheduling • Pairs satisfying each other’s constraints – Management of a resource by its owner are created – Concerned with ‘who’ and ‘when’ • Both parties are informed • Claiming- independent authorization and authentication Condor Architecture overview II Condor Architecture overview I ��������������� ��������������� ��������� ��������� ��������� #����$��� ����������� ��������� ��������� ������� ������� ������������������!�" �����������������!�" ������������������!�" �����������������!�" ������ ������ ������ ������ ������� ������ �� �� 2

  3. ClassAds Problem Solvers • Resource allocation Language – Attribute name-value pairs • High level structure built on top of the Condor agent – No specific schema • Manage large number of jobs • Requirements – Concern with the application-specific details of ordering and task – Constraints, for a match these should evaluate to true selection • Rank • Relies on a Condor agent in two ways – Desirability of a match – Uses agent as service for reliably executing jobs – Making the problem solver itself reliable Job ClassAd Machine ClassAd [ [ • Two are provided with Condor MyType = ‘‘Job’’ MyType=“Machine” – Master-worker (MW) TargetType = ‘‘Machine’’ TargetType=“Job” Requirements = Machine=“tnt.isi.edu” • System for solving a problem of indeterminate size on a large and ((other.Arch==‘‘INTEL’’&& Requirements= unreliable workforce other.OpSys==‘‘LINUX’’) (Load<3000) && other.Disk > my.DiskUsage) Rank=dept==self.dept – Directed acyclic graph manager (DAGMAN) Rank = (Memory � 10000) + KFlops Arch=“Intel” • Service for executing multiple jobs with dependencies in a Cmd = ‘‘/home-exe’’ OpSys=“Linux” Department = ‘‘CompSci’’ Disk=600000 declarative form Owner = ‘‘tannenba’’ ] DiskUsage = 6000 ] Split Execution Condor Universes • Facilitates successful remote execution of • Create a specific job environment jobs • Defined by a matched sandbox and shadow • Shadow represents the user to the system • Different Universes provide different functionality – Has information that specifies the job at run for your job: time – Standard � Support for transparent process • Executables, arguments, input files..... checkpoint and restart – Vanilla � Run any Serial Job • Sandbox is responsible for giving the job a � Provide a complete Java environment – Java safe place to play – Globus � Manage your Grid jobs – Creates an environment for job execution • A Matched Sandbox and Shadow form the universe Standard Universe Vanilla Universe • Requires re-linking your program with special • You can run any program library provided by condor – C/C++/Perl/Python/Fortran/Java/Lisp… • Allows checkpointing and remote System Calls – Checkpointing – No checkpointing: if your job is interrupted or • Condor’s Process Checkpointing mechanism saves all the the machine crashes, Condor has to restart it state of a process into a checkpoint file from the beginning. • Memory, CPU, I/O, job details, etc. • The process can then be restarted from right where it left off – No remote system calls – Remote System Calls • Input and output files • Provides an I/O service over secure RPC channel • Provides remote access to the user’s home storage device – Multi-process jobs are not allowed – Interprocess communication is not allowed 3

  4. Java Universe Globus Universe • Works better for Java programs • Advantages of using Condor-G to manage your Grid jobs • Checks for valid Java environment – Full-featured queuing service • Distinguishes Java environment – Credential Management exceptions from program exceptions – Fault-tolerance • No checkpointing • Disadvantages – No matchmaking or dynamic scheduling of jobs • Remote I/O – No job checkpoint or migration – No remote system calls “Gliding in”: allows to reach of Condor-G and the features of Condor Condor-G • Computation management agent for Grid Computing – Merges Globus and Condor technologies Application, problem solver… Job submission Condor-G Resource discovery, Globus Toolkit authentication…. Job execution Condor Processing, storage….. Access to Data in Condor Which Universe? • Use shared filesystem if available • Standard: • No shared filesystem? – Good for mixed Condor pools, flocked pools, and the – Condor can transfer files Grid at large. • Can automatically send back changed files • Vanilla: • Atomic transfer of multiple files – Good for a Condor pool of identical machines • Can be encrypted over the wire – Remote I/O Socket • Java: – Standard Universe can use remote system – Good for Java application calls • Globus: – Good for Globus jobs 4

Recommend


More recommend