Production & Reprocessing “Unified” Overview Jean-Roch Vlimant
Big Picture O&C PPD Physics McM Data Mngt ? Unified Workflow Management THIS TALK HTCondor T0 T1s T2s 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 2
In a Nutshell ● Automatic transfers, automatic assignment, workload overflowing (see backup slides) ● Simple sqlite DB with 3 tables https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/assignSchema.py ➔ Actual DB file on afs ● 80k workflows (460 char) ● 124k outputs (430 char, 5 int) ● 37k transfers (1 int, 1 pickled string representing a vector of int of size <40) ➔ Lightweight schema (right ?) ➔ Plan on adding one more temporarily depending on needs, for locking dataset/blocks. Should never exceed 5k entries ● Access pattern ➔ Currently have difficulties with concurrency update to the table, one different records ➔ Commits are essentially status changes (30 char) and insertion of of outputs (5 per 10s) or transfers. Plus rare modification of transfers attributes. ➔ Read using wide filters, ~1000 workflows at a time, every 20, 30 min ● Load foreseen ➔ Might go up to 3-4x ~1000 workflows at a time 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 3
Integration ● Using sqlalchemy ● Tested on devdb12 (June, July?) ➢ Maxedout the account quota very fast, could perform read/write and automatic indexing ● Tested on int9r (Dec) ➢ Same schema, no issues ➢ Looking forward to a backedup, reliable and supporting concurrency table update 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 4
Unified State Diagram From assignment-approved Cloned considered Input needed No input staging needed Input available Cloned staged Cloned forget Assignment Rejected trouble away issues completed Aborted assistance completed close Closed-out done https://cmst2.web.cern.ch/cmst2/unified/ 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 5
ReqMgr2 State Diagram https://cmsweb.cern.ch/wmstats/index.html 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 6
Strategies 1/3 ● If any input is needed ➔ Distribute primary in N copies (configurable) ✔ Among site matching the workflow requirement (Core, Mem, quota, …) ✔ Any existing blocks counts to make the N copies ✔ Distributed in chunk of 4TB ✔ Any existing subscription counts to make the N copies ✔ Locks the dataset from deletion ● When all requirements are met ➔ The N copies are ready at sites matching workflow requirments ➔ A transfer appears stuck ✔ Early start with ≥1 ➔ Send back to setting a transfer if too many end point in downtime ● Assignment in request manager (all below configurable by campaign) ➔ Set the lfn from parents, campaign, or default to /store/mc ➔ Set the mem/time watchdogs ➔ Tune the splitting (pre-assignment) ➔ Use as many compatible sites ➔ Set xrootd flags on primary and/or secondary ➔ Pick a site for full copy among whitelist ✔ T1 first, then T2 ✔ DDM-buffer enforced 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 7
Strategies 2/3 ● When WorkQueue elements appear with no location ➔ Transfer the missing blocks to a site in the whitelist ● When workflow has too many pending jobs ➔ Overflow to neighboring (hard-coded) sites ● Adapt job Memory requirement (job classad) ➔ From history of successful jobs per task. 95% percentile of Mem distribution with >100 successful jobs ➔ Hook for job time requirement not used ● Truncate the processing ➔ If >99% after 7 days, force complete 10 per cycle ➔ If the requester ask for it via McM API ● Verify the output (see later if something is not right) ➔ Completed in terms of #of lumisection (95% pass bar, 100% for data), fall back to #of events in case of request manager corruption ➔ Lumi section size is “small enough” (mostly ignored now) ➔ DBS/Phedex file count consistency. lfn consistency ➔ Output dataset consistency ➔ Tape Subscription made when applicable ➔ Duplicate lumi 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 8
Strategies 3/3 ● Subscribe most production blocks to DataOps ➔ For on-going workflows, closed-out, and announced ✗ Some blocks from aborted workflow might be unclaimed ● Create Harvesting ➔ For data requests, once the /DQMIO is in full somewhere, extract info from original workflow and make the harvesting (no further check on statuses) ● Set workflow announced ➔ Triggering condition for next step in McM ● Set the status VALID ➔ Although datasets are usable before hand ➔ Dataset might not be in full at a single site, but all blocks are out at production sites ● Send analysis datasets to AnalysisOps when applicable ➔ DDM scripts turns the full subscriptions to AnalysisOps ➔ Production blocks are left as DataOps ● Locks are released ➔ When requested tape copy is completed ➔ When no other workflow uses that /LHE or 30 days ➔ If INVALID or in-existing (aborted/rejected workflow) ➔ ++ 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 9
When Things Go Wrong ● Stuck transfers are exposed to transfer team ➔ Both to disk and tape https://cmst2.web.cern.ch/cmst2/unified/data.html ● Workflow with high failure rate ➔ Shifters' alert ➔ Inspected and manually aborted ➔ Automated notification to requester via McM (if via Unified) ● Workflow and dataset to be rejected ➔ Detected from McM and operated ● When requester ask for update https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=HIG-RunIISpring16DR80-01780 ➔ http://dabercro.web.cern.ch/dabercro/unified/showlog/?search=HIG-RunIISpring16DR80-01780 ➔ ➔ All needed information is pretty much available ● Workflow with some failures ➔ And does not pass the completion bar : see next slide 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 10
Recoveries ● Investigation of errors ➔ Itemized https://cmst2.web.cern.ch/cmst2/unified/assistance.html ➔ Drill down to error report on https://cmsweb.cern.ch/wmstats/index.html ➔ Categorized by most popular issues and cast ACDC : fetches in request manager what are the missing bits from failing jobs, create a new set of jobs and submit them. Clone : just restart from square one, Unified picks it back Recovery : evolve procedure to create an ACDC document with what is needed to remake the missing data (most used for data rereco) Extension : create new events in non overlapping lumi-section (rarely used these days) ➔ ACDC are partially handled by Unified automatically following some rules >20% 50660 : bump MeM by 1G >20% 50664 : split x2 >20% 61104 : plain recovery >20% 8028 : plain recovery >20% 8021 : plain recovery (if FileReadError) >20% 8001 : split x4 (if No lhe event found in ExternalLHEProducer) ➔ Things usually clear out on first round 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 11
When Things Go Really Bad ● Cause for very large tails : not in order of importance ➔ Long lasting workflow has merge issues→recovery fails because the unmerged files are being removed at the site A solution is to have the site clear according to a list that can be extract from request manager ➔ Lots of workflow to be inspected by hand, workflow with low priority looked at last→recovery fails because of unmerged files are gone : Do much less by hand (increase automation) Do things much faster by hand (see Dan's slides) ➔ Site going down→clone→another site going down, … : Solved by using a more reliable site at last stage Maybe need to prune sites by availability ? Maybe match estimated workload to site mean-uptime ? ➔ Performance issue→clone with good splitting→other issue→clone with initial splitting, … Operator interference, bad issue tracking, ... ➔ Data reprocessing needing ~100% completion ✗ Large lumi prevents job creation ✗ Segfault = no fwjreport ✗ ACDC of ACDC finish with no error ✗ Assignment mistakes ✗ Bad issue tracking ... 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 12
Recommend
More recommend