advancement of usage of taskchain in production j r
play

Advancement of usage of TaskChain in production J-R Vlimant In A - PowerPoint PPT Presentation

Advancement of usage of TaskChain in production J-R Vlimant In A Nutshell TaskChain is the most flexible type of workflow One cmsRun per task A root task either reading from input dataset or generating events wmLHE and pLHE


  1. Advancement of usage of TaskChain in production J-R Vlimant

  2. In A Nutshell ● TaskChain is the most flexible type of workflow ● One cmsRun per “task” ● A root task either reading from input dataset or generating events ● wmLHE and pLHE enabled ● Each subsequente task feeds from one of the output module from one of the preceding task ● Trees of tasks possible ● A → B → C1 → D1 and B → C2 → D2 (C2 → D3 and so on) ● Job splitting either done explicitly (#events/job, #lumis/job) or automatic using time/event (N.B. #events/lumi fully functioning) ● All outputs are exposed to computing up-front ● PROS ● In a multi-campaign mode of operation, reduces the number of workflows (items in request manager) from N>1 to 1 ● No intermediate manipulation of datasets ● No latency in assigning the next workflows ● No latency, less manual operation in creating tape families ● CONS ● Full chain has to be tested at once : change of mode of operation from gen contact ● Recovery workflows can become complicated with large number of tasks : change of operation from ops ● The chain has one priority ● All requests need to run at the same site (no T2 → T1 relocation) 2 Post-MccM Discussion, J-R Vlimant 9/19/14

  3. Already Tested ● Years of operation of release validation samples ● Although job splitting was always set explicitly ● Treating eos-based .lhe files in input https://github.com/dmwm/WMCore/issues/4871 ● #Events per lumi https://github.com/dmwm/WMCore/issues/4872 ● Doing wmLHE and gen-sim in a single workflow ● 2 requests in mcm ● 2 tasks in the taskchain ● https://vlimant.web.cern.ch/vlimant/SUS-Fall13wmLHE-00011_dict_2t.json ● Output 2 datasets as if they were processed in two different workflows, without the dataset manipulation latency ● Doing trees of requests from SUS-Fall13wmLHE-00011 ● https://cms-pdmv.cern.ch/mcm/chained_requests?root_request=SUS-Fall13wmLHE-0001 ● 1 wmLHE, 1 gen-sim, 2 digi-reco, 1 mini-aod : 5 workflows compared to one taskchain ● https://vlimant.web.cern.ch/vlimant/SUS-Fall13wmLHE-00011_dict_at.json ● The last clone made by Alan succeeded with only an AODSIM output dataset collision due to wrong assignment. 3 Post-MccM Discussion, J-R Vlimant 9/19/14

  4. Already Developed (1/3) ● Testing script for the full chain request (March 2014) https://cms-pdmv.cern.ch/mcm/public/restapi/chained_requests/ ● get_setup/<chained request id> ● Setup&run each request one after the others ● Testing API for chained requests (March 2014) https://cms-pdmv.cern.ch/mcm/restapi/chained_requests/ ● Test/<chained request id> ● Threaded runtest of the chain ● Verification of performance & efficiency measured ● Requires certificates and xrootd enabled ● Creating the taskchain dictionary from ● A chained request ID (March 2014) https://cms-pdmv.cern.ch/mcm/public/restapi/chained_requests/get_dict/SUS-chain_Fall13wmLHE_flowWMLHEtoF13_flowS14P ● Handle only the requests that are part of the chain ● N.B. The link has scratch=true which unfolds the whole chain ● A request ID (August 2014) https://cms-pdmv.cern.ch/mcm/public/restapi/chained_requests/get_dict/SUS-Fall13wmLHE-00011?scratch=true ● Look for the tree of requests from the chains the request is involved with ● N.B. The link has scratch=true which unfolds the whole chain ● Injection of taskchain (March 2014) ● wmcontrol is provided with the url to the dictionnary https://github.com/cms-PdmV/wmcontrol/commit/0a2352e7866a61cf41fb31afa334f4f268f8a415 ● Everything is done within McM 4 Post-MccM Discussion, J-R Vlimant 9/19/14

  5. Already Developed (2/3) ● Labelling of the output dataset “processingstring” (March 2014) ● Application of experience with relvals ● Simplifies greatly the assignment of TaskChains ● Registering statistics and status of multiple output dataset (August 2014) ● Required for proper toggling of done status with completed events in McM ● Reduction of stats DB size by making an history member of each doc (August 2014) ● From 23Gb to 500Mb … ● Growth plot fully available and made simpler to make ● Button for chain request testing available to gen contact (September 2014) ● Fixed for un-intentional reset of requests ● Approval toggling from gen contact & convener (September 2014) ● Once validation is finished, status is toggled ● Toggling to define then approve in the regular way ● Injection of taskchain and batch texting (September 2014) ● Injection is now threaded and locked ● Subject&Text of the pilot batch was ambiguous https://hypernews.cern.ch/HyperNews/CMS/get/dataopsrequests/5546.html ● and now fixed https://github.com/cms-PdmV/cmsPdmV/pull/652 5 Post-MccM Discussion, J-R Vlimant 9/19/14

  6. Already Developed (3/3) ● Toggling of status to done using multiple output (September 2014) ● Few typos fixed ● Worked out of the box, with regular request inspection https://cms-pdmv.cern.ch/mcm//requests?member_of_chain=HIG-chain_Summer12_flowS12to53-00264&page=0&shown=146297325599 ● Protection for dataset name collision (September 2014) ● PR https://github.com/cms-PdmV/cmsPdmV/pull/658 ● Required to prevent TaskChains to create collisions with existing requests ● Functions with indirect injection of taskchain : i.e. when toggling submit approval ● Does not operated with direct injection : i.e using /restapi/chained_requests/inject/<id> ● 6 Post-MccM Discussion, J-R Vlimant 9/19/14

  7. On-Going ● Pilot batch of TaskChain from McM ● From HIG mass scan https://cms-pdmv.cern.ch/mcm/requests?dataset_name=*FilterMuOrEle15*&member_of_campaign=Summer12 ● Extra mass point (55) added, validated https://cms-pdmv.cern.ch/mcm/requests?prepid=HIG-Summer12-02258 ➔ Completed after a few manual steps ➔ Issue with ACDC not solved yet ● Brainstorming on assignment (Ops) quoting chats with Alan ● Adapt the scripts that look for possible job location based on input datasets, being primary or pileup ● Adapt possible modification to job splitting made by assignment scripts ● Allocate TaskChains to site based on resource availability ➔ No feedback yet ➔ Proper site white list wasn't used in the pilot and lead to failures in digi-reco https://hypernews.cern.ch/HyperNews/CMS/get/dataopsrequests/5546/1/1/1/1/1/1/1/1/1/1/1/1/1/1.html ➔ Suspicion that this is also what is causing the ACDC to not start 7 Post-MccM Discussion, J-R Vlimant 9/19/14

  8. Suggestion To Next Steps ● Get feedback the Ops brainstorming and iron out the handshaking details ● Do a reservation campaign in Summer11 & Summer12* ● Put all new requests in Summer11 and Summer12* through TaskChain ● Extend to new requests in Fall13* → miniAOD ● Extend to new requests in Fall14wmLHE → Fall14 8 Post-MccM Discussion, J-R Vlimant 9/19/14

  9. Suggestion To Next Steps ● Get feedback the Ops brainstorming and iron out the handshaking details ● Do a reservation campaign in Summer11 & Summer12* ● Put all new requests in Summer11 and Summer12* through TaskChain ● Extend to new requests in Fall13* → miniAOD ● Extend to new requests in Fall14wmLHE → Fall14 9 Post-MccM Discussion, J-R Vlimant 9/19/14

Recommend


More recommend