Useful info on SNT workflows Nick Amin
Overview ⚫ Two main parts • Data/metadata retrieval ‣ people usually use DAS ‣ many of us use DIS ‣ metadata about SNT samples (i.e., CMS4) • Job submission ‣ people usually use CRAB ‣ many of us use local condor (and Metis) ⚫ Also advertising how to retrieve this information and how to optimally request CMS4 samples :) � 2
Where do I get my data? ⚫ Many CMS services to deal with data bookkeeping ⚫ DAS (Data Aggregation Service) deals with many but not all • It’s old and slow ⚫ "Can I get the cmsRun configuration for the GENSIM used to make this MC sample?" • Surprise — Can’t use DAS • Surprise — even using McM, that’s 10-30 clicks the first time and then 4-5 clicks once you know which ICONS are relevant DBS dataset/file info Phedex dataset/file location DAS 👥 Other stuff minor stuff nobody uses user PmP campaign progress Skype with Nick, or ls /hadoop/... McM MC configurations CMS4 CMS4 information � 3
Where do I get my data? ⚫ DIS alleviates this issue by querying services directly • Faster, so you find out where that WJets sample is before retirement ⚫ Drops some DAS things that we don’t use daily ⚫ Adds some things that combine multiple sources ⚫ Adds CMS4 bookkeeping ⚫ You don’t need a proxy/cert for anything, only the person running the DIS server DBS dataset/file info Phedex dataset/file location 👥 user DIS PmP campaign progress McM MC configurations CMS4 CMS4 information � 4
A DAS query processing time: 8.1s …not to mention it times out sometimes, and there’s also this kind of page: � 5
A DIS query ⚫ http://uaf-8.t2.ucsd.edu/~namin/dis/?query=%2FEGamma%2FRun2018D-22Jan2019- v2%2FMINIAOD&type=files&short=short processing time: 2.2s ⚫ If DIS talks to DBS directly, and DAS talks to DBS for the same data, then how is DAS 4x slower? 🤸 ⚫ dasgoclient (CLI) written by DAS author to bypass DAS and query DBS directly • But it’s not a website, and doesn’t (nicely) do all the things we want � 6
DIS knobs query (almost always just a dataset name) what kind of info do you want? "short" output? if unchecked, display more details � 7
DIS options (1) ⚫ Basic ⚫ Files — by default, shows only 10 files (uncheck short option to see all) � 8
DIS options (2) ⚫ Sites — where is my data? If you put in a file , you get info about that file/block If you put in a dataset , you get fractional T2 presence � 9
DIS options (3) ⚫ Chain • returns McM info (fragment, driver, CMSSW version, gridpack) ‣ … for all steps from GENSIM to NANOAODSIM � 10
DIS options (4) ⚫ Pick (pickevents) • Put in a dataset, then comma separated list of run:lumi:event • Gives you the command to run to get a single root file ⚫ Pick_cms4 (pickevents to CMS4 level) • Or, just skip the middle and print out which CMS4 files have the events � 11
DIS options (5) ⚫ SNT (search CMS4 samples) • Two entries here, because there’s two CMS4 tags � 12
How to summarize data? ⚫ How can we summarize lots of output? ⚫ "What’s the total event count of all /TTTT* samples?" • Any list of json-like stu ff can be piped into "grep" • Print out some statistics with "stats" � 13
How to select data? ⚫ Additionally, for SNT queries, put restrictions as comma separated list after dataset pattern ⚫ Print out hadoop path and dataset name for Prompt 2018 data processed with the CMS4_V10-02-04 tag � 14
Python CLI client ⚫ Python command line client has exact same syntax as webpage (just give -t <type> ), and you can make nice tables too • https://github.com/aminnj/dis/blob/master/dis_client.py • dis_client.py -t snt "/MET/Run2018*Prompt*/ MINIAOD,cms3tag=CMS4_V10-02-04 | grep dataset_name,gtag,nevents_in" --table dataset_name gtag nevents_in /MET/Run2018A-PromptReco-v2/MINIAOD 102X_dataRun2_Prompt_v11 5980578 /MET/Run2018A-PromptReco-v1/MINIAOD 102X_dataRun2_Prompt_v11 30172992 /MET/Run2018B-PromptReco-v1/MINIAOD 102X_dataRun2_Prompt_v11 28012780 /MET/Run2018C-PromptReco-v1/MINIAOD 102X_dataRun2_Prompt_v11 1986935 /MET/Run2018B-PromptReco-v2/MINIAOD 102X_dataRun2_Prompt_v11 1739672 /MET/Run2018A-PromptReco-v3/MINIAOD 102X_dataRun2_Prompt_v11 17175066 /MET/Run2018D-PromptReco-v2/MINIAOD 102X_dataRun2_Prompt_v11 162272551 /MET/Run2018C-PromptReco-v2/MINIAOD 102X_dataRun2_Prompt_v11 14698298 /MET/Run2018C-PromptReco-v3/MINIAOD 102X_dataRun2_Prompt_v11 14586790 � 15
DIS (misc) ⚫ Other features • See readme of repo � 16
ProjectMetis ⚫ "CRAB mostly works when it works, but it mostly doesn’t work" • CRAB has a lightheavyweight server between you and your jobs ⚫ Luckily we have local condor submission and running lots of cmsRun isn’t that complicated ⚫ Almost all data processing we do is based on dataset in → files out • Can be organized into "tasks" that take a "sample" (supplier of events) and produce files with events • CRAB takes a dataset, PSet, CMSSW code, and some other info in a configuration file ⚫ Metis (https://github.com/aminnj/ProjectMetis) makes it more functional • The tarfile contains CMSSW source to eventually ship to condor worker nodes task = CMSSWTask( sample = DBSSample(dataset="/ZeroBias6/Run2017A-PromptReco-v2/MINIAOD"), events_per_output = 450e3, output_name = "merged_ntuple.root", tag = "CMS4_V00-00-03", pset = "pset_test.py", pset_args = "data=True prompt=True", cmssw_version = "CMSSW_9_2_1", tarfile = "/nfs-7/userdata/libCMS3/lib_CMS4_V00-00-03_workaround.tar.gz", is_data = True, ) � 17
ProjectMetis (submitting) ⚫ Process a task def main(): task = CMSSWTask( • get list of inputs sample = DBSSample(dataset="/ZeroBias6/Run2017A-PromptReco-v2/MINIAOD"), events_per_output = 450e3, • make list of outputs output_name = "merged_ntuple.root", • submit jobs tag = "CMS4_V00-00-03", pset = "pset_test.py", • resubmit failed jobs pset_args = "data=True prompt=True", cmssw_version = "CMSSW_9_2_1", tarfile = "/nfs-7/userdata/libCMS3/lib_CMS4_V00-00-03_workaround.tar.gz", ⚫ Make a summary of is_data = True, ) jobs and put it on a dashboard task.process() StatsParser(data=total_summary, webdir="~/public_html/dump/metis_test/").do() ⚫ Easily extendible to a if __name__ == "__main__": loop over datasets # Do stuff, sleep, do stuff, sleep, etc. for i in range(100): main() time.sleep(1.*3600) # Since everything is backed up, totally OK to Ctrl+C and pick up later � 18
ProjectMetis (chaining) ⚫ Can chain together tasks tag = "v1" total_summary = {} for _ in range(10000): gen = CMSSWTask( • Input of one is the output sample = DummySample(N=1, dataset="/WH_HtoRhoGammaPhiGamma/privateMC_102x/GENSIM"), events_per_output = 1000, total_nevents = 1000000, pset = "gensim_cfg.py", cmssw_version = "CMSSW_10_2_5", scram_arch = "slc6_amd64_gcc700", of the previous task tag = tag, split_within_files = True, ) raw = CMSSWTask( sample = DirectorySample( location = gen.get_outputdir(), dataset = gen.get_sample().get_datasetname().replace("GENSIM","RAWSIM"), ), open_dataset = True, ⚫ Allows one to make a files_per_output = 1, pset = "rawsim_cfg.py", cmssw_version = "CMSSW_10_2_5", scram_arch = "slc6_amd64_gcc700", tag = tag, ) GEN → CMS4 workflow in one aod = CMSSWTask( sample = DirectorySample( location = raw.get_outputdir(), dataset = raw.get_sample().get_datasetname().replace("RAWSIM","AODSIM"), ), script open_dataset = True, files_per_output = 5, pset = "aodsim_cfg.py", cmssw_version = "CMSSW_10_2_5", scram_arch = "slc6_amd64_gcc700", • Make 5 tasks tag = tag, ) miniaod = CMSSWTask( sample = DirectorySample( location = aod.get_outputdir(), • Loop through tasks and dataset = aod.get_sample().get_datasetname().replace("AODSIM","MINIAODSIM"), ), open_dataset = True, flush = True, files_per_output = 5, pset = "miniaodsim_cfg.py", process them all cmssw_version = "CMSSW_10_2_5", scram_arch = "slc6_amd64_gcc700", tag = tag, ) • As tasks complete, the cms4 = CMSSWTask( sample = DirectorySample( location = miniaod.get_outputdir(), dataset = miniaod.get_sample().get_datasetname().replace("MINIAODSIM","CMS4"), ), open_dataset = True, inputs for the subsequent flush = True, files_per_output = 1, output_name = "merged_ntuple.root", pset = "psets_cms4/main_pset_V10-02-04.py", pset_args = "data=False year=2018", ones become available global_tag = "102X_upgrade2018_realistic_v12", cmssw_version = "CMSSW_10_2_5", scram_arch = "slc6_amd64_gcc700", tag = tag, ‣ parallel in a sense tarfile = "/nfs-7/userdata/libCMS3/lib_CMS4_V10-02-04_1025.tar.xz", ) tasks = [gen,raw,aod,miniaod,cms4] for task in tasks: task.process() summary = task.get_task_summary() total_summary[task.get_sample().get_datasetname()] = summary StatsParser(data=total_summary, webdir="~/public_html/dump/metis/").do() time.sleep(30*60) � 19
Recommend
More recommend