ichep mc production post mortem j r vlimant on behalf of
play

ICHEP MC Production Post-Mortem J-R Vlimant on behalf of everyone - PowerPoint PPT Presentation

ICHEP MC Production Post-Mortem J-R Vlimant on behalf of everyone else Disclaimer Post-Mortem : is after death/end, while there are still MC samples being produced, as we speak. The body is still warm ! A full computing operation


  1. “ICHEP MC Production” Post-Mortem J-R Vlimant on behalf of everyone else

  2. Disclaimer Post-Mortem : is after death/end, while there are still MC samples being produced, as we speak. The body is still warm ! A full computing operation post-mortem analysis is planned for the Computing & Offline Management Meeting in Trieste, 25-27 July 2012. Full post-mortem will be done by then Lots of lessons learned will be turned into action items then. It's easy to only notice what goes wrong. 2 6/26/12

  3. Summer12 GEN-SIM https://twiki.cern.ch/twiki/bin/view/CMS/PdmVProductionSummer12 ● Started on January 25 ● 3.3B in the campaign ● 2.5 B events produced ● 500 M/month achieved (has been 400M/Month) ● Part of it has been put in stand-by, while completing the HPA and Upgrade samples. ● Resuming of all “non HPA” gen-sim (800M) this week http://vlimant.web.cern.ch/vlimant/Directory/summer12/progress/Summer12_GEN-SIM_speed.html 3 6/26/12

  4. Summer12 Digi-Reco https://twiki.cern.ch/twiki/bin/view/CMS/PdmVProductionSummer12 ● 5.1 digi-reco : 370M events, started March 7, delivered early April, tailed into beginning of May ● 5.2 digi-reco : 1.3B events, started end of April, tailing in end of June. ✔ Validation samples (end of March - end of April) ✔ Low PU production ( April 18 – May 21) ( PU_S8 or E8TeV4BX50ns ) ✔ TSG production (April 21 - May 14) ( PU_S9) ✔ HPA Production (end of March – today) ( PU_S7 and PU_S6 ) http://vlimant.web.cern.ch/vlimant/Directory/summer12/progress/Summer12_START52_AODSIM_speed.html 4 6/26/12

  5. Priority Lists ● Selected samples for 5.1 ✔ Defined with Physics Coordination ✗ Production overshot by <~1week ➔ Data popularity analysis ? ● High Priority Analysis with 5.2 , ✔ Defined by all groups, filtered by Physics Coordination, compiled, and arranged for production ✔ 5 blocks+1block for the rest (see details in next slides) ✔ Everything else not on that list was frozen in production (or not attend to) ✗ Complications were met with samples already submitted in gen-sim, acquired in the queue, with lower priority, inherited from the beginning of Summer12 (early Feb) ✔ Not much issue met with Digi-Reco prioritization (since nothing had been started yet) ✔ Overall, the production went fine 5 6/26/12

  6. HPA (1/2) http://vlimant.web.cern.ch/vlimant/Directory/summer12/summary.html?search=Block1 HPA x = Block x in the url ● HPA1 : 38/40 completed. ✗ DiPhotonJets_7TeV-madgraph useless in Summer12 ✗ TTJets_MassiveBinDECAY available in PU_S6 as requestd, missing PU_S7 ✔ 140M to AODSIM ● HPA2 : 42/48 completed. ✗ 4 Higgs request still new : means not defined in PREP ✗ EWK : DY4JetsToLL_M-50 digi-reco stalled ✗ EWK : DY2JetsToLL_M-50 gen-sim extension stalled ✔ 200M to AODSIM ● HPA3 : 326/329 completed. ✗ 2 requests in “new” : means not defined in PREP ✗ JME QCD_Pt-15to30 digi-reco stalled ● 240M to AODSIM NB : “stalled” = Site issues, Queue overhead, probably done by now. 6 6/26/12

  7. HPA (2/2) http://vlimant.web.cern.ch/vlimant/Directory/summer12/summary.html?search=Block1 HPA x = Block x in the url ● HPA4 : 308/339 completed. ✗ BPH : BdToKK, BdToPiPi, BdToPiMuNu, LambdaBToPK, digi-reco stalled ✗ BPH : LambdaBToPMuNu gen-sim taking forever due to a very low filter efficiency ✗ EWK : DYToTauTau_M-20_CT10 digi-reco stalled ✗ SUS : QCD_HT-500To1000 completing ✗ SUS : TTWWJets, WZZNoGstarJets, WWWJets, TTGJets, TTWJets, ✗ Top : 7 systematic samples (TT/T/W scale up/down) digi-reco stalled ✗ Top : 3 systematic samples gen-sim stalled ✔ 550M to AODSIM ● HPA5 : 82/87 completed. ✗ SUS : QCD_HT-100To250 digi-reco stalled ✗ Top : 4 systematic samples (TT/W matching up/down) digi-reco stalled ✗ Higgs : VBF_HToZZTo2L2Nu_M-525 digi-reco stalled ✔ 52M to AODSIM NB : “stalled” = Site issues, Queue overhead, probably done by now. 7 6/26/12

  8. Issues and Action (1/3) ● WM Commissioning was done during the production itself ✔ No support from main developers gone to work in industry ✔ Solved by definition with experienced gained ✔ Lot's of experience gained both by PdmV and Comp-Ops ➔ Computing full post-mortem end July ➔ PREP2 project ● Lack of monitoring at several levels ✔ Ad-hoc monitoring pages will be turned into a consolidated third party PREP/reqMng monitoring in medium time scale (pre-PREP2) ➔ GlobalMonitor is being upgraded ● Operation over-head for submission & chaining ✔ Ad-hoc chaining from PREP evolved to ad-hoc operation summary ➔ Improvement of current PREP to speed up operation ➔ PREP2 / integration with request manager ● Operation over-head for dispatching ✗ Daily assignment is a killer overhead ✗ Weekly assignment does not allow for quick turn-over ✗ Monthly assignment early April severely delayed some samples ➔ Accumulate experience into automated procedures ➔ More from the July post-mortem 8 6/26/12

  9. Issues and Action (2/3) ● Aborting valid request to reclaim resources ✗ Damaged the output dataset ✔ We won't do that again anytime soon ➔ Development of the system to allow for this feature ● Frenzy of wanting things faster ✗ Many cases of “change the priority” the “next day it was acquirred” ✔ Ask for future careful pre-planning ✔ Tied to lack of approximate estimated time of delivery ➔ Development of the system to allow more flexibility ● Some samples “missing” were in fact never asked in priority ✔ Were dealt in priority ➔ Add a link to a PAS in PREP2 to tie requests to analysis ➔ More careful planning from the groups needs to be made, early on ● Some samples were submitted with an incorrect physics content ✔ Improve on preparation/documentation of special requests ➔ Implementing a gen-validation step as part of the submission procedure ● Not possible to “take a pick” at large samples ✗ The first 10% of the samples was not reachable fast enough ✔ Numerous requests were staged, but the rest steals resources ✔ A handful of requests were extended ➔ Planning for two-speed submission of samples (10% high, 90% bulk) with PREP2 ➔ Development on WM infrastructure to allow for safe extension of dataset 9 6/26/12

  10. Issues and Action (3/3) ● Mis-understanding on priority number and operation related to it ✔ Clarified half-way ✔ Tied to resource downtime ➔ Planned to be automated “by date” in PREP2 ● Requested statistics not matched ✗ Due to filter efficiency, corrupted LHE,... ➔ Incorporate this as part of a gen-valid request ● Stuck samples ✗ History monitoring missing ✔ Weekly report from PREP ✔ Scanning scripts developed by the operators ✔ Thanks to the eyes of some requester, making clear reports ➔ More from the July post-mortem ● Difficult to get large systematic samples ➔ Increase the usage of Fastsim ● Loosing track of samples, requests, relevance ✔ Increase coordination between groups ✔ Follow up on important samples ✔ Propagation of operational information and news ➔ Monte-Carlo coordination meeting put in place 10 6/26/12

  11. Summary Things went fine for the bulk of production. Production went over the expectations. Most of the samples prioritized have been delivered “before last week”. A few bumps along the way. Most issues addressed already. More from Computing post-mortem full analysis 11 6/26/12

Recommend


More recommend