software and computing operations
play

Software and Computing Operations Christoph Paus, MIT Stephan - PowerPoint PPT Presentation

U.S. CMS Operations Program Software and Computing Operations Christoph Paus, MIT Stephan Lammel, Fermilab USCMS Operation Budget Review, September 7 th , 2017 U.S. CMS S&C Operations, Goals Operations Program Enable high-quality,


  1. U.S. CMS Operations Program Software and Computing Operations Christoph Paus, MIT Stephan Lammel, Fermilab USCMS Operation Budget Review, September 7 th , 2017

  2. U.S. CMS S&C Operations, Goals Operations Program ▪ Enable high-quality, timely research by ▪ processing data ▪ distributing data ▪ running job submission infrastructure ▪ running various data/software/DB services ▪ investigating possible improvements ==> clearly in US/US physicist interest goal matches CMS, different focus on enabling it ==> strategy is to apply US expertise increase operations coverage with second 8h shift Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 2/14

  3. U.S. CMS S&C Operations, Strategy Operations Program ▪ Smooth, effortless operation: ▪ automate where possible ▪ make things robust ▪ off-load monitoring to shifter ▪ effective alerting ▪ Look beyond today: ▪ what is needed next month/year ▪ what becomes available ▪ what needs to be improved/evolved ▪ What is in for USCMS: ▪ know the data and issues ▪ keep US facilities at peak performance ▪ see computing research and development opportunities early Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 3/14

  4. U.S. CMS S&C Operations, Areas Operations Program Operate the Tier-0 infrastructure Tier-0 USCMS people designed and built the Tier-0 • USCMS contributes significantly to the operation • Data Operate PhEDEx and Dynamo data distribution USCMS designed and built PhEDEx, AAA, and Dynamo • Distribution USCMS operates the system with collaboration contribution • Data Re(re)construct data and produce Monte Carlo datasets USCMS designed and build the processing setup • Processing USCMS operates the system with collaboration contribution • Schedule and execute production and user jobs on Grid and Submission Cloud resources of sites USCMS co-developed glide-in WMS • Infrastructure USCMS designed and setup the Global Pool • USCMS operates the system with OSG and collab contribution • Operate various distributed data, database, and software access Central services USCMS contributed in the development of several services Services • USCMS contributes significantly to the operation • Monitor health and performance of CMS grid sites Site Support USCMS people developed the setup based on WLCG tools • USCMS contributes significantly to the daily monitoring • Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 4/14

  5. U.S. CMS S&C Operations, Tier-0 Operations Program ▪ Tier-0 components consists of: ▪ interface to transfer system of StorageManager at P5 ▪ transfer system to get data from P5 to CERN EOS/MSS ▪ Express and PromptCalib ▪ Repack data from streamer format into ROOT files ▪ PromptReco ▪ AlCaSkim ▪ data quality monitoring ▪ file merge ▪ cloud based infrastructure for CPU resources at CERN ▪ 2017 Activities: ▪ commission new interface to transfer system ▪ transfer performance and lost files in EOS ▪ data cached on disk reduced ▪ USCMS effort: ▪ CMS/O&C/CompOps/Tier-0 L3 head at CERN (0.5 FTE costed) ▪ Tier-0 operator at CERN (0.3 FTE uncosted) ▪ Tier-0 operator at Fermilab (1 FTE subsistence) ▪ Tier-0 head/operator at CERN (2 FTE cola) Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 5/14

  6. U.S. CMS S&C Operations, Data Distribution Operations Program ▪ Data Distribution components consists of: ▪ PhEDEx transfer system ▪ dynamic data management DDM / Dynamo ▪ AAA / xrootd federated data service (redirectors, monitoring) ▪ 2017 Activities: ▪ tape-to-disk staging tests at Tier-1s ▪ expanded DDM use ▪ lost files due to storage system failures ▪ network transfer rates at two of the Tier-1s ▪ storage inconsistencies due to race conditions/exceptions ▪ increase DDM functionality and capabilities ▪ USCMS effort: ▪ AAA/xrootd operations at Nebraska (0.5 FTE costed) ▪ network performance integration at Nebraska (0.2 FTE costed) ▪ storage performance integration at Florida (0.5 FTE costed) ▪ transfer team operator at CERN/MIT (0.6 uncosted) ▪ DDM/Dynamo support and evolution at MIT (0.3 uncosted) ▪ transfer team operator at Fermilab (1 FTE subsistence) ▪ CMS/O&C/CompOps/TT L3 head at CERN (1 FTE cola) Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 6/14

  7. U.S. CMS S&C Operations, Data Processing Operations Program ▪ Data Processing tasks consists of: ▪ reconstruction of cosmic and pp-collision data ▪ re-miniAOD campaign for spring conferences ▪ re-reconstruction of 2016 pp-collision data ▪ making pile up Monte Carlo samples for pre-mixing ▪ Run 2, phase 1, and 2 Monte Carlo samples ▪ 2017 Activities: ▪ EOS authentication overload with HLT and Tier-0 resources ▪ stage-out issues ▪ software availability and thus late start of campaigns ▪ network and storage overloads ▪ USCMS effort: ▪ Data Processing operations at Fermilab (1 FTE costed) ▪ CMS/O&C/CompOps/P&R L3 head (0.25 FTE uncosted) Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 7/14

  8. U.S. CMS S&C Ops, Submission Infrastructure Operations Program ▪ Submission Infrastructure tasks consists of: ▪ operation of the glide-in WMS factories ▪ support and evolution of the batch system Global Pool ▪ interface with glide-in WMS and HTCondor developers and advise on features/priorities ▪ 2017 Activities and Milestones: ▪ multi-core pilot tuning (task priorities, retirement policies, and scheduling efficiency) ▪ Global Pool stability and increased scalability (500k cores) ▪ Singularity integration and deployment (glexec replacement) ▪ including I/O resources in job scheduling ▪ USCMS effort: ▪ GlideIn Factory operations at UCSD (0.2 FTE costed) ▪ Submission Infrastructure leadership at UCSD (0.45 FTE costed) Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 8/14

  9. U.S. CMS S&C Operations, Central Services Operations Program ▪ Central Services components consists of: ▪ CVMFS for software and MC gridpack distribution ▪ DBfroNtier/squid infrastructure of distributed database cache ▪ 2017 Activities: ▪ squids switched from static config to launchpad discovery ▪ USCMS effort: ▪ CVMFS operations at Florida (0.3 FTE costed) ▪ DBfroNtier/squid operations at Johns Hopkins (0.17 FTE costed) ▪ DBfroNtier/squid support at Fermilab (0.1 FTE costed) Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 9/14

  10. U.S. CMS S&C Operations, Site Support Operations Program ▪ Site Support components consists of: ▪ SAM and HC of WLCG ▪ site readiness and status metrics ▪ topology description (VO-feed, SITECONF) ▪ dashboard metric displays ▪ 2017 Activities: ▪ decouple VO-feed from BDII, multi-site support, xrootd ▪ finer granularity tests (SAM, HC, PhEDEx links between sites) ▪ new pilot startup site test ▪ IPv6 storage commissioning/testing ▪ USCMS effort: ▪ Site Support operator at Fermilab (1.0 FTE subsistence) ▪ CMS/O&C/F&S/SS L3 head (0.25 FTE uncosted) Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 10/14

  11. U.S. CMS S&C Operations, Coordination Operations Program ▪ USCMS effort coordinating CMS/O&C: ▪ Submission Infrastructure L2 head (0.1 FTE costed) ▪ Computing Operations L2 head (0.15 FTE uncosted) ▪ Facilities and Services L2 head (0.1 FTE uncosted) ▪ USCMS effort coordinating USCMS Ops/O&C: ▪ Computing Operations L3 (0.2 FTE uncosted) ▪ Guest Scientist Line Management (0.05 FTE uncosted) Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 11/14

  12. U.S. CMS S&C Operations, FY-18 plans Operations Program ▪ Operating, operating, operating... ▪ LHC data keeps coming through 2018 ▪ Reacting to issues/addressing operational needs ▪ difficult to plan ahead, except ▪ Areas with more evolution component like ▪ Submission Infrastructure need to stay ahead of CPU/core demand: scalability &  efficiency high-availability via IPv6 of Global Pool services  feeding HTCondor monitoring and factory logs to MonIT  develop/setup mechanism to suspend matching of production  jobs to a sites ▪ Data distribution plan for DDM to become a more sophisticated cache  manager Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 12/14

  13. U.S. CMS S&C Operations, Priorities Operations Program ▪ High: ▪ Submission Infrastructure danger of losing glide-in WMS investment  USCMS makes big impact  ▪ Data Distribution/DDM know/coordinate which data are stored at which sites (physics)  ▪ AAA/xrootd  influence/guide future of remote data access (leadership) ▪ Dbfrontier/squid cross experiment/frontier activity (leadership)  ▪ Moderate: ▪ Data Processing direct knowledge of datasets/processing information would be lost (physics)  ▪ Site Support watching out for USCMS sites would be lost  ▪ Tier-0 we loose connection to data as they are recorded  ▪ Storage/Network performance integration don’t be proactive and incur delay/slower implementation when plan ready  ▪ CVMFS operation expect CMS to pick this up as service is needed for all sites  ▪ Low: Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 13/14

Recommend


More recommend