Computing at LHC experiments in the first year of data taking at 7 TeV Daniele Bonacorsi [ deputy CMS Computing coordinator - University of Bologna, Italy ] on behalf of ALICE, ATLAS, CMS, LHCb Computing
Growing up with Grids LHC Compu)ng Grid (LCG) approved by CERN Council in 2001 ✦ First Grid Deployment Board (GDB) in 2002 Since then, LCG was built on services developed in EU and US ✦ LCG has collaborated with a number of Grid Projects It evolved into the Worldwide LCG (WLCG) ✦ EGEE, NorduGrid, and Open Science Grid (OSG) ✦ CoordinaLon and service support for the operaLons of the 4 LHC experiments NORDUGRID Grid Solution for Wide Area Grid Solution for Wide Area Computing and Data Handling Computing and Data Handling Compu)ng for LHC experiments grew up together with Grids ✦ Distributed compuLng achieved by previous experiments LHC experiments started on this environment, in which most resources were ‐ located away from CERN ✦ A huge collaboraLve effort throughout the years, and massive cross‐ ferLlizaLons ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 2
WLCG today for LHC experiments 11 Tier‐1 centres, >140 Tier‐2 centres (plus Tier‐3 s) ✦ ~150k CPU cores, hit 1M jobs/day ✦ >50 PB disk ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS]
Site reliability in WLCG 2010 data taking start at 7 TeV 2006 2007 2008 2009 2010 2011 Jul’06 Feb’11 Basic monitoring of WLCG services 2010 data taking start at 7 TeV ✦ at Tier‐0/1/2 levels Sites reliability is a key ingredient in the success of LHC CompuLng 2009 2010 ✦ Result of a huge collaboraLve work ✦ Thanks to WLCG and site admins! ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 4
Readiness of WLCG Tiers Site Availability Monitoring ✦ CriLcal tests, per Tier, per experiment Some experiments built their own readiness criteria on top of basic ones ✦ e.g. CMS defines a “site readiness” based on a boolean ‘AND’ of many tests Easy to be OK on some ‐ Hard to be OK on all, and in a stable manner... ‐ CMS Tier-1 s 2010 data taking CMS Tier-2 s start at 7 TeV ~ plateau 7 40 Sep’08 Mar’11 Sep’08 Mar’11 ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 5
LHC Computing models LHC CompuLng models are based on the MONARC model ✦ Tiered compuLng faciliLes to meet the needs of the LHC experiments MONARC was developed more than a decade ago ✦ It served the community remarkably well, evoluLons in progress ATLAS CMS example example T0 T0 ... T1 T1 T1 ... T1 T1 T1 T2 T2 T2 T2 T2 T2 ... T2 T2 ... T2 T2 full mesh “cloud” ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 6
From commissioning to data taking DC04 (ALICE, CMS, LHCb) 2004 “ Data Challenges ”: DC2 (ATLAS) experiment-specific, independent tests (first full chain of computing models on grids) SC1 (network transfer tests) 2005 SC2 (network transfer tests) “ Service Challenges ”: SC3 (sustained transfer rates, since 2004, to demonstrate service aspects: DM, service reliability) 2006 - DM and sustained data transfers More experiment-specific - WM and scaling of job workloads challenges... - Support processes SC4 (nominal LHC rates, - Interoperability disk → tape tests, - Security incidents (“fire drills”) 2007 all T1, some T2s) More experiment-specific challenges... CCRC08 (phase I - II) Run the service(s): 2008 (readiness challenge, all exps, Focus on real and continuous production use ~full computing models) of the services over several years: - simulations (since 2003) 2009 STEP’09 - cosmics data taking, … (scale challenges, all exps + multi-VO overlap, + FULL computing models) pp+HI data taking “ Readiness/Scale Challenges ”: 2010 Data/Service Challenges to exercise aspects of the overall service at the same time pp+HI data taking - if possible with VO overlap 2011 ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 7
LHC data taking 2010 NOTE: log scale Remarkable ramp‐up in lumi in 2010 ✦ At the beginning, a “good” weekend could double or triple the dataset ✦ a significant failure or outage for a fill would be a big fracLon of the total data PRELIMINARY Original planning for CompuLng in the first 6 months foresaw higher data volumes (tens of pb ‐1 ) ✦ Time in stable beams per week reached 40% only few Lmes Load on compuLng systems lower than expected, no stress on resources ✦ Slower ramp has allowed predicted acLviLes to be performed more frequently This will definitely not happen again in 2011, we will be resource constrained ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 8
Networks OPN links now fully redundant ✦ Means no service interrupLons See the fiber cut during STEP’09 ‐ ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 9
Networks in operations Excellent monitoring systems ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 10
CERN → T1 data transfers CERN outbound traffic showed high performance and reliability ✦ Very well serving the needs of LHC experiments ✦ A joined and long commissioning and tesLng effort to achieve this All experiments 1 PB STEP’09 CCRC’08 ICHEP’10 challenge challenge Conference (phase I and II) ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 11
An example: ATLAS data transfers GB/s per day MC 2009 Data Data + MC Data taking + MC prod 2010 pp PbPb 2010 data taking start at 7 TeV reproc reproc reproc reproc reproc @T1s ATLAS Data brokering User subscrip)ons 6 (Analysis data) T0 export Data consolida)on MC transfers in clouds (incl. calib streams) (MC transfers extra‐clouds) 4 Aver: ~2.3 2 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2010 Transfers on all routes (among all Tier levels) ✦ Average: ~2.3 GB/s (daily average) ✦ Peak: ~7 GB/s (daily average) Traffic on OPN measured up to 70 Gbps Data available on‐site aker few hrs. ATLAS massive reprocessing campaigns ✦ ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 12
An example: CMS data transfers NOTE: log scale CMS PhEDEx CMS improved by ad‐hoc challenges of increasing complexity and by compuLng commissioning acLviLes Massive commissioning, now in conLnuous producLon‐mode of ops ✦ Can sustain up to >200 TB/day of producLon transfers on the overall topology ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 13
More examples: ALICE and LHCb data transfers GB LHCb 80k LHCb data is successfully transferred on a regular basis T0 → T1 ✦ RAW data is replicated to one of the T1 sites # done 325k transfers ALICE ALICE transfers among all Tiers ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 14
Reprocessing Once landed at the T1 level, LHC data gets reprocessed as needed ✦ New calibraLons, improved sokware, new data formats (reprocessing passes only) # jobs ATLAS # jobs CMS 6k 4 reproc campaigns in 2010 16k ~ a dozen of reproc passes in 2010 ✦ Feb’10: 2009 pp data + cosmics ✦ Apr’10: 2009/2010 data ✦ May’10: 2009/2010 data+MC ✦ Nov’10: full 2010 data+MC (from tapes) + HI reprocessing foreseen in Mar’11 # jobs ALICE LHCb HI reco: opportunistic 6k usage of resources Pass-2 reco ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 15
Reprocessing profile In 2010, possible to reprocess even more frequently than originally planned ESD � dESD, AOD 1.00 ATLAS Fraction Complete (normalised for each T1) ATLAS reprocessed 100% of data 0.75 CA DE ✦ RAW→ESD ES FR IT ND ✦ ESD merge NL UK US 0.50 ✦ ESD →dESD, AOD ✦ Grid distribuLon of derived data 0.25 Actually, from 7 days onwards 0 1 2 3 4 5 6 7 8 9 10 11 12 mostly dealing with tails Campaign Day 1.5G evts CMS About a dozen of CMS reprocessing passes in 2010 ISGC’11 - Taipei - 22 Marzo 2011 Daniele Bonacorsi [CMS] 16
Recommend
More recommend