Testing the Grid- Service & Data Challenges LHC Data Analysis Challenges for the Experiments and 100 Computing Centres in 20 Countries GridKa School Karlsruhe 15 September 2006 Michael Ernst Michael Ernst DESY / CMS DESY / CMS
The Worldwide LHC Computing Grid ... as defined by LCG � Purpose � Develop, build and maintain a distributed computing environment for the storage and analysis of data from the four LHC experiments � Ensure the computing service � … and common application libraries and tools � Phase I – 2002-2005 - Development & planning � Phase II – 2006-2008 – Deployment & commissioning of the initial services 2 Michael Ernst
WLCG Collaboration � The Collaboration � ~100 computing centres � 12 large centres (Tier-0, Tier-1) � 38 federations of smaller “Tier-2” centres � 20 countries 3 Michael Ernst
Service Hierarchy Tier-0 – the accelerator centre Data acquisition & initial processing � Long-term data curation � Distribution of data � Tier-1 centres � Tier-1 – “online” to the data acquisition process � high availability Managed Mass Storage – � � grid-enabled data Canada – Triumf (Vancouver) service Spain – PIC (Barcelona) France – IN2P3 (Lyon) Taiwan – Academia SInica (Taipei) Germany – Forschunszentrum Karlsruhe Data-heavy analysis � UK – CLRC (Oxford) Italy – CNAF (Bologna) US – Fermilab (Illinois) Netherlands Tier-1 (Amsterdam) – Brookhaven (NY) National, regional support � Nordic countries – distributed Tier-1 Tier-2 – ~100 centres in ~40 countries Simulation � End-user analysis – batch and � 4 interactive Michael Ernst
Summary of Computing Resource Requirements All experiments - 2008 From LCG TDR - June 2005 CERN All Tier-1s All Tier-2s Total CPU (MSPECint2000s) 25 56 61 142 Disk (PetaBytes) 7 31 19 57 Tape (PetaBytes) 18 35 53 CPU Disk Tape CERN CERN 12% 18% A ll Tier-2s CERN 33% 34% A ll Tier-2s 43% A ll Tier-1s 66% A ll Tier-1s A ll Tier-1s 39% 55% 5 Michael Ernst
Two major science grid infrastructures …. EGEE - Enabling Grids for E- Science OSG - US Open Science Grid 6 Michael Ernst
.. and an excellent Wide Area Network T2 T2 T2 T2 T2 T2 T2 T2s and T1s are inter-connected T2s and T1s are inter-connected T2 T2 T2 by the general purpose research by the general purpose research networks networks T2 T2 T2 T2 T2 T2 T2 T2 GridKa GridKa IN2P3 IN2P3 10 Gbit links 10 Gbit links TRIUMF TRIUMF T2 T2 T2 Optical Private Network Optical Private Network Any Tier-2 may Any Tier-2 may Brookhaven Brookhaven access data at access data at ASCC ASCC T2 T2 T2 T2 any Tier-1 any Tier-1 Nordic Nordic Nordic Fermilab Fermilab T2 T2 T2 RAL RAL CNAF CNAF PIC PIC SARA SARA T2 T2 T2 T2 T2 T2 LCG T2 T2 T2 T2 T2 T2 T2 T2 T2
LHC Data Grid Hierarchy CERN/Outside Resource Ratio ~1:2 Tier0/( Σ Tier1)/( Σ Tier2) ~1:1:1 ~PByte/sec ~100-1500 Online System MBytes/sec Experiment CERN Center Tier 0 +1 PBs of Disk; Tape Robot Tier 1 ~2.5-10 Gbps FNAL Center GridKa Center INFN Center RAL Center 2.5-10 Gbps Tier 2 Tier2 Center Tier2 Center Tier2 Center Tier2 Center Tier2 Center ~2.5-10 Gbps Tier 3 Institute Institute Institute Institute Tens of Petabytes by 2007-8. Physics data 0.1 to 10 Gbps An Exabyte ~5-7 Years later. cache Tier 4 Workstations Emerging Vision: A Richly Structured, Global Dynamic System 8 Michael Ernst
Tier-1 Tier-1 Tier-1 Tier-1 Tier-2 Tier-2 Tier-2 Tier-2 Experiment computing models define specific 9 data flows between Tier-1s and Tier-2s Michael Ernst
ATLAS “average” Tier-1 Data Flow (2008) Tape RAW Real data storage, ESD2 RAW AODm2 reprocessing and 1.6 GB/file 0.044 Hz 0.02 Hz 3.74K f/day distribution 1.7K f/day 44 MB/s 32 MB/s 3.66 TB/day 2.7 TB/day Tier-0 AODm1 AODm2 disk 500 MB/file 500 MB/file 0.04 Hz 0.04 Hz buffer 3.4K f/day 3.4K f/day ESD1 AODm1 RAW AOD2 ESD2 AOD2 AODm2 20 MB/s 20 MB/s 0.5 GB/file 500 MB/file 1.6 GB/file 10 MB/file 0.5 GB/file 10 MB/file 500 MB/file 1.6 TB/day 1.6 TB/day 0.02 Hz 0.04 Hz 0.02 Hz 0.2 Hz 0.02 Hz 0.2 Hz 0.004 Hz 1.7K f/day 3.4K f/day 1.7K f/day 17K f/day 1.7K f/day 17K f/day 0.34K f/day 10 MB/s 20 MB/s 32 MB/s 2 MB/s 10 MB/s 2 MB/s 2 MB/s Tier-2s 0.8 TB/day 1.6 TB/day 2.7 TB/day 0.16 TB/day 0.8 TB/day 0.16 TB/day 0.16 TB/day T1 T1 Plus simulation & & Plus simulation CPU ESD2 AODm2 farm analysis data flow analysis data flow 0.5 GB/file 500 MB/file 0.02 Hz 0.036 Hz ESD2 AODm2 1.7K f/day 3.1K f/day ESD2 AODm2 10 MB/s 18 MB/s 0.5 GB/file 500 MB/file 0.8 TB/day 1.44 TB/day 0.02 Hz 0.004 Hz 0.5 GB/file 500 MB/file 1.7K f/day 0.34K f/day 0.02 Hz 0.036 Hz 10 MB/s 2 MB/s 1.7K f/day 3.1K f/day Other Other 0.8 TB/day 0.16 TB/day 10 MB/s 18 MB/s disk 0.8 TB/day 1.44 TB/day T1 T1 Tier-1s T1 Tier-1s T1 10 storage Michael Ernst
Michael Ernst 11 CMS Data Flows
Michael Ernst 12 Accessing the Data CMS Basic Data Flows Tier-1 to Tier-2 Flow Tier-0 to Tier-1 Flow
Service Challenges Purpose � Understand what it takes to operate a real grid service real grid service – run for weeks/months � at a time (not just limited to experiment Data Challenges) Trigger and verify Tier1 & large Tier-2 planning and deployment – � - tested with realistic usage patterns Get the essential grid services ramped up to target levels of reliability, � availability, scalability, end-to-end performance Four progressive steps from October 2004 thru September 2006 � End 2004 - SC1 – data transfer to subset of Tier-1s � Spring 2005 – SC2 – include mass storage, all Tier-1s, some Tier-2s � 2 nd half 2005 – SC3 – Tier-1s, >20 Tier-2s –first set of baseline services � � Jun-Sep 2006 – SC4 – pilot service � Autumn 2006 – LHC service in continuous operation 13 – ready for data taking in 2007 Michael Ernst
SC4 – the Pilot LHC Service from June 2006 A stable service on which experiments can make a full demonstration of experiment offline chain DAQ � Tier-0 � Tier-1 and/or Tier-2 � data recording, calibration, reconstruction Offline analysis - Tier-1 �� Tier-2 data exchange � simulation, batch and end-user analysis And sites can test their operational readiness Service metrics � MoU service levels � Grid services � Mass storage services, including magnetic tape � Extension to most Tier-2 sites 14 Michael Ernst
The Service Challenge program this year must show � that we can run reliable services Grid reliability is the product of many components � – middleware, grid operations, computer centres, …. ? t s e Target for this Fall (LCG goal) � d o m ? s o u o o T i � 90% site availability t i b m a o o T � 90% user job success Requires a major effort by everyone � to monitor, measure, debug First data will arrive next year NOT an option to get things going later 15 Michael Ernst
Availability Targets ... as anticipated by LCG � End September 2006 - end of Service Challenge 4 � 8 Tier-1s and 20 Tier-2s > 90% of MoU targets � April 2007 – Service fully commissioned � All Tier-1s and 30 Tier-2s > 100% of MoU Targets 16 Michael Ernst
Measuring Response times and Availability Site Functional Test Framework: monitoring services by running regular tests � basic services – SRM, LFC, FTS, CE, RB, Top-level BDII, Site � BDII, MyProxy, VOMS, R-GMA, …. VO environment – tests supplied by experiments � results stored in database � displays & alarms for sites, grid operations, experiments � high level metrics for management � integrated with EGEE operations-portal - main tool for daily � operations (egee) 17 Michael Ernst
Site Functional Tests A vailability of 10 T ier-1 S ites 120% � Tier-1 sites 100% without BNL e availa 80% � Basic tests only tag 60% Percen 40% 20% average value of sites shown 0% Jul-05 A ug-05 Sep-05 O ct-05 Nov -05 Dec-05 Jan-06 F eb-06 M ar-06 M o n th A vailability of 5 T ier-1 S ites 120% 100% e availa � Only partially corrected 80% for scheduled down time tag 60% ercen � Not corrected for sites 40% P 20% with less than 24 hour 0% coverage 18 Jul-05 A ug-05 S ep-05 O ct-05 Nov -05 Dec-05 Jan-06 F eb-06 M ar-06 M o n th Michael Ernst
Medium Term Schedule ... as it was declared by LCG in April 2006 Additional 3D functionality SC4 SRM 2 distributed test and database to be agreed, deployment stable services developed, service evaluated For plan being development experiment test elaborated then - tested tests deployment deployed October target ?? Deployment schedule ?? 19 Michael Ernst
LCG Service Deadlines Pilot Services – 2006 stable service from 1 June 06 cosmics LHC Service in operation – 1 Oct 06 over following six months ramp up to 2007 full operational capacity & performance first LHC service commissioned – 1 Apr 07 physics 2008 full physics run 20 Michael Ernst
Overview of CMS Computing, Software & Analysis Challenge 2006 (CSA06) Goals and Metrics 21 Michael Ernst
Recommend
More recommend