evolution of the lhc computing models
play

Evolution of the LHC Computing Models Ian Fisk May 22, 2014 About - PowerPoint PPT Presentation

Evolution of the LHC Computing Models Ian Fisk May 22, 2014 About Me I am a scientist with Fermilab I have spent the last 14 years working on LHC Computing problems I helped build the first Tier-2 prototype computing center in the US I was


  1. Evolution of the LHC Computing Models Ian Fisk May 22, 2014

  2. About Me I am a scientist with Fermilab I have spent the last 14 years working on LHC Computing problems I helped build the first Tier-2 prototype computing center in the US I was responsible for Integration and Commissioning of the CMS Computing system for 2006-2010 Ian Fisk CD/FNAL And I was computing coordinator of CMS for LHC Run1

  3. Final Steps Software and Computing is the final in a long series to realize the physics potential of the experiment � Storage and � Serve the data � Reconstruct the � physics objects � Analyze the events Ian Fisk CD/FNAL � As the environment has become more complex and � demanding, computing and software have had to become faster and more capable

  4. You need this Ian Fisk CD/FNAL To get this

  5. Distributed Computing • Computing models are 622 Mbits/s Tier3 based roughly on the FNAL/BNL Univ WG 70k Si95 1 622 Mbits /s 70 Tbytes Disk; Robot MONARC model Tier3 s Univ WG / s t 2 i b Tier2 Center M 20k Si95 –Developed more than a 2 2 20 Tbytes Disk, 6 Robot X N decade ago CERN/CMS Tier3 350k Si95 622Mbits/s Univ WG 350 Tbytes Disk; –Foresaw Tiered M Robot 622 Mbits /s 622 Mbits/s Computing Facilities to Model Circa 2005 Model Circa 2005 meet the needs of the Fig. 4-1 Computing for an LHC Experiment Based on a Hierarchy of Computing Centers. Capac LHC Experiments for CPU and disk are representative and are provided to give an approximate scale). - 16 - • Assumes poor Tier- networking Tier- Tier- Tier- • Hierarchy of functionality 1 1 1 and capability Ian Fisk Tier- Tier- Tier- FNAL/CD 2 2 2 5

  6. Distributing Computing at the Beginning • Before LHC most of the Computing Capacity was located at the experiment at the beginning –Most experiments evolved and added distributed computing later NDG OSG LCG LHC began with a global distributed computing system Ian Fisk 6

  7. Grid Services Connection to batch (Globus and CREAM based) Site WMS During the evolution Experiment Services CE the low level services are largely the same BDII Information � System FTS Most of the changes come from the SE actions and expectations of the Ian Fisk CD/FNAL VOMS experiments Connection to storage (SRM or xrootd) Higher Level Services Lower Level Services Providing Consistent Interfaces to Facilities

  8. Successes • When the WLCG started there was a lot of concern about the viability of the Tier-2 Program –A university based grid of often small sites Tier-0 Tier-1 Tier-2 � � 2009 2013 Capacity 18% 20% Grows by � factor 2.5 44% 47% � 33% 38% � • Total system uses close to half a million processor cores continuously Ian Fisk FNAL/CD 8

  9. Moving Forward –Strict hierarchy of Prompt Reconstruction Storage connections becomes Commissioning more of a mesh Tier- CAF 0 –Divisions in Re-Reconstruction/ functionality Simulation Archiving Tier- Tier- Tier- especially for chaotic Data Serving 1 1 1 activities like analysis become more blurry Tier- Tier- Tier- T i e r - 2 2 2 2 –More access over the Tier- 2 Tier- wide area 2 Simulation and User Analysis ‣ Model changes have been an evolution ‣ Not all experiments have emphasized the same things Ian Fisk ‣ Each pushing farther in particular directions FNAL/CD 9

  10. Evolution We have had evolution all through the history or the project Slow changes and improvements Some examples Use of Tier-2s for analysis in LHCb Full mesh transfers in ATLAS and CMS Data federation in ALICE Better use of the network by all the experiments Ian Fisk CD/FNAL But many things are surprisingly stable Architectures of hardware (x86 with ever increasing cores) Services both in terms of architectures and interfaces

  11. Looking back In June of 2010 we had a workshop on Data Access and Manager in Amsterdam Areas we worried about at the time were making a less deterministic and flexible system providing better access to the data for analysis being more efficient Some things were were not worrying about Ian Fisk CD/FNAL New architectures for hardware Clouds Opportunistic Computing

  12. Progress Networking • One of the areas of progress has been better use of wide area networking to move data and to make efficient use of the distributed computing –Limited dedicated network –Much shared use R&E networking LHCOPN - Dedicated resource T0->T1 and T1 to T1 LHCOne to Tier-2s LHCOne - New initiative for Tier-2 network Ian Fisk FNAL/CD 12

  13. Mesh Transfers Transfers –Change from West 150MB/s � Tier- Tier- 1 1 � � Tier- Tier- � 2 2 150MB/s 150MB/s � Transfers –To East 300MB/s Tier- Tier- 1 1 Ian Fisk Tier- Tier- FNAL/CD 2 2 13

  14. Completing the Mesh Tier- Tier- 2 2 • Tier-2 to Tier-2 transfers are now similar to Tier-1 to Tier-2 in CMS Tier- Tier- 2 2 Ian Fisk FNAL/CD 14

  15. Overlay Batch One of the challenges of the grid is despite having a consistent set of protocols actually getting access to resources takes a lot of workflow development Pilots jobs are centrally submitted and start on worker nodes, reporting back that they are available Ian Fisk Building up an enormous batch queue Batch

  16. So what changes next? The LHC is currently in a 2 year shutdown to improve the machine Energy will increase to ~13TeV and the luminosity will grow by a factor a ~2 Both CMS and ATLAS aim to collect about 1kHz of data Events are more complex and take longer to Ian Fisk CD/FNAL reconstruct All experiments need to continue to improve efficiency

  17. Resource Provisioning The switch to pilot submissions opens other improvements in resource provisioning Instead of submitting pilots through CEs We can submit pilots through local batch systems We can submit requests to Cloud provisioning systems that start VMs with pilots Currently both ATLAS and CMS provision the use of their online trigger farms through an OpenStack cloud Ian Fisk CD/FNAL The CERN Tier-0 will also be provisioned this way Before the start of Run2 ~20% of the resources could be allocated with cloud interfaces

  18. Evolving the Infrastructure VM with VM with VM with Pilots VM with Pilots VM with Pilots VM with Pilots VM with Resource Cloud Pilots Pilots Pilots Provisioning Interface Resource Requests WN with WN with WN with Pilots Resource CE WN with Pilots WN with Pilots WN with Provisioning Pilots WN with Pilots Pilots Pilots Batch Queue Pilots • In the new resource provisioning model the pilot infrastructure communicates with the resource provisioning tools directly Ian Fisk – Requesting groups of machines for periods of time FNAL/CD 18

  19. Local Environment CVMFS Once you arrive on a worker node, you need something to run SQUID SQUID Environment distribution has come a long way SQUID LHC experiments use the same read-only environment centrally Local distributed to nearly half a million Client Ian Fisk processor cores WN FUSE Client

  20. High Performance Computing As modern worker nodes get more and more cores per box, these systems look like HPC All LHC Experiments are working on multi-processing and/or multi-threaded versions of their code We are transitioning how we scheduled pilots. A single pilot comes in an takes over an entire box or group of cores Ian Fisk The overlay batch then schedules the appropriate mix of work to use all the cores. And tightly coupled applications can run too

  21. Wide Area Access • All experiments are looking at sending data directly to the worker node even from long distance –Sending data directly to applications over the WAN • Not immediately obvious that this increases the wide area network transfers m Ian Fisk FNAL/CD 11 21

  22. Data Moved Currently we see about 400MB/s read over the wide area Thousands of active transfers Small hit in efficiency A lot of work goes into predictive read ahead and Ian Fisk caching

  23. Network Improvements While CPU (25%/year) and disk (20%/year) have both slowed in the performance improvements at a fixed cost, network is still above 30% improvement per year Cost of 100Gb/s optics are falling For CMS we expect 30% of our Tier-2 resources will be connected at 100Gb/s at Universities within a year Ian Fisk

  24. Changes at CERN Evolution of Tier 0 CERN recently deployed half their computing in Budapest with 2 x 100Gb/s connecting the two facilities Geneva is expensive for people, power, and space All the disks are at CERN and half the worker Analysis(job(CPU(efficiency( 1.000& nodes are in Hungary 0.900& 0.800& Ian Fisk 0.700& Meyrin&SLC6& 0.600& virtual& We see a 5% drop in 0.500& Wigner&SLC6& virtual& 0.400& 0.300& analysis efficiency 0.200& 0.100& 0.000& 2013%10& 2013%11& 2013%12& 2014%01& 2014%02& 2014%03&

Recommend


More recommend