challenges for grids challenges for grids
play

Challenges for Grids Challenges for Grids Markus Schulz CERN IT GD - PowerPoint PPT Presentation

Challenges for Grids Challenges for Grids Markus Schulz CERN IT GD LCG/EGEE Disclaimer Disclaimer All views expressed are mine and are not necessarily shared by the projects or organization that I am associated with Dont blame:


  1. Challenges for Grids Challenges for Grids Markus Schulz CERN IT GD LCG/EGEE

  2. Disclaimer Disclaimer • All views expressed are mine and are not necessarily shared by the projects or organization that I am associated with – Don’t blame: EGEE, LCG, CERN…. – Critique, flames, and the like should be directed to: • Markus.schulz@cern.ch 7/31/2006 Challenges for grids 2

  3. Approach Approach • Thinking a few years ahead – Based on what we know – Ignoring problems like • software quality (far from perfect) • lack of fabric management on sites • site admin fear of loosing total control – Focused on structural problems • Make production grids work at the required scale • Expand the systems to other domains – Industry, micro Vos, …… • Move closer to the grid vision 7/31/2006 Challenges for grids 3

  4. Babylonian Confusion Babylonian Confusion • What is called Grid covers ฀ : – Standalone Clusters – Clusters for scaling a single service – Intra organizational clusters • With central administrative control – Community computing • SETI@home, boinc – I.Foster: <------- This is what I will use….. • “coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations. “ • ”On-demand, ubiquitous access to computing, data, and services” 7/31/2006 Challenges for grids 4

  5. The Dangers of Success The Dangers of Success • Early Success – Constraints from existing infrastructures • Users depend on them – Research ---> Production transition is very hard – Restricts standardization • The curse of backwards compatibility • Example EGEE, WLCG, OSG, ARC – > 70 VOs 7/31/2006 Challenges for grids 5

  6. EGEE Grid Sites : Q1 2006 200 200 180 180 160 160 140 140 120 120 100 100 80 80 60 sites sites 60 40 40 20 20 0 0 Jun-04 Oct-04 Dec-04 Feb-05 Jun-05 Oct-05 Dec-05 Apr-04 Aug-04 Apr-05 Aug-05 Jun-04 Oct-04 Dec-04 Feb-05 Jun-05 Oct-05 Dec-05 Apr-04 Aug-04 Apr-05 Aug-05 30000 30000 25000 25000 EGEE: EGEE: CPU CPU 20000 20000 > 190 sites, 40 countries > 190 sites, 40 countries 15000 15000 > 24,000 processors, 10000 > 24,000 processors, 10000 5000 5000 ~ 5 PB storage ~ 5 PB storage 0 0 7/31/2006 Challenges for grids 6 ~ 70 Virtual organizations ~ 70 Virtual organizations 4 4 5 5 5 6 4 4 4 5 5 5 4 4 5 5 5 6 0 0 0 4 0 4 4 0 0 0 0 5 0 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 - - - - 0 - - - - - 0 - n - n - r - - g - t c - b - r - - g - t c - b - p n - p n - r u g c t c e b r u g c t c e b p u e p u e A u O c e A u O c e J u D e F J u D e F A A O A A O J A D F J A D F

  7. EGEE Operations EGEE Operations • Grid operator on duty – 6 teams working in weekly rotation • CERN, IN2P3, INFN, UK/I, Ru,Taipei – Crucial in improving site stability and management – Expanding to all ROCs in EGEE-II • Operations coordination – Weekly operations meetings – Regular ROC managers meetings – Series of EGEE Operations Workshops • Nov 04, May 05, Sep 05, June 06 • Geographically distributed responsibility for operations: – There is no “central” operation – Tools are developed/hosted at different sites: • GOC DB (RAL), SFT (CERN), GStat (Taipei), CIC Portal (Lyon) • Procedures described in Operations Manual – Introducing new sites – Site downtime scheduling – Suspending a site – Escalation procedures 7/31/2006 Challenges for grids 7 – etc

  8. Use of the infrastructure Use of the infrastructure 35000 30000 25000 No. jobs/day 20000 15000 10000 Total 5000 non-LCG 0 Jan-05 Feb-05 Mar-05 Apr-05 May-05 Jun-05 Jul-05 Aug-05 Sep-05 Oct-05 Nov-05 Dec-05 Jan-06 Feb-06 Mar-06 Apr-06 Sustained & regular workloads of >30K jobs/day • spread across full infrastructure • doubling/tripling in last 6 months – no effect on operations •Will increase to at least 150k jobs/day in the next CPU time delivered 18month 3,000,000 CPU - cpu-years/month 2,500,000 300 250 -hours/month lhcb 2,000,000 geant4 cpu-year / m onth 200 cms 1,500,000 biomed 150 SI2K atlas 1,000,000 alice 100 500,000 7/31/2006 Challenges for grids 8 50 0 0 Jun-05 Jul-05 Aug-05 Sep-05 Oct-05 Nov-05 Dec-05 Jan-06 Feb-06 Mar-06 Apr-06 Jun-05 Jul-05 Aug-05 Sep-05 Oct-05 Nov-05 Dec-05 Jan-06 Feb-06 Mar-06 Apr-06

  9. Use of the infrastructure Use of the infrastructure Massive data transfers > 1.5 GB/s • Several applications now depend on EGEE as their primary computing resource Sustainability: • Usage can (and does) grow without need for additional operational effort 7/31/2006 Challenges for grids 9

  10. A global, federated e-Infrastructure A global, federated e-Infrastructure BalticGrid NAREGI SEE-GRID OSG EUChinaGrid EUMedGrid EUIndiaGrid EELA EGEE infrastructure ~ 200 sites in 39 countries ~ 20 000 CPUs > 5 PB storage > 35 000 concurrent jobs per day 7/31/2006 Challenges for grids 10 > 80 Virtual Organisations

  11. OSG- Currently ~20,000 Jobs/Day OSG- Currently ~20,000 Jobs/Day ATLAS CMS CDF D0 GLOW, STAR 7/31/2006 Challenges for grids 11

  12. This all looks very promising…. This all looks very promising…. • But……. – Interoperation between grids • Lack of standardization • Several larger sites have to support multiple interfaces – Managing diversity inside grids • OS versions – Applications are sensitive and sites have preferences – Sites and user move independently • Batch systems – Each requires extensive work to interface – Limited to smallest set of shared functionality » Frustrates users AND resource managers » Lack of standardization 7/31/2006 Challenges for grids 12

  13. More problems…. More problems…. • Storage, DBs… – Different storage management systems are established • HSMs, disk pools with shared file systems – Different security, storage models, lack of standards • VO management – Creation of a VO is straight forward – Getting access to resources requires: • Negotiation with resource providers • Significant effort of sites to host an additional VO – Accounting, dynamic prioritization, quotas problematic • on global level (between different Vos) • inter-VO • Constrained by national privacy laws – No market of resources 7/31/2006 Challenges for grids 13

  14. More problems…. More problems…. • Achievable reliability limited – The more complex services have to interact, the higher the probability that the overall service fails • ‘Russian Doll Performance Sink’ here: File open – Applies to many services • Grid interfaces need to be native interfaces Information system interactions are left out – STANDARDS SRM MSS MSS GFAL 7/31/2006 Challenges for grids 14

  15. State of Standardization State of Standardization • First round of tentative standards – Mostly based on research work • Missed deployment and operations related part – Production grids started with ‘de facto standards’ – Now: OGSA • Much more detailed, recycles established standards • But: additional layers, old services will be wrapped!!! 7/31/2006 Challenges for grids 15 Diagram from Globus Alliance

  16. Replication Transfer VO Context Services Mgmt Data Integration Policy Mgmt Services Access Information Services Context Services Data I nfo Event Monitoring Discovery Logging Execution Mgmt Services Services Services Mgmt Execution I nfra Workflow Workload Execution Job Mgmt Execution WSRF WSN WSDM Naming Services Mgmt Mgmt Planning Mgmt Services Infrastructure Rsrc Mgmt Self Mgmt Reservation Configuration Deployment Provisioning Services Services Services Security Services Resource Mgmt Services Heterogeneity Mgmt Self Authentication Optimization Mgmt Authorization Service Level Security Attainment Services Integrity QoS Services Mgmt Boundary Traversal 7/31/2006 Challenges for grids 16

  17. Relevant Specifications Relevant Specifications SYSTEMS GRID UTILITY MANAGEMENT COMPUTING COMPUTING Use Cases & Distributed query processing Data Centre Applications Collaboration Persistent Archive ASP Multi Media VO Management ByteIO OGSA-EMS WS-DAI Core Services Information WSDM Discovery GGF-UR Naming WS-Base Notification Privacy Trust GFD-C.16 WSRF-RP WSRF-RL Data Model WSRF-RAP WS-Security SAML/XACML X.509 Base Profile WS-Addressing HTTP(S)/SOAP WSDL CIM/JSIM Data Transport GRID Computing, Distributed Computing and Utility Computing are different views of the same important problem domain.

  18. Is there Hope? Is there Hope? • Diversity on OS level – Virtualization is making progress (XEN,…) • Experience based standardization – Information systems,etc. • Interoperation efforts start to influence standardization • Core services start to work on native GRID interfaces – DBs, batch systems, storage – Still in an early state, but has a huge potential • Solid, well managed standards are needed • Otherwise a wrapper is the ‘best’ solution 7/31/2006 Challenges for grids 18

Recommend


More recommend