wp4 fabric management
play

WP4 Fabric Management 3 rd EU Review Maite Barroso - CERN - PowerPoint PPT Presentation

WP4 Fabric Management 3 rd EU Review Maite Barroso - CERN Maite.Barroso.Lopez@cern.ch DataGrid is a project funded by the European Commission 3 rd EU Review 19-20/02/2004 under contract IST-2000-25182 Outline Objectives (3) (Summary


  1. WP4 Fabric Management 3 rd EU Review Maite Barroso - CERN Maite.Barroso.Lopez@cern.ch DataGrid is a project funded by the European Commission 3 rd EU Review – 19-20/02/2004 under contract IST-2000-25182

  2. Outline � Objectives (3’) (Summary of objectives for the whole project) � Achievements (5’) (Summary of all useful products) � Lessons learned (3’) � Future & Exploitation (4’) � Questions (10’) Title - n° 2

  3. WP4: main objective “To deliver a computing fabric comprised of all the necessary tools to manage a center providing grid services on clusters of thousands of nodes.” • User job management (Grid and local) • Automated management of large clusters Title - n° 3

  4. WP4 objective “To deliver a computing fabric comprised of all the necessary tools to manage a center providing grid services on clusters of thousands of nodes.” •User job management (Grid and local) •Automated management of large clusters The development work divided into 6 subtasks: WP4 Configuration Mgt Installation Mgt Monitoring Fault Tolerance Resource Mgt Gridification Title - n° 4

  5. DataGrid Architecture Local Computing Local Application Local Database Local Application Local Database Grid Grid Application Layer Grid Application Layer Data Metadata Object to File Data Metadata Object to File Job Job Management Management Mapping Management Management Mapping Management Management Collective Services Collective Services Information Replica Grid Information Replica Grid & Manager Scheduler & Manager Scheduler Monitoring Monitoring Underlying Grid Services Underlying Grid Services Computing Storage Replica Authorization Service SQL Computing Storage Replica Authorization Service SQL Element Element Catalog Authentication Index Database Element Element Catalog Authentication Index Database Services Services and Accounting Services Services Services and Accounting Services Grid Fabric services Fabric services Fabric Monitoring Node Fabric Storage Resource Configuration Monitoring Node Fabric Storage Resource Configuration and Installation & Management Management Management and Installation & Management Management Management Fault Tolerance Management Fault Tolerance Management WP4 Title - n° 5

  6. WP4 Architecture design and the ideas behind � Information model. Configuration is distinct from monitoring � Configuration == desired state (what we want) � Monitoring == actual state (what we have) � Aggregation of configuration information � Good experience with LCFG concepts with central configuration template hierarchies � Node autonomy. Resolve local problems locally if possible � Cache node configuration profile and local monitoring buffer � Scheduling of intrusive actions � Plug-in authorization and credential mapping Title - n° 6

  7. Automated management of large clusters GRID Computing Element FABRIC RMS Installation Monitoring System System Fault Configuration Tolerance System Title - n° 7

  8. Automated management of large clusters Fault Tolerance System Monitoring Configuration Node System System Installation System Title - n° 8

  9. Automated management of large clusters WP4 Fault Tolerance framework Node Title - n° 9

  10. User job management (Grid and local) • Workload • Mgt System • (WP1) • WP4 non • - • gridification • WP4 non • - • gridification •Grid • Gridification component • Gridification component •Non •- •WP4 subsystem •Non •- •WP4 subsystem •External to fabric •Internal to fabric • CE • (Computing Element) • ComputingElement •SE •SE •Job repository • RMS • RMS •StorageElement •(WP5) • LCAS • farms • LCMAPS • plug • - • ins •uid/gid •uid/gid •static list •static list • other • other • tokens • tokens •wallclocktime •wallclocktime •quota check •quota check Title - n° 10

  11. Achievements Long term solution for system installation and configuration; modular, robust, reliable and scalable system which addresses the needs of large computing clusters Interim solution proposed to the EU DataGrid testbed as installation and configuration management toolkit while the final quattor framework was developed Framework for monitoring of performance, system status and environmental changes for all resources contained in a fabric Title - n° 11

  12. Achievements Resource Management System. Its main task is to maintain control over the fabric’s farm resources and to RMS ensure the efficient scheduling and execution of user (grid or local) jobs and their coordination with maintenance tasks Fault Tolerance Framework for automatic fault detection and correction Framework Computing Element, Local Centre Gridification Authorization Service, Local Credential Mapping Service: provide mechanism for components grid services to access the local fabric services: secure job submission and job control Title - n° 12

  13. Lessons learned � Fabric Management components are not grid components themselves but they are essential for a working grid. � Experience and feedback with existing tools and prototypes helped to get requirements and early feedback from users � There is a real need to be able to install, configure and manage the sites � Correctly, to avoid configuration errors that may affect not only the site but the whole grid response Automatically, to reduce the work load of system � administrators � Supporting adaptability, properly managing resource reconfigurations in a fault tolerant way � In a reproducible way Title - n° 13

  14. Future & Exploitation � All the WP4 partners are committed to continue support to the WP4 middleware?? To be discussed during the workshop � Technical evolution (commitment from partners not needed, could be for whoever wants to work in this field in the future): � Gridification components : the components will be evolved in the directions marked by GGF for authorization and authentication (LCAS: GGF standards for expressing access policies; LCMAPS: support more services like file access using girdFTP, support better OS insulation) . The support and extension will be undertaken by EGEE. � RMS : evolution to use it for resource management in data intensive cluster computing. Evolution towards OGSA. � LCFGng : No support/evolution after the end of the project. � Quattor : some open issues being tackled by the partners: overall installation toolkit and comprehensive end user documentation. Future work on security enhancements (e.g. fine-grained authorization access to CDB, data encryption). Porting to Solaris 9 and to future RH versions or other Linux distributions. � Lemon : displays/GUIs, enhancement of simple data model, sensors for other platforms (Windows) � Fault Tolerance : improvements on rule design (web spider?), user FT API Title - n° 14

  15. Future & Exploitation WP4 products have been deployed not only within the EDG testbed, but also within other sites and Grid projects/environments (map of Europe with all the sites?): � CERN Computing Centre (~2000 � Virtual laboratory for E-science nodes) project (The Netherlands) � Universidad Autonoma de � Fermilab’s Site Authentication Madrid (Spain) and Authorization service (SAZ). This triggered the development � University of Liverpool (UK) of the authorization call-out mechanism within Globus � NIKHEF (The Netherlands) � LHC Computing Grid project � LAL (Laboratoire de (LCG) l'Accélérateur Linéaire, Orsay, France) � CrossGrid � Zuse Institute Berlin (ZIB) � GridIce project Title - n° 15

  16. Future & Exploitation � An excellent example of WP4 product exploitation by a production site is CERN: � CERN Computer centre was one of the WP4 main requirement sources � Very close collaboration to test and evaluate some of the WP4 products (Lemon and quattor) � After a successful evaluation, they adopt them and made the necessary changes to run them in the production clusters (~2000 nodes) � Support and future evolution will be overtaken by them?? Title - n° 16

  17. Future & Exploitation General concepts: � Move from testbeds to production fabrics � A production fabric has � Inertia … as a virtue! � Charted QoS � Scalability � Procedures and Manageability � Cautious introduction � Retain qualities and add functionality! Title - n° 17

  18. Service Lifecycle Focuses Proliferation, Elaboration � Prototype Focus on functionality � Performance and scalability � � Risks Destabilisation � Workload � Simplification, Automation � Focus on uniformity, minimisation � Process and procedure � Availability and reliability � Stability and robustness � Production Title - n° 18

  19. Questions? � Level 1 � Level 2 � Level 3 � Level 4 Level 5 Title - n° 19

Recommend


More recommend