Software and Computing Requirements: WMS and DDM Maxim Potekhin potekhin@bnl.gov DUNE WMS/DDM Workshop@FNAL 07/28/2016
About this presentation • The value of the requirements (the document): – the purpose is to inform and guide the evolution of the DUNE Computing Model – allow us to think through potential problems before they become issues – based on broad consensus in the Collaboration – serve as a common reference and systematized list of computing items that need to be addressed • The WMS/DDM requirements were influenced by the recommendation from DOE to use the LHC experience in scoping and designing the systems for DUNE. • WMS/DDM are an important part of the requirements. • The requirements do not imply preference to any specific solutions. • They are a “living document” and will be updated as needed, including feedback from discussions such as this workshop. • The requirements have been updated in 2015-2016 and incorporated as “Appendix B” to the DUNE Computing Model, DUNE-doc-914-v2. – http://docs.dunescience.org:8080/cgi-bin/ShowDocument?docid=914 2 M Potekhin|DUNE WMS / DDM workshop@FNAL , July 2016
The structure of the Requirements 3 M Potekhin|DUNE WMS / DDM workshop@FNAL , July 2016
The Content of the Requirements (DocDB 914) 4 M Potekhin|DUNE WMS / DDM workshop@FNAL , July 2016
The following slides contain parts of the Requirements related to Workload Management and Data Management. Abridged/paraphrased as necessary to conserve time. 5 M Potekhin|DUNE WMS / DDM workshop@FNAL , July 2016
Grid and Cloud - Issues • We aim for most efficient utilization of all computing resources and hardware available to the Collaboration. • Hedging against significant uncertainties inherent in estimating and planning resource allocation over the next 10 years. • Grid Sites may have a wide range of capabilities, interfaces and other configuration parameters - heterogeneity. • We need to insulate the users from the heterogeneous nature of the Grid and instead presenting a homogeneous computing medium to them. 6 M Potekhin|DUNE WMS / DDM workshop@FNAL , July 2016
Grid and Cloud - Requirements • A widely distributed computing infrastructure, featuring a network of federated resources (including Grid- and Cloud-based resources) shall be implemented in close cooperation with participating computing sites, institutions and agencies (cf. the Open Science Grid etc). • Necessary tools and procedures shall be provided, for streamlined incorporation of new facilities and efficient use of opportunistic resources. • A Grid Information System shall keep the information about the Grid sites. 7 M Potekhin|DUNE WMS / DDM workshop@FNAL , July 2016
Distributed Computing: WMS definition • What is a WMS? – an early example of gLite: “ The Workload Management System (WMS) is a collection of components that provide the service responsible for distributing and managing tasks across computing and storage resources available on a Grid .” – According to the DUNE Requirements: “a system that enables automated placement of computational payload jobs submitted by its users on distributed resources, using the underlying Grid layer, and makes subsequent record keeping, accounting, elements of data management and general monitoring available to the user”. 8 M Potekhin|DUNE WMS / DDM workshop@FNAL , July 2016
Distributed Computing: WMS description • Workload Management Systems insulate individual users from specific configuration details and certain failure modes of Grid sites and networks, and provide substantial automation in managing the user's computational payload on the Grid. • Monitoring capabilities of a WMS (down to the job level) serve as a valuable debugging tool, and represent an essential toolkit for the operational support teams. • A WMS must be capable of keeping proper information about releases and document the software configuration used for a specific production run. 9 M Potekhin|DUNE WMS / DDM workshop@FNAL , July 2016
Distributed Computing: WMS Requirements • DUNE shall implement a Workload Management System (WMS) for resource management and brokerage functionality which will govern distribution of most types of computational workload in DUNE (e.g. production jobs, group analysis etc) across variety of resources available to the Collaboration. • The DUNE WMS shall be capable of keeping precise record of the software configuration used for each and every production job deployed on the Grid, including the DUNE Offline Software Release information. • The DUNE WMS shall be capable of quickly suspending participating sites due to outages, network congestion or potential security issues. • The DUNE WMS shall have a Workflow Management layer, which will help create and manage large groups of Grid tasks supporting the scientific workflows. 10 M Potekhin|DUNE WMS / DDM workshop@FNAL , July 2016
WMS Requirements (cont'd) • An DUNE WMS Monitoring System shall be implemented to ensure efficient operation of the WMS, by helping ascertain status and progress of Grid jobs, accounting of resource utilization, identification and debugging of failure modes etc. • The DUNE WMS Monitoring System shall have interfaces conducive to integration with both Web UI for users and operators, and with automated systems which consume the WMS data. 11 M Potekhin|DUNE WMS / DDM workshop@FNAL , July 2016
Distributed Data - Requirements • Raw Data replication - summary of basic requirements – redundant replicas (number of copies TBD) – site requirements are to be established (e.g. capacity, network throughput etc) – replicas can be striped if necessary across a few sites • Processed Data replication - summary of basic requirements – Data placement based on research interests of the corresponding working group operating at a particular location, resource availability and scheduling policies of the WMS. The number of replicas of the processed data shall not be subject to a fixed minimum. • General: – Assertion of validity of the data being replicated and/or transmitted (e.g. checksum controls). Control of data placement, volume, status and other characteristics shall be available. – A highly symmetrical placement strategy for the processed data shall exist, i.e. in principle both input and output data for any job or application can reside at any site or host which is a part of the DUNE distributed data network. 12 M Potekhin|DUNE WMS / DDM workshop@FNAL , July 2016
File Catalog Requirements • A file catalog system shall be put in place by the DUNE Software working group. • The file catalog system shall be protected from data loss to the greatest extent possible, by utilizing redundancy, replication and backup and restore systems. • The file catalog system shall have interfaces which are flexible and extensible enough to cover the range of data storage and distribution technologies employed in DUNE. • The file catalog system shall cover the totality of distributed storage used by DUNE, i.e. it will allow its clients to locate potentially multiple replicas of the data at multiple sites. 13 M Potekhin|DUNE WMS / DDM workshop@FNAL , July 2016
Metadata Requirements • A file metadata system shall be created to support the distributed data processing capabilities of DUNE. It will cover the data managed by all participating sites, utilizing a variety of middleware and storage media. • The DUNE metadata system must scale to expected file, site and job/access multiplicities and rates. • Information contained in the metadata system shall be protected from data loss to the greatest extent possible, by utilizing redundancy, replication and backup and restore systems. 14 M Potekhin|DUNE WMS / DDM workshop@FNAL , July 2016
Stuff that is yet to be included • WMS and DDM – The current version of the requirements does not contain much specifics on the interaction of the WMS and DDM. • WMS deployment – it must be possible to run the WMS at any site with adequate network connectivity and hardware, without relying on a specific site configuration. • Job submission – job submission via a network client running on the user's computer (e.g. laptop) which can be located anywhere. • Log files – important from both infrastructure diagnostics/debugging as well as for the payload characterization and debugging (cf. noSQL tech used to handle logs) • Auth/auth in WMS/DDM – solutions are pretty well known but the requirements aren't stated (i.e. compatibility with security frameworks such as X.509/VOMS). 15 M Potekhin|DUNE WMS / DDM workshop@FNAL , July 2016
Comments • Certain features of the WMS often define how efficiently it can be deployed and used: – portability, i.e. whether the system can be deployed on a variety of platforms and without many prerequisites – requirements for participating Grid sites (the less, the better) – monitoring for the users and operational support • Ideally, there is a comprehensive monitoring system, from cloud level to site level down to the task and job level (complete with log files but also making available the information on “live” status of jobs and data transmission). 16 M Potekhin|DUNE WMS / DDM workshop@FNAL , July 2016
Recommend
More recommend