The Data Grid: An Architecture for D istributed Management of Large Scientific Data Sets Ann Chervenak Ian Foster Carl Kesselman Chuck Salisbury Steve Tuecke Information Sciences Institute, Argonne National Laboratory University of Southern California
Overview ● Target Environment ● Design Principles ● Grid Services ◆ Storage systems ◆ Metadata ◆ Management of replicated files ● Implementation Data Grid 2
Data Grid Environment ● Scientific applications ◆ Global climate change, High energy physics ● Computationally demanding ● Large data sets and archives ◆ Terabytes, eventually petabytes ◆ Raw and derived data ● Geographically dispersed users and resources ◆ Data replication for enhanced performance ● Broad range of capabilities and resources ◆ Networks, systems, storage, and applications Data Grid 3
Building a Data Grid: Building Blocks Ingest/ Data STACS, Catalog Query pftp MCAT, Condor catalog mover manager manager GASS SRB others service service Security Measurement Communications Globus toolkit: security, information, MPI-IO Netlogger Condor fault detection, resource management, Resource Accounting/ Resource Fault Akenti Autopilot ... communication, etc. discovery payment management detection HPSS Computers ESnet, MREN, 1-10 GB/ s DPSS 1-10 PB 10-100 TB 1-10 TF/ s Archival, multi-PB Preliminary QoS Striped NTON Fast disk cache Analysis Access> 100 MB/ s? Archival work (e.g., DSRT) Nonarchival GB/ s net Network No QoS Secure QoS: e.g., diffserv Archive No QoS XFS: QoS for disk GB/ s net GB/ s net On-demand computer Cache QoS QoS QoS QoS Data Grid 4
Data Grid Objectives ● Integrate heterogeneous data archives into a distributed data management “grid” ● Identify services for high performance, distributed, data intensive computing Data Grid 5
Design Principles ● Mechanism Neutrality ◆ Support heterogeneous systems ● Policy Neutrality ◆ User / local decision making and control ● Compatibility with Computational Grid ◆ Integration of storage and computation ● Uniformity of Information Infrastructure ◆ Data model and interface for metadata Data Grid 6
Data Grid Services Replica Selection Other High Level Services. . . Replica Management Storage Metadata Resource Other Core System Repository Management Services. . . . . . . . . . . . DPSS HPSS LDAP MCAT LSF DI FFSERV Data Grid 7
Data Access Service ● Uniform access to heterogeneous systems ◆ remote: e.g . DPSS, HTTP, FTP, HPSS ◆ local: e.g . UNIX ● High performance data movement over WANs ◆ Third party transfer ● Data extraction and filtering functions ● Access to data is subject to global and local policy constraints Data Grid 8
Metadata Access Service ● Uniform treatment for all metadata ◆ Grid components ◆ Application-related metadata ◆ Storage system characteristics ◆ Relationships between data items ● Uniform access to metadata ◆ LDAP protocol ● Uniform storage structure ◆ LDAP hierarchical structure for distribution, replication, referral services Data Grid 9
Replica Management ● Collections contain related files ● Logical files describe replicated physical files ● Services for managing replicated file instances ◆ Create / delete ◆ Schedule / manage data transfer ◆ Register in the replica catalog ◆ Metadata display Data Grid 10
Replica Selection ● User can optimize access characteristics ◆ Grid structure and performance ◆ Storage system and file characteristics ● Intelligent scheduling to determine appropriate replica, site for (re)computation, etc. Data Grid 11
Climate Data Scenario “Access datasets A, B; Query run A-> meso-> hydro; manager compare result with B” Historical Historical data data “How do midwest flood File access archive archive frequencies under 2xCO 2 service scenario compare with historical data? ” Resource manager Cache Simulation data Analysis archive engine DPSS Cache HPSS m eso hydro com pare Data Grid 12
Current Activity ● Ongoing collaborations ◆ Climate ◆ High Energy Physics ● Storage API for uniform access to data ◆ API specification document ◆ Prototype code for HTTP, FTP, DPSS ● Replica management ◆ Replica catalog based on LDAP ◆ API and GUI tools for catalog access ● Quality of Service implementation Data Grid 13
Data Grid 14 Replica Management
Quality of Service Bulk Transfer support in GARA 12000 10000 8000 Bandwidth (KB/s) background foreground 6000 competitive 4000 2000 0 0 50 100 150 200 250 Time Data Grid 15
Planned Activity ● Data Access ◆ Integrated quality of service, security ◆ Performance enhancements for networking ● Performance guarantees for the Data Grid ● Automatic operation of the Data Grid ◆ Agent technologies used for distributed data replication, selection, and analysis ● Integrated CPU scheduling ◆ Server-side data reduction, affinity scheduling Data Grid 16
Data Grid 17
Recommend
More recommend