the data grid
play

The Data Grid: An Architecture for D istributed Management of Large - PowerPoint PPT Presentation

The Data Grid: An Architecture for D istributed Management of Large Scientific Data Sets Ann Chervenak Ian Foster Carl Kesselman Chuck Salisbury Steve Tuecke Information Sciences Institute, Argonne National Laboratory University of


  1. The Data Grid: An Architecture for D istributed Management of Large Scientific Data Sets Ann Chervenak Ian Foster Carl Kesselman Chuck Salisbury Steve Tuecke Information Sciences Institute, Argonne National Laboratory University of Southern California

  2. Overview ● Target Environment ● Design Principles ● Grid Services ◆ Storage systems ◆ Metadata ◆ Management of replicated files ● Implementation Data Grid 2

  3. Data Grid Environment ● Scientific applications ◆ Global climate change, High energy physics ● Computationally demanding ● Large data sets and archives ◆ Terabytes, eventually petabytes ◆ Raw and derived data ● Geographically dispersed users and resources ◆ Data replication for enhanced performance ● Broad range of capabilities and resources ◆ Networks, systems, storage, and applications Data Grid 3

  4. Building a Data Grid: Building Blocks Ingest/ Data STACS, Catalog Query pftp MCAT, Condor catalog mover manager manager GASS SRB others service service Security Measurement Communications Globus toolkit: security, information, MPI-IO Netlogger Condor fault detection, resource management, Resource Accounting/ Resource Fault Akenti Autopilot ... communication, etc. discovery payment management detection HPSS Computers ESnet, MREN, 1-10 GB/ s DPSS 1-10 PB 10-100 TB 1-10 TF/ s Archival, multi-PB Preliminary QoS Striped NTON Fast disk cache Analysis Access> 100 MB/ s? Archival work (e.g., DSRT) Nonarchival GB/ s net Network No QoS Secure QoS: e.g., diffserv Archive No QoS XFS: QoS for disk GB/ s net GB/ s net On-demand computer Cache QoS QoS QoS QoS Data Grid 4

  5. Data Grid Objectives ● Integrate heterogeneous data archives into a distributed data management “grid” ● Identify services for high performance, distributed, data intensive computing Data Grid 5

  6. Design Principles ● Mechanism Neutrality ◆ Support heterogeneous systems ● Policy Neutrality ◆ User / local decision making and control ● Compatibility with Computational Grid ◆ Integration of storage and computation ● Uniformity of Information Infrastructure ◆ Data model and interface for metadata Data Grid 6

  7. Data Grid Services Replica Selection Other High Level Services. . . Replica Management Storage Metadata Resource Other Core System Repository Management Services. . . . . . . . . . . . DPSS HPSS LDAP MCAT LSF DI FFSERV Data Grid 7

  8. Data Access Service ● Uniform access to heterogeneous systems ◆ remote: e.g . DPSS, HTTP, FTP, HPSS ◆ local: e.g . UNIX ● High performance data movement over WANs ◆ Third party transfer ● Data extraction and filtering functions ● Access to data is subject to global and local policy constraints Data Grid 8

  9. Metadata Access Service ● Uniform treatment for all metadata ◆ Grid components ◆ Application-related metadata ◆ Storage system characteristics ◆ Relationships between data items ● Uniform access to metadata ◆ LDAP protocol ● Uniform storage structure ◆ LDAP hierarchical structure for distribution, replication, referral services Data Grid 9

  10. Replica Management ● Collections contain related files ● Logical files describe replicated physical files ● Services for managing replicated file instances ◆ Create / delete ◆ Schedule / manage data transfer ◆ Register in the replica catalog ◆ Metadata display Data Grid 10

  11. Replica Selection ● User can optimize access characteristics ◆ Grid structure and performance ◆ Storage system and file characteristics ● Intelligent scheduling to determine appropriate replica, site for (re)computation, etc. Data Grid 11

  12. Climate Data Scenario “Access datasets A, B; Query run A-> meso-> hydro; manager compare result with B” Historical Historical data data “How do midwest flood File access archive archive frequencies under 2xCO 2 service scenario compare with historical data? ” Resource manager Cache Simulation data Analysis archive engine DPSS Cache HPSS m eso hydro com pare Data Grid 12

  13. Current Activity ● Ongoing collaborations ◆ Climate ◆ High Energy Physics ● Storage API for uniform access to data ◆ API specification document ◆ Prototype code for HTTP, FTP, DPSS ● Replica management ◆ Replica catalog based on LDAP ◆ API and GUI tools for catalog access ● Quality of Service implementation Data Grid 13

  14. Data Grid 14 Replica Management

  15. Quality of Service Bulk Transfer support in GARA 12000 10000 8000 Bandwidth (KB/s) background foreground 6000 competitive 4000 2000 0 0 50 100 150 200 250 Time Data Grid 15

  16. Planned Activity ● Data Access ◆ Integrated quality of service, security ◆ Performance enhancements for networking ● Performance guarantees for the Data Grid ● Automatic operation of the Data Grid ◆ Agent technologies used for distributed data replication, selection, and analysis ● Integrated CPU scheduling ◆ Server-side data reduction, affinity scheduling Data Grid 16

  17. Data Grid 17

Recommend


More recommend