¡ Building ¡an ¡Extensible ¡File ¡System ¡via ¡ ¡ Policy-‑based ¡Data ¡Management ¡ Hao ¡Xu ¡ Jewel ¡H. ¡Ward ¡ Mike ¡Conway ¡ Arcot ¡Rajasekar ¡ Reagan ¡W. ¡Moore ¡ (iRODS ¡ConsorIum, ¡hLp://irods.org) ¡ ¡
File System q Essential Functions: § Ingest, Store, Access q Modern File Systems are built on top of traditional file systems: § Google File System, Amazon S3, Hadoop Distributed File System § Driven by the need of a target application § Customized toward the target application domain
Data Management Needs in Archive and Scientific Communities q Discoverability q Complex Metadata q Workflow Management q Data Sharing q Provenance q Long Term Preservation q Technology Migration q Interoperability Between Infrastructures
Challenges Can generic infrastructure meet the needs of a diverse set of data management domains?
Flexibility to Define a Wide Range of Application Domain Policies q User Community à à Policies q File ingest operations: § Authentication § Authorization § Storage Quota § Aggregation § Resource Selection § Replication § File Retention § Metadata
Infrastructure Support For Non-standard Application Domain Operations q Standard file system operations have robust support: § Metadata § Auditing § Access Control List q Non-standard operations that are implemented as a library do not have direct support from the file system. Examples: § Preservation – OAIS: SIP, AIP, DIP packages § Digital library – Provenance & discovery metadata § Processing pipeline – Format transformation
Interoperability with Other Infrastructures q Emergent scalability mechanisms: § Organization change • List à Tree à Graph (Internet) à Search § Data structure change • Files, tables, streams § Property enforcement expectations • Reproducible data-driven research q Separation of how files are stored, accessed, and manipulated
Policy-based Data Management
Policy = Metadata + Procedure q Purpose ¡ ¡ ¡ Reason ¡a ¡collecIon ¡is ¡assembled ¡ q Proper)es ¡ ¡ ¡ ALributes ¡needed ¡to ¡ensure ¡the ¡ purpose ¡ q Policies ¡ ¡ ¡ Controls ¡for ¡enforcing ¡desired ¡ proper)es ¡ ¡ § Procedural ¡Policy: ¡Example: ¡When ¡an ¡object ¡is ¡ingested, ¡run ¡workflow ¡ § Asser?onal ¡Policy: ¡Example: ¡A ¡file ¡has ¡three ¡or ¡more ¡replicas ¡ q Metadata ¡ ¡ Persistent ¡state ¡ § State ¡informa?on ¡(consistency ¡in ¡a ¡distributed ¡environment) ¡ § Generated ¡through ¡applica?on ¡of ¡ procedures ¡ q Procedures ¡ OperaIons ¡performed ¡within ¡the ¡system ¡ § What ¡to ¡run: ¡Func?ons ¡that ¡implement ¡the ¡ policies ¡ § How ¡to ¡verify: ¡Valida?on ¡that ¡ metadata ¡ conforms ¡to ¡the ¡desired ¡ purpose ¡
Policy-based Data Management Purpose Collection Defines Defines Property Policy Procedure Defines Updates Controls Metadata SubType Periodic Assessment Criteria Policy
Policy-based Data Management - Collection Purpose Collection Defines Has Has Defines Has Digital Has Attribute Object Has Isa Updates Property Policy Procedure Defines Updates Controls Metadata SubType Periodic Assessment Criteria Policy
Policy-based Data Management – Collection Properties Purpose Collection Defines Has Has Defines Digital Has Attribute Object Has Integrity Isa Updates Isa Authenticity Isa Property Policy Procedure Defines Updates Controls Metadata Access Isa control SubType HasFeature Periodic HasFeature Assessment Completeness Criteria HasFeature Policy Correctness HasFeature Consensus Consistency
Policy-based Data Management – Collection Policies Purpose Collection Defines Replication Has Has Policy Isa Checksum Defines Policy Digital Has Attribute Isa Quota Object Policy Has Isa Data Type Integrity Isa Updates Policy Isa Isa Authenticity Isa Property Policy Procedure Defines Updates Controls Metadata Access Isa control SubType HasFeature Periodic HasFeature Assessment Completeness Criteria HasFeature Policy Correctness HasFeature Consensus Consistency
Policy-based Data Management –Collection Procedures Purpose Collection Defines Replication Has Has Policy Isa Checksum Defines Policy Digital Has Attribute Isa Quota Object Policy Has Isa Data Type Integrity Isa Updates Policy Isa Isa Authenticity Isa Property Policy Procedure Defines Updates Controls Metadata Access Isa control Isa SubType HasFeature GetUserACL Periodic HasFeature Workflow Isa Assessment SetDataType Completeness Criteria HasFeature Chains Isa Policy SetQuota Correctness Isa Function HasFeature Isa DataObjRepl Isa Consensus Isa SysChksumDataObj Operation Consistency
Policy-based Data Management – Persistent State Purpose Collection Defines DATA_ID DATA_REPL_NUM DATA_CHECKSUM Replication Has Has Isa Isa Isa Policy Isa Checksum Defines Policy Digital Has Attribute Isa Quota Object Policy Has Isa Data Type Integrity Isa Updates Policy Isa Isa Authenticity Isa Property Policy Procedure Defines Updates Controls Metadata Access Isa control Isa SubType HasFeature GetUserACL Periodic HasFeature Workflow Isa Assessment SetDataType Completeness Criteria HasFeature Chains Isa Policy SetQuota Correctness Isa Function HasFeature Isa DataObjRepl Isa Consensus Isa SysChksumDataObj Operation Consistency
Policy-based Data Management – Policy Enforcement Purpose Collection Defines DATA_ID DATA_REPL_NUM DATA_CHECKSUM Replication Has Has Isa Isa Isa Policy Isa Checksum Defines Policy Digital Has Attribute Isa Quota Object Policy Has Isa Data Type Integrity Isa Updates Policy Isa Isa Authenticity Isa Property Policy Procedure Defines Updates Controls Metadata Access Isa control Isa SubType Has HasFeature GetUserACL Periodic HasFeature Workflow Isa Assessment Policy SetDataType Completeness Criteria Enforcement HasFeature Chains Isa Policy Point SetQuota Correctness Isa Function HasFeature Invokes Isa DataObjRepl Isa Consensus Isa SysChksumDataObj Operation Client Consistency Action
Example of Policy-based Data Management
Policy-based Infrastructure integrated Rule Oriented Data System • Biology • Cognitive Science Temporal Dynamics of Learning Center • Human genome Broad Institute, Wellcome Trust Sanger Institute, NGS • Medicine Sick Kids Hospital • Neuroscience International Neuroinformatics Coordinating Facility • Plant genome the iPlant Collaborative • Phylogenetics Phylogenetics at CC IN2P3 • Computer Science • Network research GENI experimental network • Earth Sciences • Atmospheric science NASA Langley Atmospheric Sciences Center • Climate NOAA National Climatic Data Center • NASA Center for Climate Simulations • Ecology CEED Caveat Emptor Ecological Data • Hydrology Institute for the Environment, UNC-CH; Hydroshare • Oceanography Ocean Observatories Initiative • Seismology Southern California Earthquake Center • Engineering • Education repository CIBER-U • Physics • Astrophysics Auger supernova search • Cosmic Ray AMS experiment on the International Space Station • Dark Matter Physics Edelweiss II • High Energy Physics BaBar / Stanford Linear Accelerator • Neutrino Physics T2K and dChooz neutrino experiments • Optical Astronomy National Optical Astronomy Observatory • Particle Physics Indra multi-detector collaboration at IN2P3 • Quantum Chromodynamics IN2P3 • Radio Astronomy Cyber Square Kilometer Array, TREND, BAOradio • Social Science Odum, TerraPop
Policy Applications q Pre-process policy § Applied before an operation is done q Operation § May be policy controlled q Post-process policy § Applied after the operation is done q Are these sufficient to handle the wide diversity of data management applications? q Does this minimize the number of required operations?
RHESSys workflow to develop a Policy Choose gauge or outlet (HIS) nested watershed parameter file (Workflow) in (worldfile) containing a nested Hydrology Extract ecogeomorphic object framework, drainage area (NHDPlus) and full, initial system state. For each box, create a micro- Digital Slope Elevation Model (DEM) service to automate task, and Aspect chain into a workflow Nested watershed Streams (NHD) structure Soil and vegetation Roads (DOT) Strata parameter files Patch Land Use NLCD (EPA) Hillslope Basin Leaf Area Index Landsat TM Stream network Phenology MODIS Worldfile Flowtable Soil Data USDA RHESSys
Policies in Software Defined Networking Control selection of network paths Rule Engine Network Data GraphDB iCAT Policies Policies OF Controller iRODS Server iRODS Server iRODS Server
Policy in Data Storage Aggregation / Caching / Replication Queen Mary University of London Source: Di Lodovico et al.
Recommend
More recommend