UGM 2018 Masilamani Subramanyam Agenda Introduction - PowerPoint PPT Presentation

iRODS for Data Management and Archiving UGM 2018 Masilamani Subramanyam

Agenda ● Introduction ● Challenges ● Data Transfer Solution ● iRODS use in Data Transfer Solution ● iRODS Proof-of-Concept ● Q & A

Introduction Genentech / Roche ● Biotech Company ○ Fortune’s “100 Best Companies to ○ Work For” List Integration Services ● Application Integration ○ Partner Integration ○ Data Integration ○ Data Virtualization ● Enterprise Information Integration ○

Challenges The some of challenges faced by business with respect to data movement are: ● Bottlenecks in Hardware infrastructure and Network ● Data Transfer is too slow ● No Automated or Scheduled transfers ● No user-friendly GUI ● Custom developed scripts for every type of data transfer job ● Manually executing data transfer jobs ● Lack of visibility and traceability of data transfer jobs ● No Metadata managed related to transfer process

Data Transfer Solution Data Transfer Platform system designed to support and manage high speed transfer of scientific data that includes capabilities such as: ● Optimized high-speed protocols ● API driven interface to monitor and manage transfers ● Metadata management related to transfer process ● Ability to automate the transfers ● Post-transfer workflows ● Store, search, and manage data and transfer metadata in the data management system ● Implement solution for first use case - data replication .

Data Transfer Solution Data Transfer Solution includes multiple components: ● Hardware ● Infrastructure Management ● Software ○ File Transfer Solution ○ Data Management (iRODS) ○ Pipeline Management ● User Interfaces ● Security

iRODS use in Data Transfer Solution iRODS as Change Log ● iRODS File System Scanner capability is used to scan the mount ● path of file system to ingest the system metadata To provide the list of all new, updated and deleted files to ● support for the data replication capability iRODS - Data management system can be used to track file ● lifecycle and provenance

Scientific Data Archive and Replication Business requirements to support for Disaster recovery and high availability: ● High Performance Transfer ● Storage agnostic solution ● Scalability to support large number of files ● Detecting the changes in the file system ● Preserving Unix, Windows permission and timestamp for file creation and modification

Replication Solution Options Sync Tool Primary Alternate Site Site Replicate using High Performance Transfer Protocol Replication Alternate Primary Site Site Replicate using TCP

Replication Solution Options 2 Query iRODS catalog Python / API Initiate Replication 3 1 Ingest to iRODS catalog Primary Alternate Site Site Replicate using High Performance Transfer Protocol

Replication using Data Transfer Solution Jenkins Python Web UI Pipeline Flask API iRODS Rule for delete Perform deletes in iRODS Rule for Perform sync for detection and generate destination end-point and new/updated detection & new and updated manifest file iRODS generate manifest file files Scheduling / Queuing Service Primary Site Secondary Site Sync using High Transfer Protocol Storage Mount iRODS Consumer Server iRODS Consumer Server

iRODS Architecture in Data Transfer Solution iRODS Catalog Server Head Node Secondary Site Primary Site Node 1 Node 2 Node 3 Node 4 Storage Mount iRODS Consumer Server iRODS Consumer Server iRODS Zone

Ingest Metadata using iRODS File System Scanner Data Transfer Solution Server Rulebase configuration in server_config.json Configuration Register Update Remove Unregister Rulebase NewFiles ModifiedFiles DeletedFiles DeletedFiles Shell Script initial_reg_sync.sh detectadded.sh detectmodified.sh detectdeleted.sh META_DATA_ATTR_NAME = filesystem::mtime META_DATA_ATTR_NAME = filesystem::deleted META_DATA_ATTR_VALUE = 2018-06-05 13:02:11.914472000 META_DATA_ATTR_VALUE = Y

Ingestion using iRODS in DTP ● As part of the data transfer in DTP, iRODS will be used for the data management component to track file lifecycle and provenance. ● For the Data Replication use case, iRODS will be used to provide the system metadata of the storage that includes: ● New files added since last ingest of metadata ● Updated files since last ingest of metadata ● Deletes files since last ingest of metadata ● The system metadata can be queried using iRODS CLI or Python iRODS Client

Next Step - iRODS Automated Ingest Framework ● We are planning to implement this new framework for ingest of new and updated files metadata ● It is required sync wrapper and some additional changes for our use case ● This framework will help to simplify ingestion of metadata and also improves the performance

PoC - Data Catalog using iRODS ● Enable simplicity of access with one namespace and want to make data locality transparent to the user ● Ability to search and access to data and metadata

PoC - Data Catalog using iRODS Automation Reports Search Across Automation File Tracking Notification Workspace - Data Transformation Repositories -Data Move setup Use Cases Data Repositories Search Search Trigger Audit Send Build Setup frontend Engine ETL /Lineage Email Reports Project Solution -Rule -Rule -Rule -Rule -Rule Solution: Link Business Auto tag File event knowledge(1) with REST API Catalog Data (2) -Rule -Rule Server(1) Scope of the PoC Integrated Rule Oriented Data System IRODS 4.2.2 (2) Metadata catalog and Unified Access via Workflows and Secure collaboration IRODS data discoveries Virtualization Automation Capabilities 17

PoC - Data Catalog using iRODS

PoC - Enable Intentional Archive Users searches through metadata of the storage, folder, files level to set the metadata (e.g. ARCHIVE to Yes) to trigger the storage tiering automatically self-service iRODS Storage Tier Framework 2 2 compound resource A: isilon_to_object_storage_tier_group A: isilon_to_object_storage_tier_group V: 0 V: 1 U: U: A: irods::storage_tier_time A: irods::storage_tier_verification V: 60 V: catalog U: U: A: irods::storage_tier_verification V: catalog U: Cache AWS S3 A: irods::storage_tier_query V: .. META_DATA_ATTR_NAME = ‘ARCHIVE' AND Tier 0 Tier 1 META_DATA_ATTR_VALUE = 'Y' ( FAST ) ( INTERMEDIATE ) Tier Group 1 iRODS Zone

PoC - Enable Intentional Archive To enable self-service for users to set the flag at folder or file level and then iRODS will ● automatically apply the tiering storage for the set flag files or folders

PoC - Enable Intentional Archive After the metadata is set to trigger the tiered storage framework, the file moved from Tier 1 to ● Tier 2 (AWS S3) automatically. When the file is accessed / read, the file will be moved automatically from Tier 2 (AWS S3) to Tier 1 ●

Thanks! Questions?

UGM 2018 Masilamani Subramanyam Agenda Introduction - PowerPoint PPT Presentation

iRODS for Data Management and Archiving UGM 2018 Masilamani Subramanyam Agenda Introduction Challenges Data Transfer Solution iRODS use in Data Transfer Solution iRODS Proof-of-Concept Q & A Introduction

iRODS UGM 2019 Michele Carpen - m.carpen@cineca.it iRODS UGM 2019 26-27 June 2019, Utrecht,

FAI R data m anagem ent and Disqoverability iRODS UGM 2018 Maarten Coonen Data Architect

Chemaxon Tools Julian Fowler Chemaxon UGM - September 26 th , 2012 Example Collaboration

More than just Load Balancing iRODS Using HAProxy Tony Edgin iRODS UGM 2019 Purpose Previous

iRODS Im Impact on Science and Data Management iRODS UGM 2017 Ashok Krishnamurthy ,Kira

Welcome to Othmar Weber // Bayer Business Services // iRODS UGM 2019 Utrecht iRODS @ Bayer

Using iRODS as an entry point to VITAM for long-term data preservation IRODS UGM 2020

iRODS UGM 2019 Mattia DAntonio m.dantonio@cineca.it 26-27 th June 2019, Utrecht, The

iRODS workflows for the data management in the EUDAT pan-European infrastructure iRODS UGM 2017

Building a Dutch National Research Infrastructure IRODS UGM 2017 Frank Heere 15-06-2017 SURF:

Assista UGM 24/04/2019 CTP changes to the National Cost Collection (NCC) To create greater

3/26/2018 1 3/26/2018 2 3/26/2018 3 3/26/2018 4 3/26/2018 5 3/26/2018 6 3/26/2018 7

11/10/2018 1 11/10/2018 2 11/10/2018 3 11/10/2018 4 11/10/2018 5 11/10/2018 6

4/5/2018 1 4/5/2018 2 4/5/2018 3 4/5/2018 4 4/5/2018 5 4/5/2018 6 4/5/2018 7 4/5/2018

Managing Next Generation Sequence Data at Syngenta with iRODS Todd Moughamer Classification:

June 2018 July 2018 July 2018 July 2018 July 2018 August 2018 August 2018 September 2018

Min-Cut Partitioning with Functional Replication for Technology Mapped Circuits using Minimum

Next Generation File Replication In GlusterFS Jeff, Venky, Avra, Kotresh, Karthik About me

Todays Topics - Chapter 15 Slide 1 performance enhancement Replication Replication of

Replication: On the Ecological Validity of Online Security Developer Studies: Exploring

Black-box Concurrent Data Structures for NUMA Architectures Irina Calciu (VRG) Siddhartha Sen

GlobeTP: Template-Based Database Replication for Scalable Web

A thin arbiter for glusterfs replication Ravishankar N. (@itisravi) Sr.Software Engineer,

Strong Consistency & CAP Theorem CS 240: Computing Systems and Concurrency Lecture 15 Marco

Sambuz

Useful Links

Newsletter

Mail Us

UGM 2018 Masilamani Subramanyam Agenda Introduction - PowerPoint PPT Presentation

iRODS for Data Management and Archiving UGM 2018 Masilamani Subramanyam Agenda Introduction Challenges Data Transfer Solution iRODS use in Data Transfer Solution iRODS Proof-of-Concept Q & A Introduction

iRODS UGM 2019 Michele Carpen - m.carpen@cineca.it iRODS UGM 2019 26-27 June 2019, Utrecht,

FAI R data m anagem ent and Disqoverability iRODS UGM 2018 Maarten Coonen Data Architect

Chemaxon Tools Julian Fowler Chemaxon UGM - September 26 th , 2012 Example Collaboration

More than just Load Balancing iRODS Using HAProxy Tony Edgin iRODS UGM 2019 Purpose Previous

iRODS Im Impact on Science and Data Management iRODS UGM 2017 Ashok Krishnamurthy ,Kira

Welcome to Othmar Weber // Bayer Business Services // iRODS UGM 2019 Utrecht iRODS @ Bayer

Using iRODS as an entry point to VITAM for long-term data preservation IRODS UGM 2020

iRODS UGM 2019 Mattia DAntonio m.dantonio@cineca.it 26-27 th June 2019, Utrecht, The

iRODS workflows for the data management in the EUDAT pan-European infrastructure iRODS UGM 2017

Building a Dutch National Research Infrastructure IRODS UGM 2017 Frank Heere 15-06-2017 SURF:

Assista UGM 24/04/2019 CTP changes to the National Cost Collection (NCC) To create greater

3/26/2018 1 3/26/2018 2 3/26/2018 3 3/26/2018 4 3/26/2018 5 3/26/2018 6 3/26/2018 7

11/10/2018 1 11/10/2018 2 11/10/2018 3 11/10/2018 4 11/10/2018 5 11/10/2018 6

4/5/2018 1 4/5/2018 2 4/5/2018 3 4/5/2018 4 4/5/2018 5 4/5/2018 6 4/5/2018 7 4/5/2018

Managing Next Generation Sequence Data at Syngenta with iRODS Todd Moughamer Classification:

June 2018 July 2018 July 2018 July 2018 July 2018 August 2018 August 2018 September 2018

Min-Cut Partitioning with Functional Replication for Technology Mapped Circuits using Minimum

Next Generation File Replication In GlusterFS Jeff, Venky, Avra, Kotresh, Karthik About me

Todays Topics - Chapter 15 Slide 1 performance enhancement Replication Replication of

Replication: On the Ecological Validity of Online Security Developer Studies: Exploring

Black-box Concurrent Data Structures for NUMA Architectures Irina Calciu (VRG) Siddhartha Sen

GlobeTP: Template-Based Database Replication for Scalable Web

A thin arbiter for glusterfs replication Ravishankar N. (@itisravi) Sr.Software Engineer,

Strong Consistency &amp; CAP Theorem CS 240: Computing Systems and Concurrency Lecture 15 Marco

Sambuz

Useful Links

Newsletter

Mail Us

Strong Consistency & CAP Theorem CS 240: Computing Systems and Concurrency Lecture 15 Marco