iRODS at Bristol Myers Squibb Status and Prospects. Leveraging iRODS for scientific applications in Amazon AWS Cloud Mohammad Shaikh | Oleg Moiseyenko Scientific Cloud Computing iRODS UGM, Jun 9-11, 2020 R&D, Informatics & Predictive Sciences NOT FOR PROMOTIONAL USE
R&D: Delivering Innovative Medicines to Patients 40 compounds in development ~ 5,700 12 new medicines for Patients since 2011 R&D Colleagues Worldwide R&D $ 5.1 BILLION on a non-GAAP basis* Investment 5 PERCENT IN 2018 Increase over 2017. *This non-GAAP amount excludes significant upfront and milestone payments for business development transactions and other specified R&D items. A reconciliation of GAAP to non-GAAP measures can be found on our website at www.bms.com. The GAAP amount is $6.3B. Data as of January, 2019 R&D, Informatics & Predictive Sciences 2 NOT FOR PROMOTIONAL USE
It’s all about data, Big Data! Scientific data sets Exponential growth (Tens of PB’s) • NGS data • Proteomics Major data sources • Flow Cytometry • Imaging data • Raw data from labs • High-Throughput • Scratch space screening • Results data • Mass spectrometry • External collaborations • Databases • Public & government agencies • R&D Data governance • 25 years of retention From GB’s to PB’s scale • Backups R&D, Informatics & Predictive Sciences 3 NOT FOR PROMOTIONAL USE
Lab data challenges Data insights are only Data accessibility and sharing as good as the data Silos between teams (organizational resistance) • that drives them Generating insights in a timely manner, visualization and sharing • Networking, storage & computing power Efficient data exchanges, storage and processing • Replicating results Testing, validating, retesting,… • Data mining Lack of good metadata annotation • Data standards & compliancy Different formats, data integration and validation • R&D, Informatics & Predictive Sciences 4 NOT FOR PROMOTIONAL USE
Typical data flow diagram 1. Instruments writes raw data into local scratch space 2. Raw data pushed to S3 by Storage Gateway/DataSync or via AWS CLI S3 commands AWS Storage Gateway 2 3 4 AWS DataSync AWS Direct S3 buckets iRODS Applications Connect Metadata catalog 2 Labs 1 10 Gb/s scientific instruments AWS CLI 3. iRODS system scans S3 buckets regularly 4. Applications request data via iRODS metadata catalog R&D, Informatics & Predictive Sciences 5 NOT FOR PROMOTIONAL USE
iRODS base architecture • Client asks for data • Data requests goes to iRODS server • Server looks up information in iCAT • iCAT tells which iRODS server has data • Data is retrieved from its physical location BMS Scientific BMS Scientists Instruments UNIFIED NAMESPACE Local data stores MetaLnx iQuery API calls East 1 browser Metadata Catalog (iCAT) West 2 East 2 iRODS Server West 1 iRODS Rule Engine S3 Bucket 2 S3 Bucket 3 S3 Bucket 1 R&D, Informatics & Predictive Sciences 6 NOT FOR PROMOTIONAL USE
iRODS for Computational Genomics Internet iRODS Catalog iRODS Catalog gateway EC2 EC2 Consumer Consumer Local server Genomics (NFS) Data iRODS Catalog iRODS Catalog Hub EC2 EC2 Provider Provider Local server Enterprise Data Lake (NFS) … iRODS Ingest iRODS Ingest EC2 EC2 Worker Worker AWS Direct Local server S3 bucket A Connect (NFS) 10 Gb/s iRODS Redis iRODS Redis Corporate data center EC2 EC2 Server S3 bucket B Server … Data iRODS Metalnx iRODS resources on cloud specs iRODS Metalnx EC2 EC2 replication Server Server S3 bucket N • Consumers: m4.2xlarge (8vCPU/32GB) • Provider: m4.10xlarge (40vCPU/160GB) Primary Standby S3 bucket N+1 • Workers: c4.4xlarge (16vCPU/30GB) iRODS RDS Database iRODS RDS Database • Redis server: r4.8xlarge (32vCPU/244GB) (PostgreSQL) (PostgreSQL) Virtual private cloud S3 object store • Metalnx: m4.large (4vCPU/16GB) Availability Zone 1 Availability Zone 2 • Database: db.m4.4xlarge (16vCPU/64GB) AWS Region R&D, Informatics & Predictive Sciences 7 NOT FOR PROMOTIONAL USE
iRODS in NGS data processing pipeline Virtual Private Cloud AWS DataSync S3 Raw Data Bucket BMS NGS360 Gene NGS Labs AWS CLI expression AWS Direct Data Connect database AWS Storage Gateway AWS Batch Sequence alignment NGS QC Scientists Analysis Project Registry API S3 Result S3 Data Bucket Bucket S3 bucket A Vendor A Vendor B … Applications S3 bucket D AWS Lambda Vendor C Vendor D S3 “drop” buckets Collaborations (clinical data) R&D, Informatics & Predictive Sciences 8 NOT FOR PROMOTIONAL USE
iRODS in Discovery Imaging Platform Image S3 bucket Scientific Local storage for transformed transformation Instruments layer images S3 bucket 1 Images on local server (NFS) S3 bucket 2 Images on local AWS server (NFS) Snowball S3 bucket 3 … … Images on local iRODS Transformation server (NFS) S3 bucket A Metadata Catalog AWS Direct … S3 bucket B Connect 10 Gb/s … S3 bucket N Image analysis tools S3 bucket N+1 Storage Gateway Image Metadata Hardware S3 object store database appliance Scientists On-premises BMS AWS Cloud Collaborator’s Cloud R&D, Informatics & Predictive Sciences 9 NOT FOR PROMOTIONAL USE
Flow Cytometry Data Flows Exp. Raw Storage Analysis Tracking Aggregation Design Bio Signals Analytics FCS Express Workbook Shared NuGenesis for Biologics Drives Analytics registration fcs files & Data FlowJo Chemistry Lakes FileCatcher Spotfire workbook Cytobank LIMS iRODS Slide credit: S3 Goce Bogdanoski R&D, Informatics & Predictive Sciences 10 NOT FOR PROMOTIONAL USE
Flow Cytometry – Digital Intelligence / ML Data Standardization guidelines Flow Cytometry Data Hub 1 2 Metadata Guidelines Instrument Data Existing Source UI Generation Integration File Nomenclature Data Dictionaries User AWS DataSync or SmartSync 6 File Management, Tracking & Auditing 3 User Access to HPC on the Cloud o Automated Ingest o Storage Tiering 5 1 FC Database o Indexing o Compliance o Auditing Automated Gating (AaaS) o Publishing o AltraBio 4 Data Storage 5 2 o Provenance o Cytapex Bioinformatics o Integrity FileSelector App o Astrolabe 5 3 o t-SNE Unsupervised Analysis Pipeline Dimensionality Reduction Clustering Predictive Modeling o FlowSOM Data Pre-processing & Clean-up o Citrus o Spotfire Slide credit: o Disqover Goce Bogdanoski o Signals R&D, Informatics & Predictive Sciences 11 NOT FOR PROMOTIONAL USE
iRODS & Data Lake Integration Legend: Lab data ✓ : Preferred platform Technical Business Data File External ✗ : capability not existing on the platform move to Analytics meta data meta data acquisition Management Workflows cloud ✓ ✓ ✓ ✓ iRODS – system of ✓ ✓ ✓ Source records Operational Domain Source of truth analytics specific ✓ ✓ ✓ ✓ Data Lake – ✗ ✗ ✗ Replicated Insights, system of insights Enterprise where Cross- In roadmap repository required functional R&D, Informatics & Predictive Sciences 12 NOT FOR PROMOTIONAL USE
Roadmap to iRODS iRODS for ECL Labs Project iRODS Production iRODS Pilot iRODS for Discovery deployment for 2 nd iRODS Production Imaging Platform Computational Genomics environment in cloud NFS / S3 Flow Cytometry data syncs Data Management Nov’18 - Jul’19 Nov 2017 2022 Mar-Aug 2018 Sep 2018 Feb 2018 Dec 2019 Nov 2020 2021 Aug 2019 Towards Data Farm iRODS UGM’2020 AWS infrastructure Initial assessment, setup for iRODS We’re here today! Pilot SoW (1 st IRODS Production) Dec 2018 Production SoW iRODS Consortium membership R&D, Informatics & Predictive Sciences 13 NOT FOR PROMOTIONAL USE
Towards iRODS Data Farm East - West Coasts Data Federation East Coast Data Federation East Zone 1 Region 1, East Data Lake East Zone 2 Region 2, East Data analytics East Zone 3 Data providers Region 3, East Scientific groups West Zone 1 Region 4, West Global Search Index on top of iRODS Metadata catalog Applications West Zone 2 Region 5, West West Coast Data Federation R&D, Informatics & Predictive Sciences 14 NOT FOR PROMOTIONAL USE
Processing Data at Scale Using iRODS for managing petabytes of data in hundreds of millions of files on distributed storage resources spread across the country. Number of S3 buckets: 200+ • Number of objects in S3: 800+ millions • Size of dataset: 10+ PB • Processing rate (regular data ingest): 5 millions objects per hour • R&D, Informatics & Predictive Sciences 15 NOT FOR PROMOTIONAL USE
iRODS data ingest – standard approach Challenges Data Stream 1 • iRODS catalog is always behind Gene • Negative space / Deleted files expression S3 bucket 1 database Data Stream 2 NGS QC S3 bucket 2 Analysis iRODS iRODS … Daily Data Ingest Metadata catalog Jobs Data Stream N Applications S3 bucket N R&D, Informatics & Predictive Sciences 16 NOT FOR PROMOTIONAL USE
Near real time data ingest – AWS Lambda function Gene expression database NGS QC Analysis Data Amazon Amazon Amazon Amazon iRODS Labs S3 bucket SNS SQS Lambda Metadata catalog Applications R&D, Informatics & Predictive Sciences 17 NOT FOR PROMOTIONAL USE
Recommend
More recommend