Capacity building in the cloud for Data Intensive Cancer Genomics Bruce Press, Ci4CC Fall meeting October 1, 2018
The rate of data generation is accelerating rapidly. 50 years 73 days Densen, P. Trans Am Clin Climatol Assoc 2011
Using this information to improve cancer patient outcomes isn’t only a technology challenge. Technology Social Scalable & Secure Data Harmonization Data Sharing & Data Analysis Environments & Organization Collaboration Fluency
Cloud is the most economically reasonable way to store and analyze our growing health data corpus.
Cloud provides significant benefits for health data analysis at scale. ● Immediate scaling -- no need to wait to purchase and install hardware. ● Levels the playing field -- even researchers at institutions without large compute infrastructure investments can access powerful data and compute resources. ● Extreme durability eliminates or reduces need for backup copies. Old model: send New model: send ● Multi-tenancy of data means many data to compute compute to data researchers can access data without needing to physically copy it.
Compute and Data storage platforms allow more researchers to quickly realize the power of cloud. ● Infrastructure configuration and security/compliance ‘out of the box’. ● Optimized data storage and analysis methods, across multiple underlying cloud infrastructures. ● Cost monitoring and management. ● Allows researchers focus on science, not managing computational resources.
The cloud allows multiple researchers to access the same copy of high- value public datasets. ● NCI Cloud Pilots (now Cloud Resources and the Cancer Research Data Commons) paved the way for secure access and analysis of high value datasets in the cloud. ● Authentication and authorization mechanisms enable approved researchers to access Controlled data initially from TCGA and TARGET and now an expanding set of data resources. ● Potential to save millions of dollars by reducing replication of data and speed research by avoiding download times. https://cbiit.cancer.gov/ncip/cancer-research-data-commons
Finding, organizing and cleaning data currently accounts for 80% of work performed by data scientists. Gil Press, Forbes 2016
Connecting data sets across multiple domains will increase the power of each to drive new discoveries. ● Flexible, semantic data models and advanced search allows finding data of interest from enormous repositories. ● Can’t be a one-size-fits-all solution - the properties most interesting for a particular research question tend to be unique ● For example pregnancy exposure is highly important for birth defect research but not a typical variable for adult cancer research.
Portable and self-contained analysis methods promote reproducibility and speed harmonization of new data with large repositories. ● By describing analysis methods in Common Workflow Language and packaging tools in Docker containers, the exact routine used for large harmonization efforts can be applied to novel data. ● Implementation of the GA4GH standard WES allows the same analysis to be performed on multiple platforms. ● Example: TOPMed harmonization workflow run on GTEX files.
Sharing data and working on it together will speed discoveries.
Collaborative workspaces allow researchers with different expertise to work together in real time. ● Capture the end-to-end ‘research journey’ to facilitate reproducibility and extension of results. ● Fine-grained permissions allow different levels of data and analysis access. ● Multiple communication channels allow researchers to discuss analyses and results in situ .
Sharing of data and results will reduce re-work, enhance serendipity, and ultimately result in better outcomes for patients. ● Cloud platforms provide a efficient way to facilitate data sharing since there’s no additional cost for more researchers to access and analyze data. ● Data owners are beginning to make data broadly available without embargo while ensuring compliance with patient consents - CHOP has led this charge via the CAVATICA platform. ● New technologies facilitate sharing raw data, methods, and results in a Findable, Accessible, Interoperable and Reusable(FAIR data principles) way.
Data analysis will become a core competency for both researchers and medical professionals.
Platforms and tools must be highly usable with as low a barrier to entry as possible while at the same time enabling power users. ● Reproducibility of analysis journeys also provides powerful teaching resource. ● Interactive workshops, hackathons, and training sessions are important to build expertise across individuals with diverse backgrounds. ● Programmatic access methods (APIs) allow automation and optimization by advanced users while visual interfaces support a broad user base.
Acknowledgements The Global Seven Bridges Team Work presented was funded in whole or in part by: HHSN261201400008C, HHSN261200800001E, U2C HL138346-01, OT3 HL142478, OT3 OD02546 and U24CA224067
Recommend
More recommend