Reproducibility and Big (Omics) Data Nuno Bandeira, Ph.D. Associate Professor Dept. Computer Science and Engineering Skaggs School of Pharmacy and Pharmaceutical Sciences Executive Director NIH/NIGMS Center for Computational Mass Spectrometry C enter for C omputational M ass S pectrometry
What is the Proteome? Not just unmodified major protein isoforms � Sequence polymorphisms � Alternative splicing � Post-translational mods (PTMs) � Endogenous peptides � May be non-linear: insulin � Protein interactions: cross-linking � Microbiome: 10x more cells, 100-360x more genes � Disease proteomes � Infectious diseases: MHC peptides � Cancer: fusions, polymorphisms � Cataracts: hypermodified peptides � Antibodies, drug discovery C enter for C omputational M ass http://proteomics.ucsd.edu S pectrometry
Lens dataset: 5 th round MDVTIQHPWFKRT Full peptide MDVTIQHPWFKR Nterm variants MDVTIQHPWFK Cterm variants MDVTIQHPW Ac Nterm acetylation Ac M Nterm acetylation, M oxidation DVTIQHPWFKR +O WFK Ac Nterm acetylation, W oxidation +O Ac M WFK DVTIQHPWFK Nterm acetylation, M oxidation, W oxidation +O +O Ac M K DVTIQHPWFK Nterm acetylation, M dethiomethyl +HCS Q deamidation KyFK Ac +1 Nterm acetylation, W � Kyurenin DVTIQHPWFK Ac KR Nterm acetylation Ac Nterm acetylation, K acetylation DVTIQHPWFK Ac Ac Nterm acetylation, KR Ac Water Loss Nterm acetylation, K carboxyethylation (?) -H 2 0 Ac Ac M WFKR VTIQHPWFK Nterm acetylation, M oxidation, +O Ac +O W oxidation, K acetylation VTIQHPWFK Ac MDVTIQHPWFKR Ac Nterm acetylation, Q deamidation, Nterm acetylation M oxidation, W oxidation, K acetylation +O +O Ac +1 MDVTIQHPWFK TIQHPWFKR Undetermined Modification (+38) ( +38 ) MDVTIQHPWFK Undetermined Modification (+25) ( +26 ) C enter for MDVTIQHPWFK Undetermined Modification (+25), C omputational Q deamidation M ass ( +26 ) http://proteomics.ucsd.edu +1 S pectrometry
More than just big data Big Data Big Algorithms Thousands of datasets, Designed to build on rather than hundreds of terabytes just ‘tolerate’ big data http://massive.ucsd.edu http://proteomics.ucsd.edu/software Big Compute Big Community Proteo mics S calable, A ccessible and F lexible e nvironment 30+ data analysis workflows Empower and enable community-wide C enter for scalable to thousands of cores sharing of knowledge C omputational http://gnps.ucsd.edu M ass http://proteomics.ucsd.edu/ProteoSAFe http://proteomics.ucsd.edu S pectrometry
Dataset reanalysis: PNNL microbiome 12 TB dataset covering 112 species from diverse taxa Can easily import raw data for online reanalysis • • Includes microbial spectral libraries reusable for searching new data • Search results can be compared with dataset results – Online results or user-uploaded results – Reanalysis results will be `attachable’ to submitted dataset C enter for C omputational M ass http://proteomics.ucsd.edu S pectrometry
ProteoSAFe reanalysis ProteoSAFe: Compute-intensive discovery MS at the click of a button (billions of spectra searched) http://proteomics2.ucsd.edu/ProteoSAFe Cohort-aware spectral networks 30+ workflows, >70 tools C enter for C omputational M ass http://proteomics.ucsd.edu S pectrometry
gnps.ucsd.edu First MassIVE Knowledge Base, open March 2014 Co-analyze private+public data Share data Crowdsourced curated libraries Explore C enter for unknown C omputational M ass molecules http://proteomics.ucsd.edu S pectrometry
The GNPS vision Data to knowledge 101 – Crowdsourced consensus IDs • Curators • Revisions • Quality levels – Automated reanalysis of all data Investigator-centric – “Living” datasets with new and revised knowledge – Dataset subscriptions – Molecular explorer: C enter for C omputational “Data like mine” M ass http://proteomics.ucsd.edu S pectrometry
Challenges ahead Worldwide proteomics big data – Organizing thousands of datasets into a validated scientific resource – ‘Living’ data: consensus reanalysis, commenting, adding new results – Needs: FDR models for crowdsourced reanalysis – who’s right? Reference datasets for comparison of tools/workflows? Most data has no conditions � no biology, validation – Need dataset revisions: more metadata, updated IDs – What constitutes a publishable unit? Label datasets as “gold” once the biological conclusions are confirmed by reanalysis? Reusable knowledge bases – Translating global data into a reusable resource (e.g., libraries) – Crowdsourcing curation of shared community knowledge bases – Needs: what knowledge to represent? Who reviews the curators? C enter for C omputational M ass http://proteomics.ucsd.edu S pectrometry
Recommend
More recommend