reproducibility and big omics data
play

Reproducibility and Big (Omics) Data Nuno Bandeira, Ph.D. Associate - PowerPoint PPT Presentation

Reproducibility and Big (Omics) Data Nuno Bandeira, Ph.D. Associate Professor Dept. Computer Science and Engineering Skaggs School of Pharmacy and Pharmaceutical Sciences Executive Director NIH/NIGMS Center for Computational Mass Spectrometry


  1. Reproducibility and Big (Omics) Data Nuno Bandeira, Ph.D. Associate Professor Dept. Computer Science and Engineering Skaggs School of Pharmacy and Pharmaceutical Sciences Executive Director NIH/NIGMS Center for Computational Mass Spectrometry C enter for C omputational M ass S pectrometry

  2. What is the Proteome? Not just unmodified major protein isoforms � Sequence polymorphisms � Alternative splicing � Post-translational mods (PTMs) � Endogenous peptides � May be non-linear: insulin � Protein interactions: cross-linking � Microbiome: 10x more cells, 100-360x more genes � Disease proteomes � Infectious diseases: MHC peptides � Cancer: fusions, polymorphisms � Cataracts: hypermodified peptides � Antibodies, drug discovery C enter for C omputational M ass http://proteomics.ucsd.edu S pectrometry

  3. Lens dataset: 5 th round MDVTIQHPWFKRT Full peptide MDVTIQHPWFKR Nterm variants MDVTIQHPWFK Cterm variants MDVTIQHPW Ac Nterm acetylation Ac M Nterm acetylation, M oxidation DVTIQHPWFKR +O WFK Ac Nterm acetylation, W oxidation +O Ac M WFK DVTIQHPWFK Nterm acetylation, M oxidation, W oxidation +O +O Ac M K DVTIQHPWFK Nterm acetylation, M dethiomethyl +HCS Q deamidation KyFK Ac +1 Nterm acetylation, W � Kyurenin DVTIQHPWFK Ac KR Nterm acetylation Ac Nterm acetylation, K acetylation DVTIQHPWFK Ac Ac Nterm acetylation, KR Ac Water Loss Nterm acetylation, K carboxyethylation (?) -H 2 0 Ac Ac M WFKR VTIQHPWFK Nterm acetylation, M oxidation, +O Ac +O W oxidation, K acetylation VTIQHPWFK Ac MDVTIQHPWFKR Ac Nterm acetylation, Q deamidation, Nterm acetylation M oxidation, W oxidation, K acetylation +O +O Ac +1 MDVTIQHPWFK TIQHPWFKR Undetermined Modification (+38) ( +38 ) MDVTIQHPWFK Undetermined Modification (+25) ( +26 ) C enter for MDVTIQHPWFK Undetermined Modification (+25), C omputational Q deamidation M ass ( +26 ) http://proteomics.ucsd.edu +1 S pectrometry

  4. More than just big data Big Data Big Algorithms Thousands of datasets, Designed to build on rather than hundreds of terabytes just ‘tolerate’ big data http://massive.ucsd.edu http://proteomics.ucsd.edu/software Big Compute Big Community Proteo mics S calable, A ccessible and F lexible e nvironment 30+ data analysis workflows Empower and enable community-wide C enter for scalable to thousands of cores sharing of knowledge C omputational http://gnps.ucsd.edu M ass http://proteomics.ucsd.edu/ProteoSAFe http://proteomics.ucsd.edu S pectrometry

  5. Dataset reanalysis: PNNL microbiome 12 TB dataset covering 112 species from diverse taxa Can easily import raw data for online reanalysis • • Includes microbial spectral libraries reusable for searching new data • Search results can be compared with dataset results – Online results or user-uploaded results – Reanalysis results will be `attachable’ to submitted dataset C enter for C omputational M ass http://proteomics.ucsd.edu S pectrometry

  6. ProteoSAFe reanalysis ProteoSAFe: Compute-intensive discovery MS at the click of a button (billions of spectra searched) http://proteomics2.ucsd.edu/ProteoSAFe Cohort-aware spectral networks 30+ workflows, >70 tools C enter for C omputational M ass http://proteomics.ucsd.edu S pectrometry

  7. gnps.ucsd.edu First MassIVE Knowledge Base, open March 2014 Co-analyze private+public data Share data Crowdsourced curated libraries Explore C enter for unknown C omputational M ass molecules http://proteomics.ucsd.edu S pectrometry

  8. The GNPS vision Data to knowledge 101 – Crowdsourced consensus IDs • Curators • Revisions • Quality levels – Automated reanalysis of all data Investigator-centric – “Living” datasets with new and revised knowledge – Dataset subscriptions – Molecular explorer: C enter for C omputational “Data like mine” M ass http://proteomics.ucsd.edu S pectrometry

  9. Challenges ahead Worldwide proteomics big data – Organizing thousands of datasets into a validated scientific resource – ‘Living’ data: consensus reanalysis, commenting, adding new results – Needs: FDR models for crowdsourced reanalysis – who’s right? Reference datasets for comparison of tools/workflows? Most data has no conditions � no biology, validation – Need dataset revisions: more metadata, updated IDs – What constitutes a publishable unit? Label datasets as “gold” once the biological conclusions are confirmed by reanalysis? Reusable knowledge bases – Translating global data into a reusable resource (e.g., libraries) – Crowdsourcing curation of shared community knowledge bases – Needs: what knowledge to represent? Who reviews the curators? C enter for C omputational M ass http://proteomics.ucsd.edu S pectrometry

Recommend


More recommend