data integration in bioinformatics and life sciences
play

Data Integration in Bioinformatics and Life Sciences Erhard Rahm, - PowerPoint PPT Presentation

Data Integration in Bioinformatics and Life Sciences Erhard Rahm, Toralf Kirsten, Michael Hartung http://dbs.uni-leipzig.de http://www.izbi.de EDBT Summer School, September 2007 What is the Problem? What protocols were used for tumors


  1. Data Integration in Bioinformatics and Life Sciences Erhard Rahm, Toralf Kirsten, Michael Hartung http://dbs.uni-leipzig.de http://www.izbi.de EDBT – Summer School, September 2007

  2. What is the Problem? „What protocols were used for tumors in similar locations, for patients in the same age group, with the same genetic background?“ Source: L. Haas, ICDE2006 keynote E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 2

  3. DILS workshop series � International workshop series Data Integration in the Life Sciences (DILS) � DILS2004: Leipzig (Interdisciplinary Center for Bioinformatics) � DILS2005: San Diego, USA (UCSD Supercomputing Center) � DILS2006: Cambridge/Hinxton, UK (EBI) � DILS2007: Philadelphia (UPenn) � DILS2008: Have you ever been in Paris? � E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 3

  4. Agenda � Kinds of data to be integrated � General data integration alternatives � Warehouse approaches � Virtual and mapping-based data integration � Matching large life science ontologies � Data quality aspects � Conclusions and further challenges E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 4

  5. Agenda � Kinds of data to be integrated � Experimental data � Clinical data � Public web data � Ontologies � General data integration alternatives � Warehouse approaches � Virtual and mapping-based data integration � Matching large life science ontologies � Data quality aspects � Conclusions and further challenges E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 5

  6. Scientific data management process Source: Gertz/Ludaescher: SDM Tutorial, EDBT2006 � Sharing/reuse of data products community-oriented research � E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 6

  7. Data integration in life sciences Many heterogeneous data sources � � Experimental data produced by chip-based techniques � Genome-wide measurement of gene activity under different conditions (e.g., normal vs. different disease states) � Experimental annotations (metadata about experiments) Affymetrix gene � Clinical data expression microarray � Lots of inter-connected web data sources and ontologies � Sequence data, annotation data, vocabularies, … � Publications (knowledge in text documents) � Private vs. public data Different kinds of analysis � � Gene expression analysis � Transcription analysis � Functional profiling � Pathway analysis and reconstruction � Text mining , … E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 7

  8. Expression experiment and analysis sample (1) Cell selection spot intensities for mRNA experiment series (2) RNA/DNA labeling preparation (6) Data pre-processing (3) Hybridization array gene expression matrix x y (4) Array scan (7) Expression analysis/data mining array image x (5) Image analysis y (8) Interpretation using annotations Gene groups (co-regulated, ...) array spot intensities E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 8

  9. Experimental data � High volume of experimental data � Various existing chip types for gene expression and mutation analysis � Fast growing amount of numeric data values � Need to pre-process chip data (no standard routines) � Different data aggregation levels (e.g. Affy probe vs. probeset expression values) � Various statistical approaches, e.g. tests and resampling procedures, … � Visualizations, e.g. Heatmap, M/A plot, … � Need for comprehensive, standardized experimental annotations � Experimental set up and procedure (hybridization process, utilized devices, … � Manual specification by the experimenter � Often user-dependent utilization of abbrev. and names / synonyms � Recommendation: M inimal I nformation a bout a M icroarray E xperiment* * Brazma et al.: Minimum information about a mircoarray experiment (MIAME) – toward standards for microarray data . Nature Genetics, 29(4): 365-371, 2001 E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 9

  10. Clinical data: Requirements � Patient-oriented data � Personal data � Different types of findings, e.g. general clinical findings (blood pressure, etc.), pathological findings (tissue samples), genetic findings � Applied therapies (timing and dosages of drugs, …) Clinical studies to evaluate and improve treatment protocols, e.g. against cancer � � Data acquisition during complex workflows running in different hospitals � Special software systems for study management (eResearch Network, Oracle Clinical, ...) New research direction: collect and evaluate genetic data (e.g., gene expression � data) within clinical studies to investigate molecular-biological causes of diseases and impact of drugs Need to integrate experimental and clinical data within distributed study � management workflows � High privacy requirements: protect identity of individual patients E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 10

  11. Clinical trials: Inter-organizational workflows Selection of patients Personal meeting pre-defined (patient) data inclusion criteria Periodic Doctor or Hospital Visits • Operations General clinical • Checkups findings Tissue Pathological Analysis Extraction Pathological • Microscopy findings • Antibody Tests Genome Location specific genetic Analysis Genetic • Mutation profiling (Banding analysis, FISH) findings Genome-wide Chip-based genetic Analysis Chip-based • Mutation profiling (Matrix-CGH) genetic data • Expression profiling (Microarray) Data Data Acquisition and Analysis E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 11

  12. Publicly accessible data in web sources � Genome sources: Ensembl, NCBI Entrez, UCSC Genome, ... � Objects: Genes, transcripts, proteins etc. of different species � Object specific sources � Proteins: UniProt (SwissProt, Trembl), Protein Data Bank, ... � Protein interactions: BIND, MINT, DIP, ... � Genes: HUGO (standardized gene symbols for human genome), MGD, ... � Pathways: KEGG (metabolic & regulatory pathways), GenMAPP, ... � ... � Publication sources: Medline / Pubmed (>16 Mio entries) � Ontologies � Utilized to describe properties of biological objects � Controlled vocabulary of concepts to reduce terminology variations � Popular examples: Gene Ontology, Open Biomedical Ontologies (OBO) E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 12

  13. Sample web data with cross-references � Annotation data vs. mapping data source-specific ID (accession) annotations: } names, symbols, synonyms, etc. Enzyme } References GeneOntology to other data sources OMIM UniGene KEGG � Problem: semantics of mappings (missing mapping type) � Gene �� gene: orthologous vs. paralogous genes E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 13

  14. Highly connected data sources � Many, highly connected data sources and ontologies � Heterogeneity � Files and databases � Format and schema differences � Semantics � Incomplete data sources � Overlapping data sources � need to fuse corresponding objects from different sources � Frequent changes � Data, schema, APIs � common (global) database schema ??? E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 14

  15. Ontologies � Increasing use of ontologies in bioinformatics and medicine to organize domains, annotate data and support data integration � Develop a shared understanding of concepts in a domain � Define the terms used � Attach these terms to real data (annotation) � Provide ability to query data from different sources using a common vocabulary � Some popoluar life science ontologies � Gene Ontology (http://www.geneontology.org) � Species-independent, comprehensive sub-ontologies about Molecular Functions, Biological Processes and Cellular Components � UMLS – Unified Medical Language System (http://www.nlm.nih.gov/research/umls/umlsmain.html) � Metathesaurus comprising medical subjects and terms of Medical Subject Headings, International Classification of Diseases (ICD), … E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 15

  16. OBO – Open Biomedical Ontologies • An umbrella project for grouping different ontologies in biological/medical field Why OBO? Requirements for ontologies in OBO: - GO only covers three specific domains - Open, can be used by all without any constraints - Other aspects could also be annotated: anatomy, … - Common shared syntax - No standardization of ontologies: format, syntax, … - No overlap with other ontologies in OBO - What ontologies do exist in the biomedical domain? - Share a unique identifier space - Creation takes a lot of work � Reuse existing ontol. - Include text definitions of their terms Currently covered aspects: • Anatomies • Cell Types • Sequence Attributes • Temporal Attributes • Phenotypes • Diseases • …. http://obo.sourceforge.net/main.html E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 16

Recommend


More recommend