Data Integration in Bioinformatics and Life Sciences Erhard Rahm, Toralf Kirsten, Michael Hartung http://dbs.uni-leipzig.de http://www.izbi.de EDBT – Summer School, September 2007
What is the Problem? „What protocols were used for tumors in similar locations, for patients in the same age group, with the same genetic background?“ Source: L. Haas, ICDE2006 keynote E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 2
DILS workshop series � International workshop series Data Integration in the Life Sciences (DILS) � DILS2004: Leipzig (Interdisciplinary Center for Bioinformatics) � DILS2005: San Diego, USA (UCSD Supercomputing Center) � DILS2006: Cambridge/Hinxton, UK (EBI) � DILS2007: Philadelphia (UPenn) � DILS2008: Have you ever been in Paris? � E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 3
Agenda � Kinds of data to be integrated � General data integration alternatives � Warehouse approaches � Virtual and mapping-based data integration � Matching large life science ontologies � Data quality aspects � Conclusions and further challenges E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 4
Agenda � Kinds of data to be integrated � Experimental data � Clinical data � Public web data � Ontologies � General data integration alternatives � Warehouse approaches � Virtual and mapping-based data integration � Matching large life science ontologies � Data quality aspects � Conclusions and further challenges E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 5
Scientific data management process Source: Gertz/Ludaescher: SDM Tutorial, EDBT2006 � Sharing/reuse of data products community-oriented research � E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 6
Data integration in life sciences Many heterogeneous data sources � � Experimental data produced by chip-based techniques � Genome-wide measurement of gene activity under different conditions (e.g., normal vs. different disease states) � Experimental annotations (metadata about experiments) Affymetrix gene � Clinical data expression microarray � Lots of inter-connected web data sources and ontologies � Sequence data, annotation data, vocabularies, … � Publications (knowledge in text documents) � Private vs. public data Different kinds of analysis � � Gene expression analysis � Transcription analysis � Functional profiling � Pathway analysis and reconstruction � Text mining , … E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 7
Expression experiment and analysis sample (1) Cell selection spot intensities for mRNA experiment series (2) RNA/DNA labeling preparation (6) Data pre-processing (3) Hybridization array gene expression matrix x y (4) Array scan (7) Expression analysis/data mining array image x (5) Image analysis y (8) Interpretation using annotations Gene groups (co-regulated, ...) array spot intensities E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 8
Experimental data � High volume of experimental data � Various existing chip types for gene expression and mutation analysis � Fast growing amount of numeric data values � Need to pre-process chip data (no standard routines) � Different data aggregation levels (e.g. Affy probe vs. probeset expression values) � Various statistical approaches, e.g. tests and resampling procedures, … � Visualizations, e.g. Heatmap, M/A plot, … � Need for comprehensive, standardized experimental annotations � Experimental set up and procedure (hybridization process, utilized devices, … � Manual specification by the experimenter � Often user-dependent utilization of abbrev. and names / synonyms � Recommendation: M inimal I nformation a bout a M icroarray E xperiment* * Brazma et al.: Minimum information about a mircoarray experiment (MIAME) – toward standards for microarray data . Nature Genetics, 29(4): 365-371, 2001 E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 9
Clinical data: Requirements � Patient-oriented data � Personal data � Different types of findings, e.g. general clinical findings (blood pressure, etc.), pathological findings (tissue samples), genetic findings � Applied therapies (timing and dosages of drugs, …) Clinical studies to evaluate and improve treatment protocols, e.g. against cancer � � Data acquisition during complex workflows running in different hospitals � Special software systems for study management (eResearch Network, Oracle Clinical, ...) New research direction: collect and evaluate genetic data (e.g., gene expression � data) within clinical studies to investigate molecular-biological causes of diseases and impact of drugs Need to integrate experimental and clinical data within distributed study � management workflows � High privacy requirements: protect identity of individual patients E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 10
Clinical trials: Inter-organizational workflows Selection of patients Personal meeting pre-defined (patient) data inclusion criteria Periodic Doctor or Hospital Visits • Operations General clinical • Checkups findings Tissue Pathological Analysis Extraction Pathological • Microscopy findings • Antibody Tests Genome Location specific genetic Analysis Genetic • Mutation profiling (Banding analysis, FISH) findings Genome-wide Chip-based genetic Analysis Chip-based • Mutation profiling (Matrix-CGH) genetic data • Expression profiling (Microarray) Data Data Acquisition and Analysis E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 11
Publicly accessible data in web sources � Genome sources: Ensembl, NCBI Entrez, UCSC Genome, ... � Objects: Genes, transcripts, proteins etc. of different species � Object specific sources � Proteins: UniProt (SwissProt, Trembl), Protein Data Bank, ... � Protein interactions: BIND, MINT, DIP, ... � Genes: HUGO (standardized gene symbols for human genome), MGD, ... � Pathways: KEGG (metabolic & regulatory pathways), GenMAPP, ... � ... � Publication sources: Medline / Pubmed (>16 Mio entries) � Ontologies � Utilized to describe properties of biological objects � Controlled vocabulary of concepts to reduce terminology variations � Popular examples: Gene Ontology, Open Biomedical Ontologies (OBO) E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 12
Sample web data with cross-references � Annotation data vs. mapping data source-specific ID (accession) annotations: } names, symbols, synonyms, etc. Enzyme } References GeneOntology to other data sources OMIM UniGene KEGG � Problem: semantics of mappings (missing mapping type) � Gene �� gene: orthologous vs. paralogous genes E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 13
Highly connected data sources � Many, highly connected data sources and ontologies � Heterogeneity � Files and databases � Format and schema differences � Semantics � Incomplete data sources � Overlapping data sources � need to fuse corresponding objects from different sources � Frequent changes � Data, schema, APIs � common (global) database schema ??? E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 14
Ontologies � Increasing use of ontologies in bioinformatics and medicine to organize domains, annotate data and support data integration � Develop a shared understanding of concepts in a domain � Define the terms used � Attach these terms to real data (annotation) � Provide ability to query data from different sources using a common vocabulary � Some popoluar life science ontologies � Gene Ontology (http://www.geneontology.org) � Species-independent, comprehensive sub-ontologies about Molecular Functions, Biological Processes and Cellular Components � UMLS – Unified Medical Language System (http://www.nlm.nih.gov/research/umls/umlsmain.html) � Metathesaurus comprising medical subjects and terms of Medical Subject Headings, International Classification of Diseases (ICD), … E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 15
OBO – Open Biomedical Ontologies • An umbrella project for grouping different ontologies in biological/medical field Why OBO? Requirements for ontologies in OBO: - GO only covers three specific domains - Open, can be used by all without any constraints - Other aspects could also be annotated: anatomy, … - Common shared syntax - No standardization of ontologies: format, syntax, … - No overlap with other ontologies in OBO - What ontologies do exist in the biomedical domain? - Share a unique identifier space - Creation takes a lot of work � Reuse existing ontol. - Include text definitions of their terms Currently covered aspects: • Anatomies • Cell Types • Sequence Attributes • Temporal Attributes • Phenotypes • Diseases • …. http://obo.sourceforge.net/main.html E. Rahm et al.: Data Integration in Bioinformatics and Life Sciences EDBT summer school 2007 16
Recommend
More recommend