biomart
play

BioMart Data integration in four easy steps Arek Kasprzyk European - PowerPoint PPT Presentation

BioMart Data integration in four easy steps Arek Kasprzyk European Bioinformatics Institute 22 July 2006 BioMart A joint project European Bioinformatics Institute (EBI) Cold Spring Harbor Laboratory (CSHL) Funding


  1. BioMart Data integration in four easy steps Arek Kasprzyk European Bioinformatics Institute 22 July 2006

  2. BioMart • A joint project – European Bioinformatics Institute (EBI) – Cold Spring Harbor Laboratory (CSHL) • Funding – Wellcome Trust – European Commission – NIH

  3. Synopsis • Higher level data management system – Data mining type access to descriptive data – Query optimization – Data federation – Meta data support

  4. Transformation Configuration Source data Querying BioMart XML BioMart software XML XML 1 2 3

  5. Transformation and Configuration Tools

  6. Query interfaces

  7. Programmatic access • APIs – Perl (biomart-plib) – Java (martj) – R (biomaRt) • Web service

  8. XML XML XML XML Data federation PostgreSQL XML REGISTRY ORACLE XML MySQL XML XML XML

  9. Dataset, Attribute and Filter Attribute Dataset Filter Mart gene_chrom_end gene_display_id gene_stable_id GENE chromosome gene_id(PK) description gene_start

  10. Joining two datasets Dataset 1 Dataset 2 Links Importable Exportable name = uniprot_id name = uniprot_id filters = uniprot_ac attributes = uniprot_ac

  11. Dataset linking

  12. Third party software

  13. Ensembl

  14. GMOD

  15. biomaRt

  16. Distributed Annotation System

  17. Taverna

  18. Galaxy

  19. Examples

  20. Genomic data

  21. Uniprot, MSD, ArrayExpress Proteomic, structure, expression

  22. Model organism databases Genes Genes Expression Expression Phenotypes Phenotypes Variations Variations Literature Literature Ontologies Ontologies Sequence Sequence

  23. Zebra Fish models for human development and disease

  24. Central Server

  25. Behind closed doors ; )

  26. Target SNP selection for the study of one autoimmune disease, type 1 diabetes (T1D), and infectious diseases, malaria and dengue Laboratory of Genetics of I nfectious and Autoimmune Diseases

  27. Genetics of I nfectious and Autoimmune Diseases, Pasteur I nstitute, I NSERM U730, Paris, France. Na me FragmentPos i t ion Al le les s t rand SNP1 AL1392581659852 T /A 1 SNP2 NT_25698 2569873 C/T - 1 SNP3 ch r13 1125698 C/G 1 Data conversion and integration UCSC HapMap Diabetes-Gene Association DataBase Combined proprietary and public data Priopriatery Ensembl data NCBI

  28. Genome Location Links to databases Overlaps with TFBS Ensembl (dbSNP) Location + predicted functional Ensembl role Vega RefSeq Genetics of I nfectious and Autoimmune Diseases, Pasteur I nstitute, I NSERM U730, Paris, France. Acembly Output format :

  29. Using the Molecular Integration Database to Answer CAPRISA’s Questions Research that contributes to understanding HIV pathogenesis and epidemiology as well as HIV/AIDS treatment and prevention

  30. How is the MID populated? Sequence & Humoral Immunity ฀ Cellular Immunity HLA Typing Clinical Data Sequence Related Pipeline MID

  31. Caprisa

  32. What role for ‘Omics’ ? g Human study to evaluate Omics in assessing safety indicators g Study of skin inflammation in response to detergent g Skin samples taken and analyzed with multiple Omics techniques. n Blood n Skin biopsy n Microdialysis

  33. System Data Flow parsing import transformation download Data files Generic Coral staging BioMarts Analysis files CSV files area Import Interface Oracle 9i database BioMart Interface • Requires an extensible file and metadata management system for omics data • Oracle 9i database used for staging area and BioMarts • Database indexes files on a separate file system

  34. Adding Annotation Ensembl Mart g Query Ensembl for details of genes measured or identified in experiments e.g. GeneSpring Annotation g For example, we can link to Ensembl from Microarray Experiments by Gene ID Link on Entrez gene id Microarray Mart

  35. Four easy(?) steps

  36. Transformation Step 1

  37. Configuration Step 2

  38. Step 3 Query

  39. User interfaces

  40. Web service < Query virtualSchemaName = "default" count = "0" > < Dataset name = "hsapiens_gene_ensembl"> < Attribute name = "gene_stable_id" / > < Filter name = "chr_name" value = "22"/ > < / Dataset> < Dataset name = ”uniprot"> < Attribute name = ”accession" / > < Filter name = ”pfam" value = “only"/ > < / Dataset> < / Query>

  41. API my $initializer = BioMart::Initializer->new('registryFile'=>$confFile); my $registry = $initializer->getRegistry(); $registry->configure(); $query->addAttribute('hsapiens_gene_ensembl','ensembl_gene_id'); $query->addFilter('hsapiens_gene_ensembl','chromosome_name',['1']); $query->addAttribute(‘uniprot’,‘accession',); $query->addFilter(’uniprot', 'chromosome_name',['1’]); $query->formatter(’HTML'); my $runner = BioMart::QueryRunner->new(); $runner->execute($query); $runner->printResults();

  42. Ask for a pay rise : ) Step 4

  43. Summary • A generic data management system • Provides building blocks for designing your own ‘tailor-made’ data management – A set of easily configurable user interfaces – Distributed Data federation – Query optimization • Easy to install and manage – A project for bioinformatics students • Open source software. – No restrictions for academics or commercial users

  44. Credits • BioMart – Syed Haider – Richard Holland – Damian Smedley – Gudmundur Thorisson • Contributors – Steffen Durinck (NCI, NIH) – Eric Just (Northwestern University) – Don Gilbert (Indiana University) – Darin London (Duke University) – Will Spooner (CSHL) – Benoit Ballester (Universite de la Mediterranee) – James Smith (Ensembl) – Arne Stabenau (Ensembl) – Andreas Kahari (Ensembl) – Craig Melsopp (Ensembl) – Katerina Tzouvara (EBI) Paul Donlon (Unilever ) –

Recommend


More recommend