nebc database course 2008 biological databases online web
play

NEBC Database Course 2008 Biological Databases Online Web-Based - PowerPoint PPT Presentation

NEBC Database Course 2008 Biological Databases Online Web-Based Tools (practical) Tim Booth : tbooth@ceh.ac.uk Introduction To EnsEMBL EnsEMBL is a project to develop a software system which produces and maintains automatic annotation on


  1. NEBC Database Course 2008 Biological Databases Online Web-Based Tools (practical) Tim Booth : tbooth@ceh.ac.uk

  2. Introduction To EnsEMBL EnsEMBL is a project to develop a software system which ● produces and maintains automatic annotation on metazoan genomes. EnsEMBL is based on a large MySQL database system. ● Take a look at the website: ● http://www.ensembl.org/ Suggested exercise: Seek out the gene coding for the human p53 tumour suppressor protein, which is located on chromosome 17. Find the location of the predicted orthologue for the rat.

  3. Batch Queries via Martview Go to http://www.ensembl.org/biomart/martview ● Using the Vega 32 Homo sapiens dataset, ask to see all genes ● on the first chromosome, within the first 100 kilobases, for which there is a UniProt/Swissprot id. How many entries pass the filter? (should be 12) ● Generate an HTML report with the default settings. How many ● lines do you see? Go back and ask to add the InterPro descriptions to the report. ● Now how many lines come back? Can you account for the discrepancy, based on your knowledge ● of how queries might be constructed on the underlying database?

  4. Direct ENSEMBL Access In a terminal type: ● mysql -h ensembldb.ensembl.org -u anonymous ...you should now be logged into EnsEMBL... show databases; use rattus_norvegicus_core_47_34q; show tables; There are a lot of tables in the core database. The following ● slides show some important parts. Don't be phased by the complexity - the point is just to ● appreciate what a major database can look like. It will be discussed later why this complexity arises and how we can deal meaningfully with it.

  5. EnsEMBL Sequence Regions assembly dna 0..n 0..1 asm_seq_region_id int seq_region_id int 0..n cmp_seq_region_id int sequence mediumtext asm_start int asm_end int dnac 0..1 cmp_start int seq_region_id int 1 cmp_end int 0..1 sequence mediumblob ori int seq_region 1 n_line text 0..1 seq_region_id int name varchar 1…n 1 coord_system_id int length int coord_system 1 1 coord_system_id int name varchar 1 attrib_type version varchar 0..n attrib_type_id int “default_version”, attrib “sequence_level” seq_region_attrib code varchar rank int name varchar seq_region_id int description text attrib_type_id int 0..n value varchar

  6. EnsEMBL Genes and Transcripts gene_description gene_stable_id 0..1 0..1 gene_id int gene_id int description text stable_id varchar version int 1 1 transcript_stable_id 1 0..1 gene transcript_id int stable_id varchar translation_stable_id gene_id int 0..1 version int type varchar translation_id int analysis_id int stable_id varchar 1 seq_region_id int version int seq_region_start int 1 transcript 1 seq_region_end int transcript_id int translation 1..n seq_region_strand tinyint gene_id int translation_id int display_xref_id int 0..1 seq_region_id int transcript_id int 1 seq_region_start int seq_start int 0..n seq_region_end int start_exon_id int 1 1 seq_region_strand tinyint seq_end int exon display_xref_id int end_exon_id int exon_id int 1 0..n seq_region_id int seq_region_start int exon_stable_id exon_transcript seq_region_end int seq_region_strand tinyint exon_id int exon_id int 0..1 0..n phase tinyint stable_id varchar transcript_id int 0..n end_phase tinyint version int rank int

  7. EnsEMBL External References xref 1 xref_id int external_db external_db_id int external_db_id int 1..n dbprimary_acc varchar dbname varchar display_label varchar release varchar version varchar “KNOWN”, “PRED”, status “ORTH”,… description varchar 1 1 0..n 1 identity_xref 1 object_xref external_synonym object_xref_id int object_xref_id int xref_id int 0..1 query_identity int ensembl_id int synonym varchar target_identity int “Translation”, “Gene”, ensembl_object_type hit_start int “Transcript” xref_id int hit_end int 1..n translation_start int translation_end int cigar_line text go_xref score double object_xref_id int 0..1 evalue double “IC”,”IDA”,”IEA”,”IEP”,”I linkage_type GI”,”IMP,”IPI”,… analysis_id int

  8. Complex Database = Nasty Query select gene.seq_region_id, gene_stable_id.stable_id, transcript_stable_id.stable_id, xref.display_label from external_db inner join xref on external_db.external_db_id = xref.external_db_id inner join object_xref on xref.xref_id = object_xref.xref_id inner join translation on object_xref.ensembl_id = translation.translation_id inner join transcript on translation.transcript_id = transcript.transcript_id inner join gene on gene.gene_id = transcript.gene_id inner join seq_region on gene.seq_region_id = seq_region.seq_region_id inner join coord_system on seq_region.coord_system_id = coord_system.coord_system_id left outer join gene_stable_id on gene_stable_id.gene_id = gene.gene_id left outer join transcript_stable_id on transcript.transcript_id = transcript_stable_id.transcript_id where coord_system.name = 'chromosome' and seq_region.name = '1' and gene.seq_region_end < 150000 and external_db.db_name = 'Uniprot/SWISSPROT';

  9. Simplifying things The previous query will actually retrieve the same results as ● the MartView search, though this is hardly obvious. Making a friendly interface is one way to simplify the use of ● your database. We will look into these tomorrow, along with views, ● denormalised tables and programming APIs. For now, let's look at a simpler online database... ●

  10. A Look at The GeneSwytch Microsatellite Database The GeneSwytch database is held at CEH Oxford. ● Go to http://www.genomics.ceh.ac.uk/GeneSwytch/all.php ● Try some of the standard queries, such as viewing ● 'Haemophilus genomes'. Now run each of the following: ● describe table bac_tax_genomes – select * from bac_tax_genomes limit 20 – describe table bac_rep_genomes – select repeat_name, motif, repeat_type from bac_rep_genomes where – genome_id = 1 select genus, repeat_name, motif, repeat type from (bac_rep_genomes r – inner join bac_tax_genomes t on t.genome_id = r.genome_id) where t.genome_id in (1,31)

  11. A Few More Examples The following are also good examples of database-powered ● sites. Unfortunately, none of them provide the same level of access or technical documentation as EnsEMBL: www.hapmap.org : International human haplotyping project. ● Note link to 'HapMart' search tool. www.plasmodb.org : Similar to EnsEMBL, but specific to ● plasmodium genomes ftp://ftp.ncbi.nih.gov/pub/taxonomy/ : The contents of the NCBI ● taxonomy database can be downloaded for local use. A suitable template database to store this data can be found at http://www.biosql.org

  12. Credits and Notes Technical documentation on EnsEMBL can be located here: ● http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/EnsemblCore.html The baculovirus microsatellite database was created by Milo ● Thurston, Paul Swift and Dawn Field. Thanks to Man-Suen Chan for website suggestions. ●

Recommend


More recommend