EGTDC Database Course 2004 Biological Databases Online (practical) Tim Booth : tbooth@ceh.ac.uk Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Introduction To EnsEMBL EnsEMBL is a project to develop a software system which produces and ● maintains automatic annotation on metazoan genomes. EnsEMBL is based on a large MySQL database system. ● Take a look at the website: ● http://www.ensembl.org/ Suggested exercise: Seek out the gene coding for the human p53 tumour suppressor protein, which is located on chromosome 17. Find the location of the predicted orthologue for the rat. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Batch Queries via EnsMart Go to http://www.ensembl.org/Multi/martview ● Ask to see all genes on the first chromosome, within the first 30 ● megabases, for which there is a SwissProt id. How many entries pass the filter? (should be 13) ● Generate an HTML report with the default settings. How many lines do ● you see? Go back and ask to add the InterPro descriptions to the report. Now how ● many lines come back? Can you account for the discrepancy, based on your knowledge of how ● queries might be constructed on the underlying database? Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Direct ENSEMBL Access In a terminal type: ● mysql -h ensembldb.ensembl.org -u anonymous ...you should now be logged into EnsEMBL... show databases; use rattus_norvegicus_core_20_3b; show tables; There are a lot of tables in the core database. The following slides show ● some important parts. Don't be phased by the complexity - the point is just to appreciate what a ● major database can look like. It will be discussed later why this complexity arises and how we can deal meaningfully with it. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
EnsEMBL Sequence Regions assembly dna 0..n 0..1 asm_seq_region_id int seq_region_id int 0..n cmp_seq_region_id int sequence mediumtext asm_start int asm_end int dnac 0..1 cmp_start int 1 seq_region_id int cmp_end int 0..1 sequence mediumblob ori int seq_region 1 n_line text 0..1 seq_region_id int name varchar 1…n 1 coord_system_id int length int coord_system 1 1 coord_system_id int name varchar 1 attrib_type version varchar 0..n attrib_type_id int “default_version”, attrib “sequence_level” seq_region_attrib code varchar rank int name varchar seq_region_id int description text attrib_type_id int 0..n value varchar Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
EnsEMBL Genes and Transcripts gene_description gene_stable_id 0..1 0..1 gene_id int gene_id int description text stable_id varchar version int 1 1 transcript_stable_id 1 0..1 gene transcript_id int stable_id varchar translation_stable_id gene_id int 0..1 version int type varchar translation_id int analysis_id int stable_id varchar 1 seq_region_id int version int seq_region_start int 1 transcript 1 seq_region_end int transcript_id int translation 1..n seq_region_strand tinyint gene_id int translation_id int display_xref_id int 0..1 seq_region_id int transcript_id int 1 seq_region_start int seq_start int 0..n seq_region_end int start_exon_id int 1 1 seq_region_strand tinyint seq_end int exon display_xref_id int end_exon_id int exon_id int 1 0..n seq_region_id int seq_region_start int exon_stable_id exon_transcript seq_region_end int seq_region_strand tinyint exon_id int exon_id int Environmental Genomics Thematic Programme 0..1 0..n Data Centre phase tinyint stable_id varchar transcript_id int 0..n http://envgen.nox.ac.uk end_phase tinyint version int rank int
EnsEMBL External References xref 1 xref_id int external_db external_db_id int external_db_id int 1..n dbprimary_acc varchar dbname varchar display_label varchar release varchar version varchar “KNOWN”, “PRED”, status “ORTH”,… description varchar 1 1 0..n 1 1 identity_xref object_xref external_synonym object_xref_id int object_xref_id int xref_id int 0..1 query_identity int ensembl_id int synonym varchar target_identity int “Translation”, “Gene”, ensembl_object_type “Transcript” hit_start int xref_id int hit_end int 1..n translation_start int translation_end int cigar_line text go_xref score double object_xref_id int 0..1 evalue double “IC”,”IDA”,”IEA”,”IEP”,”I linkage_type GI”,”IMP,”IPI”,… analysis_id int Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Complex Database -> Nasty Query select gene.seq_region_id, gene_stable_id.stable_id, transcript_stable_id.stable_id, xref.display_label from external_db inner join xref on external_db.external_db_id = xref.external_db_id inner join object_xref on xref.xref_id = object_xref.xref_id inner join translation on object_xref.ensembl_id = translation.translation_id inner join transcript on translation.transcript_id = transcript.transcript_id inner join gene on gene.gene_id = transcript.gene_id inner join seq_region on gene.seq_region_id = seq_region.seq_region_id inner join coord_system on seq_region.coord_system_id = coord_system.coord_system_id left outer join gene_stable_id on gene_stable_id.gene_id = gene.gene_id left outer join transcript_stable_id on transcript.transcript_id = transcript_stable_id.transcript_id where coord_system.name = 'chromosome' and seq_region.name = '1' and gene.seq_region_end < 30000000 Environmental Genomics Thematic Programme and external_db.db_name = 'SWISSPROT'; Data Centre http://envgen.nox.ac.uk
Simplifying things The previous query will actually retrieve the same results as the ● EnsMART search, though this is hardly obvious. Making a friendly interface is one way to simplify the use of your ● database. We will look into these tomorrow, along with views, denormalised tables ● and programming APIs. For now, let's look at a simpler online database... ● Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
A Look at The Baculovirus Microsatellite Database This database was created by the Msatminer software, and is held at CEH ● Oxford. Go to http://www.genomics.ceh.ac.uk/cgi-bin/baculo/msatviewer.cgi ● Try some of the standard queries, such as viewing 'all tris'. ● Now run each of the following: ● – show tables – select * from baculo_genomes – describe repeats – select repeat_name, motif, genome from repeats where repeat_type = 'penta' – select organism, genome, count(*) as count from repeats where repeat_type = 'penta' Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
A Few More Examples The following are also good examples of database-powered sites. ● Unfortunately, none of them provide the same level of access or technical documentation as EnsEMBL: www.hapmap.org : International human haplotyping project ● www.plasmodb.org : Similar to EnsEMBL, but specific to plasmodium ● genomes ftp://ftp.ncbi.nih.gov/pub/taxonomy/ : The contents of the NCBI ● taxonomy database can be downloaded for local use. Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Credits and Notes Technical documentation on ENSEMBL can be located here: ● http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/EnsemblCore.html The baculovirus microsatellite database was created by Milo Thurston ● and Dawn Field. Thanks to Man-Suen Chan for website suggestions. ● Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Recommend
More recommend