Bioinformatics Databases Introduction to Bioinformatics Dortmund, 16.-20.07.2007 Lectures: Sven Rahmann Exercises: Udo Feldkamp, Michael Wurst 1
Overview ● Databases at NCBI (via Entrez) ● DNA – GenBank, EMBL, DDBJ – Data Format Issues ● UCSC Genome Browser ● Protein – SwissProt, PIR, PDB ● Sequence Retrieval System at EBI 2
Fundamentals ● Accession number := – unique identifier for each entry (“record”) in a DB – Example: PubMed ID [PMID] – If you know the accession number, you obtain the record without searching – Different databases can be linked via accession numbers – Data integration: Hide the details (accession numbers) behind a convenient interface 3
Databases at NCBI (2007) http://www.ncbi.nlm.nih.gov/ 4
Different Databases ● DNA – nucleotide sequence – gene – transcript / gene expression – genome ● Protein – sequence and annotation – structure ● ... 5
Different Databases ● Repositories of primary sequence data – Everything related to a topic goes in here – GenBank (NCBI Nucleotide): all nucleotide seq's ● Machine-curated annotation data – automatically generated from primary data – quality depends on primary data and method ● Manually curated annotation data – reviewed by experts (SwissProt – Amos Bairoch) – high quality, slow to grow 6
Integration ● “Meta Search Engines” – Entrez at NCBI (U.S.) – SRS at EBI (Europe) ● Value comes from linking databases ● Accession numbers provide unique identifiers 7
Security ● Assume that everything you send over the internet can be intercepted. ● Don't send confidential data, patent data, etc. ● None of the public databases currently supports encryption 8
Searching Entrez 9
Nucleotide Results 10
Core Nucleotide DB 11
DNA / Nucleotide DBs ● International Nucleotide Sequence Database Collaboration (INSDC) same content GenBank = NCBI Nucleotide 12
File Formats: GenBank LOCUS AAURRA 118 bp ss-rRNA RNA 16-JUN-1986 DEFINITION A.auricula-judae (mushroom) 5S ribosomal RNA. ACCESSION K03160 VERSION K03160.1 GI:173593 KEYWORDS 5S ribosomal RNA; ribosomal RNA. SOURCE A.auricula-judae (mushroom) ribosomal RNA. ORGANISM Auricularia auricula-judae Eukaryota; Fungi; Eumycota; Basidiomycotina; Phragmobasidiomycetes; Heterobasidiomycetidae; Auriculariales; Auriculariaceae. REFERENCE 1 (bases 1 to 118) AUTHORS Huysmans,E., Dams,E., Vandenberghe,A. and De Wachter,R. TITLE The nucleotide sequences of the 5S rRNAs of four mushrooms and their use in studying the phylogenetic position of basidiomycetes among the eukaryotes JOURNAL Nucleic Acids Res. 11, 2871-2880 (1983) FEATURES Location/Qualifiers rRNA 1..118 /note="5S ribosomal RNA" BASE COUNT 27 a 34 c 34 g 23 t ORIGIN 5' end of mature rRNA. 1 atccacggcc ataggactct gaaagcactg catcccgtcc gatctgcaaa gttaaccaga 61 gtaccgccca gttagtacca cggtggggga ccacgcggga atcctgggtg ctgtggtt // LOCUS ABCRRAA 118 bp ss-rRNA RNA 15-SEP-1990 ... 13
File Formats: FASTA >gi|173593|gb|K03160.1|AAURRA Auricula auricula-judae 5S ribosomal RNA ATCCACGGCCATAGGACTCTGAAAGCACTGCATCCCGTCCGATCTGCAA AGTTAACCAGAGTACCGCCCAGTTAGTACCACGGTGGGGGACCACGCG GGAATCCTGGGTGCTGTGGTT 14
Sequence Retrieval System (SRS) ● URL: http://srs.ebi.ac.uk/ 15
Selecting Libraries (DBs) to Search 16
Standard Query Form 17
UCSC Genome Browser ● Portal to ENCODE: Encyclopedia of DNA elements functional annotation of the human genome 18
Protein: UniProt / SwissProt ● URL: http://expasy.org/sprot/ – SwissProt: manually curated – TrEMBL: anntotated automatically 19
Protein Structure: (WW)PDB ● http://www.wwpdb.org/ 20
Recommend
More recommend