overview of current biological databases
play

Overview of current biological databases Qi Sun Computational - PowerPoint PPT Presentation

Overview of current biological databases Qi Sun Computational Biology Service Unit Cornell University Platforms for Bioinformatics HTTP SQL SOAP FTP Web Server Database Server Platforms for Bioinformatics Micorsoft Open source Linux


  1. Overview of current biological databases Qi Sun Computational Biology Service Unit Cornell University

  2. Platforms for Bioinformatics HTTP SQL SOAP FTP Web Server Database Server

  3. Platforms for Bioinformatics Micorsoft Open source Linux Windows Apache ASP.NET Mysql SQL Server Perl/Python/PHP C#

  4. Public Database - 1 NCBI Sequence Data Model Archival database (GenBank, GenPept) vs Computer algorithm generated database (Unigene) vs Manually curated database (RefSeq)

  5. The NCBI Data Model Genbank- A DNA centered database

  6. Identifier: 1. LOCUS (obsolete) 2. Accession (version) 3. GI

  7. Features

  8. GenPept- A protein centered database

  9. FTP sites: GenBank: ftp://ftp.ncbi.nih.gov/genbank/ GenPept: ftp://ftp.ncifcrf.gov/pub/genpept/

  10. Problems with Genbank and Genpept • It does not distinguish the sequence categories. • Lot of redundancy. • Same gene could be deposited into the database many times with different names • Different version of the same gene could be submitted many times with different accession number. • The features of genbank record could be chaotic.

  11. Public Database - 1 NCBI Sequence Databases Archival database (GenBank, GenPept) vs Computer algorithm generated database (Unigene) vs Curated database (RefSeq, Locuslink ...)

  12. UniGene a non-redundant set of gene-oriented clusters GenBank GenBank dbEST mRNAs genomic CDSs ESTs Unigene

  13. Unigene identifier Hs for human Mm for mouse Examples: Rn for rat Bt for cow Mm.213407 Dr for zebrafish Hs.13303 Dm for fruitfly Aga for mosquito At.138 Xl for frog At for cress Hv for barley Os for rice Ta for wheats Zm for maize

  14. Public Database - 1 NCBI Sequence Databases Archival database (GenBank, GenPept) vs Computer generated database (Unigene) vs Curated database (RefSeq, Gene ...)

  15. NCBI human genome annotation pipeline The refseq incorporate the predicted transcript and protein sequences, experimentally identified mRNA sequences, EST sequences.

  16. Refseq Accession Numbers: NT_123456 constructed genomic contigs NM_123456 mRNAs NP_123456 proteins NC_123456 chromosomes XM_123456 predicted mRNA XP_123456 predicted protein

  17. Refseq? Unigene? Genbank? Genome sequence Refseq available acc: NP_123456, et al EST sequence Unigene available acc: Hs.13303, et al Genbank acc: AP33493, et al

  18. Go to the web

  19. Files that you can download from the NCBI gene database gene_info gene2refseq gene2go

  20. NCBI Search engine Entrez • boolean operators “AND” “OR” “NOT” • entrez tags • using limits • MeSH terms Batch Entrez search by accession list

  21. Other Sequence Databases: Genomic DNA: Ensembl Genome annotation database (http://www.ensembl.org, HTTP, FTP, MySQL interface) Protein: Uniprot (http://www.pir.uniprot.org/ )

  22. go to the web KEGG database

  23. Public Database - 2 GO Gene Ontology 1. Molecular Function 2. Biological Process 3. Cellular Component http://www.geneontology.org

  24. Public Database - 2

  25. Public Database - 2 Molecular Function 3674 GO Biological Process 8150 3673 Cellular Component 5575

  26. Biological Process GO Example 1:

  27. Molecular Function GO Example 2:

  28. Gene Ontology Annotation Smn: survival motor neuron Gene ID: 39844

  29. Public Database - 4 Species Specific Databases • Arabidopsis – TAIR • Yeast – SGD • Fly – FLYBASE • Worm – WORMBASE • Mouse – MGD

Recommend


More recommend