proteomics databases and protein characterization tools
play

Proteomics databases and protein characterization tools - PDF document

Proteomics databases and protein characterization tools Marie-Claude.Blatter@ISB-SIB.ch EMBnet 2004: Proteomics using MCB - 4/3/2004 bioinformatic tools Part I Proteomics databases EMBnet 2004: Proteomics using MCB - 4/3/2004


  1. Proteomics databases and protein characterization tools Marie-Claude.Blatter@ISB-SIB.ch EMBnet 2004: Proteomics using MCB - 4/3/2004 bioinformatic tools Part I Proteomics databases EMBnet 2004: Proteomics using MCB - 4/3/2004 bioinformatic tools

  2. Proteomics databases 1. Sequence databases: « The story of a protein sequence’s life » 2. Swiss-Prot: a quick overview 3. UniProt utilities: UniRef and UniParc 4. Swiss-Prot … and the other protein databases EMBnet 2004: Proteomics using MCB - 4/3/2004 bioinformatic tools Where do the protein sequences come from ? What’s about their reliability ? What do you have to take care of ? EMBnet 2004: Proteomics using MCB - 4/3/2004 bioinformatic tools

  3. Real life of a protein sequence … Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genomes, … with or without annotated CDS EMBL, GenBank, DDBJ Scientific publications derived sequences CoDing Sequences CoDing Sequences provided by submitters provided by submitter PRF, PIR and « de novo » gene prediction TrEMBL Genpept RefSeq PRF XP_NNNNN Manually annotated Swiss-Prot 3D structures UniProt : Swiss-Prot + TrEMBL + (PIR) NCBI-nr : Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF Let’s start at the very beginning… EMBnet 2004: Proteomics using MCB - 4/3/2004 bioinformatic tools

  4. Real life of a protein sequence … Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genomes, … with or without annotated CDS EMBL, GenBank, DDBJ provided by authors CDS CoDing Sequence portion of DNA/RNA translated into protein (from Met to STOP) EMBL/GenBank/DDBJ • The 3 main public nucleic acid sequence databases are EMBL (EBI)/ GenBank (NCBI) / DDBJ (Japan): « different views of the same data set » within 2-3 days • Contribution: EMBL 10 %; GenBank 73 %; DDBJ 17 % • EMBL: since 1982 EMBnet 2004: Proteomics using MCB - 4/3/2004 bioinformatic tools

  5. EMBL/GenBank/DDBJ •Serve as archives • Contain all public sequences derived from: – Genome projects (> 80 % of entries) – Sequencing centers (cDNAs, ESTs…) – Individual scientists ( 15 % of entries) – Patent offices (i.e. European Patent Office, EPO) • Currently: 30x10 6 sequences, ~36 x10 9 bp; • Sequences from > 50’000 different species; EMBnet 2004: Proteomics using MCB - 4/3/2004 bioinformatic tools The tremendous increase in nucleotide sequences Mouse Rat Other Human 1980: 80 genes fully sequenced ! Human/Mouse/Rat: Organisms with the highest redundancy !

  6. EMBL/GenBank/DDBJ Sort of sequence museum, where sequences are preserved for eternity as they were determined, interpreted and published originally by their authors (primary sequence repository) The authors have full authority over the content of the entries they submit ! (exception: TPA, since january 2003) EMBnet 2004: Proteomics using MCB - 4/3/2004 bioinformatic tools an EMBL entry DNA (genomic) or RNA ID HSERPG standard; genomic DNA; HUM; 3398 BP. XX AC X02158; XX SV X02158.1 XX DT 13-JUN-1985 (Rel. 06, Created) DT 22-JUN-1993 (Rel. 36, Last updated, Version 2) XX DE Human gene for erythropoietin XX keyword KW erythropoietin; glycoprotein hormone; hormone; signal peptide. XX OS Homo sapiens (human) taxonomy OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; OC Eutheria; Primates; Catarrhini; Hominidae; Homo. XX RN [1] RP 1-3398 RX MEDLINE; 85137899. RA Jacobs K., Shoemaker C., Rudersdorf R., Neill S.D., Kaufman R.J., references RA Mufson A., Seehra J., Jones S.S., Hewick R., Fritsch E.F., Kawakita M., RA Shimizu T., Miyake T.; RT Isolation and characterization of genomic and cDNA clones of human RT erythropoietin; RL Nature 313:806-810(1985). XX DR GDB; 119110; EPO. Cross-references DR GDB; 119615; TIMP1. DR Swiss-Prot; P01588; EPO_HUMAN. XX …

  7. CC Data kindly reviewed (24-FEB-1986) by K. Jacobs FH Key Location/Qualifiers FH FT source 1..3398 FT /db_xref=taxon:9606 FT /organism=Homo sapiens FT mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327) FT CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763) FT /db_xref=SWISS-PROT:P01588 CDS FT /product=erythropoietin FT /protein_id=CAA26095.1 CoDing Sequence FT /translation=MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLLE FT AKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVLRG (proposed by submitters) FT QALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTITAD FT TFRKLFRVYSNFLRGKLKLYTGEACRTGDR FT mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2763) FT /product=erythropoietin FT sig_peptide join(615..627,1194..1261) FT exon 397..627 FT /number=1 FT intron 628..1193 FT /number=1 FT exon 1194..1339 FT /number=2 FT intron 1340..1595 Annotation FT /number=2 FT exon 1596..1682 FT /number=3 (Prediction or FT intron 1683..2293 experimentally determined) FT /number=3 FT exon 2294..2473 FT /number=4 FT intron 2474..2607 FT /number=4 FT exon 2608..3327 FT /note=3' untranslated region FT /number=5 XX sequence SQ Sequence 3398 BP; 698 A; 1034 C; 991 G; 675 T; 0 other; agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag 60 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat 120 EMBnet 2004: Proteomics using MCB - 4/3/2004 bioinformatic tools

  8. FT CDS complement(45959..47332) FT /db_xref="SPTREMBL:Q9UZ71" FT /note="PAB2386" FT /transl_table=11 FT /product=" 4-AMINOBUTYRATE qui se dilate AMINOTRANSFERASE FT (EC 2.6.1.19)" FT /protein_id="CAB50188.1" FT /translation="MDYPRIVVNPPGPKAKELIEREKRVLSTGIGVKLFPLVPKRGFGP FT FIEDVDGNVFIDFLAGAAAASTGYSHPKLVKAVKEQVELIQHSMIGYTHSERAIRVAEK FT LVKISPIKNSKVLFGLSGSDAVDMAIKVSKFSTRRPWILAFIGAYHGQTLGATSVASFQ FT VSQKRGYSPLMPNVFWVPYPNPYRNPWGINGYEEPQELVNRVVEYLEDYVFSHVVPPDE FT VAAFFAEPIQGDAGIVVPPENFFKELKKLLDEHGILLVMDEVQTGIGRTGKWFASEWFE FT VKPDMIIFGKGVASGMGLSGVIGREDIMDITSGSALLTPAANPVISAAADATLEIIEEE FT NLLKNAIEVGSFIMKRLNELKEQFDIIGDVRGKGLMIGVEIVKENGRPDPEMTGKICWR FT AFELGLILPSYGMFGNVIRITPPLVLTKEVAEKGLEIIEKAIKDAIAGKVERKVVTWH" EMBnet 2004: Proteomics using MCB - 4/3/2004 bioinformatic tools

  9. Proteomics databases 1. Sequence databases: « The story of a protein sequence’s life » 2. Swiss-Prot: a quick overview 3. UniProt utilities: UniRef and UniParc 4. Swiss-Prot … and the other protein databases EMBnet 2004: Proteomics using MCB - 4/3/2004 bioinformatic tools Real life of a protein sequence … Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genomes, … with or without annotated CDS Nucleic acids EMBL CoDing Sequences provided by submitters TrEMBL Amino acids Manually annotated Swiss-Prot

  10. Since december 15, 2003 Swiss-Prot and TrEMBL constitute the Knowledgebase EMBnet 2004: Proteomics using MCB - 4/3/2004 bioinformatic tools (integration of the PIR data) -> give access to all known* protein sequences * submitted to the public databases (EMBL, GenBank, DDJB, SWISS-PROT) EMBnet 2004: Proteomics using MCB - 4/3/2004 bioinformatic tools

  11. a SWISS-PROT entry = a protein sequence… … associated with - manually-checked - well-structured - periodically-updated - searchable … biological information EMBnet 2004: Proteomics using MCB - 4/3/2004 bioinformatic tools a TrEMBL entry = a protein sequence… … associated with - computer-annotated - well-structured - periodically-updated - searchable … biological information EMBnet 2004: Proteomics using MCB - 4/3/2004 bioinformatic tools

  12. CDS TrEMBL EMBL Swiss-Prot CDS Once in Swiss-Prot, TrEMBL no more in TrEMBL -> Minimal redundancy Annotation of conflicts EMBL Swiss-Prot

  13. CDS TrEMBL EMBL Swiss-Prot How to make things clear…? Depending of the server… UniProt = Swiss-Prot + TrEMBL = SPTR = SWALL Swiss-Prot =UniProt/Swiss-Prot TrEMBL= UniProt/TrEMBL=SPTrEMBL TrEMBL=SPTrEMBL + TrEMBLnew** EMBnet 2004: Proteomics using MCB - 4/3/2004 bioinformatic tools **is going to disappear soon !

  14. Swiss-Prot 1. Minimal redundancy; 2. Maximal manual annotation; 3. Integration with other databases. EMBnet 2004: Proteomics using MCB - 4/3/2004 bioinformatic tools Swiss-Prot 1. Minimal redundancy; 1 gene (1 species) -> 1 entry Swiss-Prot Identical sequences are merged, as are variants, fragments, alternative splicing isoforms….

Recommend


More recommend