Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010 1
New genomes (and metagenomes) sequenced every day... 2
3
3
3
3
3
3
3
3
3
Computational Function Prediction Needed Characterized Sequences Total Sequences 4
What about the error that results from large scale function prediction? 5
Our focus: commonly used protein sequence databases How prevalent is misannotation in common sequence databases? What can we learn about these annotation errors and annotation in general? 6
What is ‘function’? Many Possible Definitions Phenotype Enzymatic Reaction 7
What is ‘function’? Many Possible Definitions Phenotype Enzymatic Reaction 7
•Concrete definition of function •Substrate •Product •Chemical conversion • Function can be mapped to specific residues Why use enzymes? 8
Functionally Diverse Enzyme Superfamilies 9
Functionally Diverse Enzyme Superfamilies Conserved mechanistic step Multi functional Low % sequence ID 9
Functionally Diverse Enzyme Superfamilies Conserved mechanistic step Multi functional Low % sequence ID Mono functional Family Specific Residues % ID within families > % ID between families 9
What is needed for the misannotation analysis? Gold Standard Sequence Set Requirements • Organized hierarchy & data • Superfamily definitions • Family definitions • Sequences • Sequence alignments • Statistical models • Functions are experimentally characterized • Understand functional mechanism • Structure • Active site • Functionally important residues • Large set 10
What is needed for the misannotation analysis? Gold Standard Sequence Set Requirements • Organized hierarchy & data • Superfamily definitions • Family definitions — 6 Superfamilies — 5 Structural folds • Sequences — 37 Families • Sequence alignments —5/6 E.C. categories • Statistical models Genome Biol. 2006;7(1):R8. • Functions are experimentally characterized • Understand functional mechanism • Structure • Active site • Functionally important residues • Large set 10
Sequence Evidence Models Codes (HMMs) Functionally Important Residues Hand-Curated Hierarchically Sequence Alignments Organized Gold Standard Sequence Set sfld.rbvi.ucsf.edu 11
Data Source: Commonly Used Sequence Databases NCBI Automated Large TrEMBL Automated Large Swiss-Prot Curated KEGG Small 12 Automated
Analysis Question Given : A protein sequence annotated to a specific enzyme function Is that annotation correct? 13
General Process 14
15
15
15
15
Non-Family Family LC Members Members TC NC 15
16
16
Variable percent misannotation Manually curated Swiss- Prot is most accurate 17
Misannotation Problem is Getting Worse Sequences Deposited by Year and the Fraction Predicted to be Misannotated (NR DB) 1200 Fraction Predicted Misannotated 0.5 1000 Number of Sequences 0.4 800 0.3 600 0.2 400 0.1 200 0 1993 1995 1997 1999 2001 2003 2005 Year Incorrect Annotations Correct Annotations 18
What are the characteristics of these misannotations? 19
Sensitivity to threshold change 20
Non-Family Family LC Members Members TC NC TC — Trusted Cutoff NC — Noise Cutoff LC — Lenient Cutoff Sensitivity to threshold change 20
Non-Family Family LC Members Members TC NC Non-Family Family LC Members Members TC NC Sensitivity to threshold change 20
21
NSA NSA — No Superfamily Association 21
NSA SFA NSA — No Superfamily Association SFA — Superfamily Association Only 21
NSA SFA NSA — No Superfamily Association MFR SFA — Superfamily Association Only MFR — Missing Functionally Important Residues 21
NSA SFA NSA — No Superfamily Association MFR SFA — Superfamily Association Only MFR — Missing Functionally Important Residues BTC BTC — Below Trusted Cutoff 21
Types of Misannotation MFR (6%) SFA (31%) NSA (9%) NSA BTC (54%) SFA Misannotations due to overprediction Misannotations not due to overprediction NSA — No Superfamily Association MFR SFA — Superfamily Association Only MFR — Missing Functionally Important Residues BTC BTC — Below Trusted Cutoff Biggest Problem Predicting function without sufficient evidence 21
Dipeptide Epimerase Dipeptide >gi|13786715|pdb|1HZY|A Chain A, High Resolution Structure Of The Zinc-Containing Epimerase Phosphotriesterase From Pseudomonas Diminuta 1 GDRINTVRGPITISEAGFTLTHEHICGSSAGFLRAWPEFFGSRKALAEKAVRGLRRARAAGVRTIVDVST FDIGRDVSLLAEVSRAADVHIVAATGLWFDPPLSMRLRSVEELTQFFLREIQYGIEDTGIRAGIIKVATT GKATPFQELVLKAAARASLATGVPVTTHTAASQRDGEQQAAIFESEGLSPSRVCIGHSDDTDDLSYLTAL 2 = AARGYLIGLDHIPHSAIGLEDNASASALLGIRSWQTRALLIKALIDQGYMKQILVSNDWLFGFSSYVTNI 1 1 MDVMDRVNPDGMAFIPLRVIPFLREKGVPQETLAGITVTNPARFLSPTLRAS >gi|1176259|sp|P45548|PHP_ECOLI Phosphotriesterase homology protein 2 MSFDPTGYTLAHEHLHIDLSGFKNNVDCRLDQYAFICQEMNDLMTRGVRNVIEMTNRYMGRNAQFMLDVM RETGINVVACTGYYQDAFFPEHVATRSVQELAQEMVDEIEQGIDGTELKAGIIAEIGTSEGKITPLEEKV Dipeptide FIAAALAHNQTGRPISTHTSFSTMGLEQLALLQAHGVDLSRVTVGHCDLKDNLDNILKMIDLGAYVQFDT IGKNSYYPDEKRIAMLHALRDRGLLNRVMLSMDITRRSHLKANGGYGYDYLLTTFIPQLRQSGFSQADVD VMLRENPSQFFQ Epimerase Unknown Function 2 Dipeptide � & 1 1 Epimerase INCORRECT! Error Propagation 22
Dipeptide Epimerase Dipeptide >gi|13786715|pdb|1HZY|A Chain A, High Resolution Structure Of The Zinc-Containing Epimerase Phosphotriesterase From Pseudomonas Diminuta 1 GDRINTVRGPITISEAGFTLTHEHICGSSAGFLRAWPEFFGSRKALAEKAVRGLRRARAAGVRTIVDVST FDIGRDVSLLAEVSRAADVHIVAATGLWFDPPLSMRLRSVEELTQFFLREIQYGIEDTGIRAGIIKVATT GKATPFQELVLKAAARASLATGVPVTTHTAASQRDGEQQAAIFESEGLSPSRVCIGHSDDTDDLSYLTAL 2 = AARGYLIGLDHIPHSAIGLEDNASASALLGIRSWQTRALLIKALIDQGYMKQILVSNDWLFGFSSYVTNI 1 1 MDVMDRVNPDGMAFIPLRVIPFLREKGVPQETLAGITVTNPARFLSPTLRAS >gi|1176259|sp|P45548|PHP_ECOLI Phosphotriesterase homology protein 2 MSFDPTGYTLAHEHLHIDLSGFKNNVDCRLDQYAFICQEMNDLMTRGVRNVIEMTNRYMGRNAQFMLDVM RETGINVVACTGYYQDAFFPEHVATRSVQELAQEMVDEIEQGIDGTELKAGIIAEIGTSEGKITPLEEKV Dipeptide FIAAALAHNQTGRPISTHTSFSTMGLEQLALLQAHGVDLSRVTVGHCDLKDNLDNILKMIDLGAYVQFDT IGKNSYYPDEKRIAMLHALRDRGLLNRVMLSMDITRRSHLKANGGYGYDYLLTTFIPQLRQSGFSQADVD VMLRENPSQFFQ Epimerase Unknown Function 2 Dipeptide � & 1 1 Epimerase INCORRECT! Error Propagation 22
BLAST sequence similarity network • E-value 1 × 10 − 30 or lower • Distance between nodes reflects level of sequence similarity Sequence similarity Correct annotation Incorrect annotation 23
BLAST sequence similarity network • E-value 1 × 10 − 30 or lower • Distance between nodes reflects level of sequence similarity Sequence similarity Correct annotation Incorrect annotation 23
Misannotations • Cluster with each other • Indication of error propagation BLAST sequence similarity network • E-value 1 × 10 − 30 or lower • Distance between nodes reflects level of sequence similarity Sequence similarity Correct annotation Incorrect annotation 23
In Conclusion... • Misannotation is a serious problem • Automated databases • Across multiple folds, functions and superfamilies • Hard to predict misannotation a priori • Manual curation delivers the highest quality • Misannotation problem is getting worse • Overprediction is a common problem • Error propagation appears to be a common source of misannotation 24
Acknowledgements Patricia Babbitt & lab Shoshana Brown Igor Dodevski University of Zürich Tanja Kortemme & Lab Colin Smith Jim Wells Lab Emily Crawford $$ Howard Hughes Pre-Doctoral Fellowship PLoS Comput Biol. 2009 Dec;5(12):e1000605. NIH & NSF 25 Wiki Commons & Science Magazine for some images
Recommend
More recommend