annotation i nference
play

Annotation & I nference New genom es, New functions Maybe - PowerPoint PPT Presentation

Annotation & I nference New genom es, New functions Maybe Maybe Boarder line similarity Having Boarder line similarity Having No Function No Function Only part of protein Only part of protein Function Function


  1. Annotation & I nference New genom es, New functions ‘Maybe’ ‘Maybe’ Boarder line similarity Having Boarder line similarity Having No Function No Function Only part of protein Only part of protein Function Function Conflicting exp/ lit Conflicting exp/ lit New genomes New genomes Experiments No similarity Experiments No similarity ‘Wrong’ Literature No evidence ‘Wrong’ Literature No evidence Expert view Expert view Fault annotation Fault annotation Wrong inference Wrong inference Michal Linial ,Institute of Life Sciences May 2006 The Hebrew University of Jerusalem

  2. Annotation & I nference New genom es, New functions Dom ain fam ilies by EVEREST Automatic identification of Protein Domain Performance and analysis w.r.t to other resources New Annotation by I nference A method for inference – testing on a new genome New Function to Disserted Proteins High level functionality – story of the toxin like proteins May 2006

  3. Motivation W hy dom ain fam ilies? w hat is w rong w ith protein classification Nothing is wrong, But: ● Reducing false transitivity. ● Exposing Mix and Match evolution ● I m m ediate relevance to structural domain-families ● Suggesting evolutionary ‘ robust units ’ W hy autom atic? Overcoming large amounts of data Unbiased identification of new families (even without an identified seed)

  4. EVEREST : A dom ain fam ilies resource A com parative quality tool for other resources Autom atic / de-novo identification and classification of protein dom ains in all know n sequences Rigorous evaluation against manually / automated & structurally based domain- family resources • Scoring methods for a ‘quality control’ • Exposing any (interesting) relationships within ‘the world’ of domains • Web interactive tool www.everest.cs.huji.ac.il

  5. Method The Modular Nature of Proteins K6A1 MOUSE CSKP HUMAN DLG3 MOUSE MPP3 HUMAN Serine/Threonine protein kinase family active site Protein kinase C-terminal domain PDZ domain SH3 domain Guanylate kinase

  6. input False Transitivity of Local Alignment K6A1 MOUSE 1e-42 CSKP HUMAN 9e-41 8e-78 BLAST values DLG3 MOUSE 2e-47 Pairwise similarities better than 1e-40 EScore MPP3 HUMAN If we cluster these proteins, assuming transitivity of local alignment scores, we will cluster K6A1_MOUSE with MPP3_HUMAN

  7. Method Working With Segments K6A1 MOUSE 399-678 CSKP HUMAN 365-920 CSKP HUMAN 515-916 CSKP HUMAN DLG3 MOUSE 12-295 429-844 DLG3 MOUSE 378-849 Each BLAST alignment MPP3 HUMAN MPP3 HUMAN defines two segments. 28-584 118-580

  8. input Clustering Segments CSKP HUMAN K6A1 MOUSE 365-920 399-678 CSKP HUMAN 515-916 DLG3 MOUSE CSKP HUMAN Two similarity measures 429-844 12-295 between segments: • Sequence similarity if they DLG3 MOUSE were found together by 378-849 BLAST • Physical overlap if they are MPP3 HUMAN on the same protein, and MPP3 HUMAN 118-580 they intersect 28-584

  9. input The Easy Case All segments on CSKP_HUMAN defined by alignments with e-score 1e-40 or better: CSKP HUMAN We collect all Blast value that are < 100 ! ~ 14 million values

  10. EVEREST: Process Schem e Careful transitivity 0 EVolutionary 1 Ensem bles of 2 REcurrent Segm enTs 3 Putative families Pre-process 8 Iterations post-process 7 Machine learning 4 Evaluation and tests Putative domains Clustering 9 6 Statistical model 5 Majority voting 10 Method

  11. 3 Years in one slide ( Elon Portugaly) • Cluster the segments into conservative 0 1 2 groups by overlap similarity ` 3 8 7 4 9 6 5 1 0 • Each group is a putative dom ain We apply average linkage hierarchical ● clustering on the putative domains Creates a binary tree of clusters ● Each cluster is a putative dom ain fam ily ● Machine learning & Scoring w.r.t. PfamA ● Choosing good families (intrinsic properties) – training/ disjoin to test ● Each family modelled by HMM, redefine EV fam ilies . ● Iteration (3 times from 100K to 25K) ● Jointing HMMs and voting for EV consensus family. ● Method

  12. Method Quality & Evaluation Comparing with Pfam Pfam is a domain signature DB, manual curation, covers 62% aa, 7500 signatures Accuracy – how well a typical EVEREST domain family scores w.r.t Pfam Size of the intersection over the size of the union Scores range from 0 to 1.0 (Jaccard Score) EV of 10 instances matches Pfam with 10 with only 9 are overlapping Score: 0.81 O EV Pfam

  13. Getting Better ( accuracy m easure) All Clusters Chosen Clusters 2,000 60 1,750 50 x 1,000 Clusters x 1,000 Clusters 1,500 40 ~ 2 million 1,250 ~ 100, 000 1,000 30 750 20 500 10 250 0 0 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1 3 5 7 9 1 3 5 7 9 Score wrt Pfam Score wrt Pfam Iteration 1 HMMs Iteration 3 HMMs Final EVEREST Families 20 6 4 18 3 x 1,000 Clusters x 1,000 Clusters 5 x 1,000 Clusters 15 ~ 25, 000 ~ 100, 000 3 4 ~ 13, 570 13 2 10 3 2 8 2 1 5 1 1 3 0 0 0 . . . . . 0 0 0 0 0 . . . . . . . . . . 0 0 0 0 0 0 0 0 0 0 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 Score wrt Pfam Score wrt Pfam Score wrt Pfam

  14. EVEREST – Evaluation vs Reference EVEREST is evaluated against reference sets of known families (Pfam, SCOP, ● CATH) Score of EVERSET family w.r.t. Intersecting reference family: ● – size of intersection / size of union – Accuracy – Coverage ● Each EVEREST family scored ● Each reference family scored vs. best matching reference vs. best matching EVEREST ● Look at score profile across ● Look at score profile across EVEREST families interesting subsets of refrence set ● Ignore EVEREST families unknown to reference set ● Non-Trivial: family size> = 5 ● Hetero: non-trivial + appearing in hetero-multi-domain proteins

  15. Evaluation – w rt Pfam EVEREST & ADDA ( Holm ) EVEREST - Accuracy EVEREST - Coverage 4 800 x 1,000 Clusters 700 3 13,570 600 1,800 # families 3 500 2 400 2 300 1 200 1 100 0 0 1 1 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 . . . . . . . . . . . . . . . . . . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ADDA - Accuracy ADDA - Coverage 9 800 8 x 1,000 Clusters 700 7 600 # families 6 500 5 400 4 300 3 200 2 100 1 0 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 1 . . . . . . . . . . . . . . . . . . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

  16. Hetero > 5 Evaluation vs Pfam EVEREST & ADDA

  17. Evaluation – Com pare w .r.t SCOP m anual classification of structural dom ains EVEREST - Accuracy EVEREST - Coverage 23 60 20 x 100 Clusters 50 18 # families 15 40 13 30 10 8 20 5 10 3 0 0 1 1 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 . . . . . . . . . . . . . . . . . . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ADDA - Accuracy ADDA - Coverage 10 60 9 x 100 Clusters 50 8 # families 7 40 6 5 30 4 20 3 2 10 1 0 0 1 2 3 4 5 6 7 8 9 1 1 2 3 4 5 6 7 8 9 1 . . . . . . . . . . . . . . . . . . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

  18. EVEREST – Evaluation vs SCOP (family) coverage

  19. Evaluation – Com pare w rt CATH / SCOP superfam ily ( coverage)

  20. Overall Num bers ( for UniProt/ SW P) 1 3 ,5 6 9 EV families were defined. Providing Joint HMMs. Jointly cover 8 3 % of the aa in the SWP DB. The average (median) size of an EVEREST dom ain fam ily is 81 (41). The average (median) length of the dom ains is 117 (76) aa. Move to some examples (web based querying)

  21. Exam ples: New Functional Annotation EVEREST family 1017 PF04673 (Polyketide synthesis cyclase) PF04486 (SchA/CurD like protein) PF04486 has no known function ● Two of its members are known ● to be in gene clusters involved in the synthesis of polyketide- based spore pigments. Could these two families be ● considered one?

  22. New Fam ily ( 1 ) EV02275 is unknown to Pfam ● 54 out of its 55 domains appear 90 positions N-terminal to PF03171 ● (2OG-Fe(II) oxygenase superfamily) ● Perhaps this is a new domain family? PDB 1UOG ● – RED – EVEREST 2 2 7 5 – BLUE - PF0 3 1 7 1

  23. New dom ain fam ily ( 2 ) 48 proteins – Pesticidial crystal protein cry5Aa (Insecticidal delta-endotoxin CryVA(a) (Crystaline entomocidal protoxin) EV covers the 48 proteins of PFAM (and SCOP / CATH) - perfectly EVEREST SCOP 33-608 but another EV specifies the family – no OVERLAP and NO structure for this region (609-911)

  24. Tw o that becam e one Exam ples in Pfam CLANs PFAM (OLD) Taurine catabolism dioxygenase TauD, TfdA family Pfam (NEW) a composed entry: TauD

  25. Superfam ily EVEREST family EV0 4 4 6 3 fully covers both PF00465 (Iron-containing ● alcohol dehydrogenase) and PF01761 (3-dehydroquinate synthase). ENZYME: PF00465 is EC1.1- ● ENZYME: PF01761 is sometimes EC4.6 and sometimes EC1.1 ● SCOP / CATH: Same superfamily/ Homology group ● PDB 1JQA (PF00465) PDB 1DQS (PF01761)

  26. Alternative Fam ily Definition SCOP Elongation Factor CATH SCOP 3 ‘domain family’ : All support same proteins EVEREST Half C-terminal SCOP - two adjacent domains (yellow, blue) CATH – two separated (blue, red) spacer (green) EVEREST – one domain (pink)

  27. On the W eb

Recommend


More recommend