a simple tool from a complex system a simple tool from a
play

A simple tool from a complex system: A simple tool from a complex - PowerPoint PPT Presentation

A simple tool from a complex system: A simple tool from a complex system: high- -throughput, unsupervised generation of throughput, unsupervised generation of high Protein Families Protein Families Protein Families Protein Families from


  1. A simple tool from a complex system: A simple tool from a complex system: high- -throughput, unsupervised generation of throughput, unsupervised generation of high Protein Families Protein Families Protein Families Protein Families from the from the Protein Homology Network. Protein Homology Network. Protein Homology Network. Protein Homology Network. Duccio Medini Duccio Medini Duccio Medini Duccio Medini Cellular Microbiology Microbiology and and Bioinformatics Bioinformatics Unit Unit Cellular Novartis Vaccines, S Vaccines, S iena (I) Novartis iena (I)

  2. What is the What is the What is the What is the Protein Homology Network (PHN)? Protein Homology Network (PHN)? Protein Homology Network (PHN)? Protein Homology Network (PHN)?

  3. PHN PHN: definitions definitions complete genomes → 761260 predicted proteins � 251 Nodes → Proteins � Nodes Links → Blast alignments with E-score < ε (cut-off) � Links Proteins Proteins Homology relations Homology relations Connected Component → group of proteins � Connected Component connected by a path. Component A Component B

  4. PHN PHN: snapshot of a small portion (1/20) Full: 760,000 proteins and 7x10 7 links (at ε = 1 0 -5 )

  5. The structure of the PHN PHN depends on the homology cut-off ε ε = 10 = 10 -200 -200 ÷ 10 10 -100 -100 S S everal everal relationships elationships missed missed

  6. The structure of the PHN PHN depends on the homology cut-off ε ε = 10 = 10 -80 -80 ÷ 10 10 -40 -40 S S everal everal relationships elationships missed missed + “ strange” + “ strange” connections! onnections!

  7. The structure of the PHN PHN depends on the homology cut-off ε ε = 10 = 10 -30 -30 ÷ 10 10 -10 -10 S S ome relationships ome relationships still till missed missed + several + several inter-family inter-family

  8. The structure of the PHN PHN depends on the homology cut-off ε ε = 10 = 10 -5 -5 The “ giant The “ giant component” component” dominates dominates the network he network

  9. PHN: the giant connected component giant connected component Fraction of nodes included in the largest connected component At ε = 10 -5 63% 63% of the proteins are in the giant component

  10. PHN topology Proximity of a node: Proximity f a node: clustering index C 2 E = = i C ; C C ( ) i i − k k 1 i i Albert R, Barabasi AL (2002) Reviews of Modern Physics 74: 47-97 Connected Connected components: omponents: compactness index η k η = η = η i ; i i − M 1 i

  11. How do we identify Protein Families? How do we identify Protein Families? How do we identify Protein Families? How do we identify Protein Families? Family “ B” Family “ B” Family “ A” Family “ A”

  12. Overlap measure: Overlap measure: neighborhood similarity We define the overlap θ ij of two nodes i , j as the normalized fraction of nearest neighbors that they have in common n ij θ ij = ( ) max k i , k j i θ ij =0 i k i =10 j θ ij =0.3 n ij =3 θ jk ≈ 1 j k j =8 k θ is des igned to identify pairs of nodes s haring a large fraction of their neares t neighbors .

  13. The modularity measure The modularity measure Q Q : correspondence of a network partitioning to the network modular structure (Newman MEJ, Girvan M (2004) Phys ical Review E 69: 26113-26127) ( ) ∑ a i = fraction of edges with at least one end in the i -th component, Q = b i − a i 2 b i = fraction of edges with both ends in the i -th component. i PHN-Families: connected components for θ = 0,5. PHN-Families

  14. Comparison to PFAM Comparison to PFAM (~ 75% testable) Added Links Added Links 〈 θ ij 〉 Protein Classification Protein Classification Fraction Fraction 98.5% confirmed 98.5% confirmed share a domain 98.5% 0.68 do not share a domain 1.5% 0.58 Removed links Removed links 〈 ε 〉 Protein Classification Protein lassification Fraction Fraction 76.4% confirmed 76.4% confirmed ij 10 -10 do not share a domain 8.1% one or two 10 -87 68.3% multi-domains 10 -10 single domain, shared 23.6%

  15. ummary: the PHN-Families Algorithm ummary: the PHN-Families Algorithm S S

  16. Result: PHN-Families Result: PHN-Families Before Beforepartitioning After After partitioning 28,226 PHN-Families 28,226 PHN-Families(giant component disconnected into 14,443 PHN-Families + 26,000 isolated proteins)

  17. How can we use Protein Families? How can we use Protein Families? How can we use Protein Families? How can we use Protein Families? 1. 1. Enhanced Enhanced annotation annotation of new genomic of new genomic sequences equences 2. Whole genome profiling and comparison 3. Identification and study of bacterial organelles

  18. How can we use Protein Families? How can we use Protein Families? How can we use Protein Families? How can we use Protein Families? 1. Enhanced annotation of new genomic sequences 2. Whole g e genome p e profiling and c comparison 3. Identification and study of bacterial organelles

  19. Protein Families as discrete characters: Protein Families as discrete characters: the genomic matrix the genomic matrix Microorganisms Microorganisms Microorganisms Microorganisms Protein Protein Families Families Families ( (functions functions functions) ) Protein Protein Families functions

  20. Bacillales PHN-Family profiles: genomic genomic signatures signatures Archea

  21. How can we use Protein Families? How can we use Protein Families? How can we use Protein Families? How can we use Protein Families? 1. Enhanced annotation of new genomic sequences 2. Whole genome profiling and comparison 3. 3. Identification and stud Identification and study of bacterial organelles y of bacterial organelles

  22. From PHN-Families to bacterial organelles bacterial organelles A classification of proteins into families allows to recognize the similarities between complex structures, even if some individual components are missing, different, or placed in an unexpected position.

  23. Can we group all the building blocks of Can we group all the building blocks of Type IV S Type IV S ecretion S ecretion S ystems? ystems? Functional PHN- Proteins Class Families VirB1 2 42 4 VirB2 4 18 9 7 1 VirB3 3 19 13 1 VirB4 1 228 VirB5 2 46 7 VirB6 2 117 3 VirB7 6 7 7 5 3 1 1 VirB8 2 69 2 VirB9 2 127 2 VirB10 1 119 VirB11 1 724 VirD4 1 174 Covacci et al., S cience (1999) 284, 1328-33. S elected 12 major structural components from 6 reference T4S S belonging to A.tumefaciens , IncN R46 , B.suis , B.pertussis, and H.pylori , which provide a good sampling of the diversity of known TTS S s.

  24. Evolutionary diversification of Type IV S Evolutionary diversification of Type IV S S S Variable set Conserved core A B C D Groups of probably co-evolved Type IV S S

  25. PHN-Families are coherent with molecular molecular philogenesys hilogenesys 180 180 point accepted mutations 230 230 point accepted mutations

  26. Conclusions Conclusions • The complex system: The complex system: The Protein Homology Network is formed by clusters (families of homologous proteins) interconnected. • The simple tool: The simple tool: We have developed a computational method to identify these groups of proteins, the PHN-Families , an unsupervised classification of quality comparable to collections cured by human experts. • The huge amount of genomic da The huge amount of genomic data produced can be classified ta produced can be classified before expert curation, to study: � Whole genomes / Organelles / S pecific families. • Integration with Pfam Integration with Pfam and other databases will connect PHN-Fams to experimental data.

  27. Aknowledgements Aknowledgements Claudio Donati Claudio Donati Antonello Antonello Covacci Covacci The BioInformatic The BioInformatic Unit (NV&D) nit (NV&D) The Pfam group The Pfam group (The WT S (The WT S anger Institute) anger Institute) Alessandro Muzzi Alessandro Muzzi Nicola Pacchiani Nicola Pacchiani Robert Finn Robert Finn Roberto Palmas Roberto Palmas Riccardo Riccardo Beltrami Beltrami D. Medini D, Covacci A, Donati C, Protein Homology Network Families Reveal S tep-Wise Diversification of Type III and Type IV S ecretion S ystems , PLoSComputational Biology Vol. 2, No. 12, e173

Recommend


More recommend