algorithmica and molecular biology the pisan experience
play

Algorithmica and molecular biology The Pisan experience Fabrizio - PowerPoint PPT Presentation

Algorithmica and molecular biology The Pisan experience Fabrizio Luccio Glimpses into the world born from the encounter between the machines for sequencing DNA fragments, and computers that assembly those fragments. The Department group


  1. Algorithmica and molecular biology The Pisan experience Fabrizio Luccio Glimpses into the world born from the encounter between the machines for sequencing DNA fragments, and computers that assembly those fragments.

  2. The Department group Maria Federico (federico@cli.di.unipi.it) Claudio Felicioli (pangon@gmail.com) * Paolo Ferragina * (ferragin@di.unipi.it) Roberto Grossi (grossi@di.unipi.it) Fabrizio Luccio * (luccio@di.unipi.it) Roberto Marangoni * (marangon@di.unipi.it) Nadia Pisanti * (pisanti@di.unipi.it) * The boss (at least, the who knows everything) * Reference person (she made most of the work) * Trying to escape * gmail: why? (probably paid by Google) * ME! (parasite, but early group initiator)

  3. Glimpses into the world etc …. Algorithms are the winning tool. Sorry…. good algorithms are the winning tool, especially when dealing with very large data.

  4. Inefficient algorithms…. …. have the unpleasant property of resisting to hardware improvement: A polynomial-time algorithm solves a problem on n data in time t 1 = c n s An exponential-time algorithm solves a problem on n data in time t 2 = c s n with c, s constants With a computer k times faster, and same running time, we process N > n data, according to the laws: t 1 = c n s , k t 1 = c N s N = k 1/s n t 2 = c s n , k t 2 = c s N k s n = s N N = n + log s k

  5. Publications on sequence algorithms Mercatanti A., Rainaldi G., Mariani L., Marangoni R., Citti L. A method for prediction of accessible sites on an mRNA sequence for target selection of hammehead ribozymes . J. Computational Biology, 4(9) 641-653, 2002 Menconi G., Marangoni R. A compression-based approach for coding sequences identification in prokaryotic genomes , J. Computational Biology (to appear) Corsi C., Ferragina P., Marangoni R. The bioPrompt-box: an ontology-based Corsi C., Ferragina P., Marangoni R. clustering tool for searching in biological databases . BMC bioinformatics (to . BMC bioinformatics (to appear) appear) Cozza A., Morandin F., Galfrè S.G., Mariotti V., Marangoni R., Pellegrini S. Cozza A., Morandin F., Galfrè S.G., Mariotti V., Marangoni R., Pellegrini S. TAMGeS: a Three-Array Method for Genotyping of SNPs by a dual-color approach. BMC genomics (to appear) BMC genomics (to appear) Felicioli C., Marangoni R. BpMatch: an efficient algorithm for segmenting Felicioli C., Marangoni R. sequences, calculating genomic distance and counting repeats , (submitted) , (submitted) Ferragina P. String search in external memory: algorithms and data structures . Handbook of Computational Molecular Biology, CRC Press, 2005

  6. Publications on motifs N. Pisanti, M. Crochemore, R. Grossi, M.-F. Sagot. A Comparative Study of Bases for Motif Inference . NATO Series on String Algorithmics , 2004. N. Pisanti, M. Crochemore, R. Grossi, M.-F. Sagot. Bases of Motifs for Generating Repeated Patterns with Wild Cards . IEEE/ACM Transactions on Computational Biology and Bioinformatics 2(1) 40-50, 2005. C.S.Iliopoulos, J.McHugh, P.Peterlongo, N.Pisanti, W.Rytter, M.Sagot. A first approach to finding common motifs with gaps , International Journal of Foundations of Computer Science 16(6) 1145--1155, 2005. N. Pisanti, H. Soldano, M. Carpentier, J. Pothier. Implicit and Explicit Representation of Approximated Motifs . In: Algorithms for Bioinformatics , C. Iliopoulos et al, editors, King's College London Press, 2006. P.Peterlongo, N.Pisanti, F.Boyer, A.Pereira do Lago, M.-F.Sagot. Lossless filter for multiple repetitions with Hamming Distance . Journal of Discrete Algorithms 2007 (to appear).

  7. Major collaborations on motifs Lyon (group of Marie-France Sagot) (group of Marie-France Sagot) Lyon Grenoble (group of Alan Vieri) (group of Alan Vieri) Grenoble Paris (group of Henri Soldano) (group of Henri Soldano) Paris Marne-la-Valle (group of Maxime Crochemore) (group of Maxime Crochemore) Marne-la-Valle

  8. Paralogy tree construction ……. via transformation distance Pisanti N., Marangoni R., Ferragina P., Frangioni A., Savona A., Pisanelli C., Luccio F. PaTre: A Method for Paralogy Trees Construction . J. Computational Biology, 5 (10) 791-802, 2003 How does the genomic information increase? external imports - Transfections - Horizontal transfer Endogenous mechanisms (genic or genomic) duplications: Large scale Tandem Dispersed Single gene

  9. The fate of the copy Non-functional: pseudogene Functional: paralog genome as a set of families of paralogs PARALOGY TREE  How does the genome choose the paralog to duplicate within a family?  Is the duplication rate constant among the various families?  Are sparse duplications correlated to sparse deletions?

  10. Couple-comparison method Couple-comparison method Transformation Distance (TD) Transformation Distance (TD) Often newest genes are the shortest ones Often newest genes are the shortest ones To insert sequences imply paying metabolic costs. To delete To insert sequences imply paying metabolic costs. To delete sequences has no metabolic cost sequences has no metabolic cost We need an asymmetric distance: We need an asymmetric distance: TD(S,T) = the cost of the minimum-length script able to TD(S,T) = the cost of the minimum-length script able to transform S into T transform S into T Elementary operations : Insertion, Copy, Inverted copy Elementary operations : Insertion, Copy, Inverted copy

  11. TD: an example TD: an example f g h S=ATCGATCAGCTGCCCAATGAATCAGATAAAGTTTC 1ÉÉÉÉÉ.ÉÉ11ÉÉ.....16ÉÉÉÉÉÉ..25ÉÉÉÉÉÉÉ35 f g h T=ATCGATCAGCTTTCACTACGAATGAATCAGATTGGTAGCTTTGAAATAG 1ÉÉÉÉÉÉÉ..11ÉÉÉÉÉÉ...21ÉÉÉÉÉÉÉÉÉÉÉ.ÉÉÉ38ÉÉÉÉÉÉÉ48 Script transforming S into T Description 1) copy f copy (1, 1, 11) 2) insertion of TTCACTACG insert (TTCACTACG) 3) copy g copy (16,21,12) 4) insertion of TGGTAGC insert (TGGTAGC) 5) inverted copy of h cop y (25,38,11,1)

  12. PaTre PaTre Input: TD values for each possible couple made by the Input: TD values for each possible couple made by the genes of the family genes of the family Building-up of the directed graph of distances Building-up of the directed graph of distances Edmonds’ algorithm: extraction of the LSA (Lightest Edmonds’ algorithm: extraction of the LSA (Lightest Spanning Arborescence)  optimal paralogy tree  optimal paralogy tree Spanning Arborescence) Generation of optimal and sub-optimals (space of Generation of optimal and sub-optimals (space of quasi-optimal solutions) quasi-optimal solutions)

  13. PaTre has been tested by simulation PaTre has been tested by simulation …because there are no experimental data because there are no experimental data … on the history of families of genes on the history of families of genes

  14. Cost: 7840 - Distance: 0% output from PaTre for the MFINFRP simulated Ribosomal Protein of 1044 1187 1035 M. pneumoniae str01 str02 str03 757 955 0 str04 str05 704 1 2 3 str06 505 526 4 5 str07 str09 394 428 6 str08 str11 7 9 305 str10 8 11 10 The simulated paralogy tree for the Ribosomal Proteins family of M. pneumoniae

  15. str11 MFINFRP Cost: 7840 - Distance: 0% MFINFRP 1044 str05 1187 1035 str01 str02 str03 757 str06 955 str04 str05 704 str08 str06 505 526 str07 str07 str09 394 428 str02 str08 str11 305 str10 str10 str09 The paralogy tree reconstructed str03 by ClustalW for the Ribosomal proteins genic family of str01 M. pneumoniae str04

  16. After having tested PaTre on many examples, we could conclude that PaTre is able to PaTre is able to correctly reconstruct the simulated history of correctly reconstruct the simulated history of genetic families, while ClustalW and other , while ClustalW and other genetic families similarity based methods fail. similarity based methods fail.

Recommend


More recommend