Google Matrix Analysis of DNA Sequences Vivek Kandiah and Dima Shepelyansky Laboratoire de Physique Théorique, IRSAMC, UMR 5152 du CNRS Université Paul Sabatier, Toulouse Supported by EC FET open project NADINE 21 june 2014 Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 1 / 13
Introduction : motivation Large and accurate genomic dataset available for several species 1 . Interest in detection of specific/rare patterns in a given sequence. New viewpoint of directed network. Google matrix : G ij = α S ij + ( 1 − α ) / N with S i , j = T i , j / � j T i , j where T describes the transitions between nearby words. 1 http://www.ensembl.org/ Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 2 / 13
Introduction : from DNA sequence to network Bos Taurus (Bull, L ≈ 2 . 9 · 10 9 bp ); Canis Familiaris (Dog, L ≈ 2 . 5 · 10 9 bp ); Loxondonta Africana (Elephant, L ≈ 3 . 1 · 10 9 bp ); Homo Sapiens (Human, L ≈ 1 . 5 · 10 10 bp ) and Danio Rerio (Zebrafish, L ≈ 1 . 4 · 10 9 bp ). ... TCG ATAT CTGG TAAC CTA ... � �� � � �� � � �� � W k − 1 W k W k + 1 → W k − 1 → W k → W k + 1 → T ij → T ij + 1 whenever word j points to word i . Full matrix limit, L / mN 2 ≈ 10 to 100 transitions per elements at m = 6. Webpages ≈ 10 links per node on average with N ≈ 2 · 10 5 . Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 3 / 13
Statistics of Google matrix elements 0 0 -1 -1 -2 -2 2 ) 2 ) Log 10 (N g /N Log 10 (N g /N -3 -3 -4 -4 -5 -5 -6 -6 -7 -7 -7 -6 -5 -4 -3 -2 -1 0 -7 -6 -5 -4 -3 -2 -1 0 Log 10 g Log 10 g Integrated fraction Ng / N 2 of Google matrix elements with Gij > g as a function of g . Left panel : Various species with 6-letters word length: bull BT (magenta), dog CF (red), elephant LA (green), Homo sapiens HS (blue) and zebrafish DR(black). Right panel : Data for HS sequence with words of length m = 5 (brown), 6 (blue), 7 (red). For comparison black dashed and dotted curves show the same distribution for the WWW networks of Universities of Cambridge and Oxford in 2006 respectively. Oscillations but universal decay law N g ∝ 1 / g ν − 1 with ν ≈ 2 . 5 (range − 5 . 5 < log 10 g < − 0 . 5). Distribution of outgoing links in WWW networks decay with ˜ ν ≈ 2 . 7. Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 4 / 13
Statistics of Google matrix elements 0 0 -1 -1 Log 10 (N s /N) Log 10 (N s /N) -2 -2 -3 -3 -4 -4 -2 -1 0 1 -2 -1 0 1 Log 10 g s Log 10 g s Integrated fraction Ns / N of sum of ingoing matrix elements with � N j = 1 Gi , j ≥ gs . Left and right panels show the same cases as above in same colors. The dashed and dotted curves are shifted in x -axis by one unit left to fit the figure scale. Visible differences between species but close to universal decay curve as N s ∝ 1 / g µ − 1 with µ ≈ 5. Distribution of ingoing links in WWW networks decay with ˜ µ ≈ 2 . 1. Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 5 / 13
Spectrum and PageRank a) b) 0,2 0 -0,2 Presence of large gap. c) d) 0,2 HS ∼ CF and strong differences 0 between mammalian and non -0,2 mammalian sequences. -0,4 -0,2 0 0,2 0,4 0,8 1 -0,2 0 0,2 0,4 0,8 1 0,6 0,6 Spectrum of G and G ∗ are e) 0,5 identical. 0 -0,5 -1 -0,5 0 0,5 1 1,5 Eigenvalue spectrum at m = 6 of a) Bos Taurus, b) Canis Familiaris, c) Loxodonta Africana, d) Homo Sapiens and e) Danio Rerio. Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 6 / 13
Spectrum and PageRank -2 -2 -3 -3 -4 Log 10 P Log 10 P -4 -5 -5 -6 -6 -7 0 1 2 3 4 0 1 2 3 4 5 Log 10 K Log 10 K PageRank probability decay of several species at m = 6 (left) and Homo Sapiens at m = 5, m = 6 and m = 7 (right). Top five (top) and last five (bottom) PageRank entries of DNA sequences. PageRank ∼ frequency of words. BT CF LA HS DR P ( K ) ∼ 1 / K β with β = 1 / ( µ − 1 ) . TTTTTT TTTTTT AAAAAA TTTTTT ATATAT AAAAAA AAAAAA TTTTTT AAAAAA TATATA ATTTTT AATAAA ATTTTT ATTTTT AAAAAA At m = 6 : β = 0 . 273 ± 0 . 005 (BT), AAAAAT TTTATT AAAAAT AAAAAT TTTTTT TTCTTT AAATAA AGAAAA TATTTT AATAAA 0 . 340 ± 0 . 005 (CF), 0 . 281 ± 0 . 005 (LA), BT CF LA HS DR CGCGTA TACGCG CGCGTA TACGCG CCGACG 0 . 308 ± 0 . 005 (HS), 0 . 426 ± 0 . 008 (DR) TACGCG CGCGTA TACGCG CGCGTA CGTCGG in the range 1 ≤ log 10 K ≤ 3 . 3. Small CGTACG TCGCGA ATCGCG CGTACG CGTCGA CGATCG CGTACG TCGCGA TCGACG TCGACG variation between mammalian species, ATCGCG CGATCG CGCGAT CGTCGA TCGTCG stable with word length. Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 7 / 13
Spectrum and PageRank -2 0 -1 -3 -2 2 ) Log 10 (N g /N -3 Log 10 P -4 -4 -5 -5 -6 -6 -7 0 1 2 3 4 -7 -6 -5 -4 -3 -2 -1 0 Log 10 K Log 10 g 0,4 0,2 0 -0,2 -0,4 -0,4 -0,2 0 0,2 0,4 0,6 0,8 1 Random matrix model with distribution of elements corresponding to HS at m = 6. Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 8 / 13
Statistical proximity 4000 K bt K cf 3000 2000 √ � N i = 1 ( K s 1 ( i ) − K s 2 ( i )) 2 ) / N ζ ( s 1 , s 2 ) = . 1000 σ rnd K hs K hs 0 ζ ( HS , CF ) = 0 . 206, ζ ( HS , LA ) = 0 . 238, K la K dr ζ ( HS , BT ) = 0 . 246, ζ ( LA , CF ) = 0 . 303, 3000 ζ ( CF , BT ) = 0 . 308, ζ ( LA , BT ) = 0 . 324, ζ ( DR , HS ) = 0 . 375, ζ ( DR , CF ) = 0 . 414, 2000 ζ ( DR , LA ) = 0 . 422, ζ ( DR , BT ) = 0 . 425 1000 K hs K hs 0 0 1000 2000 3000 0 1000 2000 3000 4000 PageRank proximity K − K plane diagrams for different species in comparison with Homo Sapiens. Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 9 / 13
Statistical proximity 4000 K hs2 K hs2 3000 2000 1000 K hs1 K hs1 0 0 1000 2000 3000 0 1000 2000 3000 4000 PageRank proximity K − K plane diagrams between two Homo Sapiens individuals. ζ = 0 . 031 Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 10 / 13
Conclusion and Perspectives Complex and large gaped spectrum of Google matrix. Structural differences and similarities of DNA with WWW through G ij . DNA sequence µ ≈ 5 → slow PageRank decay β ≈ 0 . 25 (For WWW β ≈ 0 . 9). PageRank correlations show the statistical similarity between species from a Markov chain point of view. Random matrix model reproducing the spectrum. Other eigenmodes may highlight a relatively long living relaxation mode and might localize themselves in a paricular set of words. Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 11 / 13
References 1. Nucleotide sequence bank http://www.ncbi.nlm.nih.gov 2. Academic Web Link Database Project http://cybermetrics.wlv.ac.uk/database/ 3. S.Brin and L.Page, Computer Networks and ISDN Systems 30 107 (1998). 4. A.M. Langville and C.D. Meyer C D 2006 Google’s PageRank and Beyond: The Science of Search Engine Rankings , Princeton University Press, Princeton, 2006. 5. Frahm KM, Shepelyansky DL (2012) Poincaré recurrences of DNA sequences , Phys. Rev. E 85 : 016214 6. K.M. Frahm, B. Georgeot and D.L. Shepelyansky, Universal emergence of PageRank , J. Phys, A: Math. Theor. 44 (2011) 465101. 7. L.Ermann, K.M.Frahm and D.L.Shepelyansky Spectral properties of Google matrix of Wikipedia and other networks submitted to Eur. Phys. J. B 5 Dec 2012 8. Fortunato S Community detection in graphs Phys. Rep.486: 75 (2010) 9. Robin S, Rodolphe F , Schbath S DNA, words and models Cambridge Univ. Press, Cambridge (2005) 10. Mantegna RN, Buldyrev SV, Goldberger AL, Havlin S, Peng C-K, Simons M, Stanley HE Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics Phys. Rev. E52: 2939 (1995) 11. Halperin D, Chiapello H, Schbath S, Robin S, Hennequet-Antier C, Gruss A, El Karoui M (2007) Identification of DNA motifs implicated in maintenance of bacterial core genomes by predictive modeling , PLoS Genetics 3(9) : e153 12. Dai Q, Yang Y, Wang T (2008) Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison , Bioinformatics 24(20) : 2296 13. Reinert G, Chew D, Sun D, Waterman MS (2009) J. Comp. Biology 16(12) : 1615 14. Burden CJ, Jing J, Wilson SR (2012) Alignment-free sequence comparison for biologically realistic sequences of moderate length , Stat. Appl. Gen. Mol. Biology 11(1) 3 15. Brendel V, Beckmann JS, Trifonov EN (1986) J. Boimolecular Structure Dynamics 4 : 11 16. Popov O, Segal DM, Trifonov EN (1996) Biosystems 38 : 65 17. Frenkel Zakharia M, Frenkel Zeev M, Trifonov EN, Snir S (2009) J. Theor. Biology 260 : 438 Vivek Kandiah and Dima Shepelyansky (Quantware group, CNRS, Toulouse) Google Matrix Analysis of DNA Sequences 21 june 2014 12 / 13
Recommend
More recommend