arbres digitaux et suites d adn
play

Arbres digitaux et suites dADN Brigitte CHAUVIN (Versailles) en - PowerPoint PPT Presentation

Arbres digitaux et suites dADN Brigitte CHAUVIN (Versailles) en collaboration avec Peggy C ENAC (Univ. Bourgogne), Eric FEKETE, St ephane GINOUILLAC, Nicolas POUYANNE (Versailles) INRIA, 26 mai 2008 Outline Introduction Tree


  1. Arbres digitaux et suites d’ADN Brigitte CHAUVIN (Versailles) en collaboration avec Peggy C´ ENAC (Univ. Bourgogne), Eric FEKETE, St´ ephane GINOUILLAC, Nicolas POUYANNE (Versailles) INRIA, 26 mai 2008

  2. Outline ◮ Introduction ◮ Tree representation ◮ Where randomness is ◮ What is known ◮ Results ◮ Methods

  3. Introduction ◮ A DNA sequence is an infinite word U = u 1 u 2 . . . u n . . . ∀ i , u i ∈ { A , C , G , T } .

  4. Introduction ◮ A DNA sequence is an infinite word U = u 1 u 2 . . . u n . . . ∀ i , u i ∈ { A , C , G , T } . ◮ To be seen on a representation: ◮ repetition of patterns ◮ missing patterns ◮ repartition of different possible patterns ◮ comparison of different sequences

  5. Introduction ◮ A DNA sequence is an infinite word U = u 1 u 2 . . . u n . . . ∀ i , u i ∈ { A , C , G , T } . ◮ To be seen on a representation: ◮ repetition of patterns ◮ missing patterns ◮ repartition of different possible patterns ◮ comparison of different sequences ◮ Can we identify some characteristics ◮ easy to study on the representation ◮ different from a species to another species? ◮ objectifs : distance entre les esp` eces, stat

  6. Tree representation U = u 1 u 2 . . . u n . . . Prefixes Rev.prefixes Suffixes u 1 u 1 u 1 u 2 u 3 u 4 . . . u 1 u 2 u 2 u 1 u 2 u 3 u 4 . . . u 1 u 2 u 3 u 3 u 2 u 1 u 3 u 4 . . . . . . . . . . . . ◮ suffix trie ◮ DST of reversed prefixes ◮ trie of reversed prefixes ◮ suffix DST

  7. Example. Suffix trie. U = 1001011001110 . . . S1 S 1 = U = 1001011001110 . . . 1

  8. Example. Suffix trie. U = 1001011001110 . . . S2 S1 S 1 = U = 1001011001110 . . . S 2 = 001011001110 . . . 0 1

  9. Example. Suffix trie. U = 1001011001110 . . . S2 S3 S 1 = U = 1001011001110 . . . 1 S 2 = 001011001110 . . . 0 S1 S 3 = 01011001110 . . . 0 1

  10. Example. Suffix trie. U = 1001011001110 . . . S1 S4 S 1 = U = 1001011001110 . . . 0 1 S 2 = 001011001110 . . . S2 S3 S 3 = 01011001110 . . . 0 1 0 S 4 = 1011001110 . . . 0 1

  11. Example. Suffix trie. U = 1001011001110 . . . S 1 = U = 1001011001110 . . . S3 S5 S1 S4 S 2 = 001011001110 . . . 0 0 1 1 S2 S 3 = 01011001110 . . . 0 1 S 4 = 1011001110 . . . 0 S 5 = 011001110 . . . 0 1

  12. Example. Suffix trie. U = 1001011001110 . . . S 1 = U = 1001011001110 . . . S3 S5 S1 S4 S 2 = 001011001110 . . . 0 0 1 1 S 3 = 01011001110 . . . S2 S6 S 4 = 1011001110 . . . 0 1 0 1 S 5 = 011001110 . . . S 6 = 11001110 . . . 0 1

  13. Example. Suffix trie. U = 1001011001110 . . . S1 S7 S 1 = U = 1001011001110 . . . 0 1 S 2 = 001011001110 . . . 1 S 3 = 01011001110 . . . S3 S5 S4 S 4 = 1011001110 . . . 0 0 1 1 S 5 = 011001110 . . . S2 S6 S 6 = 11001110 . . . 0 1 0 1 S 7 = 1001110 . . . 0 1

  14. Example. Suffix trie. U = 1001011001110 . . . S1 S7 S 1 = U = 1001011001110 . . . 0 1 S 2 = 001011001110 . . . 1 S 3 = 01011001110 . . . S3 S5 S4 S 4 = 1011001110 . . . 0 0 1 1 S 5 = 011001110 . . . S2 S6 S 6 = 11001110 . . . 0 1 0 1 S 7 = 1001110 . . . 0 1 The shape of the tree is closely related to the repetitions of patterns

  15. Where randomness is? Comes from the production of the letters: { 0 , 1 } or { A , C , G , T } or from any finite alphabet. For a given word U = u 1 u 2 . . . u n . . . , the tree process ( T n ) n ≥ 0 is nonrandom.

  16. Where randomness is? Comes from the production of the letters: { 0 , 1 } or { A , C , G , T } or an alphabet. For a given word U = u 1 u 2 . . . u n . . . , the tree process ( T n ) n ≥ 0 is nonrandom. Different kinds of sources: ◮ Memoryless: Bernoulli or asymmetric i.i.d. ◮ Markov ◮ General probabilistic source ◮ choose an infinite word U = u 1 u 2 . . . u n . . . with distribution µ ◮ call T the shift, ◮ add mixing assumptions (later). The inserted words (suffixes or reversed prefixes) are NOT independent.

  17. What is known DST for independent words Bernoulli source • height, insertion depth, profile cf. Mahmoud (92) • H n − log 2 n P → 0 Aldous-Shields (98) • Concentration of the height Drmota (02) iid assymmetric, Markov source • Pittel (85) insertion depth, height strong convergences from an infinite word • iid or Markov source C´ enac et al. (07)

  18. What is known Suffix tries • height Devroye, Szpankowski (92) (i.i.d. source) • depth, fill-up level, height Jacquet, Szpankowski (93) (general source + mixing) • average size and total path length Fayolle (06) (iid assym., Markov) • fill-up level C´ enac, Fekete (general source + not too strong mixing) (in progress)

  19. Two families of methods: (1) (2) analytic combinatorics probability generating functions Mellin transform ↓ ↓ precise asymptotics on a.s. convergences - the average of additive characteristics - distribution of the height common: correlations, overlapping of words

  20. Some notations to write the results ◮ The probability that the source produces a sequence of symbols starting with the pattern m is � p m = f ( t ) dt . I m ◮ s = s 1 s 2 . . . s n . . . denotes an infinite deterministic sequence. ◮ s ( n ) = s 1 s 2 . . . s n .

  21. Some notations to write the results ◮ � p m = f ( t ) dt I m ◮ s = s 1 s 2 . . . s n . . . denotes an infinite deterministic sequence. ◮ s ( n ) = s 1 s 2 . . . s n . ◮ Entropies � 1 1 �� � h + = lim n max ln , p s ( n ) n → + ∞ s ( n ) � 1 1 �� � h − = lim n min ln , p s ( n ) n → + ∞ s ( n ) 1 1 � � �� h = lim nE ln . � U ( n ) � p n → + ∞

  22. Some notations to write the results ◮ s = s 1 s 2 . . . s n . . . denotes an infinite deterministic sequence. s ( n ) = s 1 s 2 . . . s n . ◮ � 1 � 1 �� �� 1 1 � � h + = lim n max ln , h − = lim n min ln , p s ( n ) p s ( n ) n → + ∞ n → + ∞ s ( n ) s ( n ) 1 1 � � �� h = lim nE ln . � U ( n ) � p n → + ∞ ◮ ˜ ℓ n = length shortest branch of the tree � = fill-up level = ℓ n L n = length of the longest branch of the tree. D n = insertion depth

  23. Results ℓ n = fill-up level L n = length of the longest branch of the tree. D n = insertion depth Theorem (C´ enac et al. (07)) For the DST for a memoryless source or a Markovian source ℓ n 1 L n 1 a . s . a . s . − → , and − → . ln n h + ln n h − n →∞ n →∞

  24. Results ℓ n = fill-up level L n = length of the longest branch of the tree. D n = insertion depth Theorem For the DST for a memoryless source or a Markovian source ℓ n 1 L n 1 a . s . a . s . − → , and − → . ln n h + ln n h − n →∞ n →∞ D n 1 P − → ln n h n →∞

  25. In progress ℓ n = fill-up level L n = length of the longest branch of the tree. D n = insertion depth Theorem For the suffix trie for a general source with mixing conditions ℓ n 1 a . s . − → . ln n h + n →∞

  26. Methods - 1 - Runs well (works for the DST and for the suffix trie) ◮ s = s 1 s 2 . . . s n . . . denotes an infinite deterministic sequence. ◮ s ( n ) = s 1 s 2 . . . s n def X n ( s ) = length of the branch corresponding to s in the tree T n ℓ n = min X n ( s ) and L n = max X n ( s ) . s s

  27. Methods - 1 - Runs well (works for the DST and for the suffix trie) ◮ s = s 1 s 2 . . . s n . . . denotes an infinite deterministic sequence. ◮ s ( n ) = s 1 s 2 . . . s n def = size of the first tree where is inserted s ( k ) , ◮ T k ( s ) def X n ( s ) = length of the branch corresponding to s in T n . ℓ n = min X n ( s ) and L n = max X n ( s ) . s s ◮ X n and T k are in duality { X n ( s ) ≥ k } = { T k ( s ) ≤ n } . P ( ℓ n ≤ k − 1) ≤ . . .

  28. Methods - 1 - Runs well (works for the DST and for the suffix trie) ◮ s = s 1 s 2 . . . s n . . . denotes an infinite deterministic sequence. ◮ s ( n ) = s 1 s 2 . . . s n def = size of the first tree where is inserted s ( k ) , ◮ T k ( s ) def X n ( s ) = length of the branch corresponding to s in T n . ℓ n = min X n ( s ) and L n = max X n ( s ) . s s ◮ X n and T k are in duality { X n ( s ) ≥ k } = { T k ( s ) ≤ n } . � P ( ℓ n ≤ k − 1) ≤ P ( T k ( s ) > n ) s ( k )

  29. Methods - 1 - Runs well (works for the DST) � P ( ℓ n ≤ k − 1) ≤ P ( T k ( s ) > n ) s ( k ) ◮ s = s 1 s 2 . . . s n . . . denotes an infinite deterministic sequence. ◮ s ( n ) = s 1 s 2 . . . s n def ◮ T k ( s ) = size of the first tree where is inserted s ( k ) , k k � � T k ( s ) = T r ( s ) − T r − 1 ( s ) = Z r ( s ) r =1 r =1 Z r ( s ) = waiting time of the first occurrence of s ( r ) in U after T r − 1 Hyp Markov ⇒ the r. v. Z r ( s ) are independent

  30. Un peu de proba (works for the DST) Hyp Markov ⇒ the r. v. Z r ( s ) are independent k k � � T k ( s ) = T r ( s ) − T r − 1 ( s ) = Z r ( s ) r =1 r =1 k k � � � � = Z r ( s ) − I E Z r ( s ) + E Z r ( s ) I r =1 r =1 k � = ǫ r ( s ) + I E T k ( s ) r =1 = martingale M k ( s ) + I E T k ( s )

  31. k k k � � � � � T k ( s ) = Z r ( s ) = Z r ( s ) − I E Z r ( s ) + I E Z r ( s ) r =1 r =1 r =1 k � = ǫ r ( s ) + I E T k ( s ) r =1 = martingale M k ( s ) + I E T k ( s ) 1 + M k ( s ) � � log T k ( s ) = log I E T k ( s ) + log E T k ( s ) I ∼ kh ( s ) + ↓ ∀ α > 0 , M k ( s ) E T k ( s ) = o ( k 1+ α/ 2 ) I a . s . log T k ( s ) k →∞ h ( s ) − → k

Recommend


More recommend