Arbres digitaux et suites d’ADN Brigitte CHAUVIN (Versailles) en collaboration avec Peggy C´ ENAC (Univ. Bourgogne), Eric FEKETE, St´ ephane GINOUILLAC, Nicolas POUYANNE (Versailles) INRIA, 26 mai 2008
Outline ◮ Introduction ◮ Tree representation ◮ Where randomness is ◮ What is known ◮ Results ◮ Methods
Introduction ◮ A DNA sequence is an infinite word U = u 1 u 2 . . . u n . . . ∀ i , u i ∈ { A , C , G , T } .
Introduction ◮ A DNA sequence is an infinite word U = u 1 u 2 . . . u n . . . ∀ i , u i ∈ { A , C , G , T } . ◮ To be seen on a representation: ◮ repetition of patterns ◮ missing patterns ◮ repartition of different possible patterns ◮ comparison of different sequences
Introduction ◮ A DNA sequence is an infinite word U = u 1 u 2 . . . u n . . . ∀ i , u i ∈ { A , C , G , T } . ◮ To be seen on a representation: ◮ repetition of patterns ◮ missing patterns ◮ repartition of different possible patterns ◮ comparison of different sequences ◮ Can we identify some characteristics ◮ easy to study on the representation ◮ different from a species to another species? ◮ objectifs : distance entre les esp` eces, stat
Tree representation U = u 1 u 2 . . . u n . . . Prefixes Rev.prefixes Suffixes u 1 u 1 u 1 u 2 u 3 u 4 . . . u 1 u 2 u 2 u 1 u 2 u 3 u 4 . . . u 1 u 2 u 3 u 3 u 2 u 1 u 3 u 4 . . . . . . . . . . . . ◮ suffix trie ◮ DST of reversed prefixes ◮ trie of reversed prefixes ◮ suffix DST
Example. Suffix trie. U = 1001011001110 . . . S1 S 1 = U = 1001011001110 . . . 1
Example. Suffix trie. U = 1001011001110 . . . S2 S1 S 1 = U = 1001011001110 . . . S 2 = 001011001110 . . . 0 1
Example. Suffix trie. U = 1001011001110 . . . S2 S3 S 1 = U = 1001011001110 . . . 1 S 2 = 001011001110 . . . 0 S1 S 3 = 01011001110 . . . 0 1
Example. Suffix trie. U = 1001011001110 . . . S1 S4 S 1 = U = 1001011001110 . . . 0 1 S 2 = 001011001110 . . . S2 S3 S 3 = 01011001110 . . . 0 1 0 S 4 = 1011001110 . . . 0 1
Example. Suffix trie. U = 1001011001110 . . . S 1 = U = 1001011001110 . . . S3 S5 S1 S4 S 2 = 001011001110 . . . 0 0 1 1 S2 S 3 = 01011001110 . . . 0 1 S 4 = 1011001110 . . . 0 S 5 = 011001110 . . . 0 1
Example. Suffix trie. U = 1001011001110 . . . S 1 = U = 1001011001110 . . . S3 S5 S1 S4 S 2 = 001011001110 . . . 0 0 1 1 S 3 = 01011001110 . . . S2 S6 S 4 = 1011001110 . . . 0 1 0 1 S 5 = 011001110 . . . S 6 = 11001110 . . . 0 1
Example. Suffix trie. U = 1001011001110 . . . S1 S7 S 1 = U = 1001011001110 . . . 0 1 S 2 = 001011001110 . . . 1 S 3 = 01011001110 . . . S3 S5 S4 S 4 = 1011001110 . . . 0 0 1 1 S 5 = 011001110 . . . S2 S6 S 6 = 11001110 . . . 0 1 0 1 S 7 = 1001110 . . . 0 1
Example. Suffix trie. U = 1001011001110 . . . S1 S7 S 1 = U = 1001011001110 . . . 0 1 S 2 = 001011001110 . . . 1 S 3 = 01011001110 . . . S3 S5 S4 S 4 = 1011001110 . . . 0 0 1 1 S 5 = 011001110 . . . S2 S6 S 6 = 11001110 . . . 0 1 0 1 S 7 = 1001110 . . . 0 1 The shape of the tree is closely related to the repetitions of patterns
Where randomness is? Comes from the production of the letters: { 0 , 1 } or { A , C , G , T } or from any finite alphabet. For a given word U = u 1 u 2 . . . u n . . . , the tree process ( T n ) n ≥ 0 is nonrandom.
Where randomness is? Comes from the production of the letters: { 0 , 1 } or { A , C , G , T } or an alphabet. For a given word U = u 1 u 2 . . . u n . . . , the tree process ( T n ) n ≥ 0 is nonrandom. Different kinds of sources: ◮ Memoryless: Bernoulli or asymmetric i.i.d. ◮ Markov ◮ General probabilistic source ◮ choose an infinite word U = u 1 u 2 . . . u n . . . with distribution µ ◮ call T the shift, ◮ add mixing assumptions (later). The inserted words (suffixes or reversed prefixes) are NOT independent.
What is known DST for independent words Bernoulli source • height, insertion depth, profile cf. Mahmoud (92) • H n − log 2 n P → 0 Aldous-Shields (98) • Concentration of the height Drmota (02) iid assymmetric, Markov source • Pittel (85) insertion depth, height strong convergences from an infinite word • iid or Markov source C´ enac et al. (07)
What is known Suffix tries • height Devroye, Szpankowski (92) (i.i.d. source) • depth, fill-up level, height Jacquet, Szpankowski (93) (general source + mixing) • average size and total path length Fayolle (06) (iid assym., Markov) • fill-up level C´ enac, Fekete (general source + not too strong mixing) (in progress)
Two families of methods: (1) (2) analytic combinatorics probability generating functions Mellin transform ↓ ↓ precise asymptotics on a.s. convergences - the average of additive characteristics - distribution of the height common: correlations, overlapping of words
Some notations to write the results ◮ The probability that the source produces a sequence of symbols starting with the pattern m is � p m = f ( t ) dt . I m ◮ s = s 1 s 2 . . . s n . . . denotes an infinite deterministic sequence. ◮ s ( n ) = s 1 s 2 . . . s n .
Some notations to write the results ◮ � p m = f ( t ) dt I m ◮ s = s 1 s 2 . . . s n . . . denotes an infinite deterministic sequence. ◮ s ( n ) = s 1 s 2 . . . s n . ◮ Entropies � 1 1 �� � h + = lim n max ln , p s ( n ) n → + ∞ s ( n ) � 1 1 �� � h − = lim n min ln , p s ( n ) n → + ∞ s ( n ) 1 1 � � �� h = lim nE ln . � U ( n ) � p n → + ∞
Some notations to write the results ◮ s = s 1 s 2 . . . s n . . . denotes an infinite deterministic sequence. s ( n ) = s 1 s 2 . . . s n . ◮ � 1 � 1 �� �� 1 1 � � h + = lim n max ln , h − = lim n min ln , p s ( n ) p s ( n ) n → + ∞ n → + ∞ s ( n ) s ( n ) 1 1 � � �� h = lim nE ln . � U ( n ) � p n → + ∞ ◮ ˜ ℓ n = length shortest branch of the tree � = fill-up level = ℓ n L n = length of the longest branch of the tree. D n = insertion depth
Results ℓ n = fill-up level L n = length of the longest branch of the tree. D n = insertion depth Theorem (C´ enac et al. (07)) For the DST for a memoryless source or a Markovian source ℓ n 1 L n 1 a . s . a . s . − → , and − → . ln n h + ln n h − n →∞ n →∞
Results ℓ n = fill-up level L n = length of the longest branch of the tree. D n = insertion depth Theorem For the DST for a memoryless source or a Markovian source ℓ n 1 L n 1 a . s . a . s . − → , and − → . ln n h + ln n h − n →∞ n →∞ D n 1 P − → ln n h n →∞
In progress ℓ n = fill-up level L n = length of the longest branch of the tree. D n = insertion depth Theorem For the suffix trie for a general source with mixing conditions ℓ n 1 a . s . − → . ln n h + n →∞
Methods - 1 - Runs well (works for the DST and for the suffix trie) ◮ s = s 1 s 2 . . . s n . . . denotes an infinite deterministic sequence. ◮ s ( n ) = s 1 s 2 . . . s n def X n ( s ) = length of the branch corresponding to s in the tree T n ℓ n = min X n ( s ) and L n = max X n ( s ) . s s
Methods - 1 - Runs well (works for the DST and for the suffix trie) ◮ s = s 1 s 2 . . . s n . . . denotes an infinite deterministic sequence. ◮ s ( n ) = s 1 s 2 . . . s n def = size of the first tree where is inserted s ( k ) , ◮ T k ( s ) def X n ( s ) = length of the branch corresponding to s in T n . ℓ n = min X n ( s ) and L n = max X n ( s ) . s s ◮ X n and T k are in duality { X n ( s ) ≥ k } = { T k ( s ) ≤ n } . P ( ℓ n ≤ k − 1) ≤ . . .
Methods - 1 - Runs well (works for the DST and for the suffix trie) ◮ s = s 1 s 2 . . . s n . . . denotes an infinite deterministic sequence. ◮ s ( n ) = s 1 s 2 . . . s n def = size of the first tree where is inserted s ( k ) , ◮ T k ( s ) def X n ( s ) = length of the branch corresponding to s in T n . ℓ n = min X n ( s ) and L n = max X n ( s ) . s s ◮ X n and T k are in duality { X n ( s ) ≥ k } = { T k ( s ) ≤ n } . � P ( ℓ n ≤ k − 1) ≤ P ( T k ( s ) > n ) s ( k )
Methods - 1 - Runs well (works for the DST) � P ( ℓ n ≤ k − 1) ≤ P ( T k ( s ) > n ) s ( k ) ◮ s = s 1 s 2 . . . s n . . . denotes an infinite deterministic sequence. ◮ s ( n ) = s 1 s 2 . . . s n def ◮ T k ( s ) = size of the first tree where is inserted s ( k ) , k k � � T k ( s ) = T r ( s ) − T r − 1 ( s ) = Z r ( s ) r =1 r =1 Z r ( s ) = waiting time of the first occurrence of s ( r ) in U after T r − 1 Hyp Markov ⇒ the r. v. Z r ( s ) are independent
Un peu de proba (works for the DST) Hyp Markov ⇒ the r. v. Z r ( s ) are independent k k � � T k ( s ) = T r ( s ) − T r − 1 ( s ) = Z r ( s ) r =1 r =1 k k � � � � = Z r ( s ) − I E Z r ( s ) + E Z r ( s ) I r =1 r =1 k � = ǫ r ( s ) + I E T k ( s ) r =1 = martingale M k ( s ) + I E T k ( s )
k k k � � � � � T k ( s ) = Z r ( s ) = Z r ( s ) − I E Z r ( s ) + I E Z r ( s ) r =1 r =1 r =1 k � = ǫ r ( s ) + I E T k ( s ) r =1 = martingale M k ( s ) + I E T k ( s ) 1 + M k ( s ) � � log T k ( s ) = log I E T k ( s ) + log E T k ( s ) I ∼ kh ( s ) + ↓ ∀ α > 0 , M k ( s ) E T k ( s ) = o ( k 1+ α/ 2 ) I a . s . log T k ( s ) k →∞ h ( s ) − → k
Recommend
More recommend