Counting occurrences for a finite set of words: an inclusion-exclusion approach Pierre Nicod` eme CNRS - LIX, ´ Ecole polytechnique joint work with Fr´ ement and ed´ erique Bassino, Julien Cl´ Julien Fayolle
Problem setting Compute separately the number of occurrences of a non-reduced set of words U in a random text under Bernoulli (non-uniform) model Reduced set: no word is factor of another word Reduced Non-Reduced U = { aab, ba, bb } U = { aa, aab, bbaabb } Methods – Formal languages manipulations (R´ egnier-Szpankowski) ( it fails in the non-reduced case ) – Aho-Corasick (automaton) + Chomsky-Sch¨ utzenberger – Inclusion-Exclusion (Goulden-Jackson, Noonan-Zeilberger)
Analytic Aim U = { u 1 , . . . u r } non-reduced set of words O ( r ) n : random variable counting the number of occurrences of the word u r in a random text of size n (Bernoulli model) We want to compute � Pr( O (1) = k 1 , . . . , O ( r ) = k r ) x k 1 1 . . . x k r r z n F ( z, x 1 , . . . , x r ) = n n k 1 ≥ 0 ,...,k r ≥ 0 ,n ≥ 0 From there � ∂ . . . ∂ � � O (1) × · · · × O ( r ) = [ z n ] � F ( z, x 1 , . . . , x r ) E � n n ∂x 1 ∂x r � x 1 = ··· = x r =1
(Auto)-Correlation Set auto-correlation ababa ababa | C ababa,ababa = { ǫ, ba, baba } h = ababa � � ababa ababa C h,h = { w, | w | < | h | } h.w = r.h and correlation C h 1 ,h 2 = { w, | w | < | h 2 | } h 1 .w = r.h 2 and h 1 = baba, h 2 = abaaba − → C baba,abaaba = { aba, baaba }
Generating function of a language language = set of words alphabet A = { a, b } A ⋆ = ǫ + A + A 2 + · · · + A n + . . . all the words L ⊂ A ⋆ F L ( a, b ) = � w ∈L commute( w ) � ( aabaa ) ⋆ = ǫ + aabaa + ( aabaa ) 2 + ( aabaa ) 3 + · · · 1 L = ( aabaa ) ⋆ + bbb 1 − a 4 b + b 3 = ⇒ F L ( a, b ) = if X . Y non ambiguous, F X ·Y ( a, b ) = F X ( a, b ) × F Y ( a, b ) if X and Y disjoint, F X + Y ( a, b ) = F X ( a, b )+ F Y ( a, b ) 1 if X ⋆ non ambiguous, F X ⋆ ( a, b ) = 1 − F X ( a, b )
Weighted and Counting Generating Function Generating function of the language L M ( a, b ) = � α ∈L commute( α ) α ∈L p α z | α | = � π n z n Weighted generating function W ( z ) = M ( ω a z, ω b z ) = � ω a = Pr( a ) , ω b = Pr( b ) , p α proba. of word α , π n proba. that a word of size n belongs to L α ∈L z | α | = � f n z n Counting generating function F ( z ) = M ( z, z ) = � f n number of words of the language of size n Example L = { ǫ, aa, ab, ba, aaab } ( ǫ empty word) M ( a, b ) = 1 + a 2 + 2 ab + a 3 b ⇒ F ( z ) = 1 + 3 z 2 + z 3
Formal Languages Analysis (R´ egnier-Szpankowski - 1998) “parse” the text with respect to the occurrences Right R − set of texts obtained by reading up to the first occurrence Minimal M − set of texts separating two occurrences Ultimate U − set of texts following the last occurrence Not N − set of texts with no occurrence A ⋆ = N + R . ( M ) ⋆ . U L x = N + R x. ( M x ) ⋆ . U ⇒
Equations over the langages C = C h,h π h = Pr( h ) (Bernoulli model) (I) A ⋆ = U + MA ⋆ (II) A ⋆ h = R . C + R . A ⋆ .h (III) M + = A ⋆ .h + C − ǫ (IV) N . A = R + N − ǫ solving π h z | h | 1 R ( z ) = U ( z ) = π h z | h | + (1 − z ) C ( z ) π h z | h | + (1 − z ) C ( z ) C ( z ) z − 1 N ( z ) = M ( z ) = 1 + π h z | h | + (1 − z ) C ( z ) π h z | h | + (1 − z ) C ( z ) 1 L ( z, x ) = 1 − x 1 − z + π h z | h | x + (1 − x ) C ( z )
Reduced sets (R´ egnier) R i , M i,j , U i � R i ( z ) , M i,j ( z ) , U i ( z ) functions of C h 1 ,h 1 ( z ) , C h 2 ,h 2 ( z ) , C h 1 ,h 2 ( z ) , C h 2 ,h 1 ( z ) ⋆ x 1 M 1 , 1 ( z ) x 2 M 1 , 2 ( z ) U 1 ( z ) F ( z, x 1 , x 2 ) = N ( z )+( x 1 R 1 ( z ) , x 2 R 2 ( z )) x 1 M 2 , 1 ( z ) x 2 M 2 , 2 ( z ) U 2 ( z ) This collapses in case of non-reduced sets
Aho-Corasick – Input: non-reduced set of words U . – Output: automaton A U recognizing A ∗ U . Algorithm: 1. build T U , the ordinary trie representing the set U 2. build A U = ( A , Q, δ, ǫ, T ): – Q = Pref( U ) – T = A ∗ U ∩ Pref( U ) if qx ∈ Pref( U ) , qx – δ ( q, x ) = Border( qx ) otherwise , Border( v ) = the longest proper suffix of v which belongs to Pref( U ) if defined, or ǫ otherwise.
Example U = { aab, aa } ǫ aab a b a a aa Trie T U of U
Example U = { aab, aa } δ ( ǫ, b ) = Border( b ) = ǫ b ǫ a aab b a a aa
Example U = { aab, aa } δ ( a, b ) = Border( a.b ) = ǫ b ǫ a aab b a a b aa
Example U = { aab, aa } δ ( aa, a ) = Border( aa.a ) = aa b ǫ a aab b a a b aa a
Example U = { aab, aa } δ ( aab, a ) = Border( aab.a ) = a b ǫ a aab a b a a b aa a
Example U = { aab, aa } δ ( aab, b ) = Border( aab.b ) = ǫ b b ǫ a aab a b a a b aa a
Example U = { aab, aa } 0 0 b b a b ǫ 0 0 aab b ax 2 a a T ( x 1 , x 2 ) = , b a a 0 0 ax 2 bx 1 b aa b a 0 0 a x 1 , x 2 marks for aab, aa
Example U = { aab, aa } b a 0 0 b ǫ b 0 0 aab b ax 2 a a T ( x 1 , x 2 ) = , b a a 0 0 ax 2 bx 1 b aa 0 0 b a a 1 1 F ( a, b, x 1 , x 2 ) = (1 , 0 , 0 , 0)( I − T ( a, b, x 1 , x 2 )) − 1 1 1 1 − a ( x 2 − 1) = 1 − ax 2 − b + ab ( x 2 − 1) − a 2 bx 2 ( x 1 − 1) 2 .
Inclusion-Exclusion Principle - Analytic Version Set of camelus genus (camel and dromedary); the number of humps is counted by the formal variable x . � � F ( x ) = x 2 + x F = , , { “objects of P in which each elementary configuration (hump) Φ = is either distinguished or not” } � � = , , , , , Φ( t ) = t + 1 + t 2 + t + t + 1 = 2 + 3 t + t 2 = F (1 + t ) Inclusion-Exclusion principle If Φ( t ) is easy to get, then F ( x ) = Φ( x − 1).
Application: counts for one word word aaa f ( x ): unknown p.g.f of counts of aaa bbbbbaaaaaaaabbbbb each occurrence is distinguished or not (flip-flop) ⇒ 2 k configurations for a text with k occurrences bbbbbaaaaaaaabbbbb bbbbbaaaaaaaabbbbb bbbbbaaaaaaaabbbbb bbbbbaaaaaaaabbbbb 1 f ( x ) � f (1 + x ) = φ ( x ) x � � + x � f ( x ) = φ ( x − 1) computing easier φ ( t ) and substituting t � x − 1 give harder f ( x ) (Inclusion-Exclusion paradigm)
One word - Clusters word aaa C aaa,aaa = { ǫ, a, aa } bbbbbaaaaaaaabbbbb bbbbbaaaaaaaabbbbb bbbbbaaaaaaaabbbbb bbbbbaaaaaaaabbbbb bbbbbaaaaaaaabbbbb clusters C C aaa = aaa • ( ǫ + a • + aa • + a • a • + a • a • a • + a • aa • + aa • a • + . . . ) ǫ + ( ( C aaa,aaa − ǫ ) • ) + � � = aaa • double counting (further removed by the inclusion-exclusion principle): z + z 2 ( C aaa,aaa − ǫ ) + ( z ) = 1 − ( z + z 2 ) = z + 2 z 2 + 3 z 3 + 5 z 4 + 8 z 5 + 13 z 6 + . . . � = z + z 2 + z 3 + z 4 + z 5 + z 6 + . . .
Word aaa - Clusters - Generating function C aaa,aaa ( z ) = 1 + z + z 2 C aaa,aaa = { ǫ, a, aa } C aaa = aaa • ( ǫ + a • + aa • + a • a • + a • a • a • + a • aa • + aa • a • + . . . ) ǫ + (( C aaa,aaa − ǫ ) • ) + � � = aaa • C aaa ( z, x ) = zzzx (1 + zx + zzx + zxzx + zxzxzx + zxzzx + zzxzx + . . . ) � ǫ + ( C aaa,aaa ( z ) × x ) + � = z 3 x xz + xz 2 xz 3 � � = xz 3 1 + = 1 − ( xz + xz 2 ) 1 − ( xz + xz 2 )
Parsing of a text with respect to clusters C = C h,h , word h , clusters C xh ( z ) C = h + h. C + h CC + h CCC + . . . ⇒ = C ( z, x ) = 1 − x ( C ( z ) − 1) When reading a random text T , at each position, either we read a letter of the alphabet A , either we begin a cluster C , T = ǫ + A + C + AA + A C + C A + CC + AAA + AA C + A C A + C AA + A CC + . . . = Seq( A + C ) Therefore, counting with x the number of occurrences of the word h , we have, removing double counting by inclusion-exclusion, 1 1 � = F ( z, x ) = � ( x − 1) h ( z ) 1 − A ( z )+ C ( z, x − 1) 1 − A ( z ) − 1 − ( x − 1)( C ( z ) − 1)
Reduced set - (Goulden-Jackson - 1979, 1983) U = { aba, bab, aa } bbbbbabababaabbbbb bbbbbabababaabbbbb bbbbbabababaabbbbb clusters C i,j begin with w i and finish with w j � C i,j = w i C w i ,w j + C i,k . ( C w k ,w j − δ kj ǫ ) 1 ≤ k ≤ 3 − 1 C w 1 ,w 1 • − ǫ C w 1 ,w 2 • C w 1 ,w 3 • 1 C = ( w 1 • , w 2 • , w 3 • ) I − C w 2 ,w 1 • C w 2 ,w 2 • − ǫ C w 2 ,w 3 • 1 C w 3 ,w 1 • C w 3 ,w 2 • C w 3 ,w 3 • − ǫ 1 1 T = Seq( A + C ) = ⇒ Φ( z, x 1 , x 2 , x 3 ) = 1 − A ( z ) − C ( z, x 1 , x 2 , x 3 ) 1 F ( z, x 1 , x 2 , x 3 ) = Φ( z, x 1 − 1 , x 2 − 1 , x 3 − 1) = 1 − A ( z ) − C ( z, x 1 − 1 , x 2 − 1 , x 3 − 1)
General Case: Non Reduced Set of Words U = { aa, ab, baaaab } I II aaaabbbbbbbabaaaabbbb aa ab aa baaaab ab aa aa aa create clusters of distinguished occurrences Reduced Cluster , no induced factor occurrences (Cluster I). Count distinguished occurrences by t i � x i − 1 (Inclusion-Exclusion principle) Induced Factor Occurrences , occurrence baaaab of reduced Cluster II induces 0, 1, 2, or 3 distinguished occurrences aa . To recover the correct count of 8 marked configurations, count them by (1 + t i ) 3 � x 3 i .
Recommend
More recommend