Counting occurrences for a finite set of words: an - PowerPoint PPT Presentation

Counting occurrences for a finite set of words: an inclusion-exclusion approach Pierre Nicod` eme CNRS - LIX, ´ Ecole polytechnique joint work with Fr´ ement and ed´ erique Bassino, Julien Cl´ Julien Fayolle

Problem setting Compute separately the number of occurrences of a non-reduced set of words U in a random text under Bernoulli (non-uniform) model Reduced set: no word is factor of another word Reduced Non-Reduced U = { aab, ba, bb } U = { aa, aab, bbaabb } Methods – Formal languages manipulations (R´ egnier-Szpankowski) ( it fails in the non-reduced case ) – Aho-Corasick (automaton) + Chomsky-Sch¨ utzenberger – Inclusion-Exclusion (Goulden-Jackson, Noonan-Zeilberger)

Analytic Aim U = { u 1 , . . . u r } non-reduced set of words O ( r ) n : random variable counting the number of occurrences of the word u r in a random text of size n (Bernoulli model) We want to compute � Pr( O (1) = k 1 , . . . , O ( r ) = k r ) x k 1 1 . . . x k r r z n F ( z, x 1 , . . . , x r ) = n n k 1 ≥ 0 ,...,k r ≥ 0 ,n ≥ 0 From there � ∂ . . . ∂ � � O (1) × · · · × O ( r ) = [ z n ] � F ( z, x 1 , . . . , x r ) E � n n ∂x 1 ∂x r � x 1 = ··· = x r =1

(Auto)-Correlation Set auto-correlation ababa ababa | C ababa,ababa = { ǫ, ba, baba } h = ababa � � ababa ababa C h,h = { w, | w | < | h | } h.w = r.h and correlation C h 1 ,h 2 = { w, | w | < | h 2 | } h 1 .w = r.h 2 and h 1 = baba, h 2 = abaaba − → C baba,abaaba = { aba, baaba }

Generating function of a language language = set of words alphabet A = { a, b } A ⋆ = ǫ + A + A 2 + · · · + A n + . . . all the words L ⊂ A ⋆ F L ( a, b ) = � w ∈L commute( w ) � ( aabaa ) ⋆ = ǫ + aabaa + ( aabaa ) 2 + ( aabaa ) 3 + · · · 1 L = ( aabaa ) ⋆ + bbb 1 − a 4 b + b 3 = ⇒ F L ( a, b ) = if X . Y non ambiguous, F X ·Y ( a, b ) = F X ( a, b ) × F Y ( a, b ) if X and Y disjoint, F X + Y ( a, b ) = F X ( a, b )+ F Y ( a, b ) 1 if X ⋆ non ambiguous, F X ⋆ ( a, b ) = 1 − F X ( a, b )

Weighted and Counting Generating Function Generating function of the language L M ( a, b ) = � α ∈L commute( α ) α ∈L p α z | α | = � π n z n Weighted generating function W ( z ) = M ( ω a z, ω b z ) = � ω a = Pr( a ) , ω b = Pr( b ) , p α proba. of word α , π n proba. that a word of size n belongs to L α ∈L z | α | = � f n z n Counting generating function F ( z ) = M ( z, z ) = � f n number of words of the language of size n Example L = { ǫ, aa, ab, ba, aaab } ( ǫ empty word)  M ( a, b ) = 1 + a 2 + 2 ab + a 3 b  ⇒ F ( z ) = 1 + 3 z 2 + z 3 

Formal Languages Analysis (R´ egnier-Szpankowski - 1998) “parse” the text with respect to the occurrences Right R − set of texts obtained by reading up to the first occurrence Minimal M − set of texts separating two occurrences Ultimate U − set of texts following the last occurrence Not N − set of texts with no occurrence A ⋆ = N + R . ( M ) ⋆ . U L x = N + R x. ( M x ) ⋆ . U ⇒

Equations over the langages C = C h,h π h = Pr( h ) (Bernoulli model) (I) A ⋆ = U + MA ⋆ (II) A ⋆ h = R . C + R . A ⋆ .h (III) M + = A ⋆ .h + C − ǫ (IV) N . A = R + N − ǫ solving π h z | h | 1 R ( z ) = U ( z ) = π h z | h | + (1 − z ) C ( z ) π h z | h | + (1 − z ) C ( z ) C ( z ) z − 1 N ( z ) = M ( z ) = 1 + π h z | h | + (1 − z ) C ( z ) π h z | h | + (1 − z ) C ( z ) 1 L ( z, x ) = 1 − x 1 − z + π h z | h | x + (1 − x ) C ( z )

Reduced sets (R´ egnier) R i , M i,j , U i � R i ( z ) , M i,j ( z ) , U i ( z ) functions of C h 1 ,h 1 ( z ) , C h 2 ,h 2 ( z ) , C h 1 ,h 2 ( z ) , C h 2 ,h 1 ( z ) ⋆      x 1 M 1 , 1 ( z ) x 2 M 1 , 2 ( z )  U 1 ( z ) F ( z, x 1 , x 2 ) = N ( z )+( x 1 R 1 ( z ) , x 2 R 2 ( z ))   x 1 M 2 , 1 ( z ) x 2 M 2 , 2 ( z ) U 2 ( z ) This collapses in case of non-reduced sets

Aho-Corasick – Input: non-reduced set of words U . – Output: automaton A U recognizing A ∗ U . Algorithm: 1. build T U , the ordinary trie representing the set U 2. build A U = ( A , Q, δ, ǫ, T ): – Q = Pref( U ) – T = A ∗ U ∩ Pref( U )  if qx ∈ Pref( U ) , qx  – δ ( q, x ) = Border( qx ) otherwise ,  Border( v ) = the longest proper suffix of v which belongs to Pref( U ) if defined, or ǫ otherwise.

Example U = { aab, aa } ǫ aab a b a a aa Trie T U of U

Example U = { aab, aa } δ ( ǫ, b ) = Border( b ) = ǫ b ǫ a aab b a a aa

Example U = { aab, aa } δ ( a, b ) = Border( a.b ) = ǫ b ǫ a aab b a a b aa

Example U = { aab, aa } δ ( aa, a ) = Border( aa.a ) = aa b ǫ a aab b a a b aa a

Example U = { aab, aa } δ ( aab, a ) = Border( aab.a ) = a b ǫ a aab a b a a b aa a

Example U = { aab, aa } δ ( aab, b ) = Border( aab.b ) = ǫ b b ǫ a aab a b a a b aa a

Example U = { aab, aa }   0 0 b b a   b ǫ 0 0 aab b ax 2 a   a T ( x 1 , x 2 ) = ,   b a a   0 0 ax 2 bx 1   b aa   b a 0 0 a x 1 , x 2 marks for aab, aa

Example U = { aab, aa }   b a 0 0 b   ǫ b 0 0 aab b ax 2 a   a T ( x 1 , x 2 ) = ,   b a a   0 0 ax 2 bx 1   b aa   0 0 b a   a 1 1 F ( a, b, x 1 , x 2 ) = (1 , 0 , 0 , 0)( I − T ( a, b, x 1 , x 2 )) − 1   1   1 1 − a ( x 2 − 1) = 1 − ax 2 − b + ab ( x 2 − 1) − a 2 bx 2 ( x 1 − 1) 2 .

Inclusion-Exclusion Principle - Analytic Version Set of camelus genus (camel and dromedary); the number of humps is counted by the formal variable x . � � F ( x ) = x 2 + x F = , , { “objects of P in which each elementary configuration (hump) Φ = is either distinguished or not” } � � = , , , , , Φ( t ) = t + 1 + t 2 + t + t + 1 = 2 + 3 t + t 2 = F (1 + t ) Inclusion-Exclusion principle If Φ( t ) is easy to get, then F ( x ) = Φ( x − 1).

Application: counts for one word word aaa f ( x ): unknown p.g.f of counts of aaa bbbbbaaaaaaaabbbbb each occurrence is distinguished or not (flip-flop) ⇒ 2 k configurations for a text with k occurrences bbbbbaaaaaaaabbbbb bbbbbaaaaaaaabbbbb bbbbbaaaaaaaabbbbb bbbbbaaaaaaaabbbbb   1 f ( x ) � f (1 + x ) = φ ( x )   x � � + x � f ( x ) = φ ( x − 1)   computing easier φ ( t ) and substituting t � x − 1 give harder f ( x ) (Inclusion-Exclusion paradigm)

One word - Clusters word aaa C aaa,aaa = { ǫ, a, aa } bbbbbaaaaaaaabbbbb bbbbbaaaaaaaabbbbb bbbbbaaaaaaaabbbbb bbbbbaaaaaaaabbbbb bbbbbaaaaaaaabbbbb clusters C C aaa = aaa • ( ǫ + a • + aa • + a • a • + a • a • a • + a • aa • + aa • a • + . . . ) ǫ + ( ( C aaa,aaa − ǫ ) • ) + � � = aaa • double counting (further removed by the inclusion-exclusion principle): z + z 2 ( C aaa,aaa − ǫ ) + ( z ) = 1 − ( z + z 2 ) = z + 2 z 2 + 3 z 3 + 5 z 4 + 8 z 5 + 13 z 6 + . . . � = z + z 2 + z 3 + z 4 + z 5 + z 6 + . . .

Word aaa - Clusters - Generating function C aaa,aaa ( z ) = 1 + z + z 2 C aaa,aaa = { ǫ, a, aa } C aaa = aaa • ( ǫ + a • + aa • + a • a • + a • a • a • + a • aa • + aa • a • + . . . ) ǫ + (( C aaa,aaa − ǫ ) • ) + � � = aaa • C aaa ( z, x ) = zzzx (1 + zx + zzx + zxzx + zxzxzx + zxzzx + zzxzx + . . . ) � ǫ + ( C aaa,aaa ( z ) × x ) + � = z 3 x xz + xz 2 xz 3 � � = xz 3 1 + = 1 − ( xz + xz 2 ) 1 − ( xz + xz 2 )

Parsing of a text with respect to clusters C = C h,h , word h , clusters C xh ( z ) C = h + h. C + h CC + h CCC + . . . ⇒ = C ( z, x ) = 1 − x ( C ( z ) − 1) When reading a random text T , at each position, either we read a letter of the alphabet A , either we begin a cluster C , T = ǫ + A + C + AA + A C + C A + CC + AAA + AA C + A C A + C AA + A CC + . . . = Seq( A + C ) Therefore, counting with x the number of occurrences of the word h , we have, removing double counting by inclusion-exclusion, 1 1 � = F ( z, x ) = � ( x − 1) h ( z ) 1 − A ( z )+ C ( z, x − 1) 1 − A ( z ) − 1 − ( x − 1)( C ( z ) − 1)

Reduced set - (Goulden-Jackson - 1979, 1983) U = { aba, bab, aa } bbbbbabababaabbbbb bbbbbabababaabbbbb bbbbbabababaabbbbb clusters C i,j begin with w i and finish with w j � C i,j = w i C w i ,w j + C i,k . ( C w k ,w j − δ kj ǫ ) 1 ≤ k ≤ 3 − 1       C w 1 ,w 1 • − ǫ C w 1 ,w 2 • C w 1 ,w 3 • 1       C = ( w 1 • , w 2 • , w 3 • )  I − C w 2 ,w 1 • C w 2 ,w 2 • − ǫ C w 2 ,w 3 • 1            C w 3 ,w 1 • C w 3 ,w 2 • C w 3 ,w 3 • − ǫ 1 1 T = Seq( A + C ) = ⇒ Φ( z, x 1 , x 2 , x 3 ) = 1 − A ( z ) − C ( z, x 1 , x 2 , x 3 ) 1 F ( z, x 1 , x 2 , x 3 ) = Φ( z, x 1 − 1 , x 2 − 1 , x 3 − 1) = 1 − A ( z ) − C ( z, x 1 − 1 , x 2 − 1 , x 3 − 1)

General Case: Non Reduced Set of Words U = { aa, ab, baaaab } I II aaaabbbbbbbabaaaabbbb aa ab aa baaaab ab aa aa aa create clusters of distinguished occurrences Reduced Cluster , no induced factor occurrences (Cluster I). Count distinguished occurrences by t i � x i − 1 (Inclusion-Exclusion principle) Induced Factor Occurrences , occurrence baaaab of reduced Cluster II induces 0, 1, 2, or 3 distinguished occurrences aa . To recover the correct count of 8 marked configurations, count them by (1 + t i ) 3 � x 3 i .

Counting occurrences for a finite set of words: an - PowerPoint PPT Presentation

Counting occurrences for a finite set of words: an inclusion-exclusion approach Pierre Nicod` eme CNRS - LIX, Ecole polytechnique joint work with Fr ement and ed erique Bassino, Julien Cl Julien Fayolle Problem setting Compute

Math 283, Spring 2006, Prof. Tesler May 22, 2006 Markov chains and the number of occurrences

Some useful tasks involving language Find all phone numbers in a text, e.g., occurrences such

44 Days And Counting 44 Days And Counting 2010 World Equestrian Games Overview September 25

Counting is Hard: Probabilistically Counting Views at Reddit Krishnan Chandra, Data Engineer

Counting Basic 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 of 1 10/02/2003 04:00 PM 1

Counting CS1200, CSE IIT Madras Meghana Nasre April 2, 2020 CS1200, CSE IIT Madras Meghana

Counting CS1200, CSE IIT Madras Meghana Nasre March 26, 2020 CS1200, CSE IIT Madras Meghana

Counting and Probability Whats to come? Counting and Probability Whats to come?

Finite state automata Finite graphs with labels on edges/nodes Lecture 2 a set of nodes

RCGC can be more attractive Reference counting example Root set Heap space 1 1 2 1 1 1 1

RCGC can be more attractive Reference counting example Root set Heap space 1 1 2 1 1 1 1 2

Finite Automata A finite automaton has a finite set of states with which it accepts or rejects

MapReduce Marek Adamczyk 24 XI 2010 Example Counting word occurrences Input document:

Finite A to B implies |A| = |B| Cardinality for finite A, B finite-card .1 finite-card .2

Occurrences and Researches of Harmful Algal Blooms in China in Recent Years LU Songhui Research

UNSAFE PORTS: abnormal occurrences and the insurance solution. David Pitlarge Partner Marine,

Baja California, Mxico The Science of Climate Change: a focus on Central America and the

GF: a Logical Framework for Grammars Aarne Ranta University of Gothenburg and Digital Grammars

Verbalizing Ontologies in Controlled Baltic Languages Normunds Grztis , Gunta Nepore, Baiba

An Adaptation Technique for GF-Based Dialogue Systems Faegheh Hasibi August 31, 2012 Department

Optimizing Jaccard, Dice, and other measures for image segmentation Matthew Blaschko joint work

Hadron Structure at COMPASS on behalf of the COMPASS Collaboration Krzysztof Kurek, National

Matrix Calculations: Kernels & Images, Matrix Multiplication A. Kissinger (and H. Geuvers)

G New Measurement of G at COMPASS From Open Charm Events Florent Robinet CEA, Saclay on

Counting occurrences for a finite set of words: an - PowerPoint PPT Presentation

Counting occurrences for a finite set of words: an inclusion-exclusion approach Pierre Nicod` eme CNRS - LIX, Ecole polytechnique joint work with Fr ement and ed erique Bassino, Julien Cl Julien Fayolle Problem setting Compute

Math 283, Spring 2006, Prof. Tesler May 22, 2006 Markov chains and the number of occurrences

Some useful tasks involving language Find all phone numbers in a text, e.g., occurrences such

44 Days And Counting 44 Days And Counting 2010 World Equestrian Games Overview September 25

Counting is Hard: Probabilistically Counting Views at Reddit Krishnan Chandra, Data Engineer

Counting Basic 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 of 1 10/02/2003 04:00 PM 1

Counting CS1200, CSE IIT Madras Meghana Nasre April 2, 2020 CS1200, CSE IIT Madras Meghana

Counting CS1200, CSE IIT Madras Meghana Nasre March 26, 2020 CS1200, CSE IIT Madras Meghana

Counting and Probability Whats to come? Counting and Probability Whats to come?

Finite state automata Finite graphs with labels on edges/nodes Lecture 2 a set of nodes

RCGC can be more attractive Reference counting example Root set Heap space 1 1 2 1 1 1 1

RCGC can be more attractive Reference counting example Root set Heap space 1 1 2 1 1 1 1 2

Finite Automata A finite automaton has a finite set of states with which it accepts or rejects

MapReduce Marek Adamczyk 24 XI 2010 Example Counting word occurrences Input document:

Finite A to B implies |A| = |B| Cardinality for finite A, B finite-card .1 finite-card .2

Occurrences and Researches of Harmful Algal Blooms in China in Recent Years LU Songhui Research

UNSAFE PORTS: abnormal occurrences and the insurance solution. David Pitlarge Partner Marine,

Baja California, Mxico The Science of Climate Change: a focus on Central America and the

GF: a Logical Framework for Grammars Aarne Ranta University of Gothenburg and Digital Grammars

Verbalizing Ontologies in Controlled Baltic Languages Normunds Grztis , Gunta Nepore, Baiba

An Adaptation Technique for GF-Based Dialogue Systems Faegheh Hasibi August 31, 2012 Department

Optimizing Jaccard, Dice, and other measures for image segmentation Matthew Blaschko joint work

Hadron Structure at COMPASS on behalf of the COMPASS Collaboration Krzysztof Kurek, National

Matrix Calculations: Kernels &amp; Images, Matrix Multiplication A. Kissinger (and H. Geuvers)

G New Measurement of G at COMPASS From Open Charm Events Florent Robinet CEA, Saclay on

Matrix Calculations: Kernels & Images, Matrix Multiplication A. Kissinger (and H. Geuvers)