Factor Automata of Automata and Applications Mehryar Mohri 1,2 , Pedro Moreno 2 , Eugene Weinstein 1,2 mohri@cs.nyu.edu, pedro@google.com, eugenew@cs.nyu.edu 1 Courant Institute of Mathematical Sciences 2 Google Inc.
Introduction • Objective: construct full index for a large set of strings • We want to efficiently search for factors (subwords) • Deterministic minimal factor automaton is a good option • Optimal lookup speed (linear in size of query) • Set of strings might be given as an automaton • Smaller representation • Might be produced by another application • Hence, consider factor automata of automata 2
Past Work • Factor automaton of a string has at most states, 2 | x | − 2 x and transitions [Crochemore ’85; Blumer et al. ’86] 3 | x | − 4 • Can be constructed by a linear-time online algorithm • Size bounds for a set of strings has also previously been U studied [Blumer et al. ’87] • If is the sum of the lengths of all the strings in || U || U • • Factor automaton of has at most states and 2 || U || − 1 U transitions 3 || U || − 3 • We prove a substantially better bound here 3
Suffix & Factor Automata • We start out with an automaton recognizing strings in A U • Let and be the deterministic minimal automata S ( A ) F ( A ) recognizing the suffixes and factors of , respectively A • To construct make each state of initial (by adding S ( A ) A epsilons), determinize, minimize • To construct make each state of final, minimize F ( A ) S ( A ) • Consequence: | F ( A ) | ≤ | S ( A ) | 3 a b a c 0 1 2 5 b a 4 4
Suffix & Factor Automata • We start out with an automaton recognizing strings in A U • Let and be the deterministic minimal automata S ( A ) F ( A ) recognizing the suffixes and factors of , respectively A • To construct make each state of initial (by adding S ( A ) A epsilons), determinize, minimize • To construct make each state of final, minimize F ( A ) S ( A ) • Consequence: | F ( A ) | ≤ | S ( A ) | ε ε 3 ε a b a c 0 1 2 5 b a 4 ε ε 4
Suffix & Factor Automata • We start out with an automaton recognizing strings in A U • Let and be the deterministic minimal automata S ( A ) F ( A ) recognizing the suffixes and factors of , respectively A • To construct make each state of initial (by adding S ( A ) A epsilons), determinize, minimize • To construct make each state of final, minimize F ( A ) S ( A ) • Consequence: b | F ( A ) | ≤ | S ( A ) | 1 6 c a ε a b ε 3 ε a b 4 b a c c a 0 1 2 5 0 3 5 b a b 4 ε a 2 ε 4
Suffix & Factor Automata • We start out with an automaton recognizing strings in A U • Let and be the deterministic minimal automata S ( A ) F ( A ) recognizing the suffixes and factors of , respectively A • To construct make each state of initial (by adding S ( A ) A epsilons), determinize, minimize • To construct make each state of final, minimize F ( A ) S ( A ) • Consequence: b | F ( A ) | ≤ | S ( A ) | 1 6 c a ε a b ε 3 ε a b 4 b a c c a 0 1 2 5 0 3 5 b a b 4 ε a 2 ε 4
Size Bound: Strategy • Goal: a bound on in terms of | F ( A ) | | A | • Work on bounding – consider suffixes only for now | S ( A ) | • Idea: each state in accepts a distinct set of suffixes, so S ( A ) count the number of possible sets of suffixes • The suffix sets can be arranged in a hierarchy, which is directly related in size to A • Motivated by similar arguments for single-string case in [Blumer et al. ’86]; string sets in [Blumer et al. ’87] 5
Suffix Sets • Automaton is -suffix unique if no two strings accepted A k by share the same -length suffix. Suffix-unique if A k k = 1 • Define : set of states in reachable after reading end - set ( x ) A x • e.g., end - set ( ac ) = { 2 , 3 , 4 , 5 } • denotes end - set ( x ) = end - set ( y ) x ≡ y • This is a right-invariant equivalence relation • is the equivalence class of [ x ] x 3 a b a c 0 1 2 5 b a 4 6
Notation • is number of strings accepted by N str A • If is a state of , is set of suffixes accepted from S ( A ) suff( q ) q q • e.g., suff(3) = { ab, ba } • is the set of states in from which a non-empty N ( q ) A string in can be read to reach a final state suff( q ) • e.g., N (3) = { 2 , 1 } S ( A ) b a b 2 3 a b a 5 a b 0 1 a 2 c c b 6 4 4 a b A 0 1 c b a 3 5 7
Suffix Set Inclusion
ć ą ą Suffix Set Inclusion • Lemma: Let be a suffix-unique automaton and let and q � A q A be a su ffi x-unique be two states of such that , then S ( A ) that N ( q ) ∩ N ( q ′ ) � = ∅ , su ff ( q ) ⊆ su ff ( q ′ ) and N ( q ) ⊆ N ( q ′ ) ć or su ff ( q ′ ) ⊆ su ff ( q ) and N ( q ′ ) ⊆ N ( q )
ą ć ą Suffix Set Inclusion • Lemma: Let be a suffix-unique automaton and let and q � A q A be a su ffi x-unique be two states of such that , then S ( A ) that N ( q ) ∩ N ( q ′ ) � = ∅ , su ff ( q ) ⊆ su ff ( q ′ ) and N ( q ) ⊆ N ( q ′ ) ć or su ff ( q ′ ) ⊆ su ff ( q ) and N ( q ′ ) ⊆ N ( q ) • Proof: Let paths in to and be labeled with and . q � u � S ( A ) q u S ( A ) q u u � q �
ą ć ą Suffix Set Inclusion • Lemma: Let be a suffix-unique automaton and let and q � A q A be a su ffi x-unique be two states of such that , then S ( A ) that N ( q ) ∩ N ( q ′ ) � = ∅ , su ff ( q ) ⊆ su ff ( q ′ ) and N ( q ) ⊆ N ( q ′ ) ć or su ff ( q ′ ) ⊆ su ff ( q ) and N ( q ′ ) ⊆ N ( q ) • Proof: Let paths in to and be labeled with and . q � u � S ( A ) q u • Thus must have a state exists p ∈ N ( q ) ∩ N ( q ′ ). A ff ( ′ ) such that both S ( A ) A q u u p u � q � u �
ą ą ć Suffix Set Inclusion • Lemma: Let be a suffix-unique automaton and let and q � A q A be a su ffi x-unique be two states of such that , then S ( A ) that N ( q ) ∩ N ( q ′ ) � = ∅ , su ff ( q ) ⊆ su ff ( q ′ ) and N ( q ) ⊆ N ( q ′ ) ć or su ff ( q ′ ) ⊆ su ff ( q ) and N ( q ′ ) ⊆ N ( q ) • Proof: Let paths in to and be labeled with and . q � u � S ( A ) q u • Thus must have a state exists p ∈ N ( q ) ∩ N ( q ′ ). A ff ( ′ ) such that both • Thus, exist paths and from to final v � ∈ su ff ( q � ) v ∈ suff( q ) p S ( A ) A q u u v v p v � u � v � q � u �
Suffix Set Inclusion S ( A ) A q u u v v p v � u � v � q � u � • Since is suffix-unique, any string accepted by and A A ending in must also end in v uv • Thus, any path from initial to must end in p u • By same reasoning, it must also end in u � • Hence, is a suffix of , or vice versa u � u • Assume the former, then , thus ′ us, su ff ( q ′ ) ⊆ su ff ( q ), N ( q ′ ) ⊆ N ( q ). QED. obtain similarly the other statement of the u v x u ’ 9
Suffix-unique Bound • Theorem: If is a suffix-unique deterministic and minimal A automaton, then the number of states of is bounded as S ( A ) | S ( A ) | Q ≤ 2 | A | Q − 3 . • Proof (sketch): • Lemma: For any two states of the suffix automaton, either suffix sets are disjoint, or one includes the other • We can show that each state of corresponds to a S ( A ) q distinct equivalence class , count these to get bound [ x ] • The equivalence sets induce a suffix sets hierarchy which we will analyze 10
Suffix Sets: Non-branching • Count non-branching, branching nodes separately • Consider state in with equivalence class , longest S ( A ) [ x ] x • The only way to have a branching node is if there exist factors (since is a right-equivalence relation) ax, bx ( a � = b ) ≡ • Node is only non-branching when is a prefix or suffix x • distinct prefixes, suffix only when final state: | A | Q − 2 N str • Total non-branching nodes most N nb ≤ | A | Q − 2 + N str . nodes of , observe that
Suffix Sets: Non-branching Includes Includes Disjoint • Count non-branching, branching nodes separately • Consider state in with equivalence class , longest S ( A ) [ x ] x • The only way to have a branching node is if there exist factors (since is a right-equivalence relation) ax, bx ( a � = b ) ≡ • Node is only non-branching when is a prefix or suffix x • distinct prefixes, suffix only when final state: | A | Q − 2 N str • Total non-branching nodes most N nb ≤ | A | Q − 2 + N str . nodes of , observe that
Recommend
More recommend