Approximate Search of Regular Expressions Using Bit-Parallel Algorithms Kristo Tammeoja Jaak Vilo Teooriapäevad Rõuges, 2007
Contents � Regular expression (RE) syntax � Glushkov’s automaton � Existing bit-parallel algorithms � Exact matching � Approximate matching � New feature added � Error-free regions 2
Regular expression � Syntax � (, ) � | � Quantifier � *, +, ?, {m,n}, {m,} � Character classes (example [a-z]) 3
Regular expression � Syntax � (, ) � | � Quantifier � *, +, ?, {m,n}, {m,} � Character classes (example [a-z]) � Matching as used in presentation � Regular expression A* � AAAAA match � BAAAC no match 4
Regular expression 1 error allowed R(E|G)(EX) * 1:R(E|G)(EX)* 1:R<E|G>(EX)* 1:R(E|G)<EX>* 5
Regular expression 1 error allowed R(E|G)(EX) * 1:R(E|G)(EX)* 1:R<E|G>(EX)* R E R G R E E X 1:R(E|G)<EX>* R G E X R E E X E X 6
Regular expression 1 error allowed R(E|G)(EX) * 1:R(E|G)(EX)* 1:R<E|G>(EX)* R R R E subst. R G R G del. R E E X R E X E 1:R(E|G)<EX>* R G E X R E G E X R E E X E X R E E E X E X ins. R E E R X E X 7
Regular expression 1 error allowed R(E|G)(EX) * 1:R(E|G)(EX)* 1:R<E|G>(EX)* R R R E no match subst. R G R no match G del. R E E X R E X match E 1:R(E|G)<EX>* R G E X R E G E X R E E X E X R E E E X E X ins. R E E R X E X 8
Regular expression 1 error allowed R(E|G)(EX) * 1:R(E|G)(EX)* 1:R<E|G>(EX)* R R R E no match subst. R G R no match G del. R E E X R E X match E 1:R(E|G)<EX>* R G E X R E G E X match R E E X E X R E E E X E X ins. match R E E R X E X no match 9
Glushkov’s automaton R ( E | G ) ( E X ) * 10
Glushkov’s automaton � Character in RE = state in automaton R ( E | G ) ( E X ) * R E G E X 11
Glushkov’s automaton � Character in RE = state in automaton + one state for the beginning of the RE R ( E | G ) ( E X ) * R E G E X 12
Glushkov’s automaton � Character in RE = state in automaton + one state for the beginning of the RE � Transitions show which characters/positions can precede each other R ( E | G ) ( E X ) * R E G E X R... 13
Glushkov’s automaton � Character in RE = state in automaton + one state for the beginning of the RE � Transitions show which characters/positions can precede each other R ( E | G ) ( E X ) * R R E G E X R... 14
Glushkov’s automaton � Character in RE = state in automaton + one state for the beginning of the RE � Transitions show which characters/positions can precede each other R ( E | G ) ( E X ) * R R E G E X R ... 15
Glushkov’s automaton � Character in RE = state in automaton + one state for the beginning of the RE � Transitions show which characters/positions can precede each other R ( E | G ) ( E X ) * E R R E G E X R E... R G... G 16
Glushkov’s automaton � Character in RE = state in automaton + one state for the beginning of the RE � Transitions show which characters/positions can precede each other R ( E | G ) ( E X ) * E R R E G E X RE... G 17
Glushkov’s automaton � Character in RE = state in automaton + one state for the beginning of the RE � Transitions show which characters/positions can precede each other R ( E | G ) ( E X ) * E E R R E G E X R E E... G 18
Glushkov’s automaton � Character in RE = state in automaton + one state for the beginning of the RE � Transitions show which characters/positions can precede each other R ( E | G ) ( E X ) * E E R R E G E X R G E... E G 19
Glushkov’s automaton � Character in RE = state in automaton + one state for the beginning of the RE � Transitions show which characters/positions can precede each other R ( E | G ) ( E X ) * E E R R E G E X RG E X... E X G 20
Glushkov’s automaton � Character in RE = state in automaton + one state for the beginning of the RE � Transitions show which characters/positions can precede each other R ( E | G ) ( E X ) * E E R E R E G E X RGE X E... E X G 21
Glushkov’s automaton � Character in RE = state in automaton + one state for the beginning of the RE � Transitions show which characters/positions can precede each other R ( E | G ) ( E X ) * E E R E R E G E X E X G 22
Glushkov’s automaton � All labels entering a node are labeled by the same character R ( E | G ) ( E X ) * E E R E R E G E X E X G 23
Glushkov’s automaton � All labels entering a node are labeled by the same character R ( E | G ) ( E X ) * E E R E R E G E X E X G 24
Glushkov’s automaton � All labels entering a node are labeled by the same character for example after reading character ‘E’ only states with label ‘E’ can be active E E R E R E G E X E X G 25
Exact search � Simulation of NFA = changing active states based on the character read from the text � We use bit-vectors (one bit for each state) to hold active states δ (D, a) � D – bit-vector of active states � a – character read � Returns new bit-vector � 2 |D| · | Σ | different sets of parameters � |D| – number of states in automaton � | Σ | - alphabet's size 26
Exact search � “ After reading character ‘E’ only states with label ‘E’ can be active ” so ... � δ (D, a) = T[D] & B[a] � T[ D ] – states that can be reached from states in D by any character � B[ a ] – states that can be reached by character a 27
Exact search � δ (D, a) = T[D] & B[a] A B C AA|AB|AC A A A B A C A A A a B[a] D T[D] ‘A’ 0111010 1000000 ‘B’ 0100000 ‘C’ ... 0101010 ... 28
Exact search � δ (D, a) = T[D] & B[a] A B C AA|AB|AC A A A B A C A A A a B[a] D T[D] ‘A’ 0111010 1000000 ‘B’ 0000100 0100000 ‘C’ ... 0101010 ... 29
Exact search � δ (D, a) = T[D] & B[a] A B C AA|AB|AC A A A B A C A A A a B[a] D T[D] ‘A’ 0111010 1000000 ‘B’ 0000100 0100000 ‘C’ 0000001 ... 0101010 ... 30
Exact search � δ (D, a) = T[D] & B[a] A B C AA|AB|AC A A A B A C A A A a B[a] D T[D] ‘A’ 0111010 1000000 0101010 ‘B’ 0000100 0100000 ‘C’ 0000001 ... 0101010 ... 31
Exact search � δ (D, a) = T[D] & B[a] A B C AA|AB|AC A A A B A C A A A a B[a] D T[D] ‘A’ 0111010 1000000 0101010 ‘B’ 0000100 0100000 0010000 ‘C’ 0000001 ... 0101010 ... 32
Exact search � δ (D, a) = T[D] & B[a] A B C AA|AB|AC A A A B A C A A A a B[a] D T[D] ‘A’ 0111010 1000000 0101010 ‘B’ 0000100 0100000 0010000 ‘C’ 0000001 ... 0101010 0010101 ... 33
Exact search � δ (D, a) = T[D] & B[a] A B C AA|AB|AC A A A B A C A A A δ (0101010, ‘A’) a B[a] D T[D] ‘A’ 0111010 1000000 0101010 ‘B’ 0000100 0100000 0010000 ‘C’ 0000001 ... 0101010 0010101 ... 34
Exact search � δ (D, a) = T[D] & B[a] A B C AA|AB|AC A A A B A C A A A δ (0101010, ‘A’) a B[a] D T[D] 0010101 T[D] ‘A’ 0111010 1000000 0101010 & 0111010 B[a] ‘B’ 0000100 0100000 0010000 ‘C’ 0000001 ... 0010000 0101010 0010101 ... 35
Exact search D ← 100..00 // initial state active F ← bit-vector of final states For pos ∈ 1 ... n Do // scanning text D ← T[D] & B[t pos ] If D & F ≠ 000..00 Then match End of For 36
Approximate search Errors � Insertion � Deletion � Substitution 37
Approximate search � When searching with k errors we make k+1 replicas of the automaton, one for each error-level � Plus we need transitions for errors R E G E X No errors R E G E X ? ? ? ? ? R E G E X Up to 1 error R E G E X 38
Approximate search � R 0 , R 1 – current bit-vectors � R 0 ’, R 1 ’ – bit-vectors after processing character a R 0 ’ = T[R 0 ] & B[c] R 1 ’ = ? 39
Approximate search R 1 ’ = T[R 1 ] & B[c] | ... no errors � Same as in exact search E GEX R E G E X No errors R E G E X R E G E X Up to 1 error R E G E X 40
Approximate search R 1 ’ = T[R 1 ] & B[c] | R 0 | ... no errors del � Active states remain the same R A EGEX R E G E X No errors R E G E X Σ Σ Σ Σ Σ Σ R E G E X Up to 1 error R E G E X 41
Approximate search R 1 ’ = T[R 1 ] & B[c] | R 0 | T[R 0 ’] | ... no errors del ins � Insert new character after the current one � Just one step in automaton R E EX R E G E X No errors R E G E X ε ε ε ε ε Σ Σ Σ Σ Σ Σ R E G E X Up to 1 error R E G E X 42
Recommend
More recommend