Text Search and Closure Properties CSCI 3130 Formal Languages and Automata Theory Siu On CHAN Fall 2018 Chinese University of Hong Kong 1/28
Text Search
grep program grep -E regex file.txt n copies [ab]{2} one or more (cat)+ zero or one [ab]? star * (ab) concatenation [ab][12] cat|12 Searches for an occurrence of patterns matching a regular expression regex language meaning 2/28 union [abc] shorthand for a | b | c { cat , 12 } { a , b , c } { a1 , a2 , b1 , b2 } { ε, ab , abab , . . . } { ε, a , b } { cat , catcat , . . . } { aa , ab , ba , bb }
Searching with grep savoring Babbage grep -E '[abAB]{5}' words Words with 5 consecutive a or b unsavory savory's savory savors savoriest Words containing savories savorier savored savor's savor grep -E 'savou?r' words cd /usr/share/dict/ savor or savour 3/28
Searching with grep savoring Babbage grep -E '[abAB]{5}' words Words with 5 consecutive a or b unsavory savory's savory savors savoriest Words containing savories savorier savored savor's savor grep -E 'savou?r' words cd /usr/share/dict/ savor or savour 3/28
More grep commands . any symbol [a-d] anything in a range ^ beginning of line $ end of line grep -E '^a.pl.$' words 4/28
How do you look for Words that start in go and have another go grep -E '^go.*go' words grep -iE '([aeiouy].*){10}' words grep -iE '^[^aeiouy]*$' words [^R] means “does not contain” grep -iE '^[^aeiouy]*([aeiouy][^aeiouy]*){10}$' words 5/28 Words with at least ten vowels? Words without any vowels? Words with exactly ten vowels?
How grep (could) work allowed Python, etc) Regular expression also supported in modern languages (C, Java, fjnds substring accept/reject output looks for substring matches whole input handling not allowed regular [ab]? , a+ , (cat){3} in grep in class differences input text fjle DFA NFA expression 6/28
Implementation of grep a{3} not containing ? [^aeiouy] n times n copies How do you handle expressions like 7/28 one or more (cat)+ zero or more [ab]? → ()|[ab] R ? → ε | R → (cat)(cat)* R + → RR ∗ → aaa R { n } → RR . . . R � �� �
Closure properties
So L can be described by the regular expression Example 011 1 1 0 0 1 0 111 110 100 001 010 The language L of strings that end in 101 is regular 000 1 0 or has length 0, 1, or 2 000, 001, 010, 011, 100, 110 or 111 Hint: a string does not end in 101 if and only if it ends in How about the language L of strings that do not end in 101? 8/28 ( 0 + 1 ) ∗ 101
Example The language L of strings that end in 101 is regular How about the language L of strings that do not end in 101? Hint: a string does not end in 101 if and only if it ends in 000, 001, 010, 011, 100, 110 or 111 or has length 0, 1, or 2 8/28 ( 0 + 1 ) ∗ 101 So L can be described by the regular expression ( 0 + 1 ) ∗ ( 000 + 001 + 010 + 011 + 100 + 110 + 111 )+ ε +( 0 + 1 )+( 0 + 1 )( 0 + 1 )
Complement The complement L of a language L contains those strings that are not in L Examples or have length 0, 1, or 2 9/28 L = { w ∈ Σ ∗ | w / ∈ L } ( Σ = { 0 , 1 } ) L 1 = lang. of all strings that end in 101 L 1 = lang. of all strings that do not end in 101 = lang. of all strings that end in 000, …, 111 (but not 101) L 2 = lang. of 1 ∗ = { ε, 1 , 11 , 111 , . . . } L 2 = lang. of all strings that contain at least one 0 = lang. of the regular expression ( 0 + 1 ) ∗ 0 ( 0 + 1 ) ∗
Example The language L of strings that contain 101 is regular You can write a regular expression, but it is a lot of work! 10/28 ( 0 + 1 ) ∗ 101 ( 0 + 1 ) ∗ How about the language L of strings that do not contain 101?
Closure under complement If L is a regular language, so is L languages regular expression NFA DFA The DFA defjnition will be the most convenient here We assume L has a DFA, and show L also has a DFA 11/28 To argue this, we can use any of the equivalent defjnitions of regular
Arguing closure under complement Suppose L is regular, then it has a DFA M accepts L M reversed accepts strings not in L 12/28 Now consider the DFA M ′ with the accepting and rejecting states of
Can we do the same with an NFA? q 2 Not the complement! 1 0 0, 1 0 1 q 1 q 0 q 0 0, 1 0 1 q 2 q 1 13/28 ( 0 + 1 ) ∗ 10
Can we do the same with an NFA? q 1 Not the complement! 0, 1 0 1 q 2 q 0 q 0 0, 1 0 1 q 2 q 1 13/28 ( 0 + 1 ) ∗ 10 ( 0 + 1 ) ∗
Intersection L 14/28 Examples: L The intersection L ∩ L ′ is the set of strings that are in both L and L ′ L ′ L ∩ L ′ 1 ∗ 1 ∗ 11 ( 0 + 1 ) ∗ 11 L ∩ L ′ L ′ 1 ∗ ( 0 + 1 ) ∗ 10 ∅ If L and L ′ are regular, is L ∩ L ′ also regular?
Closure under intersection languages regular expression NFA DFA 15/28 If L and L ′ are regular languages, so is L ∩ L ′ To argue this, we can use any of the equivalent defjnitions of regular Suppose L and L ′ have DFAs, call them M and M ′ Goal: construct a DFA (or NFA) for L ∩ L ′
r 0 s 0 r 0 s 1 r 1 s 0 r 1 s 1 Example 0 1 1 0 0 0 0 1 1 1 0 1 r 1 r 0 L (even number of 0s) M 1 0 1 0 s 1 s 0 16/28 L ′ (odd number of 1s) M ′ L ∩ L ′ = lang. of even number of 0s and odd number of 1s
Example 1 1 1 0 0 0 0 1 1 0 0 1 r 1 r 0 s 0 s 1 0 1 0 1 M L (even number of 0s) 16/28 L ′ (odd number of 1s) M ′ r 0 , s 0 r 0 , s 1 r 1 , s 0 r 1 , s 1 L ∩ L ′ = lang. of even number of 0s and odd number of 1s
Closure under intersection start states F for M states accepting 17/28 states M and M ′ DFA for L ∩ L ′ Q × Q ′ = { ( r 1 , s 1 ) , ( r 1 , s 2 ) , Q = { r 1 , . . . , r s } Q ′ = { s 1 , . . . , s m } . . . , ( r 2 , s 1 ) , . . . , ( r n , s m ) } ( r i , s j ) r i for M s j for M ′ F × F ′ = { ( r i , s j ) | r i ∈ F , s j ∈ F ′ } F ′ for M ′ Whenever M is in state r i and M ′ is in state s j , the DFA for L ∩ L ′ will be in state ( r i , s j )
Closure under intersection transitions r i r j a s k a a 18/28 DFA for L ∩ L ′ M and M ′ r j , s ℓ r i , s k s ℓ
Reversal reversing all its strings 19/28 The reversal w R of a string w is w written backwards w R = god w = dog The reversal L R of a language L is the language obtained by L R = { god , raw , level } L = { dog , war , level }
Reversal of regular languages L is regular and has regex How about L R ? It is regular and represented by 20/28 L = language of all strings that end in 01 ( 0 + 1 ) ∗ 01 This is the language of all strings beginning in 10 10 ( 0 + 1 ) ∗
Closure under reversal If L is a regular language, so is L R How do we argue? regular expression NFA DFA 21/28
Arguing closure under reversal Take a regular expression E for L A regular expression can be of the following types: • alphabet symbols like a and b 22/28 We will fjnd a regular expression E R representing L R • special symbols ∅ and ε • union, concatenation, or star of simpler expressions
Inductive proof of closure under reversal E R 1 1 E R E 1 E 2 2 Regular expression E 23/28 a a reversal E R ∅ ∅ ε ε E 1 + E 2 1 + E R 2 E R E ∗ ( E R 1 ) ∗
Duplication? Example: 24/28 L DUP = { ww | w ∈ L } L = { cat , dog } L DUP = { catcat , dogdog } If L is regular, is L DUP also regular?
Attempts Let’s try regular expression L a b L DUP aa bb LL aa ab ba bb Let’s try NFA q 0 NFA for L NFA for L q 1 25/28 L DUP ? = L 2
Attempts Let’s try regular expression Let’s try NFA q 0 NFA for L NFA for L q 1 25/28 L = { a , b } L DUP = { aa , bb } L DUP ? = L 2 LL = { aa , ab , ba , bb } ε ε ε
An example ( L is regular) Let’s design an NFA for L DUP 26/28 L = language of 0 ∗ 1 L = { 1 , 01 , 001 , 0001 , . . . } L DUP = { 11 , 0101 , 001001 , 00010001 , . . . } = { 0 n 10 n 1 | n � 0 }
Next lecture: will show that languages like L DUP are not regular An example 0001 Seems to require infjnitely many states! 0 0 0 1 001 1 01 1 1 1 27/28 L DUP = { 11 , 0101 , 001001 , 00010001 , . . . } = { 0 n 10 n 1 | n � 0 } 0 …
An example 001 Seems to require infjnitely many states! 0 0 0 0001 1 1 01 1 1 1 27/28 L DUP = { 11 , 0101 , 001001 , 00010001 , . . . } = { 0 n 10 n 1 | n � 0 } 0 … Next lecture: will show that languages like L DUP are not regular
Backreferences in grep Advanced feature in grep and other “regular expression” libraries grep -E '^(.*)\1$' words Standard “regular expression” libraries can accept irregular languages (as defjned in this course)! 28/28 the special expression \1 refers to the substring specifjed by (.*) (.*)\1 looks for a repeated substring, e.g. mama ^(.*)\1$ accepts the language L DUP
Recommend
More recommend