Regular Expressions Greg Plaxton Theory in Programming Practice, Spring 2004 Department of Computer Science University of Texas at Austin
What is a Regular Expression? • A regular expression defines a (possibly infinite) set of strings over a given alphabet • Analogous to an arithmetic expression – The symbols of the alphabet are analogous to the numerical constants in an arithmetic expression – Instead of arithmetic operators such as addition, multiplication, and exponentiation, the operators are concatenation, union, and closure Theory in Programming Practice, Plaxton, Spring 2004
Regular Expressions: Syntax • The symbols ∅ (empty set), � (empty string), and any symbol of the alphabet are regular expressions • For any regular expressions p and q , ( pq ) (concatenation) and ( p | q ) (union) are regular expressions • For any regular expression p , p ∗ (Kleene closure) is a regular expression Theory in Programming Practice, Plaxton, Spring 2004
Regular Expressions: Semantics • The regular expression ∅ corresponds to the empty set of strings • The regular expression � corresponds to the set of strings { � } • For any symbol a in the alphabet, the regular expression a corresponds to the set of strings { a } • For any regular expressions p and q with corresponding set of strings X and Y , the regular expression ( pq ) (resp., ( p | q ) ) denotes the set of strings { xy | x ∈ X ∧ y ∈ Y } (resp., X ∪ Y ) • For any regular expression p with corresponding set of strings X , the regular expression p ∗ denotes the set of strings { x 1 x 2 · · · x k | k ≥ 0 ∧ �∀ i : 1 ≤ i ≤ k : x i ∈ X �} Theory in Programming Practice, Plaxton, Spring 2004
Regular Expressions: Parenthesization • When writing a regular expression, we generally try to omit as many parentheses as possible without altering the meaning of the expression • Where parentheses are omitted, Kleene closure has the highest binding power, then concatenation, then union – Parentheses may be omitted whenever this convention yields the intended parenthesization • Note that concatenation and union are associative – These facts often enable us to drop parentheses, e.g., we can write abc instead of (( ab ) c ) Theory in Programming Practice, Plaxton, Spring 2004
A Remark on Kleene Closure • One can think of Kleene closure as follows: p ∗ = � | p | pp | ppp | . . . • The RHS above is not a regular expression because it has an infinite number of terms – It is straightforward to prove by induction that every regular expression has a finite length • The motivation for introducing the Kleene closure operator is to make the above RHS into a regular expression Theory in Programming Practice, Plaxton, Spring 2004
Regular Expressions: Examples • What is the set of strings corresponding to the regular expression a | bc ∗ d ? • It is often convenient to introduce identifiers to stand for certain regular expressions and then to use these identifiers as a shorthand for building up more complex regular expressions – PosDigit = 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 – Digit = 0 | PosDigit – Natural = 0 | PosDigit Digit ∗ • The set of strings over the lowercase English alphabet containing all five vowels in order corresponds to the regular expression ( Letter ∗ ) a ( Letter ∗ ) e ( Letter ∗ ) i ( Letter ∗ ) o ( Letter ∗ ) u ( Letter ∗ ) where Letter = a | b | c | . . . | z Theory in Programming Practice, Plaxton, Spring 2004
A More Elaborate Example • For any binary string x , let f ( x ) denote the nonnegative integer corresponding to x – Example: If x = 00110 , then f ( x ) = 6 • Problem: Construct a regular expression corresponding to the set of all binary strings x such that f ( x ) is a multiple of 3 – We first inductively define the sets B 0 , B 1 , and B 2 of all binary strings x such that f ( x ) is congruent to 0 , 1 , and 2 , respectively, modulo 3 – We then deduce a regular expression for B 0 Theory in Programming Practice, Plaxton, Spring 2004
Inductive Definition of Sets B 0 , B 1 , and B 2 (0) The empty string belongs to B 0 (1) For any binary string x in B 0 , x 0 belongs to B 0 and x 1 belongs to B 1 (2) For any binary string x in B 1 , x 0 belongs to B 2 and x 1 belongs to B 0 (3) For any binary string x in B 2 , x 0 belongs to B 1 and x 1 belongs to B 2 Theory in Programming Practice, Plaxton, Spring 2004
Characterization of B 2 in Terms of B 1 • By (2) and (3), any binary string in B 2 is either of the form x 0 where x belongs to B 1 , or is of the form x 1 where x belongs to B 2 • It follows that B 2 consists of all binary strings of the form x 01 ∗ where x belongs to B 1 Theory in Programming Practice, Plaxton, Spring 2004
Characterization of B 1 in terms of B 0 • By (1), (3), and the preceding characterization of B 2 , any binary string in B 1 is either of the form x 1 where x belongs to B 0 , or is of the form x 01 ∗ 0 where x belongs to B 1 • It follows that B 1 consists of all binary strings of the form x 1(01 ∗ 0) ∗ where x belongs to B 0 Theory in Programming Practice, Plaxton, Spring 2004
Deducing a Regular Expression for B 0 • By (0), (1), (2), and the preceding characterization of B 1 , the set B 0 consists of the empty string, all binary strings of the form x 0 where x belongs to B 0 , and all binary strings of the form x 1(01 ∗ 0) ∗ 1 where x belongs to B 0 • It follows that B 0 consists of all binary strings of the form (0 | 1(01 ∗ 0) ∗ 1) ∗ Theory in Programming Practice, Plaxton, Spring 2004
Remark: Alternative View of the Preceding Example • The binary strings in B 0 may be viewed as being generated by the grammar − → S B 0 − → � | B 0 0 | B 1 1 B 0 − → B 0 1 | B 2 0 B 1 − → B 1 0 | B 2 1 B 2 • As we have seen, the above grammar generates a regular language • Not all grammars generate regular languages Theory in Programming Practice, Plaxton, Spring 2004
Recommend
More recommend