strings languages and regular expressions
play

Strings, Languages, and Regular expressions Lecture 2 1 Strings - PowerPoint PPT Presentation

Strings, Languages, and Regular expressions Lecture 2 1 Strings 2 Definitions for strings e.g., = {0,1}, = { , , , } , = set of ascii characters alphabet = finite set of symbols string = finite


  1. Strings, Languages, and 
 Regular expressions Lecture 2 1

  2. Strings 2

  3. Definitions for strings e.g., Σ = {0,1}, 
 Σ = { α , β , …, ω } , 
 Σ = set of ascii characters • alphabet Σ = finite set of symbols • string = finite sequence of symbols of Σ • length of a string w is denoted | w | . Could formalize • empty string is denoted “ ε ”. as a function 
 w: [ n ] → Σ 
 | ε | = ? 0 |cat|=3 where | w | = n Variable conventions (for this lecture) ! a , b , c , ... elements of Σ (i.e., strings of length 1) w , x , y , z , ... strings of length 0 or more CS 374 A , B , C ,... sets of strings 3

  4. Much ado about nothing • ε is a string containing no symbols. It is not a set. • { ε } is a set containing one string: the empty string ε . It is a set, not a string. • Ø is the empty set . It contains no strings. CS 374 4

  5. Concatenation & its properties • xy denotes the concatenation of strings x and y (sometimes written x ⋅ y ) • Associative: ( uv ) w = u ( vw ) and we write uvw . • Identity element ε : ε w = w ε = w If | x |= m , | y |= n 
 • Can be used to define strings 
 xy : [ m+n ] → Σ 
 (set of all strings Σ * ) inductively such that 
 xy ( i ) = x ( i ) if i ≤ m 
 • NOT commutative: ab ≠ ba xy ( i ) = y ( i-m ) else CS 374 5

  6. Substring, Prefix, Suffix, Exponents • v is a substring of w iff there exist strings x , y , such that w = xvy . – If x = ε ( w = vy ) then v is a prefix of w . – If y = ε ( w = xv ) then v is a suffix of w . • If w is a string, then w n is defined inductively by: – w n = ε if n = 0 (blah) 4 =? – w n = ww n -1 if n > 0 blahblahblahblah CS 374 6

  7. Set Concatenation • If X and Y are sets of strings, then XY = { xy | x ∈ X, y ∈ Y } % e.g. X = { fido, rover, spot } , Y = { fluffy, tabby } then XY = { fidofluffy, fidotabby, roverfluffy, ... } | XY | =? 6 A = { a,aa } , B = { ε ,a } | AB | = ? 3 A = { a,aa } , B = Ø CS 374 Ø AB = ? 7

  8. Σ n , Σ *, and Σ + • Σ n is the set of all strings over Σ of length exactly n . Defined inductively as: – Σ 0 = { ε } – Σ n = ΣΣ n -1 if n > 0 • Σ * is the set of all finite length strings: Σ * = ∪ n ≥ 0 Σ n % • Σ + is the set of all nonempty finite length strings: CS 374 Σ + = ∪ n ≥ 1 Σ n 8

  9. Σ n , Σ *, and Σ + • | Σ n | = ? | Σ | n • | Ø n | = ? – Ø 0 = { ε } – Ø n = ØØ n -1 = Ø if n > 0 • | Ø n | = 1 if n = 0 
 | Ø n | = 0 if n > 0 CS 374 9

  10. Σ n , Σ *, and Σ + • Σ * is the set of all finite length strings: Σ * = ∪ n ≥ 0 Σ n % This can be • x is a string iff x = ε or x = au where | u |=| x | -1 the formal definition of a • | Σ * | = ? “string” – Infinity. More precisely, ℵ 0 – | Σ * | = | Σ + | = | N | = ℵ 0 no longest • How long is the longest string in Σ * ? string! CS 374 • How many infinitely long strings in Σ * ? none 10

  11. Σ n , Σ *, and Σ + • Σ + is the set of all nonempty finite length strings: Σ + = ∪ n ≥ 1 Σ n % • Σ + = ? % – Σ Σ * % – Σ * Σ % – Σ Σ * Σ% – Σ ∪ Σ 2 Σ * CS 374 11

  12. Enumerating Strings • Canonical (standard) ordering is the 1 ε 0 2 0 1 lexicographical (dictionary) ordering 3 1 1 4 00 2 5 • Order by length (starting with 0) 01 2 6 10 2 7 11 2 • Order the | Σ | n strings of length n 8 000 3 9 001 3 by comparing characters left to 10 010 3 11 011 3 right 12 100 3 13 101 3 14 110 3 15 111 3 16 1000 4 CS 374 17 1001 4 18 1010 4 19 1011 4 12 20 1100 4

  13. Inductive Definitions • Often operations on strings are formally defined inductively ε R = ε 
 ( au ) R = u R a – e.g., w n in terms of w n -1 % – Another example: w R ( w reversed) inducting on length Well-defined: 
 | u |<| w | • If | w | = 0 , w R = ε a ∈ Σ , u ∈ Σ * • If | w | ≥ 1 , w R = u R a where w = au CS 374 – e.g. (cat) R = (c ⋅ at) R = (at) R ⋅ c = (a ⋅ t) R ⋅ c 
 = (t) R ⋅ a ⋅ c = (t ⋅ ε ) R ⋅ ac = ε R ⋅ tac = tac 13

  14. Inductive Proofs • Inductive proofs follow inductive definitions • Theorem : ( uv ) R = v R u R % ε R = ε 
 • Proof : By induction ( au ) R = u R a But on what? | u |, | v |, | u+v |, double induction on | u |,| v |? | u | (or | v |) is good enough: Base case: | u | = 0 : i.e., u = ε . 
 Then: ( uv ) R = v R 
 & v R u R = v R ε R = v R ε = v R ☑️ CS 374 Definition of Reversal: 
 base-case 14

  15. Inductive Proofs • Inductive proofs follow inductive definitions • Theorem : ( uv ) R = v R u R % ε R = ε 
 • Proof : By induction ( au ) R = u R a Inductive step: Let n > 0. Assume ( wv ) R = v R w R ∀ w, | w | < n Consider any u with | u | = n . So u = aw , a ∈ Σ , w ∈ Σ * . Definition of Reversal: ( uv ) R = ( awv ) R = ( a ( wv )) R = ( wv ) R a 
 inductive-case = v R w R a 
 Inductive Hypothesis: | w | <n = v R ( aw ) R 
 Definition of Reversal: CS 374 inductive-case = v R u R 15

  16. Languages 16

  17. Recall Computation Problem : 
 Program : 
 To compute a function F that A finitely described process maps each input (a string) to taking a string as input, and an output bit outputting a bit (or not halting) P computes F if for every x, P(x) outputs F(x) and halts Too restrictive? Enough to compute functions with longer outputs too: 
 P(x,i) outputs the i th bit of F(x) CS 374 Enough to model interactive computation too: 
 P*(x,state) outputs (y,new_state) 17

  18. Language 0 1 ε • A function from Σ * to {0,1} can be identified 2 0 0 with the set of strings mapped to 1 3 1 1 4 0 00 • A language is a subset of Σ * 5 1 01 6 1 10 – Computational problem for a language: 7 0 11 8 0 000 given a string in Σ * , decide if it belongs 9 1 001 to the language 10 1 010 11 0 011 • Examples of languages : Ø , Σ * , Σ , { ε } , 
 12 1 100 set of strings of odd length, set of strings 13 0 101 14 0 110 encoding valid C programs, set of strings 15 1 111 encoding valid C programs that halt, … 16 1 1000 CS 374 17 0 1001 • There are uncountably many languages (but 18 0 1010 each language has countably many strings) 19 1 1011 18 20 0 1100

  19. Operations on Languages • Already seen concatenation: L 1 L 2 = { xy | x ∈ L 1 , y ∈ L 2 } • Set operations: ̅ = Σ * - L = { x ∈ Σ * | x ∉ L } – Complement: L – Union: L 1 ∪ L 2 – Intersection, difference (can be based on the above two) • L n inductively defined: L 0 = { ε }, L n = LL n- 1 % • L* = ∪ n ≥ 0 L n , and L + = LL* % CS 374 • { ε }* = ? Ø* = ? 19

  20. Complexity of Languages • How computable is a language? • Singleton languages – L such that | L | = 1. Example: L = {374} – An algorithm can have the single string hard-coded into it • More generally, finite languages – Algorithm can have all the strings hard-coded into it • Many interesting languages are uncomputable CS 374 • But many others are neither too easy nor impossible… 20

  21. Regular Languages 21

  22. Regular Languages • The set of regular languages over some alphabet Σ is defined inductively by: • Ø is a regular language • { ε } is a regular language • { a } is a regular language for each a ∈ Σ • If L 1 , L 2 are regular, then L 1 ∪ L 2 is regular • If L 1 , L 2 are regular, then L 1 L 2 is regular CS 374 • If L is regular, then L* is regular 22

  23. Regular Languages Examples • L = { w } where w ∈ Σ * is any fixed string – e.g., L = { aba } = { a }{ b }{ a } and { a } & { b } are both regular – Proof by induction on | w |, using concatenation for induction • L = any finite set of strings – e.g., L = set of all strings of length at most 10 – Proof by induction on | L |, using union for induction (and the above) CS 374 – Beware: Induction applicable only for | L | ∈ N , not | L |= ℵ 0 23

  24. Regular Languages Examples • Infinite sets, but of strings with “regular” patterns – Σ * (recall: L* is regular if L is) – Σ + = ΣΣ * – All binary integers, without leading 0’s • L = {1}{0,1}* ∪ {0} – All binary integers which are multiples of 37 • later CS 374 24

  25. Regular Expressions 25

  26. Regular Expressions • A short-hand to denote a regular language as strings that match a pattern • Useful in – text search (editors, Unix/grep) – compilers: lexical analysis • Dates back to 50’s: Stephen Kleene, 
 who has a star named after him * CS 374 * The star named after him is the Kleene star “*” 26

  27. Inductive Definition A regular expression r over alphabet Σ is one of the following (L( r ) is the language it represents): Atomic expressions (Base cases) Ø % L (Ø) = Ø % ε" L ( ε ) = { ε } % a for a ∈ Σ L ( a ) = { a } Inductively defined expressions alt notation 
 ( r 1 + r 2 ) % L ( r 1 +r 2 ) = L ( r 1 ) ∪ L ( r 2 ) % ( r 1 | r 2 ) or ( r 1 ∪ r 2 ) ( r 1 r 2 ) % L ( r 1 r 2 ) = L ( r 1 ) L ( r 2 ) % ( r )* L ( r* ) = L ( r )* CS 374 Any regular language has a regular expression and vice versa 27

  28. Regular Expressions • Can omit many parentheses – By following precedence rules : 
 * before concatenation before + % • e.g. r*s + t ≡ (( r* ) s ) + t " – By associativity: ( r+s ) +t ≡ r+s+t , ( rs ) t ≡ rst " • More short-hand notation – e.g., r + ≡ rr* (note: + is in superscript) CS 374 28

More recommend