cs 241 systems programming lecture 23 regular expressions
play

CS 241: Systems Programming Lecture 23. Regular Expressions I Spring - PowerPoint PPT Presentation

CS 241: Systems Programming Lecture 23. Regular Expressions I Spring 2020 Prof. Stephen Checkoway 1 Theory of regular languages Mathematical theory of sets of strings You'll see this in CS 383 Connection to finite state machines 2 Theory


  1. CS 241: Systems Programming Lecture 23. Regular Expressions I Spring 2020 Prof. Stephen Checkoway 1

  2. Theory of regular languages Mathematical theory of sets of strings ‣ You'll see this in CS 383 Connection to finite state machines 2

  3. Theory of regular languages Mathematical theory of sets of strings ‣ You'll see this in CS 383 Connection to finite state machines We're going to skip all of this for this course! 2

  4. Problem we want to solve Identify and/or extract text that matches a given pattern Examples ‣ Determine if a text string matches the pattern ‣ Find all lines of text in a file containing a given word ‣ Extract all phone numbers from a file ‣ Extract fields from structured text ‣ Classify types of text (e.g., compilers need to determine if some text is a number like 0x7D2 or symbols like == or keywords like double ) ‣ Find all of the tags in an HTML file Approach: Use a regular expression to specify the pattern 3

  5. grep(1) grep matches lines of input against a given regular expression (regex), printing each line that matches (or does not match) $ grep 'Computer Science' file ‣ prints each line of file that contains the string " Computer Science " More generally, 
 $ grep regex file 
 will print each line of file that matches the regular expression regex 4

  6. What is a regular expression? Text that describes a search pattern Comes in a variety of "flavors" ‣ Basic Regular Expression ( BRE ) ‣ Extended Regular Expression ( ERE ) ‣ Perl-Compatible Regular Expressions ( PCRE ) Be careful not to confuse with file globbing which uses similar special characters like * and ? but with slightly di ff erent meanings 5

  7. Baseline regex characters 6

  8. Baseline regex characters . (period) any single character except newline 6

  9. Baseline regex characters . (period) any single character except newline * 0 or more of the preceding item (greedy) 6

  10. Baseline regex characters . (period) any single character except newline * 0 or more of the preceding item (greedy) ^ start of a line 6

  11. Baseline regex characters . (period) any single character except newline * 0 or more of the preceding item (greedy) ^ start of a line $ end of the line 6

  12. Baseline regex characters . (period) any single character except newline * 0 or more of the preceding item (greedy) ^ start of a line $ end of the line [ ] match one of the enclosed characters ‣ [a-z] matches a range ‣ [^ ] reverses the sense of match ‣ put ] or – at start to be a member of the list 6

  13. Baseline regex characters . (period) any single character except newline * 0 or more of the preceding item (greedy) ^ start of a line $ end of the line [ ] match one of the enclosed characters ‣ [a-z] matches a range ‣ [^ ] reverses the sense of match ‣ put ] or – at start to be a member of the list Every other character just matches itself; precede any of the above with \ to treat as a normal character that must literally match 6

  14. Examples 7

  15. Examples a Anything with the letter 'a' 7

  16. Examples a Anything with the letter 'a' abc Anything with the string 'abc' 7

  17. Examples a Anything with the letter 'a' abc Anything with the string 'abc' a.c 'a' followed by any char then 'c' 7

  18. Examples a Anything with the letter 'a' abc Anything with the string 'abc' a.c 'a' followed by any char then 'c' ^a Line starting with 'a' 7

  19. Examples a Anything with the letter 'a' abc Anything with the string 'abc' a.c 'a' followed by any char then 'c' ^a Line starting with 'a' a$ Line ending with 'a' 7

  20. Examples a Anything with the letter 'a' abc Anything with the string 'abc' a.c 'a' followed by any char then 'c' ^a Line starting with 'a' a$ Line ending with 'a' ^a$ Line consisting of a single 'a' on it 7

  21. Examples a Anything with the letter 'a' abc Anything with the string 'abc' a.c 'a' followed by any char then 'c' ^a Line starting with 'a' a$ Line ending with 'a' ^a$ Line consisting of a single 'a' on it a.*b 'a' then anything else, then 'b' (includes 'ab') 7

  22. Examples a Anything with the letter 'a' abc Anything with the string 'abc' a.c 'a' followed by any char then 'c' ^a Line starting with 'a' a$ Line ending with 'a' ^a$ Line consisting of a single 'a' on it a.*b 'a' then anything else, then 'b' (includes 'ab') [abc] One of 'a', 'b', or 'c' 7

  23. Examples a Anything with the letter 'a' abc Anything with the string 'abc' a.c 'a' followed by any char then 'c' ^a Line starting with 'a' a$ Line ending with 'a' ^a$ Line consisting of a single 'a' on it a.*b 'a' then anything else, then 'b' (includes 'ab') [abc] One of 'a', 'b', or 'c' 7

  24. Valid identifiers in C (things like variable or function names) 
 1. start with either a letter or an underscore; and 
 2. consist of letters, numbers, or underscores. E.g., main , foo_bar , _Okay123XY are valid identifiers; 
 but 32x , foo-bar , and &blah are not Which regular expression describes valid C identifiers? A. [a-zA-Z0-9_]* B. [a-zA-Z0-9_][a-zA-Z0-9_]* C. [a-zA-Z_][a-zA-Z0-9_]* D. [^0-9][a-zA-Z0-9_]* 8

  25. Basic regex (obsolete) \{m,n\} match previous item at least m times, but at most n times \{m\} match previous item exactly m times \{m,\} match previous item at least m times \( \) group and save enclosed pattern match ‣ \1 the first saved match ‣ \5 the fifth saved match ‣ Using such "back references" makes it not a real regular expression and should be avoided 9

  26. Extended regex (modern) {m,n} match previous item at least m times, but at most n times ( ) group and save enclosed pattern match + match 1 or more of the previous {1,} ? match previous 0 or 1 time {0,1} | match RE either before or after ‣ apple | banana (ab|c+){2} 'abab', 'abc', 'abcccc', 'cab', 'cccab' 'ccccccccc' 10

  27. Examples 11

  28. Examples (ab|c){2} 'abab', 'abc', 'cab', 'cc' (ERE) 11

  29. POSIX character classes Within brackets [ ], we can use character classes corresponding to those in ctype.h by surrounding the name with [: and :] ‣ alnum , digit , punct , alpha , graph , space , blank , lower , upper , cntrl , print , xdigit ‣ E.g., [[:digit:][:blank:]] Shortcuts (needs "enhanced" regular expressions): ‣ \d is [[:digit:]] \D is [^[:digit:]] ‣ \s is [[:space:]] \S is [^[:space:]] ‣ \w is [[:alnum:]_] \W is [^[:alnum:]_] 12

  30. Which string does the ERE 
 \( [[:digit:]]{3} \) [[:digit:]]{3} - [[:digit:]]{4} 
 match? A. ([1]{3}) [2]{3}-[3]{4} B. 123 456-7890 C. (123) 456-7890 D. \(123\) 456-7890 13

  31. grep(1) Name comes from ed(1) program command g/re/p grep –E re files use extended regex (or use egrep) egrep –l re files just list file names egrep –c re files just list count of matches egrep –n re files just list line numbers egrep –i re files ignore case egrep –v re files show non-matching lines 14

  32. awk(1) Named after the developers ‣ A. Aho ‣ P . Weinberger ‣ B. Kernighan Programming language for working on files Consists of a sequence of pattern-action statements of the form ‣ pattern { action } ‣ Each line of the input is matched compared to each pattern in order; each matching pattern has its associated action run 15

  33. Running AWK Running ‣ $ awk -f foo.awk files # foo.awk contains the program ‣ $ awk prog files # pattern-action separated by ; Understands whitespace separated fields (can change this via -F option) ‣ $1 , $2 , $3 ‣ $0 is the whole line Other variables, just use their names 16

  34. Patterns matches the regular expression re /re/ BEGIN matches before any input is used (can be used to set variables) END matches after all input is used (e.g., can print things) expr matches if the expression is nonzero matches all lines between the line matching p1 and the line 
 p1 , p2 matching p2 (including those lines) (empty pattern) matches every line 17

  35. Simple AWK program Prints the lines of a file with START and END BEGIN { print "START" } { print } END { print "END"} 18

  36. Actions An action is a sequence of statements inside { } separated by ; ‣ assignment statements var = value ‣ conditionals/loops: if , while , for , do - while , break , continue , ‣ for (var in array) stmt ‣ print expr-list ‣ printf format, expr-list A missing action means to print the line 19

  37. Simple AWK program Prints lines longer than 72 characters length ( $0 ) > 72 { print } Missing action block means print length ( $0 ) > 72 20

  38. Sum up a list of numbers BEGIN { SUM = 0 } { SUM += $1 } END { print "Total is", SUM } 21

  39. Print size and owner from ls -l $ ls -l | awk '{ print $5 , "\t", $3 }' 22

  40. Given pop.txt with lines containing zip code, county, population, e.g., 
 44001, Lorain, 20769 
 44011, Lorain, 21193 
 what is the awk command to print out the population of Oberlin (zip code 44074)? A. $ awk -F ', ' '/44074/ { print $3 }' B. $ awk -F ', ' '$0 == 44074 { print $2 }' C. $ awk -F ', ' '$1 == 44074 { print $3 }' D. $ awk -F ', ' '44074 { print $2 }' 23

  41. In-class exercise https://regex.sketchengine.co.uk Do the four interactive exercises Grab a laptop and a partner and try to get as much of that done as you can! 24

Recommend


More recommend