CSCI 2133 – Rapid Programming Techniques for Innovation Week 2 – Regular Expression and Implementation CSCI 2133
Previous Lecture • About the course project • Choosing a topic, defining scope, and forming a team • Preparing design document (P0) • Planning the process • Interfaces, choices, tools • About background knowledge: Java, C, Linux command line CSCI 2133
Regular Expressions • Covered in CSCI 2132 • We will now look at a basic implementation • They illustrate a good general tool used in software development and different implementations • An implementation is described in the textbook [Kernighan and Pike, sec. 9.2, p. 222] • Regular expressions are pervasive in Unix environments, but not so much in other platforms • There are different flavours of RegEx’es; we can start with grep CSCI 2133
What Is a Regular Expression? • A regular expression ( regex ) describes a set of possible input strings. • Regular expressions descend from a fundamental concept in Computer Science called finite automata theory • Regular expressions are endemic to Unix • vi , ed , sed , and emacs • awk , tcl , perl and Python • grep , egrep , fgrep • compilers CSCI 2133
What is a regular expression? / [a-zA-Z_\-]+ @ (([a-zA-Z_\-])+\.)+[a-zA-Z]{2,4} / • regular expression ("regex"): describes a pattern of text • can test whether a string matches the expr's pattern • can use a regex to search/replace characters in a string • very powerful, but tough to read • regular expressions occur in many places: • text editors (TextPad) allow regexes in search/replace • languages: JavaScript; Java Scanner , String split • Unix/Linux/Mac shell commands ( grep , sed , find , etc.) CSCI 2133
c k s regular expression UNIX Tools rocks. match UNIX Tools sucks. match UNIX Tools is okay. no match CSCI 2133
Regular Expressions • A regular expression can match a string in more than one place. a p p l e regular expression Scrapple from the apple. match 1 match 2 CSCI 2133
Regular Expressions • The . regular expression can be used to match any character. o . regular expression For me to poop on. match 1 match 2 CSCI 2133
Character Classes • Character classes [] can be used to match any specific set of characters. b [eor] a t regular expression beat a brat on a boat match 1 match 2 match 3 CSCI 2133
Negated Character Classes • Character classes can be negated with the [^] syntax. b [^eo] a t regular expression beat a brat on a boat match CSCI 2133
More About Character Classes • [aeiou] will match any of the characters a , e , i , o , or u • [kK]orn will match korn or Korn • Ranges can also be specified in character classes • [1-9] is the same as [123456789] • [abcde] is equivalent to [a-e] • You can also combine multiple ranges • [abcde123456789] is equivalent to [a-e1-9] • Note that the - character has a special meaning in a character class but only if it is used within a range, [-123] would match the characters - , 1 , 2 , or 3 CSCI 2133
Named Character Classes • Commonly used character classes can be referred to by name ( alpha , lower, upper, alnum , digit , punct , cntrl ) • Syntax [: name :] • [a-zA-Z] [[:alpha:]] • [a-zA-Z0-9] [[:alnum:]] • [45a-z] [45[:lower:]] • Important for portability across languages CSCI 2133
Anchors • Anchors are used to match at the beginning or end of a line (or both). • ^ means beginning of the line • $ means end of the line CSCI 2133
^ b [eor] a t regular expression beat a brat on a boat match b [eor] a t $ regular expression beat a brat on a boat match ^word$ ^$ CSCI 2133
Repetition • The * is used to define zero or more occurrences of the single regular expression preceding it. CSCI 2133
y a * y regular expression I got mail, yaaaaaaaaaay! match o a * o regular expression For me to poop on. match .* CSCI 2133
Match length • A match will be the longest string that satisfies the regular expression. a . * e regular expression Scrapple from the apple. no no yes CSCI 2133
Question • Can you utilize the knowledge learned by now to specify one regex to represent phone numbers: 123-234-3455, (123)-234-2344, (123)123-2345, ( 212) 123 - 23445? 18 CSCI 2133
Repetition Ranges • Ranges can also be specified • { } notation can specify a range of repetitions for the immediately preceding regex • { n } means exactly n occurrences • { n ,} means at least n occurrences • { n , m } means at least n occurrences but no more than m occurrences • Example: • .{0,} same as .* • a{2,} same as aaa* CSCI 2133
Subexpressions • If you want to group part of an expression so that * or { } applies to more than just the previous character, use ( ) notation • Subexpresssions are treated like a single character • a* matches 0 or more occurrences of a • abc* matches ab , abc , abcc , abccc , … • (abc)* matches abc , abcabc , abcabcabc , … • (abc){2,3} matches abcabc or abcabcabc CSCI 2133
grep • grep comes from the ed (Unix text editor) search command “ g lobal r egular e xpression p rint” or g/ re /p • This was such a useful command that it was written as a standalone utility • There are two other variants, egrep and fgrep that comprise the grep family • grep is the answer to the moments where you know you want the file that contains a specific phrase but you can’t remember its name CSCI 2133
Family Differences • grep - uses regular expressions for pattern matching • fgrep - file grep, does not use regular expressions, only matches fixed strings but can get search strings from a file • egrep - extended grep, uses a more powerful set of regular expressions but does not support backreferencing, generally the fastest member of the grep family • agrep – approximate grep; not standard CSCI 2133
Syntax • Regular expression concepts we have seen so far are common to grep and egrep . • grep and egrep have slightly different syntax • grep : BREs • egrep : EREs (enhanced features we will discuss) • Major syntax differences: • grep : \( and \) , \{ and \} • egrep : ( and ) , { and } CSCI 2133
Protecting Regex Metacharacters • Since many of the special characters used in regexs also have special meaning to the shell, it’s a good idea to get in the habit of single quoting your regexs • This will protect any special characters from being operated on by the shell • If you habitually do it, you won’t have to worry about when it is necessary CSCI 2133
Escaping Special Characters • Even though we are single quoting our regexs so the shell won’t interpret the special characters, some characters are special to grep (eg * and . ) • To get literal characters, we escape the character with a \ (backslash) • Suppose we want to search for the character sequence a*b* • Unless we do something special, this will match zero or more ‘a’s followed by zero or more ‘b’s, not what we want • a\*b\* will fix this - now the asterisks are treated as regular characters CSCI 2133
Egrep: Alternation • Regex also provides an alternation character | for matching one or another subexpression • (T|Fl)an will match ‘Tan’ or ‘Flan’ • ^(From|Subject): will match the From and Subject lines of a typical email message • It matches a beginning of line followed by either the characters ‘From’ or ‘Subject’ followed by a ‘:’ • Subexpressions are used to limit the scope of the alternation • At(ten|nine)tion then matches “Attention” or “Atninetion”, not “Atten” or “ninetion” as would happen without the parenthesis - Atten|ninetion CSCI 2133
Egrep: Repetition Shorthands • The * (star) has already been seen to specify zero or more occurrences of the immediately preceding character • + (plus) means “one or more” abc+d will match ‘abcd’, ‘abccd’, or ‘abccccccd’ but will not match ‘abd’ Equivalent to {1,} CSCI 2133
Egrep: Repetition Shorthands cont • The ‘ ? ’ (question mark) specifies an optional character, the single character that immediately precedes it July? will match ‘Jul’ or ‘July’ Equivalent to {0,1} Also equivalent to (Jul|July) • The * , ? , and + are known as quantifiers because they specify the quantity of a match • Quantifiers can also be used with subexpressions • (a*c)+ will match ‘c’, ‘ac’, ‘aac’ or ‘aacaacac’ but will not match ‘a’ or a blank line CSCI 2133
Grep: Backreferences • Sometimes it is handy to be able to refer to a match that was made earlier in a regex • This is done using backreferences • \ n is the backreference specifier, where n is a number • Looks for n th subexpression • For example, to find if the first word of a line is the same as the last: • ^\([[:alpha:]]\{1,\}\) .* \1$ • The \([[:alpha:]]\{1,\}\) matches 1 or more letters CSCI 2133
Practical Regex Examples • Variable names in C • [a-zA-Z_][a-zA-Z_0-9]* • Dollar amount with optional cents • \$[0-9]+(\.[0-9][0-9])? • Time of day • (1[012]|[1-9]):[0-5][0-9] (am|pm) • HTML headers <h1> <H1> <h2> … • <[hH][1-4]> CSCI 2133
Recommend
More recommend