Perl: Regular expressions A powerful tool for searching and transform ing text. SENG 265: Software Developm ent University of Victoria M ethods Department of Computer Perl Regular Expression: Slide 1 Science
M otivation while (my $line = <STDIN>) { We have seen many • chomp $line; operations involving if ($line eq “BEGIN:VSTART”) { # ... string comparisons } • Several Perl built-in } functions also help with # ... operations on strings my ($property, $value) = split /:/, $foo; if ($property eq “DSTART) { – split & join # ... etc etc etc – substr } – length @csv_fields = split /,/, $input_line; There is a lot we can do • $output = join “:”, @data; with such functions $first_char = substr $input, 0, 1; Example: • $width = length $heading; – Given a string holding print $heading, “\n: some timestamp, print “-” x $width; extract out different parts of date & time SENG 265: Software Developm ent University of Victoria M ethods Department of Computer Perl Regular Expression: Slide 2 Science
M otivation my $datetime = “20051225T053000”; Recall: • – iCalendar dates are used $year = substr $datetime, 0, 4; by iCal-like programs $month = substr $datetime, 4, 2; – The year, month, etc. $day = substr $datetime, 6, 2; $hour = substr $datetime, 9, 2; portions of the code are $min = substr $datetime, 11, 2; fixed in position $sec = substr $datetime, 13, 2; How could we use “ substr” • to help us? # ISO 8601 time format ” h t • This code certainly obtains my $datetime = “i2003-10-31T13:37:14-0500”; l a e h what we need. r u o y $year = substr $datetime, 1, 5; – But it can be a bit tricky o t $month = substr $datetime, 7, 8; s u to get right. o d r a z a – Adapting code to use # coffee break H “ another date/time format # ... $day = substr $datetime, 9, 2; is not trivial… $hour = substr $datetime, 12, 2; – … and is bugbait! $min = substr $datetime, 14, 2; $sec = substr $datetime, 16, 2; SENG 265: Software Developm ent University of Victoria M ethods Department of Computer Perl Regular Expression: Slide 3 Science
M otivation my $datetime = “20051225T053000”; A better method is to • indicate the string’s pattern my ($year, $month, $day, in a way the reflects the $hour, $minute, $second) actual order of pattern = $datetime =~ m{ \A # start of string components (\d{4}) # year – The date begins at the (\d{2}) # month start of the string. (\d{2}) # day T # literal T – The year is four digits. (\d{2}) # hour – The month follows (two (\d{2}) # minute (\d{2}) # second digits)… \z # end of string – … and then the day. }xms; – The “ T” character separates the date and time if ($datetime =~ – Hour, minute and date /^(\d{4})(\d{2})(\d{2})T(\d{2})(\d{2})(\d{2})$/) { follow, each two digits ($year, $month, $day, $hour, $min, $sec) long. = ($1, $2, $3, $4, $5, $6); } For the elder Perlmongers: • SENG 265: Software Developm ent University of Victoria M ethods Department of Computer Perl Regular Expression: Slide 4 Science
M otivation ISO 8601 time format Back to our “ code • my $datetime = “i2003-10-31T13:37:14-0500 ”; modification” example – Now we have a different my ($year, $month, $day, $hour, $minute, $second) date format = $ical_date – Using a regular =~ m{ \A # start of string i # literal i expression, we can (\d{4}) # year greatly reduce the - # literal dash possibility of bugs (\d{2}) # month - # literal dash – String begins with an (\d{2}) # day T # literal T “ i” … (\d{2}) # hour – followed by year… : # literal colon (\d{2}) # minute – followed by a dash… : # literal colon (\d{2}) # second – followed by month… .+ # ignore remainder \z # end of string – etc… }xms; SENG 265: Software Developm ent University of Victoria M ethods Department of Computer Perl Regular Expression: Slide 5 Science
Topics • Our coverage of regex syntax will Simple matching • be much more slowly paced that • Metacharacters the “ motivation” just shown! – Previous slides have been Anchored search • shown to give you a “ flavour” Character classes • of what regular expressions can achieve. • Range operators in We will learn how to – character classes construct such expression over the next few lectures. Matching any character • • We have a range of topics • Grouping Regular expressions can seem • Extracting Matches • complex and cryptic – However, slow and patient Search and Replace • work with such expressions will improve your productivity. SENG 265: Software Developm ent University of Victoria M ethods Department of Computer Perl Regular Expression: Slide 6 Science
Perl Regular Expressions • Perl is renowned for its excellence at text % ls *.c processing. • Handling of regular % ps aux | grep “s265s*” | less expressions plays a big factor in its fame. • Mastering even the basics Java: will allow you to manipulate import java.util.regex.*; text with ease. Python: Regular expressions have a • import re; strong formalism (FSA). • You have already used C#: some and seen others. using System.Text.RegularExpressions; • Other languages have some support for regexes, usually via some library. SENG 265: Software Developm ent University of Victoria M ethods Department of Computer Perl Regular Expression: Slide 7 Science
Simple String M atching • Regular expressions are usually used in my $line = <SOMEINPUT>; conjunction with an “ if” chomp $line; – “ if < string matches # Unbeknownst to programmer, the first line # of the input is the line “Hello, World”; this pattern> …” if ($line =~ m/World/xms) { – “ ... then > do print “Regexp matches!\n”; } something with that else { match> .” print “Oh, poop.\n”; } • The simplest such match if ($line eq “World”) { refers to a string print “line is equal to ‘World’\n”; } But note: this is much • else { print “line sure ain’t equal to ‘World’\n”; different that using “ eq” } SENG 265: Software Developm ent University of Victoria M ethods Department of Computer Perl Regular Expression: Slide 8 Science
A word about “ m /yadayada/xm s” • The text between the two slashes is the regular expression (“ regex” ). • Leading “ m” indicates the regex is used for a match Trailing “ xms” are three regex options • – “ x” : Extended formatting (whitespace in regex is ignored) – “ m” : For line boundaries (and eliminates a cause of some subtle bugs) – “ s” : ensures everything is matched by the “ .” symbol Why all of this verbiage instead of plain old “ /yadayada/” as of • old? /’[^\\’]*(?:\\.[^\\’]*)*’/ • Also note: “ m{ } ” or “ m//” m{ ‘ # an opening single quote [^\\’]* # any non-special chars (?: # then all of.. \\ . # any explicitly backslashed char [^\\’]* # followed by any non-special chars )* # repeated zero of many times ‘ # a closing single quote }xms SENG 265: Software Developm ent University of Victoria M ethods Department of Computer Perl Regular Expression: Slide 9 Science
Another exam ple • The code on the right #!/usr/bin/perl searches for a pattern in use strict; some dictionary file my $regexp = shift @ARGV; while (my $word = <>) { – Note that a command- if ($word =~ m/$regexp/xms) { print $word; line argument is being } used for a regex! } – Also note “ < > ” syntax: % ./search.pl pter /usr/share/dict/linux.words This takes the first abrupter Acalypterae unused command-line acanthopteran Acanthopteri argument, and uses it ... <snip> ... as a filename for unchapter unchaptered opening! underprompter ... <snip> ... Zygopteris zygopteron zygopterous % SENG 265: Software Developm ent University of Victoria M ethods Department of Computer Perl Regular Expression: Slide 10 Science
M etacharacters { } [ ] ( ) • Regexs obtain their power ^ $ . by describing sets of | * ? strings. / \ Such descriptions involve • “2+2=4” =~ m/2+2/xms # doesn’t match the use of “ metacharacters” “2+2=4” =~ m/2\+2/xms # does match • Of course, some strings “The interval is [0,1).” =~ that we want to match will m/[0,1)./xms # syntax error contain these strings. “The interval is [0,1).” =~ m/\[0,1\)\./xms # does match – Therefore we must “ escape” them. “/usr/bin/perl” =~ m/\/usr\/bin/\/perl/xms # matches “/usr/bin/perl” =~ m{/usr/bin/perl}xms # better ‘C:\WINDOWS’ =~ m/C:\\WINDOWS/ # matches SENG 265: Software Developm ent University of Victoria M ethods Department of Computer Perl Regular Expression: Slide 11 Science
Recommend
More recommend