regular expressions
play

Regular Expressions CS 2110 What is a regular expression? A - PowerPoint PPT Presentation

Regular Expressions CS 2110 What is a regular expression? A special string for describing a pattern of characters. Examples: Regular expression Description One of three characters (a, b, OR c) [abc] A single lowercase letter [a-z] A


  1. Regular Expressions CS 2110

  2. What is a regular expression?  A special string for describing a pattern of characters.  Examples: Regular expression Description One of three characters (a, b, OR c) [abc] A single lowercase letter [a-z] A single lowercase letter OR number (not both) [a-z0-9] Any one character . A period (“.”) \. 0 to many * 0 or 1 ? 1 or many +

  3. REGEX String  Mark regular expressions as raw strings  Starts with r”  Use square brackets for “any character from inside the bracket”  r“[ bce ]” – matches “b”, or “c”, or “e” (But not “be” or “ bc ”)  Use ranges or classes of characters  r“[A - Z]” – matches any uppercase letter  r“[a - z]” – matches any lowercase letter  r“[0 - 9]” – matches any digit  Searching for hyphens: include – right after the [ or right before ]  r”[ -a- z]” – matches any hyphen OR any lowercase letter

  4. Regex String  r“[ bce ]at”  Matches “bat”, “cat”, “eat”  r“.at”  Matches 3 letter words that end in “at”  r“at \ .”  Matches “at.”

  5. Regex in Python  Import statement  import re  Compiling the regex  regex = re.compile( regular_expression_extring)  regex is now a regular expression tool we can use  Using regex  results = regex.search(text)  results = regex.findall(text)  results = regex.finditer(text)

  6. Regular Expression Examples  Use “^” at the start of a [] for negation:  r“[^a - z]” – match anything except lowercase letters  r“[^0 -9 ]” – match anything except decimal digits  Use ^ at the start of the expression (not inside []) to mean “the start of the string” (i.e., searching from the beginning of the string only)  i.e., if searching through a list of strings, only match strings that start with the expression  Use $ for the end of the string.

  7. Pre defined characters: Character Meaning Any digit – means the same as [0-9] \d Anything EXCEPT digits – means the same as [^0-9] \D Any whitespace character “ “, “ \ t” “ \ n”, etc. – [ \t\n] \s \S Any NON-whitespace character \\ Match a literal backslash \w Matches ANY alphanumeric character and underscore [a-zA-Z0-9_] \W Matches any non-alphanumeric character [^a-zA-Z0-9_]

  8. Regular Expression Examples  r“[0 -9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0- 9]”  Phone number written as “123 -456- 7890”  Except, that’s a little redundant, right?  We can write the same patter above as  r“[0 -9]{3}-[0-9]{3}-[0- 9]{4}”  {x} means repeat look for the previous pattern to repeat x times  “[ abn ]{6}” would match “banana”, for example (or “ nnnaaa ”)  “[ abn ]{3,6}” would match “ban”, “nan”, “ abba ”, “banana”, etc.

  9. Regex Examples  Most English first names:  r”[A -Z][a- z]+”  Dates:  [0-9]{2}[/-][0-9]{2}[/-][0-9]{4} OR  [0-9]{4}[/-][0-9]{2}[/-][0-9]{2}  SSN  [0-9]{3}-[0-9]{2}-[0-9]{4}

  10. Regex findall  Find all returns a list of all the strings that match the regex.  Example, let’s consider this pattern for emails:  r"[a-z0-9]+@[a-z]+\.[a- z]+“  Using that, let’s find all the emails at:  https://engineering.virginia.edu/departments/computer- science/faculty

  11. Practice  Use this webpage:  https://storage.googleapis.com/cs1111/practice/simpsons_phone_book.txt  Find all the phone numbers using regular expressions! (Not text parsing)  Now:  Get the name and phone number of everyone whose first name starts with “J” and whose last name starts with “Neu”  USE REGULAR EXPRESSIONS

  12. Next time  Groups  Using them  Getting individual groups  The match object  More practice

  13. Match Object  Returned by search() and finditer(). Example: <_sre.SRE_Match object; span=(0, 5), match='Frodo'>  This match object can be used as follows  match.span – (0,5)  match.start – 0  match.end – 5  match.group – “Frodo”

  14. Search() and Finditer function  regex.search(text) – Search through text, find the first instance of a match to regex, and return a MATCH object  Returns None if no match object found  Often used as a “does this pattern exist in the text” test  Can also be written as  re.search(regular_expression, text)  FindIter returns an iterable of match objects (that is, you can loop through it)

  15. Pulling down emails of CS Faculty import re import urllib.request url = "https://engineering.virginia.edu/departments/computer- science/faculty" phone_number_pattern = r"[a-z0-9]+@[a-z]+\.[a-z]+" req = urllib.request.urlopen(url) html = req.read().decode("UTF-8") regex = re.compile(phone_number_pattern) emails = regex.findall(html) print(emails)

  16. But wait…  We get the result: albertm@darden.virginia That doesn’t seem right… shouldn’t emails end in com, edu, or org?  Let’s try this pattern:  r"([a-z0-9]+@[a-z\.]+\.(com|edu|org))  That gives us tuples like:  ('albertm@darden.virginia.edu', 'edu ’)  Wait…why tuples?

  17. Groups  Parentheses can be used to isolate “Groups” in the regular expression.  Example:  In this String r"[a-z0-9]+@[a-z\.]+\.(com|edu|org)"  group(0) – The overall match  group(1) – Specifically the match in parentheses (com, edu, or org)  .group() – returns the same as group(0)  .groups() – returns the matching SUB-groups (not the overall match)

Recommend


More recommend