Regular Expressions CS 2110
What is a regular expression? A special string for describing a pattern of characters. Examples: Regular expression Description One of three characters (a, b, OR c) [abc] A single lowercase letter [a-z] A single lowercase letter OR number (not both) [a-z0-9] Any one character . A period (“.”) \. 0 to many * 0 or 1 ? 1 or many +
REGEX String Mark regular expressions as raw strings Starts with r” Use square brackets for “any character from inside the bracket” r“[ bce ]” – matches “b”, or “c”, or “e” (But not “be” or “ bc ”) Use ranges or classes of characters r“[A - Z]” – matches any uppercase letter r“[a - z]” – matches any lowercase letter r“[0 - 9]” – matches any digit Searching for hyphens: include – right after the [ or right before ] r”[ -a- z]” – matches any hyphen OR any lowercase letter
Regex String r“[ bce ]at” Matches “bat”, “cat”, “eat” r“.at” Matches 3 letter words that end in “at” r“at \ .” Matches “at.”
Regex in Python Import statement import re Compiling the regex regex = re.compile( regular_expression_extring) regex is now a regular expression tool we can use Using regex results = regex.search(text) results = regex.findall(text) results = regex.finditer(text)
Regular Expression Examples Use “^” at the start of a [] for negation: r“[^a - z]” – match anything except lowercase letters r“[^0 -9 ]” – match anything except decimal digits Use ^ at the start of the expression (not inside []) to mean “the start of the string” (i.e., searching from the beginning of the string only) i.e., if searching through a list of strings, only match strings that start with the expression Use $ for the end of the string.
Pre defined characters: Character Meaning Any digit – means the same as [0-9] \d Anything EXCEPT digits – means the same as [^0-9] \D Any whitespace character “ “, “ \ t” “ \ n”, etc. – [ \t\n] \s \S Any NON-whitespace character \\ Match a literal backslash \w Matches ANY alphanumeric character and underscore [a-zA-Z0-9_] \W Matches any non-alphanumeric character [^a-zA-Z0-9_]
Regular Expression Examples r“[0 -9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0- 9]” Phone number written as “123 -456- 7890” Except, that’s a little redundant, right? We can write the same patter above as r“[0 -9]{3}-[0-9]{3}-[0- 9]{4}” {x} means repeat look for the previous pattern to repeat x times “[ abn ]{6}” would match “banana”, for example (or “ nnnaaa ”) “[ abn ]{3,6}” would match “ban”, “nan”, “ abba ”, “banana”, etc.
Regex Examples Most English first names: r”[A -Z][a- z]+” Dates: [0-9]{2}[/-][0-9]{2}[/-][0-9]{4} OR [0-9]{4}[/-][0-9]{2}[/-][0-9]{2} SSN [0-9]{3}-[0-9]{2}-[0-9]{4}
Regex findall Find all returns a list of all the strings that match the regex. Example, let’s consider this pattern for emails: r"[a-z0-9]+@[a-z]+\.[a- z]+“ Using that, let’s find all the emails at: https://engineering.virginia.edu/departments/computer- science/faculty
Practice Use this webpage: https://storage.googleapis.com/cs1111/practice/simpsons_phone_book.txt Find all the phone numbers using regular expressions! (Not text parsing) Now: Get the name and phone number of everyone whose first name starts with “J” and whose last name starts with “Neu” USE REGULAR EXPRESSIONS
Next time Groups Using them Getting individual groups The match object More practice
Match Object Returned by search() and finditer(). Example: <_sre.SRE_Match object; span=(0, 5), match='Frodo'> This match object can be used as follows match.span – (0,5) match.start – 0 match.end – 5 match.group – “Frodo”
Search() and Finditer function regex.search(text) – Search through text, find the first instance of a match to regex, and return a MATCH object Returns None if no match object found Often used as a “does this pattern exist in the text” test Can also be written as re.search(regular_expression, text) FindIter returns an iterable of match objects (that is, you can loop through it)
Pulling down emails of CS Faculty import re import urllib.request url = "https://engineering.virginia.edu/departments/computer- science/faculty" phone_number_pattern = r"[a-z0-9]+@[a-z]+\.[a-z]+" req = urllib.request.urlopen(url) html = req.read().decode("UTF-8") regex = re.compile(phone_number_pattern) emails = regex.findall(html) print(emails)
But wait… We get the result: albertm@darden.virginia That doesn’t seem right… shouldn’t emails end in com, edu, or org? Let’s try this pattern: r"([a-z0-9]+@[a-z\.]+\.(com|edu|org)) That gives us tuples like: ('albertm@darden.virginia.edu', 'edu ’) Wait…why tuples?
Groups Parentheses can be used to isolate “Groups” in the regular expression. Example: In this String r"[a-z0-9]+@[a-z\.]+\.(com|edu|org)" group(0) – The overall match group(1) – Specifically the match in parentheses (com, edu, or org) .group() – returns the same as group(0) .groups() – returns the matching SUB-groups (not the overall match)
Recommend
More recommend