Regular Expressions CS 2110 What is a regular expression? A - PowerPoint PPT Presentation

Regular Expressions CS 2110

What is a regular expression?  A special string for describing a pattern of characters.  Examples: Regular expression Description One of three characters (a, b, OR c) [abc] A single lowercase letter [a-z] A single lowercase letter OR number (not both) [a-z0-9] Any one character . A period (“.”) \. 0 to many * 0 or 1 ? 1 or many +

REGEX String  Mark regular expressions as raw strings  Starts with r”  Use square brackets for “any character from inside the bracket”  r“[ bce ]” – matches “b”, or “c”, or “e” (But not “be” or “ bc ”)  Use ranges or classes of characters  r“[A - Z]” – matches any uppercase letter  r“[a - z]” – matches any lowercase letter  r“[0 - 9]” – matches any digit  Searching for hyphens: include – right after the [ or right before ]  r”[ -a- z]” – matches any hyphen OR any lowercase letter

Regex String  r“[ bce ]at”  Matches “bat”, “cat”, “eat”  r“.at”  Matches 3 letter words that end in “at”  r“at \ .”  Matches “at.”

Regex in Python  Import statement  import re  Compiling the regex  regex = re.compile( regular_expression_extring)  regex is now a regular expression tool we can use  Using regex  results = regex.search(text)  results = regex.findall(text)  results = regex.finditer(text)

Regular Expression Examples  Use “^” at the start of a [] for negation:  r“[^a - z]” – match anything except lowercase letters  r“[^0 -9 ]” – match anything except decimal digits  Use ^ at the start of the expression (not inside []) to mean “the start of the string” (i.e., searching from the beginning of the string only)  i.e., if searching through a list of strings, only match strings that start with the expression  Use $ for the end of the string.

Pre defined characters: Character Meaning Any digit – means the same as [0-9] \d Anything EXCEPT digits – means the same as [^0-9] \D Any whitespace character “ “, “ \ t” “ \ n”, etc. – [ \t\n] \s \S Any NON-whitespace character \\ Match a literal backslash \w Matches ANY alphanumeric character and underscore [a-zA-Z0-9_] \W Matches any non-alphanumeric character [^a-zA-Z0-9_]

Regular Expression Examples  r“[0 -9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0- 9]”  Phone number written as “123 -456- 7890”  Except, that’s a little redundant, right?  We can write the same patter above as  r“[0 -9]{3}-[0-9]{3}-[0- 9]{4}”  {x} means repeat look for the previous pattern to repeat x times  “[ abn ]{6}” would match “banana”, for example (or “ nnnaaa ”)  “[ abn ]{3,6}” would match “ban”, “nan”, “ abba ”, “banana”, etc.

Regex Examples  Most English first names:  r”[A -Z][a- z]+”  Dates:  [0-9]{2}[/-][0-9]{2}[/-][0-9]{4} OR  [0-9]{4}[/-][0-9]{2}[/-][0-9]{2}  SSN  [0-9]{3}-[0-9]{2}-[0-9]{4}

Regex findall  Find all returns a list of all the strings that match the regex.  Example, let’s consider this pattern for emails:  r"[a-z0-9]+@[a-z]+\.[a- z]+“  Using that, let’s find all the emails at:  https://engineering.virginia.edu/departments/computer- science/faculty

Practice  Use this webpage:  https://storage.googleapis.com/cs1111/practice/simpsons_phone_book.txt  Find all the phone numbers using regular expressions! (Not text parsing)  Now:  Get the name and phone number of everyone whose first name starts with “J” and whose last name starts with “Neu”  USE REGULAR EXPRESSIONS

Next time  Groups  Using them  Getting individual groups  The match object  More practice

Match Object  Returned by search() and finditer(). Example: <_sre.SRE_Match object; span=(0, 5), match='Frodo'>  This match object can be used as follows  match.span – (0,5)  match.start – 0  match.end – 5  match.group – “Frodo”

Search() and Finditer function  regex.search(text) – Search through text, find the first instance of a match to regex, and return a MATCH object  Returns None if no match object found  Often used as a “does this pattern exist in the text” test  Can also be written as  re.search(regular_expression, text)  FindIter returns an iterable of match objects (that is, you can loop through it)

Pulling down emails of CS Faculty import re import urllib.request url = "https://engineering.virginia.edu/departments/computer- science/faculty" phone_number_pattern = r"[a-z0-9]+@[a-z]+\.[a-z]+" req = urllib.request.urlopen(url) html = req.read().decode("UTF-8") regex = re.compile(phone_number_pattern) emails = regex.findall(html) print(emails)

But wait…  We get the result: albertm@darden.virginia That doesn’t seem right… shouldn’t emails end in com, edu, or org?  Let’s try this pattern:  r"([a-z0-9]+@[a-z\.]+\.(com|edu|org))  That gives us tuples like:  ('albertm@darden.virginia.edu', 'edu ’)  Wait…why tuples?

Groups  Parentheses can be used to isolate “Groups” in the regular expression.  Example:  In this String r"[a-z0-9]+@[a-z\.]+\.(com|edu|org)"  group(0) – The overall match  group(1) – Specifically the match in parentheses (com, edu, or org)  .group() – returns the same as group(0)  .groups() – returns the matching SUB-groups (not the overall match)

Regular Expressions CS 2110 What is a regular expression? A - PowerPoint PPT Presentation

Regular Expressions CS 2110 What is a regular expression? A special string for describing a pattern of characters. Examples: Regular expression Description One of three characters (a, b, OR c) [abc] A single lowercase letter [a-z] A

Regular Expressions (REs) Regular Expressions (REs) p.1/37 Expressions In arithmetic:

Objectives You should be able to ... Regular Languages Use the syntax of regular expressions

Regexp Lecture 26: Regular Expressions Regular Expressions Regular expressions are a small

C++0x Regular Expressions Simon Andreas Frimann Lund Datalogisk Institut Kbenhavns

Regular Expressions = Regular Languages Mark Greenstreet, CpSc 421, Term 1, 2008/09 17

Theory of Computer Science C3. Regular Languages: Regular Expressions, Pumping Lemma Malte

Regular Expressions A regular expression describes a language using three operations. Regular

Chapter 7 Expressions and Statements Expressions Arithmetic Expressions Conditional

Kleene Algebras: The Algebra of Regular Expressions Adam Braude University of Puget Sound May

CS/COE 1520 pitt.edu/~ach54/cs1520 Regular expressions Regular expressions Formally:

Regular Expressions in .NET Regular Expressions in .NET By: Nasser Alshammari College of

Regular Expressions Regular Expressions and Automata and Automata Berlin Chen 2003 References:

Regular Expressions for Linguists: A Life Skill . Michael Yoshitaka Erlewine mitcho@mitcho.com

Regular Expressions Dr. Mattox Beckman University of Illinois at Urbana-Champaign Department of

Regular Languages Dr. Mattox Beckman University of Illinois at Urbana-Champaign Department of

Fem Poble(s): Expressions Meritxell (Txell) Martn Pardo, Ph.D Research associate Data

Mixed models in R using the lme4 package Part 1: Introduction to R Douglas Bates University of

Positional formatting REGULAR EX P RES S ION S IN P YTH ON Maria Eugenia Inzaugarat Data

CISC 5500 Data Analytics Tools and Scripting Special characters; permissins; processes; vi

Programming in C Soon I will control the world! Hell llo o World ld! 1 Introduction to C

+ Symbolic Encryption + String class // Comparing String objects, see reference below. String p

Darrell Bethea May 25, 2011 Yesterdays slides updated Midterm on tomorrow in SN014

Strings in Python Computers store text as strings >>> s = "GATTACA" 0 1 2

L A TEX Revision LaTeX is a document preparation system Typesets documents Commands

Regular Expressions CS 2110 What is a regular expression? A - PowerPoint PPT Presentation

Regular Expressions CS 2110 What is a regular expression? A special string for describing a pattern of characters. Examples: Regular expression Description One of three characters (a, b, OR c) [abc] A single lowercase letter [a-z] A

Regular Expressions (REs) Regular Expressions (REs) p.1/37 Expressions In arithmetic:

Objectives You should be able to ... Regular Languages Use the syntax of regular expressions

Regexp Lecture 26: Regular Expressions Regular Expressions Regular expressions are a small

C++0x Regular Expressions Simon Andreas Frimann Lund Datalogisk Institut Kbenhavns

Regular Expressions = Regular Languages Mark Greenstreet, CpSc 421, Term 1, 2008/09 17

Theory of Computer Science C3. Regular Languages: Regular Expressions, Pumping Lemma Malte

Regular Expressions A regular expression describes a language using three operations. Regular

Chapter 7 Expressions and Statements Expressions Arithmetic Expressions Conditional

Kleene Algebras: The Algebra of Regular Expressions Adam Braude University of Puget Sound May

CS/COE 1520 pitt.edu/~ach54/cs1520 Regular expressions Regular expressions Formally:

Regular Expressions in .NET Regular Expressions in .NET By: Nasser Alshammari College of

Regular Expressions Regular Expressions and Automata and Automata Berlin Chen 2003 References:

Regular Expressions for Linguists: A Life Skill . Michael Yoshitaka Erlewine mitcho@mitcho.com

Regular Expressions Dr. Mattox Beckman University of Illinois at Urbana-Champaign Department of

Regular Languages Dr. Mattox Beckman University of Illinois at Urbana-Champaign Department of

Fem Poble(s): Expressions Meritxell (Txell) Martn Pardo, Ph.D Research associate Data

Mixed models in R using the lme4 package Part 1: Introduction to R Douglas Bates University of

Positional formatting REGULAR EX P RES S ION S IN P YTH ON Maria Eugenia Inzaugarat Data

CISC 5500 Data Analytics Tools and Scripting Special characters; permissins; processes; vi

Programming in C Soon I will control the world! Hell llo o World ld! 1 Introduction to C

+ Symbolic Encryption + String class // Comparing String objects, see reference below. String p

Darrell Bethea May 25, 2011 Yesterdays slides updated Midterm on tomorrow in SN014

Strings in Python Computers store text as strings &gt;&gt;&gt; s = &quot;GATTACA&quot; 0 1 2

L A TEX Revision LaTeX is a document preparation system Typesets documents Commands

Strings in Python Computers store text as strings >>> s = "GATTACA" 0 1 2