. Regular Expressions for Linguists: A Life Skill . Michael Yoshitaka Erlewine mitcho@mitcho.com Hackl Lab Turkshop March 2013 . .
Regular Expressions What are regular expressions? • Regular Expressions (aka regex es or regexp s) are a way of telling your computer what to do. • Specifically, look at (a lot of) text and find things or make changes . • These are both things which we might want to do as linguists. . . 2
Why are we learning it now? • Constructing input for Turk (can) involve(s) manipulating a lot of text. In particular, you might want to test different systematic variants of your sentences. • Cutting and pasting is highly prone to errors. • Regular Expressions are a quick way to do this. 3 . .
Note You will not master regular expressions today. Regular expressions are learned through practice. 4 . .
Tools Today we will use regexes as a glorified find/replace tool in a text editor. Free editors with good regex support include: • TextWrangler for Mac • Notepad++ for Windows • Download the NppToolBucket plugin. Move the dll file to C:\Program Files\Notepad++\plugins . This adds a better find/replace window to Notepad++. • Komodo Edit for Mac/Windows/Linux Hopefully you’ve already installed one of those. 5 . .
Get these now ☞ Download the Regular Expressions cheat sheet: http://web.mit.edu/hackl/www/lab/turkshop/ slides/regex-cheatsheet.pdf ☞ Download some sample files to play with here: http://web.mit.edu/hackl/www/lab/turkshop/ examples/week2-regex.zip 6 . .
What do you do with a regex ? • Find • Find all/count • Find and replace Make sure you know how to do these things in your editor. 1 Open the file lookingglass.txt in your editor and count the . . occurrences of cat . Try replacing all the cat s with dog s. (Don’t save; undo.) What problem does this have? Make sure you turn on “regular expressions” (TextWranger: “grep”) and “wrap around” in your search window. Notice that you can specify case-sensitivity. 7 . .
Basic matching Look at the “basic matching” section of the cheat sheet. 1 Does the text have any tabs? . . 2 Search for Alice\n . What did you find? . . 3 Find five-letter words with \s\w\w\w\w\w\s . What problem(s) . . does this have? 4 This file annoyingly has line breaks in the middle of sentences. Find . . \n and replace them all with one space each. (Don’t save; undo.) Did this do what we wanted? 8 . .
Character classes Look at the “character classes” section of the cheat sheet. 1 Find sequences of three vowels in a row with . . [aeiou][aeiou][aeiou] . 2 Find sequences of three capital letters in a letter . . [A-Z][A-Z][A-Z] . Make sure to turn on case-sensitivity! 9 . .
Boundaries Look at the “boundaries” section of the cheat sheet. 1 Search for \bcat\b . What did you find? . . 2 Search for \bcat\w . What did you find? . . 3 Find some words that start with “q”. . . 4 How ofuen is “Alice” at the beginning of a line? At the end of a line? . . 5 Does ^through pick out the same things as \nthrough ? . . 10 . .
Disjunctions Disjunctions are pretty straightforward. 1 Search for (cat|dog)s . What did you find? . . Moving on... 11 . .
“Quantifiers” Look at the “quantifiers” section of the cheat sheet. 1 Find all words that end in -ing . Make sure you select the entire . . -ing -suffixed word. 2 Find all words that end in -ing or the plural -ings . . . 3 How many exactly-ten-letter words are there? . . 4 How many matches of \b[\w’]{10}\b are there? What is this? . . 5 What’s the longest word in the text? . . 6 What’s the longest word in the text which can be typed using only the . . keys on the top row of your keyboard? 7 Find the line that both the “Red Queen” and “Red King” are in. . . . . 12
Special characters Special characters are pretty straightforward too. If your editor ever dies and tells you your regular expression is bad, most likely you forgot to escape something. 13 . .
Backreferences Look at the “backreferences” section of the cheat sheet. 1 Search for \b(\w+) ␣ \1\b . What did you find? . . 2 Search for \b(\w)\w* ␣ \1\w*\b . What does this find? . . 3 Replace all the line breaks inside paragraphs, but keep the paragraph . . breaks intact. 14 . .
Realistic (important for next week!) exercises Open file blocking.txt . 1 Take each sentence and replace “made V ” with the appropriate . . “ V ed”. (Don’t save; undo.) 2 Take each line and turn it into two lines: one as is and the other with . . “make V ” replaced with the appropriate “ V ed”. Open file quantifiers.txt . Each sentence contains two quantifiers in curly braces. 1 Take each sentence and split it into two sentences, using the options . . in the curly braces one at a time. Add “a” and “b” to the sentence numbers at the same time. 15 . .
Recommend
More recommend