CS 126 Lecture T1: Pattern Matching
Outline • Introduction • Pattern matching in Unix • Regular expressions in Unix • Regular expressions as formal languages • Finite State Automata • Conclusions CS126 14-1 Randy Wang
Introduction to Theoretical Computer Science • Two fundamental questions: - Power ? What are the things a computer can and cannot do? - Speed ? How quickly can a computer solve different classes of problems? • Approach: - We don’t talk about specific physical machines or specific problems, instead - We reduce computers to general minimalist abstract mathematical entities - We talk about general classes of problems • Today: the simplest machine (an FSA) and the class of problems it can solve CS126 14-2 Randy Wang
Why Learn Theory? • In theory... - Deeper understanding of what a computer or computing is - Pure science: some of the most challenging “holy grails” (why climb a mountain? because it’s there!) - Philosophical implications • In practice... (some examples) - A sequential circuit: theory of finite state automata - Compilers: theory of context free grammar - Cryptography: complexity theories CS126 14-3 Randy Wang
Outline • Introduction • Pattern matching in Unix • Regular expressions in Unix • Regular expressions as formal languages • Finite State Automata • Conclusions CS126 14-4 Randy Wang
Unix Tools • Remember what we said about the success of Unix? - A large number of very simple small tools - Unix provides “glue” that allows you to connect them together to perform useful tasks effortlessly • Some of the most important tools have to do with pattern matching: - grep - awk - sed - more - emacs - perl CS126 14-5 Randy Wang
Demos • Words and partial words • Which files have the pattern • Interaction with other commands CS126 14-6 Randy Wang
Any file names that end with “.sl”: “Wildcard” file name matching (“glob style”): Unix shell feature, not to be confused with grep syntax
A dot matches any character, part of grep syntax, not to be confused with the dots in file names
Outline • Introduction • Pattern matching in Unix • Regular expressions in Unix • Regular expressions as formal languages • Finite State Automata • Conclusions CS126 14-10 Randy Wang
egrep or grep -E only or egrep
More Demos • regular expressions • egrep or grep -E features • escape characters • command line options CS126 14-12 Randy Wang
Examples wrong example taactgatacatacatacatacgctaat CS126 14-13 Randy Wang
Unix command displaying disk usage How to say it if you want a “real” dot? use an “escape character” in front...
“Escape” Character escape characters bunch of spaces bunch of letters or bunch of numbers but not both CS126 14-15 Randy Wang
Testament to Flexibility and Power of Unix Philosophy • Simple general tools + glue (scripting, and shell) • The advantages are being magnified in the age of web CS126 14-17 Randy Wang
Outline • Introduction • Pattern matching in Unix • Regular expressions in Unix • Regular expressions as formal language - Regular expression generator • Finite State Automata • Conclusions CS126 14-18 Randy Wang
Unix vs. Theory • Unix regular expressions are useful • But more complex than the theoretical minimum • But are they any more powerful ? no. CS126 14-19 Randy Wang
Formal Languages • Formal definitions - An alphabet : a finite set of symbols - A string : a finite sequence of symbols from the alphabet - A language : a (potentially infinite) set of strings over an alphabet • Intriguing topic: finite representation of a language - How? + language generators (a set of rules for producing strings) + language recognizers - We will study different classes of languages , their generators, and their recognizers, each more powerful than the previous ones - There are even strange languages that fail all these finite representational methods! CS126 14-20 Randy Wang
Why Study Formal Languages CS126 14-21 Randy Wang
(Bare Minimum) Regular Expression: Generator Rules CS126 14-22 Randy Wang
Regular Languages CS126 14-23 Randy Wang
Outline • Introduction • Pattern matching in Unix • Regular expressions in Unix • Regular expressions as formal languages • Finite State Automata - Regular expression recognizer and beyond • Conclusions CS126 14-24 Randy Wang
Finite State Automata: Regular Language Recognizers input tape 0 0 1 1 0 1 0 0 finite states 0 1 read head 7 6 2 5 3 4 CS126 14-25 Randy Wang
FSA Example Demo CS126 14-26 Randy Wang
FSA Example read a 1, and beginning state Can kill any number of the string still these “ears”, and the has a chance string will still be accepted! Important implication later. input state read a 0, and the dead state string is accepted if we stop now CS126 14-27 Randy Wang
Second FSA Example CS126 14-28 Randy Wang
An Application CS126 14-29 Randy Wang
Third FSA Example: Add Outputs CS126 14-30 Randy Wang
Bounce Filter Demo CS126 14-31 Randy Wang
State Meaning CS126 14-32 Randy Wang
Fourth FSA Example • How does it work? - Every time we scan one more digit: x = x<<1 + y - Equivalent to: x = x*2 + y - Three states: x%3==0, x%3==1, x%3==2 - Six transitions: (0*2+0)%3==0, (0*2+1)%3==1 (1*2+0)%3==2, (1*2+1)%3==0 (2*2+0)%3==1, (2*2+1)%3==2 CS126 14-33 Randy Wang
Outline • Introduction • Pattern matching in Unix • Regular expressions in Unix • Regular expressions as formal languages • Finite State Automata • Conclusions CS126 14-35 Randy Wang
Looking Ahead... • Regular expressions are very simple languages, and FSAs are very simple machines • What kind of languages cannot be expressed by regular expressions? What tasks can’t be performed by FSAs? • Basic idea: because the machine only has a finite number of states N, it can’t remember more than N things • So any language that requires remembering infinite number of things is not regular • This is something that we will do a couple more times: - Define a machine, and understand its behavior - Find things it can’t do - Define a more powerful machine - Repeat until we either run out of machines or problems - (Hmm... which will we run out first?) CS126 14-36 Randy Wang
CS126 14-37 Randy Wang
A Warm-up Result a s x b • Remember we said we could cut any ear when showing the first example of FSA? • More formally, if a(s)*b is accepted, then ab is accepted CS126 14-38 Randy Wang
repeat visits to the same state
What Have We Learned Today • How to write Unix-style regular expressions • How to use their associated Unix tools to perform useful and interesting tasks • “Formal” regular expressions • FSAs, how to trace their execution • Constructing simple FSAs to solve problems • Understanding the limits of REs and FSAs: being able to spot what problems they cannot solve (you’ll get better at this after a few more lectures...) CS126 14-40 Randy Wang
Recommend
More recommend