cs 126 lecture t1 pattern matching outline
play

CS 126 Lecture T1: Pattern Matching Outline Introduction Pattern - PowerPoint PPT Presentation

CS 126 Lecture T1: Pattern Matching Outline Introduction Pattern matching in Unix Regular expressions in Unix Regular expressions as formal languages Finite State Automata Conclusions CS126 14-1 Randy Wang Introduction


  1. CS 126 Lecture T1: Pattern Matching

  2. Outline • Introduction • Pattern matching in Unix • Regular expressions in Unix • Regular expressions as formal languages • Finite State Automata • Conclusions CS126 14-1 Randy Wang

  3. Introduction to Theoretical Computer Science • Two fundamental questions: - Power ? What are the things a computer can and cannot do? - Speed ? How quickly can a computer solve different classes of problems? • Approach: - We don’t talk about specific physical machines or specific problems, instead - We reduce computers to general minimalist abstract mathematical entities - We talk about general classes of problems • Today: the simplest machine (an FSA) and the class of problems it can solve CS126 14-2 Randy Wang

  4. Why Learn Theory? • In theory... - Deeper understanding of what a computer or computing is - Pure science: some of the most challenging “holy grails” (why climb a mountain? because it’s there!) - Philosophical implications • In practice... (some examples) - A sequential circuit: theory of finite state automata - Compilers: theory of context free grammar - Cryptography: complexity theories CS126 14-3 Randy Wang

  5. Outline • Introduction • Pattern matching in Unix • Regular expressions in Unix • Regular expressions as formal languages • Finite State Automata • Conclusions CS126 14-4 Randy Wang

  6. Unix Tools • Remember what we said about the success of Unix? - A large number of very simple small tools - Unix provides “glue” that allows you to connect them together to perform useful tasks effortlessly • Some of the most important tools have to do with pattern matching: - grep - awk - sed - more - emacs - perl CS126 14-5 Randy Wang

  7. Demos • Words and partial words • Which files have the pattern • Interaction with other commands CS126 14-6 Randy Wang

  8. Any file names that end with “.sl”: “Wildcard” file name matching (“glob style”): Unix shell feature, not to be confused with grep syntax

  9. A dot matches any character, part of grep syntax, not to be confused with the dots in file names

  10. Outline • Introduction • Pattern matching in Unix • Regular expressions in Unix • Regular expressions as formal languages • Finite State Automata • Conclusions CS126 14-10 Randy Wang

  11. egrep or grep -E only or egrep

  12. More Demos • regular expressions • egrep or grep -E features • escape characters • command line options CS126 14-12 Randy Wang

  13. Examples wrong example taactgatacatacatacatacgctaat CS126 14-13 Randy Wang

  14. Unix command displaying disk usage How to say it if you want a “real” dot? use an “escape character” in front...

  15. “Escape” Character escape characters bunch of spaces bunch of letters or bunch of numbers but not both CS126 14-15 Randy Wang

  16. Testament to Flexibility and Power of Unix Philosophy • Simple general tools + glue (scripting, and shell) • The advantages are being magnified in the age of web CS126 14-17 Randy Wang

  17. Outline • Introduction • Pattern matching in Unix • Regular expressions in Unix • Regular expressions as formal language - Regular expression generator • Finite State Automata • Conclusions CS126 14-18 Randy Wang

  18. Unix vs. Theory • Unix regular expressions are useful • But more complex than the theoretical minimum • But are they any more powerful ? no. CS126 14-19 Randy Wang

  19. Formal Languages • Formal definitions - An alphabet : a finite set of symbols - A string : a finite sequence of symbols from the alphabet - A language : a (potentially infinite) set of strings over an alphabet • Intriguing topic: finite representation of a language - How? + language generators (a set of rules for producing strings) + language recognizers - We will study different classes of languages , their generators, and their recognizers, each more powerful than the previous ones - There are even strange languages that fail all these finite representational methods! CS126 14-20 Randy Wang

  20. Why Study Formal Languages CS126 14-21 Randy Wang

  21. (Bare Minimum) Regular Expression: Generator Rules CS126 14-22 Randy Wang

  22. Regular Languages CS126 14-23 Randy Wang

  23. Outline • Introduction • Pattern matching in Unix • Regular expressions in Unix • Regular expressions as formal languages • Finite State Automata - Regular expression recognizer and beyond • Conclusions CS126 14-24 Randy Wang

  24. Finite State Automata: Regular Language Recognizers input tape 0 0 1 1 0 1 0 0 finite states 0 1 read head 7 6 2 5 3 4 CS126 14-25 Randy Wang

  25. FSA Example Demo CS126 14-26 Randy Wang

  26. FSA Example read a 1, and beginning state Can kill any number of the string still these “ears”, and the has a chance string will still be accepted! Important implication later. input state read a 0, and the dead state string is accepted if we stop now CS126 14-27 Randy Wang

  27. Second FSA Example CS126 14-28 Randy Wang

  28. An Application CS126 14-29 Randy Wang

  29. Third FSA Example: Add Outputs CS126 14-30 Randy Wang

  30. Bounce Filter Demo CS126 14-31 Randy Wang

  31. State Meaning CS126 14-32 Randy Wang

  32. Fourth FSA Example • How does it work? - Every time we scan one more digit: x = x<<1 + y - Equivalent to: x = x*2 + y - Three states: x%3==0, x%3==1, x%3==2 - Six transitions: (0*2+0)%3==0, (0*2+1)%3==1 (1*2+0)%3==2, (1*2+1)%3==0 (2*2+0)%3==1, (2*2+1)%3==2 CS126 14-33 Randy Wang

  33. Outline • Introduction • Pattern matching in Unix • Regular expressions in Unix • Regular expressions as formal languages • Finite State Automata • Conclusions CS126 14-35 Randy Wang

  34. Looking Ahead... • Regular expressions are very simple languages, and FSAs are very simple machines • What kind of languages cannot be expressed by regular expressions? What tasks can’t be performed by FSAs? • Basic idea: because the machine only has a finite number of states N, it can’t remember more than N things • So any language that requires remembering infinite number of things is not regular • This is something that we will do a couple more times: - Define a machine, and understand its behavior - Find things it can’t do - Define a more powerful machine - Repeat until we either run out of machines or problems - (Hmm... which will we run out first?) CS126 14-36 Randy Wang

  35. CS126 14-37 Randy Wang

  36. A Warm-up Result a s x b • Remember we said we could cut any ear when showing the first example of FSA? • More formally, if a(s)*b is accepted, then ab is accepted CS126 14-38 Randy Wang

  37. repeat visits to the same state

  38. What Have We Learned Today • How to write Unix-style regular expressions • How to use their associated Unix tools to perform useful and interesting tasks • “Formal” regular expressions • FSAs, how to trace their execution • Constructing simple FSAs to solve problems • Understanding the limits of REs and FSAs: being able to spot what problems they cannot solve (you’ll get better at this after a few more lectures...) CS126 14-40 Randy Wang

Recommend


More recommend