Motivation How would you find a substring inside a string? - PDF document

Motivation How would you find a substring inside a string? Something like this? (Is this good?) def find (string, pattern): CS 61A/CS 98-52 n = len(string) m = len(pattern) for i in range(n - m + 1 ): is_match = True Mehrdad Niknami for j in range(m): University of California, Berkeley if pattern[j] != string[i + j] is_match = False break if is_match: return i What if you were looking for a pattern ? Like an email address? Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 1 / 23 Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 2 / 23 Background Background Text processing has been at the heart of computer science since the 1950s Most of you will probably graduate without learning string processing. Regular languages: 1950s (Kleene) Instead, you’ll learn how to process images and Big Data™. Context-free languages (CFLs): 1950s (Chomsky) Which makes me sad. :( You should know how to solve solved problems! Regular expressions (regexes) & automata: 1960s (Thompson) Learn & use 100%-accurate algorithms before 85%-accurate ones! LR parsing ( l eft-to-right, r ightmost-derivation): 1960s (Knuth) O ( mn )-time str.find(substring) is bad ! You can do much better: Context-free parsers: 1960s (Earley) Good algorithms finish in O ( m + n ) time & space (e.g. Z algorithm) String searching (Knuth-Morris-Pratt, Boyer-Moore, etc.): 1970s The best/coolest finish in O ( m + n ) time but O (1) space !!! Periods & critical factorizations: 1970s (Cesari-Vincent) So, today, I’ll teach a bit about string processing. :) [...] Critical factorizations in linear complexity: 2016 (Kosolobov) You can learn more in CS 164, CS 176, etc. (Have fun!) Research is still ongoing ...apparently more in Europe? Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 3 / 23 Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 4 / 23 Formal Languages Formal Grammars Languages can be infinite, so we can’t always list all the strings in them. In formal language theory: We therefore use grammars to describe languages. Alphabet : any set (usually a character set , like English or ASCII) For example, this grammar describes L = { “”, “ hi ”, “ hihi ”, . . . } : → Often denoted by Σ S → T Letter : an element in the given alphabet , e.g. “ x ” T → ε String (or word ): finite sequence of letters , e.g. “ hi ” T → T "h" "i" Language : a set of strings , e.g. { “ a ”, “ aa ”, “ aaa ”, . . . } We call S a nonterminal symbol and “ h ” a terminal symbol (i.e., letter). → Often denoted by L Each line is a production rule , producing a sentential form on the right. We might omit the quotes/braces, so we’ll use the following denotations: To make life easier, we’ll denote these by uppercase and lowercase respectively, omitting quotes and spaces when convenient. ε : empty string (i.e., “”) We then merge and simplify rules via the pipe (OR) symbol: ∅ : empty language (i.e., empty set {} ) S → S hi | ε Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 5 / 23 Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 6 / 23 Regular Languages Regular Grammars A regular grammar is a grammar in which all productions have at most The following are regular languages over the alphabet Σ: one nonterminal symbol, all of which appear on either the left or the right. ∅ In other words, this is a regular grammar: { ε } S → A b c A → S a | ε { σ } ∀ σ ∈ Σ This is not a regular grammar (but it is linear and context-free ): The union A ∪ B of any regular languages A and B over Σ S → A b c The concatenation AB of any regular languages A and B over Σ A → a S | ε The repetition (Kleene star) A ∗ of any regular language A over Σ A ∗ = { ε } ∪ A ∪ AA ∪ AAA ∪ . . . and neither is this (it is context-sensitive ): Notice that all finite languages are regular, but not all infinite languages. S → S s | ε S s → S t Regular languages do not allow arbitrary “nesting” (e.g. parens). A language is regular iff it can be described by a regular grammar. Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 7 / 23 Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 8 / 23

Regular Expressions Regular Expressions A regular expression is an easier way to describe a regular language. Regular expressions ( regexes ) are equivalent to regular grammars 1 , e.g. It’s essentially a pattern for describing a regular language. For example, in [abcw-z] * (1+2|3)?4 \ ? , we have: Y � �� [abcw-z] * 4 \ ? (1+ 2|3)? [abcw-z] (a character set ) means “either a , b , c , w , x , y , or z ”. � �� X � �� Asterisk (a.k.a. “Kleene star”, a quantifier ) means “zero or more” Z Plus (another quantifier) means “one or more” is equivalent to Question mark (another quantifier) means “at most one” S → Z 4 ? Z → Y 2 | X 3 | ε Backslash (“escape”) before a special character means that character Y → Y 1 | X 1 Pipe (the OR symbol | ) means “either”, and parentheses group X → X a | X b | X c | X w | X x | X y | X z | ε So this matches zero or more of a, b, c, w, x, y, z, followed by either Here, the regex is more compact. Sometimes, the grammar is smaller. nothing or by 3 or by 1’s followed by 2, followed by 4 and a question mark. 1 If you’ve seen backreferences: those are not technically valid in regexes. Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 9 / 23 Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 10 / 23 Regular Expressions Regular Expressions Python has a regex engine to find text matching a regex: Million-dollar question: >>> import re How do you find text matching a regex? >>> m = re.match('.* ([a-z0-9._-]+)@([a-z0-9._-]+)', 'hello cs61a@berkeley.edu cs98-52') Two steps: >>> m 1 Parse the regex (pattern) to “understand” its structure <re.Match object; span=( 0 , 24 ), match='hello cs61a@berkeley.edu'> 2 Use the regex to parse the actual text (corpus) >>> m.groups() It turns out that: ('cs61a', 'berkeley.edu') 1 Step 1 is theoretically harder, but practically easier. Notice that these could all be handled by re.match : (This can be done similarly to how you parsed Scheme.) Substring search ( str.find ) 2 Step 2 is theoretically easier, but practically harder. Subsequence search ( re.match(".*b.*b", "abbc") ) This is because we need parsing the corpus to be fast . The grep tool (from ed ’s g/re/p = global/regex/print ) does this for files. Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 11 / 23 Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 12 / 23 Regular Expressions Finite Automata A finite automaton (FA) consists of the following (example below) 2 : How do you solve each step? An input alphabet Σ ( { 0 , 1 } here) Both steps are often done using “recursive-descent”—similarly to how your A finite set of states S ( { s 0 , s 1 , s 2 } here) Scheme parser parsed its input. An initial state s 0 ∈ S ( s 0 here) Basically: try every possibility recursively. “Backtrack” on failure to try something else. A set of accepting (or final ) states F ⊂ S ( { s 2 } here) A transition function δ : S × Σ → 2 S (the arrows here) Problem: Recursive-descent can take exponential time ! Example (where “ a { 3 } ” is shorthand for “ aaa ”): 0 1 1 >>> re.match("(a?){25}a{25}", "a" * 25 ) s 0 s 1 s 2 Can we hope to parse corpora in time linear to their lengths? Yes , using finite automata. 1 0 0 2 Note that an FA is not quite the same thing as a finite-state machine (FSM). Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 13 / 23 Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 14 / 23 Finite Automata Finite Automata Finite automata are language recognizers : you feed a string as an input, and if it accepts the input string, the string is in its language. 3 Notice the transition function δ outputs a subset of states. In particular: = ⇒ Finite automata recognize regular languages , and nothing else ! In a deterministic finite automaton (DFA), the transition function always Therefore, we can: outputs a set with exactly one state (a singleton ). 1 Convert regex pattern to FA i.e., in a DFA, the next state is determined by the input & current state. (i.e., every state has exactly 1 arrow leaving it for each possible input.) 2 Feed corpus to FA in linear time ! In a nondeterministic finite automaton (NFA), the above is not true. 3 ... 4 Profit! But how can we do this? 3 Pumping lemma : A long-enough input must contain a repeatable substring. (Why?) Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 15 / 23 Mehrdad Niknami (UC Berkeley) CS 61A/CS 98-52 16 / 23

Motivation How would you find a substring inside a string? - PDF document

Motivation How would you find a substring inside a string? Something like this? (Is this good?) def find (string, pattern): CS 61A/CS 98-52 n = len(string) m = len(pattern) for i in range(n - m + 1 ): is_match = True Mehrdad Niknami for j in

5. Motivation Motivation: Big Questions Where does motivation come from? Can

with Polynomial Filters Josiah Manson and Scott Schaefer Texas A&M University Motivation

Sketch Model Review MotoThresher Empowering Tanzanian Farmers Motivation Motivation

Bringing Portraits to Life CS448V: Lecture 13 Motivation Motivation Motivation Bring Your

Motivation What is Motivation? How motivated are you now? What are your thoughts as you enter

All are participants of road traffic (also elderly), especially in city All are participants of

Video Analytics Xavier Gir-i-Nieto Motivation 2 Motivation 3 Motivation 4 Outline 1.

Indoor Places Lukas Kuster Motivation GPS for localization [7] 2 Motivation Indoor

MOTIVATION How to Find it and How to Keep it IN THE BEGINNING MOTIVATION IS THE PROCESS THAT

MOTIVATION MOTIVATION Dr. M. Thenmozhi Professor Department of Management Studies Indian

Undergraduates and Public Service Motivation Student Motivation Literature Student

CT Analysis GROUP : PA4 Motivation The motivation of this project is to help the doctor to

Symmetry Transforms 1 1 Motivation Symmetry is everywhere 2 Motivation Symmetry is

MOTIVATION Watch this video on intrinsic versus extrinsic motivation Value x Expectation (of

7.2 Ray Tracing Hao Li http://cs420.hao-li.com 1 Motivation: Reflections 2 Motivation: Depth

Intrinsic Motivation Ho How to to G Get et You our r Kid ids s Mo Moti tivate ted d

Outline Motivation Answer Set Programming is a well-known declarative problem solving approach.

Motivation for SMEs Some thoughts Leadership and Motivation Dia daoibh agus t filte

My P value is lower than your P value! Beyond GWAS in livestock genomics Joanna Szyda Motivation

Motivation: No Formal Theory Motivation: No Formal Theory Master course at Leiden University

End result Implementation Resources used Google maps Formsbased worksheets in both

1 What motivates you? Motivation Survey Achievement Interpersonal relationships, superior

1 Motivation of our project 2 Overview 3 Motivation for SYNENERGENE Identify problems

Team UC Chile PSB Lab Everything in nature is about Context Motivation Bioluminescence

Motivation How would you find a substring inside a string? - PDF document

Motivation How would you find a substring inside a string? Something like this? (Is this good?) def find (string, pattern): CS 61A/CS 98-52 n = len(string) m = len(pattern) for i in range(n - m + 1 ): is_match = True Mehrdad Niknami for j in

5. Motivation Motivation: Big Questions Where does motivation come from? Can

with Polynomial Filters Josiah Manson and Scott Schaefer Texas A&amp;M University Motivation

Sketch Model Review MotoThresher Empowering Tanzanian Farmers Motivation Motivation

Bringing Portraits to Life CS448V: Lecture 13 Motivation Motivation Motivation Bring Your

Motivation What is Motivation? How motivated are you now? What are your thoughts as you enter

All are participants of road traffic (also elderly), especially in city All are participants of

Video Analytics Xavier Gir-i-Nieto Motivation 2 Motivation 3 Motivation 4 Outline 1.

Indoor Places Lukas Kuster Motivation GPS for localization [7] 2 Motivation Indoor

MOTIVATION How to Find it and How to Keep it IN THE BEGINNING MOTIVATION IS THE PROCESS THAT

MOTIVATION MOTIVATION Dr. M. Thenmozhi Professor Department of Management Studies Indian

Undergraduates and Public Service Motivation Student Motivation Literature Student

CT Analysis GROUP : PA4 Motivation The motivation of this project is to help the doctor to

Symmetry Transforms 1 1 Motivation Symmetry is everywhere 2 Motivation Symmetry is

MOTIVATION Watch this video on intrinsic versus extrinsic motivation Value x Expectation (of

7.2 Ray Tracing Hao Li http://cs420.hao-li.com 1 Motivation: Reflections 2 Motivation: Depth

Intrinsic Motivation Ho How to to G Get et You our r Kid ids s Mo Moti tivate ted d

Outline Motivation Answer Set Programming is a well-known declarative problem solving approach.

Motivation for SMEs Some thoughts Leadership and Motivation Dia daoibh agus t filte

My P value is lower than your P value! Beyond GWAS in livestock genomics Joanna Szyda Motivation

Motivation: No Formal Theory Motivation: No Formal Theory Master course at Leiden University

End result Implementation Resources used Google maps Formsbased worksheets in both

1 What motivates you? Motivation Survey Achievement Interpersonal relationships, superior

1 Motivation of our project 2 Overview 3 Motivation for SYNENERGENE Identify problems

Team UC Chile PSB Lab Everything in nature is about Context Motivation Bioluminescence

with Polynomial Filters Josiah Manson and Scott Schaefer Texas A&M University Motivation