9/25/17 CSCI 2320 Lexical Analysis Ref: Ch 3 + Handout (Nishimura) MOHAMMAD T. IRFAN Plan Chomsky Hierarchy Lexical Analysis 1
9/25/17 Chomsky Hierarchy Faster computa?on Regular grammar BoGom of hierarchy Context-free grammar (CFG/BNF) Context-sensi?ve grammar Unrestricted grammar Top of hierarchy More expressive power Chomsky Hierarchy A, B ∈ N ω ∈ T* α, β ∈ (T U N)* A → ω B A → ω B | ω Regular grammar A → ω Context-free grammar (CFG/BNF) A → β Context-sensi?ve grammar α → β, where |α| <= |β| Unrestricted grammar α → β 2
9/25/17 Regular grammar: A, B ∈ N A → ω B ω ∈ T* A → ω pros and cons Pros ◦ Can do the first layer of abstrac?on in PL syntax ◦ Integer → 0 Integer | 1 Integer | ... | 9 Integer | 0 | 1 | ... | 9 ◦ Note: following is not regular grammar (why?) ◦ Integer à Integer Digit ◦ Digit à 0 | 1 | ... | 9 Cons ◦ Cannot check balanced parenthesis, braces, etc. ◦ Cannot represent {a n b n | n >= 1} CFG/BNF/EBNF: A ∈ N A → β β ∈ (T U N)* pros and cons Pros ◦ Can do all layers of abstrac?ons in PL syntax ◦ Assignment à Iden-fier = Expression; Cons ◦ Can't do lots of seman?c-type things ◦ Variable declared before use? ◦ Operand and operator compa?ble? ◦ Can't represent languages like {ww | w ∈ T + } ◦ Can do equality checking (a n b n ), but can't detect repe??on 3
9/25/17 A, B ∈ N Context-sensiRve: ω ∈ T* α, β ∈ (T U N)* pros and cons α → β, where |α| <= |β| Pros ◦ Can represent languages like {a n b n c n | n >= 1} Cons ◦ It is undecidable whether a given sentence ω can be derived from a given context-sensi?ve grammar ◦ Can't do parsing! ◦ Can't write a compiler for context-sensi?ve grammar! A, B ∈ N Unrestricted: ω ∈ T* α, β ∈ (T U N)* pros and cons α → β Pros ◦ Equivalent to Turing machine ◦ That is, can compute any computable func?on Cons ◦ Can we do parsing? 4
9/25/17 Plan Chomsky Hierarchy Lexical Analysis Lexical Analysis Input: Lexemes (typed ASCII characters) Output: Tokens (sequence of characters having a collec?ve meaning) Discard: whitespace, comments int count = 10; Lexemes int count = 10 ; keywo ident opera intLi separ Tokens rd ifier tor teral ator 5
9/25/17 Why do lexical analysis separately? Simpler, faster grammar for parsing ◦ Next: how? 75% of ?me spent in lexical analysis Def. Regular Expressions RegExpr Meaning x a character x \x an escaped character, e.g., \n { Z } a reference to a reg expr Z M | N M or N, where M and N are reg expr M N M followed by N M* zero or more occurrences of M M+ One or more occurrences of M M? Zero or one occurrence of M 6
9/25/17 Def. Regular Expressions RegExpr Meaning [aeiou] the set of vowels [0-9] the set of digits . Any single character Special symbols: ^ means not (e.g., [^aeiouAEIOU] is a non-vowel) CLite regular definiRon Category Defini3on AnyChar [ -~] From space (ASCII 27) to ?lde (126) LeGer [a-zA-Z] Digit [0-9] Whitespace [ \t] Space and tab Eol \n 7
9/25/17 Category Defini3on Keyword bool | char | else | false | float | if | int | main | true | while Iden?fier {LeGer}({LeGer} | {Digit})* IntegerLit {Digit}+ FloatLit {Digit}+\.{Digit}+ CharLit '{AnyChar}' Category Defini3on Operator = | || | && | == | != | < | <= | > | + | - | * | / | ! | [ | ] Separator : | . | { | } | ( | ) Comment // ({AnyChar} | {Whitespace})* {Eol} 8
9/25/17 ImplementaRon Using Python Python's re package hGps://docs.python.org/3/library/re.html import re #regex re.split(...) #Use regex argument to split a string into parts Common string matching regex: Symbol Defini3on \d [0-9] \D [^0-9] \w [a-zA-Z0-9_] \W [^a-zA-Z0-9_] 9
9/25/17 Describe the language: 1. 0(0|1) + 0 2. ((ε|0)1*)* 3. 0*10*10*10* 4. (00|11)* Write regular expression for: 1. All strings of lowercase leGers, where leGers appear in ascending order. 2. All strings of leGers containing vowels in order. 10
9/25/17 Exam 1 Coming Thursday, Sept 28 Start of class (30 min) Up to today's class Finite State Automata (FSA) BEHIND THE SCENE OF REGULAR EXPRESSIONS 11
9/25/17 Finite State Automata (FSA) Σ: Input alphabet + unique end symbol ($) Set of states ◦ Represented by nodes ◦ Unique start state ◦ One or more final states State transi?on func?on ◦ Labelled (using alphabet) arcs in graph DeterminisRc F.A. (DFA) There is at most one outgoing arc from any state for any par?cular input symbol ◦ Easy to parse: does x belong to L G ? 12
9/25/17 Non-determinisRc F.A. (NFA) Allows mul?ple outgoing arcs from a state for the same input symbol Allows transi?ons on empty string (ε) ◦ Easy to express a language ◦ But difficult to parse Known algorithms 1. DFA à regular expression 2. Regular expression à NFA Language designer à implementa?on (parsing) 3. NFA à DFA DFA à Regex à NFA à DFA All 3 are equivalent! 13
9/25/17 Example State elimina?on algorithm • Nishimura handout: + means | Odd binary number (More details soon) Regex à NFA à DFA à Regex (0|1)*1 à ? à ? à ? Idea: • For |, symbols will be on the same arc For concatena?on, create new state • • For *, use self-loop (More details on next slide) Idea: • Start with the NFA start symbol and tabulate all possible sets of NFA states that you can reach on 0 and 1 transi?ons. • Each set of NFA state is a DFA state. Regex à NFA ScoG, Programming Languages (2000) 14
9/25/17 NFA/DFA à Regex State eliminaRon algorithm How to preserve all paths a•er dele?ng a node? For each node to be deleted: ◦ Match each incoming arc with every outgoing arc Class ParRcipaRon 3 Do the following for binary numbers with an even number of 0s: Regular expression à NFA à DFA à Regular expression. 15
Recommend
More recommend