CSE443 Compilers Dr. Carl Alphonce alphonce@buffalo.edu 343 Davis Hall
Announcements HW-01 posted PR-01 posted Team formation: what is current status?
Lexical Phases of structure a compiler Figure 1.6, page 5 of text
Bird's eye view { for, while, x, factorial, … } G = (N, ∑ , P, S) grammar: rules for language: a set of strings generating language finite automaton regular expression a machine for language regex: a form of grammar C program generated by FLEX
languages & grammars Formally, a grammar is defined by 4 items: 1. N, a set of non-terminals 2. ∑ , a set of terminals 3. P, a set of productions 4. S, a start symbol G = (N, ∑ , P, S)
languages & grammars N, a set of non-terminals ∑ , a set of terminals (alphabet) N ∩ ∑ = {} P, a set of productions of the form (right linear) X -> a X -> aY X -> 𝜁 X ∈ N, Y ∈ N, a ∈ ∑ , 𝜁 denotes the empty string S, a start symbol S ∈ N
Lexical Analysis Lexical structure described by regular grammar Deterministic finite state machine performs analysis
LANGUAGE operations base cases { 𝜁 } is a regular language ∀ a ∈ ∑ , { a } is a regular language Recall, 𝜁 is the empty string
LANGUAGE operations If L and M are regular, so are: L ∪ M = { s | s ∈ L or s ∈ M } union LM = { st | s ∈ L and t ∈ M } concatenation L * = ∪ i=0, ∞ L i Kleene closure L i is L concatenated with itself i times: L 0 = { 𝜁 }, by definition L 1 = L L 2 = LL L 3 = LLL, etc. L * is the union of all these sets!
Example of L * Suppose L is {a, bb} L 0 = { 𝜁 }, by definition L 1 = L = {a, bb} L 2 = LL = {aa, abb, bba, bbbb} L 3 = LLL = {aaa, aabb, abba, abbbb, bbaa, bbbba, bbaa, bbabb, bbbba, bbbbbb, abbbb, bbabb} L 4 = …and so so… L * = ∪ i=0, ∞ L i = { 𝜁 , a, bb, aa, abb, bba, bbbb, aaa, aabb, abba, abbbb, bbaa, bbbba, bbaa, bbabb, bbbba, bbbbbb, abbbb, bbabb, … }
Given an alphabet ∑ REGular EXpression (regex) Inductive definition 𝜁 is a regex 𝓜 ( 𝜁 ) = { 𝜁 } For each a ∈ ∑ , a is a regex 𝓜 (a) = {a}
Regular expressions (regex) Inductive definition Assume r and s are regexes. r|s is a regex denoting 𝓜 (r) ∪ 𝓜 (s) rs is a regex denoting 𝓜 (r) 𝓜 (s) r * is a regex denoting ( 𝓜 (r)) * (r) is a regex denoting 𝓜 (r) Precedence: Kleene closure > concatenation > union Associativity: all left-associative (minimize use of parentheses: (r|s)|t = r|s|t )
Algebraic laws Assume r and s are regexes. Commutativity r|s = s|r Associativity r|(s|t) = (r|s)|t and r(st) = (rs)t Disributivity r(s|t) = rs|rt and (s|t)r = sr|tr Identity 𝜁 r = r 𝜁 = r Idempotency r ** = r *
We can describe a regular language using a regular expression
A regular expression can be recognized using a finite state machine. Machines: NFA non-deterministic finite automaton DFA deterministic finite automaton
Process of building lexical analyzer 1) spell out the language language
Process of building lexical analyzer 2) formulate a regular expression language regex
Process of building lexical analyzer 3) build an NFA language regex NFA
Process of building lexical analyzer 4) transform NFA to DFA language regex NFA DFA
Process of building lexical analyzer 5) transform DFA to a minimal DFA language regex NFA DFA DFA
Process of building lexical analyzer 5) The minimal DFA is character our lexical analyzer stream language regex NFA DFA DFA token stream lexical analyzer
Focus for today regex NFA
Nondeterministic Finite Automata (NFA) A finite set of states S An alphabet ∑ , 𝜁 ∉ ∑ 𝛆 ⊆ S X ( ∑ ∪ { 𝜁 }) X 𝒬 (S) (transition function) s 0 ∈ S (a single start state) F ⊆ S (a set of final or accepting states)
Deterministic Finite Automata (DFA) A finite set of states S An alphabet ∑ , 𝜁 ∉ ∑ 𝛆 ⊆ S X ∑ X S (transition function) s 0 ∈ S (a single start state) F ⊆ S (a set of final or accepting states)
A state is a circle with its state number written inside. 0
Initial state has an arrow from nowhere pointing in. State 0 is often the initial state. 0
A final state is drawn with a double circle. 1
Arrows are labeled with 𝜁 … 𝜁 1 0 … or a ∈ ∑ . a 1 0 for each a ∈ ∑
Regex -> NFA 𝜁 1 0 N(s) 𝜁 𝜁 0 1 𝜁 𝜁 N(t) a 1 0 S | t for each a ∈ ∑
Regex -> NFA St 0 1 N(s) N(t) 𝜁 𝜁 S * 0 1 N(s) 𝜁 𝜁
Simple example static
Simple example static c s t a t i 0 1 2 3 4 5 6
Simple example static struct c s a t i t 0 1 2 3 4 5 6 𝜁 𝜁 i F t s t r u c 7 8 9 𝜁 𝜁 10 11 12 13
Process of building lexical analyzer 5) The minimal DFA is character our lexical analyzer stream language regex NFA DFA DFA token stream lexical analyzer
Focus above: build a non-deterministic recognizer regex NFA
Next step: make recognizer deterministic NFA DFA
(a|b) * abb first we construct an NFA from this regular expression
(a|b) * abb a
(a|b) * abb a b
(a|b) * abb a 𝜁 𝜁 b 𝜁 𝜁
(a|b) * abb 𝜁 a 𝜁 𝜁 𝜁 𝜁 𝜁 b 𝜁 𝜁
(a|b) * abb 𝜁 a 𝜁 𝜁 a 𝜁 𝜁 𝜁 b 𝜁 𝜁
(a|b) * abb 𝜁 a 𝜁 𝜁 a b 𝜁 𝜁 𝜁 b 𝜁 𝜁
(a|b) * abb 𝜁 a 𝜁 𝜁 a b b 𝜁 𝜁 𝜁 b 𝜁 𝜁
(a|b) * abb 𝜁 a 2 3 𝜁 𝜁 a b b 𝜁 0 1 6 8 7 9 10 𝜁 𝜁 b 𝜁 𝜁 4 5
Operations 𝜁 -closure(t) is the set of states reachable from state t using only 𝜁 -transitions. 𝜁 -closure(T) is the set of states reachable from any state t ∈ T using only 𝜁 - transitions. move(T,a) is the set of states reachable from any state t ∈ T following a transition on symbol a ∈ ∑ .
NFA -> DFA algorithm (set of states construction - page 153 of text) INPUT: An NFA N = (S, ∑ , 𝛆 , s 0 , F) OUTPUT: A DFA D = (S', ∑ , 𝛆 ', s 0 ', F') such that ℒ (D)= ℒ (N) ALGORITHM: Compute s 0 ' = 𝜁 -closure(s 0 ), an unmarked set of states Set S' = { s 0 ' } while there is an unmarked T ∈ S' mark T for each symbol a ∈ ∑ let U = 𝜁 -closure(move(T,a)) if U ∉ S', add unmarked U to S' add transition: 𝛆 '(T,a) = U F' is the subset of S' all of whose members contain a state in F .
Recommend
More recommend