Finite State Automata for Regular Expression From Regular Expression to Generated Lexer Final Remarks Compiling Techniques Lecture 4: Automatic Lexer Generation (EaC § 2.4) Christophe Dubach 27 September 2016 Christophe Dubach Compiling Techniques
Finite State Automata for Regular Expression From Regular Expression to Generated Lexer Final Remarks Table of contents 1 Finite State Automata for Regular Expression Finite State Automata Non-determinism 2 From Regular Expression to Generated Lexer Regular Expression to NFA From NFA to DFA 3 Final Remarks Christophe Dubach Compiling Techniques
Finite State Automata for Regular Expression From Regular Expression to Generated Lexer Final Remarks Automatic Lexer Generation Lexer Source char token AST Semantic AST IR IR Scanner Tokeniser Parser code Analyser Generator Errors Starting from a collection of regular expressions (RE) we automatically generate a Lexer. We use finite state automata (FSA) for the construction Christophe Dubach Compiling Techniques
Finite State Automata for Regular Expression Finite State Automata From Regular Expression to Generated Lexer Non-determinism Final Remarks Definition: finite state automata A finite state automata is defined by: S , a finite set of states Σ, an alphabet, or character set used by the recogniser δ ( s , c ), a transition function (takes a state and a character and returns new state) s 0 , the initial or start state S F , a set of final states (a stream of characters is accepted iif the automata ends up in a final state) Christophe Dubach Compiling Techniques
Finite State Automata for Regular Expression Finite State Automata From Regular Expression to Generated Lexer Non-determinism Final Remarks Finite State Automata for Regular Expression Example: register names ( ’ 0 ’ | ’ 1 ’ | . . . | ’ 9 ’ ) ∗ r e g i s t e r ::= ’ r ’ ( ’ 0 ’ | ’ 1 ’ | . . . | ’ 9 ’ ) The RE (Regular Expression) corresponds to a recogniser (or finite state automata): ’0’ | ’1’ | ... | ’9’ ’r’ s 0 s 1 s 2 ’0’ | ’1’ | ... | ’9’ Christophe Dubach Compiling Techniques
Finite State Automata for Regular Expression Finite State Automata From Regular Expression to Generated Lexer Non-determinism Final Remarks ’0’ | ’1’ | ... | ’9’ ’r’ s 0 s 1 s 2 ’0’ | ’1’ | ... | ’9’ Finite State Automata (FSA) operation: Start in state s 0 and take transitions on each input character The FSA accepts a word x iff x leaves it in a final state ( s 2 ) Examples: r17 takes it through s 0 , s 1 , s 2 and accepts r takes it through s 0 , s 1 and fails a starts in s 0 and leads straight to failure Christophe Dubach Compiling Techniques
Finite State Automata for Regular Expression Finite State Automata From Regular Expression to Generated Lexer Non-determinism Final Remarks Table encoding and skeleton code To be useful a recogniser must be turned into code ’0’ | ’1’ | ... | ’9’ Skeleton recogniser c = next c h a r a c t e r ’r’ s 0 s 1 s 2 s t a t e = s 0 ’0’ | ’1’ | ... | ’9’ w h i l e ( c � = EOF) s t a t e = δ ( state , c ) Table encoding RE c = next c h a r a c t e r i f ( s t a t e f i n a l ) δ ’r’ others ’ 0 ’ | ’ 1 ’ | . . . | ’ 9 ’ r e t u r n s u c c e s s error error s 0 s 1 e l s e error error s 1 s 2 r e t u r n e r r o r error error s 2 s 2 Christophe Dubach Compiling Techniques
Finite State Automata for Regular Expression Finite State Automata From Regular Expression to Generated Lexer Non-determinism Final Remarks Deterministic Finite Automaton Each RE corresponds to a Deterministic Finite Automaton (DFA). However, it might be hard to construct directly. What about an RE such as (a | b) ∗ abb ? a | b ǫ a b b s 0 s 1 s 2 s 3 s 4 This is a little different: s 0 has a transition on ǫ , which can be followed without consuming an input character s 1 has two transitions on a This is a Non-determinisitic Finite Automaton (NFA) Christophe Dubach Compiling Techniques
Finite State Automata for Regular Expression Finite State Automata From Regular Expression to Generated Lexer Non-determinism Final Remarks Non-deterministic vs deterministic finite automata Deterministic finite state automata (DFA): All edges leaving the same node have distinct labels There is no ǫ transition Non-deterministic finite state automata (NFA): Can have multiple edges with the same label leaving from the same node Can have ǫ transition This means we might have to backtrack Christophe Dubach Compiling Techniques
Finite State Automata for Regular Expression Regular Expression to NFA From Regular Expression to Generated Lexer From NFA to DFA Final Remarks Automatic Lexer Generation It is possible to systematically generate a lexer for any regular expression. This can be done in three steps: 1 regular expression (RE) → non-deterministic finite automata (NFA) 2 NFA → deterministic finite automata (DFA) 3 DFA → generated lexer Christophe Dubach Compiling Techniques
Finite State Automata for Regular Expression Regular Expression to NFA From Regular Expression to Generated Lexer From NFA to DFA Final Remarks 1st step: RE → NFA (Ken Thompson, CACM, 1968) M N x “ x ′′ s 0 s 1 M ǫ N s 0 s 1 s 2 s 3 ǫ M ∗ M [ M ] s 0 s 1 ǫ M ǫ M ǫ s 1 s 2 s 0 s 1 s 2 s 3 ǫ ǫ ǫ M | N s 0 s 5 ǫ ǫ M + N s 3 s 4 ǫ M ǫ s 0 s 1 s 2 s 3 ǫ Christophe Dubach Compiling Techniques
Finite State Automata for Regular Expression Regular Expression to NFA From Regular Expression to Generated Lexer From NFA to DFA Final Remarks Example: a ( b | c ) ∗ ǫ b s 4 s 5 ǫ ǫ a ǫ ǫ ǫ s 0 s 1 s 2 s 3 s 8 s 9 ǫ ǫ c s 6 s 7 ǫ b | c a A human would do: s 0 s 1 Christophe Dubach Compiling Techniques
Finite State Automata for Regular Expression Regular Expression to NFA From Regular Expression to Generated Lexer From NFA to DFA Final Remarks Step 2: NFA → DFA Executing a non-deterministic finite automata requires backtracking, which is inefficient. To overcome this, we need to construct a DFA from the NFA. The main idea: We build a DFA which has one state for each set of states the NFA could end up in. A set of state is final in the DFA if it contains the final state from the NFA. Since the number of states in the NFA is finite ( n ), the number of possible sets of states is also finite (maximum 2 n ). Christophe Dubach Compiling Techniques
Finite State Automata for Regular Expression Regular Expression to NFA From Regular Expression to Generated Lexer From NFA to DFA Final Remarks Assuming the state of the NFA are labelled s i and the states of the DFA we are building are labelled q i . We have two key functions: reachable( s i , α ) returns the set of states reachable from s i by consuming character α ǫ -closure( s i ) returns the set of states reachable from s i by ǫ ( e.g. , without consuming a character) Christophe Dubach Compiling Techniques
Finite State Automata for Regular Expression Regular Expression to NFA From Regular Expression to Generated Lexer From NFA to DFA Final Remarks The Subset Construction algorithm (Fixed point iteration) q 0 = ǫ - closure ( s 0 ) ; Q = { q 0 } ; add q 0 to WorkList w h i l e ( WorkList not empty ) remove q from WorkList f o r each α ∈ Σ subset = ǫ - closure ( reachable ( q , α )) δ ( q , α ) = subset i f ( subset / ∈ Q ) then add subset to Q and to WorkList The algorithm (in English) Start from start state s 0 of the NFA, compute its ǫ -closure Build subset from all states reachable from q 0 for character α Add this subset to the transition table/function δ If the subset has not been seen before, add it to the worklist Iterate until no new subset are created Christophe Dubach Compiling Techniques
Finite State Automata for Regular Expression Regular Expression to NFA From Regular Expression to Generated Lexer From NFA to DFA Final Remarks Informal proof of termination Q contains no duplicates (test before adding) similarly we will never add twice the same subset to the worklist bounded number of states; maximum 2 n subsets, where n is number of state in NFA ⇒ the loop halts End result S contains all the reachable NFA states It tries each symbol in each s i It builds every possible NFA configuration ⇒ Q and δ form the DFA Christophe Dubach Compiling Techniques
Finite State Automata for Regular Expression Regular Expression to NFA From Regular Expression to Generated Lexer From NFA to DFA Final Remarks NFA → DFA ǫ a ( b | c ) ∗ s 4 s 5 ǫ b ǫ a ǫ ǫ ǫ s 0 s 1 s 2 s 3 s 8 s 9 ǫ ǫ c s 6 s 7 ǫ ǫ - closure ( reachable ( q , α )) NFA states a b c q 0 s 0 q 1 none none q 1 s 1 , s 2 , s 3 , none q 2 q 3 s 4 , s 6 , s 9 q 2 s 5 , s 8 , s 9 , none q 2 q 3 s 3 , s 4 , s 6 q 3 s 7 , s 8 , s 9 , none q 2 q 3 s 3 , s 4 , s 6 Christophe Dubach Compiling Techniques
Finite State Automata for Regular Expression Regular Expression to NFA From Regular Expression to Generated Lexer From NFA to DFA Final Remarks Resulting DFA for a ( b | c ) ∗ Graph Table encoding q 2 b a b c b q 0 q 1 error error a q 0 q 1 c b q 1 error q 2 q 3 q 2 error q 2 q 3 c q 3 c q 3 error q 2 q 3 Smaller than the NFA All transitions are deterministic (no need to backtrack!) Could be even smaller (see EaC § 2.4.4 Hopcroft’s Algorithm for minimal DFA) Can generate the lexer using skeleton recogniser seen earlier Christophe Dubach Compiling Techniques
Recommend
More recommend