Compiler Construction Lecture 3: Lexical Analysis II (Extended Matching Problem) Thomas Noll Lehrstuhl f¨ ur Informatik 2 (Software Modeling and Verification) noll@cs.rwth-aachen.de http://moves.rwth-aachen.de/teaching/ss-14/cc14/ Summer Semester 2014
Outline Recap: Lexical Analysis 1 Complexity Analysis of Simple Matching 2 The Extended Matching Problem 3 First-Longest-Match Analysis 4 Implementation of FLM Analysis 5 Compiler Construction Summer Semester 2014 3.2
Lexical Analysis Definition The goal of lexical analysis is to decompose a source program into a sequence of lexemes and their transformation into a sequence of symbols. The corresponding program is called a scanner (or lexer): (token,[attribute]) Source program Scanner Parser get next token Symbol table Example: . . . �x1�:=y2+�1�;� . . . ⇓ . . . (id , p 1 )(gets , )(id , p 2 )(plus , )(int , 1)(sem , ) . . . Compiler Construction Summer Semester 2014 3.3
The DFA Method I Known from Formal Systems, Automata and Processes : Algorithm (DFA method) Input: regular expression α ∈ RE Ω , input string w ∈ Ω ∗ Procedure: using Kleene’s Theorem, construct A α ∈ NFA Ω such 1 that L ( A α ) = � α � apply powerset construction (cf. Definition 2.12) to 2 obtain A ′ α = � Q ′ , Ω , δ ′ , q ′ 0 , F ′ � ∈ DFA Ω with L ( A ′ α ) = L ( A α ) = � α � solve the matching problem by deciding whether 3 δ ′∗ ( q ′ 0 , w ) ∈ F ′ Output: “yes” or “no” Compiler Construction Summer Semester 2014 3.4
The DFA Method II The powerset construction involves the following concept: Definition ( ε -closure) Let A = � Q , Ω , δ, q 0 , F � ∈ NFA Ω . The ε -closure ε ( T ) ⊆ Q of a subset T ⊆ Q is defined by T ⊆ ε ( T ) and if q ∈ ε ( T ), then δ ( q , ε ) ⊆ ε ( T ) Definition (Powerset construction) Let A = � Q , Ω , δ, q 0 , F � ∈ NFA Ω . The powerset automaton A ′ = � Q ′ , Ω , δ ′ , q ′ 0 , F ′ � ∈ DFA Ω is defined by Q ′ := 2 Q �� � ∀ T ⊆ Q , a ∈ Ω : δ ′ ( T , a ) := ε q ∈ T δ ( q , a ) q ′ 0 := ε ( { q 0 } ) F ′ := { T ⊆ Q | T ∩ F � = ∅} Compiler Construction Summer Semester 2014 3.5
Outline Recap: Lexical Analysis 1 Complexity Analysis of Simple Matching 2 The Extended Matching Problem 3 First-Longest-Match Analysis 4 Implementation of FLM Analysis 5 Compiler Construction Summer Semester 2014 3.6
Complexity of DFA Method in construction phase: 1 Kleene method: time and space O ( | α | ) (where | α | := length of α ) Powerset construction: time and space O (2 | A α | ) = O (2 | α | ) (where | A α | := # of states of A α ) at runtime: 2 Word problem: time O ( | w | ) (where | w | := length of w ), space O (1) (but O (2 | α | ) for storing DFA) = ⇒ nice runtime behavior but memory requirements very high (and exponential time in construction phase) Compiler Construction Summer Semester 2014 3.7
The NFA Method Idea: reduce memory requirements by applying powerset construction at runtime, i.e., only “to the run of w through A α ” Algorithm 3.1 (NFA method) Input: automaton A α = � Q , Ω , δ, q 0 , F � ∈ NFA Ω , input string w ∈ Ω ∗ Variables: T ⊆ Q, a ∈ Ω Procedure: T := ε ( { q 0 } ); while w � = ε do a := head ( w ); �� � T := ε q ∈ T δ ( q , a ) ; w := tail ( w ) od Output: if T ∩ F � = ∅ then “yes” else “no” Compiler Construction Summer Semester 2014 3.8
Complexity Analysis For NFA method at runtime: Space: O ( | α | ) (for storing NFA and T ) Time: O ( | α | · | w | ) (in the loop’s body, | T | states need to be considered) Comparison: Method Space Time (for “ w ∈ � α � ?”) O (2 | α | ) DFA O ( | w | ) NFA O ( | α | ) O ( | α | · | w | ) ⇒ trades exponential space for increase in time = In practice: Exponential blowup of DFA method usually does not occur in “real” applications ( = ⇒ used in [f]lex ) Improvement of NFA method: caching of transitions δ ′ ( T , a ) = ⇒ combination of both methods Compiler Construction Summer Semester 2014 3.9
Outline Recap: Lexical Analysis 1 Complexity Analysis of Simple Matching 2 The Extended Matching Problem 3 First-Longest-Match Analysis 4 Implementation of FLM Analysis 5 Compiler Construction Summer Semester 2014 3.10
The Extended Matching Problem I Definition 3.2 Let n ≥ 1 and α 1 , . . . , α n ∈ RE Ω with ε / ∈ � α i � � = ∅ for every i ∈ [ n ] (= { 1 , . . . , n } ). Let Σ := { T 1 , . . . , T n } be an alphabet of corresponding tokens and w ∈ Ω + . If w 1 , . . . , w k ∈ Ω + such that w = w 1 . . . w k and for every j ∈ [ k ] there exists i j ∈ [ n ] such that w j ∈ � α i j � , then ( w 1 , . . . , w k ) is called a decomposition and ( T i 1 , . . . , T i k ) is called an analysis of w w.r.t. α 1 , . . . , α n . Problem 3.3 (Extended matching problem) Given α 1 , . . . , α n ∈ RE Ω and w ∈ Ω + , decide whether there exists a decomposition of w w.r.t. α 1 , . . . , α n and determine a corresponding analysis. Compiler Construction Summer Semester 2014 3.11
The Extended Matching Problem II Observation: neither the decomposition nor the analysis are uniquely determined Example 3.4 α = a + , w = aa 1 ⇒ two decompositions ( aa ) and ( a , a ) with respective (unique) = analyses ( T 1 ) and ( T 1 , T 1 ) α 1 = a | b , α 2 = a | c , w = a 2 = ⇒ unique decomposition ( a ) but two analyses ( T 1 ) and ( T 2 ) Goal: make both unique = ⇒ deterministic scanning Compiler Construction Summer Semester 2014 3.12
Outline Recap: Lexical Analysis 1 Complexity Analysis of Simple Matching 2 The Extended Matching Problem 3 First-Longest-Match Analysis 4 Implementation of FLM Analysis 5 Compiler Construction Summer Semester 2014 3.13
Ensuring Uniqueness Two principles : Principle of the longest match (“maximal munch tokenization”) 1 for uniqueness of decomposition make lexemes as long as possible motivated by applications: e.g., every (non-empty) prefix of an identifier is also an identifier Principle of the first match 2 for uniqueness of analysis choose first matching regular expression (in the given order) therefore: arrange keywords before identifiers (if keywords protected) Compiler Construction Summer Semester 2014 3.14
Principle of the Longest Match Definition 3.5 (Longest-match decomposition) A decomposition ( w 1 , . . . , w k ) of w ∈ Ω + w.r.t. α 1 , . . . , α n ∈ RE Ω is called a longest-match decomposition (LM decomposition) if, for every i ∈ [ k ], x ∈ Ω + , and y ∈ Ω ∗ , ⇒ there is no j ∈ [ n ] such that w i x ∈ � α j � w = w 1 . . . w i xy = Corollary 3.6 Given w and α 1 , . . . , α n , at most one LM decomposition of w exists (clear by definition) and it is possible that w has a decomposition but no LM decomposition (see following example). Example 3.7 w = aab , α 1 = a + , α 2 = ab ⇒ ( a , ab ) is a decomposition but no LM decomposition exists = Compiler Construction Summer Semester 2014 3.15
Principle of the First Match Problem: a (unique) LM decomposition can have several associated analyses (since � α i � ∩ � α j � � = ∅ with i � = j is possible; cf. keyword/identifier problem) Definition 3.8 (First-longest-match analysis) Let ( w 1 , . . . , w k ) be the LM decomposition of w ∈ Ω + w.r.t. α 1 , . . . , α n ∈ RE Ω . Its first-longest-match analysis (FLM analysis) ( T i 1 , . . . , T i k ) is determined by i j := min { l ∈ [ n ] | w j ∈ � α l � } for every j ∈ [ k ]. Corollary 3.9 Given w and α 1 , . . . , α n , there is at most one FLM analysis of w. It exists iff the LM decomposition of w exists. Compiler Construction Summer Semester 2014 3.16
Outline Recap: Lexical Analysis 1 Complexity Analysis of Simple Matching 2 The Extended Matching Problem 3 First-Longest-Match Analysis 4 Implementation of FLM Analysis 5 Compiler Construction Summer Semester 2014 3.17
Implementation of FLM Analysis Algorithm 3.10 (FLM analysis – overview) Input: expressions α 1 , . . . , α n ∈ RE Ω , tokens { T 1 , . . . , T n } , input word w ∈ Ω + Procedure: for every i ∈ [ n ] , construct A i ∈ DFA Ω such that 1 L ( A i ) = � α i � (see DFA method; Algorithm 2.9) construct the product automaton A ∈ DFA Ω such that 2 L ( A ) = � n i =1 � α i � partition the set of final states of A to follow the 3 first-match principle extend the resulting DFA to a backtracking DFA which 4 implements the longest-match principle let the backtracking DFA run on w 5 Output: FLM analysis of w (if existing) Compiler Construction Summer Semester 2014 3.18
(2) The Product Automaton Definition 3.11 (Product automaton) Let A i = � Q i , Ω , δ i , q ( i ) 0 , F i � ∈ DFA Ω for every i ∈ [ n ]. The product automaton A = � Q , Ω , δ, q 0 , F � ∈ DFA Ω is defined by Q := Q 1 × . . . × Q n q 0 := ( q (1) 0 , . . . , q ( n ) 0 ) δ (( q (1) , . . . , q ( n ) ) , a ) := ( δ 1 ( q (1) , a ) , . . . , δ n ( q ( n ) , a )) ( q (1) , . . . , q ( n ) ) ∈ F iff there ex. i ∈ [ n ] such that q ( i ) ∈ F i Lemma 3.12 The above construction yields L ( A ) = � n i =1 L ( A i ) (= � n i =1 � α i � ) . Remark: similar construction for intersection ( F := F 1 × . . . × F n ) Compiler Construction Summer Semester 2014 3.19
Recommend
More recommend