Compiler Construction Lecture 4: Lexical Analysis III (Practical Aspects) Thomas Noll Lehrstuhl f¨ ur Informatik 2 (Software Modeling and Verification) noll@cs.rwth-aachen.de http://moves.rwth-aachen.de/teaching/ss-14/cc14/ Summer Semester 2014
Outline Recap: First-Longest-Match Analysis 1 Time Complexity of First-Longest-Match Analysis 2 First-Longest-Match Analysis with NFA 3 Longest Match in Practice 4 Regular Definitions 5 Generating Scanners Using [f]lex 6 Preprocessing 7 Compiler Construction Summer Semester 2014 4.2
The Extended Matching Problem Definition Let n ≥ 1 and α 1 , . . . , α n ∈ RE Ω with ε / ∈ � α i � � = ∅ for every i ∈ [ n ] (= { 1 , . . . , n } ). Let Σ := { T 1 , . . . , T n } be an alphabet of corresponding tokens and w ∈ Ω + . If w 1 , . . . , w k ∈ Ω + such that w = w 1 . . . w k and for every j ∈ [ k ] there exists i j ∈ [ n ] such that w j ∈ � α i j � , then ( w 1 , . . . , w k ) is called a decomposition and ( T i 1 , . . . , T i k ) is called an analysis of w w.r.t. α 1 , . . . , α n . Problem (Extended matching problem) Given α 1 , . . . , α n ∈ RE Ω and w ∈ Ω + , decide whether there exists a decomposition of w w.r.t. α 1 , . . . , α n and determine a corresponding analysis. Compiler Construction Summer Semester 2014 4.3
Ensuring Uniqueness Two principles : Principle of the longest match (“maximal munch tokenization”) 1 for uniqueness of decomposition make lexemes as long as possible motivated by applications: e.g., every (non-empty) prefix of an identifier is also an identifier Principle of the first match 2 for uniqueness of analysis choose first matching regular expression (in the given order) therefore: arrange keywords before identifiers (if keywords protected) Compiler Construction Summer Semester 2014 4.4
Implementation of FLM Analysis Algorithm (FLM analysis – overview) Input: expressions α 1 , . . . , α n ∈ RE Ω , tokens { T 1 , . . . , T n } , input word w ∈ Ω + Procedure: for every i ∈ [ n ] , construct A i ∈ DFA Ω such that 1 L ( A i ) = � α i � (see DFA method; Algorithm 2.9) construct the product automaton A ∈ DFA Ω such that 2 L ( A ) = � n i =1 � α i � partition the set of final states of A to follow the 3 first-match principle extend the resulting DFA to a backtracking DFA which 4 implements the longest-match principle let the backtracking DFA run on w 5 Output: FLM analysis of w (if existing) Compiler Construction Summer Semester 2014 4.5
(4) The Backtracking DFA Definition (Backtracking DFA) The set of configurations of B is given by ( { N } ⊎ Σ) × Ω ∗ · Q · Ω ∗ × Σ ∗ · { ε, lexerr } The initial configuration for an input word w ∈ Ω + is ( N , q 0 w , ε ). The transitions of B are defined as follows (where q ′ := δ ( q , a )): normal mode: look for initial match if q ′ ∈ F ( i ) ( T i , q ′ w , W ) if q ′ ∈ P \ F ( N , q ′ w , W ) ( N , qaw , W ) ⊢ if q ′ / output: W · lexerr ∈ P backtrack mode: look for longest match if q ′ ∈ F ( i ) ( T i , q ′ w , W ) if q ′ ∈ P \ F ( T , vaq ′ w , W ) ( T , vqaw , W ) ⊢ if q ′ / ∈ P ( N , q 0 vaw , WT ) end of input ( T , q , W ) ⊢ output: WT if q ∈ F ( N , q , W ) ⊢ output: W · lexerr if q ∈ P \ F ( T , vaq , W ) ⊢ ( N , q 0 va , WT ) if q ∈ P \ F Compiler Construction Summer Semester 2014 4.6
Outline Recap: First-Longest-Match Analysis 1 Time Complexity of First-Longest-Match Analysis 2 First-Longest-Match Analysis with NFA 3 Longest Match in Practice 4 Regular Definitions 5 Generating Scanners Using [f]lex 6 Preprocessing 7 Compiler Construction Summer Semester 2014 4.7
Time Complexity of FLM Analysis Lemma 4.1 The worst-case time complexity of FLM analysis using the backtracking DFA on input w ∈ Ω + is O ( | w | 2 ) . Proof. lower bound: α 1 = a , α 2 = a ∗ b , w = a m requires O ( m 2 ) upper bound: each run from mode N to T ∈ Σ consumes at least one input symbol (and possibly reads all input symbols), involving at most � | w | i =1 = n ( n +1) transitions 2 if no Σ-mode is reached, lexerr is reported after ≤ | w | transitions Remark: possible improvement by tabular method (similar to Knuth-Morris-Pratt Algorithm for pattern matching in strings) Literature: Th. Reps: “Maximal-Munch” Tokenization in Linear Time , ACM TOPLAS 20(2), 1998, 259–273 Compiler Construction Summer Semester 2014 4.8
Outline Recap: First-Longest-Match Analysis 1 Time Complexity of First-Longest-Match Analysis 2 First-Longest-Match Analysis with NFA 3 Longest Match in Practice 4 Regular Definitions 5 Generating Scanners Using [f]lex 6 Preprocessing 7 Compiler Construction Summer Semester 2014 4.9
A Backtracking NFA A similar construction is possible for the NFA method: A i = � Q i , Ω , δ i , q ( i ) 0 , F i � ∈ NFA Ω ( i ∈ [ n ]) by NFA method 1 “Product” automaton: Q := { q 0 } ⊎ � n i =1 Q i 2 A 1 ε . q 0 . . ε A n Partitioning of final states: 3 M ⊆ Q is called a T i -matching if M ∩ F i � = ∅ and for all j ∈ [ i − 1] : M ∩ F j = ∅ yields set of T i -matchings F ( i ) ⊆ 2 Q M ⊆ Q is called productive if there exists a productive q ∈ M yields productive state sets P ⊆ 2 Q Backtracking automaton: similar to DFA case 4 Compiler Construction Summer Semester 2014 4.10
Outline Recap: First-Longest-Match Analysis 1 Time Complexity of First-Longest-Match Analysis 2 First-Longest-Match Analysis with NFA 3 Longest Match in Practice 4 Regular Definitions 5 Generating Scanners Using [f]lex 6 Preprocessing 7 Compiler Construction Summer Semester 2014 4.11
Longest Match in Practice In general: lookahead of arbitrary length required that is, | v | unbounded in configurations ( T , vqw , W ) see Lemma 4.1: α 1 = a , α 2 = a ∗ b , w = a . . . a “Modern” programming languages (Pascal, Java, ...): lookahead of one or two characters sufficient separation of keywords, identifiers, etc. by spaces Pascal: two-character lookahead required to distinguish 1.5 (real number) from 1..5 (integer range) However: principle of longest match not always applicable! Compiler Construction Summer Semester 2014 4.12
Inadequacy of Longest Match I Example 4.2 (Longest match in FORTRAN) Relational expressions 1 valid lexemes: .EQ. (relational operator), EQ (identifier), 12 (integer), 12. , .12 (reals) input string: 12�.EQ.�12 � 12.EQ.12 (ignoring blanks!) intended analysis: (int , 12)(relop , eq)(int , 12) LM yields: (real , 12 . 0)(id , EQ )(real , 0 . 12) ⇒ wrong interpretation DO loops 2 (correct) input string: DO�5�I�=�1,�20 � DO5I=1,20 intended analysis: (do , )(label , 5)(id , I )(gets , )(int , 1)(comma , )(int , 20) LM analysis (wrong): (id , DO5I )(gets , )(int , 1)(comma , )(int , 20) (erroneous) input string: DO�5�I�=�1.�20 � DO5I=1.20 LM analysis (correct): (id , DO5I )(gets , )(real , 1 . 2) Compiler Construction Summer Semester 2014 4.13
Inadequacy of Longest Match II Example 4.3 (Longest match in C) valid lexemes: x (identifier) = (assignment) =- (subtractive assignment; K&R/ANSI-C: -= ) 1 , -1 (integers) input string: x=-1 intended analysis: (id , x )(gets , )(int , − 1) LM yields: (id , x )(dec , )(int , 1) ⇒ wrong interpretation Possible solutions: Hand-written (non-FLM) scanners FLM with lookahead (later) Compiler Construction Summer Semester 2014 4.14
Outline Recap: First-Longest-Match Analysis 1 Time Complexity of First-Longest-Match Analysis 2 First-Longest-Match Analysis with NFA 3 Longest Match in Practice 4 Regular Definitions 5 Generating Scanners Using [f]lex 6 Preprocessing 7 Compiler Construction Summer Semester 2014 4.15
Regular Definitions I Goal: modularizing the representation of regular sets by introducing additional identifiers Definition 4.4 (Regular definition) Let { R 1 , . . . , R n } be a set of symbols disjoint from Ω. A regular definition (over Ω) is a sequence of equations R 1 = α 1 . . . R n = α n such that, for every i ∈ [ n ], α i ∈ RE Ω ⊎{ R 1 ,..., R i − 1 } . Remark: since recursion is not involved, every R i can (iteratively) be substituted by a regular expression α ∈ RE Ω (otherwise = ⇒ context-free languages) Compiler Construction Summer Semester 2014 4.16
Regular Definitions II Example 4.5 (Symbol classes in Pascal) Identifiers: Letter = A | . . . | Z | a | . . . | z Digit = 0 | . . . | 9 Id = Letter ( Letter | Digit ) ∗ Digits = Digit + Numerals: Empty = ∅ ∗ (unsigned) OptFrac = . Digits | Empty OptExp = E ( + | - | Empty ) Digits | Empty Num = Digits OptFrac OptExp RelOp = < | <= | = | <> | > | >= Rel. operators: Keywords: If = if Then = then Else = else Compiler Construction Summer Semester 2014 4.17
Outline Recap: First-Longest-Match Analysis 1 Time Complexity of First-Longest-Match Analysis 2 First-Longest-Match Analysis with NFA 3 Longest Match in Practice 4 Regular Definitions 5 Generating Scanners Using [f]lex 6 Preprocessing 7 Compiler Construction Summer Semester 2014 4.18
The [f]lex Tool Usage of [f]lex (“[fast] lexical analyzer generator”): [f]lex cc − → − → spec.l lex.yy.c a.out [f]lex specification Scanner (in C) Executable a . out Program − → Symbol sequence A [f]lex specification is of the form Definitions (optional) %% Rules %% Auxiliary procedures (optional) Compiler Construction Summer Semester 2014 4.19
Recommend
More recommend