Implementation of Lexical Analysis Outline Specifying lexical - PowerPoint PPT Presentation

Implementation of Lexical Analysis

Outline • Specifying lexical structure using regular expressions • Finite automata – Deterministic Finite Automata (DFAs) – Non-deterministic Finite Automata (NFAs) • Implementation of regular expressions RegExp ⇒ NFA ⇒ DFA ⇒ Tables 2

Notation • For convenience, we use a variation (allow user- defined abbreviations) in regular expression notation • Union: A + B ≡ A | B • Option: A + ε A? ≡ • Range: ‘a’+’b’+…+’z’ [a-z] ≡ • Excluded range: complement of [a-z] ≡ [^a-z] 3

Regular Expressions in Lexical Specification • Last lecture: a specification for the predicate s ∈ L(R) • But a yes/no answer is not enough ! • Instead: partition the input into tokens • We will adapt regular expressions to this goal 4

Regular Expressions ⇒ Lexical Spec. (1) 1. Select a set of tokens • Integer, Keyword, Identifier, OpenPar, ... 2. Write a regular expression (pattern) for the lexemes of each token • Integer = digit + • Keyword = ‘if’ + ‘else’ + … • Identifier = letter (letter + digit)* • OpenPar = ‘(‘ • … 5

Regular Expressions ⇒ Lexical Spec. (2) 3. Construct R, matching all lexemes for all tokens R = Keyword + Identifier + Integer + … = R 1 + R 2 + R 3 + … Facts: If s ∈ L(R) then s is a lexeme – Furthermore s ∈ L(R i ) for some “i” – This “i” determines the token that is reported 6

Regular Expressions ⇒ Lexical Spec. (3) 4. Let input be x 1 …x n • (x 1 ... x n are characters) • For 1 ≤ i ≤ n check x 1 …x i ∈ L(R) ? 5. It must be that x 1 …x i ∈ L(R j ) for some j (if there is a choice, pick a smallest such j) 6. Remove x 1 …x i from input and go to previous step 7

How to Handle Spaces and Comments? 1. We could create a token Whitespace Whitespace = (‘ ’ + ‘\n’ + ‘\t’) + – We could also add comments in there – An input “ \t\n 5555 “ is transformed into Whitespace Integer Whitespace 2. Lexer skips spaces (preferred) • Modify step 5 from before as follows: It must be that x k ... x i ∈ L(R j ) for some j such that x 1 ... x k-1 ∈ L(Whitespace) • Parser is not bothered with spaces 8

Ambiguities (1) • There are ambiguities in the algorithm • How much input is used? What if • x 1 …x i ∈ L(R) and also • x 1 …x K ∈ L(R) – Rule: Pick the longest possible substring – The “maximal munch” 9

Ambiguities (2) • Which token is used? What if • x 1 …x i ∈ L(R j ) and also • x 1 …x i ∈ L(R k ) – Rule: use rule listed first (j if j < k) • Example: – R 1 = Keyword and R 2 = Identifier – “if” matches both – Treats “if” as a keyword not an identifier 10

Error Handling • What if No rule matches a prefix of input ? • Problem: Can’t just get stuck … • Solution: – Write a rule matching all “bad” strings – Put it last • Lexer tools allow the writing of: R = R 1 + ... + R n + Error – Token Error matches if nothing else matches 11

Summary • Regular expressions provide a concise notation for string patterns • Use in lexical analysis requires small extensions – To resolve ambiguities – To handle errors • Good algorithms known (next) – Require only single pass over the input – Few operations per character (table lookup) 12

Regular Languages & Finite Automata Basic formal language theory result : Regular expressions and finite automata both define the class of regular languages. Thus, we are going to use: • Regular expressions for specification • Finite automata for implementation (automatic generation of lexical analyzers) 13

Finite Automata A finite automaton is a recognizer for the strings of a regular language A finite automaton consists of – A finite input alphabet Σ – A set of states S – A start state n – A set of accepting states F ⊆ S – A set of transitions state → input state 14

Finite Automata • Transition s 1 → a s 2 • Is read In state s 1 on input “a” go to state s 2 • If end of input (or no transition possible) – If in accepting state ⇒ accept – Otherwise ⇒ reject 15

Finite Automata State Graphs • A state • The start state • An accepting state a • A transition 16

A Simple Example • A finite automaton that accepts only “1” 1 17

Another Simple Example • A finite automaton accepting any number of 1’s followed by a single 0 • Alphabet: {0,1} 1 0 18

And Another Example • Alphabet {0,1} • What language does this recognize? 0 1 0 0 1 1 19

And Another Example • Alphabet still { 0, 1 } 1 1 • The operation of the automaton is not completely defined by the input – On input “11” the automaton could be in either state 20

Epsilon Moves • Another kind of transition: ε -moves ε A B • Machine can move from state A to state B without reading input 21

Deterministic and Non-Deterministic Automata • Deterministic Finite Automata (DFA) – One transition per input per state – No ε -moves • Non-deterministic Finite Automata (NFA) – Can have multiple transitions for one input in a given state – Can have ε -moves • Finite automata have finite memory – Enough to only encode the current state 22

Execution of Finite Automata • A DFA can take only one path through the state graph – Completely determined by input • NFAs can choose – Whether to make ε -moves – Which of multiple transitions for a single input to take 23

Acceptance of NFAs • An NFA can get into multiple states 1 0 1 0 • Input: 1 0 1 • Rule: NFA accepts an input if it can get in a final state 24

NFA vs. DFA (1) • NFAs and DFAs recognize the same set of languages (regular languages) • DFAs are easier to implement – There are no choices to consider 25

NFA vs. DFA (2) • For a given language the NFA can be simpler than the DFA 1 0 0 NFA 0 1 0 0 0 DFA 1 1 • DFA can be exponentially larger than NFA 26

Regular Expressions to Finite Automata • High-level sketch NFA Regular DFA expressions Lexical Table-driven Specification Implementation of DFA 27

Regular Expressions to NFA (1) • For each kind of reg. expr, define an NFA – Notation: NFA for regular expression M M • For ε ε • For input a a 28

Regular Expressions to NFA (2) • For AB ε A B • For A + B ε B ε ε ε A 29

Regular Expressions to NFA (3) • For A* ε A ε ε 30

Example of Regular Expression → NFA conversion • Consider the regular expression (1+0)*1 • The NFA is ε 1 ε ε C E 1 ε B 0 A G H ε I J ε ε D F ε 31

NFA to DFA. The Trick • Simulate the NFA • Each state of DFA = a non-empty subset of states of the NFA • Start state = the set of NFA states reachable through ε -moves from NFA start state • Add a transition S → a S’ to DFA iff – S’ is the set of NFA states reachable from any state in S after seeing the input a • considering ε -moves as well 32

NFA to DFA. Remark • An NFA may be in many states at any time • How many different states ? • If there are N states, the NFA must be in some subset of those N states • How many subsets are there? – 2 N - 1 = finitely many 33

NFA to DFA Example ε 1 ε ε C E 1 ε B 0 A G H ε I J ε ε D F ε 0 0 FGABCDHI 1 0 1 ABCDHI 1 EJGABCDHI 34

Implementation • A DFA can be implemented by a 2D table T – One dimension is “states” – Other dimension is “input symbols” – For every transition S i → a S k define T[i,a] = k • DFA “execution” – If in state S i and input a, read T[i,a] = k and skip to state S k – Very efficient 35

Table Implementation of a DFA 0 0 T 1 0 1 S 1 U 0 1 S T U T T U U T U 36

Implementation (Cont.) • NFA → DFA conversion is at the heart of tools such as lex, ML-Lex or flex • But, DFAs can be huge • In practice, lex/ML-Lex/flex-like tools trade off speed for space in the choice of NFA and DFA representations 37

Theory vs. Practice Two differences: • DFAs recognize lexemes. A lexer must return a type of acceptance (token type) rather than simply an accept/reject indication. • DFAs consume the complete string and accept or reject it. A lexer must find the end of the lexeme in the input stream and then find the next one, etc. 38

Implementation of Lexical Analysis Outline Specifying lexical - PowerPoint PPT Presentation

Implementation of Lexical Analysis Outline Specifying lexical structure using regular expressions Finite automata Deterministic Finite Automata (DFAs) Non-deterministic Finite Automata (NFAs) Implementation of regular

Compilers Lexical Analysis Alex Aiken Lexical Analysis 1. Lexical Analysis 2. Parsing 3.

Introduction to Lexical Analysis Outline Informal sketch of lexical analysis

Heterogeneous Lexical Resources MultiJEDI ERC 259234 Lexical Resource Lexical Resource Lexical

Introduction to Lexical Analysis Outline Informal sketch of lexical analysis

Lexical analysis Lexical analysis Lexical analysis checks the correctness of program words and

Introduction to Lexical Analysis Identifies tokens in input string Issues in lexical

LEXICAL TYPOLOGY Peter Koch (Part I) Koch, Lexical typology, 2010-8-24 A. General introduction

Lesson 2 Lexical Analysis CS 226/326 Spring 2003 Lexical Analysis Transform source program

Lexical Analysis Therefore an implementation of a lexical analyser must do two things: Recognise

Lexical Analysis Aslan Askarov aslan@cs.au.dk acknowledgments: E. Ernst Lexical analysis

Specifying Operations Specifying Operations Why operations are specified Algorithmic methods

Outline Informal sketch of lexical

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part II) Department of Romance Studies, Tbingen

LEXICAL SEMANTICS LEXICAL SEMANTICS CS 224N 2011 Gerald Penn Slides largely adapted from

Lexical Analysis (2) Sukree Sinthupinyo 1 1 Department of Computer Engineering Chulalongkorn

Lexical and Syntax Analysis Part I 1 Introduction Every implementation of Programming

CSCI 5832 Natural Language Processing Lecture 3 Jim Martin 1/24/08 1 Today 1/22 Regexs,

Finite State Methods for Lexicon and Morphology Bernd Kiefer Bernd.Kiefer@dfki.de Deutsches

Foundations of Computer Science Lecture 24 Deterministic Finite Automata (DFA) A Simple

600.405 Finite-State Methods in NLP Assignment 1: Getting Started Prof. J. Eisner Fall

Final Projects vPool from here TextID E11 finale! IST 338 in April big picture I've got

Finite State Machines: Motivating Examples Greg Plaxton Theory in Programming Practice, Fall 2005

Registers and Finite State Machines Eric McCreath Memory State information can be maintained by

CS445 / SE463 / ECE 451 / CS645 So,ware requirements