Lexical analysis CS440/540 Lexical Analysis Process: converting - PowerPoint PPT Presentation

Lexical analysis CS440/540

Lexical Analysis • Process: converting input string (source program) into substrings (tokens) • Input: source program • Output: a sequence of tokens • Also called: lexer, tokenizer, scanner

Token and Lexeme • Token: a syntactic category • Lexeme: instance of the token Token Sample lexemes keyword if, else, for, while,… whitespace ‘ ’, ‘ \ t’, ‘ \ n’, … comparison <,>,==,!=,… identifier total, score, name, … number 1, 3.14159, 0, … literal “Super nice cool compiler ”, “ ComS ”, …

Basic design 1. Define a finite set of tokens. • Keyword, whitespace, identifier, … 2. Describe which strings belong to each token • Keyword: “if” or “else” or “for” or … • whitespace: non-empty sequence of blanks, newlines, and tabs • identifier: strings of letters or digits, starting with a letter

Analysis example if (i == j) z = 0; \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1; else z = 1; • Identifier: ? • Keyword: ? • Comparison: ? • Number: ? • Whitespace: ?

Analysis example if (i == j) z = 0; \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1; else z = 1; • Identifier: i, j, z • Keyword: if, else • Comparison: == • Number: 0, 1 • Whitespace: ‘ ’, \t, \n

What would you do? • Foo<Bar<Bazz>> • This is nested templates in C++. • However, do you see any conflict?

What would you do? • Foo<Bar<Bazz>> • This is nested templates in C++. • However, do you see any conflict? • Foo<Bar<Bazz >> • cin >> var

Alphabet, String, and Language • Alphabet ( Σ ) • Any finite set of symbols. • String over an alphabet • A finite sequence of symbols drawn from that alphabet. • Language ( 𝑀 ) • Any countable set of strings over some fixed alphabet. • Formally, Let S be a set of characters. A language over S is a set of strings of characters drawn from S. Alphabet Language English characters English sentences ASCII C programs

Operations on Languages • Single character • ′𝑑 ′ = {"c"} • Epsilon • 𝜗 = {""} • Union • 𝐵 + 𝐶 = {𝑡|𝑡 ∈ 𝐵 𝑝𝑠 𝑡 ∈ 𝐶} • Concatenation • 𝐵𝐶 = {𝑏𝑐|𝑏 ∈ 𝐵 𝑏𝑜𝑒 𝑐 ∈ 𝐶} • Iteration • 𝐵 ∗ =∪ 𝑗≥0 𝐵 𝑗 where 𝐵 𝑗 = 𝐵 … 𝑗 𝑢𝑗𝑛𝑓𝑡 … 𝐵

Example • 𝑀 = {𝐵, 𝐶, … , 𝑎, 𝑏, 𝑐, … , 𝑨} , 𝐸 = {0,1, … , 9} • 𝑀 + 𝐸 • set of letters and digits, each of which strings is either one letter or one digit • 𝐵 , 𝑕 , 1 , … • 𝑀𝐸 • set of strings of length two, each consisting of one letter followed by one digit • 𝑑4 , 𝑘8 , 𝑧6 , … • 𝑀 4 • set of all 4-letter strings • 1234 , 7416 , 2592 , …

Regular Expressions • Describing the language by a combination of language operations of some alphabet.

Example • Keyword • “if” or “else” or “for” or … • keyword = ?

Example • Keyword • “if” or “else” or “for” or … • keyword = ‘if’ + ‘else’ + ‘for’ + …

Examples • Integer • non-empty string of digits • digit = ‘0’ + ‘1’ + … + ‘9’ • integer = ?

Examples • Integer • non-empty string of digits • digit = ‘0’ + ‘1’ + … + ‘9’ • integer = digit digit* • Definition • A*: zero or more of the preceding element • A + =AA*: one or more of the preceding element • integer = digit + • A?: zero or one of the preceding element

Examples • Identifier • Strings of letters or digits, starting with a letter • letter = ‘A’ + … + ‘Z’ + ‘a’ + … + ‘z’ • digit = ‘0’ + ‘1’ + … + ‘9’ • identifier = ?

Examples • Identifier • Strings of letters or digits, starting with a letter • letter = ‘A’ + … + ‘Z’ + ‘a’ + … + ‘z’ • digit = ‘0’ + ‘1’ + … + ‘9’ • identifier = letter (letter + digit)*

More Examples • Phone number • (515)-294-8813 • Σ =? • 𝑏𝑠𝑓𝑏 =? • 𝑓𝑦𝑑ℎ𝑏𝑜𝑕𝑓 =? • 𝑞ℎ𝑝𝑜𝑓 =? • phone number = ?

More Examples • Phone number • (515)-294-8813 • Σ = 𝑒𝑗𝑕𝑗𝑢𝑡 ∪ {−, , } • 𝑏𝑠𝑓𝑏 = 𝑒𝑗𝑕𝑗𝑢 3 • 𝑓𝑦𝑑ℎ𝑏𝑜𝑕𝑓 = 𝑒𝑗𝑕𝑗𝑢 3 • 𝑞ℎ𝑝𝑜𝑓 = 𝑒𝑗𝑕𝑗𝑢 4 • phone number = ‘(’area ‘) - ’ exchange ‘ - ’ phone

More Examples • email address • weile@iastate.edu • Σ =? • 𝑜𝑏𝑛𝑓 =? • address = ?

More Examples • email address • weile@iastate.edu • Σ = 𝑚𝑓𝑢𝑢𝑓𝑠𝑡 ∪ {. , @} • 𝑜𝑏𝑛𝑓 = 𝑚𝑓𝑢𝑢𝑓𝑠 + • address = name ‘@’ name ‘.’ name

An algorithm of lexical analysis • Transition diagram • Flowchart with states and edges; each edge is labelled with characters; certain subset of states are marked as “final states.” • Transition from state to state proceeds along edges according to the next input character. • Every string that ends up at a final state is accepted. • If get “stuck”, there is no transition for a given character, it is an error. • Transition diagrams can be easily translated to programs using if or case statements

Implementation state0: c = getchar(); if (isalpha(c)) token += c; goto state1; error(); state1: c = getchar(); if (isalpha(c) || isdigit(c)) token += c; goto state1; if (isdelimiter(c)) goto state2; error(); state2: return(token);

Finite automata • Finite automata • Deterministic Finite Automata (DFAs) • Non-deterministic Finite Automata (NFAs)

Notation • Given a string s and a regxp R, is 𝑡 ∈ 𝑀(𝑆) • There is variation in regular expression notation • Union: A + B ≡ A | B • Option: A + ε ≡ A? • Range: ‘a’+’b’+…+’z’ ≡ [a -z] • Excluded range: complement of [a- z] ≡ [^a -z]

Lexical Spec  Regular Expressions (1) 1. Write a rexp for the lexemes of each token • Number = digit + • Keyword = ‘if’ + ‘else’ + … • Identifier = letter (letter + digit)* • OpenPar = ‘(‘ • ClosePar = ‘)’ 2. Construct R, matching all lexemes for all tokens • R = Keyword + Identifier + Number + … • = R 1 + R 2 + R 3 + …

Lexical Spec  Regular Expressions (2) 3. Let input be x 1 … x n • For 1 ≤ i ≤ n check • x 1 …x i ∈ L(R) 4. If success, then we know that • x 1 …x i ∈ L(R j ) for some j 5. Remove x 1 …x i from input and go to (3) \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

Ambiguities • What if x 1 …x i ∈ L(R) and also x 1 …x j ∈ L(R)? • note that i ≠ j • Possible rule • pick longest possible string in L(R) • What if x 1 …x i ∈ L(R j ) and also x 1 …x i ∈ L(R k )? • note that j ≠ k • Possible rule • use the listed first

Finite Automata • A finite automaton consists of • An input alphabet Σ • A set of states S • A start state n • A set of accepting states F ⊆ S • A set of transitions state → input state

Finite Automata • Transition • s 1 → a s 2 • Is read: • In state s 1 on input “a” go to state s 2 • If end of input and in accepting state  accept • Otherwise  reject

Finite Automata State Graphs

Simple examples • A finite automaton that accepts only “1” • A finite automaton accepting any number of 1’s followed by a single 0

And Another Example • Alphabet {0,1} • What language does this recognize?

And Another Example • Alphabet {0,1} • What language does this recognize? • (1*0(0 + 1?|1)) +

Epsilon Moves • Machine can move from state A to state B without reading input

Deterministic and Nondeterministic Automata • Deterministic Finite Automata (DFA) • One transition per input per state • No ε -moves • Nondeterministic Finite Automata (NFA) • Can have multiple transitions for one input in a given state • Can have ε -moves

Execution of Finite Automata • A DFA can take only one path through the state graph • Completely determined by input • NFAs can choose • Whether to make ε -moves • Which of multiple transitions for a single input to take

Acceptance of NFAs • An NFA can get into multiple states • Rule: NFA accepts if it can get to a final state • Input: 100

NFA vs. DFA • NFAs and DFAs recognize the same set of languages (regular languages) • DFAs are faster to execute • DFA can be exponentially larger than NFA • For a given language NFA can be simpler than DFA (1*0(0|1)0*1?) +

Regular Expressions to NFA (1) • For each kind of rexp, define an NFA • Notation: NFA for rexp M • For ε • For input a

Regular Expressions to NFA (2) • For AB • For A | B

Regular Expressions to NFA (3) • For A*

Example: RegExp  NFA conversion • Consider the regular expression • (1|0)*1 • The NFA is

Lexical analysis CS440/540 Lexical Analysis Process: converting - PowerPoint PPT Presentation

Lexical analysis CS440/540 Lexical Analysis Process: converting input string (source program) into substrings (tokens) Input: source program Output: a sequence of tokens Also called: lexer, tokenizer, scanner Token and Lexeme

Compilers Lexical Analysis Alex Aiken Lexical Analysis 1. Lexical Analysis 2. Parsing 3.

Heterogeneous Lexical Resources MultiJEDI ERC 259234 Lexical Resource Lexical Resource Lexical

Lexical analysis Lexical analysis Lexical analysis checks the correctness of program words and

Introduction to Lexical Analysis Outline Informal sketch of lexical analysis

LEXICAL TYPOLOGY Peter Koch (Part I) Koch, Lexical typology, 2010-8-24 A. General introduction

Lesson 2 Lexical Analysis CS 226/326 Spring 2003 Lexical Analysis Transform source program

Introduction to Lexical Analysis Outline Informal sketch of lexical analysis

Lexical Analysis Aslan Askarov aslan@cs.au.dk acknowledgments: E. Ernst Lexical analysis

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part II) Department of Romance Studies, Tbingen

LEXICAL SEMANTICS LEXICAL SEMANTICS CS 224N 2011 Gerald Penn Slides largely adapted from

Introduction to Lexical Analysis Identifies tokens in input string Issues in lexical

Lexical Analysis Therefore an implementation of a lexical analyser must do two things: Recognise

Lexical Analysis (2) Sukree Sinthupinyo 1 1 Department of Computer Engineering Chulalongkorn

Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from

Lexical Databases Like a dictionary Lexical properties of interest to psycholinguists

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part III) Department of Romance Studies, Tbingen

Nondeterministic Finite Automata CSCI 3130 Formal Languages and Automata Theory Siu On CHAN

Data assimilation of chaotic Cellular Automata using a particle filter with localization Marimo

Space-efficient quantum Space-efficient quantum automata automata Andris Ambainis Nikolay

Recent advances in the HPMPC and BLASFEO software packages Gianluca Frison Syscop group retreat

Finite projective spaces Leo Storme Ghent University Dept. of Mathematics Krijgslaan 281 - S22

with HTS dynamo driven field coils, for aerospace applications Swarn Kalsi 1 , R. A. Badcock 2 ,

Polytopic outer-approximation of semialgebraic sets V. Cerone 1 , D. Piga 2 , D. Regruto 1 , 1

Finite-Dimensional Linear Analysis: A Systematic Presentation in Problem Form (Dover Books on