introduction to lexical analysis
play

Introduction to Lexical Analysis Identifies tokens in input string - PowerPoint PPT Presentation

Outline Informal sketch of lexical analysis Introduction to Lexical Analysis Identifies tokens in input string Issues in lexical analysis Lookahead Ambiguities Specifying lexers Regular expressions Examples


  1. Outline • Informal sketch of lexical analysis Introduction to Lexical Analysis – Identifies tokens in input string • Issues in lexical analysis – Lookahead – Ambiguities • Specifying lexers – Regular expressions – Examples of regular expressions 2 Lexical Analysis What’s a Token? • What do we want to do? Example: • A syntactic category if (i == j) – In English: then noun, verb, adjective, … z = 0; else – In a programming language: z = 1; Identifier, Integer, Keyword, Whitespace, … • The input is just a string of characters: if (i == j)\nthen\n\tz = 0;\n\telse\n\t\tz = 1; • Goal: Partition input string into substrings – Where the substrings are tokens 3 4

  2. Tokens What are Tokens used for? • Tokens correspond to sets of strings • Classify program substrings according to role – these sets depend on the programming language • Output of lexical analysis is a stream of tokens . . . • Identifier: strings of letters or digits, starting with a letter • Integer: a non-empty string of digits • . . . which is input to the parser • Keyword: “else” or “if” or “begin” or … • Whitespace: a non-empty sequence of blanks, • Parser relies on token distinctions newlines, and tabs – An identifier is treated differently than a keyword 5 6 Designing a Lexical Analyzer: Step 1 Designing a Lexical Analyzer: Step 2 • Define a finite set of tokens • Describe which strings belong to each token – Tokens describe all items of interest – Choice of tokens depends on language, design of • Recall: parser – Identifier: strings of letters or digits, starting • Recall with a letter if (i == j)\nthen\n\tz = 0;\n\telse\n\t\tz = 1; – Integer: a non-empty string of digits • Useful tokens for this expression: – Keyword: “else” or “if” or “begin” or … Integer, Keyword, Relation, Identifier, Whitespace, – Whitespace: a non-empty sequence of blanks, (, ), =, ; newlines, and tabs 7 8

  3. Lexical Analyzer: Implementation Example An implementation must do two things: • Recall: if (i == j)\nthen\n\tz = 0;\n\telse\n\t\tz = 1; 1. Recognize substrings corresponding to tokens • Token-lexeme groupings: 2. Return the value or lexeme of the token – Identifier: i, j, z – The lexeme is the substring – Keyword: if, then, else – Relation: == – Integer: 0, 1 – (, ), =, ; single character of the same name 9 10 Why do Lexical Analysis? True Crimes of Lexical Analysis • Dramatically simplify parsing • Is it as easy as it sounds? – The lexer usually discards “uninteresting” tokens that don’t contribute to parsing • Not quite! • E.g. Whitespace, Comments – Converts data early • Look at some programming language history . . . • Separate out logic to read source files – Potentially an issue on multiple platforms – Can optimize reading code independently of parser 11 12

  4. Lexical Analysis in FORTRAN A terrible design! Example • FORTRAN rule: Whitespace is insignificant • Consider – DO 5 I = 1,25 • E.g., VAR1 is the same as VA R1 – DO 5 I = 1.25 • The first is DO 5 I = 1 , 25 • Footnote: FORTRAN whitespace rule was motivated • The second is DO 5I by inaccuracy of punch card operators = 1.25 • Reading left-to-right, cannot tell if DO 5I is a variable or DO stmt. until after “,” is reached 13 14 Lexical Analysis in FORTRAN. Lookahead. Another Great Moment in Scanning Two important points: • PL/1: Keywords can be used as identifiers: 1. The goal is to partition the string. This is implemented by reading left-to-write, recognizing I F T HEN T HEN T HEN = EL SE; EL SE EL SE = I F one token at a time can be difficult to determine how to label lexemes 2. “Lookahead” may be required to decide where one token ends and the next token begins – Even our simple example has lookahead issues i vs. if = vs. == 15 16

  5. More Modern True Crimes in Scanning Review • Nested template declarations in C++ • The goal of lexical analysis is to – Partition the input string into lexemes (the smallest program units that are individually meaningful) ve c to r<ve c to r<int>> myVe c to r – Identify the token of each lexeme ve c to r < ve c to r < int >> myVe c to r • Left-to-right scan ⇒ lookahead sometimes required (ve c to r < (ve c to r < (int >> myVe c to r))) 17 18 Next Regular Languages • We still need • There are several formalisms for specifying tokens – A way to describe the lexemes of each token – A way to resolve ambiguities • Regular languages are the most popular • Is if two variables i and f ? – Simple and useful theory • Is == two equal signs = = ? – Easy to understand – Efficient implementations 19 20

  6. Languages Examples of Languages • Alphabet = English • Alphabet = ASCII characters • Language = English • Language = C programs sentences Def. Let Σ be a set of characters. A language Λ over Σ is a set of strings of characters drawn • Not every string on • Note: ASCII character from Σ English characters is an set is different from English sentence ( Σ is called the alphabet of Λ ) English character set 21 22 Notation Atomic Regular Expressions • Languages are sets of strings • Single character { } = ' ' " " c c • Need some notation for specifying which sets of strings we want our language to contain • Epsilon { } ε = "" • The standard notation for regular languages is regular expressions 23 24

  7. Compound Regular Expressions Regular Expressions • Union • Def. The regular expressions over Σ are the smallest set of expressions including { } + = ∈ ∈ | or A B s s A s B ε • Concatenation ∈∑ ' ' where c c { } = ∈ ∈ | and AB ab a A b B + ∑ where , are rexp over A B A B • Iteration " " " AB ∑ = = * U * i i where is a rexp over A A where ... times ... A A A A i A ≥ i 0 25 26 Syntax vs. Semantics Example: Keyword • To be careful, we should distinguish syntax Keyword: “else” or “if” or “begin” or … and semantics (meaning) of regular expressions { } ε = n' + L ' else' + 'if' + 'begi ( ) "" L = (' ') {" "} L c c + = ∪ ( ) ( ) ( ) L A B L A L B = ∈ ∈ ( ) { | ( ) and ( )} L AB ab a L A b L B = U * i ( ) ( ) L A L A Note: 'else' abbrev iates 'e''l''s ''e' ≥ 0 i 27 28

  8. Example: Integers Example: Identifier Integer: a non-empty string of digits Identifier: strings of letters or digits, starting with a letter = + + + + + + + + + digit '0' '1' '2' '3' '4' '5' '6' '7' '8' '9' + + + + + * K K letter = 'A' 'Z' 'a' 'z' integer = digit digit + * identifier = letter (letter digit) + = * Abbreviation: A AA * * Is (letter + di git ) the s ame? 29 30 Example: Whitespace Example 1: Phone Numbers Whitespace: a non-empty sequence of blanks, • Regular expressions are all around you! newlines, and tabs • Consider +46(0)18-471-1056 ( ) + Σ = digits ∪ { + , − , ( , ) } ' ' + '\n' + '\t' country = digit digit city = digit digit univ = digit digit digit extension = digit digit digit digit phone_num = ‘ + ’country’ ( ’0‘ ) ’city’ − ’univ’ − ’extension 31 32

  9. Example 2: Email Addresses Summary • Consider kostis@it.uu.se • Regular expressions describe many useful languages { } • Regular languages are a language specification ∑ = ∪ letters .,@ – We still need an implementation + name = letter address = name '@' name '.' name '. ' name • Next time: Given a string s and a regular expression R , is ∈ ( )? s L R 33 34

Recommend


More recommend