Outline • Informal sketch of lexical analysis Introduction to Lexical Analysis – Identifies tokens in input string • Issues in lexical analysis – Lookahead – Ambiguities • Specifying lexers – Regular expressions – Examples of regular expressions 2 Lexical Analysis What’s a Token? • What do we want to do? Example: • A syntactic category if (i == j) – In English: then noun, verb, adjective, … z = 0; else – In a programming language: z = 1; Identifier, Integer, Keyword, Whitespace, … • The input is just a string of characters: if (i == j)\nthen\n\tz = 0;\n\telse\n\t\tz = 1; • Goal: Partition input string into substrings – Where the substrings are tokens 3 4
Tokens What are Tokens used for? • Tokens correspond to sets of strings • Classify program substrings according to role – these sets depend on the programming language • Output of lexical analysis is a stream of tokens . . . • Identifier: strings of letters or digits, starting with a letter • Integer: a non-empty string of digits • . . . which is input to the parser • Keyword: “else” or “if” or “begin” or … • Whitespace: a non-empty sequence of blanks, • Parser relies on token distinctions newlines, and tabs – An identifier is treated differently than a keyword 5 6 Designing a Lexical Analyzer: Step 1 Designing a Lexical Analyzer: Step 2 • Define a finite set of tokens • Describe which strings belong to each token – Tokens describe all items of interest – Choice of tokens depends on language, design of • Recall: parser – Identifier: strings of letters or digits, starting • Recall with a letter if (i == j)\nthen\n\tz = 0;\n\telse\n\t\tz = 1; – Integer: a non-empty string of digits • Useful tokens for this expression: – Keyword: “else” or “if” or “begin” or … Integer, Keyword, Relation, Identifier, Whitespace, – Whitespace: a non-empty sequence of blanks, (, ), =, ; newlines, and tabs 7 8
Lexical Analyzer: Implementation Example An implementation must do two things: • Recall: if (i == j)\nthen\n\tz = 0;\n\telse\n\t\tz = 1; 1. Recognize substrings corresponding to tokens • Token-lexeme groupings: 2. Return the value or lexeme of the token – Identifier: i, j, z – The lexeme is the substring – Keyword: if, then, else – Relation: == – Integer: 0, 1 – (, ), =, ; single character of the same name 9 10 Why do Lexical Analysis? True Crimes of Lexical Analysis • Dramatically simplify parsing • Is it as easy as it sounds? – The lexer usually discards “uninteresting” tokens that don’t contribute to parsing • Not quite! • E.g. Whitespace, Comments – Converts data early • Look at some programming language history . . . • Separate out logic to read source files – Potentially an issue on multiple platforms – Can optimize reading code independently of parser 11 12
Lexical Analysis in FORTRAN A terrible design! Example • FORTRAN rule: Whitespace is insignificant • Consider – DO 5 I = 1,25 • E.g., VAR1 is the same as VA R1 – DO 5 I = 1.25 • The first is DO 5 I = 1 , 25 • Footnote: FORTRAN whitespace rule was motivated • The second is DO 5I by inaccuracy of punch card operators = 1.25 • Reading left-to-right, cannot tell if DO 5I is a variable or DO stmt. until after “,” is reached 13 14 Lexical Analysis in FORTRAN. Lookahead. Another Great Moment in Scanning Two important points: • PL/1: Keywords can be used as identifiers: 1. The goal is to partition the string. This is implemented by reading left-to-write, recognizing I F T HEN T HEN T HEN = EL SE; EL SE EL SE = I F one token at a time can be difficult to determine how to label lexemes 2. “Lookahead” may be required to decide where one token ends and the next token begins – Even our simple example has lookahead issues i vs. if = vs. == 15 16
More Modern True Crimes in Scanning Review • Nested template declarations in C++ • The goal of lexical analysis is to – Partition the input string into lexemes (the smallest program units that are individually meaningful) ve c to r<ve c to r<int>> myVe c to r – Identify the token of each lexeme ve c to r < ve c to r < int >> myVe c to r • Left-to-right scan ⇒ lookahead sometimes required (ve c to r < (ve c to r < (int >> myVe c to r))) 17 18 Next Regular Languages • We still need • There are several formalisms for specifying tokens – A way to describe the lexemes of each token – A way to resolve ambiguities • Regular languages are the most popular • Is if two variables i and f ? – Simple and useful theory • Is == two equal signs = = ? – Easy to understand – Efficient implementations 19 20
Languages Examples of Languages • Alphabet = English • Alphabet = ASCII characters • Language = English • Language = C programs sentences Def. Let Σ be a set of characters. A language Λ over Σ is a set of strings of characters drawn • Not every string on • Note: ASCII character from Σ English characters is an set is different from English sentence ( Σ is called the alphabet of Λ ) English character set 21 22 Notation Atomic Regular Expressions • Languages are sets of strings • Single character { } = ' ' " " c c • Need some notation for specifying which sets of strings we want our language to contain • Epsilon { } ε = "" • The standard notation for regular languages is regular expressions 23 24
Compound Regular Expressions Regular Expressions • Union • Def. The regular expressions over Σ are the smallest set of expressions including { } + = ∈ ∈ | or A B s s A s B ε • Concatenation ∈∑ ' ' where c c { } = ∈ ∈ | and AB ab a A b B + ∑ where , are rexp over A B A B • Iteration " " " AB ∑ = = * U * i i where is a rexp over A A where ... times ... A A A A i A ≥ i 0 25 26 Syntax vs. Semantics Example: Keyword • To be careful, we should distinguish syntax Keyword: “else” or “if” or “begin” or … and semantics (meaning) of regular expressions { } ε = n' + L ' else' + 'if' + 'begi ( ) "" L = (' ') {" "} L c c + = ∪ ( ) ( ) ( ) L A B L A L B = ∈ ∈ ( ) { | ( ) and ( )} L AB ab a L A b L B = U * i ( ) ( ) L A L A Note: 'else' abbrev iates 'e''l''s ''e' ≥ 0 i 27 28
Example: Integers Example: Identifier Integer: a non-empty string of digits Identifier: strings of letters or digits, starting with a letter = + + + + + + + + + digit '0' '1' '2' '3' '4' '5' '6' '7' '8' '9' + + + + + * K K letter = 'A' 'Z' 'a' 'z' integer = digit digit + * identifier = letter (letter digit) + = * Abbreviation: A AA * * Is (letter + di git ) the s ame? 29 30 Example: Whitespace Example 1: Phone Numbers Whitespace: a non-empty sequence of blanks, • Regular expressions are all around you! newlines, and tabs • Consider +46(0)18-471-1056 ( ) + Σ = digits ∪ { + , − , ( , ) } ' ' + '\n' + '\t' country = digit digit city = digit digit univ = digit digit digit extension = digit digit digit digit phone_num = ‘ + ’country’ ( ’0‘ ) ’city’ − ’univ’ − ’extension 31 32
Example 2: Email Addresses Summary • Consider kostis@it.uu.se • Regular expressions describe many useful languages { } • Regular languages are a language specification ∑ = ∪ letters .,@ – We still need an implementation + name = letter address = name '@' name '.' name '. ' name • Next time: Given a string s and a regular expression R , is ∈ ( )? s L R 33 34
Recommend
More recommend