CS502: Compiler Design Lexical Analysis (Cont.) Manas Thakur Fall 2020
Recognizing strings in regular languages ● Done using Finite Automata ● A kind of state machine – A state denotes a remembrance of the string read so far. – An arrow from state A to state B over a character c denotes a transition (state change). ● Formally, a finite automaton M consists of 5 components: – Q : Set of states – Σ : Set of input symbols (alphabet) – q : Initial state – F : Set of final (accept) states – δ : Transition function ● M accepts a string w if reading w takes M to an accept state. Manas Thakur CS502: Compiler Design 2
Regular expression to finite automata ● Exercise: Write a regular expression that accepts your fjrst name and nothing else! ● Finite automaton that accepts “manas”: m a n a s q1 q2 q3 q0 q4 q5 – Q: {q0, q1, q2, q3, q4, q5} – Σ : {a..z} – q: q0 – F: {q5} – δ : {<q0,m,q1>, <q1,a,q2>, <q2,n,q3>,<q3,a,q4>,<q4,s,q5>} ● “Invisible” arrows take to an error state. Manas Thakur CS502: Compiler Design 3
Finite automaton to recognize identifiers Regular expression: letter(letter|digit)* letter, digit letter other q0 q1 q2 digit, other q3 Manas Thakur CS502: Compiler Design 4
Tables for the recognizer ● Two tables control the recognizer: ● To change languages, we can just change tables. Manas Thakur CS502: Compiler Design 5
Code for the recognizer ● Given an automaton, can we write a recognizer for a token? Manas Thakur CS502: Compiler Design 6
Classwork ● Draw an FA that recognizes strings over alphabet { a , b } with exactly three b s. – a*ba*ba*ba* ● Strings of length>1 starting and ending with a . – a(a+b)*a ● Strings with third last letter as a . – (a+b)*a(a+b)(a+b) ● Do you see non-determinism in the above two FAs? – We have actually constructed a non-deterministic FA (the first one being a deterministic FA)! ● Not for CS502: – Conversion of NFA to DFA, minimization of DFA, ... Manas Thakur CS502: Compiler Design 7
Automatic construction of lexers ● JavaCC: Popular lexical analyzer (and also parser) generator Lexer in Java fjle.jj Java Bytecode javacc javac java Tokens Input stream ● Takes regular expressions as input. ● Constructs equivalent finite automata. ● Emits code for the scanner. ● Lex/Flex: another popular lexer generator written in C. Manas Thakur CS502: Compiler Design 8
JavaCC regexes in action ● BREAKING NEWS: – You can start doing A1 today (spec on Moodle before eod). Manas Thakur CS502: Compiler Design 9
Errors in lexical analysis ● It is difficult for a lexer to identify errors. – Limited resources: e.g., no context information. ● fi (a = f(x)) – Is fi a misspelling for if , or a function identifier? ● As fi is a valid lexeme for the token identifier, the lexer must return the token <id, fi> . ● A later phase (parser or semantic analyzer) may be able to catch What should a lexer do on detecting an error? the error. ● But some errors can be caught by a lexer: – int %x; – if (a < b);$ Manas Thakur CS502: Compiler Design 10
Error handling in lexical analysis ● Panic and exit(1). ● Try to recover from the error and proceed. Why? ● We are a compiler; not an interpreter! Manas Thakur CS502: Compiler Design 11
Error recovery in lexical analysis ● Delete one character from the input. ● Insert a missing character into the remaining input. – Which one? ● Replace a character by another character. ● Transpose two adjacent characters. ● Theoretical problem: Find the smallest number of transformations (add, delete, replace) needed to convert a source program to one that consists only of valid lexemes. – Too expensive in practice. ● In practice, most lexical errors involve a single character. Manas Thakur CS502: Compiler Design 12
Limits of Regular Languages ● Not all languages are regular. ● Try constructing an FA for the following languages: – L = {0 n 1 n } – L = {wcw r | w ∈ Σ *} Note: neither of these is a regular expression! ● FAs cannot count properly! ● However, this is a little subtle. One can construct FAs for: – Alternating 0s and 1s ● (ε | 1)(01)*(ε | 0) – Sets of pairs of 0s and 1s ● (01 | 10)+ Manas Thakur CS502: Compiler Design 13
What next? ● Learn a language that could recognize L = {0 n 1 n } and L = {wcw r | w ∈ Σ*}! ● Why do we care? – Do you fjnd any similarity between above, and recognizing: ● Matching parentheses/blocks? Manas Thakur CS502: Compiler Design 14
Recommend
More recommend