cs502 compiler design lexical analysis cont manas thakur
play

CS502: Compiler Design Lexical Analysis (Cont.) Manas Thakur Fall - PowerPoint PPT Presentation

CS502: Compiler Design Lexical Analysis (Cont.) Manas Thakur Fall 2020 Recognizing strings in regular languages Done using Finite Automata A kind of state machine A state denotes a remembrance of the string read so far. An arrow


  1. CS502: Compiler Design Lexical Analysis (Cont.) Manas Thakur Fall 2020

  2. Recognizing strings in regular languages ● Done using Finite Automata ● A kind of state machine – A state denotes a remembrance of the string read so far. – An arrow from state A to state B over a character c denotes a transition (state change). ● Formally, a finite automaton M consists of 5 components: – Q : Set of states – Σ : Set of input symbols (alphabet) – q : Initial state – F : Set of final (accept) states – δ : Transition function ● M accepts a string w if reading w takes M to an accept state. Manas Thakur CS502: Compiler Design 2

  3. Regular expression to finite automata ● Exercise: Write a regular expression that accepts your fjrst name and nothing else! ● Finite automaton that accepts “manas”: m a n a s q1 q2 q3 q0 q4 q5 – Q: {q0, q1, q2, q3, q4, q5} – Σ : {a..z} – q: q0 – F: {q5} – δ : {<q0,m,q1>, <q1,a,q2>, <q2,n,q3>,<q3,a,q4>,<q4,s,q5>} ● “Invisible” arrows take to an error state. Manas Thakur CS502: Compiler Design 3

  4. Finite automaton to recognize identifiers Regular expression: letter(letter|digit)* letter, digit letter other q0 q1 q2 digit, other q3 Manas Thakur CS502: Compiler Design 4

  5. Tables for the recognizer ● Two tables control the recognizer: ● To change languages, we can just change tables. Manas Thakur CS502: Compiler Design 5

  6. Code for the recognizer ● Given an automaton, can we write a recognizer for a token? Manas Thakur CS502: Compiler Design 6

  7. Classwork ● Draw an FA that recognizes strings over alphabet { a , b } with exactly three b s. – a*ba*ba*ba* ● Strings of length>1 starting and ending with a . – a(a+b)*a ● Strings with third last letter as a . – (a+b)*a(a+b)(a+b) ● Do you see non-determinism in the above two FAs? – We have actually constructed a non-deterministic FA (the first one being a deterministic FA)! ● Not for CS502: – Conversion of NFA to DFA, minimization of DFA, ... Manas Thakur CS502: Compiler Design 7

  8. Automatic construction of lexers ● JavaCC: Popular lexical analyzer (and also parser) generator Lexer in Java fjle.jj Java Bytecode javacc javac java Tokens Input stream ● Takes regular expressions as input. ● Constructs equivalent finite automata. ● Emits code for the scanner. ● Lex/Flex: another popular lexer generator written in C. Manas Thakur CS502: Compiler Design 8

  9. JavaCC regexes in action ● BREAKING NEWS: – You can start doing A1 today (spec on Moodle before eod). Manas Thakur CS502: Compiler Design 9

  10. Errors in lexical analysis ● It is difficult for a lexer to identify errors. – Limited resources: e.g., no context information. ● fi (a = f(x)) – Is fi a misspelling for if , or a function identifier? ● As fi is a valid lexeme for the token identifier, the lexer must return the token <id, fi> . ● A later phase (parser or semantic analyzer) may be able to catch What should a lexer do on detecting an error? the error. ● But some errors can be caught by a lexer: – int %x; – if (a < b);$ Manas Thakur CS502: Compiler Design 10

  11. Error handling in lexical analysis ● Panic and exit(1). ● Try to recover from the error and proceed. Why? ● We are a compiler; not an interpreter! Manas Thakur CS502: Compiler Design 11

  12. Error recovery in lexical analysis ● Delete one character from the input. ● Insert a missing character into the remaining input. – Which one? ● Replace a character by another character. ● Transpose two adjacent characters. ● Theoretical problem: Find the smallest number of transformations (add, delete, replace) needed to convert a source program to one that consists only of valid lexemes. – Too expensive in practice. ● In practice, most lexical errors involve a single character. Manas Thakur CS502: Compiler Design 12

  13. Limits of Regular Languages ● Not all languages are regular. ● Try constructing an FA for the following languages: – L = {0 n 1 n } – L = {wcw r | w ∈ Σ *} Note: neither of these is a regular expression! ● FAs cannot count properly! ● However, this is a little subtle. One can construct FAs for: – Alternating 0s and 1s ● (ε | 1)(01)*(ε | 0) – Sets of pairs of 0s and 1s ● (01 | 10)+ Manas Thakur CS502: Compiler Design 13

  14. What next? ● Learn a language that could recognize L = {0 n 1 n } and L = {wcw r | w ∈ Σ*}! ● Why do we care? – Do you fjnd any similarity between above, and recognizing: ● Matching parentheses/blocks? Manas Thakur CS502: Compiler Design 14

Recommend


More recommend