TDT4205 Recitation 3 Lexical analysis ● Last week: – Make and makefiles – Text filters inside and out – Some C, idiomatically ● Today: problem set 2 – We've raced through the preliminaries, time for compiler stuff ( yay! ) – Analysis by hand, and by generated analyzer – (This lecture is given both Monday and Thursday, to keep everyone on board even with the off-beat timing)
Today's tedious practical matter ● The exercises are part of your evaluation – I'm not the one holding the ultimate responsibility that your evaluation is fair – Thus, I can't decide on any kind of differential treatment – In plain English, I can not extend your deadlines – No, not even for world tours, moon landings or funerals – Where it says “contact the instructor”, that's Dr. Elster – (Generally, after Feb. 15 th the deadlines harden)
Worthwhile questions in plenary (This one is from rec. 1, but I gave a somewhat foofy answer at the time...) ● ● Does it make a difference whether main ends in “ return int” or “ exit ( int )” ? – As it turns out, No . – The reason I hesitated was that one can register function calls to happen at program exit (w. function pointers and the atexit function). – This mechanism is required to behave the same in both cases, so it's really a clear case. (Live and learn...) ● For myself, I'll keep writing exit for “stop the program” and return for “stop the function” (unless there turns out to be a good reason why it's silly).
Where we're at ● Things needed to – Submit homework (pdfs and tarballs) – Build programs (make, cc) – Build scanners (Lex/flex) – Build parsers (Yacc/bison) – Build symbol tables (hashtables/libghthash) – Assemble machine code (as) ● ...but first, a bit of handwaving
The science of computer languages: even experts reach for magical metaphors Battle /w ferocious dragon “The spirit which lives in the computer”
My humble perspective on the subject ● Compiler construction lies at an intersection between – Hardware architecture (very practical things) – Software architecture (very complicated things) – Complexity theory (very theoretical things) – Theories of language (very non-computery things) ● What's cool about it is that handling the resulting calamity in the middle is a success story of computer science ● Even so, the dragon's sweater reads “complexity of compiler design”, and the knight's sword is a parser generator ● Moral: bring tools to the job – Dragons find hero programmers crunchy, and good with ketchup
General terminology: bits and bobs of languages ● Different language models are suitable depending on what you want to look at: – Lexical models say what goes into a word, and where they are separated from each other – Syntactical models tell which roles a given word can play in a statement it is part of – Semantics speak of what it means when a given word appears playing a given role There's a whole heap of other stuff which isn't usually applied to ● programming languages (morphology, pragmatics, …) What we're after today is lumping characters into words. ●
Lexical analysis, the ad-hoc way ● Say that we want to recognize an arbitrary fraction; should be easy, <ctypes.h> is full of functions to classify characters... – read character – while ( isdigit ( character ) ) { read another } – if ( character != ' / ' ) { die horribly } – while ( isdigit ( character ) ) { keep reading } ● First loop takes an integer numerator ● Second loop takes an integer denominator ● Condition in the middle requires that they're separated by what we expect. ● This works if you only have a few different words in your care.
The automaton way I ● DFAs are good too, they chew a character at a time ● Looking at the state diagram, each state has a finite number of transitions... ● ...so we can code them up in a finite amount of time. ● Here goes: – if ( state=1 and c='a', or state=1 and c='b', or... ) { state = 14; /* lowercase letters go to 14 */ } – else if ( state = 1 and c='0', or state=1 and c='1', or... ) { state = 42; /* digits in state 1...*/ } – else if (… else if... (I'm beginning to think this wasn't such a fantastic idea after all) ●
The automaton way II ● DFA can be tabulated. – First, punch in the table – Next, set a start state – Loop while state isn't known to accept or reject: ● Next state = table [ this state ] [ character ] ● (A recipe like this is in the Dragon book, pp. 151) ● Wonder of wonders, one algorithm will work for any DFA, just change the table! ● This is pretty much get_token in Task 2, it's not that hard.
Surrounding logic of vm.c Basically, it's like last week's text filter description, but with tokens: ● T = token_get(); while ( T != wrong ) { do_something ( T ); T = token_get(); } 'token_get' is a little more involved than 'readchar()', but it's still just ● an integer to branch on The 'do_something' is already in place, you won't have to write that ●
Inside token_get: where did I leave my chars? DFA have horrible short term memory, they barely just know where ● they are. When the time comes to accept: ● – What are we accepting? (Answer is the token) – Why did we accept this? (Answer is the lexeme) At the accept state in the PS2 diagram, neither is known ● To do this, impart a sense of history to your code: ● – The 2 n d to last state determines the token, it can be set then to be recalled if reaching the accept state – There's a buffer 'lexeme' to plop each char into as you go along, to tell “127” from “74” even though they both match to integer tokens
A few more notes on vm.c The table is all set up, table[state]['a'] gives transition from state on 'a' ● – Initially, all lead to state -1, which works for 'reject' – We'll assume transitions not noted lead there – Table is (much) bigger than it has to be, for the convenience of indexing with character codes – There's a macro T(state,c) which expands to table[state][c], this is just to save on the typing. The language def. isn't splendidly clean (mixes in whitespace for ● good measure), but the intention is (hopefully) clear The 'lexeme' buffer is fixed-length, and can be easily overrun with ● long integers. We could fix it, but it's kind of beside the point at the moment, let's assume input is friendly. (The stack is finite too, so it won't do long programs.) ●
Testing ● There are two files included, one for checking just the tokenizer, and one small program ● “./vm -t” will drop execution, this is used to test with an included list of lexemes ● (In a few cases, the input is broken through 'sed', to see if errors come out. Sed is just a text filter which can apply reg.exp. substitutions. It's a handy tool.) ● Just starting “./vm -t” without any pipeline will take stdin from the keyboard. (On most terminals, Ctrl + D will send the end-of-file character.)
The bridge to Lex ● What we just saw is exactly what Lex does: – Take some regular expressions – Write out mother load table – Implement traversal ● The names are a little different: – 'token_get (stdin)' is called “yylex()” – The lexeme buffer is called yytext ● Major win: the tabulation is automated; less tedious, far less prone to mistakes
Lex specifications: declarations Declarations section contains initializer C code, some directives, and ● optionally, some named regular expressions – “TABSTRING [\t]+” will define a symbolic name TABSTRING for the expression (which here matches a sequence of at least one tabulator character) – References to these names can go into other expressions in the rules section: {TABSTRING}123 will match a string of at least one tab, followed by '123' – Not necessary, but a boon for readability when expressions grow complicated Anything enclosed between '%{' and '%}' in this section will go ● verbatim in at the start of the generated C code There's a nasty macro in there, which gets more attention in a minute ●
Lex specifications: rules ● The rules section is just regular expressions annotated with basic blocks (semantic actions ): – a(a|b)* { return SOMETHING; } will see the yylex() function return “SOMETHING” whenever what appears on stdin matches an 'a' followed by zero or more ('a' or 'b')-s – Any code could go into the semantic action, it's just a block of C. If it's empty, the reg.exp. will strip text from the input. – A set of token values to return are already included in “parser.h”, so you don't have to invent token values
Gritty details The one rule already implemented in scanner.l is ● “. { RETURN(yytext[0]);}”, which matches any one character and returns its ASCII code as a token value. Keep this rule (as the last one), it will save us from defining long ● symbolic names for single-char tokens like '+' and '}' (...even though this overlaps the lexeme with the token value...) The RETURN() macro is a hack, but a useful one: ● – #ifdef-d on DUMP_TOKENS, it not only returns a token value, but also prints it along with its lexeme. Thus, we can define DUMP_TOKENS and test the scanner without plopping a greater frame around it. – When we're done, dropping DUMP_TOKENS will give us a well-behaved scanner which just returns values.
Recommend
More recommend