Where we're at: Syntax analysis of VSL ● Things needed to – Submit homework (pdfs and tarballs) – Build programs (make, cc) – Build scanners (Lex/flex) – Build parsers (Yacc/bison) – Build symbol tables (hashtables/libghthash) – Assemble machine code (as) In PS3, we will see our Very Simple Language take shape, in terms of its ● syntactic structure (i.e., “which words go where”). Things are getting a bit more complicated, but it's where the fun begins. ●
ps3_skeleton: The sequence of things main calls yyparse() (which is generated by yacc from src/parser.y) ● yyparse() creates a tree of node_t structs (TODOs in src/parser.y, tree.c), ● and assigns the root node to the globally declared 'root' if parsing succeeds If compiled with the macro DUMP_TREES=1 or DUMP_TREES=3, then ● main prints a text representation of the tree at 'root' on stderr Main calls simplify_tree (TODO in tree.c) to prune away a few features ● from the syntax tree which are only convenient in syntax analysis (and would cause extra headache when code generation time comes) If compiled with the macro DUMP_TREES=2 or DUMP_TREES=3, then ● main prints a text representation of the (now simplified) tree on stderr Main takes down the tree with destroy_subtree (TODO in src/tree.c) ●
Yacc specifications in general Much like Lex specifications, Yacc specs contain definitions, rules and ● function implementation sections, separated by %% I will walk you through the definitions ● The rules are not reg. exps any more, they are grammar productions in a ● format like left_nonterm : nonterm token nonterm nonterm {/*action 1*/} | nonterm token token { /*action 2, one NT and 2 tokens */ } | { /* action 3 – epsilon productions don't need a right hand side } | token { /* action 4 – just a token */ } ; ← semicolon ends a string of productions with same l.h.s. With the “return yytext[0]” final rule of the scanner, single characters act as ● their own tokens, so they can appear as literals (e.g. '+' or '}') in productions.
The declarations section The 'extern' declarations of yytext and yylineno just mean that we rely on the linker ● to find them in the object code for the scanner. 'node_t * root;' is a global declaration for a struct which will be assigned as the last ● thing the parser does – it's where main will get a hold of the parser's work Prototypes for yylex and yyerror are just to point at implementations elsewhere in the ● framework. yyerror is Yacc's callback for syntax errors, the implementation at the bottom just stops the program dead in it's tracks (which will do for us). The %token directives name all the tokens we want in the header file shared with the ● scanner. These are just magic integers. %left sets the associativity of operators (there's a %right as well), breaking it across ● multiple lines orders operator precedence. There's an operator UMINUS with no associativity and high precedence – this is a ● placeholder it's purpose is on the next slide
Regarding UMINUS '-' can act a bit funny as an operator; when it's part of a binary ● expression, it has precedence like '+', but when it's unary (like in “- 123”) it binds tighter than everything. To let Yacc work with this, we need to pretend that there's another ● token (it need not be returned from anywhere to have a precedence) Operator precedence can be set by associating the grammar rule with ● a token's precedence even if the token itself does not appear in the production: the rules expression '-' expression { /* action goes here */ } '-' expression %prec UMINUS { /* action goes here */ } will handle the first rule according to the 'default' precedence of binary minus, and associate the precedence of the ethereal UMINUS with the second rule.
First steps: tree node structure include/tree.h defines a structured type node_t, which will hold our ● tree nodes, and has elements – nodetype_t type (remember what kind of node this is) – void *data (to retain copies lexical information where it's needed – integers, string literals and suchlike) – void *entry (nevermind this for now, but we'll need it when the time comes for a bit of semantics) – uint32_t n_children (number of nodes this one links to below – unsigned, can't be less than 0) – node_t **children (list of pointers to the nodes below) ...the following figure illustrates how these structs are supposed to ● link together
Shift/reduce parsing a la Yacc The parser generated by yacc effectively traces out this tree for us, left-to- ● right, bottom to top, pushing tokens onto an internal stack, and calling a production rule every time it can reduce the right hand side of a production into the nonterminal on the left. At the bottom, with two productions – integer: NUMBER { /* This code is called when the scanner returns a number */ } – expression: integer { /* This code is called next, since the right- hand-side of the rule only requires that we've had an integer */} – What happens next depends on what has been recently seen; if what's on the parser's internal stack was just missing an expression to complete the right hand side of a production, another rule will fire – otherwise, the scanner gets to fetch the next token, in the hope that something will match soon. What we need to construct our tree is to build it inductively inside the ● production's semantic action blocks (plain old C).
Where baby tokens come from ● The skeleton for the parser really depends on a correct scanner. ● Since some late submissions for that exercise must be admitted for a while still, I regrettably can't hand out the Lex spec. quite yet. ● Instead, the skeleton code includes the C code for one generated by lex. (It's technically possible to reconstruct the reg. exps. from the state table therein, but I reckon it is more work than figuring them out from scratch...) ● For the nonce, it scans VSL sources – the recipe will be included in future skeletons.
Referring to tokens and nonterminals in rules As an example, consider the production rule ● integer : NUMBER { /* Code */ } To construct a node_t from this, we need ● – The lexeme which was scanned for NUMBER (Parser knows about yytext, can read it directly) – A dynamically allocated (node_t) to fill in the data – A way to assign it to the 'integer' nonterminal which is passed on up. $1 means “first token or nonterminal on the right hand side” ● $2 is the second one, and so on... ● $$ is the left hand side ●
Productions and type information It's good that we can refer to the parts of a production, but yacc needs to know that ● $1 in this case is an integer (the NUMBER token value), and that we want $$ to be a (node_t *). The %token directives in the declarations section says which names we want to refer ● to (more or less arbitrary) integer token values The definition of YYSTYPE at the very top says that we want all nonterminals to be ● of type (node_t *) It's possible to type them all explicitly, but we will have better use for a – perfectly regular tree, so everything is a node structure Thus, $$ = (node_t *) malloc ( sizeof(node_t) ); in the integer rule above will pass ● upwards a pointer, and the code in the rule can fill it in with “0 children”, NULL pointers where need be, and the integer value read from the lexeme (which can be found by “strtol(yytext, NULL, 10);” ) $$ is “some pointer to a node_t struct”, so “$$->type = integer_n;” etc. are perfectly ● valid statements.
Aside: State of the union The type of nonterminals is where yacc likes unions. ● For internal reasons, all nonterminals in the generated code are declared ● with YYSTYPE With the %union directive, you can define YYSTYPE to be a union of any ● number of types, e.g. %union { double dval; int ival; } will permit tokens to be typed, as in %token<dval> DOUBLE, and “$$.dval = (double)($1.ival) + 3.14;” will make sense for a production which needs to add an int and a double (without abandoning all kinds of type checking) We won't be needing it right here, but thinking in terms of type-generic ● logic is a healthy mental exercise for any programmer. The utility of yacc goes far beyond building compilers, so now you heard ● this in case you will need it.
Building blocks ● Small as the language is, there are 48 productions in VSL. ● If you need 8 lines of code per production, that is already 384 lines of parser with a lot of similar malloc-ing, not even counting whitespace, comments. ● This is not hard, but it's more typing than is pleasant, and it's horrible to change if you need it. ● Therefore, turning node genesis and destruction into one- liners is a positive boon ● That's what the auxiliary routines in src/tree.c are for.
Recommend
More recommend