automatic scanner generation in minijava symbol class
play

Automatic scanner generation in MiniJava Symbol class We use the - PDF document

Automatic scanner generation in MiniJava Symbol class We use the jflex tool to automatically create a scanner from a Lexemes are represented as instances of class Symbol specification file, Scanner/minijava.jflex class Symbol { (We use the CUP


  1. Automatic scanner generation in MiniJava Symbol class We use the jflex tool to automatically create a scanner from a Lexemes are represented as instances of class Symbol specification file, Scanner/minijava.jflex class Symbol { (We use the CUP tool to automatically create a parser from a int sym; // which token class? specification file, Parser/minijava.cup , which also Object value; // any extra data for this lexeme generates all the code for the token classes used in the ... scanner, via the Symbol class.) } The MiniJava Makefile automatically rebuilds the scanner A different integer constant is defined for each token class, in the (or parser) whenever its specification file changes sym helper class class sym { static int CLASS = 1; static int IDENTIFIER = 2; static int COMMA = 3; ... } Can use this in printing code for Symbol s • see symbolToString in minijava.jflex Craig Chambers 40 CSE 401 Craig Chambers 41 CSE 401 Token declarations jflex token specifications Declare new token classes in Parser/minijava.cup , Helper definitions for character classes and regular expressions using terminal declarations letter = [a-zA-Z] • include Java type if Symbol stores extra data eol = [\r\n] Examples: (Simple) token definitions are of the form: regexp { Java stmt } /* reserved words: */ terminal CLASS, PUBLIC, STATIC, EXTENDS; regexp can be (at least): ... • a string literal in double-quotes, e.g. "class" , "<=" /* operators: */ • a reference to a named helper, in braces, e.g. {letter} terminal PLUS, MINUS, STAR, SLASH, EXCLAIM; • a character list or range, in square brackets, e.g. [a-zA-Z] ... • a negated character list or range, e.g. [^\r\n] /* delimiters: */ • . (which matches any single character) terminal OPEN_PAREN, CLOSE_PAREN; • regexp regexp , regexp | regexp , regexp * , regexp + , regexp ? , ( regexp ) terminal EQUALS, SEMICOLON, COMMA, PERIOD; ... Java stmt (the accept action) is typically: /* tokens with values: */ • return symbol(sym. CLASS ); for a simple token terminal String IDENTIFIER; • return symbol(sym. CLASS , yytext()); for a token terminal Integer INT_LITERAL; with extra data based on the lexeme string yytext() • empty for whitespace Craig Chambers 42 CSE 401 Craig Chambers 43 CSE 401

  2. Syntactic Analysis / Parsing Context-free grammars (CFG’s) Purpose: stream of tokens ⇒ abstract syntax tree (AST) Syntax specified using CFG’s • RE’s not powerful enough AST: • can’t handle nested, recursive structure • general grammars (GG’s) too powerful • captures hierarchical structure of input program • not decidable ⇒ parser might run forever! • primary representation of program for rest of compiler CFG’s: convenient compromise • capture important structural & nesting characteristics Plan: • some properties checked later during semantic analysis • study how grammars can specify syntax • study algorithms for constructing ASTs from token streams • study MiniJava implementation Common notation for CFG’s: Extended Backus-Naur Form (EBNF) Craig Chambers 44 CSE 401 Craig Chambers 45 CSE 401 CFG terminology EBNF description of initial MiniJava syntax Program ::= MainClassDecl {ClassDecl} MainClassDecl::= class ID { Terminals : alphabet of language defined by CFG public static void main ( String [ ] ID ) { {Stmt} } } ClassDecl ::= class ID [ extends ID] { Nonterminals : symbols defined in terms of terminals and {ClassVarDecl} {MethodDecl} } nonterminals ClassVarDecl ::= Type ID ; MethodDecl ::= public Type ID Production : rule for how a nonterminal (l.h.s.) is defined in ( [Formal { , Formal}] ) { {Stmt} return Expr ; } terms of a finite, possibly empty sequence of terminals & Formal ::= Type ID nonterminals Type ::= int | boolean | ID • recursive productions allowed! Stmt ::= Type ID ; | { {Stmt} } Can have multiple productions for same nonterminal | if ( Expr ) Stmt else Stmt | while ( Expr ) Stmt • alternatives | System.out.println ( Expr ) ; | ID = Expr ; Start symbol : root symbol defining language Expr ::= Expr Op Expr | ! Expr | Expr . ID ( [Expr { , Expr}] ) | ID | this Program ::= Stmt | Integer | true | false Stmt ::= if ( Expr ) Stmt else Stmt | ( Expr ) Stmt ::= while ( Expr ) Stmt Op ::= + | - | * | / | < | <= | >= | > | == | != | && Craig Chambers 46 CSE 401 Craig Chambers 47 CSE 401

  3. Transition diagrams Derivations and parse trees “Railroad diagrams” Derivation: sequence of expansion steps, beginning with start symbol, • another, more graphical notation for CFG’s leading to a string of terminals • look like FSA’s, where arcs can be labelled with nonterminals as well as terminals Parsing: inverse of derivation • given target string of terminals (a.k.a. tokens), want to recover nonterminals representing structure Can represent derivation as a parse tree • concrete syntax tree Craig Chambers 48 CSE 401 Craig Chambers 49 CSE 401 Example grammar Ambiguity E ::= E Op E | - E | ( E ) | id Some grammars are ambiguous : Op ::= + | - | * | / • multiple distinct parse trees with same final string Structure of parse tree captures much of meaning of program; ambiguity ⇒ multiple possible meanings for same program Craig Chambers 50 CSE 401 Craig Chambers 51 CSE 401

  4. Famous ambiguities: “dangling else” Resolving the ambiguity Stmt ::= ... | Option 1: add a meta-rule e.g. “ else associates with closest previous if ” if ( Expr ) Stmt | if ( Expr ) Stmt else Stmt • works, keeps original grammar intact • ad hoc and informal “ if ( e 1 ) if ( e 2 ) s 1 else s 2 ” Craig Chambers 52 CSE 401 Craig Chambers 53 CSE 401 Option 2: rewrite the grammar to resolve ambiguity explicitly Option 3: redesign the language to remove the ambiguity Stmt ::= ... | Stmt ::= MatchedStmt | UnmatchedStmt if Expr then Stmt end | MatchedStmt ::= ... | if Expr then Stmt else Stmt end if ( Expr ) MatchedStmt else MatchedStmt • formal, clear, elegant UnmatchedStmt ::= if ( Expr ) Stmt | if ( Expr ) MatchedStmt • allows sequence of Stmt s in then and else branches, no { , } needed else UnmatchedStmt • extra end required for every if • formal, no additional rules beyond syntax • sometimes obscures original grammar Craig Chambers 54 CSE 401 Craig Chambers 55 CSE 401

  5. Another famous ambiguity: expressions Resolving the ambiguity E ::= E Op E | - E | ( E ) | id Option 1: add some meta-rules, e.g. precedence and associativity rules Op ::= + | - | * | / “ a + b * c ” Example: E ::= E Op E | - E | E ++ | ( E ) | id Op ::= + | - | * | / | % | ** | == | < | && | || operator precedence associativity postfix ++ highest left prefix - right ** (expon.) right * , / , % left + , - left == , < none && left || lowest left Craig Chambers 56 CSE 401 Craig Chambers 57 CSE 401 Option 2: modify the grammar to explicitly resolve the ambiguity Example, redone Strategy: E ::= E0 • create a nonterminal for each precedence level E0 ::= E0 || E1 | E1 left associative • expr is lowest precedence nonterminal, E1 ::= E1 && E2 | E2 left associative each nonterminal can be rewritten with higher E2 ::= E3 ( == | < ) E3 non associative precedence operator, E3 ::= E3 ( + | - ) E4 | E4 left associative highest precedence operator includes atomic exprs E4 ::= E4 ( * | / | % ) E5 | E5 left associative • at each precedence level, use: E5 ::= E6 ** E5 | E6 right associative • left recursion for left-associative operators E6 ::= - E6 | E7 right associative • right recursion for right-associative operators E7 ::= E7 ++ | E8 left associative • no recursion for non-associative operators E8 ::= id | ( E ) Craig Chambers 58 CSE 401 Craig Chambers 59 CSE 401

  6. Designing a grammar Parsing algorithms Concerns: Given grammar, want to parse input programs • accuracy • check legality • unambiguity • produce AST representing structure • formality • be efficient • readability, clarity • ability to be parsed by particular parsing algorithm Kinds of parsing algorithms: • top-down parser ⇒ LL(k) grammar • top-down • bottom-up parser ⇒ LR(k) grammar • bottom-up • ability to be implemented using a particular strategy • by hand • by automatic tools Craig Chambers 60 CSE 401 Craig Chambers 61 CSE 401 Top-down parsing Predictive parsing Build parse tree for input program from the top (start symbol) Predictive parser: down to leaves (terminals) top-down parser that can select correct rhs looking at at most k input tokens (the lookahead ) Basic issue: Efficient: • when "expanding" a nonterminal with some r.h.s., how to pick which r.h.s.? • no backtracking needed • linear time to parse E.g. Stmt ::= Call | Assign | If | While Implementation of predictive parsers: Call ::= Id ( Expr { , Expr } ) • recursive-descent parser Assign ::= Id = Expr ; • each nonterminal parsed by a procedure If ::= if Test then Stmts end | • call other procedures to parse sub-nonterminals, recursively if Test then Stmts else Stmts end • typically written by hand While ::= while Test do Stmts end • table-driven parser • PDA: like table-driven FSA, plus stack to do recursive FSA calls Solution: look at input tokens to help decide • typically generated by a tool from a grammar specification Craig Chambers 62 CSE 401 Craig Chambers 63 CSE 401

Recommend


More recommend