1
play

1 L L (k) L L(k) LL( k ) LL(k) Grammars What if there are common - PDF document

Parsing Algorithms Earleys algorithm (1970) Top-down works for all CFGs Bottom-up Recursive descent O(N 3 ) worst case LL performance O(N 2 ) for Parsing: continued LR unambiguous grammars LALR


  1. Parsing Algorithms • Earley’s algorithm (1970) • Top-down • works for all CFGs Bottom-up • Recursive descent – O(N 3 ) worst case • LL performance – O(N 2 ) for Parsing: continued • LR unambiguous grammars • LALR – Based on dynamic • SLR programming, used • CYK primarily for computational • David Notkin GLR linguistics • Simple precedence parser • Different parsing algorithms Autumn 2008 • Bounded context generally place various • … restrictions on the grammar of the language to be parsed • ACM digital library returned 5600+ articles matching “parsing algorithm” • Google Scholar almost 34,000 CSE401 Au08 2 Top Down Parsing Predictive Parser • Build parse tree from the top (start symbol) down to • Predictive parser: top-down parser that uses at most the next k tokens to select production (the lookahead ) leaves (terminals) • Efficient: no backtracking needed, linear time to parse • Basic issue: when expanding a nonterminal, which right • Implementations (analogous to lexing) hand side should be selected? – recursive-descent parser • Solution: look at input tokens to decide • each nonterminal parsed by a procedure • call other procedures to parse sub-nonterminals, Stmts ::= Call | Assign | If | While recursively Call ::= Id ( Expr { , Expr} ) • typically written by hand Assign ::= Id := Expr ; – table-driven parser If ::= if Test then Stmts end • push-down automata: essentially a table-driven FSA, | if Test then Stmts else Stmts end plus stack to do recursive calls While ::= while Test do Stmts end • typically generated by a tool from a grammar specification CSE401 Au08 3 CSE401 Au08 4 1

  2. L L (k) L L(k) LL( k ) LL(k) Grammars What if there are common prefixes? k tokens lookahead Find Left derivation Left-to-right scan • • Left factor common prefixes to eliminate them Can construct predictive parser automatically and easily if grammar is LL(k) – create new nonterminal for different suffixes – Left-to-right scan of input, finds leftmost derivation – delay choice until after common prefix – k tokens of look ahead needed • Before • Some restrictions including If ::= if Test then Stmts end | – no ambiguity if Test then Stmts else Stmts end – no common prefixes of length ≥ k: If ::= if Test then Stmts end | • After if Test then Stmts else Stmts end If ::= if Test then Stmts IfCont – no left recursion (e.g., E ::= E Op E | ...) IfCont ::= end | else Stmts end • Restrictions guarantee that, given k input tokens, can always select correct right hand side to expand nonterminal. CSE401 Au08 5 CSE401 Au08 6 Left recursion? Rewrite… Table-driven predictive parser • Automatically compute PREDICT table from grammar Before After • PREDICT(nonterminal,input-token) => right hand E ::= E + T | T E ::= T ECon side ECon ::= + T ECon |  T ::= T * F | F F ::= id | ... T ::= F TCon TCon ::= * F TCon |  F ::= id | ... • May not be as clear; can sugar it E ::= T { + T } T ::= F { * F } F ::= id | ( E ) | … • Greater distance from concrete syntax to abstract syntax CSE401 Au08 7 CSE401 Au08 8 2

  3. Compute PREDICT table Example for you to do: if you want • Compute FIRST set for each right hand side – All tokens that can appear first in a derivation from that right hand side • In case right hand side can be empty – Compute FOLLOW set for each non-terminal • All tokens that can appear immediately after that non-terminal in a derivation • Compute FIRST and FOLLOW sets mutually recursively • PREDICT then depends on the FIRST set CSE401 Au08 9 CSE401 Au08 10 PREDICT and LL(1) Top down implementation • If PREDICT table has at most one entry per cell • int accept(Symbol s) { For years the 401 compiler was a top- if (sym == s) { – Then the grammar is LL(1) down predictive parser, getsym(); implemented by a – There is always exactly one right choice return 1; method for each • So it’s fast to parse and easy to implement } nonterminal return 0; – We have shifted to a • If multiple entries in each cell bottom-up, automatically } generated parser – Ex: common prefixes, left recursion, ambiguity – But if you’re going to build a simple one, this is – Can rewrite grammar (sometimes) int expect(Symbol s) { usually best if (accept(s)) • – Can patch table manually, if you “know” what to do Examples from return 1; http://en.wikibooks.org/ – Or can use more powerful parsing technique wiki/Compiler_construct error("expect: unexpected symbol"); ion – return 0; Helper functions on right } CSE401 Au08 11 CSE401 Au08 12 3

  4. Example method Example method void statement(void) { if (accept(ident)) { void factor(void) { expect(becomes); if (accept(ident)) { expression(); ; … } else if (accept(number)) { } else if (accept(ifsym)) { ; condition(); } else if (accept(lparen)) { expect(thensym); expression(); statement(); expect(rparen); } else if (accept(whilesym)) { } else { condition(); error("factor: syntax error"); expect(dosym); getsym(); statement(); } } } } CSE401 Au08 13 CSE401 Au08 14 “Shift - reduce” strategy Bottom up parsing • • read (“shift”) tokens until the right hand side of Construct parse tree for input from leaves up – reducing a string of tokens to single start symbol by inverting “correct” production has been seen productions • reduce handle to nonterminal, then continue • Bottom-up parsing is more general than top-down parsing and • done when all input read and reduced to start just as efficient – generally preferred in practice nonterminal Read the T ::= int int * int + int productions found T ::= int * T by bottom-up parse int * T + int bottom to top; this xyzabcdef T ::= int A ::= bc .D T + int is a rightmost ^ E ::= T derivation! T + T E ::= T + E T + E E CSE401 Au08 15 CSE401 Au08 16 4

  5. LR(k) LR Parsing Tables • LR(k) parsing • Construct parsing tables implementing a FSA with a stack – rows: states of parser – Left-to-right scan of input, rightmost derivation – columns: token(s) of lookahead – k tokens of look ahead – entries: action of parser • Strictly more general than LL(k) • shift, goto state X – Gets to look at whole right hand side of production • reduce production “X ::= RHS” before deciding what to do, not just first k tokens • accept – Can handle left recursion and common prefixes • error • Algorithm to construct FSA similar to algorithm to build DFA – As efficient as any top-down parsing from NFA • Complex to implement – each state represents set of possible places in parsing – Generally need automatic tools to construct parser • LR(k) algorithm may build huge tables from grammar CSE401 Au08 17 CSE401 A8 18 Questions? Ada language/compiler color • US DoD wanted (roughly) a single, high-level programming language • They wrote requirements for this language and received 14 bids (1977) • Four semi-finalists (1978): green (Cii), red for (Intermetrics), blue (SofTech), yellow for (SRI) • Two finalists: green and red – requirements finalized as Steelman document CSE401 Au08 19 CSE401 Au08 20 5

  6. York Ada compiler (c. 1986) General syntax: examples from Steelman “Facts and Figures About the York Ada Compiler” (Wand et al.) • • • Written in C 2A. Character Set. The full set of 2D. Other Syntactic Issues. Multiple character graphics that may be occurrences of a language defined • About 80 KLOC for compiler used in source problems shall be symbol appearing in the same given in the language definition. context shall not have essentially – Front-end about 57 KLOC, code gen about 20 different meanings. … Every source program shall also have a representation that uses only • 2E. Mnemonic identifiers. KLOC, VAX-specific code gen about 3 KLOC the following 55 character subset of Mnemonically significant identifiers the ASClI graphics: … • 7 KLOC for run-time shall be allowed. There shall be a • 2B. Grammar. The language should break character for use within • “It is difficult to make an accurate estimate of the time taken to write the have a simple, uniform, and easily identifiers. The language and its compiler because the compiler writers had other demands on their time parsed grammar and lexical translators shall not permit (completing PhDs, teaching, etc.) . Fourteen individuals have been structure. The language shall have identifiers or reserved words to be free form syntax and should use abbreviated. … involved at various times during the project and have contributed familiar notations where such use approximately 20 man years to the design and construction of the • 2G. Numeric Literals. There shall be does not conflict with other goals. built-in decimal literals. There shall software . The money spent directly to support the construction of the be no implicit truncation or rounding compiler was [approximately $340k], however this included neither the of integer and fixed point literals salaries of four members of the project nor the cost of computer time (we used approximately 30% of a VAX- 11/780 over a five year period).” CSE401 Au08 21 CSE401 Au08 22 6

Recommend


More recommend