bottom up parsing
play

Bottom-Up Parsing (A First Step) CockeYoungerKasami (CYK) algorithm - PowerPoint PPT Presentation

Bottom-Up Parsing (A First Step) CockeYoungerKasami (CYK) algorithm and Chomsky Normal Form 1 Last time Showed how to use Java CUP for getting ASTs But we never saw HOW the parser works 2 This time Dip our toe into parsing


  1. Bottom-Up Parsing (A First Step) Cocke–Younger–Kasami (CYK) algorithm and Chomsky Normal Form 1

  2. Last time Showed how to use Java CUP for getting ASTs But we never saw HOW the parser works 2

  3. This time Dip our toe into parsing – Approaches to parsing – CFG transformations • Useless non-terminals • Chomsky Normal Form: A form of grammar that is easier to deal with – CYK: powerful, heavyweight approach to parsing 3

  4. Approaches to Parsing Top Down / “Goal driven” – Begin with the start nonterminal Expr – Grow parse tree downward to match the string Expr plus Term Bottom Up / “Data Driven” Term id – Start at terminals – Generate ever larger subtrees; id the goal is to obtain a single tree whose root is the start nonterminal 4

  5. CYK: A General Approach to Parsing ( Cocke–Younger–Kasami algorithm ) Operates in time O(n 3 ) Works bottom-up Requires the grammar to be in Chomsky Normal Form – This turns out not to be a limitation: any context-free grammar can be converted into one in Chomsky Normal Form 5

  6. Chomsky Normal Form All rules must be one of two forms: X t (terminal) X A B The only rule allowed to derive epsilon is the start S 6

  7. What CNF buys CYK • The fact that non-terminals come in pairs allows you to think of a subtree as a subspan of the input • The fact that non-terminals are not nullable (except for start) means that each subspan has at least one character s = s1 s2 s3 s4 7

  8. CYK: Dynamic Programming X t Form the leaves of the parse tree X A B Form binary interior nodes of the parse tree S 1,4 S 1,4 S 2,4 S 1,3 S 1,4 S 3,4 S 1,2 S 1,2 S 3,4 S 1,1 S 2,2 S 3,3 S 4,4 S 1,1 S 2,2 S 3,3 S 4,4 S 1,1 S 2,2 S 3,3 S 4,4 s1 s2 s3 s4 s1 s2 s3 s4 s1 s2 s3 s4 8

  9. Running CYK … Track every viable subtree from leaf to root. Here are all the subspans for a string of 6 terminals: Full string 1,6 start, end 1,5 2,6 Ending 1,4 2,5 3,6 position of subspan 1,3 2,4 3,5 4,6 1,2 2,3 3,4 4,5 5,6 Single 1,1 2,2 3,3 4,4 5,5 6,6 characters Starting position of subspan 9

  10. CYK Example F 1,6 W In general, go up a column ⟶ F I W 1,5 2,6 and down a diagonal ⟶ F I Y X ⟶ W L X 1,4 2,5 3,6 ⟶ X N R ⟶ Y L R N ⟶ N id ⟶ 1,3 2,4 3,5 4,6 N I Z ⟶ Z C N ⟶ Z X I id ⟶ L ( 1,2 2,3 3,4 4,5 5,6 ⟶ R ) ⟶ I,N L I,N C I,N R C , id ( id , id ) 10

  11. CYK Example F 1,6 ⟶ F I W W ⟶ F I Y ⟶ W L X 2,6 ⟶ X N R X ⟶ Y L R 3,6 ⟶ N id ⟶ N I Z N ⟶ Z C N ⟶ 3,5 I id ⟶ L ( ⟶ Z R ) ⟶ C , 4,5 I,N L I,N C I,N R id ( id , id ) 11

  12. CYK Example F 1,6 ⟶ F I W W ⟶ F I Y ⟶ W L X 2,6 ⟶ X N R X ⟶ Y L R 3,6 ⟶ N id ⟶ N I Z N ⟶ Z C N ⟶ 3,5 I id ⟶ L ( ⟶ Z R ) 4,5 ⟶ C , I,N L I,N C N R id ( id , id ) 12

  13. CYK Example F 1,6 ⟶ F I W W ⟶ F I Y ⟶ W L X 2,6 ⟶ X N R X ⟶ Y L R 3,6 ⟶ N id ⟶ N I Z N ⟶ Z C N 3,5 ⟶ I id ⟶ L ( ⟶ Z R ) 4,5 ⟶ C , I,N L I C N R id ( id , id ) 13

  14. CYK Example F 1,6 ⟶ F I W W ⟶ F I Y ⟶ W L X 2,6 ⟶ X N R X ⟶ Y L R 3,6 ⟶ N id ⟶ N I Z N ⟶ Z C N 3,5 ⟶ I id ⟶ L ( ⟶ Z R ) 4,5 ⟶ C , I,N L I C N R id ( id , id ) 14

  15. CYK Example F 1,6 ⟶ F I W W ⟶ F I Y 2,6 ⟶ W L X ⟶ X N R X ⟶ Y L R 3,6 ⟶ N id ⟶ N I Z N ⟶ Z C N 3,5 ⟶ I id ⟶ L ( ⟶ Z R ) 4,5 ⟶ C , I,N L I C N R id ( id , id ) 15

  16. CYK Example F 1,6 ⟶ F I W W ⟶ F I Y 2,6 ⟶ W L X ⟶ X N R X ⟶ Y L R 3,6 ⟶ N id ⟶ N I Z N ⟶ Z C N 3,5 ⟶ I id ⟶ L ( ⟶ Z R ) 4,5 ⟶ C , I,N L I C N R id ( id , id ) 16

  17. Cleaning up our grammars We want to avoid unnecessary work – Remove useless rules 17

  18. Eliminating Useless Nonterminals 1. If a nonterminal cannot derive a sequence of terminal symbols, then it is useless 2. If a nonterminal cannot be derived from the start symbol, then it is useless 18

  19. Eliminate Useless Nonterminals Mark all terminal symbols Repeat If a nonterminal If all symbols on the cannot derive a righthand side of a sequence of production are marked mark the lefthand side terminal symbols, Until no more non-terminals then it is useless can be marked 19

  20. Example: S X | Y X ( ) Y ( Y Y ) 20

  21. Eliminate Useless Nonterminals Mark the start symbol Repeat If a nonterminal If the lefthand side of a cannot be derived production is marked from the start mark all righthand non-terminal symbol, then it is Until no more non-terminals useless can be marked 21

  22. Example: S A B A + | - | ε B digit | B digit C . B 22

  23. Chomsky Normal Form 4 Steps – Eliminate epsilon rules – Eliminate unit rules – Fix productions with terminals on RHS – Fix productions with > 2 nonterminals on RHS 23

  24. Eliminate (Most) Epsilon Productions If a nonterminal A immediately derives epsilon – Make copies of all rules with A on the RHS and delete all combinations of A in those copies 24

  25. Example 1 F id ( A ) A ε A N N id N id , N F id ( A ) F id ( ) A N N id N id , N 25

  26. Example 2 X A x A y A A ε A z X A x A y A | A x A y | A x y A | x A y A | A x y | x A y | x y A | x y A z 26

  27. Eliminate Unit Productions Productions of the form A B are called unit productions Place B anywhere A could have appeared and remove the unit production 27

  28. Example 1 F id ( A ) F id ( ) A N N id N id , N F id ( N ) F id ( ) N id N id , N 28

  29. Fix RHS Terminals For productions with terminals and something else on the RHS – For each terminal t add the rule X t Where X is a new non-terminal – Replace t with X in the original rules 29

  30. Example F I L N R F I L R F id ( N ) N id F id ( ) N I C N N id N id , N I id L ( R ) C , 30

  31. Fix RHS Nonterminals For productions with > 2 Nonterminals on the RHS – Replace all but the first nonterminal with a new nonterminal – Add a rule from the new nonterminal to the replaced nonterminal sequence – Repeat 31

  32. Example F I L N R F I W W L N R F I W W L X X N R 32

  33. Parsing is Tough CYK parses an arbitrary CFG, but – O(n 3 ) time – Too slow! For special classes of grammars – O(n) time – Examples of such classes: LL(1) and LALR(1) 33

  34. Classes of Grammars LL(1) – Scans input from Left-to-right (first L) – Builds a Leftmost Derivation (second L) – Can peek (1) token ahead of the token being parsed – Top-down “predictive parsers” LALR(1) – Uses special lookahead procedure (LA) – Scans input from Left-to-right (second L) – Rightmost derivation (R) – Can also peek (1) token ahead LALR(1) strictly more powerful, but the algorithm is harder to understand (Java CUP generates a LALR(1) parser) 34

  35. Summary We covered • How to parse with the CYK algorithm (dynamic programming) • How to put a grammar into Chomsky Normal Form 35

Recommend


More recommend