some experiments on indicators of parsing complexity for
play

Some Experiments on Indicators of Parsing Complexity for Lexicalized - PowerPoint PPT Presentation

Some Experiments on Indicators of Parsing Complexity for Lexicalized Grammars Anoop Sarkar, Fei Xia and Aravind Joshi Dept. of Computer and Information Sciences University of Pennsylvania f anoop,fxia,joshi g @linc.cis.upenn.edu 1 Lexicalized


  1. Some Experiments on Indicators of Parsing Complexity for Lexicalized Grammars Anoop Sarkar, Fei Xia and Aravind Joshi Dept. of Computer and Information Sciences University of Pennsylvania f anoop,fxia,joshi g @linc.cis.upenn.edu 1

  2. Lexicalized Tree Adjoining Grammars NP NP u u NNP NP NNP m n n Ms. Haag S u NP VP NP arg n u VBZ NP NNP n arg n plays Elianti These trees can be combined to parse the sentence Ms. Haag plays Elianti . 2

  3. Important Properties of LTAG wrt Parsing � Predicate-argument structure is represented in each elementary tree. � Adjunction instead of feature unification. � No recursive feature structures. FSs are bounded. 3

  4. Important Properties of LTAG wrt Parsing � Transformational relations for the same predicate-argument structure are precomputed. � Each predicate selects a family of elementary trees. � Different sources of issues for parsing efficiency. 4

  5. Parsing Efficiency � Parsing accuracy: Evaluations done in previous work. � Parsing efficiency: observed time complexity for producing all parses. � The usual notion: compare different parsing algorithms wrt time, space, number of edges, : : : � This paper: explore parsing efficiency from a viewpoint that is inde- pendent of a particular parsing algorithm. 5

  6. Parsing Efficiency � Not an alternative to comparision of parsing algorithms. � An exploration of parsing efficiency from the perspective of a fully lexi- calized grammar. � Sources of parsing complexity that are part of the input to the parsing algorithm. 6

  7. Parsing Efficiency � We explore two issues: syntactic lexical ambiguity and clausal com- plexity. � The contention: for LTAGs these issues are relevant across all parsing algorithms. 7

  8. Experiment: The Parser � Implementation of head-corner chart-based parser. � It is bi-directional – van Noord style. � Produces a derivation forest as output. � Written in ANSI C: � -version available at ftp://ftp.cis.upenn.edu/xtag/pub/lem 8

  9. Experiment: Input Grammar � Treebank Grammar � extracted from Sections 02–21 WSJ Penn Treebank 6789 tree templates, 123039 lexicalized trees � � number of word types in the lexicon is 44215 � average number of trees per word is 2 : 78 9

  10. 400 350 300 Number of trees selected 250 200 150 100 50 0 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 Word frequency Number of trees selected by the words grouped by word frequency 10

  11. Treebank Grammar and XTAG English Grammar � Compared TG with the XTAG Grammar which has 1004 tree tem- plates, 53 tree families and 1.8 million lexicalized trees. � 82.1% of template tokens in the Treebank grammar match a corre- sponding template in the XTAG grammar � 14.0% are covered by the XTAG grammar but the templates look dif- ferent because of different linguistic analyses 11

  12. Treebank Grammar and XTAG English Grammar � 1.1% of template tokens in the Treebank grammar are due to annota- tion errors � The remaining 2.8% are not currently covered by the XTAG grammar � A total of 96.1% of the structures in the Treebank grammar match up with structures in the XTAG grammar. 12

  13. Experiment: Test Corpus � input was a set of 2250 sentences � each sentence was 21 words or less � avg. sentence length was 12 : 3 � number of tokens = 27715 � output: shared forest of parses 13

  14. 45 40 35 30 log(No. of derivations) 25 20 15 10 5 0 2 4 6 8 10 12 14 16 18 20 Sentence length Number of derivations per sentence 14

  15. 10 9 8 7 log(time) in seconds 6 5 4 3 2 1 0 2 4 6 8 10 12 14 16 18 20 Sentence length Parsing times per sentence 2 Coeff of determination R = 0 : 65 15

  16. 4000 3500 3000 Median time (seconds) 2500 2000 1500 1000 500 0 5 10 15 20 Sentence length Median parsing times per sentence 16

  17. Hypothesis � There is a large variability in parse times. � The typical increase in time depending on sentence length is not ob- served. � Can a sentence predict its own parsing time? � Hypothesis: check the number of lexicalized trees that are selected by each sentence. 17

  18. 10 9 8 7 log(Time taken) in seconds 6 5 4 3 2 1 0 0 200 400 600 800 1000 Total num of trees selected by a sentence The impact of syntactic lexical ambiguity on parsing times 2 0 : 82 (previous = 0.65) R = 18

  19. Hypothesis � To test the hypothesis further we did the following tests: – Check time taken when an oracle gives us the single correct tree for each word. – Check time taken after parsing based on the output of an n -best SuperTagger. 19

  20. 0 -0.5 -1 -1.5 log(Time taken in secs) -2 -2.5 -3 -3.5 -4 -4.5 -5 0 5 10 15 20 Sentence length Parse times when the parser gets the correct tree for each word in the sentence Total time = 31.2 secs vs. 548K secs (orig) 20

  21. 8 6 4 log(Time in secs) 2 0 -2 -4 -6 0 5 10 15 20 25 Sentence length Time taken by the parser after n -best SuperTagging ( 60 ) n = Total time = 21K secs vs. 548K secs (orig) 21

  22. Clausal Complexity � The complexity of syntactic and semantic processing is related to the number of predicate-argument structures being computed for a given sentence. � This notion of complexity can be measured using the number of clauses in the sentence. � Does the number of clauses grow proportionally with sentence length? 22

  23. 14 12 Average number of clauses in the sentences 10 8 6 4 2 0 0 50 100 150 200 250 Sentence length Average number of clause plotted against sentence length. 99.1% of sentences in the Penn Treebank contain 6 or fewer clauses 23

  24. 4 3.5 3 Standard deviation of clause number 2.5 2 1.5 1 0.5 0 0 50 100 150 200 250 Sentence length Standard deviation of clause number plotted against sentence length. Increase in deviation for sentences longer than 50 words. 24

  25. log(Time taken in secs) 10 9 8 7 6 5 4 3 2 1 0 20 15 1 1.5 10 Sentence length 2 2.5 3 5 3.5 4 Num of clauses 4.5 5 Variation in parse time against sentence length while identifying the number of clauses 25

  26. log(Time taken in secs) 10 9 8 7 6 5 4 3 2 1 0 1000 1 1.5 500 Num of trees selected 2 2.5 3 3.5 4 Num of clauses 4.5 5 Variation in parse time against number of trees The parser spends most of its time attaching modifiers 26

  27. Conclusions � We explored two issues that affect parsing effiency for LTAGs: syntac- tic lexical ambiguity and clausal complexity. – Parsing of LTAGs is determined by number of trees selected by a sentence. – Number of clauses does not grow proportionally with sentence length. � Current work: incorporate these factors to improve parsing efficiency for LTAGs. 27

Recommend


More recommend