Some Experiments on Indicators of Parsing Complexity for Lexicalized Grammars Anoop Sarkar, Fei Xia and Aravind Joshi Dept. of Computer and Information Sciences University of Pennsylvania f anoop,fxia,joshi g @linc.cis.upenn.edu 1
Lexicalized Tree Adjoining Grammars NP NP u u NNP NP NNP m n n Ms. Haag S u NP VP NP arg n u VBZ NP NNP n arg n plays Elianti These trees can be combined to parse the sentence Ms. Haag plays Elianti . 2
Important Properties of LTAG wrt Parsing � Predicate-argument structure is represented in each elementary tree. � Adjunction instead of feature unification. � No recursive feature structures. FSs are bounded. 3
Important Properties of LTAG wrt Parsing � Transformational relations for the same predicate-argument structure are precomputed. � Each predicate selects a family of elementary trees. � Different sources of issues for parsing efficiency. 4
Parsing Efficiency � Parsing accuracy: Evaluations done in previous work. � Parsing efficiency: observed time complexity for producing all parses. � The usual notion: compare different parsing algorithms wrt time, space, number of edges, : : : � This paper: explore parsing efficiency from a viewpoint that is inde- pendent of a particular parsing algorithm. 5
Parsing Efficiency � Not an alternative to comparision of parsing algorithms. � An exploration of parsing efficiency from the perspective of a fully lexi- calized grammar. � Sources of parsing complexity that are part of the input to the parsing algorithm. 6
Parsing Efficiency � We explore two issues: syntactic lexical ambiguity and clausal com- plexity. � The contention: for LTAGs these issues are relevant across all parsing algorithms. 7
Experiment: The Parser � Implementation of head-corner chart-based parser. � It is bi-directional – van Noord style. � Produces a derivation forest as output. � Written in ANSI C: � -version available at ftp://ftp.cis.upenn.edu/xtag/pub/lem 8
Experiment: Input Grammar � Treebank Grammar � extracted from Sections 02–21 WSJ Penn Treebank 6789 tree templates, 123039 lexicalized trees � � number of word types in the lexicon is 44215 � average number of trees per word is 2 : 78 9
400 350 300 Number of trees selected 250 200 150 100 50 0 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 Word frequency Number of trees selected by the words grouped by word frequency 10
Treebank Grammar and XTAG English Grammar � Compared TG with the XTAG Grammar which has 1004 tree tem- plates, 53 tree families and 1.8 million lexicalized trees. � 82.1% of template tokens in the Treebank grammar match a corre- sponding template in the XTAG grammar � 14.0% are covered by the XTAG grammar but the templates look dif- ferent because of different linguistic analyses 11
Treebank Grammar and XTAG English Grammar � 1.1% of template tokens in the Treebank grammar are due to annota- tion errors � The remaining 2.8% are not currently covered by the XTAG grammar � A total of 96.1% of the structures in the Treebank grammar match up with structures in the XTAG grammar. 12
Experiment: Test Corpus � input was a set of 2250 sentences � each sentence was 21 words or less � avg. sentence length was 12 : 3 � number of tokens = 27715 � output: shared forest of parses 13
45 40 35 30 log(No. of derivations) 25 20 15 10 5 0 2 4 6 8 10 12 14 16 18 20 Sentence length Number of derivations per sentence 14
10 9 8 7 log(time) in seconds 6 5 4 3 2 1 0 2 4 6 8 10 12 14 16 18 20 Sentence length Parsing times per sentence 2 Coeff of determination R = 0 : 65 15
4000 3500 3000 Median time (seconds) 2500 2000 1500 1000 500 0 5 10 15 20 Sentence length Median parsing times per sentence 16
Hypothesis � There is a large variability in parse times. � The typical increase in time depending on sentence length is not ob- served. � Can a sentence predict its own parsing time? � Hypothesis: check the number of lexicalized trees that are selected by each sentence. 17
10 9 8 7 log(Time taken) in seconds 6 5 4 3 2 1 0 0 200 400 600 800 1000 Total num of trees selected by a sentence The impact of syntactic lexical ambiguity on parsing times 2 0 : 82 (previous = 0.65) R = 18
Hypothesis � To test the hypothesis further we did the following tests: – Check time taken when an oracle gives us the single correct tree for each word. – Check time taken after parsing based on the output of an n -best SuperTagger. 19
0 -0.5 -1 -1.5 log(Time taken in secs) -2 -2.5 -3 -3.5 -4 -4.5 -5 0 5 10 15 20 Sentence length Parse times when the parser gets the correct tree for each word in the sentence Total time = 31.2 secs vs. 548K secs (orig) 20
8 6 4 log(Time in secs) 2 0 -2 -4 -6 0 5 10 15 20 25 Sentence length Time taken by the parser after n -best SuperTagging ( 60 ) n = Total time = 21K secs vs. 548K secs (orig) 21
Clausal Complexity � The complexity of syntactic and semantic processing is related to the number of predicate-argument structures being computed for a given sentence. � This notion of complexity can be measured using the number of clauses in the sentence. � Does the number of clauses grow proportionally with sentence length? 22
14 12 Average number of clauses in the sentences 10 8 6 4 2 0 0 50 100 150 200 250 Sentence length Average number of clause plotted against sentence length. 99.1% of sentences in the Penn Treebank contain 6 or fewer clauses 23
4 3.5 3 Standard deviation of clause number 2.5 2 1.5 1 0.5 0 0 50 100 150 200 250 Sentence length Standard deviation of clause number plotted against sentence length. Increase in deviation for sentences longer than 50 words. 24
log(Time taken in secs) 10 9 8 7 6 5 4 3 2 1 0 20 15 1 1.5 10 Sentence length 2 2.5 3 5 3.5 4 Num of clauses 4.5 5 Variation in parse time against sentence length while identifying the number of clauses 25
log(Time taken in secs) 10 9 8 7 6 5 4 3 2 1 0 1000 1 1.5 500 Num of trees selected 2 2.5 3 3.5 4 Num of clauses 4.5 5 Variation in parse time against number of trees The parser spends most of its time attaching modifiers 26
Conclusions � We explored two issues that affect parsing effiency for LTAGs: syntac- tic lexical ambiguity and clausal complexity. – Parsing of LTAGs is determined by number of trees selected by a sentence. – Number of clauses does not grow proportionally with sentence length. � Current work: incorporate these factors to improve parsing efficiency for LTAGs. 27
Recommend
More recommend