Parsing with Lexicalized TAG (1) Extracting and comparing LTAG (2) Presentation by Philip John Gorinski Seminar “Recent Advances in Parsing Technology” Saarland University, Winter Term 2011/12 (1) Yves Shabes, Aravind K. Joshi, 1990 (2) Fei Xia, Chung-hye Han, Martha Palmer, and Aravind Joshi, 2001
Overview ● Introduction ● Lexicalized TAG, Advantages of parsing with LTAG ● Parsing LTAGs ● bottom-up ● top-down ● bottom-up + dynamic top-down ● Extracting and Comparing LTAG ● Data ● Extraction ● Language comparison using LTAGs ● Conclusion
Overview ● Introduction ● Lexicalized TAG, Advantages of parsing with LTAG ● Parsing LTAGs ● bottom-up ● top-down ● bottom-up + dynamic top-down ● Extracting and Comparing LTAG ● Data ● Extraction ● Language comparison using LTAGs ● Conclusion
Introduction: Lexicalized TAG like regular Tree Adjoining Grammar ● initial trees (α-trees) / auxiliary trees (β-trees) ● ● substitution (↓) / adjunction (*) of trees ● additional properties ● lexical “anchor” for each tree, i.e., all trees associated with the lexicon ● here also: separation of lexicon and tree families 4 / 36 Parsing with lexicalized TAG
Introduction: Lexicalized TAG Substitution: S S NP NP 0 NP 0 VP VP N D D D N N V V NP NP 1 ↓ the girl the the boy boy saw saw N D the girl 5 / 36 Parsing with lexicalized TAG
Introduction: Lexicalized TAG Adjunction: S S NP NP VP VP N N D D V V NP NP the the boy boy saw saw N N D D the girl the A N N pretty girl A N* pretty 6 / 36 Parsing with lexicalized TAG
Introduction: Lexicalized TAG ● Tree families ● essentially LTAG trees, but abstracted anchor ● e.g., family of verbs taking one object (np 0 Vnp 1 ) S S NP 0 ↓ VP ... NP i ↓ (+wh) S V◊ NP 1 ↓ NP 0 ↓ VP ε i V◊ NP 1 ↓ ● Lexicon: associates verbs with tree families 7 / 36 Parsing with lexicalized TAG
Introduction: Advantages ● TAG provides extended domain of locality ● capture non-local features in a localized fashion ● 'production-like' ● LTAG preserves this feature ● LTAG provides linking to lexical information ● very useful for actual parsing ● limited search space, prevention of recursion [...] 8 / 36 Parsing with lexicalized TAG
Overview ● Introduction ● Lexicalized TAG, Advantages of parsing with LTAG ● Parsing LTAGs ● bottom-up ● top-down ● bottom-up + dynamic top-down ● Extracting and Comparing LTAG ● Data ● Extraction ● Language comparison using LTAGs ● Conclusion
Parsing LTAGs ● General two-step strategy for lexicalized grammars 1. select elementary structures for lexical input items 2. parse sentence wrt. to resulting set of structures ● first step 'filters' the grammar ● may drastically reduce search space ➔ LTAGs are finitely ambiguous! ● may guide top-down parser by using bottom-up information, e.g., item's position in input string ● second step suitable for any parsing algorithm 10 / 36 Parsing with lexicalized TAG
Parsing LTAGs: bottom-up ● CKY-type parser for TAG (Vijay-Shanker and Joshi, 1985) ● data driven ● bottom-up information of first stage has no effect on algorithm itself ● grammar filtering reduces number of nodes in the recognition matrix 11 / 36 Parsing with lexicalized TAG
Parsing LTAGs: top-down ● like push-down automatons for CFG parsing (Lang, 1990) ● indices for sub trees spanning the input ● CFG: 2 indices; (L)TAG: 4 indices for positions left/right of anchor in auxiliary trees X X* i j k l 12 / 36 Parsing with lexicalized TAG
Parsing LTAGs: top-down ● problem for top-down: left-recursion ● A → A B ● infinite search space ● quite frequent phenomenon in TAG ● solved by grammar filtering for LTAG ● parser considers only elementary trees selected by first stage ● can be distinguished by typology and position in input string ➔ each tree only used once ● finite search space even for top-down parser! 13 / 36 Parsing with lexicalized TAG
Parsing: bottom-up + dynamic top-down ● Earley-type TAG parser (Schabes and Joshi, 1988) ● scan / predict / complete ● use bottom-up prediction to guide top-down parsing ● straight forward parsing for LTAGs ● lexicalization simplifies certain steps of the algorithm 14 / 36 Parsing with lexicalized TAG
Parsing: bottom-up + dynamic top-down 1. first pass selects subset of grammar ➔ limits search space 2. each tree is anchored ➔ same state set can not predict that a tree can be substituted and be completed ➔ same state set can not predict an auxiliary tree for left adjunction and right completion 3. information of anchor position can be used to filter top-down prediction / completions for adjunction and substitution 15 / 36 Parsing with lexicalized TAG
Parsing: bottom-up + dynamic top-down the 1 men 2 who 3 hate 4 women 5 that 6 smoke 7 cigarettes 8 are 9 intolerant 10 ● with normal TAG, “men” could be predicted for substitution in “hate/smoke” structure ● would lead to back tracking in later analysis ● lexicalization prevents prediction! ● anchor position does not match the string 16 / 36 Parsing with lexicalized TAG
Overview ● Introduction ● Lexicalized TAG, Advantages of parsing with LTAG ● Parsing LTAGs ● bottom-up ● top-down ● bottom-up + dynamic top-down ● Extracting and Comparing LTAG ● Data ● Extraction ● Language comparison using LTAGs ● Conclusion
Motivation ● Automatic extraction of grammars has motivations in both theoretical linguistics and NLP engineering ● Theoretical motivation ● quantitative testing of Universal Grammar ● explore similarities and differences of languages ● Engineering motivation ● links between structures of different grammars ● valuable for parsing, lexicon development, machine translation ... 18 / 36 Extracting and comparing LTAG
Data ● 3 Languages for comparison ● English, Chinese, Korean ● Germanic, Sino-Tibetan, Altaic ● Different word order ● SVO (En, Ch) vs. SOV (Ko) ● permutable argument NPs (Ko) ● Subject/Object deletion ● freely (Ch, Ko) vs. none (En) ● Inflectional morphology ● rich (Ko) vs. little (En) vs. none (Ch) 19 / 36 Extracting and comparing LTAG
Data ● English Penn Treebank II (Marcus et al., 1993) ● 1,174K words, ~23.85 words/sentence, 94 tags ● Chinese Penn Treebank (Xia et al., 2000) ● 100K words, ~23.81 words/sentences, 92 tags ● Korean Penn Treebank (Han et al., 2001) ● 54K words, ~10.71 words/sentence, 61 tags ● All provide phrase structure annotation ● Use similar annotation scheme 20 / 36 Extracting and comparing LTAG
Data ● Example of English Penn Treebank sentence 21 / 36 Extracting and comparing LTAG
Overview ● Introduction ● Lexicalized TAG, Advantages of parsing with LTAG ● Parsing LTAGs ● bottom-up ● top-down ● bottom-up + dynamic top-down ● Extracting and Comparing LTAG ● Data ● Extraction ● Language comparison using LTAGs ● Conclusion
Extraction ● Tool: LexTract ● recognizes 3 types of initial/auxiliary LTAG trees ● Spine: predicate-argument relations ● Mod: modification rules ● Conj: coordination relations ● each extracted tree should fall into exactly one category 23 / 36 Extracting and comparing LTAG
Extraction ● Spine-trees ● X ⁰ : anchor, head of X m ● tree is formed by ● a spine X m → X m-1 → ... → X ⁰ ● the arguments of X ⁰ 24 / 36 Extracting and comparing LTAG
Extraction ● Mod-trees ● W q : root with two children ● W q* : adjunction node with same label as W q ● X m : modifier of W q* , spine-tree with 25 / 36 Extracting and comparing LTAG
Extraction ● Conj-trees ● root with 3 children ● Conjunct: adjunction node Xm* ● Conjunction ● Conjunct: spine tree X m → ... → X ⁰ 26 / 36 Extracting and comparing LTAG
Extraction “(at) underwriters still draft policies using fountain pens and blotting paper” spine-trees mod-trees conj-tree 27 / 36 Extracting and comparing LTAG
Extraction: Results template etree types word types context-free types rules English 6,926 131,397 49,206 1,524 Chinese 1,140 21,125 10,772 515 Korean 632 13,941 10,035 152 ● Templates: etrees with lexical items removed ● CFG extracted by reading rules off the templates ● small subsets of frequent templates cover majority of tokens ● English: Top 100 (500, 1000, 1500) = 87.1% (96.6%, 98.4%, 99.0%) 28 / 36 Extracting and comparing LTAG
Overview ● Introduction ● Lexicalized TAG, Advantages of parsing with LTAG ● Parsing LTAGs ● bottom-up ● top-down ● bottom-up + dynamic top-down ● Extracting and Comparing LTAG ● Data ● Extraction ● Language comparison using LTAGs ● Conclusion
Language Comparison ● Make LTAGs comparable ● create new shared tagset ● merge original tags into new tags ● replace original treebank tags ● re-run LexTract ● Compare LTAGs for English, Chinese, Korean ● templates ● context-free rules ● sub-templates 30 / 36 Extracting and comparing LTAG
Language Comparison ● new tagsets reduce templates by ~50% ● few shared, high-frequency templates account for large portion of observed data across languages 31 / 36 Extracting and comparing LTAG
Recommend
More recommend