Why NLP Needs Theoretical Syntax (It in Fact Already Uses It) Owen Rambow Center for Computational Learning Systems Columbia University, New York City rambow@ccls.columbia.edu
Key Issue: Representation • Aravind Joshi to statisticians (adapted): “You know how to count, but we tell you what to count” • Linguistic representations are not naturally occurring! • They are devised by linguists • Example: English Penn Treebank – Beatrice Santorini (thesis: historical syntax of Yiddish) – Lots of linguistic theory went into the PTB – PTB annotation manual is a comprehensive descriptive grammar of English
What Sort of Representations for Syntax? • Syntax: links between text and meaning • Text consists of words -> lexical models – Lexicalized formalisms – Note: bi- and monolexical versions of CFG • Need to link to meaning (for example, PropBank) – Extended domain of locality to locate predicate- argument structure – Note: importance of dashtags etc in PTB II • Tree Adjoining Grammar! (but CCG is also cool, and LFG has its own appeal)
Why isn’t everyone using TAG? • The PTB is not annotated with a TAG • Need to do linguistic interpretation on PTB to extract TAG (Chen 2001, Fei 2001) • This is not surprising: all linguistic representations need to be interpreted (Rambow 2010) – Extraction of (P)CFG is simple and requires little interpretation – Extraction of bilexical (P)CFG is not, requires head percolation, which is interpretation
Why isn’t everyone using TAG Parsers? • Unclear how well they are performing – PS evaluation irrelevant • MICA parser (Bangalore et al 2009): – high 80s on a linguistically motivated predicate-argument structure dependency – MALT does slightly better on same representation – But MICA output comes fully interpreted, MALT does not • Once we have a good syntactic pred-arg structure, tasks like semantic role labeling (PropBank) are easier – 95% on args given gold pred-arg structure (Chen and Rambow 2002)
What Have We Learned About TAG Parsing? • Large TAG grammar not easy to manage computationally (MICA: 5000 trees, 1,200 used in parsing) • Small TAG grammars lose too much information • Need to investigate: – Dynamic creation of TAG grammars (trees created in response to need) (note: LTAG-spinal Shen 2006) – “Bushes”: underspecified trees – Metagrammars (Kinyon 2003)
What about All Those Other Languages? • Can’t do treebanks for 3,000 languages • Need to understand cross-linguistic variation and use that understanding in computational models – Cross-linguistic variation: theoretical syntax – Models: NLP – Link: metagrammars for TAG
Summary • Treebanks already encode insights from theoretical syntax • Require interpretation for non-trivial models • Applications other than Parseval require richer representations (and richer evaluations) • But probably English is not the right language to argue for the need for richer syntactic knowledge • Real coming bottleneck: NLP for 3,000 languages
Recommend
More recommend