automated acquisition of linguistic knowledge for robust
play

Automated Acquisition of Linguistic Knowledge for Robust - PowerPoint PPT Presentation

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar


  1. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar So how does all this fit together? In what follows we will “learn” (incl. validate and evaluate) MWEs for the 1 DELPH-IN English Resource Grammar (ERG; [Flickinger, 2000]) and contribute to boosting its coverage we will tackle the robustness problem of the DELPH-IN 2 German Grammar (GG; [Crysmann, 2003]) when employed for real life applications by enhancing it automatically with the linguistic knowledge it lacks Valia Kordoni Automated Acquisition of Linguistic Knowledge

  2. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar The DELPH-IN Collaboration Deep Linguistic Processing with HPSG: Initiative Grammars: English: LinGO ERG (23K lexical entries), German: (35K lexical entries), Japanese: JaCY (48K lexical entries) Others: Norwegian, Modern Greek, Korean, Chinese . . . Processing software: LKB : grammar engineering platform, PET : efficient parser, [incr tsdb()] : profiling platform, HoG : infrastructure for building hybrid NLP applications based on RMRS semantic representations. Applications: Machine Translation, IE, Email Autoresponse, . . . All of them available online: http://wiki.delph-in.net/moin Valia Kordoni Automated Acquisition of Linguistic Knowledge

  3. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Road Map Introduction 1 Acquisition of MWEs: Theoretical Background & Motivation 2 Detection of MWEs candidates 3 Evaluation of the Identification of MWEs 4 Resources Comparing Corpora Comparing Statistical Measures Extension of the English Resource Grammar with MWEs 5 Setup Grammar Performance Enhancing Robustness of the German Grammar 6 Valia Kordoni Automated Acquisition of Linguistic Knowledge

  4. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background Multiword Expressions: Definition A multiword expression (MWE) is decomposable into multiple simplex words lexically, syntactically, semantically, pragmatically and/or statistically idiosyncratic Valia Kordoni Automated Acquisition of Linguistic Knowledge

  5. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background Some Examples San Fancisco, ad hoc, by and large, Where Eagles Dare, kick the bucket, part of speech, in step, the Oakland Raiders, trip the light fantastic, telephone box, call (someone) up, take a walk, do a number on (someone), take (unfair) advantage of, pull strings, kindle excitement, fresh air, ... Valia Kordoni Automated Acquisition of Linguistic Knowledge

  6. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background MWE or not MWE? ... there is no unified phenomenon to describe but rather a complex of features that interact in various, often untidy, ways and represent a broad continuum between non-compositional (or idiomatic) and compositional groups of words. [Moon, 1998] Valia Kordoni Automated Acquisition of Linguistic Knowledge

  7. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background Lexicosyntactic Idiomaticity by and large (???) = by(P) and(conj) large(Adj) wine and dine (V [ trans ] ) = wine (V [ intrans ] ) and(conj) dine (V [ intrans ] ) ad hoc (Adj) = ad(?) hoc(?) Valia Kordoni Automated Acquisition of Linguistic Knowledge

  8. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background Semantic Idiomaticity kick the bucket = die’ spill the beans = reveal’ (secret’) kindle excitement = kindle’ (excitement’) Valia Kordoni Automated Acquisition of Linguistic Knowledge

  9. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background Pragmatic Idiomaticity Situatedness: the expression is associated with a fixed pragmatic point situated MWEs: good morning, all aboard non-situated MWEs: first off, to and fro The “Wheel of Fortune” factor: how to represent the jumble of phrases stored in the mental lexicon? The “Monty Python” factor: mish-mash of evocative language fragments Valia Kordoni Automated Acquisition of Linguistic Knowledge

  10. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background Statistical Idiomaticity unblemished spotless flawless immaculate impeccable eye + − − − − gentleman ? + − − − home ? + + ? − lawn ? + − − − memory + ? − − − quality + − − − − record + + + + + reputation + + + − − taste + − − − − Table: Adapted from [Cruse, 1986] Valia Kordoni Automated Acquisition of Linguistic Knowledge

  11. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background MWE Markedness MWE Marked Lex Syn Sem Prag Stat ✔ ✔ ad hominem ? ? ? ✗ ✔ ✗ ✗ ✗ at first first aid ✗ ✗ ✔ ✗ ? ✗ ✗ ✗ ✗ ✔ salt and pepper ✗ ✗ ✗ ✔ ✔ good morning cat’s cradle ✔ ✔ ✔ ✗ ? Valia Kordoni Automated Acquisition of Linguistic Knowledge

  12. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background Other Indicators of MWE-hood ([Fillmore et al., 1988], [Liberman and Sproat, 1992], [Nunberg et al., 1994]) Institutionalisation/conventionalisation: bread and butter Non-identifiability: meaning cannot be predicted from surface form idiom of decoding (non-identifiable): kick the bucket, fly off the handle idiom of encoding (identifiable): wide awake, plain truth Valia Kordoni Automated Acquisition of Linguistic Knowledge

  13. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background Other Indicators of MWE-hood ([Fillmore et al., 1988], [Liberman and Sproat, 1992], [Nunberg et al., 1994]) Figuration: the expression encodes some metaphor, metonymy, hyperbole, etc. figurative expressions: bull market, beat around the bush non-figurative expressions: first off, to and fro Valia Kordoni Automated Acquisition of Linguistic Knowledge

  14. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background Other Indicators of MWE-hood ([Fillmore et al., 1988], [Liberman and Sproat, 1992], [Nunberg et al., 1994]) Single-word paraphrasability: the expression has a single word paraphrase paraphrasable MWEs: leave out = omit non-paraphrasable MWEs: look up paraphrasable MWEs: take off clothes = undress Valia Kordoni Automated Acquisition of Linguistic Knowledge

  15. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background Other Indicators of MWE-hood ([Fillmore et al., 1988], [Liberman and Sproat, 1992], [Nunberg et al., 1994]) Proverbiality: the expression is used to “describe” – and implicitly, to explain – a recurrent situation of particular social interest... in virtue of its resemblance or relation to a scenario involving homely, concrete things and relations’ [Nunberg et al., 1994] informality: the expression is associated with more informal or colloquial registers affect: the expression encodes a certain evaluation of affective stance toward the thing it denotes Valia Kordoni Automated Acquisition of Linguistic Knowledge

  16. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background Other Indicators of MWE-hood ([Fillmore et al., 1988], [Liberman and Sproat, 1992], [Nunberg et al., 1994]) Prosody: the expression has a distinctive stress pattern which diverges from the norm prosodically-marked MWE: soft spot prosodically-unmarked MWE: first aid, red herring prosodically-marked non-MWE: dental operation Valia Kordoni Automated Acquisition of Linguistic Knowledge

  17. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background MWEs and the Notion of Compositionality: Definition degree to which the features of the parts of an MWE combine to predict the features of the whole Valia Kordoni Automated Acquisition of Linguistic Knowledge

  18. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background MWEs and the Notion of Compositionality Generally considered in the context of semantic compositionality, but we can equally talk about: lexical compositionality syntactic compositionality pragmatic compositionality Valia Kordoni Automated Acquisition of Linguistic Knowledge

  19. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background Example: Syntactic Compositionality Definition : Degree to which the syntactic features of the parts of an MWE combine to predict the syntax of the whole Fixed expression: by and large, San Francisco Verb particles: eat up vs. chicken out Syntactic compositionality binary effect: non-compositional MWEs lexicalised Valia Kordoni Automated Acquisition of Linguistic Knowledge

  20. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background Question Given that compositionality extends over all aspects of markedness that affect MWEs, it is the be all and end of all of MWEs? Almost, but there are subtleties due to: statistical markedness decomposability Valia Kordoni Automated Acquisition of Linguistic Knowledge

  21. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background Question Given that compositionality extends over all aspects of markedness that affect MWEs, it is the be all and end of all of MWEs? Almost, but there are subtleties due to: statistical markedness decomposability Valia Kordoni Automated Acquisition of Linguistic Knowledge

  22. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background Statistical Markedness (Revisited) Statistical markedness is (often) a reflection of lack of statistical non -compositionality, rather than a lack of compositionality: p (impeccable N) × p (Adj eye) ≈ p (impeccable eye) 1 BUT p (unblemished N) × p (Adj eye) ≫ p (unblemished eye) 2 p (spotless N) × p (Adj eye) ≫ p (spotless eye) 3 p (flawless N) × p (Adj eye) ≫ p (flawless eye) 4 Valia Kordoni Automated Acquisition of Linguistic Knowledge

  23. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background Decomposability: Definition degree to which the features of an MWE can be ascribed to those of its parts Valia Kordoni Automated Acquisition of Linguistic Knowledge

  24. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background Decomposability and Semantic Idiomaticity kick the bucket = die’ spill the beans = reveal’ (secret’) kindle excitement = kindle’ (excitement’) Valia Kordoni Automated Acquisition of Linguistic Knowledge

  25. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background Decomposability: Three Classes of MWEs Classification of MWEs into 3 classes: non-decomposable MWEs : kick the bucket, shoot the breeze, hot dog idiosyncratically decomposable MWEs : spill the beans, let the cat out of the bag, radar footprint simple decomposable MWEs : kindle excitement, traffic light Valia Kordoni Automated Acquisition of Linguistic Knowledge

  26. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background Decomposability: Three Classes of MWEs There is a cline of “markedness” for idiosyncratically decomposable MWEs: chicken out vs. home office vs. radar footprint Valia Kordoni Automated Acquisition of Linguistic Knowledge

  27. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background Decomposability and Syntactic Flexibility Consider: * the bucket was kicked by Kim Strings were pulled to get Sandy the job . The FBI kept closer tabs on Kim than they kept on Sandy . ... the considerable advantage that was taken of the situation The syntactic flexibility of an idiom can generally be explained in terms of its decomposability Valia Kordoni Automated Acquisition of Linguistic Knowledge

  28. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background So What was the Answer to our Question? Yes and No: simple compositionality is adequate for describing many instances of lexical, syntactic, semantic and pragmatic markedness BUT the notion of compositionality is significantly different for statistically marked MWEs AND decomposability diffuses the markedness boundary Valia Kordoni Automated Acquisition of Linguistic Knowledge

  29. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs: Theoretical Linguistic Background And Why is it we Care about Compositionality? For all the reasons we care about MWEs: Lexicography/dictionary making Idiomaticity (coherent semantics) Overgeneration Undergeneration Relevance in applications, including MT, IR, QA, ... Valia Kordoni Automated Acquisition of Linguistic Knowledge

  30. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs in NLP: Motivation MWEs in NLP It is difficult to provide a unified account for the detection of these distinct but related phenomena. We will show how we build on compositionality to also deal with MWEs in NLP . Challenge for Grammar Engineering and “Deep” Linguistic Processing Lexical coverage is the major barrier to broad-coverage “deep” linguistic processing MWEs constitute a significant part of the problem; this should not be surprising, since in any case they are equivalent in number to single words in speakers’ lexicon [Jackendoff, 1997] Valia Kordoni Automated Acquisition of Linguistic Knowledge

  31. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MWEs in NLP: Motivation MWEs in NLP It is difficult to provide a unified account for the detection of these distinct but related phenomena. We will show how we build on compositionality to also deal with MWEs in NLP . Challenge for Grammar Engineering and “Deep” Linguistic Processing Lexical coverage is the major barrier to broad-coverage “deep” linguistic processing MWEs constitute a significant part of the problem; this should not be surprising, since in any case they are equivalent in number to single words in speakers’ lexicon [Jackendoff, 1997] Valia Kordoni Automated Acquisition of Linguistic Knowledge

  32. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Lexical coverage as a major barrier: an example from English BNC Coverage Test (ERG jan-06 [Flickinger, 2000]) 1.8M sentences (21.2M words) from BNC written component with only ASCII characters and no more than 20 words each Result # Sentences Percentage Parsed 644,940 35.80% Lex. Missing 969,452 53.82% Full Lex. Span, No Parse 186,883 10.38% Valia Kordoni Automated Acquisition of Linguistic Knowledge

  33. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Lexical coverage as a major barrier: an example from English BNC Coverage Test (ERG jan-06 [Flickinger, 2000]) 1.8M sentences (21.2M words) from BNC written component with only ASCII characters and no more than 20 words each Result # Sentences Percentage Parsed 644,940 35.80% Lex. Missing 969,452 53.82% Full Lex. Span, No Parse 186,883 10.38% Valia Kordoni Automated Acquisition of Linguistic Knowledge

  34. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Road Map Introduction 1 Acquisition of MWEs: Theoretical Background & Motivation 2 Detection of MWEs candidates 3 Evaluation of the Identification of MWEs 4 Resources Comparing Corpora Comparing Statistical Measures Extension of the English Resource Grammar with MWEs 5 Setup Grammar Performance Enhancing Robustness of the German Grammar 6 Valia Kordoni Automated Acquisition of Linguistic Knowledge

  35. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Error Mining [van Noord, 2004b] Parsability R ( w i . . . w j ) = C ( w i ... w j , OK ) C ( w i ... w j ) If the parsability of a particular word sequence is very low, it indicates that something is wrong Parsabilities can be calculated efficiently for large corpora with suffix arrays and perfect hashing [Lucchesi and Kowaltowski, 1993] Valia Kordoni Automated Acquisition of Linguistic Knowledge

  36. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Error Mining Experiment Experiment was run on Num. % BNC: the parsed uni-gram 798 20.84% sentences and the bi-gram 2,011 52.52% unparsed sentences (with tri-gram 937 24.47% full lex. span) Table: Distribution of N-grams with Low parsability n-grams R < 0 . 1 were extracted other unigram 3+ grams were taken for trigram further inverstigation unigram bigram bigram trigram other Valia Kordoni Automated Acquisition of Linguistic Knowledge

  37. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Example of Low Parsability N-grams N-gram R Count the burden of 0.000 49 by and large 0.000 37 face of it 0.000 34 frame of mind 0.000 23 points of view 0.000 20 hair and a 0.000 17 the to infinitive 0.000 15 of alcohol and 0.000 8 a great many 0.083 44 glance up at 0.083 33 for and against 0.086 21 from of government 0.142 6 Valia Kordoni Automated Acquisition of Linguistic Knowledge

  38. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Example of Low Parsability N-grams N-gram R Count the burden of 0.000 49 by and large 0.000 37 face of it 0.000 34 frame of mind 0.000 23 points of view 0.000 20 hair and a 0.000 17 the to infinitive 0.000 15 of alcohol and 0.000 8 a great many 0.083 44 glance up at 0.083 33 for and against 0.086 21 from of government 0.142 6 Valia Kordoni Automated Acquisition of Linguistic Knowledge

  39. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Example of Low Parsability N-grams N-gram R Count the burden of 0.000 49 by and large 0.000 37 face of it 0.000 34 frame of mind 0.000 23 points of view 0.000 20 hair and a 0.000 17 the to infinitive 0.000 15 of alcohol and 0.000 8 a great many 0.083 44 glance up at 0.083 33 for and against 0.086 21 from of government 0.142 6 Valia Kordoni Automated Acquisition of Linguistic Knowledge

  40. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Summary Error mining-based MWE detection ? Need for validation of detected MWEs Valia Kordoni Automated Acquisition of Linguistic Knowledge

  41. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Summary Error mining-based MWE detection ? Need for validation of detected MWEs Valia Kordoni Automated Acquisition of Linguistic Knowledge

  42. Introduction Acquisition of MWEs: Theoretical Background & Motivation Resources Detection of MWEs candidates Comparing Corpora Evaluation of the Identification of MWEs Comparing Statistical Measures Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Road Map Introduction 1 Acquisition of MWEs: Theoretical Background & Motivation 2 Detection of MWEs candidates 3 Evaluation of the Identification of MWEs 4 Resources Comparing Corpora Comparing Statistical Measures Extension of the English Resource Grammar with MWEs 5 Setup Grammar Performance Enhancing Robustness of the German Grammar 6 Valia Kordoni Automated Acquisition of Linguistic Knowledge

  43. Introduction Acquisition of MWEs: Theoretical Background & Motivation Resources Detection of MWEs candidates Comparing Corpora Evaluation of the Identification of MWEs Comparing Statistical Measures Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Identification of MWEs The aim Given a list of sequences of words to distinguish MWEs (e.g., in the red ) from random sequences of words (e.g., of alcohol and ) Valia Kordoni Automated Acquisition of Linguistic Knowledge

  44. Introduction Acquisition of MWEs: Theoretical Background & Motivation Resources Detection of MWEs candidates Comparing Corpora Evaluation of the Identification of MWEs Comparing Statistical Measures Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Identification of MWEs Why so many Statistical Tests in the Literature? Complications in evaluation hard to say which is the “best” test conflicting results from different researchers Different corpora have different distributional idiosyncracies Different tests have different statistical idiosyncracies Valia Kordoni Automated Acquisition of Linguistic Knowledge

  45. Introduction Acquisition of MWEs: Theoretical Background & Motivation Resources Detection of MWEs candidates Comparing Corpora Evaluation of the Identification of MWEs Comparing Statistical Measures Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Identification of MWEs 2 important questions Thus, there are two important questions How reliable is the corpus used? How precise is a statistical measure to distinguish the phenomena studied? Valia Kordoni Automated Acquisition of Linguistic Knowledge

  46. Introduction Acquisition of MWEs: Theoretical Background & Motivation Resources Detection of MWEs candidates Comparing Corpora Evaluation of the Identification of MWEs Comparing Statistical Measures Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Resources 1039 trigrams from error mining system [van Noord, 2004b] 4 corpora BNC f : fragment of the BNC used in the error-mining experiments BNC: complete BNC (from the site http://pie.usna.edu/) Google: Web using Google Yahoo: Web using Yahoo Corpus Frequency of 1,039 trigrams BNC f 66,101 BNC 322,325 Google 224,479,065 Yahoo 6,081,786,313 Valia Kordoni Automated Acquisition of Linguistic Knowledge

  47. Introduction Acquisition of MWEs: Theoretical Background & Motivation Resources Detection of MWEs candidates Comparing Corpora Evaluation of the Identification of MWEs Comparing Statistical Measures Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Comparing corpora Hypothesis The relative ordering in frequency for different n-grams is preserved across corpora, in the same domain If not, different conclusions may be drawn from different corpora Valia Kordoni Automated Acquisition of Linguistic Knowledge

  48. Introduction Acquisition of MWEs: Theoretical Background & Motivation Resources Detection of MWEs candidates Comparing Corpora Evaluation of the Identification of MWEs Comparing Statistical Measures Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Comparing corpora – first test Relative Frequency Rank for the Trigrams 10 -1 BNC f BNC Google relative frequency Yahoo 10 -2 10 -3 10 -4 10 -5 1 10 100 1000 rank The overall ranking distribution is very similar for these corpora, showing the expected Zipf like behaviour Valia Kordoni Automated Acquisition of Linguistic Knowledge

  49. Introduction Acquisition of MWEs: Theoretical Background & Motivation Resources Detection of MWEs candidates Comparing Corpora Evaluation of the Identification of MWEs Comparing Statistical Measures Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Comparing corpora – second test Measuring Kendall’s τ scores between corpora a significant correlation was found with p < 0.000001 But what is the degree of correlation among them? To estimate the correlation: the probability Q that any 2 trigrams chosen from two corpora have the same relative ordering in frequency Valia Kordoni Automated Acquisition of Linguistic Knowledge

  50. Introduction Acquisition of MWEs: Theoretical Background & Motivation Resources Detection of MWEs candidates Comparing Corpora Evaluation of the Identification of MWEs Comparing Statistical Measures Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Comparing corpora – second test BNC Google Yahoo BNC f 0.81 0.73 0.78 BNC 0.73 0.77 Google 0.86 The corpora are correlated, and can probably be used interchangeably for the statistical properties of the trigrams A higher correlation was observed between Yahoo and Google Valia Kordoni Automated Acquisition of Linguistic Knowledge

  51. Introduction Acquisition of MWEs: Theoretical Background & Motivation Resources Detection of MWEs candidates Comparing Corpora Evaluation of the Identification of MWEs Comparing Statistical Measures Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Comparing statistical measures Using a single corpus: BNC f Comparing Mutual Information (MI), χ 2 and Permutation Entropy (PE) for MWE identification MI and χ 2 are typical measures of association that compare the joint probability of occurrence of a certain group of events p ( abc ) with a prediction derived from the null hypothesis of statistical independence between these events p ∅ ( abc ) = p ( a ) · p ( b ) · p ( c ) Valia Kordoni Automated Acquisition of Linguistic Knowledge

  52. Introduction Acquisition of MWEs: Theoretical Background & Motivation Resources Detection of MWEs candidates Comparing Corpora Evaluation of the Identification of MWEs Comparing Statistical Measures Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar MI and χ 2 [ n ( abc ) − n ∅ ( abc ) ] 2 χ 2 = � n ∅ ( abc ) a , b , c � n ( abc ) n ( abc ) � � MI = log 2 N n ∅ ( abc ) a , b , c Valia Kordoni Automated Acquisition of Linguistic Knowledge

  53. Introduction Acquisition of MWEs: Theoretical Background & Motivation Resources Detection of MWEs candidates Comparing Corpora Evaluation of the Identification of MWEs Comparing Statistical Measures Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Permutation Entropy (PE) Permutation entropy, is a measure of order association � PE = − p ( w i w j w k ) ln [ p ( w i w j w k ) ] ( i , j , k ) n ( w 1 w 2 w 3 ) p ( w 1 w 2 w 3 ) = � n ( w i w j w k ) ( i , j , k ) where the sum runs over all the permutations: (e.g. by and large, large by and, and large by, and by large, large and by , and by large and ) Valia Kordoni Automated Acquisition of Linguistic Knowledge

  54. Introduction Acquisition of MWEs: Theoretical Background & Motivation Resources Detection of MWEs candidates Comparing Corpora Evaluation of the Identification of MWEs Comparing Statistical Measures Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Permutation Entropy (PE) PE for MWE detection - Hypothesis: MWEs are more rigid to permutations; therefore they have smaller PEs the more independent the words are the closer the PE is from its maximal value (ln 6, for trigrams) It does not rely on single word counts, which are less accurate in Web based corpora Valia Kordoni Automated Acquisition of Linguistic Knowledge

  55. Introduction Acquisition of MWEs: Theoretical Background & Motivation Resources Detection of MWEs candidates Comparing Corpora Evaluation of the Identification of MWEs Comparing Statistical Measures Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Are they equivalent? Kendall’s τ for assessing the correlation of the rankings for these measures and its significance Q is the probability of finding the same ordering in them MI × χ 2 χ 2 × PE MI × PE Q 0.71 0.55 0.45 The correlations found are statistically significant The measures order the trigrams differently 70% chance of getting the same order from MI and χ 2 they are very different from the PE Valia Kordoni Automated Acquisition of Linguistic Knowledge

  56. Introduction Acquisition of MWEs: Theoretical Background & Motivation Resources Detection of MWEs candidates Comparing Corpora Evaluation of the Identification of MWEs Comparing Statistical Measures Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Are they useful for MWE detection? To check that we compare the measures’ distributions for MWEs and non-MWEs Gold standard = set of 382 MWE candidates annotated by a native speaker 90 MWEs 292 non-MWEs MI or PE seem to differentiate between MWEs and non-MWEs Valia Kordoni Automated Acquisition of Linguistic Knowledge

  57. Introduction Acquisition of MWEs: Theoretical Background & Motivation Resources Detection of MWEs candidates Comparing Corpora Evaluation of the Identification of MWEs Comparing Statistical Measures Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Are they useful? Normalised histograms for MWEs and non-MWEs The ideal scenario: non overlapping distributions for MWEs and non-MWEs A simple threshold operation would be enough to distinguish between them Valia Kordoni Automated Acquisition of Linguistic Knowledge

  58. Introduction Acquisition of MWEs: Theoretical Background & Motivation Resources Detection of MWEs candidates Comparing Corpora Evaluation of the Identification of MWEs Comparing Statistical Measures Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Are they useful? Normalised histograms for MWEs and non-MWEs The ideal scenario: non overlapping distributions for MWEs and non-MWEs A simple threshold operation would be enough to distinguish between them Valia Kordoni Automated Acquisition of Linguistic Knowledge

  59. Introduction Acquisition of MWEs: Theoretical Background & Motivation Resources Detection of MWEs candidates Comparing Corpora Evaluation of the Identification of MWEs Comparing Statistical Measures Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Are they useful? Normalised histograms for MWEs and non-MWEs χ 2 (BNC f ) MI (BNC f ) PE (Yahoo) 0.2 0.18 0.25 MWEs MWEs MWEs 0.18 non-MWEs non-MWEs non-MWEs 0.16 0.16 0.2 0.14 0.14 0.12 Probability Probability Probability 0.12 0.15 0.1 0.1 0.08 0.08 0.1 0.06 0.06 0.04 0.04 0.05 0.02 0.02 0 0 0 -5.5 -5 -4.5 -4 -3.5 -3 -2.5 -2 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 2 3 4 5 6 7 8 log(MI) log( χ 2 ) log(PE(Yahoo)) As some types of MWEs may have stronger constraints on word order, more visible effects will probably be seen if we look at application of measures for individual types of MWEs [Evert and Krenn, 2005] Valia Kordoni Automated Acquisition of Linguistic Knowledge

  60. Introduction Acquisition of MWEs: Theoretical Background & Motivation Resources Detection of MWEs candidates Comparing Corpora Evaluation of the Identification of MWEs Comparing Statistical Measures Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Summary So far we have detected n-grams which are candidate MWEs We have validated them using statistical measures on corpora For grammar engineering we still need a way, though, of acquiring new lexical entries for MWEs and of evaluating their influence on the grammar performance Valia Kordoni Automated Acquisition of Linguistic Knowledge

  61. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Setup Evaluation of the Identification of MWEs Grammar Performance Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Road Map Introduction 1 Acquisition of MWEs: Theoretical Background & Motivation 2 Detection of MWEs candidates 3 Evaluation of the Identification of MWEs 4 Resources Comparing Corpora Comparing Statistical Measures Extension of the English Resource Grammar with MWEs 5 Setup Grammar Performance Enhancing Robustness of the German Grammar 6 Valia Kordoni Automated Acquisition of Linguistic Knowledge

  62. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Setup Evaluation of the Identification of MWEs Grammar Performance Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar English Resource Grammar [Flickinger, 2000] A large scale broad coverage precision HPSG grammar Lexicon coverage is a major problem MWEs comprise a large portion of the missing lexical entries Valia Kordoni Automated Acquisition of Linguistic Knowledge

  63. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Setup Evaluation of the Identification of MWEs Grammar Performance Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Lexical hierarchy and atomic lexical types The lexical information is encoded in atomic lexical types A lexicon is a n : n mapping between lexemes and atomic lexical type Valia Kordoni Automated Acquisition of Linguistic Knowledge

  64. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Setup Evaluation of the Identification of MWEs Grammar Performance Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Lexical hierarchy and atomic lexical types The lexical information is encoded in atomic lexical types A lexicon is a n : n mapping between lexemes and atomic lexical type Valia Kordoni Automated Acquisition of Linguistic Knowledge

  65. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Setup Evaluation of the Identification of MWEs Grammar Performance Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Lexical hierarchy and atomic lexical types The lexical information is encoded in atomic lexical types A lexicon is a n : n mapping between lexemes and atomic lexical type Valia Kordoni Automated Acquisition of Linguistic Knowledge

  66. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Setup Evaluation of the Identification of MWEs Grammar Performance Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Maximum Entropy Model-based Lexical Type Predictor A statistical classifier that predicts for each occurrence of an unknown word or a missing lexical entry Input: features from the context Output: atomic lexical types exp ( � i θ i f i ( t , c )) p ( t , c ) = � t ′ ∈ T exp ( � i θ i f i ( t ′ , c )) Valia Kordoni Automated Acquisition of Linguistic Knowledge

  67. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Setup Evaluation of the Identification of MWEs Grammar Performance Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar “Words-with-spaces” vs. compositional approaches Words-with-spaces approach [Zhang et al., 2006] Assign lexical types for the entire MWE Grammar coverage significantly improves Loss in generality for productive MWEs Compositional approach Assign new lexical entries for the head word to treat the MWE as compositional Hopefully the grammar coverage improves without drop in accuracy Valia Kordoni Automated Acquisition of Linguistic Knowledge

  68. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Setup Evaluation of the Identification of MWEs Grammar Performance Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar “Words-with-spaces” vs. compositional approaches Words-with-spaces approach [Zhang et al., 2006] Assign lexical types for the entire MWE Grammar coverage significantly improves Loss in generality for productive MWEs Compositional approach Assign new lexical entries for the head word to treat the MWE as compositional Hopefully the grammar coverage improves without drop in accuracy Valia Kordoni Automated Acquisition of Linguistic Knowledge

  69. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Setup Evaluation of the Identification of MWEs Grammar Performance Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Experiment Rank all the MWE candidates according to the three statistical measures: MI, χ 2 , PE, and select the top 30 MWE with highest average ranking Extract sub-corpus from BNC f which contains at least one of the MWE for evaluation (674 sentences) Use heuristics to extract head words (20 head words) Run lexical acquisition for head words on the sub-corpus (21 new entries) Valia Kordoni Automated Acquisition of Linguistic Knowledge

  70. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Setup Evaluation of the Identification of MWEs Grammar Performance Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Grammar Coverage item # parsed # avg. analysis # coverage % ERG 674 48 335.08 7.1% ERG + MWE 674 153 285.01 22.7% The coverage improvement is largely compatible with the results of “words-with-spaces” approach reported in [Zhang et al., 2006] (about 15%) Great reduction in lexical entries added Valia Kordoni Automated Acquisition of Linguistic Knowledge

  71. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Setup Evaluation of the Identification of MWEs Grammar Performance Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Grammar Accuracy 153 parsed sentences are analyzed by hand 124 (81.0%) of them receive at least one correct/acceptable analysis (comparable to the accuracy reported by [Baldwin et al., 2004]) Parse selection model finds best analysis in top-5 for 66% of the cases, and top-10 for 75% Valia Kordoni Automated Acquisition of Linguistic Knowledge

  72. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Setup Evaluation of the Identification of MWEs Grammar Performance Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Summary MWE candidates have been detected with error mining Different corpora have been compared for the purpose of MWE validation Different statistical measures have been compared for identifying MWEs Grammar performance has been evaluated for automated MWE acquisition using a compositional approach Valia Kordoni Automated Acquisition of Linguistic Knowledge

  73. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Setup Evaluation of the Identification of MWEs Grammar Performance Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Outlook Hand-crafted precision grammars usually face coverage/robustness challenges when applied to unseen data with unknown words/MWEs, unknown constructions, etc., all over the place [Baldwin et al., 2004] reported parsing coverage of 18 % on unseen BNC data parsed with the ERG, with the majority of parsing failures related to missing lexical entries The Lexical Type Prediction model I have presented above is used to handle unknown words (simplex and MWE) on-the-fly With the use of this model the ERG achieves around 84 % parsing coverage on unseen WSJ data Valia Kordoni Automated Acquisition of Linguistic Knowledge

  74. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Setup Evaluation of the Identification of MWEs Grammar Performance Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Outlook Other “Deep” Parsing Systems LFG XLE 79.6 % F-Score [Kaplan et al., 2004] CCG C & C 81.86 % F-Score [Clark and Curran, 2007] HPSG Enju 82.64 % F-Score [Sagae et al., 2008] The aforementioned systems are evaluated on 700 sentences selected from WSJ data (PARC 700), using Grammatical Relations (GR) Valia Kordoni Automated Acquisition of Linguistic Knowledge

  75. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Road Map Introduction 1 Acquisition of MWEs: Theoretical Background & Motivation 2 Detection of MWEs candidates 3 Evaluation of the Identification of MWEs 4 Resources Comparing Corpora Comparing Statistical Measures Extension of the English Resource Grammar with MWEs 5 Setup Grammar Performance Enhancing Robustness of the German Grammar 6 Valia Kordoni Automated Acquisition of Linguistic Knowledge

  76. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Background German has rich morphology and that also affects the design of the lexicon of the German Grammar (GG; [Crysmann, 2003]): a large amount of linguistic information is encoded in the form of constraints in the feature structures of the various types Valia Kordoni Automated Acquisition of Linguistic Knowledge

  77. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Importance of Linguistic Constraints Assumption We have to use the linguistic information contained in these constraints in order to develop linguistically oriented and well motivated (“Deep”) Lexical Acquisition (DLA) methods. Valia Kordoni Automated Acquisition of Linguistic Knowledge

  78. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Expanded Atomic Lexical Types How do we capture the linguistic information from the feature structures of the GG lexical types? Expand the type definitions of 38 selected atomic types with the relevant linguistic information contained in their feature values Which information is relevant? Not every feature must be considered: the target type inventory would be too sparse; moreover, not every feature is useful for the DLA process Valia Kordoni Automated Acquisition of Linguistic Knowledge

  79. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Expanded Atomic Lexical Types (cont.) Solution Perform an extensive linguistic analysis of the features to be considered for DLA: linguistically motivated DLA Valia Kordoni Automated Acquisition of Linguistic Knowledge

  80. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Relevant Linguistic Features Feature Values Meaning + in some cases the article for the noun can be omitted SUBJOPT (subject options) - the noun always goes with an article + raising verb - non-raising verb – case-number-gender information for nouns KEYAGR (key agreement) c-s-n underspecified-singular-neutral c-p-g underspecified-plural-gender ... ... (O)COMPAGR ((oblique)) a-n-g, d-n-g, etc. case-number-gender information complement – for (oblique) verb complements agreement – case-number-gender of the modified noun (for adjectives) (O)COMPTOPT ((oblique)) – verbs can take a different number of complements complement + the respective (oblique) complement is present options - the respective (oblique) complement is absent – the auxiliary verb used for – the formation of perfect tense KEYFORM haben the auxiliary verb is ‘haben’ sein the auxiliary verb is ‘sein’ Table: Relevant features used for type expanding Valia Kordoni Automated Acquisition of Linguistic Knowledge

  81. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Expanded Lexical Type Example Before expanding abenteuer-n := count-noun-le & [ [ --SUBJOPT -, KEYAGR c-n-n, KEYREL "_abenteuer_n_rel", KEYSORT situation, MCLASS nclass-2_-u_-e ] ]. After expanding abenteuer-n := count-noun-le_-_c-n-n (the values of the SUBJOPT and KEYAGR attributes are attached to the original type definition) Valia Kordoni Automated Acquisition of Linguistic Knowledge

  82. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Expanded Lexicon Original Expanded lexicon lexicon Number of lexical types 386 485 Atomic lexical types 38 137 -nouns 9 72 -verbs 19 53 -adjectives 3 5 -adverbs 7 7 Table: Expanded atomic lexical types Such target type inventory ensures that learning process will deliver fine-grained linguistic results while not having sparse data problems Valia Kordoni Automated Acquisition of Linguistic Knowledge

  83. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Grammar Performance in Practical Applications Corpus Coverage Accuracy FR 8.89% 85% FR + DLA 21.08% 83% deWaC 7.46% – deWaC + DLA 16.95% – Table: Coverage results The coverage for FR improves with more than 12% Given the fact that deWaC is an open and unbalanced corpus, the 10% increase in coverage is also a significant improvement Valia Kordoni Automated Acquisition of Linguistic Knowledge

  84. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Goal Achieved! Assumption proven! With our linguistically-oriented DLA methods, we have managed to increase parsing coverage and at the same time, to preserve the high accuracy of the grammar. Valia Kordoni Automated Acquisition of Linguistic Knowledge

  85. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Summary We have tackled from a more linguistically-oriented point of view the robustness problem which arises when lexicalised grammars are employed as part of bigger processing architectures in real life applications We have shown clearly that missing lexical entries are the main cause for parsing failures and thus illustrated the importance of increasing lexical coverage of lexicalised grammars We have also illustrated the importance of morphology in the lexical prediction process for languages like German Valia Kordoni Automated Acquisition of Linguistic Knowledge

  86. Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Summary With our linguistically motivated DLA methods, parsing coverage of lexicalised grammars improves significantly while the linguistic quality of the grammars remains intact Since our DLA methods are considered to be formalism- and language-independent, it will be interesting, in future research, to apply them on other systems and languages Valia Kordoni Automated Acquisition of Linguistic Knowledge

  87. Appendix For Further Reading For Further Reading I Baldwin, T., Bender, E. M., Flickinger, D., Kim, A., and Oepen, S. (2004). Road-testing the English Resource Grammar over the British National Corpus. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004) , Lisbon, Portugal. Clark, S. and Curran, J. (2007). Formalism-Independent Parser Evaluation with CCG and DepBank. In Proceedings of ACL2007 . Copestake, A. and Flickinger, D. (2000). An open-sourse grammar development environment and broad-coverage English grammar using HPSG. In Proceedings of the Second conference on Language Resources and Evaluation (LREC 2000) , Athens, Greece. Cruse, A. (1986). Lexical Semantics . Cambridge University Press, Cambridge, UK. Crysmann, B. (2003). On the efficient implementation of German verb placement in HPSG. In Proceedings of RANLP 2003 , pages 112–116, Borovets, Bulgaria. Valia Kordoni Automated Acquisition of Linguistic Knowledge

Recommend


More recommend