diversifiable bootstrapping
play

Diversifiable Bootstrapping Hideki Shima Thesis Committee: Language - PDF document

8/20/2014 Carnegie Mellon Paraphrase Pattern Acquisition by Diversifiable Bootstrapping Hideki Shima Thesis Committee: Language Technologies Institute Teruko Mitamura, CMU (chair) Eric Nyberg, CMU School of Computer Science Eduard Hovy, CMU


  1. 8/20/2014 Carnegie Mellon Paraphrase Pattern Acquisition by Diversifiable Bootstrapping Hideki Shima Thesis Committee: Language Technologies Institute Teruko Mitamura, CMU (chair) Eric Nyberg, CMU School of Computer Science Eduard Hovy, CMU Carnegie Mellon University, USA Patrick Pantel, Microsoft Research Thesis Defense, Aug 20th, 1 2014 Carnegie Mellon Need for capturing meaning equivalence in QA Q. What did John Lennon die of?  John Lennon died of what Templates of natural language expressions can bridge different surface with close meaning: • X died of Y • X was murdered with Y John Lennon was murdered with gunshots in 1980 … Thesis Defense, Aug 20th, 2014 2 1

  2. 8/20/2014 Carnegie Mellon Need for capturing meaning equivalence in QA Q. What did John Lennon die of?  John Lennon died of what • X died of Y • killed X with Y • X had died from Y • X suffered a fatal Y • X was murdered with Y • X fell victim to Y • X 's death by Y • pumping Y into X , ending his life John Lennon was murdered with gunshots John Lennon 's death by gunshots John Lennon suffered a fatal gunshot wound John Lennon fell victim to assassin's bullets Chapman killed him with four gunshots wounds … pumping four bullets into him , ending his life : : : Thesis Defense, Aug 20th, 2014 3 Carnegie Mellon Paraphrasing is a common need in various applications  Automatic Evaluation – In Machine Translation [Kauchak & Barzilay, 2006][Padó et al., 2009] – In Text Summarization [Zhou et al., 2006] – In Question Answering [Ibrahim et al., 2003] [Dalmas, 2007]  Text Summarization [Lloret et al., 2008][Tatar et al., 2009]  Information Retrieval [Parapar et al., 2005][Riezler et al., 2007]  Information Extraction [Romano et al., 2006]  Question Answering [Harabagiu & Hickl, 2006][Dogdan et al., 2008]  Collocation Error Correction [Dahlmeier and Ng, 2011] Thesis Defense, Aug 20th, 2014 4 2

  3. 8/20/2014 Carnegie Mellon Classification of Paraphrase Research Usage / Application (1) Paraphrase Recognition • Question Answering <kill, murder>  {Y, N} (word/phrase-level) • Text Summarization < S 1 , S 2 >  {Y, N} (sentence-level) • Automatic Grading < D 1 , D 2 >  {Y, N} (document-level) • Plagiarism Detection (2) Paraphrase Generation • Query Expansion • Reference Expansion in • die  <decease, pass away, kick the bucket> Automatic Evaluation • He had a lot of admiration for his job  He had plenty of admiration for his job (3) Paraphrase Extraction • Resource for (1) and (2)  {word, phrase, sentence}  Paraphrase dictionary -level paraphrases  Sentence-aligned paraphrase • with/without variables corpus • with/without structure <writer, author> SUBJ FROM SUBJ TO <a lot of X, plenty of X> <X buy Y, Y sell X> <X wrote Y, < S 1 , S 2 > X is the writer of Y> Thesis Defense, Aug 20th, 2014 5 Carnegie Mellon Why not using existing lexical resource (e.g. WordNet)? Can we rewrite patterns with knowledge for more lexical varieties? e.g., WordNet [Miller, 1995] , FrameNet [Baker et al., 1998] , Nomlex [Macleod et al., 1998] , VerbNet [Kipper et al., 2006] Limitations: • Lack of coverage (e.g. phrasal expression) • Lack of context (preposition etc) Thesis Defense, Aug 20th, 2014 6 3

  4. 8/20/2014 Carnegie Mellon Why extract paraphrases? It’s because language expressions are diverse Type Example Paraphrases of “die” Idioms bite the dust, go west, give up the ghost, go to a better place, pay the ultimate price, buy the farm Non-idiom phrase suffer a fatal something ; fall victim to something; pumping a bullet into the heart, ending one’s life Religious be carried away by angels, answer God’s calling, go to heaven, euphemism reach nirvana Euphemism by (author) write one’s final chapter, (dancer) dance one’s last profession dance, (gambler) cashed in their chips Slang in military go Tango Uniform, go T.U., turn one’s toes up, be KIA (killed in action), be KIFA (killed in flight accident), be DOW (died of wounds) Slang in physician be at room temperature, be bloodless, feel no pain, lose vital signs, wear a toe tag Slang in gangsters merc, merk, murk, snuff, smoke, bang, get a backdoor parole Thesis Defense, Aug 20th, 2014 7 Carnegie Mellon (Quasi-)Paraphrase entailment / inference metaphor syntactic variation euphemism absolute synonym slang / jargon neologism near-synonym expression with high semantic similarity expression with high semantic relatedness Thesis Defense, Aug 20th, 2014 8 4

  5. 8/20/2014 Carnegie Mellon Outline  Introduction  Paraphrase Extraction – Vanilla Espresso (Baseline) – Espresso Extension (Baseline2) – Diversifiable Bootstrapping – Distributional Type Filtering  Paraphrase Evaluation Metric: DIMPLE  Experiment – Design – Evaluation Results  Conclusion Thesis Defense, Aug 20th, 2014 9 Carnegie Mellon Paraphrase extraction source corpora – Bilingual parallel corpus [Callison-Burch, 2008, Kok and Brockett, 2010] – Multiple translations [Barzilay & McKeown 2001] [Pang et al, 2003] – Aligned news contents [Dolan et al., 2004][Dolan and Brockett, 2005][Quirk et al., 2004] – Aligned definitions [Hashimoto et al., 2002] – Huge monolingual corpora • 150GB [Bhagat & Ravichandran, 2008] • 4.5TB parsed corpus [Metzler & Hovy, 2011] Thesis Defense, Aug 20th, 2014 10 5

  6. 8/20/2014 Carnegie Mellon Thesis Contribution (1 of 4)  Problem – Corpus Restriction : previous works have special corpus requirement e.g. parallel corpus, terabyte- scale corpus. • Not suitable for domain-specific paraphrase acquisition • Costly to build  Hypothesis & Proposed Solution – It is possible to extract paraphrase templates from an unstructured monolingual corpus given seed instances  Bootstrap Paraphrase Learning Thesis Defense, Aug 20th, 2014 11 Carnegie Mellon Bootstrap Instance/Pattern Learning ESPRESSO [Pantel & Pennacchiotti, 2006] INPUT OUTPUT BOOTSTRAP LEARNING ALGORITHM seed more instances instances monolingual patterns plain corpus Thesis Defense, Aug 20th, 2014 12 6

  7. 8/20/2014 Carnegie Mellon Bootstrap Instance/Pattern Learning INPUT OUTPUT BOOTSTRAP LEARNING X (killer) Y (victim) Bootstrapping more ALGORITHM seed John Wilkes Booth Abraham Lincoln instances instances Mark David Chapman John Lennon Nathuram Godse Mahatma Gandhi Yigal Amir Yitzhak Rabin John Bellingham Spencer Perceval Mohammed Bouyeri Theo van Gogh monolingual patterns plain corpus Dan White Mayor George Moscone Sirhan Sirhan Robert F. Kennedy El Sayyid Nosair Meir Kahane Mijailo Mijailovic Anna Lindh Thesis Defense, Aug 20th, 2014 13 Carnegie Mellon Bootstrap Instance/Pattern Learning X , the assassin of Y INPUT OUTPUT assassination of Y by X Bootstrapping X assassinated Y seed more instances the assassination of Y by X instances of X , the assassin of Y X assassinated Y in : : : monolingual patterns plain corpus Unlike many other bootstrapping works the goal is acquire patterns, not instances Thesis Defense, Aug 20th, 2014 14 7

  8. 8/20/2014 Carnegie Mellon Bootstrap Instance/Pattern Learning INPUT OUTPUT BOOTSTRAP LEARNING ALGORITHM seed more instances instances monolingual patterns plain corpus Thesis Defense, Aug 20th, 2014 15 Carnegie Mellon Bootstrap Instance/Pattern Learning Seed Extracted Sentences Instances Patterns 1st iteration Ranked Extracted Sentences Patterns Instances 2nd Ranked . . . iteration Instances Thesis Defense, Aug 20th, 2014 16 8

  9. 8/20/2014 Carnegie Mellon Bootstrap Instance/Pattern Learning Search sentences by instances Seed Extracted Sentences Instances Patterns  Edwin Booth was brother of John Wilkes Booth , the 1st iteration Ranked Extracted assassin of Abraham Lincoln. Sentences Patterns Instances  John Wilkes Booth , the assassin of Abraham Lincoln , was inspired by Brutus.  In 1969 Berman was part of the defense team of 2nd Ranked . . . Sirhan Sirhan , the assassin of Robert F. Kennedy . iteration Instances : : : Thesis Defense, Aug 20th, 2014 17 Carnegie Mellon Bootstrap Instance/Pattern Learning Search sentences by instances Seed Extracted Sentences Instances Patterns  Edwin Booth was brother of X , the assassin of Y. 1st iteration Ranked Extracted  X , the assassin of Y , was inspired by Brutus. Sentences Patterns Instances  In 1969 Berman was part of the defense team of X , the assassin of Y . : : : 2nd Ranked . . . iteration Instances Thesis Defense, Aug 20th, 2014 18 9

  10. 8/20/2014 Carnegie Mellon Bootstrap Instance/Pattern Learning Extract patterns from sentences Seed Extracted Sentences Instances Patterns  … brother of X , the assassin of Y . 1st iteration Ranked Extracted Sentences  X , the assassin of Y , was Patterns Instances  …team of X , the assassin of Y . 2nd Ranked . . . iteration Instances Extracted Pattern: Longest Common Substring among retrieved sentences Thesis Defense, Aug 20th, 2014 19 Carnegie Mellon Bootstrap Instance/Pattern Learning Score and rank patterns Seed Extracted Sentences Instances Patterns 1st iteration Ranked Extracted Sentences Patterns Instances Rank by reliability of pattern: r ( p ). r ( p ) is based on an association measure with each instance in the corpus. 2nd Ranked . . . iteration Instances Thesis Defense, Aug 20th, 2014 20 10

Recommend


More recommend