Retrofitting Contextualized Word Embeddings with Paraphrases Weijia - PowerPoint PPT Presentation

Retrofitting Contextualized Word Embeddings with Paraphrases Weijia Shi 1* , Muhao Chen 1 * , Pei Zhou 2 , Kai-Wei Chang 1 1 University of California, Los Angeles 2 University of Southern California

Contextualized Word Embeddings Representations that considers the difference of lexical semantics under different linguistic contexts Such representations have become the backbone of many StoA NLU systems for • Sentence classification, textual inference, QA, EDL, NMT, SRL, …

Contextualized Word Embeddings Aggregating context information in a word vector with a pre-trained deep neural language model . Key benefits: • More refined semantic representations of lexemes • Automatically capturing polysemy • Apples have been grown for thousands of years in Asia and Europe . Apple Apple Embedding space • With that market capacity, Apple is worth over 1% of the world's GDP.

The Paraphrased Context Problem The pre-trained language models are not aware of the semantic relatedness of contexts The same word can be represented more differently than opposite words in unrelated contexts Contexts L2 distance by ELMo Paraphrases How can I make bigger my arms? 6.42 How do I make my arms bigger? Some people believe earth is flat , why? 7.59 Why do people still believe in flat earth? It is a very small window. 5.44 I have a large suitcase.

The Paraphrased Context Problem Consider ELMo distances of the same words ( excluding stop words ) in paraphrased sentence pairs from MRPC: ELMo encoding of shared words in MRPC paraphrases 50% 41.50% 40% 28.30% 30% 20% 10% Contextualization: • Can be oversensitive to 0% paraphrasing, >d(good, bad) >d(big, small) • and further impair sentence >d(good, bad) >d(big, small) representations.

Outline • Background • Paraphrase-aware retrofitting • Evaluation • Future Work

Paraphrase-aware Retrofitting (PAR) Method • An orthogonal transformation M to retrofit the input space • Minimizing the variance of word representations on paraphrased contexts • Without compromising the varying representations on unrelated contexts Orthogonal constraint: Keeping the relative distance of raw embeddings before contextualization

Paraphrase-aware Retrofitting (PAR) Learning objective Input: Paraphrase 1 : What is prison life like? Paraphrase 2 : How is life in prison? Negative sample : I have life insurance. Orthogonal Loss Function: constraint 𝑀 = ෍ ෍ 𝑒 𝑡 1 ,𝑡 2 𝐍𝐱 + 𝛿 − 𝑒 ෢ 𝑇 2 𝐍𝐱 + + 𝜇𝑀 𝑃 𝑇 1 ,෢ (𝑇 1 ,𝑇 2 )∈𝑄 𝑥∈𝑇 1 ∩𝑇 2 Intuition: the shared words in paraphrases should be embedded closer than those in non-paraphrases.

Experiment Settings Paraphrase pair datasets • The positive training cases of MRPC (2,753 pairs) • Sampled Quora (20,000 pairs) and PAN (5,000 pairs) Tasks • Sentence classification : MPQA, MR, CR, SST-2 • Textual inference : MRPC, SICK-E • Sentence relatedness scoring : SICK-R, STS-15, STS-16, STS- Benchmark • Adversarial SQuAD * The first three categories of tasks follow the settings in SentEval [Conneau et al, 2018].

Text Classification/Inference/Relatedness Tasks PAR leads to performance improvement of ELMo by • 2.59-4.21% in accuracy on sentence classification tasks • 2.60-3.30% in accuracy on textual inference tasks • 3-5% in Pearson correlation in text similarity tasks Comparison of ELMo w/o and W/ on Three SentEval Tasks 1 0.8 0.6 SST-2 (acc) SST-Benchmark ( ρ) SICK-E (acc) PAR improves ELMo on ELMo ELMo-PAR sentence representation tasks.

Adversarial SQuAD Bi-Directional Attention Flow (BiDAF) [Seo et al. 2017] on two challenge settings • AddOneSent: add one human-paraphrased sentence • AddvSent: add one adversarial example sentence that is semantically similar to the question F1 scores (%) w/ and w/o PAR 60 57.9 53.7 55 50 47.1 45 PAR improves the 41.7 robustness of a 40 downstream QA AddOneSent AddvSent model against adversarial ELMo-BiDAF ELMo-PAR-BiDAF examples.

Word Representations Average distances of shared words in MRPC test set sentence pairs before and after applying PAR 5 3.75 2.5 1.25 0 Paraphrase Non-paraphrase ELMo (all layers) ELMo-PAR PAR minimizes the differences of a word’s representations in paraphrased contexts and preserves the differences in non-paraphrased contexts.

Future Work Applying PAR on other contextualized embedding models To modify contextualized word embeddings linguistic knowledge • Context simplicity aware embeddings • Incorporating lexical definitions in the word contextualization process

Thank You 14

Retrofitting Contextualized Word Embeddings with Paraphrases Weijia - PowerPoint PPT Presentation

Retrofitting Contextualized Word Embeddings with Paraphrases Weijia Shi 1* , Muhao Chen 1 * , Pei Zhou 2 , Kai-Wei Chang 1 1 University of California, Los Angeles 2 University of Southern California Contextualized Word Embeddings Representations

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Contextualized Word Embeddings Luke Zettlemoyer (Slides adapted from Danqi Chen, Chris Manning,

IN5550: Neural Methods in Natural Language Processing Lecture 11/1 Contextualized embeddings

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word, Sense and Contextualized Embeddings: Vector Representations of Meaning in NLP Jose

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Gender Bias in Contextualized Word Embeddings Jieyu Zhao 1 , Tianlu Wang 2 , Mark Yatskar 3 ,

Contextualized Word Embeddings Spring 2020 2020-03-17 Adapted from slides from Danqi Chen and

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

Automatic Performance Tuning and Analysis of Sparse Triangular Solve Richard Vuduc, Shoaib Kamil,

Learning and Interpreting STS with Structural Kernels Alessandro Moschitti Department of

STS for NLG Christian Chiarcos chiarcos@uni-potsdam.de Natural Language Generation Natural

Institute for Defense Analyses 4850 Mark Center Drive Alexandria, Virginia 22311-1882

T he ro le o f T ra nsc uta ne o us ve rsus Surg ic a l I nte rve ntio ns fo r Struc tura l

LONDON WORKSHOP 5 FEBRUARY 2020 Information Classification: Restricted AGENDA 15:00 WELCOME AND

Semantic T extual Similarity & more on Alignment CMSC 723 / LING 723 / INST 725 M ARINE C

FlexSC FlexSC Livio Soares Livio Soares 2 The legacy from the single core era system calls

Retrofitting Contextualized Word Embeddings with Paraphrases Weijia - PowerPoint PPT Presentation

Retrofitting Contextualized Word Embeddings with Paraphrases Weijia Shi 1* , Muhao Chen 1 * , Pei Zhou 2 , Kai-Wei Chang 1 1 University of California, Los Angeles 2 University of Southern California Contextualized Word Embeddings Representations

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Contextualized Word Embeddings Luke Zettlemoyer (Slides adapted from Danqi Chen, Chris Manning,

IN5550: Neural Methods in Natural Language Processing Lecture 11/1 Contextualized embeddings

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word, Sense and Contextualized Embeddings: Vector Representations of Meaning in NLP Jose

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Gender Bias in Contextualized Word Embeddings Jieyu Zhao 1 , Tianlu Wang 2 , Mark Yatskar 3 ,

Contextualized Word Embeddings Spring 2020 2020-03-17 Adapted from slides from Danqi Chen and

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

Automatic Performance Tuning and Analysis of Sparse Triangular Solve Richard Vuduc, Shoaib Kamil,

Learning and Interpreting STS with Structural Kernels Alessandro Moschitti Department of

STS for NLG Christian Chiarcos chiarcos@uni-potsdam.de Natural Language Generation Natural

Institute for Defense Analyses 4850 Mark Center Drive Alexandria, Virginia 22311-1882

T he ro le o f T ra nsc uta ne o us ve rsus Surg ic a l I nte rve ntio ns fo r Struc tura l

LONDON WORKSHOP 5 FEBRUARY 2020 Information Classification: Restricted AGENDA 15:00 WELCOME AND

Semantic T extual Similarity &amp; more on Alignment CMSC 723 / LING 723 / INST 725 M ARINE C

FlexSC FlexSC Livio Soares Livio Soares 2 The legacy from the single core era system calls

Semantic T extual Similarity & more on Alignment CMSC 723 / LING 723 / INST 725 M ARINE C