A Decomposable Attention Model for Natural Language Inference - PowerPoint PPT Presentation

A Decomposable Attention Model for Natural Language Inference Ankur Parikh, Oscar Tackstrom, Dipanjan Das, Jakob Uszkoreit Presented by: Xikun Zhang University of Illinois, Urbana-Champaign

Natural Language Inference A key part of our understanding of natural language is the ability to u understand sentence semantics. Semantic Entailment or, more popularly, the task of Natural u Language Inference (NLI) is a core Natural Language Understanding task (NLU). While it poses as a classification task, it is uniquely well-positioned to serve as a benchmark task for research on NLU. It attempts to judge whether one sentence can be inferred from another. More specifically, it tries to identify the relationship between the u meanings of a pair of sentences, called the premise and the hypothesis. The relationship could be one of the following: Entailment: the hypothesis is a sentence with a similar meaning as the o premise Contradiction: the hypothesis is a sentence with a contradictory meaning o Neutral: the hypothesis is a sentence with mostly the same lexical items o as the premise but a different meaning.

Natural Language Inference (Cont’d) u Determine entailment/contradiction/neutral relationships between a premise and a hypothesis. Bob is in his room, but because of the thunder Premise and lightning outside, he cannot sleep. entailment Hypothesis 1 Bob is awake. Hypothesis 2 It is sunny outside. contradiction neutral Hypothesis 3 Bob has a big house. 3

Recent Work (Sentence Encoding) words 4

Recent Work (Sentence Encoding) word vector representation s 5

Recent Work (Sentence Encoding) representation layer 6

Recent Work (Sentence Encoding) similarity layer 7

Recent Work (Sentence Encoding) output 8

Recent Work (Sentence Encoding) Lot of papers using this family of neural architectures: Hu et al. (2014) Bowman et al. (2015) He et al. (2015) 9

Recent Work (Seq2Seq) encoder recurrent neural network How are you <EOS> model for machine translation (Sutskever et al. 2014, Cho et al. 2014) 1 0

Recent Work (Seq2Seq) decoder recurrent neural network M I am fine <EOS> How are you <EOS> model for machine translation (Sutskever et al. 2014, Cho et al. 2014) 11

Recent Work decoder recurrent neural network M I am fine <EOS> How are you <EOS> sequence to sequence model with attention 12 (Bahdanau et al. 2014)

decoder recurrent neural network M I am fine <EOS> How are you <EOS> machine translation (Bahdanau et al. 2014) reading comprehension (Hermann et al. 2015) sentence similarity/entailment 13 (Rocktaschel et al. 2015, Wang and Jiang 2015, Cheng et al. 2016)

Motivation for this Work u Alignment plays key role in many NLP tasks: u Machine translation [Koehn, 2009] u Sentence Similarity [Haghighi et al., 2005; Koehn, 2009; Das and Smith, 2009, Chang et al., 2010; Fader et al., 2013] u Natural Language Inference [Marsi and Krahmer, 2005; McCartney et al., 2006; Hickl and Bensley, 2007; McCartney et al., 2008] u Semantic Parsing [Andreas et al., 2013] u Attention is the neural counterpart to alignment [Bahdanau et al. 2014] 14

Motivation for this Work How well can we do with just alignment/attention, without building complex sentence representations? Bob is in his room, but because of the thunder and lightning Premise outside, he cannot sleep. Hypothesis 1 Bob is awake. Bob is in his room, but because of the thunder and lightning Premise outside, he cannot sleep. Hypothesis 2 It is sunny outside. 15

Decomposable Attention 3. Aggregate 1. Attend 2. Compare park outside someone = outside G ( , ) playing music in alice someone = G ( , ) the H ( ) = + + + … … park flute+ solo music alice G ( , ) = plays a flute solo flute music F ( , ) 16

Step 1: Attend Unnormalized attention weights: In practice, sub-phrase in sub-phrase in sentence 2 aligned to sentence 1 aligned to 17

Attend 2: Compare Separately compare aligned subphrases: is a feed forward network 18

Step 3: Aggregate u Combine results and classify. In practice, H is a feed forward neural network + linear layer + sigmoid 19

Decomposable Attention 3. Aggregate 1. Attend 2. Compare park outside someone = outside G ( , ) playing music in alice someone = G ( , ) the H ( ) = + + + … … park flute+ solo music alice G ( , ) = plays a flute solo flute music F ( , ) 20

Beyond Unordered Words u Intra-Attention - Construct a “context” using an extra attention layer u Uses weak word order information via distance bias The distance-sensitive bias terms ! "#$ ∈ ℝ provides the model with a minimal amount of sequence information, while remaining parallelizable. These terms are bucketed such that all 21 distances greater than 10 words share the same bias.

Empirical Results Dataset: Stanford Natural Language Inference Corpus (SNLI, Bowman et al. 2015) http://nlp.stanford.edu/projects/snli/ 549,367 sentence pairs for training 9,842 pairs for development 9,824 pairs for testing 22

Empirical Results Accuracy 78 Lexicalized Classifiers Bowman et al. (2015) 3M 81 LSTM RNN Encoders Bowman et al. (2016) 15M 81 Pretrained GRU Encoders Vendrov et al. (2015) 3.5M 82 Tree-Based CNN Encoders Mou et al. (2015) 3.7M 83 SPINN-PI Encoders Bowman et al. (2016) 252K 84 LSTM with Attention Rocktaschel et al. (2016) 1.9M 86 mLSTM Wang and Jiang (2016) 3.4M 86 LSTMN w/ Attention Fusion Cheng et al. (2016) 382K 86 This Work 23 87 This Work with Self Attention 582K

Empirical Results 92 92 91 Neutral Entailment Contradiction 88 87 87 86 86 Accuracy 84 84 82 81 24

Error Analysis - Wins Sentence 1 Sentence 2 DA (vanilla) DA (intra att.) SPINN-PI mLSTM Gold Two kids are standing Two kids enjoy in the ocean hugging their day at the N N E E N each other. beach. A dancer in costumer the man is performs on stage N N E E N captivated while a man watches. The fountain is They are sitting on the splashing the N N C C N edge of a fountain persons seated 25

Error Analysis - Losses Sentence 1 Sentence 2 DA (vanilla) DA (intra att.) SPINN-PI mLSTM Gold Two dogs play with Dogs are watching N C C C C tennis ball in field. a tennis match. Two kids begin to Two penguins make a snowman on a making a N C C C C sunny winter day. snowman. The horses pull the carriage, holding Horses ride in a people carriage pulled by E E C C C and a dog, through a dog. the rain. 26

Headroom Sentence 1 Sentence 2 DA (vanilla) DA (intra att.) SPINN-PI mLSTM Gold A woman closes her The woman has eyes as she plays her E E E E C her eyes open cello. Two women having Three women are drinks and smoking E E E E C at a bar. cigarettes at the bar. A band playing with A band watches E E E E C fans watching. the fans play 27

Conclusion u We presented a simple attention-based approach to text similarity that is trivially parallelizable. u Our results suggest that for at least the SNLI task pairwise comparisons are relatively more important than global sentence-level representations 28

Thank You 29

A Decomposable Attention Model for Natural Language Inference - PowerPoint PPT Presentation

A Decomposable Attention Model for Natural Language Inference Ankur Parikh, Oscar Tackstrom, Dipanjan Das, Jakob Uszkoreit Presented by: Xikun Zhang University of Illinois, Urbana-Champaign Natural Language Inference A key part of our

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Inference and computing with decomposable graphs Peter Green 1 Alun Thomas 2 1 School of

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

The Attention Economy What is the attention economy? A business model where you (as the

Attention! 1. Definitions and behavioral effects 2. Effects on neural firing rates: Spatial

Natural Language Understanding We want to communicate with computers using natural language

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Vertex decomposable graphs and obstructions to shellability Russ Woodroofe Washington U in St

+ This work is supported by the National Science Foundation under under award DUE-1725941.

Health Information Technology Oversight Council February 2 nd , 2017 Agenda Welcome,

CLAS12 software status update July 21, 2020 Outline Software organization Progress since

Referential scales and differential case marking: A study using hierarchical models in Bayesian

Constraints on the alignment limit of the MSSM Higgs sector

Search for Higgs beyond the Standard Model with the ATLAS Detector Nikolina Ilic Radboud

PHYSICAL EXAMINATION OF THE FOOT AND ANKLE Disclosures Robert B. Anderson, MD No conflicts

November 15, 2017 Meeting Materials The mission of the Boston Green Ribbon