Oren Melamud, Jacob Goldberger, Ido Dagan CoNLL, 2016 context2vec: Learning Generic Context Embedding with Bidirectional LSTM
• Target: bank 2 What context is They robbed the _bank_ last night. • Sentential context: They robbed the last night.
IBM this company for 100 million dollars. They robbed the _bank_ last night. I can’t find _April_ . • Sentence completion • Word sense disambiguation • Named entity recognition • More: supersense tagging, coreference resolution, ... 3 What context representations are used for
They robbed the _bank_ last night. I can’t find _April_ . • Sentence completion • Word sense disambiguation • Named entity recognition • More: supersense tagging, coreference resolution, ... 3 What context representations are used for IBM this company for 100 million dollars.
I can’t find _April_ . • Sentence completion • Word sense disambiguation • Named entity recognition • More: supersense tagging, coreference resolution, ... 3 What context representations are used for IBM this company for 100 million dollars. They robbed the _bank_ last night.
• Sentence completion • Word sense disambiguation • Named entity recognition • More: supersense tagging, coreference resolution, ... 3 What context representations are used for IBM this company for 100 million dollars. They robbed the _bank_ last night. I can’t find _April_ .
• Sentence completion • Word sense disambiguation • Named entity recognition • More: supersense tagging, coreference resolution, ... 3 What context representations are used for IBM this company for 100 million dollars. They robbed the _bank_ last night. I can’t find _April_ .
• IBM this company for 100 million dollars. • IBM bought this company for million dollars. • IBM this company for 100 million dollars. • I this necklace for my wife’s birthday. sentence representation • Context representation 4 Different context words, similar contextual information • Information on the target slot/word Similar context words, different contextual information sum of context words • Contextual information What we want from context representations
• IBM this company for 100 million dollars. • IBM bought this company for million dollars. • IBM this company for 100 million dollars. • I this necklace for my wife’s birthday. • Information on the target slot/word Similar context words, different contextual information Different context words, similar contextual information • Context representation sentence representation 4 What we want from context representations • Contextual information ̸ = sum of context words
• IBM this company for 100 million dollars. • I this necklace for my wife’s birthday. • Information on the target slot/word Similar context words, different contextual information Different context words, similar contextual information • Context representation sentence representation 4 What we want from context representations • Contextual information ̸ = sum of context words • IBM this company for 100 million dollars. • IBM bought this company for million dollars.
• Information on the target slot/word Similar context words, different contextual information Different context words, similar contextual information • Context representation sentence representation 4 What we want from context representations • Contextual information ̸ = sum of context words • IBM this company for 100 million dollars. • IBM bought this company for million dollars. • IBM this company for 100 million dollars. • I this necklace for my wife’s birthday.
• Information on the target slot/word Similar context words, different contextual information Different context words, similar contextual information 4 What we want from context representations • Contextual information ̸ = sum of context words • IBM this company for 100 million dollars. • IBM bought this company for million dollars. • IBM this company for 100 million dollars. • I this necklace for my wife’s birthday. • Context representation ̸ = sentence representation
• Our goal • Sentential context representations • More value than sum of words • Unsupervised generic learning setting • Our model • context2vec = word2vec - CBOW + biLSTM • We show • context2vec average of word embeddings • context2vec state-of-the-art (more complex models) • Toolkit available for your NLP application 5 Our work
• Our goal • Sentential context representations • More value than sum of words • Unsupervised generic learning setting • Our model • context2vec = word2vec - CBOW + biLSTM • We show • context2vec average of word embeddings • context2vec state-of-the-art (more complex models) • Toolkit available for your NLP application 5 Our work
• Our goal • Sentential context representations • More value than sum of words • Unsupervised generic learning setting • Our model • context2vec = word2vec - CBOW + biLSTM • We show • Toolkit available for your NLP application 5 Our work • context2vec >> average of word embeddings • context2vec ∼ state-of-the-art (more complex models)
• Our goal • Sentential context representations • More value than sum of words • Unsupervised generic learning setting • Our model • context2vec = word2vec - CBOW + biLSTM • We show • Toolkit available for your NLP application 5 Our work • context2vec >> average of word embeddings • context2vec ∼ state-of-the-art (more complex models)
Background
Limited scope loses word order Variable-size 7 Popular recent context representations
Limited scope loses word order Variable-size 7 Popular recent context representations
Limited scope loses word order Variable-size 7 Popular recent context representations
• Word order captured with biLSTM • Task-specific training • Supervision is limited in size • Pre-trained word embeddings carry valuable information from large corpora • Can we bring even more information? NER (Lample, 2016) 8 Supervised biLSTM with pre-trained word embeddings
• Word order captured with biLSTM • Task-specific training • Supervision is limited in size • Pre-trained word embeddings carry valuable information from large corpora • Can we bring even more information? NER (Lample, 2016) 8 Supervised biLSTM with pre-trained word embeddings
• Word order captured with biLSTM • Task-specific training • Supervision is limited in size • Pre-trained word embeddings carry valuable information from large corpora • Can we bring even more information? NER (Lample, 2016) 8 Supervised biLSTM with pre-trained word embeddings
Model
10 Baseline architecture: word2vec with CBOW objective function averaged target word context embeddings embeddings context window context word embeddings John had [ submitted ] a paper submitted ( ) c avg · ⃗ c avg · ⃗ S = ∑ log σ ( ⃗ t ) + ∑ t ′ ∈ NEGS ( t , c ) log σ ( − ⃗ t ′ ) ( t , c ) ∈ PAIRS
11 context2vec word2vec CBOW context2vec = word2vec - CBOW + biLSTM objective function sentential context embeddings target word embeddings MLP objective function averaged target word context embeddings LSTM LSTM LSTM LSTM LSTM embeddings context window LSTM LSTM LSTM LSTM LSTM context word embeddings John had [ submitted ] a paper submitted John had [ submitted ] a paper submitted
12 Learning architecture: context2vec objective function sentential context embeddings target word embeddings MLP LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM John had [ submitted ] a paper submitted ( ) c c 2 v · ⃗ c c 2 v · ⃗ log σ ( ⃗ t ′ ∈ NEGS ( t , c ) log σ ( − ⃗ S = ∑ t ) + ∑ t ′ ) ( t , c ) ∈ PAIRS
13 The context2vec embedding space IBM [ ] this company I [ ] t2c this necklace for acquired my wife’s bought birthday company technology IBM bought this [ ] target word sentential context
14 The context2vec embedding space IBM [ ] this company c2c I [ ] this necklace for acquired t2t my wife’s bought birthday company technology IBM bought this [ ] target word sentential context
Evaluation & Results
• Using simple cosine similarity measures • Standalone evaluation of context2vec 16 Evaluation goals
write migrate climb swear contribute • Implementation: Shortest target-context cosine distance • Benchmark: Microsoft sentence completion challenge (Zweig and Burges, 2011) 17 Tasks: Sentence completion I have seen it on him, and could to it.
skilled luminous vivid hopeful smart • Implementation: Rank by target-context cosine distance • Benchmarks: • Lexical sample (McCarthy and Navigli and Burges, 2007) • All-words (Kremer et al., 2014) 18 Tasks: Lexical substitution Charlie is a bright boy.
• They add (s2) a touch of humor. • The minister added (s4) : the process remains fragile. TEST TRAIN • Implementation: Shortest context-context cosine distance (kNN) • Benchmark: Senseval-3 English lexical sample (Mihalcea et al. , 2004) 19 Tasks: Supervised word sense disambiguation This adds a wider perspective.
Recommend
More recommend