Representation learning MLP: What are we learning in W 0 when we backpropagate? h 0 x W 0 W 0 (one-hot) 1 y = softmax ( W 2 h 1 + b 2 ) → 0 h 1 = f ( W 1 h 0 + b 1 ) 0 h 0 = f ( W 0 x + b 0 ) 0 DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 50
Representation learning Back-propagation allows Neural Networks to learn representations for words while training: word embeddings! h 0 x W 0 W 0 (one-hot) 1 y = softmax ( W 2 h 1 + b 2 ) → 0 h 1 = f ( W 1 h 0 + b 1 ) 0 h 0 = f ( W 0 x + b 0 ) 0 DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 51
Representation learning ● Back-propagation allows Neural Networks to learn representations for words while training! – Word embeddings ! Continuous vector space instead of one-hot ● Are these word embeddings useful? ● Which task would be the best to learn embeddings that can be used in other tasks? ● Can we transfer this representation from one task to the other? ● Can we have all languages in one embedding space? DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 52
Representation learning Representation learning and and word embeddings word embeddings DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 53
Word embeddings judge ● Let’s represent words as vectors: driver policeman Similar words should have vectors cop Tallinn which are close to each other Cambridge Paris London WHY? ● If an AI has seen these two sequences I live in Cambridge I live in Paris ● … then which one should be more plausible? I live in Tallinn I live in policeman DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 54
Word embeddings judge ● Let’s represent words as vectors: driver policeman Similar words should have vectors cop Tallinn which are close to each other Cambridge Paris London WHY? ● If an AI has seen these two sequences I live in Cambridge I live in Paris ● … then which one should be more plausible? I live in Tallinn OK I live in policeman DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 55
judge driver policeman Word embeddings cop Tallinn Cambridge London Paris Distributional vector spaces: Option 1: use co-occurrence Option 2: learn low-rank matrix PPMI dense matrix directly ● 1A: large sparse matrix ● 2A: MLP on particular classification task ● 1B: factorize it and use low- ● 2B: find a general task rank dense matrix (SVD) ● DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 59
judge driver policeman Word embeddings cop Tallinn Cambridge London Paris General task with large quantities of data: guess the missing word (language models) CBOW : given context guess middle word … people who keep pet dogs or cats exhibit better mental and physical health … SKIP-GRAM given middle word guess context … people who keep pet dogs or cats exhibit better mental and physical health … Proposed by Mikolov et al. (2013) - Word2vec DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 60
CBOW Source: Mikolov et al. 2013 Like MLP, one layer, LARGE vocabulary LARGE number of classes! … people who keep pet The vocabulary dogs or cats exhibit better mental and physical health … Cross-entropy loss / Softmax expensive! Negative sampling : K E w n ∼ P noise log σ( h on ) J NEG w ( t ) = log σ( h 0 i )− ∑ k = 1 DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 62
judge driver policeman Word embeddings cop Tallinn Cambridge London Paris Pre-trained word embeddings can leverage texts with BILLIONS of words!! Pre-trained word embeddings useful for: ● Word similarity ● Word analogy ● Other tasks like PoS tagging, NERC, sentiment analysis, etc. ● Initialize embedding layer in deep learning models DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 63
judge driver policeman Word embeddings cop Tallinn Cambridge London Paris Word similarity Source: Collobert et al. 2011 DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 64
judge driver policeman Word embeddings cop Tallinn Cambridge London Paris Word analogy: a is to b as c is to ? king ? man is to king as woman is to ? man woman DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 65
judge driver policeman Word embeddings cop Tallinn Cambridge London Paris Word analogy: a is to b as c is to ? king man is to king as woman is to ? man queen woman a − b ≈ c − d d ≈ c − a + b argmax d ∈ V ( cos ( d ,c − a + b )) king − man + woman ≈ queen DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 66
Word embeddings ● How to use embeddings in a given task (e.g. MLP sentiment analysis): – Learn them from scratch (random init.) – Initialize using pre-trained embeddings from some other task (e.g. word2vec) ● Other embeddings: – GloVe (Pennington et al. 2014) – Fastext (Mikolov et al. 2017) DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 68
Recap ● Deep learning: learn representations for words ● Are they useful for anything? ● Which task would be the best to learn embeddings that can be used in other tasks? ● Can we transfer this representation from one task to the other? ● Can we have all languages in one embedding space? DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 70
Superhuman abilities: Superhuman abilities: cross-lingual word cross-lingual word embeddings embeddings http://aclweb.org/anthology/P18-1073 http://aclweb.org/anthology/P18-1073 DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 71
Contents ● Introduction to NLP – Deep Learning ~ Learning Representations ● Text as bag of words – Text Classification – Representation learning and word embeddings – Superhuman: xlingual word embeddings ● Text as a sequence: RNN – Sentence encoders – Machine Translation – Superhuman: unsupervised MT DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 120
From words to sequences ● Representation for words: one vector for each word (word embeddings ) ● Representation for sequences of words: one vector for each sequence (?!) – Is it possible to represent a sentence in one vector at all? – Let’s go back to MLP DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 121
From words to sequences MLP: What is h 0 with respect to the words in the input ? Add vectors of words in context (1’s in x), ● plus bias, apply non-linearity h 0 sentence representation f ( ∑ ⃗ w i + b 0 ) DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 122
Sentence encoder A function: D w i ∈ℝ input a sequence of word embeddings ⃗ output a sentence representation D' s ∈ℝ ⃗ hidden layers s sentence representation ⃗ sentence encoder word embeddings w 1 w 2 w 3 ⃗ ⃗ ⃗ DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 123
Sentence encoder Baseline: Continuous bag of words (pre-trained embeddings) h 1 = f ( W 1 h 0 + b 1 ) h 0 = ⃗ s s sentence representation ⃗ Σ word embeddings w 1 w 2 w 3 ⃗ ⃗ ⃗ DL4NLP in 2 hours - Eneko Agirre - Pavia 2019 124
Recommend
More recommend