Warnings I am not a ML expert, rather a ML user Please excuse any errors and inaccuracies Focus of talk: input representation (“encoding”) Key problem in NLP, interesting properties Leaving out Generating output (“decoding”) – that’s also interesting Sequence generation Seq. elements discrete, large domain (softmax over 10 6 ) Sequence length not a priori known Rudolf Rosa – Deep Neural Networks in Natural Language Processing 31/116
Warnings I am not a ML expert, rather a ML user Please excuse any errors and inaccuracies Focus of talk: input representation (“encoding”) Key problem in NLP, interesting properties Leaving out Generating output (“decoding”) – that’s also interesting Sequence generation Seq. elements discrete, large domain (softmax over 10 6 ) Sequence length not a priori known Decision at encoder/decoder boundary (if any) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 32/116
Problem 1: Words Massively multi-valued discrete data (words) Continuous low-dimensional vectors (word embeddings) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 33/116
Simplification For now, forget sentences 1 word some output Rudolf Rosa – Deep Neural Networks in Natural Language Processing 34/116
Simplification Word is positive/neutral/negative, For now, forget sentences 1 word some output Rudolf Rosa – Deep Neural Networks in Natural Language Processing 35/116
Simplification Word is positive/neutral/negative, Definition of the word, For now, forget sentences 1 word some output Rudolf Rosa – Deep Neural Networks in Natural Language Processing 36/116
Simplification Word is positive/neutral/negative, Definition of the word, For now, forget sentences Hyperonym (dog → animal), … 1 word some output Rudolf Rosa – Deep Neural Networks in Natural Language Processing 37/116
Simplification Word is positive/neutral/negative, Definition of the word, For now, forget sentences Hyperonym (dog → animal), … 1 word some output Situation We have labelled training data for some words (10 3 ) We want to generalize (ideally) to all words (10 6 ) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 38/116
The problem with words How many words are there? Rudolf Rosa – Deep Neural Networks in Natural Language Processing 39/116
The problem with words How many words are there? Too many! Rudolf Rosa – Deep Neural Networks in Natural Language Processing 40/116
The problem with words How many words are there? Too many! Many problems with counting words, cannot be done Rudolf Rosa – Deep Neural Networks in Natural Language Processing 41/116
The problem with words How many words are there? Too many! Many problems with counting words, cannot be done ~10 6 Rudolf Rosa – Deep Neural Networks in Natural Language Processing 42/116
The problem with words How many words are there? Too many! Many problems with counting words, cannot be done ~10 6 (but potentially infinite – new words get created every day) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 43/116
The problem with words How many words are there? Too many! Many problems with counting words, cannot be done ~10 6 (but potentially infinite – new words get created every day) Long-standing problem of NLP Rudolf Rosa – Deep Neural Networks in Natural Language Processing 44/116
The problem with words How many words are there? Too many! Many problems with counting words, cannot be done ~10 6 (but potentially infinite – new words get created every day) Long-standing problem of NLP 10 6 Natural representation: 1-hot vector i … … 0 0 0 0 1 0 0 0 0 Rudolf Rosa – Deep Neural Networks in Natural Language Processing 45/116
The problem with words How many words are there? Too many! Many problems with counting words, cannot be done ~10 6 (but potentially infinite – new words get created every day) Long-standing problem of NLP 10 6 Natural representation: 1-hot vector i … … 0 0 0 0 1 0 0 0 0 ML with ~10 6 binary features on input Rudolf Rosa – Deep Neural Networks in Natural Language Processing 46/116
The problem with words How many words are there? Too many! Many problems with counting words, cannot be done ~10 6 (but potentially infinite – new words get created every day) Long-standing problem of NLP 10 6 Natural representation: 1-hot vector i … … 0 0 0 0 1 0 0 0 0 ML with ~10 6 binary features on input Pair of words: ~10 12 Rudolf Rosa – Deep Neural Networks in Natural Language Processing 47/116
The problem with words How many words are there? Too many! Many problems with counting words, cannot be done ~10 6 (but potentially infinite – new words get created every day) Long-standing problem of NLP 10 6 Natural representation: 1-hot vector i … … 0 0 0 0 1 0 0 0 0 ML with ~10 6 binary features on input Pair of words: ~10 12 No generalization, meaning of words not captured dog~puppy, dog~~cat, dog~~~platypus, dog~~~~whiskey Rudolf Rosa – Deep Neural Networks in Natural Language Processing 48/116
Split the words Split into characters M O C K Rudolf Rosa – Deep Neural Networks in Natural Language Processing 49/116
Split the words Split into characters M O C K Not that many (~10 2 ) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 50/116
Split the words Split into characters M O C K Not that many (~10 2 ) Do not capture meaning Starts with “m-”, is it positive or negative? Rudolf Rosa – Deep Neural Networks in Natural Language Processing 51/116
Split the words Split into characters M O C K Not that many (~10 2 ) Do not capture meaning Starts with “m-”, is it positive or negative? Split into subwords/morphemes mis class if ied Rudolf Rosa – Deep Neural Networks in Natural Language Processing 52/116
Split the words Split into characters M O C K Not that many (~10 2 ) Do not capture meaning Starts with “m-”, is it positive or negative? Split into subwords/morphemes mis class if ied Word starts with “mis-”: it is probably negative misclassify, mistake, misconception… Rudolf Rosa – Deep Neural Networks in Natural Language Processing 53/116
Split the words Split into characters M O C K Not that many (~10 2 ) Do not capture meaning Starts with “m-”, is it positive or negative? Split into subwords/morphemes mis class if ied Word starts with “mis-”: it is probably negative misclassify, mistake, misconception… Helps, used in practice Rudolf Rosa – Deep Neural Networks in Natural Language Processing 54/116
Split the words Split into characters M O C K Not that many (~10 2 ) Do not capture meaning Starts with “m-”, is it positive or negative? Split into subwords/morphemes mis class if ied Word starts with “mis-”: it is probably negative misclassify, mistake, misconception… Helps, used in practice Potentially infinite set covered by a finite set of subwords Rudolf Rosa – Deep Neural Networks in Natural Language Processing 55/116
Split the words Split into characters M O C K Not that many (~10 2 ) Do not capture meaning Starts with “m-”, is it positive or negative? Split into subwords/morphemes mis class if ied Word starts with “mis-”: it is probably negative misclassify, mistake, misconception… Helps, used in practice Potentially infinite set covered by a finite set of subwords Meaning-capturing subwords still too many (~10 5 ) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 56/116
Distributional hypothesis Rudolf Rosa – Deep Neural Networks in Natural Language Processing 57/116
Distributional hypothesis smelt (assume you don’t know this word) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 58/116
Distributional hypothesis smelt (assume you don’t know this word) I had a smelt for lunch. Rudolf Rosa – Deep Neural Networks in Natural Language Processing 59/116
Distributional hypothesis smelt (assume you don’t know this word) I had a smelt for lunch. → noun, meal/food Rudolf Rosa – Deep Neural Networks in Natural Language Processing 60/116
Distributional hypothesis smelt (assume you don’t know this word) I had a smelt for lunch. → noun, meal/food My father caught a smelt . Rudolf Rosa – Deep Neural Networks in Natural Language Processing 61/116
Distributional hypothesis smelt (assume you don’t know this word) I had a smelt for lunch. → noun, meal/food My father caught a smelt . → animal/illness Rudolf Rosa – Deep Neural Networks in Natural Language Processing 62/116
Distributional hypothesis smelt (assume you don’t know this word) I had a smelt for lunch. → noun, meal/food My father caught a smelt . → animal/illness Rudolf Rosa – Deep Neural Networks in Natural Language Processing 63/116
Distributional hypothesis smelt (assume you don’t know this word) I had a smelt for lunch. → noun, meal/food My father caught a smelt . → animal/illness Smelts are disappearing from oceans. Rudolf Rosa – Deep Neural Networks in Natural Language Processing 64/116
Distributional hypothesis smelt (assume you don’t know this word) I had a smelt for lunch. → noun, meal/food My father caught a smelt . → animal/illness Smelts are disappearing from oceans. → plant/fish Rudolf Rosa – Deep Neural Networks in Natural Language Processing 65/116
Distributional hypothesis smelt (assume you don’t know this word) I had a smelt for lunch. → noun, meal/food My father caught a smelt . → animal/illness Smelts are disappearing from oceans. → plant/fish Rudolf Rosa – Deep Neural Networks in Natural Language Processing 66/116
Distributional hypothesis smelt (assume you don’t know this word) I had a smelt for lunch. → noun, meal/food My father caught a smelt . → animal/illness Smelts are disappearing from oceans. → plant/fish Rudolf Rosa – Deep Neural Networks in Natural Language Processing 67/116
Distributional hypothesis smelt (assume you don’t know this word) I had a smelt for lunch. → noun, meal/food My father caught a smelt . → animal/illness Smelts are disappearing from oceans. → plant/fish koruška Rudolf Rosa – Deep Neural Networks in Natural Language Processing 68/116
Distributional hypothesis smelt (assume you don’t know this word) I had a smelt for lunch. → noun, meal/food My father caught a smelt . → animal/illness Smelts are disappearing from oceans. → plant/fish Harris (1954): “Words that occur in the same contexts tend to have similar meanings.” Rudolf Rosa – Deep Neural Networks in Natural Language Processing 69/116
Distributional hypothesis Harris (1954): “Words that occur in the same contexts tend to have similar meanings.” Cooccurrence matrix # of sentences containing both WORD and CONTEXT WORD CONTEXT lunch caught oceans doctor green smelt 10 10 10 1 1 salmon 100 100 100 1 1 flu 1 100 1 100 10 seaweed 10 1 100 1 100 Rudolf Rosa – Deep Neural Networks in Natural Language Processing 70/116
Distributional hypothesis Harris (1954): “Words that occur in the same contexts tend to have similar meanings.” Cooccurrence matrix # of sentences containing both WORD and CONTEXT WORD CONTEXT lunch caught oceans doctor green smelt 10 10 10 1 1 salmon 100 100 100 1 1 flu 1 100 1 100 10 seaweed 10 1 100 1 100 Cheap plentiful data (webs, news, books…): ~10 9 Rudolf Rosa – Deep Neural Networks in Natural Language Processing 71/116
Distributional hypothesis Harris (1954): “Words that occur in the same contexts tend to have similar meanings.” Cooccurrence matrix # of sentences containing both WORD and CONTEXT WORD CONTEXT lunch caught oceans doctor green NxN, N~10 6 smelt 10 10 10 1 1 salmon 100 100 100 1 1 flu 1 100 1 100 10 seaweed 10 1 100 1 100 Cheap plentiful data (webs, news, books…): ~10 9 Rudolf Rosa – Deep Neural Networks in Natural Language Processing 72/116
From cooccurence to PMI Cooccurrence matrix Association M C [i, j] = count(word i & context j ) measures Conditional probability matrix M P [i, j] = P(word i | context j ) = M C [i, j] / count(context j ) Conditional log-probability matrix M LogP [i, j] = log P(word i | context j ) = log M P [i, j] Pointwise mutual information matrix M PMI [i, j] = log [P(word i | context j ) / P(word i )] Rudolf Rosa – Deep Neural Networks in Natural Language Processing 73/116
From cooccurence to PMI Cooccurrence matrix Association M C [i, j] = count(word i & context j ) measures Conditional probability matrix M P [i, j] = P(word i | context j ) = M C [i, j] / count(context j ) Conditional log-probability matrix M LogP [i, j] = log P(word i | context j ) = log M P [i, j] Pointwise mutual information matrix M PMI [i, j] = log [P(word i | context j ) / P(word i )] PMI(A, B) = log P(A & B) / P(A) P(B) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 74/116
From cooccurence to PMI Word representation still impratically huge M PMI [i] ∈ R N , N~10 6 Rudolf Rosa – Deep Neural Networks in Natural Language Processing 75/116
From cooccurence to PMI Word representation still impratically huge M PMI [i] ∈ R N , N~10 6 But better than 1-hot Meaningful continuous vectors (e.g. cos similarity) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 76/116
From cooccurence to PMI Word representation still impratically huge M PMI [i] ∈ R N , N~10 6 But better than 1-hot Meaningful continuous vectors (e.g. cos similarity) Just need to compress it! Rudolf Rosa – Deep Neural Networks in Natural Language Processing 77/116
From cooccurence to PMI Word representation still impratically huge M PMI [i] ∈ R N , N~10 6 But better than 1-hot Meaningful continuous vectors (e.g. cos similarity) Just need to compress it! Explicitly: matrix factorization post-hoc, not used Implicitly: word2vec widely used Rudolf Rosa – Deep Neural Networks in Natural Language Processing 78/116
Matrix factorization Rudolf Rosa – Deep Neural Networks in Natural Language Processing 79/116
Matrix factorization Levy&Goldberg (2014) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 80/116
Matrix factorization Levy&Goldberg (2014) Take M LogP or M PMI Rudolf Rosa – Deep Neural Networks in Natural Language Processing 81/116
Matrix factorization Levy&Goldberg (2014) Take M LogP or M PMI Shift the matrix to make it positive (- min) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 82/116
Matrix factorization Levy&Goldberg (2014) Take M LogP or M PMI Shift the matrix to make it positive (- min) N ~ 10 6 d ~ 10 2 Truncated Singular Value Decomposition: M = UDV T M ∈ R NxN →U ∈ R Nxd , D ∈ R dxd , V ∈ R Nxd Rudolf Rosa – Deep Neural Networks in Natural Language Processing 83/116
Matrix factorization Levy&Goldberg (2014) Take M LogP or M PMI Shift the matrix to make it positive (- min) N ~ 10 6 d ~ 10 2 Truncated Singular Value Decomposition: M = UDV T M ∈ R NxN →U ∈ R Nxd , D ∈ R dxd , V ∈ R Nxd Word embedding matrix: W = UD ∈ R Nxd Embedding vec(word i ) = W[i] ∈ R d Rudolf Rosa – Deep Neural Networks in Natural Language Processing 84/116
Matrix factorization Levy&Goldberg (2014) Take M LogP or M PMI Shift the matrix to make it positive (- min) N ~ 10 6 d ~ 10 2 Truncated Singular Value Decomposition: M = UDV T M ∈ R NxN →U ∈ R Nxd , D ∈ R dxd , V ∈ R Nxd Word embedding matrix: W = UD ∈ R Nxd Embedding vec(word i ) = W[i] ∈ R d Continuous low-dimensional vector Rudolf Rosa – Deep Neural Networks in Natural Language Processing 85/116
Matrix factorization Levy&Goldberg (2014) Take M LogP or M PMI Shift the matrix to make it positive (- min) N ~ 10 6 d ~ 10 2 Truncated Singular Value Decomposition: M = UDV T M ∈ R NxN →U ∈ R Nxd , D ∈ R dxd , V ∈ R Nxd Word embedding matrix: W = UD ∈ R Nxd Embedding vec(word i ) = W[i] ∈ R d Continuous low-dimensional vector Meaningful (cos similarity, algebraic operations) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 86/116
Word embeddings magic Word similarity (cos) vec(dog) ~ vec(puppy), vec(cat) ~ vec(kitten) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 87/116
Word embeddings magic Word similarity (cos) vec(dog) ~ vec(puppy), vec(cat) ~ vec(kitten) Word meaning algebra Some relations parallel across words vec(puppy) - vec(dog) ~ vec(kitten) - vec(cat) cat dog kitten puppy Rudolf Rosa – Deep Neural Networks in Natural Language Processing 88/116
Word embeddings magic Word similarity (cos) vec(dog) ~ vec(puppy), vec(cat) ~ vec(kitten) Word meaning algebra Some relations parallel across words vec(puppy) - vec(dog) ~ vec(kitten) - vec(cat) cat dog kitten puppy => vec(puppy) - vec(dog) + vec(cat) ~ vec(kitten) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 89/116
Word embeddings magic Word similarity (cos) vec(dog) ~ vec(puppy), vec(cat) ~ vec(kitten) Word meaning algebra Some relations parallel across words vec(puppy) - vec(dog) ~ vec(kitten) - vec(cat) cat dog kitten puppy => vec(puppy) - vec(dog) + vec(cat) ~ vec(kitten) vodka – Russia + Mexico, teacher – school + hospital… Rudolf Rosa – Deep Neural Networks in Natural Language Processing 90/116
word2vec (Mikolov+, 2013) Predict word w i from its context (CBOW) E.g.: “ I had _____ for lunch ” Sentence: … w i-2 w i-1 w i w i+1 w i+2 … W i-2 … … W 0 0 0 0 1 0 0 0 0 Softmax W i-1 (hierarchical) … … W 0 0 0 0 1 0 0 0 0 W i … … ∑ σ V 0 0 0 0 1 0 0 0 0 W i+1 … … Output word W 0 0 0 0 1 0 0 0 0 ”Linear Another (distribution) hidden matrix W i+2 … … W 0 0 0 0 1 0 0 0 0 layer” (dxN) Context Shared Train with vectors projection matrix SGD (1-hot) (Nxd) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 91/116
word2vec (Mikolov+, 2013) Predict context from a word w i (SGNS) E.g.: “ ____ _____ smelt _____ _____ ” Sentence: … w i-2 w i-1 w i w i+1 w i+2 … W i-2 … … V σ 0 0 0 0 1 0 0 0 0 W i-1 … … V σ 0 0 0 0 1 0 0 0 0 W i … … 0 0 0 0 1 0 0 0 0 W W i+1 … … V σ 0 0 0 0 1 0 0 0 0 Input Projection word matrix W i+2 … … V σ 0 0 0 0 1 0 0 0 0 (1-hot) (Nxd) Another matrix, Output shared context vectors (dxN) (distributions) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 92/116
word2vec ~ implicit factorization W i W i-2 … … … … W V σ 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 Word embedding matrix W ∈ R Nxd embedding(word i ) = W[i] ∈ R d Levy&Goldberg (2014) word2vec SGNS implicitly factorizes M PMI M PMI [i, j] = log [P(word i | context j ) / P(word i )] SGNS: M PMI = WV M PMI ∈ R NxN →W ∈ R Nxd , V ∈ R dxN Rudolf Rosa – Deep Neural Networks in Natural Language Processing 93/116
Problem 2: Sentences Variable-length input sequences with long-distance relations between elements (sentences) Fixed-sized neural units (attention mechanisms) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 94/116
Processing sentences Convolutional neural netowrks Recurrent neural networks Attention mechanism Self-attentive networks Rudolf Rosa – Deep Neural Networks in Natural Language Processing 95/116
Convolutional neural networks Input: sequence of word embeddings Filters (size 3-5), norm, maxpooling Rudolf Rosa – Deep Neural Networks in Natural Language Processing 96/116
Convolutional neural networks Input: sequence of word embeddings Filters (size 3-5), norm, maxpooling Training deep CNNs hard→residual connections Layer input averaged with output, skips non-linearity Rudolf Rosa – Deep Neural Networks in Natural Language Processing 97/116
Convolutional neural networks Input: sequence of word embeddings Filters (size 3-5), norm, maxpooling Training deep CNNs hard→residual connections Layer input averaged with output, skips non-linearity Problem: capturing long-range dependencies Receptive field of each filter is limited My computer works, but I have to buy a new mouse. Rudolf Rosa – Deep Neural Networks in Natural Language Processing 98/116
Convolutional neural networks Input: sequence of word embeddings Filters (size 3-5), norm, maxpooling Training deep CNNs hard→residual connections Layer input averaged with output, skips non-linearity Problem: capturing long-range dependencies Receptive field of each filter is limited My computer works, but I have to buy a new mouse. Good for word n gram spotting Sentiment analysis, named entity detection… Rudolf Rosa – Deep Neural Networks in Natural Language Processing 99/116
Recurrent neural networks Input: sequence of word embeddings Output: final state of RNN Rudolf Rosa – Deep Neural Networks in Natural Language Processing 100/116
Recommend
More recommend