deep neural networks in natural language processing
play

Deep Neural Networks in Natural Language Processing Charles - PowerPoint PPT Presentation

Rudolf Rosa rosa@ufal.mff.cuni.cz Deep Neural Networks in Natural Language Processing Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Hora Informaticae, I AV R, Praha, 14 Jan 2019


  1. Warnings  I am not a ML expert, rather a ML user  Please excuse any errors and inaccuracies  Focus of talk: input representation (“encoding”)  Key problem in NLP, interesting properties  Leaving out  Generating output (“decoding”) – that’s also interesting  Sequence generation  Seq. elements discrete, large domain (softmax over 10 6 )  Sequence length not a priori known Rudolf Rosa – Deep Neural Networks in Natural Language Processing 31/116

  2. Warnings  I am not a ML expert, rather a ML user  Please excuse any errors and inaccuracies  Focus of talk: input representation (“encoding”)  Key problem in NLP, interesting properties  Leaving out  Generating output (“decoding”) – that’s also interesting  Sequence generation  Seq. elements discrete, large domain (softmax over 10 6 )  Sequence length not a priori known  Decision at encoder/decoder boundary (if any) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 32/116

  3. Problem 1: Words Massively multi-valued discrete data (words) Continuous low-dimensional vectors (word embeddings) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 33/116

  4. Simplification  For now, forget sentences  1 word some output Rudolf Rosa – Deep Neural Networks in Natural Language Processing 34/116

  5. Simplification Word is positive/neutral/negative,  For now, forget sentences  1 word some output Rudolf Rosa – Deep Neural Networks in Natural Language Processing 35/116

  6. Simplification Word is positive/neutral/negative, Definition of the word,  For now, forget sentences  1 word some output Rudolf Rosa – Deep Neural Networks in Natural Language Processing 36/116

  7. Simplification Word is positive/neutral/negative, Definition of the word,  For now, forget sentences Hyperonym (dog → animal), …  1 word some output Rudolf Rosa – Deep Neural Networks in Natural Language Processing 37/116

  8. Simplification Word is positive/neutral/negative, Definition of the word,  For now, forget sentences Hyperonym (dog → animal), …  1 word some output  Situation  We have labelled training data for some words (10 3 )  We want to generalize (ideally) to all words (10 6 ) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 38/116

  9. The problem with words  How many words are there? Rudolf Rosa – Deep Neural Networks in Natural Language Processing 39/116

  10. The problem with words  How many words are there? Too many! Rudolf Rosa – Deep Neural Networks in Natural Language Processing 40/116

  11. The problem with words  How many words are there? Too many!  Many problems with counting words, cannot be done Rudolf Rosa – Deep Neural Networks in Natural Language Processing 41/116

  12. The problem with words  How many words are there? Too many!  Many problems with counting words, cannot be done  ~10 6 Rudolf Rosa – Deep Neural Networks in Natural Language Processing 42/116

  13. The problem with words  How many words are there? Too many!  Many problems with counting words, cannot be done  ~10 6 (but potentially infinite – new words get created every day) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 43/116

  14. The problem with words  How many words are there? Too many!  Many problems with counting words, cannot be done  ~10 6 (but potentially infinite – new words get created every day)  Long-standing problem of NLP Rudolf Rosa – Deep Neural Networks in Natural Language Processing 44/116

  15. The problem with words  How many words are there? Too many!  Many problems with counting words, cannot be done  ~10 6 (but potentially infinite – new words get created every day)  Long-standing problem of NLP 10 6  Natural representation: 1-hot vector i … … 0 0 0 0 1 0 0 0 0 Rudolf Rosa – Deep Neural Networks in Natural Language Processing 45/116

  16. The problem with words  How many words are there? Too many!  Many problems with counting words, cannot be done  ~10 6 (but potentially infinite – new words get created every day)  Long-standing problem of NLP 10 6  Natural representation: 1-hot vector i … … 0 0 0 0 1 0 0 0 0  ML with ~10 6 binary features on input Rudolf Rosa – Deep Neural Networks in Natural Language Processing 46/116

  17. The problem with words  How many words are there? Too many!  Many problems with counting words, cannot be done  ~10 6 (but potentially infinite – new words get created every day)  Long-standing problem of NLP 10 6  Natural representation: 1-hot vector i … … 0 0 0 0 1 0 0 0 0  ML with ~10 6 binary features on input  Pair of words: ~10 12 Rudolf Rosa – Deep Neural Networks in Natural Language Processing 47/116

  18. The problem with words  How many words are there? Too many!  Many problems with counting words, cannot be done  ~10 6 (but potentially infinite – new words get created every day)  Long-standing problem of NLP 10 6  Natural representation: 1-hot vector i … … 0 0 0 0 1 0 0 0 0  ML with ~10 6 binary features on input  Pair of words: ~10 12  No generalization, meaning of words not captured  dog~puppy, dog~~cat, dog~~~platypus, dog~~~~whiskey Rudolf Rosa – Deep Neural Networks in Natural Language Processing 48/116

  19. Split the words  Split into characters M O C K Rudolf Rosa – Deep Neural Networks in Natural Language Processing 49/116

  20. Split the words  Split into characters M O C K  Not that many (~10 2 ) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 50/116

  21. Split the words  Split into characters M O C K  Not that many (~10 2 )  Do not capture meaning  Starts with “m-”, is it positive or negative? Rudolf Rosa – Deep Neural Networks in Natural Language Processing 51/116

  22. Split the words  Split into characters M O C K  Not that many (~10 2 )  Do not capture meaning  Starts with “m-”, is it positive or negative?  Split into subwords/morphemes mis class if ied Rudolf Rosa – Deep Neural Networks in Natural Language Processing 52/116

  23. Split the words  Split into characters M O C K  Not that many (~10 2 )  Do not capture meaning  Starts with “m-”, is it positive or negative?  Split into subwords/morphemes mis class if ied  Word starts with “mis-”: it is probably negative  misclassify, mistake, misconception… Rudolf Rosa – Deep Neural Networks in Natural Language Processing 53/116

  24. Split the words  Split into characters M O C K  Not that many (~10 2 )  Do not capture meaning  Starts with “m-”, is it positive or negative?  Split into subwords/morphemes mis class if ied  Word starts with “mis-”: it is probably negative  misclassify, mistake, misconception…  Helps, used in practice Rudolf Rosa – Deep Neural Networks in Natural Language Processing 54/116

  25. Split the words  Split into characters M O C K  Not that many (~10 2 )  Do not capture meaning  Starts with “m-”, is it positive or negative?  Split into subwords/morphemes mis class if ied  Word starts with “mis-”: it is probably negative  misclassify, mistake, misconception…  Helps, used in practice  Potentially infinite set covered by a finite set of subwords Rudolf Rosa – Deep Neural Networks in Natural Language Processing 55/116

  26. Split the words  Split into characters M O C K  Not that many (~10 2 )  Do not capture meaning  Starts with “m-”, is it positive or negative?  Split into subwords/morphemes mis class if ied  Word starts with “mis-”: it is probably negative  misclassify, mistake, misconception…  Helps, used in practice  Potentially infinite set covered by a finite set of subwords  Meaning-capturing subwords still too many (~10 5 ) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 56/116

  27. Distributional hypothesis Rudolf Rosa – Deep Neural Networks in Natural Language Processing 57/116

  28. Distributional hypothesis  smelt (assume you don’t know this word) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 58/116

  29. Distributional hypothesis  smelt (assume you don’t know this word)  I had a smelt for lunch. Rudolf Rosa – Deep Neural Networks in Natural Language Processing 59/116

  30. Distributional hypothesis  smelt (assume you don’t know this word)  I had a smelt for lunch. → noun, meal/food Rudolf Rosa – Deep Neural Networks in Natural Language Processing 60/116

  31. Distributional hypothesis  smelt (assume you don’t know this word)  I had a smelt for lunch. → noun, meal/food  My father caught a smelt . Rudolf Rosa – Deep Neural Networks in Natural Language Processing 61/116

  32. Distributional hypothesis  smelt (assume you don’t know this word)  I had a smelt for lunch. → noun, meal/food  My father caught a smelt . → animal/illness Rudolf Rosa – Deep Neural Networks in Natural Language Processing 62/116

  33. Distributional hypothesis  smelt (assume you don’t know this word)  I had a smelt for lunch. → noun, meal/food  My father caught a smelt . → animal/illness Rudolf Rosa – Deep Neural Networks in Natural Language Processing 63/116

  34. Distributional hypothesis  smelt (assume you don’t know this word)  I had a smelt for lunch. → noun, meal/food  My father caught a smelt . → animal/illness  Smelts are disappearing from oceans. Rudolf Rosa – Deep Neural Networks in Natural Language Processing 64/116

  35. Distributional hypothesis  smelt (assume you don’t know this word)  I had a smelt for lunch. → noun, meal/food  My father caught a smelt . → animal/illness  Smelts are disappearing from oceans. → plant/fish Rudolf Rosa – Deep Neural Networks in Natural Language Processing 65/116

  36. Distributional hypothesis  smelt (assume you don’t know this word)  I had a smelt for lunch. → noun, meal/food  My father caught a smelt . → animal/illness  Smelts are disappearing from oceans. → plant/fish Rudolf Rosa – Deep Neural Networks in Natural Language Processing 66/116

  37. Distributional hypothesis  smelt (assume you don’t know this word)  I had a smelt for lunch. → noun, meal/food  My father caught a smelt . → animal/illness  Smelts are disappearing from oceans. → plant/fish Rudolf Rosa – Deep Neural Networks in Natural Language Processing 67/116

  38. Distributional hypothesis  smelt (assume you don’t know this word)  I had a smelt for lunch. → noun, meal/food  My father caught a smelt . → animal/illness  Smelts are disappearing from oceans. → plant/fish koruška Rudolf Rosa – Deep Neural Networks in Natural Language Processing 68/116

  39. Distributional hypothesis  smelt (assume you don’t know this word)  I had a smelt for lunch. → noun, meal/food  My father caught a smelt . → animal/illness  Smelts are disappearing from oceans. → plant/fish  Harris (1954): “Words that occur in the same contexts tend to have similar meanings.” Rudolf Rosa – Deep Neural Networks in Natural Language Processing 69/116

  40. Distributional hypothesis  Harris (1954): “Words that occur in the same contexts tend to have similar meanings.”  Cooccurrence matrix  # of sentences containing both WORD and CONTEXT WORD CONTEXT lunch caught oceans doctor green smelt 10 10 10 1 1 salmon 100 100 100 1 1 flu 1 100 1 100 10 seaweed 10 1 100 1 100 Rudolf Rosa – Deep Neural Networks in Natural Language Processing 70/116

  41. Distributional hypothesis  Harris (1954): “Words that occur in the same contexts tend to have similar meanings.”  Cooccurrence matrix  # of sentences containing both WORD and CONTEXT WORD CONTEXT lunch caught oceans doctor green smelt 10 10 10 1 1 salmon 100 100 100 1 1 flu 1 100 1 100 10 seaweed 10 1 100 1 100  Cheap plentiful data (webs, news, books…): ~10 9 Rudolf Rosa – Deep Neural Networks in Natural Language Processing 71/116

  42. Distributional hypothesis  Harris (1954): “Words that occur in the same contexts tend to have similar meanings.”  Cooccurrence matrix  # of sentences containing both WORD and CONTEXT WORD CONTEXT lunch caught oceans doctor green NxN, N~10 6 smelt 10 10 10 1 1 salmon 100 100 100 1 1 flu 1 100 1 100 10 seaweed 10 1 100 1 100  Cheap plentiful data (webs, news, books…): ~10 9 Rudolf Rosa – Deep Neural Networks in Natural Language Processing 72/116

  43. From cooccurence to PMI  Cooccurrence matrix Association  M C [i, j] = count(word i & context j ) measures  Conditional probability matrix  M P [i, j] = P(word i | context j ) = M C [i, j] / count(context j )  Conditional log-probability matrix  M LogP [i, j] = log P(word i | context j ) = log M P [i, j]  Pointwise mutual information matrix  M PMI [i, j] = log [P(word i | context j ) / P(word i )] Rudolf Rosa – Deep Neural Networks in Natural Language Processing 73/116

  44. From cooccurence to PMI  Cooccurrence matrix Association  M C [i, j] = count(word i & context j ) measures  Conditional probability matrix  M P [i, j] = P(word i | context j ) = M C [i, j] / count(context j )  Conditional log-probability matrix  M LogP [i, j] = log P(word i | context j ) = log M P [i, j]  Pointwise mutual information matrix  M PMI [i, j] = log [P(word i | context j ) / P(word i )]  PMI(A, B) = log P(A & B) / P(A) P(B) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 74/116

  45. From cooccurence to PMI  Word representation still impratically huge  M PMI [i] ∈ R N , N~10 6 Rudolf Rosa – Deep Neural Networks in Natural Language Processing 75/116

  46. From cooccurence to PMI  Word representation still impratically huge  M PMI [i] ∈ R N , N~10 6  But better than 1-hot  Meaningful continuous vectors (e.g. cos similarity) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 76/116

  47. From cooccurence to PMI  Word representation still impratically huge  M PMI [i] ∈ R N , N~10 6  But better than 1-hot  Meaningful continuous vectors (e.g. cos similarity)  Just need to compress it! Rudolf Rosa – Deep Neural Networks in Natural Language Processing 77/116

  48. From cooccurence to PMI  Word representation still impratically huge  M PMI [i] ∈ R N , N~10 6  But better than 1-hot  Meaningful continuous vectors (e.g. cos similarity)  Just need to compress it!  Explicitly: matrix factorization  post-hoc, not used  Implicitly: word2vec  widely used Rudolf Rosa – Deep Neural Networks in Natural Language Processing 78/116

  49. Matrix factorization Rudolf Rosa – Deep Neural Networks in Natural Language Processing 79/116

  50. Matrix factorization  Levy&Goldberg (2014) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 80/116

  51. Matrix factorization  Levy&Goldberg (2014)  Take M LogP or M PMI Rudolf Rosa – Deep Neural Networks in Natural Language Processing 81/116

  52. Matrix factorization  Levy&Goldberg (2014)  Take M LogP or M PMI  Shift the matrix to make it positive (- min) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 82/116

  53. Matrix factorization  Levy&Goldberg (2014)  Take M LogP or M PMI  Shift the matrix to make it positive (- min) N ~ 10 6 d ~ 10 2  Truncated Singular Value Decomposition:  M = UDV T M ∈ R NxN →U ∈ R Nxd , D ∈ R dxd , V ∈ R Nxd Rudolf Rosa – Deep Neural Networks in Natural Language Processing 83/116

  54. Matrix factorization  Levy&Goldberg (2014)  Take M LogP or M PMI  Shift the matrix to make it positive (- min) N ~ 10 6 d ~ 10 2  Truncated Singular Value Decomposition:  M = UDV T M ∈ R NxN →U ∈ R Nxd , D ∈ R dxd , V ∈ R Nxd  Word embedding matrix: W = UD ∈ R Nxd  Embedding vec(word i ) = W[i] ∈ R d Rudolf Rosa – Deep Neural Networks in Natural Language Processing 84/116

  55. Matrix factorization  Levy&Goldberg (2014)  Take M LogP or M PMI  Shift the matrix to make it positive (- min) N ~ 10 6 d ~ 10 2  Truncated Singular Value Decomposition:  M = UDV T M ∈ R NxN →U ∈ R Nxd , D ∈ R dxd , V ∈ R Nxd  Word embedding matrix: W = UD ∈ R Nxd  Embedding vec(word i ) = W[i] ∈ R d  Continuous low-dimensional vector Rudolf Rosa – Deep Neural Networks in Natural Language Processing 85/116

  56. Matrix factorization  Levy&Goldberg (2014)  Take M LogP or M PMI  Shift the matrix to make it positive (- min) N ~ 10 6 d ~ 10 2  Truncated Singular Value Decomposition:  M = UDV T M ∈ R NxN →U ∈ R Nxd , D ∈ R dxd , V ∈ R Nxd  Word embedding matrix: W = UD ∈ R Nxd  Embedding vec(word i ) = W[i] ∈ R d  Continuous low-dimensional vector  Meaningful (cos similarity, algebraic operations) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 86/116

  57. Word embeddings magic  Word similarity (cos)  vec(dog) ~ vec(puppy), vec(cat) ~ vec(kitten) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 87/116

  58. Word embeddings magic  Word similarity (cos)  vec(dog) ~ vec(puppy), vec(cat) ~ vec(kitten)  Word meaning algebra  Some relations parallel across words  vec(puppy) - vec(dog) ~ vec(kitten) - vec(cat) cat dog kitten puppy Rudolf Rosa – Deep Neural Networks in Natural Language Processing 88/116

  59. Word embeddings magic  Word similarity (cos)  vec(dog) ~ vec(puppy), vec(cat) ~ vec(kitten)  Word meaning algebra  Some relations parallel across words  vec(puppy) - vec(dog) ~ vec(kitten) - vec(cat) cat dog kitten puppy  => vec(puppy) - vec(dog) + vec(cat) ~ vec(kitten) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 89/116

  60. Word embeddings magic  Word similarity (cos)  vec(dog) ~ vec(puppy), vec(cat) ~ vec(kitten)  Word meaning algebra  Some relations parallel across words  vec(puppy) - vec(dog) ~ vec(kitten) - vec(cat) cat dog kitten puppy  => vec(puppy) - vec(dog) + vec(cat) ~ vec(kitten)  vodka – Russia + Mexico, teacher – school + hospital… Rudolf Rosa – Deep Neural Networks in Natural Language Processing 90/116

  61. word2vec (Mikolov+, 2013)  Predict word w i from its context (CBOW)  E.g.: “ I had _____ for lunch ”  Sentence: … w i-2 w i-1 w i w i+1 w i+2 … W i-2 … … W 0 0 0 0 1 0 0 0 0 Softmax W i-1 (hierarchical) … … W 0 0 0 0 1 0 0 0 0 W i … … ∑ σ V 0 0 0 0 1 0 0 0 0 W i+1 … … Output word W 0 0 0 0 1 0 0 0 0 ”Linear Another (distribution) hidden matrix W i+2 … … W 0 0 0 0 1 0 0 0 0 layer” (dxN) Context Shared Train with vectors projection matrix SGD (1-hot) (Nxd) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 91/116

  62. word2vec (Mikolov+, 2013)  Predict context from a word w i (SGNS)  E.g.: “ ____ _____ smelt _____ _____ ”  Sentence: … w i-2 w i-1 w i w i+1 w i+2 … W i-2 … … V σ 0 0 0 0 1 0 0 0 0 W i-1 … … V σ 0 0 0 0 1 0 0 0 0 W i … … 0 0 0 0 1 0 0 0 0 W W i+1 … … V σ 0 0 0 0 1 0 0 0 0 Input Projection word matrix W i+2 … … V σ 0 0 0 0 1 0 0 0 0 (1-hot) (Nxd) Another matrix, Output shared context vectors (dxN) (distributions) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 92/116

  63. word2vec ~ implicit factorization W i W i-2 … … … … W V σ 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0  Word embedding matrix W ∈ R Nxd  embedding(word i ) = W[i] ∈ R d  Levy&Goldberg (2014)  word2vec SGNS implicitly factorizes M PMI  M PMI [i, j] = log [P(word i | context j ) / P(word i )]  SGNS: M PMI = WV  M PMI ∈ R NxN →W ∈ R Nxd , V ∈ R dxN Rudolf Rosa – Deep Neural Networks in Natural Language Processing 93/116

  64. Problem 2: Sentences Variable-length input sequences with long-distance relations between elements (sentences) Fixed-sized neural units (attention mechanisms) Rudolf Rosa – Deep Neural Networks in Natural Language Processing 94/116

  65. Processing sentences  Convolutional neural netowrks  Recurrent neural networks  Attention mechanism  Self-attentive networks Rudolf Rosa – Deep Neural Networks in Natural Language Processing 95/116

  66. Convolutional neural networks  Input: sequence of word embeddings  Filters (size 3-5), norm, maxpooling Rudolf Rosa – Deep Neural Networks in Natural Language Processing 96/116

  67. Convolutional neural networks  Input: sequence of word embeddings  Filters (size 3-5), norm, maxpooling  Training deep CNNs hard→residual connections  Layer input averaged with output, skips non-linearity Rudolf Rosa – Deep Neural Networks in Natural Language Processing 97/116

  68. Convolutional neural networks  Input: sequence of word embeddings  Filters (size 3-5), norm, maxpooling  Training deep CNNs hard→residual connections  Layer input averaged with output, skips non-linearity  Problem: capturing long-range dependencies  Receptive field of each filter is limited  My computer works, but I have to buy a new mouse. Rudolf Rosa – Deep Neural Networks in Natural Language Processing 98/116

  69. Convolutional neural networks  Input: sequence of word embeddings  Filters (size 3-5), norm, maxpooling  Training deep CNNs hard→residual connections  Layer input averaged with output, skips non-linearity  Problem: capturing long-range dependencies  Receptive field of each filter is limited  My computer works, but I have to buy a new mouse.  Good for word n gram spotting  Sentiment analysis, named entity detection… Rudolf Rosa – Deep Neural Networks in Natural Language Processing 99/116

  70. Recurrent neural networks  Input: sequence of word embeddings  Output: final state of RNN Rudolf Rosa – Deep Neural Networks in Natural Language Processing 100/116

Recommend


More recommend