natural language processing with deep learning word
play

Natural Language Processing with Deep Learning Word Embeddings - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Word Embeddings Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Agenda Introduction Count-based word representation Prediction-based word embedding


  1. Natural Language Processing with Deep Learning Word Embeddings Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception

  2. Agenda • Introduction • Count-based word representation • Prediction-based word embedding

  3. Agenda • Introduction • Count-based word representation • Prediction-based word embedding

  4. Distributional Representation § An entity is represented with a vector of 𝑒 dimensions § Distributed Representations - Each dimension (units) is a feature of the entity - Units in a layer are not mutually exclusive - Two units can be ‘‘active’’ at the same time 𝑒 𝒚 𝑦 ! 𝑦 " 𝑦 # … 𝑦 $ 4

  5. Word Embedding Model 𝑒 𝑤1 𝑤2 Word Embedding 𝑤𝑂 Model When vector representations are dense, they are often called embedding e.g. word embedding 5

  6. Word embeddings projected to a two-dimensional space

  7. Word Embeddings – Nearest neighbors frog book asthma frogs books bronchitis toad foreword allergy litoria author allergies published arthritis leptodactylidae preface diabetes rana 7 https://nlp.stanford.edu/projects/glove/

  8. Word Embeddings – Linear substructures § Analogy task: - man to woman is like king to ? (queen) 𝒚 !"#$% − 𝒚 #$% + 𝒚 &'%( = 𝒚 ∗ 𝒚 ∗ ≈ 𝒚 *+,-. 8 https://nlp.stanford.edu/projects/glove/

  9. Agenda • Introduction • Count-based word representation • Prediction-based word embedding

  10. Intuition for Computational Semantics “You shall know a word by the company it keeps!” J. R. Firth, A synopsis of linguistic theory 1930–1955 (1957) 10

  11. d e r d c a r s i n k alcoholic Tesgüino beverage out of corn f e r m e n t e d b o Mexico t t l e Nida[1975] 11

  12. fermentation e g l t t o r b a i n medieval Ale brew pale d b a r r i n k alcoholic 12

  13. Tesgüino ←→ Ale Algorithmic intuition: Two words are related when they have common context words 13

  14. Word-Document Matrix – recap 𝔼 is a set of documents (plays of Shakespeare) § 𝔼 = [𝑒1, 𝑒2, … , 𝑒𝑁] 𝕎 is the set of words (vocabularies) in dictionary § 𝕎 = [𝑤1, 𝑤2, … , 𝑤𝑂] § Words as rows and documents as columns Values: term count tc !,# § Matrix size 𝑂×𝑁 § 𝑒1 𝑒2 𝑒3 𝑒4 As You Like It Twelfth Night Julius Caesar Henry V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0 ... ... … ... ... 14

  15. Cosine § Cosine is the normalized dot product of two vectors - Its result is between -1 and +1 ∑ 0 𝑦 0 𝑧 0 𝒚 𝒛 cos(𝒚, 𝒛) = . = 𝒚 / 𝒛 / / / ∑ 0 𝑦 0 ∑ 0 𝑧 0 𝒚 = 1 0 𝒛 = 4 § 1 5 6 1 ∗ 4 + 1 ∗ 5 + 0 ∗ 6 9 cos(𝒚, 𝒛) = 1 $ + 1 $ + 0 $ 4 $ + 5 $ + 6 $ = ~12.4 15

  16. Word-Document Matrix 𝑒1 𝑒2 𝑒3 𝑒4 As You Like It Twelfth Night Julius Caesar Henry V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0 ... ... … ... ... § Similarity between two words: similarity soldier, clown = cos 𝒚 1"23',4 , 𝒚 52"!% 16

  17. Context § Context can be - Document - Paragraph, tweet - Window of (2-10) context words on each side of the word § Word-Context matrix - Every word as a unit (dimension) ℂ = [𝑑1, 𝑑2, … , 𝑑𝑀] - Matrix size: 𝑂×𝑀 - Usually ℂ = 𝕎 and therefore 𝑀 = 𝑂 17

  18. Word-Context Matrix § Window context of 7 words sugar, a sliced lemon, a tablespoonful of apricot preserve or jam, a pinch each of, their enjoyment. Cautiously she sampled her first pineapple and another fruit whose taste she likened well suited to programming on the digital computer . In finding the optimal R-stage policy from for the purpose of gathering data and information necessary for the study authorized in the 𝑑1 𝑑2 𝑑3 𝑑4 𝑑5 𝑑6 aardvark computer data pinch result sugar 𝑤1 apricot 0 0 0 1 0 1 𝑤2 pineapple 0 0 0 1 0 1 𝑤3 digital 0 2 1 0 1 0 𝑤4 information 0 1 6 0 4 0 18

  19. Word-to-Word Relations 𝑑1 𝑑2 𝑑3 𝑑4 𝑑5 𝑑6 aardvark computer data pinch result sugar 𝑤1 apricot 0 0 0 1 0 1 𝑤2 pineapple 0 0 0 1 0 1 𝑤3 digital 0 2 1 0 1 0 𝑤4 information 0 1 6 0 4 0 § First-order co-occurrence relation - Each cell of the word-context matrix - Words that appear in the proximity of each other - Like drink to beer , and drink and wine § Second-order similarity relation - Cosine similarity between the representation vectors - Words that appear in similar contexts - Like beer to wine , tesgüino to ale , and frog to toad 19

  20. Point Mutual Information § Problem with raw counting methods - Biased towards high frequent words (“and”, “the”) although they don’t contain much of information § Point Mutual Information (PMI) - Rooted in information theory - A better measure for the first-order relation in word-context matrix in order to reflect the informativeness of co-occurrences - Joint probability of two events (random variables) divided by their marginal probabilities PMI 𝑌, 𝑍 = log 𝑞(X, Y) 𝑞 X 𝑞(Y) 20

  21. Point Mutual Information PMI 𝑤, 𝑑 = log 𝑞(𝑤, 𝑑) 𝑞 𝑤 𝑞(𝑑) 𝑞 𝑤, 𝑑 = # 𝑤, 𝑑 𝑇 ( * ∑ %&' # 𝑤, 𝑑 % 𝑞 𝑑 = ∑ )&' # 𝑤 ) , 𝑑 𝑞 𝑤 = 𝑇 𝑇 * ( 𝑇 = H H # 𝑤 ) , 𝑑 % )&' %&' § Positive Point Mutual Information (PPMI) PPMI 𝑢, 𝑑 = max(PMI, 0) 21

  22. Point Mutual Information 𝑑1 𝑑2 𝑑3 𝑑4 𝑑5 𝑑6 aardvark computer data pinch result sugar 𝑤1 apricot 0 0 0 1 0 1 𝑤2 pineapple 0 0 0 1 0 1 𝑤3 digital 0 2 1 0 1 0 𝑤4 information 0 1 6 0 4 0 6 19 = .32 𝑄 𝑤 = information, 𝑑 = data = Q 11 19 = .58 𝑄 𝑤 = information = Q 7 19 = .37 𝑄 𝑑 = data = Q .32 PPMI 𝑤 = information, 𝑑 = data = max(0, log .58 ∗ .37) = .39 22

  23. Singular Value Decomposition - Recap § An 𝑂×𝑀 matrix 𝒀 can be factorized to three matrices: 𝒀 = 𝑽𝜯𝑾 𝐔 § 𝑽 (left singular vectors) is an 𝑂×𝑀 unitary matrix § 𝜯 is an 𝑀×𝑀 diagonal matrix, diagonal entries - are eigenvalues, - show the importance of corresponding 𝑀 dimensions in 𝒀 - are all positive and sorted from large to small values § 𝑾 𝐔 (right singular vectors) is an 𝑀×𝑀 unitary matrix * The definition of SVD is simplified. Refer to https://en.wikipedia.org/wiki/Singular_value_decomposition for the exact definition 24

  24. Singular Value Decomposition – Recap 𝑀×𝑀 𝑀×𝑀 𝑂×𝑀 𝑂×𝑀 = eigenvalues right singular vectors 𝑾 𝐔 𝜯 original matrix left singular vectors 𝒀 𝑽 25

  25. Applying SVD to Word-Context Matrix § Step 1: create a sparse PPMI matrix of the size 𝑂 ✕ 𝑀 , § Apply SVD contexts 𝑀×𝑀 𝑀×𝑀 words 𝑂×𝑀 𝑂×𝑀 = eigenvalues context vectors 𝑾 𝐔 𝜯 (sparse) word vectors word-context matrix 𝑽 𝒀 26

  26. Applying SVD to Term-Context Matrix § Step 2: keep only top 𝑒 eigenvalues in 𝜯 and set the rest to zero § Truncate the 𝑽 and 𝑾 𝐔 matrices based on the changes 𝑾 𝐔 respectively in 𝜯 , called N 𝑽 and N 27

  27. Applying SVD to Term-Context Matrix 𝑀 𝑒 𝑒 𝑒 𝑒 𝑀×𝑀 𝑀×𝑀 𝑂 ×𝑀 truncated truncated eigenvalues context word % 𝜯 vectors : 𝑾 𝐔 truncated word vectors : 𝑽 § N 𝑽 matrix is the dense low-dimensional word vectors 28

  28. Agenda • Introduction • Count-based word representation • Prediction-based word embedding

  29. Word Embedding with Neural Networks Recipe for creating (dense) word embedding with neural networks § Design a neural network architecture! § Loop over training data (𝑤, 𝑑) for some epochs Pass the word 𝑤 as input and execute forward passing - Calculate the probability of observing context word 𝑑 at output: - 𝑞(𝑑|𝑤) - Optimize the network to maximize this likelihood probability Details come next! 30

  30. Training Data 𝒠 Window of size 2 http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ 31

  31. Neural embeddings – architecture Train sample: ( Tesgüino , drink ) Output Layer Input Layer (softmax) (One-hot encoder ) Forward pass 𝑞(𝐞𝐬𝐣𝐨𝐥|𝐔𝐟𝐭𝐡ü𝐣𝐨𝐩) Backpropagation 𝑽 % 𝑭 𝑂×𝑒 𝑒×𝑂 1×𝑒 1×𝑂 1×𝑂 Linear activation Encoder embedding Decoder embedding https://web.stanford.edu/~jurafsky/slp3/ 32

  32. Ale Tesgüino Embedding vector 33

  33. Ale Tesgüino Embedding vector 34

  34. Ale Tesgüino Decoding vector Embedding vector 35

  35. Ale drink Tesgüino Decoding vector Embedding vector 36

  36. drink Ale Tesgüino Decoding vector Embedding vector 37

  37. Ale drink Tesgüino - Train sample: ( Tesgüino , drink ) - Update vectors to maximize 𝑞(drink|Tesgüino) Decoding vector Embedding vector 38

Recommend


More recommend