Natural Language Processing with Deep Learning Word Embeddings Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception
Agenda • Introduction • Count-based word representation • Prediction-based word embedding
Agenda • Introduction • Count-based word representation • Prediction-based word embedding
Distributional Representation § An entity is represented with a vector of 𝑒 dimensions § Distributed Representations - Each dimension (units) is a feature of the entity - Units in a layer are not mutually exclusive - Two units can be ‘‘active’’ at the same time 𝑒 𝒚 𝑦 ! 𝑦 " 𝑦 # … 𝑦 $ 4
Word Embedding Model 𝑒 𝑤1 𝑤2 Word Embedding 𝑤𝑂 Model When vector representations are dense, they are often called embedding e.g. word embedding 5
Word embeddings projected to a two-dimensional space
Word Embeddings – Nearest neighbors frog book asthma frogs books bronchitis toad foreword allergy litoria author allergies published arthritis leptodactylidae preface diabetes rana 7 https://nlp.stanford.edu/projects/glove/
Word Embeddings – Linear substructures § Analogy task: - man to woman is like king to ? (queen) 𝒚 !"#$% − 𝒚 #$% + 𝒚 &'%( = 𝒚 ∗ 𝒚 ∗ ≈ 𝒚 *+,-. 8 https://nlp.stanford.edu/projects/glove/
Agenda • Introduction • Count-based word representation • Prediction-based word embedding
Intuition for Computational Semantics “You shall know a word by the company it keeps!” J. R. Firth, A synopsis of linguistic theory 1930–1955 (1957) 10
d e r d c a r s i n k alcoholic Tesgüino beverage out of corn f e r m e n t e d b o Mexico t t l e Nida[1975] 11
fermentation e g l t t o r b a i n medieval Ale brew pale d b a r r i n k alcoholic 12
Tesgüino ←→ Ale Algorithmic intuition: Two words are related when they have common context words 13
Word-Document Matrix – recap 𝔼 is a set of documents (plays of Shakespeare) § 𝔼 = [𝑒1, 𝑒2, … , 𝑒𝑁] 𝕎 is the set of words (vocabularies) in dictionary § 𝕎 = [𝑤1, 𝑤2, … , 𝑤𝑂] § Words as rows and documents as columns Values: term count tc !,# § Matrix size 𝑂×𝑁 § 𝑒1 𝑒2 𝑒3 𝑒4 As You Like It Twelfth Night Julius Caesar Henry V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0 ... ... … ... ... 14
Cosine § Cosine is the normalized dot product of two vectors - Its result is between -1 and +1 ∑ 0 𝑦 0 𝑧 0 𝒚 𝒛 cos(𝒚, 𝒛) = . = 𝒚 / 𝒛 / / / ∑ 0 𝑦 0 ∑ 0 𝑧 0 𝒚 = 1 0 𝒛 = 4 § 1 5 6 1 ∗ 4 + 1 ∗ 5 + 0 ∗ 6 9 cos(𝒚, 𝒛) = 1 $ + 1 $ + 0 $ 4 $ + 5 $ + 6 $ = ~12.4 15
Word-Document Matrix 𝑒1 𝑒2 𝑒3 𝑒4 As You Like It Twelfth Night Julius Caesar Henry V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0 ... ... … ... ... § Similarity between two words: similarity soldier, clown = cos 𝒚 1"23',4 , 𝒚 52"!% 16
Context § Context can be - Document - Paragraph, tweet - Window of (2-10) context words on each side of the word § Word-Context matrix - Every word as a unit (dimension) ℂ = [𝑑1, 𝑑2, … , 𝑑𝑀] - Matrix size: 𝑂×𝑀 - Usually ℂ = 𝕎 and therefore 𝑀 = 𝑂 17
Word-Context Matrix § Window context of 7 words sugar, a sliced lemon, a tablespoonful of apricot preserve or jam, a pinch each of, their enjoyment. Cautiously she sampled her first pineapple and another fruit whose taste she likened well suited to programming on the digital computer . In finding the optimal R-stage policy from for the purpose of gathering data and information necessary for the study authorized in the 𝑑1 𝑑2 𝑑3 𝑑4 𝑑5 𝑑6 aardvark computer data pinch result sugar 𝑤1 apricot 0 0 0 1 0 1 𝑤2 pineapple 0 0 0 1 0 1 𝑤3 digital 0 2 1 0 1 0 𝑤4 information 0 1 6 0 4 0 18
Word-to-Word Relations 𝑑1 𝑑2 𝑑3 𝑑4 𝑑5 𝑑6 aardvark computer data pinch result sugar 𝑤1 apricot 0 0 0 1 0 1 𝑤2 pineapple 0 0 0 1 0 1 𝑤3 digital 0 2 1 0 1 0 𝑤4 information 0 1 6 0 4 0 § First-order co-occurrence relation - Each cell of the word-context matrix - Words that appear in the proximity of each other - Like drink to beer , and drink and wine § Second-order similarity relation - Cosine similarity between the representation vectors - Words that appear in similar contexts - Like beer to wine , tesgüino to ale , and frog to toad 19
Point Mutual Information § Problem with raw counting methods - Biased towards high frequent words (“and”, “the”) although they don’t contain much of information § Point Mutual Information (PMI) - Rooted in information theory - A better measure for the first-order relation in word-context matrix in order to reflect the informativeness of co-occurrences - Joint probability of two events (random variables) divided by their marginal probabilities PMI 𝑌, 𝑍 = log 𝑞(X, Y) 𝑞 X 𝑞(Y) 20
Point Mutual Information PMI 𝑤, 𝑑 = log 𝑞(𝑤, 𝑑) 𝑞 𝑤 𝑞(𝑑) 𝑞 𝑤, 𝑑 = # 𝑤, 𝑑 𝑇 ( * ∑ %&' # 𝑤, 𝑑 % 𝑞 𝑑 = ∑ )&' # 𝑤 ) , 𝑑 𝑞 𝑤 = 𝑇 𝑇 * ( 𝑇 = H H # 𝑤 ) , 𝑑 % )&' %&' § Positive Point Mutual Information (PPMI) PPMI 𝑢, 𝑑 = max(PMI, 0) 21
Point Mutual Information 𝑑1 𝑑2 𝑑3 𝑑4 𝑑5 𝑑6 aardvark computer data pinch result sugar 𝑤1 apricot 0 0 0 1 0 1 𝑤2 pineapple 0 0 0 1 0 1 𝑤3 digital 0 2 1 0 1 0 𝑤4 information 0 1 6 0 4 0 6 19 = .32 𝑄 𝑤 = information, 𝑑 = data = Q 11 19 = .58 𝑄 𝑤 = information = Q 7 19 = .37 𝑄 𝑑 = data = Q .32 PPMI 𝑤 = information, 𝑑 = data = max(0, log .58 ∗ .37) = .39 22
Singular Value Decomposition - Recap § An 𝑂×𝑀 matrix 𝒀 can be factorized to three matrices: 𝒀 = 𝑽𝜯𝑾 𝐔 § 𝑽 (left singular vectors) is an 𝑂×𝑀 unitary matrix § 𝜯 is an 𝑀×𝑀 diagonal matrix, diagonal entries - are eigenvalues, - show the importance of corresponding 𝑀 dimensions in 𝒀 - are all positive and sorted from large to small values § 𝑾 𝐔 (right singular vectors) is an 𝑀×𝑀 unitary matrix * The definition of SVD is simplified. Refer to https://en.wikipedia.org/wiki/Singular_value_decomposition for the exact definition 24
Singular Value Decomposition – Recap 𝑀×𝑀 𝑀×𝑀 𝑂×𝑀 𝑂×𝑀 = eigenvalues right singular vectors 𝑾 𝐔 𝜯 original matrix left singular vectors 𝒀 𝑽 25
Applying SVD to Word-Context Matrix § Step 1: create a sparse PPMI matrix of the size 𝑂 ✕ 𝑀 , § Apply SVD contexts 𝑀×𝑀 𝑀×𝑀 words 𝑂×𝑀 𝑂×𝑀 = eigenvalues context vectors 𝑾 𝐔 𝜯 (sparse) word vectors word-context matrix 𝑽 𝒀 26
Applying SVD to Term-Context Matrix § Step 2: keep only top 𝑒 eigenvalues in 𝜯 and set the rest to zero § Truncate the 𝑽 and 𝑾 𝐔 matrices based on the changes 𝑾 𝐔 respectively in 𝜯 , called N 𝑽 and N 27
Applying SVD to Term-Context Matrix 𝑀 𝑒 𝑒 𝑒 𝑒 𝑀×𝑀 𝑀×𝑀 𝑂 ×𝑀 truncated truncated eigenvalues context word % 𝜯 vectors : 𝑾 𝐔 truncated word vectors : 𝑽 § N 𝑽 matrix is the dense low-dimensional word vectors 28
Agenda • Introduction • Count-based word representation • Prediction-based word embedding
Word Embedding with Neural Networks Recipe for creating (dense) word embedding with neural networks § Design a neural network architecture! § Loop over training data (𝑤, 𝑑) for some epochs Pass the word 𝑤 as input and execute forward passing - Calculate the probability of observing context word 𝑑 at output: - 𝑞(𝑑|𝑤) - Optimize the network to maximize this likelihood probability Details come next! 30
Training Data Window of size 2 http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ 31
Neural embeddings – architecture Train sample: ( Tesgüino , drink ) Output Layer Input Layer (softmax) (One-hot encoder ) Forward pass 𝑞(𝐞𝐬𝐣𝐨𝐥|𝐔𝐟𝐭𝐡ü𝐣𝐨𝐩) Backpropagation 𝑽 % 𝑭 𝑂×𝑒 𝑒×𝑂 1×𝑒 1×𝑂 1×𝑂 Linear activation Encoder embedding Decoder embedding https://web.stanford.edu/~jurafsky/slp3/ 32
Ale Tesgüino Embedding vector 33
Ale Tesgüino Embedding vector 34
Ale Tesgüino Decoding vector Embedding vector 35
Ale drink Tesgüino Decoding vector Embedding vector 36
drink Ale Tesgüino Decoding vector Embedding vector 37
Ale drink Tesgüino - Train sample: ( Tesgüino , drink ) - Update vectors to maximize 𝑞(drink|Tesgüino) Decoding vector Embedding vector 38
Recommend
More recommend