1 IN4080 – 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning
2 Neural networks, Language models, word2wec Lecture 6, 21 Sept
Today 3 Neural networks Language models Word embeddings Word2vec
Artificial neural networks 4 Inspired by the brain 1 1 neurons, synapses Does not pretend to be a model of the brain The simplest model is the Feed forward network, also called Multi-layer Perceptron
Linear regression as a network 5 Each feature, 𝑦 𝑗 , of the input bias vector is an input node node 1 An additional bias node 𝑦 0 = 1 for the intercept w0 w1 x1 A weight at each edge, Multiply the input values with w2 𝑧 ො 𝑧 Σ the respective weights: 𝑥 𝑗 𝑦 𝑗 x2 w3 Sum them target output value node 𝑛 𝑥 𝑗 𝑦 𝑗 = 𝒙 ∙ 𝒚 𝑧 = σ 𝑗=0 ො x3 input nodes
Gradient descent (for linear regression) 6 We start with an initial set of weights Consider training examples Adjust the weights to reduce the loss How? Gradient descent Gradient means partial derivatives.
Linear regression: higher dimensions 7 Linear regression of more than two variables works similarly We try to fit the best (hyper-)plane 𝑜 𝑧 = 𝑔 𝑦 0 , 𝑦 1 , … , 𝑦 𝑜 = ො 𝑥 𝑗 𝑦 𝑗 = 𝑥 ∙ Ԧ 𝑦 𝑗=0 We can use the same mean square: 𝑛 1 𝑧 𝑗 2 𝑛 𝑧 𝑗 − ො 𝑗=1
Partial derivatives 8 A function of more than one variable, e.g. 𝑔(𝑦, 𝑧) 𝜖𝑔 The partial derivative, e.g. 𝜖𝑦 is the derivative one gets by keeping the other variables constant E.g. if 𝑔 𝑦, 𝑧 = 𝑏𝑦 + 𝑐𝑧 + 𝑑, 𝜖𝑔 𝜖𝑔 𝜖𝑦 = 𝑏 and 𝜖𝑧 = 𝑐 https://www.wikihow.com/Image:OyXsh.png
Gradient descent 9 We move in the opposite direction of where the gradient is pointing. Intuitively: Take small steps in all direction parallel to the (feature) axes The length of the steps are proportional to the steepness in each direction
Properties of the derivatives 10 If 𝑔 𝑦 = 𝑏𝑦 + 𝑐 then 𝑔 ′ 𝑦 = 𝑏 1. 𝑒𝑔 we also write 𝑒𝑦 = 𝑏 𝑒𝑧 and if 𝑧 = 𝑔 𝑦 , we can write 𝑒𝑦 = 𝑏 If 𝑔 𝑦 = 𝑦 𝑜 for an integer ≠ 0 then 𝑔 ′ 𝑦 = 𝑜𝑦 (𝑜−1) 2. If 𝑔 𝑦 = (𝑧) and 𝑧 = ℎ(𝑦) then 𝑔 ′ 𝑦 = ′ 𝑧 ℎ′(𝑦) 3. 𝑒𝑨 𝑒𝑨 𝑒𝑧 if 𝑨 = 𝑔 𝑦 = (𝑧) , this can be written 𝑒𝑦 = 𝑒𝑧 𝑒𝑦 In particular, if 𝑔 𝑦 = 𝑏𝑦 + 𝑐 2 then 𝑔 ′ 𝑦 = 2 𝑏𝑦 + 𝑐 𝑏
Gradient descent (for linear regression) 11 Loss: Mean squared error : 2 1 𝑜 𝑜 σ 𝑘=1 𝑀 ෝ 𝒛 , 𝒛 = 𝑧 𝑘 − 𝑧 𝑘 ො 𝑛 𝑥 𝑗 𝑦 𝑘,𝑗 = 𝒙 ∙ 𝒚 𝑘 𝑧 𝑘 = σ 𝑗=0 ො We will update the 𝑥 𝑗 -s Consider the partial derivatives w.r.t the 𝑥 𝑗 -s 𝜖 1 𝑜 𝑜 σ 𝑘=1 𝜖𝑥 𝑗 𝑀 ෝ 𝒛 , 𝒛 = 2 ො 𝑧 𝑘 − 𝑧 𝑘 𝑦 𝑘,𝑗 𝑜 is the number of observations, 0 ≤ 𝑘 ≤ 𝑜 and 𝜖 Update 𝑥 𝑗 : 𝑥 𝑗 = 𝑥 𝑗 − 𝜃 𝜖𝑥 𝑗 𝑀 ෝ 𝒛 , 𝒛 𝑛 is the number of features for each observation, 0 ≤ 𝑗 ≤ 𝑛
Inspecting the update 12 𝑜 bias 𝑥 𝑗 = 𝑥 𝑗 − 𝜃 1 node 𝑜 2 ො 𝑧 𝑘 − 𝑧 𝑘 𝑦 𝑘,𝑗 1 𝑘=1 w0 w1 x1 w2 𝑧 ො 𝑧 Σ The error term The x2 w3 (delta term) of this contribution to target output value node prediction, from the the error from loss function x3 this weight input nodes 𝜃 is the learning rate
Logistic regression as a network 13 𝑛 𝑥 𝑗 𝑦 𝑗 = 𝒙 ∙ 𝒚 z = σ 𝑗=0 bias node 1 1 𝑧 = 𝜏(𝑨) = ො 1+𝑓 −𝑨 𝑘 1 − ො 1−𝑧 𝑘 𝑜 Loss: 𝑀 𝐷𝐹 = − σ 𝑘=1 log ො 𝑧 𝑘 𝑧 𝑘 w0 w1 x1 𝜖 𝜖 𝜖 ො 𝑧 𝜖𝑨 𝑥 𝑗 𝑀 𝐷𝐹 = 𝑧 𝑀 𝐷𝐹 × 𝜖𝑨 × 𝜖ෞ 𝜖 ො 𝜖𝑥 𝑗 z 𝑧 ො 𝜖 𝑧− ො 𝑧 𝑧 w2 Σ 𝑧 𝑀 𝐷𝐹 = 𝜖 ො 𝑧 1− ො ො To simplify, 𝑧 x2 𝜖 ො 𝑧 consider only one w3 𝜖𝑨 = ො 𝑧 1 − ො 𝑧 target observation, 𝑧 𝑘 value output 𝜖𝑨 𝜖𝑥 𝑗 = 𝑦 𝑗 node x3 𝑧− ො 𝜖 𝑧 𝑧 𝑦 𝑗 = 𝑧 − ො 𝑥 𝑗 𝑀 𝐷𝐹 = 𝑧 ො 𝑧 1 − ො 𝑧 𝑦 𝑗 𝜖ෞ 𝑧 1− ො ො input nodes
Logistic regression as a network 14 From the bias activation From the node 1 function loss w0 w1 x1 𝜖 𝑧− ො 𝑧 z 𝑧 ො 𝑧 𝑦 𝑗 = 𝑧 − ො 𝑥 𝑗 𝑀 𝐷𝐹 = 𝑧 ො 𝑧 1 − ො 𝑧 𝑦 𝑗 𝑧 w2 Σ 𝜖ෞ 𝑧 1− ො ො x2 w3 target value The delta term The contribution output node at the end of to the error from x3 W this weight input nodes
Feed forward network 15 An input layer 1 1 An output layer: the predictions One or more hidden layers Connections from one layer to the next (from left to right)
The hidden nodes 16 Each hidden node is like a small logistic regression: 1 First sum of weighted inputs : w0 𝑛 𝑥 𝑗 𝑦 𝑗 = 𝒙 ∙ 𝒚 w1 x1 z = σ 𝑗=0 Then the result is run through an z y w2 Σ activation function, e.g. σ x2 1 𝑧 = 𝜏(𝑨) = w3 1+𝑓 −𝑥∙𝑦 x3 It is the non-linearity of the activation function which makes it possible for MLP to predict non-linear decision boundaries
The output layer 17 Alternatives 1 1 Regression: One node No activation function Binary classifier: One node Logistic activation function Multinomial classifier Several nodes Softmax + more alternatives Choice of loss function depends on task
Learning in multi-layer networks 18 Consider two consecutive layers: M0 Layer M, with 1 ≤ 𝑗 ≤ 𝑛 nodes, and a bias N1 node M0 M1 Layer N, with 1 ≤ 𝑘 ≤ 𝑜 nodes N2 Let 𝑥 𝑗,𝑘 be the weight at the edge going from 𝑁 𝑗 to 𝑂 𝑘 M2 Consider processing one observation: N3 Let 𝑦 𝑗 be the value going out of node 𝑁 𝑗 M3 If M is a hidden layer: N4 𝑦 𝑗 = 𝜏(𝑨 𝑗 ) , where 𝑨 𝑗 = σ(… )
Learning in multi-layer networks 19 If N is the output layer, calculate the error 𝑂 as before from the loss and the M0 terms 𝜀 𝑥 1,1 N1 𝑘 activation function at each node 𝑂 𝑘 M1 𝑥 1,2 If M is a hidden layer: Calculate the error N2 term at the nodes combining 𝑥 1,3 A weighted sum of the error terms at layer N M2 N3 The derivative of the activation function 𝑥 1,4 𝑁 = σ 𝑘=1 𝑒𝑦 𝑗 𝑜 𝑂 𝜀 𝑗 𝑥 𝑗,𝑘 𝜀 M3 𝑘 𝑒𝑨 𝑗 N4 where 𝑦 𝑗 = 𝜏(𝑨 𝑗 ) , where 𝑨 𝑗 = σ(… )
Learning in multi-layer networks 20 By repeating the process, we get error M0 terms at all nodes in all the hidden layers. 𝑥 1,1 N1 The update of the weights between the M1 𝑥 1,2 layers can be done as before: N2 𝑥 1,3 𝑂 𝑥 𝑗,𝑘 = 𝑥 𝑗,𝑘 − 𝑦 𝑗 𝜀 𝑘 M2 where 𝑦 𝑗 is the value going out of node 𝑁 𝑗 N3 𝑥 1,4 M3 N4
Alternative activation functions 21 There are alternative activation functions 𝑓 𝑦 −𝑓 −𝑦 tanh 𝑦 = 𝑓 𝑦 +𝑓 −𝑦 𝑆𝑓𝑀𝑉 𝑦 = max 𝑦, 0 ReLU is the preferred method in hidden layers in deep networks
Today 22 Neural networks Language models Word embeddings Word2vec
23 Language model
Probabilistic Language Models 24 Goal: Ascribe probabilities to word sequences. Motivation: Translation: P(she is a tall woman) > P(she is a high woman) P(she has a high position) > P(she has a tall position) Spelling correction: P(She met the prefect.) > P(She met the perfect.) Speech recognition: P(I saw a van) > P(eyes awe of an)
Probabilistic Language Models 25 Goal: Ascribe probabilities to word sequences. 𝑄(𝑥 1 , 𝑥 2 , 𝑥 3 , … , 𝑥 𝑜 ) Related: the probability of the next word 𝑄(𝑥 𝑜 | 𝑥 1 , 𝑥 2 , 𝑥 3 , … , 𝑥 𝑜−1 ) A model which does either is called a Language Model, LM Comment: The term is somewhat misleading (Probably origin from speech recognition)
Chain rule 26 The two definitions are related by the chain rule for probability: 𝑄 𝑥 1 , 𝑥 2 , 𝑥 3 , … , 𝑥 𝑜 = 𝑄 𝑥 1 × 𝑄 𝑥 2 𝑥 1 × 𝑄 𝑥 3 |𝑥 1 , 𝑥 2 ×∙∙∙× 𝑄 𝑥 𝑜 |𝑥 1 , 𝑥 2 , … , 𝑥 𝑜−1 = 𝑜 𝑄 𝑥 𝑗 |𝑥 1 , 𝑥 2 , … , 𝑥 𝑗−1 = ς 𝑗 𝑜 𝑄 𝑥 𝑗 |𝑥 1 𝑗−1 ς 𝑗 P(“its water is so transparent”) = P(its) × P(water|its) × P(is|its water) × P(so|its water is) × P(transparent|its water is so) But this does not work for long sequences (we may not even have seen before)
Recommend
More recommend