in4080 2020 fall
play

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 - PowerPoint PPT Presentation

1 IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural networks, Language models, word2wec Lecture 6, 21 Sept Today 3 Neural networks Language models Word embeddings Word2vec Artificial neural networks


  1. 1 IN4080 – 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning

  2. 2 Neural networks, Language models, word2wec Lecture 6, 21 Sept

  3. Today 3  Neural networks  Language models  Word embeddings  Word2vec

  4. Artificial neural networks 4  Inspired by the brain 1 1  neurons, synapses  Does not pretend to be a model of the brain  The simplest model is the  Feed forward network, also called  Multi-layer Perceptron

  5. Linear regression as a network 5  Each feature, 𝑦 𝑗 , of the input bias vector is an input node node 1  An additional bias node 𝑦 0 = 1 for the intercept w0 w1 x1  A weight at each edge,  Multiply the input values with w2 𝑧 ො 𝑧 Σ the respective weights: 𝑥 𝑗 𝑦 𝑗 x2 w3  Sum them target output value node 𝑛 𝑥 𝑗 𝑦 𝑗 = 𝒙 ∙ 𝒚 𝑧 = σ 𝑗=0  ො x3 input nodes

  6. Gradient descent (for linear regression) 6  We start with an initial set of weights  Consider training examples  Adjust the weights to reduce the loss  How?  Gradient descent  Gradient means partial derivatives.

  7. Linear regression: higher dimensions 7  Linear regression of more than two variables works similarly  We try to fit the best (hyper-)plane 𝑜 𝑧 = 𝑔 𝑦 0 , 𝑦 1 , … , 𝑦 𝑜 = ෍ ො 𝑥 𝑗 𝑦 𝑗 = 𝑥 ∙ Ԧ 𝑦 𝑗=0  We can use the same mean square: 𝑛 1 𝑧 𝑗 2 𝑛 ෍ 𝑧 𝑗 − ො 𝑗=1

  8. Partial derivatives 8  A function of more than one variable, e.g. 𝑔(𝑦, 𝑧) 𝜖𝑔  The partial derivative, e.g. 𝜖𝑦 is the derivative one gets by keeping the other variables constant  E.g. if 𝑔 𝑦, 𝑧 = 𝑏𝑦 + 𝑐𝑧 + 𝑑, 𝜖𝑔 𝜖𝑔 𝜖𝑦 = 𝑏 and 𝜖𝑧 = 𝑐 https://www.wikihow.com/Image:OyXsh.png

  9. Gradient descent 9  We move in the opposite direction of where the gradient is pointing.  Intuitively:  Take small steps in all direction parallel to the (feature) axes  The length of the steps are proportional to the steepness in each direction

  10. Properties of the derivatives 10 If 𝑔 𝑦 = 𝑏𝑦 + 𝑐 then 𝑔 ′ 𝑦 = 𝑏 1. 𝑒𝑔  we also write 𝑒𝑦 = 𝑏 𝑒𝑧  and if 𝑧 = 𝑔 𝑦 , we can write 𝑒𝑦 = 𝑏 If 𝑔 𝑦 = 𝑦 𝑜 for an integer ≠ 0 then 𝑔 ′ 𝑦 = 𝑜𝑦 (𝑜−1) 2. If 𝑔 𝑦 = 𝑕(𝑧) and 𝑧 = ℎ(𝑦) then 𝑔 ′ 𝑦 = 𝑕 ′ 𝑧 ℎ′(𝑦) 3. 𝑒𝑨 𝑒𝑨 𝑒𝑧  if 𝑨 = 𝑔 𝑦 = 𝑕(𝑧) , this can be written 𝑒𝑦 = 𝑒𝑧 𝑒𝑦  In particular, if 𝑔 𝑦 = 𝑏𝑦 + 𝑐 2 then 𝑔 ′ 𝑦 = 2 𝑏𝑦 + 𝑐 𝑏

  11. Gradient descent (for linear regression) 11  Loss: Mean squared error : 2 1 𝑜 𝑜 σ 𝑘=1  𝑀 ෝ 𝒛 , 𝒛 = 𝑧 𝑘 − 𝑧 𝑘 ො 𝑛 𝑥 𝑗 𝑦 𝑘,𝑗 = 𝒙 ∙ 𝒚 𝑘 𝑧 𝑘 = σ 𝑗=0  ො  We will update the 𝑥 𝑗 -s  Consider the partial derivatives w.r.t the 𝑥 𝑗 -s 𝜖 1 𝑜 𝑜 σ 𝑘=1 𝜖𝑥 𝑗 𝑀 ෝ 𝒛 , 𝒛 = 2 ො 𝑧 𝑘 − 𝑧 𝑘 𝑦 𝑘,𝑗  𝑜 is the number of observations, 0 ≤ 𝑘 ≤ 𝑜 and 𝜖  Update 𝑥 𝑗 : 𝑥 𝑗 = 𝑥 𝑗 − 𝜃 𝜖𝑥 𝑗 𝑀 ෝ 𝒛 , 𝒛 𝑛 is the number of features for each observation, 0 ≤ 𝑗 ≤ 𝑛

  12. Inspecting the update 12 𝑜 bias 𝑥 𝑗 = 𝑥 𝑗 − 𝜃 1 node 𝑜 ෍ 2 ො 𝑧 𝑘 − 𝑧 𝑘 𝑦 𝑘,𝑗 1 𝑘=1 w0 w1 x1 w2 𝑧 ො 𝑧 Σ The error term The x2 w3 (delta term) of this contribution to target output value node prediction, from the the error from loss function x3 this weight input nodes 𝜃 is the learning rate

  13. Logistic regression as a network 13 𝑛 𝑥 𝑗 𝑦 𝑗 = 𝒙 ∙ 𝒚  z = σ 𝑗=0 bias node 1 1 𝑧 = 𝜏(𝑨) = ො  1+𝑓 −𝑨 𝑘 1 − ො 1−𝑧 𝑘 𝑜  Loss: 𝑀 𝐷𝐹 = − σ 𝑘=1 log ො 𝑧 𝑘 𝑧 𝑘 w0 w1 x1 𝜖 𝜖 𝜖 ො 𝑧 𝜖𝑨 𝑥 𝑗 𝑀 𝐷𝐹 = 𝑧 𝑀 𝐷𝐹 × 𝜖𝑨 ×  𝜖ෞ 𝜖 ො 𝜖𝑥 𝑗 z 𝑧 ො 𝜖 𝑧− ො 𝑧 𝑧 w2 Σ 𝑧 𝑀 𝐷𝐹 =  𝜖 ො 𝑧 1− ො ො To simplify, 𝑧 x2 𝜖 ො 𝑧 consider only one w3 𝜖𝑨 = ො 𝑧 1 − ො 𝑧  target observation, 𝑧 𝑘 value output 𝜖𝑨 𝜖𝑥 𝑗 = 𝑦 𝑗  node x3 𝑧− ො 𝜖 𝑧 𝑧 𝑦 𝑗 = 𝑧 − ො 𝑥 𝑗 𝑀 𝐷𝐹 = 𝑧 ො 𝑧 1 − ො 𝑧 𝑦 𝑗  𝜖ෞ 𝑧 1− ො ො input nodes

  14. Logistic regression as a network 14 From the bias activation From the node 1 function loss w0 w1 x1 𝜖 𝑧− ො 𝑧 z 𝑧 ො 𝑧 𝑦 𝑗 = 𝑧 − ො 𝑥 𝑗 𝑀 𝐷𝐹 = 𝑧 ො 𝑧 1 − ො 𝑧 𝑦 𝑗 𝑧 w2 Σ 𝜖ෞ 𝑧 1− ො ො x2 w3 target value The delta term The contribution output node at the end of to the error from x3 W this weight input nodes

  15. Feed forward network 15  An input layer 1 1  An output layer: the predictions  One or more hidden layers  Connections from one layer to the next (from left to right)

  16. The hidden nodes 16  Each hidden node is like a small logistic regression: 1  First sum of weighted inputs : w0 𝑛 𝑥 𝑗 𝑦 𝑗 = 𝒙 ∙ 𝒚 w1 x1  z = σ 𝑗=0  Then the result is run through an z y w2 Σ activation function, e.g. σ x2 1  𝑧 = 𝜏(𝑨) = w3 1+𝑓 −𝑥∙𝑦 x3 It is the non-linearity of the activation function which makes it possible for MLP to predict non-linear decision boundaries

  17. The output layer 17 Alternatives 1 1  Regression:  One node  No activation function  Binary classifier:  One node  Logistic activation function  Multinomial classifier  Several nodes  Softmax  + more alternatives  Choice of loss function depends on task

  18. Learning in multi-layer networks 18  Consider two consecutive layers: M0  Layer M, with 1 ≤ 𝑗 ≤ 𝑛 nodes, and a bias N1 node M0 M1  Layer N, with 1 ≤ 𝑘 ≤ 𝑜 nodes N2  Let 𝑥 𝑗,𝑘 be the weight at the edge going from 𝑁 𝑗 to 𝑂 𝑘 M2  Consider processing one observation: N3  Let 𝑦 𝑗 be the value going out of node 𝑁 𝑗 M3  If M is a hidden layer: N4  𝑦 𝑗 = 𝜏(𝑨 𝑗 ) , where 𝑨 𝑗 = σ(… )

  19. Learning in multi-layer networks 19  If N is the output layer, calculate the error 𝑂 as before from the loss and the M0 terms 𝜀 𝑥 1,1 N1 𝑘 activation function at each node 𝑂 𝑘 M1 𝑥 1,2  If M is a hidden layer: Calculate the error N2 term at the nodes combining 𝑥 1,3  A weighted sum of the error terms at layer N M2 N3  The derivative of the activation function 𝑥 1,4 𝑁 = σ 𝑘=1 𝑒𝑦 𝑗 𝑜 𝑂  𝜀 𝑗 𝑥 𝑗,𝑘 𝜀 M3 𝑘 𝑒𝑨 𝑗 N4  where 𝑦 𝑗 = 𝜏(𝑨 𝑗 ) , where 𝑨 𝑗 = σ(… )

  20. Learning in multi-layer networks 20  By repeating the process, we get error M0 terms at all nodes in all the hidden layers. 𝑥 1,1 N1  The update of the weights between the M1 𝑥 1,2 layers can be done as before: N2 𝑥 1,3 𝑂  𝑥 𝑗,𝑘 = 𝑥 𝑗,𝑘 − 𝑦 𝑗 𝜀 𝑘 M2  where 𝑦 𝑗 is the value going out of node 𝑁 𝑗 N3 𝑥 1,4 M3 N4

  21. Alternative activation functions 21  There are alternative activation functions 𝑓 𝑦 −𝑓 −𝑦  tanh 𝑦 = 𝑓 𝑦 +𝑓 −𝑦  𝑆𝑓𝑀𝑉 𝑦 = max 𝑦, 0  ReLU is the preferred method in hidden layers in deep networks

  22. Today 22  Neural networks  Language models  Word embeddings  Word2vec

  23. 23 Language model

  24. Probabilistic Language Models 24  Goal: Ascribe probabilities to word sequences.  Motivation:  Translation:  P(she is a tall woman) > P(she is a high woman)  P(she has a high position) > P(she has a tall position)  Spelling correction:  P(She met the prefect.) > P(She met the perfect.)  Speech recognition:  P(I saw a van) > P(eyes awe of an)

  25. Probabilistic Language Models 25  Goal: Ascribe probabilities to word sequences.  𝑄(𝑥 1 , 𝑥 2 , 𝑥 3 , … , 𝑥 𝑜 )  Related: the probability of the next word  𝑄(𝑥 𝑜 | 𝑥 1 , 𝑥 2 , 𝑥 3 , … , 𝑥 𝑜−1 )  A model which does either is called a Language Model, LM  Comment: The term is somewhat misleading  (Probably origin from speech recognition)

  26. Chain rule 26  The two definitions are related by the chain rule for probability:  𝑄 𝑥 1 , 𝑥 2 , 𝑥 3 , … , 𝑥 𝑜 =  𝑄 𝑥 1 × 𝑄 𝑥 2 𝑥 1 × 𝑄 𝑥 3 |𝑥 1 , 𝑥 2 ×∙∙∙× 𝑄 𝑥 𝑜 |𝑥 1 , 𝑥 2 , … , 𝑥 𝑜−1 = 𝑜 𝑄 𝑥 𝑗 |𝑥 1 , 𝑥 2 , … , 𝑥 𝑗−1 = ς 𝑗 𝑜 𝑄 𝑥 𝑗 |𝑥 1 𝑗−1  ς 𝑗  P(“its water is so transparent”) = P(its) × P(water|its) × P(is|its water) × P(so|its water is) × P(transparent|its water is so)  But this does not work for long sequences  (we may not even have seen before)

Recommend


More recommend