natural language processing with deep learning neural
play

Natural Language Processing with Deep Learning Neural Networks a - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Neural Networks a Walkthrough Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Agenda Introduction Non-linearities Forward pass &


  1. Natural Language Processing with Deep Learning Neural Networks – a Walkthrough Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception

  2. Agenda • Introduction • Non-linearities • Forward pass & backpropagation • Softmax & loss function • Optimization & regularization

  3. Agenda • Introduction • Non-linearities • Forward pass & backpropagation • Softmax & loss function • Optimization & regularization

  4. Notation § 𝑏 → scalar § 𝒄 → vector - 𝑗 !" element of 𝒄 is the scalar 𝑐 # § 𝑫 → matrix - 𝑗 !" vector of 𝑫 is 𝒅 # - 𝑘 !" element of the 𝑗 !" vector of 𝑫 is the scalar 𝑑 #,% § Tensor: generalization of scalar, vector, matrix to any arbitrary dimension 4

  5. Linear Algebra 5

  6. Linear Algebra – Transpose 𝒃 is in 1 × d dimensions → 𝒃 𝐔 is in § d × 1 dimensions 𝑩 is in e × d dimensions → 𝑩 𝐔 is in d × e dimensions § 1 4 & 1 2 3 = 2 5 4 5 6 3 6 6

  7. Linear Algebra – Dot product § 𝒃 2 𝒄 ' = 𝑑 dimensions: 1 × d ( d × 1 = 1 - 2 = 1 2 3 0 5 1 𝒅 § 𝒃 2 𝑪 = dimensions: 1 × d ( d × e = 1 × e - 2 3 = 1 2 3 0 1 5 2 1 −1 § 𝑩 2 𝑪 = 𝑫 dimensions: l × m ( m × n = - l × n 1 2 3 5 2 2 3 1 0 1 3 2 = 0 1 5 −5 0 0 5 1 −1 8 13 4 1 0 § Linear transformation: dot product of a vector to a matrix 7

  8. Probability § Conditional probability 𝑞(𝑧|𝑦) § Probability distribution - For a discrete random variable 𝒜 with 𝐿 states • 0 ≤ 𝑞 𝑨 # ≤ 1 2 • ∑ #01 𝑞 𝑨 # = 1 - E.g. with 𝐿 = 4 states: 0.2 0.3 0.45 0.05 8

  9. Probability § Expected value 𝔽 -~/ 𝑔 = 1 𝑌 , 𝑔(𝑦) -∈/ - Note: This is an imprecise definition. Though, it suffices for our use in this lecture 9

  10. Artificial Neural Networks § Neural Networks are non-linear functions and universal approximators § They composed of several simple (non-)linear operations § Neural networks can readily be defined as probabilistic models which estimate 𝑞(𝑧|𝒚; 𝑿) - Given input vector 𝒚 and the set of parameters 𝑿 , estimate the probability of the output class y 10

  11. A Feedforward network output probability input vector distribution 𝒚 𝑞 𝑧 𝒚; 𝑿 𝑿 (𝟑) 𝑿 (𝟐) size 4x2 size 3x4 parameter matrices 11

  12. Learning with Neural Networks § Design the network’s architecture § Consider proper regularization methods § Initialize parameters § Loop until some exit criteria are met - Sample a minibatch from training data 𝒠 - Loop over data points in the minibatch • Forward pass : given input 𝒚 predict output 𝑞 𝑧 𝒚; 𝑿 - Calculate loss function - Calculate the gradient of each parameter regarding the loss function using the backpropagation algorithm - Update parameters using their gradients 12

  13. Agenda • Introduction • Non-linearities • Forward pass & backpropagation • Softmax & loss function • Optimization & regularization

  14. Neural Computation source 14

  15. An Artificial Neuron source 15

  16. Linear 𝑔 𝑦 = 𝑦 16

  17. Sigmoid 1 𝑔 𝑦 = 𝜏 𝑦 = 1 + 𝑓 !" § squashes input between 0 and 1 § Output becomes like a probability value 17

  18. Hyperbolic Tangent (Tanh) 𝑔 𝑦 = tanh 𝑦 = 𝑓 #" − 1 𝑓 #" + 1 § squashes input between -1 and 1 𝜏 Tanh 18

  19. Rectified Linear Unit (ReLU) 𝑔 𝑦 = max(0, 𝑦) § Good for deep architectures, as it prevents vanishing gradient 19

  20. Examples 𝑿 = 0.5 −0.5 2 0 0 𝒚 = 1 3 0 0 0 4 −1 Linear transformation 𝒚𝑿 : § 0.5 −0.5 2 0 −1 𝒚𝑿 = 1 −1 = 𝟏. 𝟔 3 −𝟏. 𝟔 𝟑 𝟐𝟑 −𝟓 0 0 0 4 Non-linear transformation ReLU(𝒚𝑿) : § ReLU 0.5 = 𝟏. 𝟔 −0.5 2 12 −3 𝟏. 𝟏 𝟑 𝟐𝟑 𝟏. 𝟏 Non-linear transformation 𝜏(𝒚𝑿) : § 𝜏 0.5 = 𝟏. 𝟕𝟑 −0.5 2 12 −3 𝟏. 𝟒𝟖 𝟏. 𝟗𝟗 𝟏. 𝟘𝟘 𝟏. 𝟏𝟐𝟗 Non-linear transformation tanh(𝒚𝑿) : § tanh 0.5 = 𝟏. 𝟓𝟕 −0.5 2 12 −3 −𝟏. 𝟓𝟕 𝟏. 𝟘𝟕 𝟏. 𝟘𝟘 −𝟏. 𝟘𝟘 20

  21. Agenda • Introduction • Non-linearities • Forward pass & backpropagation • Softmax & loss function • Optimization & regularization

  22. Forward pass § Consider this calculation: 𝑨(𝑦; 𝒙) = 2 ∗ 𝑥 33 + 𝑦 ∗ 𝑥 1 + 𝑥 4 where 𝑦 is input and 𝒙 is the set of parameters with the initialization 𝑥 " = 1 𝑥 # = 3 𝑥 $ = 2 § Let’s break it into intermediary variables: 𝑏 = 𝑦 ∗ 𝑥 1 𝑐 = 𝑏 + 𝑥 4 𝑑 = 𝑥 33 𝑨 = 𝑐 + 2 ∗ 𝑑 22

  23. z = 𝑐 + 2 ∗ 𝑑 Computational Graph 𝑑 = 𝑥 ## 𝑐 = 𝑏 + 𝑥 ! 𝑏 = 2 ∗ 𝑦 ∗ 𝑥 " 𝑥 ! 𝑦 𝑥 " 𝑥 # 𝑥 ! = 1 𝑥 " = 3 𝑥 # = 2 23

  24. z = 𝑐 + 2 ∗ 𝑑 Computational Graph 𝜖 local derivatives 𝜖 = 1 𝜖 = 2 𝑑 = 𝑥 ## 𝑐 = 𝑏 + 𝑥 ! 𝜖 = 1 𝑏 = 2 ∗ 𝑦 ∗ 𝑥 " 𝜖 = 2 ∗ 𝑥 # 𝜖 = 1 𝜖 = 2 ∗ 𝑥 " 𝜖 = 2 ∗ 𝑦 𝑥 ! 𝑦 𝑥 " 𝑥 # 𝑥 ! = 1 𝑥 " = 3 𝑥 # = 2 24

  25. z = 𝑐 + 2 ∗ 𝑑 Forward pass 𝑨 = 15 𝜖 local derivatives 𝜖 = 1 𝜖 = 2 𝑑 = 𝑥 ## 𝑐 = 𝑏 + 𝑥 ! 𝑑 = 4 𝑐 = 7 𝜖 = 1 𝑏 = 2 ∗ 𝑦 ∗ 𝑥 " 𝜖 = 2 ∗ 𝑥 # 𝜖 = 1 𝑏 = 6 𝜖 = 2 ∗ 𝑥 " 𝜖 = 2 ∗ 𝑦 𝑥 ! 𝑦 𝑥 " 𝑥 # 𝑦 = 1 𝑥 ! = 1 𝑥 " = 3 𝑥 # = 2 25

  26. z = 𝑐 + 2 ∗ 𝑑 Backward pass 𝑨 = 15 𝜖 local derivatives 𝜖 = 1 𝜖 = 2 𝜖 = 1 𝜖 = 2 𝑑 = 𝑥 ## 𝑐 = 𝑏 + 𝑥 ! 𝑑 = 4 𝑐 = 7 𝜖 = 1 𝜖 = 1 𝑏 = 2 ∗ 𝑦 ∗ 𝑥 " 𝜖 = 2 ∗ 𝑥 # 𝜖 = 1 𝑏 = 6 𝜖 = 4 𝜖 = 1 𝜖 = 2 ∗ 𝑥 " 𝜖 = 2 ∗ 𝑦 𝜖 = 6 𝜖 = 2 𝑥 ! 𝑦 𝑥 " 𝑥 # 𝑦 = 1 𝑥 ! = 1 𝑥 " = 3 𝑥 # = 2 26

  27. Gradient & Chain rule § We need the gradient of 𝑨 regarding 𝒙 for optimization 𝜖𝑨 𝜖𝑨 𝜖𝑨 ∇ 𝒙 𝑨 = 𝜖𝑥 4 𝜖𝑥 1 𝜖𝑥 3 § We calculate it using chain rule and local derivates: IJ IJ IL IK ! = IL IK ! IJ IJ IL IM IK " = IL IM IK " IJ IJ IN IK # = IN IK # 27

  28. Backpropagation IJ IJ IL IK ! = IK ! = 1 ∗ 1 = 1 IL IJ IJ IL IM IK " = IK " = 1 ∗ 1 ∗ 2 = 2 IL IM IJ IJ IN IK # = IK # = 2 ∗ 4 = 8 IN 28

  29. Agenda • Introduction • Non-linearities • Forward pass & backpropagation • Softmax & loss function • Optimization & regularization

  30. Softmax § Given the output vector 𝒜 of a neural networks model with 𝐿 output classes § softmax turns the vector to a probability distribution 𝑓 J $ softmax(𝒜) O = 𝑓 J % S ∑ PQR normalization term 30

  31. Softmax – numeric example § 𝐿 = 4 classes 1 2 𝑓 % 𝒜 = 5 𝑦 6 0.004 log(𝑦) 0.013 softmax(𝒜) = 0.264 0.717 31

  32. Softmax characteristics § The exponential function in softmax makes the highest value becomes separated from the others § Softmax identifies the “ max ” but in a “ soft ” way! § Softmax makes competition between the predicted output values, so that in the extreme case, “ winner takes all” - Winner-takes-all: one output is 1 and the rest are 0 - This resembles the competition between nearby neurons in the cortex 32

  33. Negative Log Likelihood (NLL) Loss § NLL loss function is commonly used in neural networks to optimize classification tasks: ℒ = −𝔽 𝒚,Z~𝒠 log 𝑞 𝑧 𝒚; 𝕏 - 𝒠 the set of (training) data - 𝒚 input vector - 𝑧 correct output class § NLL is a form of cross entropy loss 33

  34. NLL + Softmax § The choice of output function (such as softmax) is highly related to the selection of loss function. These two should fit with each other! § Softmax and NLL are a good pair § To see why, let’s calculate the final NLL loss function when softmax is used at output layer (next page) 34

  35. NLL + Softmax § Loss function for one data point: ℒ(𝑔 𝒚; 𝒙 , 𝑧) § 𝒜 the output vector of network before applying softmax § 𝑧 the index of the correct class ℒ(𝑔 𝒚; 𝒙 , 𝑧) = − log 𝑞 𝑧 𝒚; 𝕏 𝑓 J & = − log S 𝑓 J % ∑ PQR S 𝑓 J % = −𝑨 Z + log ∑ PQR 35

  36. NLL + Softmax – example 2 1 2 𝒜 = 0.5 6 § If the correct class is the first one, 𝑧 = 0 : ℒ = −1 + log 𝑓 1 + 𝑓 3 + 𝑓 4.D + 𝑓 E = −1 + 6.02 = 𝟔. 𝟏𝟑 § If the correct class is the third one, 𝑧 = 2 : ℒ = −0.5 + log 𝑓 1 + 𝑓 3 + 𝑓 4.D + 𝑓 E = −0.5 + 6.02 = 𝟔. 𝟔𝟑 § If the correct class is the fourth one, 𝑧 = 3 : ℒ = −6 + log 𝑓 1 + 𝑓 3 + 𝑓 4.D + 𝑓 E = −6 + 6.02 = 𝟏. 𝟏𝟑 36

  37. NLL + Softmax – example 1 1 2 𝒜 = 5 6 § If the correct class is the first one, 𝑧 = 0 : ℒ = −1 + log 𝑓 1 + 𝑓 3 + 𝑓 D + 𝑓 E = −1 + 6.33 = 𝟔. 𝟒𝟒 § If the correct class is the third one, 𝑧 = 2 : ℒ = −5 + log 𝑓 1 + 𝑓 3 + 𝑓 D + 𝑓 E = −5 + 6.33 = 𝟐. 𝟒𝟒 § If the correct class is the fourth one, 𝑧 = 3 : ℒ = −6 + log 𝑓 1 + 𝑓 3 + 𝑓 D + 𝑓 E = −6 + 6.33 = 𝟏. 𝟒𝟒 37

Recommend


More recommend