natural language processing
play

Natural Language Processing Anoop Sarkar - PowerPoint PPT Presentation

SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University October 17, 2019 0 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 1:


  1. SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University October 17, 2019 0

  2. Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 1: Feedforward neural networks 1

  3. Log-linear models versus Neural networks Feedforward neural networks Stochastic Gradient Descent Motivating example: XOR Computation Graphs 2

  4. Log linear model ◮ Let there be m features, f k ( x , y ) for k = 1 , . . . , m ◮ Define a parameter vector v ∈ R m ◮ A log-linear model for classification into labels y ∈ Y : exp ( v · f ( x , y ))) Pr( y | x ; v ) = � y ′ ∈Y exp ( v · f ( x , y ′ ))) Advantages The feature representation f ( x , y ) can represent any aspect of the input that is useful for classification. Disadvantages The feature representation f ( x , y ) has to be designed by hand which is time-consuming and error-prone. 3

  5. Log linear model Figure from [1] Disadvantages: number of combined features can explode 4

  6. Neural Networks Advantages ◮ Neural networks replace hand-engineered features with representation learning ◮ Empirical results across many different domains show that learned representations give significant improvements in accuracy ◮ Neural networks allow end to end training for complex NLP tasks and do not have the limitations of multiple chained pipeline models Disadvantages For many tasks linear models are much faster to train compared to neural network models 5

  7. Alternative Form of Log linear model Log-linear model: exp ( v · f ( x , y ))) Pr( y | x ; v ) = � y ′ ∈Y exp ( v · f ( x , y ′ ))) Alternative form using functions: exp ( v ( y ) · f ( x ) + γ y ) Pr( y | x ; v ) = � � � v ( y ′ ) · f ( x ) + γ y ′ ) y ′ ∈Y exp ◮ Feature vector f ( x ) maps input x to R d ◮ Parameters v ( y ) ∈ R d and γ y ∈ R for each y ∈ Y ◮ We assume v ( y ) · f ( x ) is a dot product. Using matrix multiplication it would be v ( y ) · f ( x ) T ◮ Let v = { ( v ( y ) , γ y ) : y ∈ Y} 6

  8. Log-linear models versus Neural networks Feedforward neural networks Stochastic Gradient Descent Motivating example: XOR Computation Graphs 7

  9. Representation Learning: Feedforward Neural Network Replace hand-engineered features f with learned features φ : exp ( v ( y ) · φ ( x ; θ ) + γ y ) Pr( y | x ; θ, v ) = � � � v ( y ′ ) · φ ( x ; θ ) + γ y ′ ) y ′ ∈Y exp ◮ Replace f ( x ) with φ ( x ; θ ) ∈ R d where θ are new parameters ◮ Parameters θ are learned from training data ◮ Using θ the model φ maps input x to R d : a learned representation from x ◮ x ∈ R d is a pre-trained vector of size d ◮ We will use feedforward neural networks to define φ ( x ; θ ) ◮ φ ( x ; θ ) will be a non-linear mapping to R d ◮ φ replaces f which was a linear model 8

  10. A Single Neuron aka Perceptron A single neuron maps input x ∈ R d to output h : h = g ( w · x + b ) ◮ Weight vector w ∈ R d , a bias b ∈ R are the parameters of the model learned from training data ◮ Transfer function (also called activation function ) g : R → R ◮ It is important that g is a non-linear transfer function ◮ Linear g ( z ) = α · z + β for constants α, β (linear perceptron) 9

  11. Activation Functions and their Gradients from [2], Fig. 4.3 10

  12. The sigmoid Transfer Function: σ sigmoid transfer function: 1 g ( z ) = 1 − exp ( z ) Derivative of sigmoid: dg ( z ) = g ( z )(1 − g ( z )) dz 11

  13. The tanh Transfer Function tanh transfer function: g ( z ) = exp (2 z ) − 1 exp (2 z ) + 1 Derivative of tanh: dg ( z ) = 1 − g ( z ) 2 dz 12

  14. Alternatives to tanh hardtanh:  1 if z > 1  g ( z ) = − 1 if z < − 1  z otherwise � 1 dg ( z ) if − 1 ≤ z ≤ 1 = 0 dz otherwise softsign: z g ( z ) = 1 + | z | � 1 if z ≥ 0 dg ( z ) (1+ z ) 2 = − 1 if z < 0 dz (1+ z ) 2 13

  15. The ReLU Transfer Function Rectified Linear Unit (ReLU): g ( z ) = { z if z ≥ 0 or 0 if z < 0 } or equivalently g ( z ) = max { 0 , z } Derivative of ReLU: dg ( z ) = { 1 if z > 0 or 0 if z < 0 } dz non-differentiable or undefined if z = 0 (in practice: choose a value for z = 0) 14

  16. Desperately Seeking Transfer Functions from [3] Enumeration of non-linear functions 15

  17. Desperately Seeking Transfer Functions from [3] Enumeration of non-linear functions 16

  18. The Swish Transfer Function [3] Enumeration of activation functions: Swish was the end result of comparing all the auto-generated activation functions for accuracy on standard datasets. Swish uses the sigmoid σ : g ( z ) = z · σ ( β z ) ◮ If β = 0 then g ( z ) = z 2 (a linear function; so avoid this) ◮ If β → ∞ then g ( z ) = ReLu Derivative of Swish: dg ( z ) = β g ( z ) + σ ( β z )(1 − β g ( z )) dz 17

  19. The Swish Transfer Function [3] Swish transfer function with First derivative of the Swish different values of β transfer function 18

  20. Derivatives w.r.t. parameters Derivatives w.r.t. w : Given h = g ( w · x + b ) derivatives w.r.t. w 1 , . . . , w j , . . . w d : dh dw j Derivatives w.r.t. b : derivatives w.r.t. b : dh db 19

  21. Chain Rule of Differentiation Introduce an intermediate variable z ∈ R z = w · x + b h = g ( z ) Then by the chain rule to differentiate w.r.t. w : dh = dh dz = dg ( z ) × x j dw j dz dw j dz And similarly for b : dh db = dh dz db = dg ( z ) × 1 dz dz 20

  22. Single Layer Feedforward model A single layer feedforward model consists of: ◮ An integer d specifying the input dimension. Each input to the network is x ∈ R d ◮ An integer m specifying the number of hidden units ◮ A parameter matrix W ∈ R m × d . The vector W k ∈ R d for 1 ≤ k ≤ m is the k th row of W ◮ A vector b ∈ R d of bias parameters ◮ A transfer function g : R → R g ( z ) = ReLU ( z ) or g ( z ) = tanh ( z ) 21

  23. Single Layer Feedforward model (continued) For k = 1 , . . . , m : ◮ The input to the k th neuron is: z k = W k · x + b k ◮ The output from the k th neuron is: h k = g ( z k ) ◮ Define vector φ ( x ; θ ) ∈ R m as: φ ( x ; θ ) = h k ◮ θ = ( W , b ) where W ∈ R m × d and b ∈ R d ◮ Size of θ is m × ( d + 1) parameters Some intuition The neural network employs m hidden units, each with their own parameters W k and b k , and these neurons are used to construct a hidden representation h ∈ R m 22

  24. Matrix Form We can replace the operation: z k = W k · x + b for k = 1 , . . . , m with z = Wx + b where the dimensions are as follows (vector of size m equals a matrix of size m × 1): z = W x + b ���� ���� ���� ���� m × 1 m × d d × 1 m × 1 � �� � m × 1 23

  25. Single Layer Feedforward model (matrix form) A single layer feedforward model consists of: ◮ An integer d specifying the input dimension. Each input to the network is x ∈ R d ◮ An integer m specifying the number of hidden units ◮ A parameter matrix W ∈ R m × d ◮ A vector b ∈ R d of bias parameters ◮ A transfer function g : R m → R m g ( z ) = [ . . . , ReLU ( z i ) , . . . ] or g ( z ) = [ . . . , tanh ( z i ) , . . . ] or g ( z ) = [ . . . , σ ( z i ) , . . . ] or for i = 1 , . . . , m 24

  26. Single Layer Feedforward model (matrix form, continued) ◮ Vector of inputs to the hidden layer z ∈ R m : z = Wx + b ◮ Vector of outputs from hidden layer h ∈ R m : h = g ( z ) ◮ Define φ ( x ; θ ) = h where θ = ( W , b ) exp ( r y ) ◮ Define softmax y = y ′ exp ( r y ′ ) for r y = v ( y ) · h + γ y � ◮ Let V = [ . . . , v y , . . . ] for y ∈ Y . v y ∈ R m so V ∈ R |Y|× m . ◮ Let Γ = [ . . . , γ y , . . . ] for y ∈ Y . Γ ∈ R |Y| . Putting it all together: r = softmax ( V · φ ( x ; θ ) + Γ ) ���� � �� � vector of size |Y| for each y ∈ Y an R value � �� � A vector of size R Y that sums to 1 25

  27. Feedforward neural network 26

  28. n-gram Feedforward neural network from [5] 27

  29. Log-linear models versus Neural networks Feedforward neural networks Stochastic Gradient Descent Motivating example: XOR Computation Graphs 28

  30. Simple stochastic gradient descent Inputs: ◮ Training examples ( x i , y i ) for i = 1 , . . . , n ◮ A feedforward representation φ ( x ; θ ) ◮ Integer T specifying the number of updates ◮ A sequence of learning rates: η 1 , . . . , η T where η t ∈ [0 , 1] ◮ One should experiment with learning rates: 0.001, 0.01, 0.1, 1 ◮ Bottou (2012) suggests a learning rate η t = η 1 1+ η 1 × λ × t where λ is a hyperparameter that can be tuned experimentally Initialization: Set v = ( v ( y ) , γ y ) for all y , and θ to random values 29

  31. Gradient descent Algorithm: ◮ For t = 1 , . . . , T ◮ Select an integer i uniformly at random from { 1 , . . . , n } ◮ Define L ( θ, v ) = − log P ( y i | x i ; θ, v ) ◮ For each parameter θ j and v k ( y ) and γ y (for each label y ): θ j − η t × dL ( θ, v ) = θ j d θ j v k ( y ) − η t × dL ( θ, v ) v k ( y ) = dv k ( y ) γ ( y ) − η t × dL ( θ, v ) γ ( y ) = d γ ( y ) ◮ Output : parameters θ , v = ( v ( y ) , γ y ) for all y 30

  32. Log-linear models versus Neural networks Feedforward neural networks Stochastic Gradient Descent Motivating example: XOR Computation Graphs 31

Recommend


More recommend