Traitement automatique des langues : Fondements et applications Cours 10 : Neural networks (1) Tim Van de Cruys & Philippe Muller 2016—2017
Introduction Machine learning for NLP • Standard approach: linear model trained over high-dimensional but very sparse feature vectors • Recently: non-linear neural networks over dense input vectors
Neural Network Architectures Feed-forward neural networks • Best known, standard neural network approach • Fully connected layers • Can be used as drop-in replacement for typical NLP classifiers
Feature representation Dense vs. one hot • One hot : each feature is its own dimension • Dimensionality vector is same as number of features • Each feature is completely independent from one another • Dense : each feature is a d -dimensional vector • Dimensionality is d • Similar features have similar vectors
Feature representation Feature combinations • Traditional NLP: specify interactions of features • E.g. features like ’word is jump , tag is V and previous word is they ’ • Non-linear network: only specify core features • Non-linearity of network takes care of finding indicative feature combinations
Feature representation Why dense? • Discrete approach often works surprisingly well for NLP tasks • n-gram language models • POS-tagging, parsing • sentiment analysis • Still, a very poor representation of word meaning • No notion of similarity • Limited inference
Feature representation Why dense? • Discrete approach often works surprisingly well for NLP tasks • n-gram language models • POS-tagging, parsing • sentiment analysis • Still, a very poor representation of word meaning • No notion of similarity • Limited inference [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 ]
Feature representation Why dense? • Discrete approach often works surprisingly well for NLP tasks • n-gram language models • POS-tagging, parsing • sentiment analysis • Still, a very poor representation of word meaning • No notion of similarity • Limited inference [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 ] [ 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
Feature representation Why dense?
Feature representation Why dense?
Feature representation Why dense?
Feed-forward Architecture Multi-layer perceptron with 2 hidden layers NN MLP 2 ( x ) = y (1) h 1 = g ( xW 1 + b 1 ) (2) h 2 = g ( h 1 W 2 + b 2 ) (3) y = h 2 W 3 (4) x : vector of size d in = 3 y : vector of size d out = 2 h 1 , h 2 : vectors of size d hidden = 4
Feed-forward Architecture Multi-layer perceptron with 2 hidden layers NN MLP 2 ( x ) = y (1) h 1 = g ( xW 1 + b 1 ) (2) h 2 = g ( h 1 W 2 + b 2 ) (3) y = h 2 W 3 (4) W 1 , W 2 , W 3 : matrices of size [ 3 × 4 ] , [ 4 × 4 ] , [ 4 × 2 ] b 1 , b 2 : ’bias’ vectors of size d hidden = 4 g ( · ) : non-linear activation function (elementwise)
Feed-forward Architecture Multi-layer perceptron with 2 hidden layers NN MLP 2 ( x ) = y (1) h 1 = g ( xW 1 + b 1 ) (2) h 2 = g ( h 1 W 2 + b 2 ) (3) y = h 2 W 3 (4) W 1 , W 2 , W 3 , b 1 , b 2 = parameters of the network ( θ ) Use of multiple hidden layers: deep learning
Feed-forward Non-linear activation functions Sigmoid (logistic) function 1 σ ( x ) = 1 + e − x 1 0.5 0 − 6 − 4 − 2 0 2 4 6
Feed-forward Non-linear activation functions Hyperbolic tangent (tanh) function tanh ( x ) = e 2 x − 1 e 2 x + 1
Feed-forward Non-linear activation functions Rectified linear unit (ReLU) ReLU ( x ) = max ( 0 , x )
Feed-forward Output transformation function Softmax function x = x 1 . . . x k e xi softmax ( x i ) = � k j = 1 e xj
Feed-forward Input vector • Embedding lookup from embedding matrix • concatenate or sum embeddings
Feed-forward Loss functions • L ( ˆ y , y ) - the loss of predicting ˆ y when true output is y • Set parameters θ in order to minimize loss across different training examples • Compute gradient of parameters with regard to loss function to find minimum, take steps in right direction
Feed-forward Loss functions • Hinge loss (binary and multi-class) • classify correct class over incorrect class(es) with margin of at least 1 • Categorical cross-entropy loss (negative log-likelihood) • Measure difference between true class distribution y and predicted class distribution ˆ y • Use with softmax output • Ranking loss • In unsupervised setting: rank attested examples over unattested, corrupted ones with margin of at least 1
Training Stochastic gradient descent • Goal: minimize total loss � n i = 1 L ( f ( x i ; θ ) , y i ) • Estimating gradient over entire training set before taking step is computationally heavy • Compute gradient for small batch of samples from training set → estimate of gradient: stochastic • Learning rate λ : size of step in right direction • Improvements: momentum, adaptive learning rate
Training Stochastic gradient descent • Size mini-batch: balance between better estimate and faster convergence • Gradients over different parameters (weight matrices, bias terms, embeddings, ...) efficiently calculated using backpropagation algorithm • No need to carry out derivations yourself: automatic tools for gradient computation using computational graphs
Training Initialization • Parameters of network are initialized randomly • Magnitude of random samples has effect on training success • effective initialization schemes exist
Training Misc • Shuffling: shuffle training set with each epoch • Learning rate: balance between proper convergence and fast convergence • Minibatch: balance speed/proper estimate; efficient using GPU
Training Regularization • Neural networks have many parameters: risk of overfitting • Solution: regularization • L 2 : extend loss function with squared penalty on parameters, i.e. λ 2 || θ || 2 • Dropout: Randomly dropping (setting to zero) half of the neurons in the network for each training sample
Word embeddings • Each word i is represented by a small, dense vector v i ∈ R d • d is typically in the range 50–1000 • Matrix of size V (vocabulary size) × d (embedding size) • words are ‘embedded’ in a real-valued, low-dimensional space • Similar words have similar embeddings
Word embeddings • Each word i is represented by a small, dense vector v i ∈ R d • d is typically in the range 50–1000 • Matrix of size V (vocabulary size) × d (embedding size) • words are ‘embedded’ in a real-valued, low-dimensional space • Similar words have similar embeddings d 1 d 2 d 3 . . . apple –2.34 –1.01 0.33 pear –2.28 –1.20 0.11 car –0.20 1.02 2.44 . . .
Word embeddings • Each word i is represented by a small, dense vector v i ∈ R d • d is typically in the range 50–1000 • Matrix of size V (vocabulary size) × d (embedding size) • words are ‘embedded’ in a real-valued, low-dimensional space • Similar words have similar embeddings
Neural word embeddings • Word embeddings have been around for quite some time • The term ‘embedding’ was coined within the neural network community, along with new methods to learn them • Idea: Let’s allocate a number of parameters for each word and allow the neural network to automatically learn what the useful values should be • Prediction-based: learn to predict the next word
Embeddings through language modeling • Predict the next word in a sequence, based on the previous word • One non-linear hidden layer, one softmax layer for classification • Choose parameters that optimize probability of correct word
Embeddings through language modeling • Predict the next word in a sequence, based on the previous word • One non-linear hidden layer, one softmax layer for classification • Choose parameters that optimize probability of correct word
Embeddings through error detection • Take a correct sentence and create a corrupted counterpart • Train the network to assign a higher score to the correct version of each sentence
Embeddings through error detection • Take a correct sentence and create a corrupted counterpart • Train the network to assign a higher score to the correct version of each sentence
Word2vec • Neural network approaches work well, but large number of parameters makes them computationally heavy • Popular, light-weight approach with less parameters: word2vec • No hidden layer, only softmax classifier • Two different models • Continuous bag of words (CBOW): predict current word based on surrounding words • Skip-gram: predict surrounding words based on context words
CBOW • Current word w t is predicted from context words • Prediction is made from the sum of context embeddings
Skip-gram • Each context word is predicted from current word • Parameters for each softmax classifier are shared
Negative sampling • Computation of full softmax classifier is still rather expensive • Only compute score for correct context, and a number of wrong contexts • Maximize correct contexts and minimize wrong ones
Recommend
More recommend