Traitement automatique des langues : Fondements et applications - PowerPoint PPT Presentation

Traitement automatique des langues : Fondements et applications Cours 10 : Neural networks (1) Tim Van de Cruys & Philippe Muller 2016—2017

Introduction Machine learning for NLP • Standard approach: linear model trained over high-dimensional but very sparse feature vectors • Recently: non-linear neural networks over dense input vectors

Neural Network Architectures Feed-forward neural networks • Best known, standard neural network approach • Fully connected layers • Can be used as drop-in replacement for typical NLP classifiers

Feature representation Dense vs. one hot • One hot : each feature is its own dimension • Dimensionality vector is same as number of features • Each feature is completely independent from one another • Dense : each feature is a d -dimensional vector • Dimensionality is d • Similar features have similar vectors

Feature representation Feature combinations • Traditional NLP: specify interactions of features • E.g. features like ’word is jump , tag is V and previous word is they ’ • Non-linear network: only specify core features • Non-linearity of network takes care of finding indicative feature combinations

Feature representation Why dense? • Discrete approach often works surprisingly well for NLP tasks • n-gram language models • POS-tagging, parsing • sentiment analysis • Still, a very poor representation of word meaning • No notion of similarity • Limited inference

Feature representation Why dense? • Discrete approach often works surprisingly well for NLP tasks • n-gram language models • POS-tagging, parsing • sentiment analysis • Still, a very poor representation of word meaning • No notion of similarity • Limited inference [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 ]

Feature representation Why dense? • Discrete approach often works surprisingly well for NLP tasks • n-gram language models • POS-tagging, parsing • sentiment analysis • Still, a very poor representation of word meaning • No notion of similarity • Limited inference [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 ] [ 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 ]

Feature representation Why dense?

Feed-forward Architecture Multi-layer perceptron with 2 hidden layers NN MLP 2 ( x ) = y (1) h 1 = g ( xW 1 + b 1 ) (2) h 2 = g ( h 1 W 2 + b 2 ) (3) y = h 2 W 3 (4) x : vector of size d in = 3 y : vector of size d out = 2 h 1 , h 2 : vectors of size d hidden = 4

Feed-forward Architecture Multi-layer perceptron with 2 hidden layers NN MLP 2 ( x ) = y (1) h 1 = g ( xW 1 + b 1 ) (2) h 2 = g ( h 1 W 2 + b 2 ) (3) y = h 2 W 3 (4) W 1 , W 2 , W 3 : matrices of size [ 3 × 4 ] , [ 4 × 4 ] , [ 4 × 2 ] b 1 , b 2 : ’bias’ vectors of size d hidden = 4 g ( · ) : non-linear activation function (elementwise)

Feed-forward Architecture Multi-layer perceptron with 2 hidden layers NN MLP 2 ( x ) = y (1) h 1 = g ( xW 1 + b 1 ) (2) h 2 = g ( h 1 W 2 + b 2 ) (3) y = h 2 W 3 (4) W 1 , W 2 , W 3 , b 1 , b 2 = parameters of the network ( θ ) Use of multiple hidden layers: deep learning

Feed-forward Non-linear activation functions Sigmoid (logistic) function 1 σ ( x ) = 1 + e − x 1 0.5 0 − 6 − 4 − 2 0 2 4 6

Feed-forward Non-linear activation functions Hyperbolic tangent (tanh) function tanh ( x ) = e 2 x − 1 e 2 x + 1

Feed-forward Non-linear activation functions Rectified linear unit (ReLU) ReLU ( x ) = max ( 0 , x )

Feed-forward Output transformation function Softmax function x = x 1 . . . x k e xi softmax ( x i ) = � k j = 1 e xj

Feed-forward Input vector • Embedding lookup from embedding matrix • concatenate or sum embeddings

Feed-forward Loss functions • L ( ˆ y , y ) - the loss of predicting ˆ y when true output is y • Set parameters θ in order to minimize loss across different training examples • Compute gradient of parameters with regard to loss function to find minimum, take steps in right direction

Feed-forward Loss functions • Hinge loss (binary and multi-class) • classify correct class over incorrect class(es) with margin of at least 1 • Categorical cross-entropy loss (negative log-likelihood) • Measure difference between true class distribution y and predicted class distribution ˆ y • Use with softmax output • Ranking loss • In unsupervised setting: rank attested examples over unattested, corrupted ones with margin of at least 1

Training Stochastic gradient descent • Goal: minimize total loss � n i = 1 L ( f ( x i ; θ ) , y i ) • Estimating gradient over entire training set before taking step is computationally heavy • Compute gradient for small batch of samples from training set → estimate of gradient: stochastic • Learning rate λ : size of step in right direction • Improvements: momentum, adaptive learning rate

Training Stochastic gradient descent • Size mini-batch: balance between better estimate and faster convergence • Gradients over different parameters (weight matrices, bias terms, embeddings, ...) efficiently calculated using backpropagation algorithm • No need to carry out derivations yourself: automatic tools for gradient computation using computational graphs

Training Initialization • Parameters of network are initialized randomly • Magnitude of random samples has effect on training success • effective initialization schemes exist

Training Misc • Shuffling: shuffle training set with each epoch • Learning rate: balance between proper convergence and fast convergence • Minibatch: balance speed/proper estimate; efficient using GPU

Training Regularization • Neural networks have many parameters: risk of overfitting • Solution: regularization • L 2 : extend loss function with squared penalty on parameters, i.e. λ 2 || θ || 2 • Dropout: Randomly dropping (setting to zero) half of the neurons in the network for each training sample

Word embeddings • Each word i is represented by a small, dense vector v i ∈ R d • d is typically in the range 50–1000 • Matrix of size V (vocabulary size) × d (embedding size) • words are ‘embedded’ in a real-valued, low-dimensional space • Similar words have similar embeddings

Word embeddings • Each word i is represented by a small, dense vector v i ∈ R d • d is typically in the range 50–1000 • Matrix of size V (vocabulary size) × d (embedding size) • words are ‘embedded’ in a real-valued, low-dimensional space • Similar words have similar embeddings d 1 d 2 d 3 . . . apple –2.34 –1.01 0.33 pear –2.28 –1.20 0.11 car –0.20 1.02 2.44 . . .

Word embeddings • Each word i is represented by a small, dense vector v i ∈ R d • d is typically in the range 50–1000 • Matrix of size V (vocabulary size) × d (embedding size) • words are ‘embedded’ in a real-valued, low-dimensional space • Similar words have similar embeddings

Neural word embeddings • Word embeddings have been around for quite some time • The term ‘embedding’ was coined within the neural network community, along with new methods to learn them • Idea: Let’s allocate a number of parameters for each word and allow the neural network to automatically learn what the useful values should be • Prediction-based: learn to predict the next word

Embeddings through language modeling • Predict the next word in a sequence, based on the previous word • One non-linear hidden layer, one softmax layer for classification • Choose parameters that optimize probability of correct word

Embeddings through error detection • Take a correct sentence and create a corrupted counterpart • Train the network to assign a higher score to the correct version of each sentence

Word2vec • Neural network approaches work well, but large number of parameters makes them computationally heavy • Popular, light-weight approach with less parameters: word2vec • No hidden layer, only softmax classifier • Two different models • Continuous bag of words (CBOW): predict current word based on surrounding words • Skip-gram: predict surrounding words based on context words

CBOW • Current word w t is predicted from context words • Prediction is made from the sum of context embeddings

Skip-gram • Each context word is predicted from current word • Parameters for each softmax classifier are shared

Negative sampling • Computation of full softmax classifier is still rather expensive • Only compute score for correct context, and a number of wrong contexts • Maximize correct contexts and minimize wrong ones

Traitement automatique des langues : Fondements et applications - PowerPoint PPT Presentation

Traitement automatique des langues : Fondements et applications Cours 10 : Neural networks (1) Tim Van de Cruys & Philippe Muller 20162017 Introduction Machine learning for NLP Standard approach: linear model trained over

Traitement automatique des langues : Fondements et applications Cours 11 : Neural networks (2)

Association canadienne des professeurs de langues secondes Canadian Association of Second Language

Approche dapprentissage automatique pour lannotation automatique des vnements Dr. Rim

Panorama des mthodes de dtection et de traitement des anomalies Laure Berti-quille IRD

Rle des comorbidits psychiatriques dans le traitement et le pronostic des maladies

Fondements pour la v erification des syst` emes temps-r eel et concurrents Lecture 3

Block Ciphers and DES S-DES DES Details DES Design Other Ciphers CSS441: Security and

Data Encryption Standard Simplified-DES Details of DES DES in OpenSSL Cryptography DES in

Data Encryption Standard Simplified-DES Details of DES DES in OpenSSL Cryptography DES in

by 28 Octobre 2016 Gif sur Yvette

Plurilingual education and languages of education / Education plurilingue et langues de

Algorithmes de traitement dimage pour lestimation des caract eristiques locales de la

(Quelques) tendances et dfis du magntisme en dimensions rduites Fondements et applications

Emine rnek Schools Erasmus+ Project 2018-2020 Jouons aux langues et aux cultures

Approche inductive et dductive en langues secondes: processus, produit et perceptions Colloque:

Apprentissage Automatique et Fouille de donnes textuelles Jean-Michel RENDERS Xerox Research

Influence of singularities in rounded corners (Les coins ronds) Monique D AUGE . Adaptation

Mean reflected SDE Paul-Eric Chaudru de Raynal Universit Savoie Mont Blanc, LAMA 3rd Young

So#ware Architecture Prof. Bertrand Meyer, Dr. Michela Pedroni ETH Zurich, FebruaryMay 2010

Papillon Lexical Database Project Monolingual Dictionaries & Interlingual Links Mathieu

Viscoelastic materials and viscothermal losses: theory and numerics with applications to

A kinetic scheme for air entrainment in transient flows: a two-layer approach. Mehmet Ersoy 1

Formalization of Foundations of Geometry An overview of the GeoCoq library Julien Narboux

Visual data security by data hiding and/or encryption William PUECH ICAR (Image &

Traitement automatique des langues : Fondements et applications - PowerPoint PPT Presentation

Traitement automatique des langues : Fondements et applications Cours 10 : Neural networks (1) Tim Van de Cruys & Philippe Muller 20162017 Introduction Machine learning for NLP Standard approach: linear model trained over

Traitement automatique des langues : Fondements et applications Cours 11 : Neural networks (2)

Association canadienne des professeurs de langues secondes Canadian Association of Second Language

Approche dapprentissage automatique pour lannotation automatique des vnements Dr. Rim

Panorama des mthodes de dtection et de traitement des anomalies Laure Berti-quille IRD

Rle des comorbidits psychiatriques dans le traitement et le pronostic des maladies

Fondements pour la v erification des syst` emes temps-r eel et concurrents Lecture 3

Block Ciphers and DES S-DES DES Details DES Design Other Ciphers CSS441: Security and

Data Encryption Standard Simplified-DES Details of DES DES in OpenSSL Cryptography DES in

Data Encryption Standard Simplified-DES Details of DES DES in OpenSSL Cryptography DES in

by 28 Octobre 2016 Gif sur Yvette

Plurilingual education and languages of education / Education plurilingue et langues de

Algorithmes de traitement dimage pour lestimation des caract eristiques locales de la

(Quelques) tendances et dfis du magntisme en dimensions rduites Fondements et applications

Emine rnek Schools Erasmus+ Project 2018-2020 Jouons aux langues et aux cultures

Approche inductive et dductive en langues secondes: processus, produit et perceptions Colloque:

Apprentissage Automatique et Fouille de donnes textuelles Jean-Michel RENDERS Xerox Research

Influence of singularities in rounded corners (Les coins ronds) Monique D AUGE . Adaptation

Mean reflected SDE Paul-Eric Chaudru de Raynal Universit Savoie Mont Blanc, LAMA 3rd Young

So#ware Architecture Prof. Bertrand Meyer, Dr. Michela Pedroni ETH Zurich, FebruaryMay 2010

Papillon Lexical Database Project Monolingual Dictionaries &amp; Interlingual Links Mathieu

Viscoelastic materials and viscothermal losses: theory and numerics with applications to

A kinetic scheme for air entrainment in transient flows: a two-layer approach. Mehmet Ersoy 1

Formalization of Foundations of Geometry An overview of the GeoCoq library Julien Narboux

Visual data security by data hiding and/or encryption William PUECH ICAR (Image &amp;

Papillon Lexical Database Project Monolingual Dictionaries & Interlingual Links Mathieu

Visual data security by data hiding and/or encryption William PUECH ICAR (Image &