modern neural networks approaches to nlp
play

Modern Neural-Networks approaches to NLP J.-C. Chappelier - PowerPoint PPT Presentation

Introduction Introduction How does it work? Conclusion Modern Neural-Networks approaches to NLP J.-C. Chappelier Laboratoire dIntelligence Artificielle Facult I&C EPFL c J.-C. Chappelier Modern Neural-Networks approaches to


  1. Introduction Introduction How does it work? Conclusion Modern Neural-Networks approaches to NLP J.-C. Chappelier Laboratoire d’Intelligence Artificielle Faculté I&C � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 1 / 47

  2. Introduction Objectives of this lecture Introduction CAVEAT/REMINDER How does it Introduction So, is this course a Machine Learning Course? work? NLP Natural Conclusion Language Functions Corpus-Based Approach to NLP Linguistic ◮ NLP makes use of Machine Learning (as would Image Processing for instance) Processing Levels Example of NLP ◮ but good results require: architecture ◮ good preprocessing Interdependencies between processing ◮ good data (to learn from), relevant annotations levels Conclusion ◮ good understanding of the pros/cons, features, outputs, results, ... ☞ The goal of this course is to provide you with the core concepts and baseline techniques to achieve the above mentioned requirements. � EPFL c J.-C. Chappelier & M. Rajman Introduction to INLP – 22 / 45 The goal of this lecture is to make give a broad overview on modern Neural Network approaches to NLP. This lecture is worth deepening with some full Deep Learning course; e.g.: ◮ F . Fleuret (Master) Deep learning (EE-559) ◮ J. Henderson (EDOC) Deep Learning For Natural Language Processing (EE-608) � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 2 / 47

  3. Introduction Contents Introduction How does it work? Conclusion ➀ Introduction ◮ What is it all about? What does it change? ◮ Why now? ◮ Is it worth it? ➁ How does it work? ◮ words (word2vec (CBow, Skip-gram), GloVe, fastText) ◮ documents (RNN, CNN, LSTM, GRU) ➂ Conclusion ◮ Advantages and drawbacks ◮ Future � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 3 / 47

  4. Introduction What is it all about? Introduction What is it all about? Why now? Is it worth it? Modern approach to NLP heavily emphasizes “Neural Networks” and “Deep Learning” How does it work? Conclusion Two key ideas (which are, in fact, quite independant): ◮ make use of more abstract/algeabraic representation of words: use “ word embeddings ”: ◮ go from sparse (& high-dimensional) to dense (& less high-dimensional) representation of documents ◮ make use of (“deep”) neural networks (= trainable non-linear functions) Other characteristics: ◮ supervised tasks ◮ better results (at least on usual benchmarks) ◮ less? preprocessing/“feature selection” ◮ CPU and data consuming � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 4 / 47

  5. Introduction How does it work? Introduction What is it all about? Why now? Is it worth it? How does it work? ◮ Key idea #1: Learning Word Representations Conclusion Typical NLP: Corpus –> some algo –> word/tokens/ n -grams vectors Key idea in recent approaches: can we do it task independant ? so as to reduce whatever NL P(rocessing) to some algebraic vector manipulation: no longer start “core (NL)P” from words anymore, but from vectors (learned once for all) that capture general syntactical and semantic information ◮ Key idea #2: use Neural Networks (NN) to do the “ from vectors to output ”-job → R m non-linear function with (many) parameters A NN is simply a R n − � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 5 / 47

  6. Introduction Neural Networks (NN): a 4-slides primer Introduction What is it all about? Why now? ◮ NN are non-linear non-parametric (= many parameters models) functions Is it worth it? How does it work? ◮ The ones we’re here talking about are for supervised learning Conclusion ☞ make use of a loss function to evaluate how their output fits to the desired output usual loss: corpus (negative) log-likelihood ∝ P ( output | input ) ◮ non-linarity: localised on each "neuron" (1-D non linear function) sigmoïd -like (e.g. logistic function 1 / ( 1 + e − x ) ) or ReLU (weird name for very simple function: max( 0 , x ) ) sigmoid(x) max(0,x) 1 10 0.8 8 0.6 6 0.4 4 0.2 2 0 0 -10 -5 0 5 10 -10 -5 0 5 10 ◮ the non-linearity is applyed to a linear combination of input: dot-product of input (vector) and parameters (“weight” vector) � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 6 / 47

  7. Introduction Softmax output function Introduction What is it all about? Why now? Another famous non-linearity is the “softmax function” Is it worth it? How does it work? softmax = generalization from 1D to n -D of logistic function (see e.g. "Logitic Regression", 2 Conclusion weeks ago) Purpose: turns whatever list of values into a probability distribution ( x 1 ,..., x m ) �− → ( s 1 ,..., s m ) where e x i s i = m ∑ e x j j = 1 Examples: x = ( 7 , 12 , − 4 , 8 , 4 ) − → s = ( 0 . 0066 , 0 . 9752 , 1 e − 6 , 0 . 01798 , 0 . 0003 ) x = ( 0 . 33 , 0 . 5 , 0 . 1 , 0 . 07 ) − → s = ( 0 . 266 , 0 . 316 , 0 . 211 , 0 . 206 ) � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 7 / 47

  8. Introduction Multi-Layer Perceptrons (MLP) a.k.a. Feed-Forward NN Introduction What is it all about? (FFNN) Why now? Is it worth it? MLP (Rumelhart, 1986) : neurons are organized in (a few) layers, from input to ouput: How does it work? Conclusion Parameters: "weights" of the network = input weights of each neurons MLP are universal approximators : input : x 1 ,..., x n ( n -dimensional real vecto), output : ≃ f ( x 1 ,..., x n ) ∈ R m to whatever precision decided a priori In a probabilistic framework: very often used to approximate the posterior probability P ( y 1 ,..., y m | x 1 ,..., x n ) Convergence to a local minimum of the loss function (often the mean quadratic error) � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 8 / 47

  9. Introduction NN learning procedure Introduction What is it all about? Why now? Is it worth it? How does it work? General learning procedure (see e.g. Baum-Welch): Conclusion ➀ Initialize the parameters ➁ Then loop over training data ( supervised ): 1. Compute (using NN) output from given input 2. Compute loss by comparing output to reference 3. Update parameters: “backpropagation”: update proportionnal to the gradient of the loss function 4. Stopping when some criterion is fulfilled (e.g. loss function is small, validation-set error increases, number of steps is reached) � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 9 / 47

  10. Introduction about Deep Learning (more later) Introduction What is it all about? Why now? Is it worth it? How does it ◮ not all Neural Network models (NN) are deep learners work? Conclusion ◮ there is NO need of deep learning for good "word"-embeddings ◮ models: convolutional NN (CNN) or recurrrent NN (RNN, incl. LSTM) ◮ still suffer the same old problems : overfitting and computational power a quote, from Pr. Michel Jordan (IEEE Spectrum, 2014): “ deep learning is largely a rebranding of neural networks, which go back to the 1980s. They actually go back to the 1960s; it seems like every 20 years there is a new wave that involves them. In the current wave, the main success story is the convolutional neural network, but that idea was already present in the previous wave. ” Why such a reborn now? ☞ many more data (user-data pillage), more computational power (GPUs) � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 10 / 47

  11. Introduction What is Deep Learning after all? Introduction What is it all about? Why now? Is it worth it? How does it composition of many functions (neural-net layers) work? taking advantage of Conclusion ◮ the chain rule (aka “back-propagation”) ◮ stochastic gradient decent ◮ parameters-sharing/localization of computation (a.k.a. “convolutions”) ◮ parallel operations on GPUs This does not differ much from networks from the 90s: several tricks and algorithmic improvements backed-up by 1. large data sets (user-data pillage) 2. large computational resources (GPU popularized) 3. enthusiasm from academia and industry (hype) � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 11 / 47

  12. Introduction Corpus-based linguistics: the evolution Introduction What is it all about? Why now? Is it worth it? ◮ before corpora ( < 1970): hand written rules How does it work? ◮ first wave ( ≃ 1980-2015): probabilistic models (HMM, SCFG, CRF, ...) Conclusion ◮ neural-nets and "word" embedings (1986, 1990, 1997, 2003, 2011, 2013+): ◮ MLP: David Rumelhart, 1986 ◮ RNN: Jeffrey Elman, 1990 ◮ LSTM: Hochreiter and Schmidhuber, 1997 ◮ early NN Word Embdedings: Yoshua Bengio et al., 2003; Collobert & Weston (et al.) 2008 & 2011 ◮ word2vec (2013), GloVe (2014) ◮ ... ◮ transfer learning (2018–): ULMFiT (2018), ELMo (2018), BERT (2018), OpenAI GPT2 (2019) use even more than "word" embeddings: pre-trained early layers to feed the later layers of some NN, followed by a (shallow?) task-specific architecture that is trained in a supervised way � EPFL c J.-C. Chappelier Modern Neural-Networks approaches to NLP – 12 / 47

Recommend


More recommend