deep architectures for natural language processing Sergey I. Nikolenko 1,2 DataFest 4 Moscow, February 11, 2017 1 Laboratory for Internet Studies, NRU Higher School of Economics, St. Petersburg 2 Steklov Institute of Mathematics at St. Petersburg 4 Not really a footnote mark, just the way DataFests prefer to be numbered Random facts : • February 11, the birthday of Thomas Alva Edison, was proclaimed in 1983 by Ronald Reagan to be the National Inventors' Day • Ten years later, in 1993, Pope John Paul II proclaimed February 11 to be the World Day of the Sick , ''a special time of... offering one's suffering''
plan • The deep learning revolution has not left natural language processing alone. • DL in NLP has started with standard architectures (RNN, CNN) but then has branched out into new directions. • You have already heard about distributed word representations; now let us see a ((very-)very) brief overview of the most promising directions in modern NLP based on deep learning. • We will concentrate on NLP problems that have given rise to new models and architectures. 2
nlp problems • NLP is a very diverse field. Types of NLP problems: • well-defined syntactic problems with semantic complications: • part-of-speech tagging; • morphological segmentation; • stemming and lemmatization; • sentence boundary disambiguation and word segmentation; • named entity recognition; • word sense disambiguation; • syntactic parsing; • coreference resolution; • well-defined semantic problems: • language modeling; • sentiment analysis; • relationship/fact extraction; • question answering; • text generation problems, usually not so very well defined: • text generation per se; • automatic summarization; • machine translation; • dialog and conversational models... 3
basic nn architectures • Basic neural network architectures that have been adapted for deep learning over the last decade: • feedforward NNs are the basic building block; • autoencoders map a (possibly distorted) input to itself, usually for feature engineering; • convolutional NNs apply NNs with shared weights to certain windows in the previous layer (or input), collecting first local and then more and more global features; • recurrent NNs have a hidden state and propagate it further, used for sequence learning; • in particular, LSTM ( long short-term memory ) and GRU ( gated recurrent unit ) units fight vanishing gradients and are often used for NLP since they are good for longer dependencies. 4
0.5520 0.4755 0.5735 0.4048 0.4642 0.3930 0.5957 0.5618 0.5459 0.5978 0.4145 0.4682 0.4500 0.4613 0.4481 0.4441 0.4442 0.4316 0.4368 0.5547 word embeddings • Distributional hypothesis in linguistics: words with similar meaning will occur in similar contexts. • Distributed word representations ( word2vec , Glove and variations) map words to a Euclidean space (usually of dimension several hundred). • Some sample nearest neighbors: любовь синоним жизнь антоним нелюбовь эвфемизм приязнь анаграмма боль омоним страсть оксюморон программист программистка компьютерщик стажерка программер инопланетянка электронщик американочка автомеханик предпринимательница криптограф студенточка • How do we use them? 5
how to use word vectors: recurrent architectures • Recurrent architectures on top of word vectors; this is straight from basic Keras tutorials: 6
how to use word vectors: recurrent architectures • Often bidirectional , providing both left and right context for each word: 6
how to use word vectors: recurrent architectures • And you can make them deep (but not too deep): 6
attention in recurrent networks • Recent important development: attention . A small (sub)network that learns which parts to focus on. • (Yang et al., 2016): Hierarchical Attention Networks; word level, then sentence level attention for classification (e.g., sentiment). 7
up and down from word embeddings • Word embeddings are the first step of most DL models in NLP. • But we can go both up and down from word embeddings. • First, a sentence is not necessarily the sum of its words. • How do we combine word vectors into “text chunk” vectors? 8
up and down from word embeddings • Word embeddings are the first step of most DL models in NLP. • But we can go both up and down from word embeddings. • First, a sentence is not necessarily the sum of its words. • How do we combine word vectors into “text chunk” vectors? • The simplest idea is to use the sum and/or mean of word embeddings to represent a sentence/paragraph: • a baseline in (Le and Mikolov 2014); • a reasonable method for short phrases in (Mikolov et al. 2013) • shown to be effective for document summarization in (Kageback et al. 2014). 8
sentence embeddings • Distributed Memory Model of Paragraph Vectors (PV-DM) (Le and Mikolov 2014): • a sentence/paragraph vector is an additional vector for each paragraph; • acts as a “memory” to provide longer context; • also a dual version, PV-DBOW. • A number of convolutional architectures (Ma et al., 2015; Kalchbrenner et al., 2014). 9
sentence embeddings • Recursive neural networks (Socher et al., 2012): • a neural network composes a chunk of text with another part in a tree; • works its way up from word vectors to the root of a parse tree. 9
sentence embeddings • Recursive neural networks (Socher et al., 2012): • by training this in a supervised way, one can get a very effective approach to sentiment analysis (Socher et al. 2013). 9
sentence embeddings • Recursive neural networks (Socher et al., 2012): • further improvements (Irsoy, Cardie, 2014): decouple leaves and internal nodes and make the networks deep to get hierarchical representations; • but all this is dependent on getting parse trees (later). 9
word vector extensions • Other modifications of word embeddings add external information. • E.g., the RC-NET model (Xu et al. 2014) extends skip-grams with relations (semantic and syntactic) and categorical knowledge (sets of synonyms, domain knowledge etc.): 𝑥 Hinton − 𝑥 Wimbledon ≈ 𝑠 born at ≈ 𝑥 Euler − 𝑥 Basel • Another important problem with both word vectors and char-level models: homonyms. The model usually just chooses one meaning. • We have to add latent variables for different meaning and infer them from context: Bayesian inference with stochastic variational inference (Bartunov et al., 2015). 10
character-level models • Word embeddings have important shortcomings: • vectors are independent but words are not; consider, in particular, morphology-rich languages like Russian; • the same applies to out-of-vocabulary words: a word embedding cannot be extended to new words; • word embedding models may grow large; it’s just lookup, but the whole vocabulary has to be stored in memory with fast access. • E.g., “polydistributional” gets 48 results on Google, so you probably have never seen it, and there’s very little training data: • Do you have an idea what it means? Me too. 11
character-level models • Hence, character-level representations : • began by decomposing a word into morphemes (Luong et al. 2013; Botha and Blunsom 2014; Soricut and Och 2015); • C2W (Ling et al. 2015) is based on bidirectional LSTMs: 11
character-level models • The approach of Deep Structured Semantic Model (DSSM) (Huang et al., 2013; Gao et al., 2014a; 2014b): • sub-word embeddings: represent a word as a bag of trigrams; • vocabulary shrinks to |𝑊 | 3 (tens of thousands instead of millions), but collisions are very rare; • the representation is robust to misspellings (very important for user-generated texts). 11
character-level models • ConvNet (Zhang et al. 2015): text understanding from scratch, from the level of symbols, based on CNNs. • Character-level models and extensions to appear to be very important, especially for morphology-rich languages like Russian. 11
modern char-based language model: kim et al., 2015 12
Recommend
More recommend