deep architectures for natural language processing Sergey I. - PowerPoint PPT Presentation

deep architectures for natural language processing Sergey I. Nikolenko 1,2 DataFest 4 Moscow, February 11, 2017 1 Laboratory for Internet Studies, NRU Higher School of Economics, St. Petersburg 2 Steklov Institute of Mathematics at St. Petersburg 4 Not really a footnote mark, just the way DataFests prefer to be numbered Random facts : • February 11, the birthday of Thomas Alva Edison, was proclaimed in 1983 by Ronald Reagan to be the National Inventors' Day • Ten years later, in 1993, Pope John Paul II proclaimed February 11 to be the World Day of the Sick , ''a special time of... offering one's suffering''

plan • The deep learning revolution has not left natural language processing alone. • DL in NLP has started with standard architectures (RNN, CNN) but then has branched out into new directions. • You have already heard about distributed word representations; now let us see a ((very-)very) brief overview of the most promising directions in modern NLP based on deep learning. • We will concentrate on NLP problems that have given rise to new models and architectures. 2

nlp problems • NLP is a very diverse field. Types of NLP problems: • well-defined syntactic problems with semantic complications: • part-of-speech tagging; • morphological segmentation; • stemming and lemmatization; • sentence boundary disambiguation and word segmentation; • named entity recognition; • word sense disambiguation; • syntactic parsing; • coreference resolution; • well-defined semantic problems: • language modeling; • sentiment analysis; • relationship/fact extraction; • question answering; • text generation problems, usually not so very well defined: • text generation per se; • automatic summarization; • machine translation; • dialog and conversational models... 3

basic nn architectures • Basic neural network architectures that have been adapted for deep learning over the last decade: • feedforward NNs are the basic building block; • autoencoders map a (possibly distorted) input to itself, usually for feature engineering; • convolutional NNs apply NNs with shared weights to certain windows in the previous layer (or input), collecting first local and then more and more global features; • recurrent NNs have a hidden state and propagate it further, used for sequence learning; • in particular, LSTM ( long short-term memory ) and GRU ( gated recurrent unit ) units fight vanishing gradients and are often used for NLP since they are good for longer dependencies. 4

0.5520 0.4755 0.5735 0.4048 0.4642 0.3930 0.5957 0.5618 0.5459 0.5978 0.4145 0.4682 0.4500 0.4613 0.4481 0.4441 0.4442 0.4316 0.4368 0.5547 word embeddings • Distributional hypothesis in linguistics: words with similar meaning will occur in similar contexts. • Distributed word representations ( word2vec , Glove and variations) map words to a Euclidean space (usually of dimension several hundred). • Some sample nearest neighbors: любовь синоним жизнь антоним нелюбовь эвфемизм приязнь анаграмма боль омоним страсть оксюморон программист программистка компьютерщик стажерка программер инопланетянка электронщик американочка автомеханик предпринимательница криптограф студенточка • How do we use them? 5

how to use word vectors: recurrent architectures • Recurrent architectures on top of word vectors; this is straight from basic Keras tutorials: 6

how to use word vectors: recurrent architectures • Often bidirectional , providing both left and right context for each word: 6

how to use word vectors: recurrent architectures • And you can make them deep (but not too deep): 6

attention in recurrent networks • Recent important development: attention . A small (sub)network that learns which parts to focus on. • (Yang et al., 2016): Hierarchical Attention Networks; word level, then sentence level attention for classification (e.g., sentiment). 7

up and down from word embeddings • Word embeddings are the first step of most DL models in NLP. • But we can go both up and down from word embeddings. • First, a sentence is not necessarily the sum of its words. • How do we combine word vectors into “text chunk” vectors? 8

up and down from word embeddings • Word embeddings are the first step of most DL models in NLP. • But we can go both up and down from word embeddings. • First, a sentence is not necessarily the sum of its words. • How do we combine word vectors into “text chunk” vectors? • The simplest idea is to use the sum and/or mean of word embeddings to represent a sentence/paragraph: • a baseline in (Le and Mikolov 2014); • a reasonable method for short phrases in (Mikolov et al. 2013) • shown to be effective for document summarization in (Kageback et al. 2014). 8

sentence embeddings • Distributed Memory Model of Paragraph Vectors (PV-DM) (Le and Mikolov 2014): • a sentence/paragraph vector is an additional vector for each paragraph; • acts as a “memory” to provide longer context; • also a dual version, PV-DBOW. • A number of convolutional architectures (Ma et al., 2015; Kalchbrenner et al., 2014). 9

sentence embeddings • Recursive neural networks (Socher et al., 2012): • a neural network composes a chunk of text with another part in a tree; • works its way up from word vectors to the root of a parse tree. 9

sentence embeddings • Recursive neural networks (Socher et al., 2012): • by training this in a supervised way, one can get a very effective approach to sentiment analysis (Socher et al. 2013). 9

sentence embeddings • Recursive neural networks (Socher et al., 2012): • further improvements (Irsoy, Cardie, 2014): decouple leaves and internal nodes and make the networks deep to get hierarchical representations; • but all this is dependent on getting parse trees (later). 9

word vector extensions • Other modifications of word embeddings add external information. • E.g., the RC-NET model (Xu et al. 2014) extends skip-grams with relations (semantic and syntactic) and categorical knowledge (sets of synonyms, domain knowledge etc.): 𝑥 Hinton − 𝑥 Wimbledon ≈ 𝑠 born at ≈ 𝑥 Euler − 𝑥 Basel • Another important problem with both word vectors and char-level models: homonyms. The model usually just chooses one meaning. • We have to add latent variables for different meaning and infer them from context: Bayesian inference with stochastic variational inference (Bartunov et al., 2015). 10

character-level models • Word embeddings have important shortcomings: • vectors are independent but words are not; consider, in particular, morphology-rich languages like Russian; • the same applies to out-of-vocabulary words: a word embedding cannot be extended to new words; • word embedding models may grow large; it’s just lookup, but the whole vocabulary has to be stored in memory with fast access. • E.g., “polydistributional” gets 48 results on Google, so you probably have never seen it, and there’s very little training data: • Do you have an idea what it means? Me too. 11

character-level models • Hence, character-level representations : • began by decomposing a word into morphemes (Luong et al. 2013; Botha and Blunsom 2014; Soricut and Och 2015); • C2W (Ling et al. 2015) is based on bidirectional LSTMs: 11

character-level models • The approach of Deep Structured Semantic Model (DSSM) (Huang et al., 2013; Gao et al., 2014a; 2014b): • sub-word embeddings: represent a word as a bag of trigrams; • vocabulary shrinks to |𝑊 | 3 (tens of thousands instead of millions), but collisions are very rare; • the representation is robust to misspellings (very important for user-generated texts). 11

character-level models • ConvNet (Zhang et al. 2015): text understanding from scratch, from the level of symbols, based on CNNs. • Character-level models and extensions to appear to be very important, especially for morphology-rich languages like Russian. 11

modern char-based language model: kim et al., 2015 12

deep architectures for natural language processing Sergey I. - PowerPoint PPT Presentation

deep architectures for natural language processing Sergey I. Nikolenko 1,2 DataFest 4 Moscow, February 11, 2017 1 Laboratory for Internet Studies, NRU Higher School of Economics, St. Petersburg 2 Steklov Institute of Mathematics at St. Petersburg

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Architectures Architectural styles Software architectures Architectures versus middleware

Deep learning for natural language processing Introduction to natural language processing

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

8. Other Deep Architectures CS 519 Deep Learning, Winter 2018 Fuxin Li With materials from Zsolt

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

DESIGN FICTIONS FABIEN GIRARDIN, 21.10.2015, BARCELONA WWW.NEARFUTURELABORATORY.COM Why the Near

1 Welcome! Personalized Learning: Meeting the Needs of Students with Disabilities & English

NAIS Presentation Schneider Berwick Innovation Center materials on our website at:

Implementation of XQuery Part 3: Support for Streaming XML Motivation XQuery used in very

STUDENT ORIENTATION & HANDBOOK As of 11 May 2020 S C H O O L O F M U S I C A N D T H E

Introduction to CMS Medicaid Enrollment and Utilization Data Files Gerri Barosso, ResDAC

Building Effective Partnerships for Review January 11, 2017 About the National Center The

Todays Presenters John Chrastka Executive Director, EveryLibrary Carrie Andrew Director,

deep architectures for natural language processing Sergey I. - PowerPoint PPT Presentation

deep architectures for natural language processing Sergey I. Nikolenko 1,2 DataFest 4 Moscow, February 11, 2017 1 Laboratory for Internet Studies, NRU Higher School of Economics, St. Petersburg 2 Steklov Institute of Mathematics at St. Petersburg

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Architectures Architectural styles Software architectures Architectures versus middleware

Deep learning for natural language processing Introduction to natural language processing

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

8. Other Deep Architectures CS 519 Deep Learning, Winter 2018 Fuxin Li With materials from Zsolt

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

DESIGN FICTIONS FABIEN GIRARDIN, 21.10.2015, BARCELONA WWW.NEARFUTURELABORATORY.COM Why the Near

1 Welcome! Personalized Learning: Meeting the Needs of Students with Disabilities &amp; English

NAIS Presentation Schneider Berwick Innovation Center materials on our website at:

Implementation of XQuery Part 3: Support for Streaming XML Motivation XQuery used in very

STUDENT ORIENTATION &amp; HANDBOOK As of 11 May 2020 S C H O O L O F M U S I C A N D T H E

Introduction to CMS Medicaid Enrollment and Utilization Data Files Gerri Barosso, ResDAC

Building Effective Partnerships for Review January 11, 2017 About the National Center The

Todays Presenters John Chrastka Executive Director, EveryLibrary Carrie Andrew Director,

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

1 Welcome! Personalized Learning: Meeting the Needs of Students with Disabilities & English

STUDENT ORIENTATION & HANDBOOK As of 11 May 2020 S C H O O L O F M U S I C A N D T H E