Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020
Word Embeddings Agenda Traditional NLP Word Embeddings-1 Word Embeddings-2 Text preprocessing Topic Modeling ELMo ● ● ● Bag-of-words model Neural Embeddings ULMFit ● ● ● External Resources Word2Vec BERT ● ● ● Sequential classification GloVe RoBERTa, DistilBERT ● ● ● Other tasks (MT, LM) fastText Multilinguality ● ● ● 2 02/04/2020
Word Embeddings Traditional NLP 3 02/04/2020
Word Embeddings Preprocessing How to preprocess text? How do we (humans) split the text to analyse it? ● “ Divide et impera ” approach: ○ Word split ■ Sentence split ■ Paragraphs, etc ■ Is there any other information that we can collect? ○ 4 02/04/2020
Word Embeddings Preprocessing (2) Other preprocessing steps: Morphological: ● Stemming/Lemmatization ○ Grammatical: ● Part of Speech Tagging (PoS) ○ Chunking/Constituency Parsing ○ Dependency Parsing ○ 5 02/04/2020
Word Embeddings Preprocessing (3) Morphological: Stemming: ● The process of bringing the inflected words to their common root: ○ Producing => produc; produced => produc ■ are =>are ■ Lemmatization: ● Bringing the words to the same lemma word: ○ am , is, are => be ■ 6 02/04/2020
Word Embeddings Preprocessing (4) Grammatical: Part of Speech Tagging (PoS): ● Assign to each word a grammatical tag ○ Sentence: “There are different examples that we might use!” ● Preprocessing: ○ Lemmatization PoS Tagging 7 02/04/2020
Word Embeddings Preprocessing (5) Parsing: ● Shallow Parsing (Chunking): ○ Adds a tree structure to the POS-tags ■ First identifies its constituents and then their relation ■ Deep Parsing (Dependency Parsing): ○ Parses the sentence in its grammatical structure ■ “Head” - “Dependent” form ■ It is an acyclic directed graph (mostly implemented as a tree) ■ 8 02/04/2020
Word Embeddings Preprocessing (6) Sentence: ● Constituency Parsing: ● “There are different examples that we might use!” Dependency Parsing: ● 9 02/04/2020
Word Embeddings Bag-of-words Model Use the preprocessed text in Machine Learning tasks ● How to encode the features? ■ A major paradigm in NLP and IR - Bag-of-Words (BoW): ● The text is considered to be a set of its words ○ Grammatical dependencies are ignored ○ Features encoding: ○ Dictionary based (Nominal features) ■ One hot encoded/Frequency encoded ■ 10 02/04/2020
Word Embeddings Bag-of-words Model (2) Sentences: ● Features: ● Representation of features for Machine Learning: ● 11 02/04/2020
Word Embeddings Feature encoding PoS tagging: ● Word + PoS tag as part of dictionary: ○ Example: John- PN ■ Chunking: ● Use Noun Phrases: ○ Example: the bank account ■ Dependency Parsing: ● Word + Dependency Path as part of dictionary: ○ Example: use- nsubj-acl:relcl ■ 12 02/04/2020
Word Embeddings External Resources Sparse dimension of the feature space: ● We miss linguistic resources: ○ Synonyms, antonyms, hyponyms, hypernyms,... ■ Enrich the feature of our examples with their synonyms ■ Set negative weight to antonyms ■ External resources to improve sparsity: ● Wordnet: A lexical database for English, which groups words in synsets ○ Wiktionary: A free multilingual dictionary enriched with relations between ○ words 13 02/04/2020
Word Embeddings Sequential Classification We need to classify a sequence of tokens: ● Information Extraction: ○ Example: Extract the names of companies from documents (open domain) ■ How to Model? ● Classify each token as part or not part of the information: ○ The classification of the current token depends on the classification of the ■ previous one Sequential classifier ■ Still not enough; we need to encode the output ■ We need to know where every “annotation” starts and ends ■ 14 02/04/2020
Word Embeddings Sequential Classification (2) Why do we need a schema? ● Example: I work for TU Graz Austria! ○ BILOU: Beginning, Inside, Last, Outside, Unit ● BIO(most used): Beginning, Inside, Outside ● BILOU has shown to perform better in some datasets ● Example: “The Know Center GmbH is a spinoff of TUGraz.” ● BILOU: The-O; Know-B; Center-I; GmbH-L; is-O; a-O; spinoff-O; of- O; TUGraz-U ○ BIO: The-O; Know-B; Center-I; GmbH-I; is-O; a-O; spinoff-O; of-O; TUGraz-B ○ Sequential classifiers: Hidden Markov Model, CRF, etc ● 15 02/04/2020 1: Named Entity recognition: https://www.aclweb.org/anthology/W09-1119.pdf
Word Embeddings Sentiment Analysis Assign a sentiment to a piece of text: ● Binary (like/dislike) ○ Rating based (eg. 1-5) ○ Assign the sentiment to a target phrase: ● Usually involving features around the target ○ External resources: ● SentiWordnet http://sentiwordnet.isti.cnr.it/ ○ 16 02/04/2020
Word Embeddings Language model Generating the next token of a sequence ● Usually based on the collection of co-occurrence of words in a window: ● Statistics are collected and the next word is predicted based on these information ○ Mainly, models the probability of the sequence: ○ ■ In traditional approaches solved as an n-gram approximation: ● Usually solved by combining different sizes of n-grams and weighting them ○ 17 02/04/2020
Word Embeddings Machine Translation Translate text from one language to another ● Different approaches: ● Rule based: ○ Usually by using a dictionary ■ Statistical (Involving a bilingual aligned corpora) ○ IBM models (1-6) for aligning and training ■ Hybrid : ○ The use of the two previous techniques ■ 18 02/04/2020
Word Embeddings Traditional NLP End 19 02/04/2020
Word Embeddings Dense Word Representation 20 02/04/2020
Word Embeddings From Sparse to Dense Topic Modeling ● Since LSA(Latent Semantic Analysis): ● These methods utilize low-rank approximations to decompose large ○ matrices that capture statistical information about a corpus Other Methods later: ● pLSA (Probability Latent Semantic Analysis) ○ Uses probability instead of SVD (Single Value Decomposition) ■ LDA (Latent Dirichlet Allocation): ● A Bayesian version of pLSA ○ 21 02/04/2020
Word Embeddings Neural embeddings Language models suffer from the “Curse of dimensionality”: ● The word sequence that we want to predict is likely to be different from ○ the ones we have seen in the training Seeing in the “ The cat is walking in the bedroom ” => should help us ○ generate: “ A dog was running in the room ”: Similar semantics and grammatical role ■ A Neural Probabilistic Language Model: ● Bengio et al. implemented in 2003 the idea of Mnih and Hinton (1989): ○ Learned a language model and embeddings for the words ■ 22 02/04/2020
Word Embeddings Neural embeddings (2) Bengio’s architecture: ● ○ Approximate a function with a window approach ○ Model the approximation with a neural network ○ Input Layer in a 1-hot-encoding form ○ Two hidden layers (first more of a random initialization) ○ A tanh intermediate layer ○ 23 02/04/2020
Word Embeddings Neural embeddings (3) A final softmax layer: ● Outputting the next word in the sequence ○ Learned a word representation of 18K words ● with almost 1M words in the corpus IMPORTANT Linguistic Theory: ● Words that tend to occur in similar linguistic ○ context tend to resemble each other in meanings 24 02/04/2020
Word Embeddings Word2vec A deep learning model (2 layers) that compute dense vector representations of words ● Two different architectures: ● Continuous Bag-of-Words Model (CBOW) (Faster) ○ Predict the middle word in a window of words ■ Skip-gram Model (Better with small amount of data) ○ Predict the context of a middle word, given the word ■ Models the probability of words co-occurring with the current word(candidate) ● The embedding learned is the output of the hidden layer ● 25 02/04/2020
Word Embeddings Word2vec (2) Skip-Gram: CBOW: 26 02/04/2020
Word Embeddings Word2vec (3) The output is a softmax function ● Three new techniques: ● 1. Subsample of frequent word: Each word in the training set is discarded with a probability ■ is the frequency of word and t (around ) a threshold ● Keep words that are more likely to occur less often ● ■ accelerates learning and even significantly improves the accuracy of the learned vectors of the rare words 27 02/04/2020
Word Embeddings Word2vec (4) 2. Hierarchical Softmax: A tree approximation of the softmax, by using a sigmoid at every step ○ Intuition: at every step decide whether to go right or left ○ O(log(n)) instead of O(n) ○ 3. Negative sampling: Alternative to Hierarchical Softmax (works better): ○ Brings up infrequent terms, squeezes the probability of frequent terms ○ Change the weight of only a selected number of terms with probability: ○ ■ 28 02/04/2020
Word Embeddings Word2vec (5) A serendipity effect from the Word2vec is the linearity(analogy) between ● embeddings: The famous example: (King - Man) + Woman = Queen ● 29 02/04/2020
Recommend
More recommend