CMSC - 676 Classifications using word embedding techniques Presented By - Prachi Bhalerao (prachib1@umbc.edu)
Introduction • Need - Finding similarity between words and extracting semantic/contextual information as much as possible is viable. - Applied across the complete NLP spectrum. • What is Word / Sentence Embedding? Technique for representing words into the vectors of real numbers which helps in comparing semantics of different words and in efficient representation of data (Dimensionality Reduction)
Types of word embeddings 1. Frequency-based word embedding Count Vector TF-IDF vector 2. Prediction-based word embedding Word2Vec fastText GloVe etc.
Word embedding techniques Word2Vec CBOW model Skip-gram model Predicts target word from given context Predicts the context from given word
Word embedding techniques FastText
Example 1 : Sub-topic detection • Technique based on sentence embeddings for detecting sub- topics is proposed. • Latent Dirichlet Allocation (LDA) to get the topics. • Topic Word Embedding (TWE) to train the weibo data set under a topic. • Taking the cosine between the word embeddings and the topic embeddings as the weight value, the word embeddings of the target words are weighted and added. • This is used to extend the topic information into the word embeddings and enhance the semantics of the word embeddings. • The p-means method is used to merge the blog into the sentence embeddings, which is the characteristic value of the blog • Finally, sub-topic clusters obtained through kmeans.
Example 2 : Named Entity Recognition • Word vectors obtained from the word2vec and fastText embedding approach are applied to the task of named entity recognition (NER). • Given a tokenized text, the task is that of predicting which words belong to which predefined category. • Word2vec model was trained • Then classification performed using greedy implementation of the Linear Support Vector algorithm. • Addressing Cluster Granularity • Unlabelled corpus size Measures given are achieved from adding clusters at granularity 1000, built from Results- word2vec models trained on the various 1. Performance of the NER model improved with growth of the size of data sets, to the NER classifier. the unlabelled data set but only to a limit (around 300 000 types, in One possible explanation for the stagnating this paper) at which it even started to drop. performance of the larger data set is that 2. Combining multiple cluster granularities led to our best other training settings need to be employed improvement. It didn’t improve performance for smaller data sets. for optimal training (e.g. higher vector dimensionality or more training iterations).
Challenges 1. Homographs: Different words sharing the same spelling Average of the contexts of all the words with same meaning is taken. When put into practice, this can significantly impact on the performance of ML systems posing a potential problem for conversational agents and text classifiers e.g. Apple, like 2. Inflection : Alterations of a word to express different grammatical categories Inflected forms (past tense or participle, for example) of verbs. That’s because some word inflections appear less frequently than others in certain contexts -> Fewer examples of those ‘less common’ words in context for the algorithm to learn from them -> ‘less similar’ vectors
References • Yu Xie et al, ‘A Method based on Sentence Embeddings for the Sub - Topics Detection’ 2019 J. Phys.: Conf. Ser. 1168 052004; https://iopscience.iop.org/article/10.1088/1742- 6596/1168/5/052004/pdf • Scharolta Katharina Siencnik , ‘Adapting word2vec to Named Entity Recognition’, Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015); https://www.ep.liu.se/ecp/109/030/ecp15109030.pdf • Nils Reimers, Benjamin Schiller, Tilman Beck, Johannes Daxenberger, Christian Stab, Iryna Gurevych . ‘Classification and Clustering of Arguments with Contextualized Word Embeddings’ ACL 2019; https://arxiv.org/abs/1906.09821 • Indrajit Dhillon, Rahul Kumar, ‘Enhanced word clustering for hierarchical text classification’ at acm; https://dl.acm.org/doi/abs/10.1145/775047.775076
More recommend