CMSC - 676 Classifications using word embedding techniques - PowerPoint PPT Presentation

CMSC - 676 Classifications using word embedding techniques Presented By - Prachi Bhalerao (prachib1@umbc.edu)

Introduction • Need - Finding similarity between words and extracting semantic/contextual information as much as possible is viable. - Applied across the complete NLP spectrum. • What is Word / Sentence Embedding? Technique for representing words into the vectors of real numbers which helps in comparing semantics of different words and in efficient representation of data (Dimensionality Reduction)

Types of word embeddings 1. Frequency-based word embedding Count Vector TF-IDF vector 2. Prediction-based word embedding Word2Vec fastText GloVe etc.

Word embedding techniques Word2Vec CBOW model Skip-gram model Predicts target word from given context Predicts the context from given word

Word embedding techniques FastText

Example 1 : Sub-topic detection • Technique based on sentence embeddings for detecting sub- topics is proposed. • Latent Dirichlet Allocation (LDA) to get the topics. • Topic Word Embedding (TWE) to train the weibo data set under a topic. • Taking the cosine between the word embeddings and the topic embeddings as the weight value, the word embeddings of the target words are weighted and added. • This is used to extend the topic information into the word embeddings and enhance the semantics of the word embeddings. • The p-means method is used to merge the blog into the sentence embeddings, which is the characteristic value of the blog • Finally, sub-topic clusters obtained through kmeans.

Example 2 : Named Entity Recognition • Word vectors obtained from the word2vec and fastText embedding approach are applied to the task of named entity recognition (NER). • Given a tokenized text, the task is that of predicting which words belong to which predefined category. • Word2vec model was trained • Then classification performed using greedy implementation of the Linear Support Vector algorithm. • Addressing Cluster Granularity • Unlabelled corpus size Measures given are achieved from adding clusters at granularity 1000, built from Results- word2vec models trained on the various 1. Performance of the NER model improved with growth of the size of data sets, to the NER classifier. the unlabelled data set but only to a limit (around 300 000 types, in One possible explanation for the stagnating this paper) at which it even started to drop. performance of the larger data set is that 2. Combining multiple cluster granularities led to our best other training settings need to be employed improvement. It didn’t improve performance for smaller data sets. for optimal training (e.g. higher vector dimensionality or more training iterations).

Challenges 1. Homographs: Different words sharing the same spelling Average of the contexts of all the words with same meaning is taken. When put into practice, this can significantly impact on the performance of ML systems posing a potential problem for conversational agents and text classifiers e.g. Apple, like 2. Inflection : Alterations of a word to express different grammatical categories Inflected forms (past tense or participle, for example) of verbs. That’s because some word inflections appear less frequently than others in certain contexts -> Fewer examples of those ‘less common’ words in context for the algorithm to learn from them -> ‘less similar’ vectors

References • Yu Xie et al, ‘A Method based on Sentence Embeddings for the Sub - Topics Detection’ 2019 J. Phys.: Conf. Ser. 1168 052004; https://iopscience.iop.org/article/10.1088/1742- 6596/1168/5/052004/pdf • Scharolta Katharina Siencnik , ‘Adapting word2vec to Named Entity Recognition’, Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015); https://www.ep.liu.se/ecp/109/030/ecp15109030.pdf • Nils Reimers, Benjamin Schiller, Tilman Beck, Johannes Daxenberger, Christian Stab, Iryna Gurevych . ‘Classification and Clustering of Arguments with Contextualized Word Embeddings’ ACL 2019; https://arxiv.org/abs/1906.09821 • Indrajit Dhillon, Rahul Kumar, ‘Enhanced word clustering for hierarchical text classification’ at acm; https://dl.acm.org/doi/abs/10.1145/775047.775076

CMSC - 676 Classifications using word embedding techniques - PowerPoint PPT Presentation

CMSC - 676 Classifications using word embedding techniques Presented By - Prachi Bhalerao (prachib1@umbc.edu) Introduction Need - Finding similarity between words and extracting semantic/contextual information as much as possible is viable.

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

Welcome to the 676 www.georgetownarmycadets.ca Welcome to the 676 There are opportunities for

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

Honors Combinatorics CMSC-27410 = Math-28410 CMSC-37200 Instructor: Laszlo Babai University

Honors Combinatorics CMSC-27410 = Math-28410 CMSC-37200 Instructor: Laszlo Babai University

Honors Combinatorics CMSC-27410 = Math-28410 CMSC-37200 Instructor: Laszlo Babai University

Honors Combinatorics CMSC-27410 = Math-28410 CMSC-37200 Instructor: Laszlo Babai University

Honors Combinatorics CMSC-27410 = Math-28410 CMSC-37200 Instructor: Laszlo Babai University

Honors Combinatorics CMSC-27410 = Math-28410 CMSC-37200 Instructor: Laszlo Babai University

Latent Semantic Indexing Mandar Haldekar CMSC 676 Introduction to LSI Retrieval based on

INFORMATION RETRIEVAL USING NEURAL NETWORKS VINEETH REDDY ANUGU CMSC 676 INFORMATION RETRIEVAL

Create PDF in MS Word 2013 using Adobe Distiller 10 Sep 2020 V0C V0C Create PDF In MS Word 2013

Motion pictures etcetera Turnover & output Robbert de Ruijter Content o Classifications

THE PICTORIAL PRESENTATION OF DIGITAL DATA FOR CLASSIFICATION PURPOSES T. D. Kinman and C T.

Simplified presentation of revenue Effective from the Q1 2007/08 interim report, Coloplast is

Selecting Fluxes for Lead-Free Wave Soldering Chrys Shea Sanju Arora Steve Brown Cookson

Lead Today. Transform Tomorrow. Barclays Kohler Utility Mini-Conference Aug. 16, 2017

Year end results 30 June 2016 Jon Macdonald CEO Caroline Rawlinson CFO Colin Rohloff IR

Community Newspapers Drive Results 2016 www.newspaperscanada.ca Study Details Study Timing

Corporate Presentation (Covid-19 updates) Disclaimer This presentation is for information only

TRANSACT IN THE DIGITAL ECONOMY F E B R U A R Y 1 2 , 2 0 1 9 1 INVESTOR OR P PRESENTATION

CMSC - 676 Classifications using word embedding techniques - PowerPoint PPT Presentation

CMSC - 676 Classifications using word embedding techniques Presented By - Prachi Bhalerao (prachib1@umbc.edu) Introduction Need - Finding similarity between words and extracting semantic/contextual information as much as possible is viable.

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

Welcome to the 676 www.georgetownarmycadets.ca Welcome to the 676 There are opportunities for

Word Meaning &amp; Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

Honors Combinatorics CMSC-27410 = Math-28410 CMSC-37200 Instructor: Laszlo Babai University

Honors Combinatorics CMSC-27410 = Math-28410 CMSC-37200 Instructor: Laszlo Babai University

Honors Combinatorics CMSC-27410 = Math-28410 CMSC-37200 Instructor: Laszlo Babai University

Honors Combinatorics CMSC-27410 = Math-28410 CMSC-37200 Instructor: Laszlo Babai University

Honors Combinatorics CMSC-27410 = Math-28410 CMSC-37200 Instructor: Laszlo Babai University

Honors Combinatorics CMSC-27410 = Math-28410 CMSC-37200 Instructor: Laszlo Babai University

Latent Semantic Indexing Mandar Haldekar CMSC 676 Introduction to LSI Retrieval based on

INFORMATION RETRIEVAL USING NEURAL NETWORKS VINEETH REDDY ANUGU CMSC 676 INFORMATION RETRIEVAL

Create PDF in MS Word 2013 using Adobe Distiller 10 Sep 2020 V0C V0C Create PDF In MS Word 2013

Motion pictures etcetera Turnover &amp; output Robbert de Ruijter Content o Classifications

THE PICTORIAL PRESENTATION OF DIGITAL DATA FOR CLASSIFICATION PURPOSES T. D. Kinman and C T.

Simplified presentation of revenue Effective from the Q1 2007/08 interim report, Coloplast is

Selecting Fluxes for Lead-Free Wave Soldering Chrys Shea Sanju Arora Steve Brown Cookson

Lead Today. Transform Tomorrow. Barclays Kohler Utility Mini-Conference Aug. 16, 2017

Year end results 30 June 2016 Jon Macdonald CEO Caroline Rawlinson CFO Colin Rohloff IR

Community Newspapers Drive Results 2016 www.newspaperscanada.ca Study Details Study Timing

Corporate Presentation (Covid-19 updates) Disclaimer This presentation is for information only

TRANSACT IN THE DIGITAL ECONOMY F E B R U A R Y 1 2 , 2 0 1 9 1 INVESTOR OR P PRESENTATION

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Motion pictures etcetera Turnover & output Robbert de Ruijter Content o Classifications