Language Modeling Martin Saveski, Igor Trajkovski Information - PowerPoint PPT Presentation

Automatic Construction of WordNets by Using Machine Translation and Language Modeling Martin Saveski, Igor Trajkovski Information Society Language Technologies Ljubljana 2010 1

Outline • WordNet • Motivation and Problem Statement • Methodology • Results • Evaluation • Conclusion and Future Work 2

WordNet • Lexical database of the English language • Groups words into sets of cognitive synonyms called synsets • Each synsets contains gloss and links to other synsets – Links define the place of the synset in the conceptual space • Source of motivation for researchers from various fields 3

WordNet Example {motor vehicle, automotive vehicle} Hypernym • Car a motor vehicle with four • Auto wheels; usually propelled by an internal combustion engine • Automobile • Motorcar Gloss Synset Hyponym {cab, taxi, hack, taxicab} 4

Motivation • Plethora of WordNet applications – Text classification, clustering, query expansion, etc. • There is no publicly available WordNet for the Macedonian Language – Macedonian was not included in the EuroWordNet and BalkaNet projects • Manual construction is expensive and labor intensive process – Need to automate the process 5

Problem Statement • Assumptions: – The conceptual space modeled by the PWN is not depended on the language in which it is expressed – Majority of the concepts exist in both languages, English and Macedonian, but have different notations find translations English Macedonian which lexicalize Synset Synset the same concept Given a synset in English, it is our goal to find a set of words which lexicalize the same concept in Macedonian 6

Resources and Tools • Resources: – Princeton implementation of WordNet (PWN) – backbone for the construction – English-Macedonian Machine Readable Dictionary (in-house-developed) – 182,000 entries • T ools: – Google Translation System (Google Translate) – Google Search Engine 7

Methodology 1. Finding Candidate Words 2. Translating the synset gloss 3. Assigning scores the candidate words 4. Selection of the candidate words 8

Finding Candidate Words MRD SET • W 1 T(W 1 ) = CW 11 , CW 12 … CW 1s • CW 1 • W 2 T(W 2 ) = CW 21 , CW 22 … CW 2k • CW 2 • W 3 T(W 3 ) = CW 31 , CW 32 … CW 3j • CW 3 • • • • • • • • • • W n T(W n ) = CW n1 , CW n2 … CW nm • CW y PWN Synset Candidate Words • T(W 1 ) contains translations of all senses of the word W 1 • Essentially, we have Word Sense Disambiguation (WSD) problem 9

Finding Candidate Words (cont.) W 1 CW 1 W 2 CW 2 MRD . . . . . . W n CW m PWN Candidate Synset Words

Translating the synset gloss • Statistical approach to WSD: – Using the word sense definitions and a large text corpus, we can determine the sense in which the word is • Word Sense Definition = Synset Gloss • The gloss translation can be used to measure the correlation between the synset and the candidate words • We use Google Translate (EN-MK) to translate the glosses of the PWN synsets 11

Translating the synset gloss (cont.) PWN Synset Gloss Translation Gloss (T-Gloss) W 1 CW 1 W 2 CW 2 MRD . . . . . . W n CW m PWN Candidate Synset Words

Assigning scores to the candidate words • T o apply the statistical WSD technique we lack a large, domain independent text corpus • Google Similarity Distance (GSD) – Calculates the semantic similarity between words/phrases based on the Google result counts • We calculate GSD between each candidate word and gloss translation • The GSD score is assigned to each candidate word 13

Assigning scores to the candidate words PWN Synset Gloss Translation Gloss (T-Gloss) Google Similarity Distance (GSD) W 1 CW 1 GSD( CW 1 , T-Gloss) W 2 CW 2 GSD( CW 2 , T-Gloss) MRD . . . . . . . . . W n CW m GSD( CW m , T-Gloss) Similarity Scores PWN Candidate Synset Words

Selection of the candidate words • Selection by using two thresholds: 1. Score(CW) > T 1 - Ensures that the candidate word has minimum correlation with the gloss translation 2. Score(CW) > (T 2 x MaxScore) - Discriminates between the words which capture the meaning of the synset and those that do not 15

Selection of the candidate words (cont.) PWN Synset Gloss Translation Gloss (T-Gloss) Google Similarity Distance (GSD) W 1 CW 1 GSD( CW 1 , T-Gloss) CW 1 W 2 CW 2 GSD( CW 2 , T-Gloss) CW 2 Selection MRD . . . . . . . . . . . . W n CW m GSD( CW m , T-Gloss) CW k Similarity Scores PWN Candidate Resulting Synset Words Synset

Example a defamatory or abusive со клевети или навредлив word or phrase збор или фраза (MK-GLOSS) Gloss Translation Synset Gloss Google Similarity Distance (GSD) Candidate GSD English Explanation Word Score Selection MRD T 1 = 0,2 навреда offence, insult 0.78 T 2 = 0,62 epithet, in a positive sense епитет 0.49 углед reputation 0.41 to name somebody крсти 0.40 Name Навреда Epithet назив name, title 0.37 презиме last name 0.35 MWN title наслов 0.35 PWN Synset глас voice 0.34 Synset first name име 0.33

Results of the MWN construction 25000 20000 15000 10000 5000 0 Nouns Verbs Adjectives Adverbs Synsets 22838 7256 3125 57 Words 12480 2786 2203 84 Size of the MWN NB: All words included in the MWN are lemmas 18

Evaluation of the MWN • There is no manually constructed WordNet (lack of Golden Standard) • Manual evaluation: – Labor intensive and expensive • Alternative Method: – Evaluation by use of MWN in practical applications – MWN applications were our motivation and ultimate goal 19

MWN for T ext Classification • Easy to measure and compare the performance of the classification algorithms • We extended the synset similarity measures to word-to-word i.e. text-to-text level – Leacock and Chodorow (LCH) (node-based) – Wu and Palmer (WUP) (arc-based) • Baseline: – Cosine Similarity (classical approach) 20

MWN for T ext Classification (cont.) • Classification Algorithm: – K Nearest Neighbors ( KNN ) – Allows the similarity measures to be compared unambiguously • Corpus: A1 TV - News Archive (2005-2008) Category Balkan Economy Macedonia Sci/Tech World Sport TOTAL Articles 1,264 1,053 3,323 920 1,845 1,232 9,637 Tokens 159,956 160,579 585,368 17,775 222,560 142,958 1,289,196 A1 Corpus, size and categories 21

MWN for T ext Classification – Results 100% 90% 80,4% 80% 73,7% 70% 59,8% 60% 50% 40% 30% 20% 10% 0% Balkan Economy Macedonia Sci/Tech World Sport Weighted Average WUP Similarity Cosine Similarity LCH Similarity Text Classification Results (F-Measure, 10-fold cross-validation) 22

Future Work • Investigation of the semantic relatedness between the candidate words – Word Clustering prior to assigning to synset – Assigning group of candidate words to the synset • Experiments of using the MWN for other applications – Text Clustering – Word Sense Disambiguation 23

Q&A Thank you for your attention. Questions? 24

Google Similarity Distance • Word/phrases acquire meaning from the way they are used in the society and from their relative semantics to other words/phrases • Formula: f(x), f(y), f(x,y) – results counts of x, y, and (x, y) N – Normalization factor 25

Synset similarity metrics • Leacock and Chodorow (LCH) 𝑡𝑗𝑛 𝑀𝐷𝐼 𝑡 1 , 𝑡 2 = − log 𝑚𝑓𝑜 𝑡 1 , 𝑡 2 2 ∗ 𝐸 len – number of nodes form s1 to s2 , D – maximum depth of the hierarchy • Measures in number of nodes 26

Synset similarity metrics (cont.) • Wu and Palmer (WUP) 2 ∗ 𝑒𝑓𝑞𝑢ℎ ( 𝑀𝐷𝑇 ) 𝑡𝑗𝑛 𝑋𝑉𝑄 𝑡 1 , 𝑡 2 = 𝑒𝑓𝑞𝑢ℎ 𝑡 1 + 𝑒𝑓𝑞𝑢ℎ 𝑡 2 LCS – most specific synset ancestor to both synsets • Measures in number of links 27

Semantic Word Similarity • The similarity of W 1 and W 2 is defined as: • The maximum similarity (minimum distance) between the: – Set of synsets containing W 1 , – Set of synsets containing W 2 28

Semantic T ext Similarity • The similarity between texts T 1 and T 2 is: 𝑛𝑏𝑦𝑇𝑗𝑛 𝑥 , 𝑈 2 ∗ 𝑗𝑒𝑔 𝑥 1 , 𝑈 2 = 1 𝑥 ∈ 𝑈 1 𝑡𝑗𝑛 𝑈 2 𝑗𝑒𝑔 𝑥 𝑥 ∈ 𝑈 1 𝑛𝑏𝑦𝑇𝑗𝑛 𝑥 , 𝑈 1 ∗ 𝑗𝑒𝑔 𝑥 𝑥 ∈ 𝑈 2 + 𝑗𝑒𝑔 𝑥 𝑥 ∈ 𝑈 2 – idf – inverse document frequency (measures word specificity) 29

Language Modeling Martin Saveski, Igor Trajkovski Information - PowerPoint PPT Presentation

Automatic Construction of WordNets by Using Machine Translation and Language Modeling Martin Saveski, Igor Trajkovski Information Society Language Technologies Ljubljana 2010 1 Outline WordNet Motivation and Problem Statement

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task Probabilistic Modeling

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

Language Modeling Michael Collins, Columbia University Overview The language modeling problem

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Count-based Language Modeling CMSC 473/673 UMBC Some slides adapted from 3SLP, Jason Eisner

NEST Modeling Language: A modeling language for spiking neuron and synapse models for NEST

Topics Why E Field Modeling What is E Field Modeling Case Studies Questions 2 Why

Outline 1 The topic 2 Decision support systems 3 Modeling 3.3 Advanced modeling

Verilog HDL:Digital Design and Modeling Chapter 5 Gate-Level Modeling Chapter 5 Gate-Level

Developmental Developmental Disorders affecting Disorders affecting language language

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

CS11-737: Multilingual Natural Language Processing Language contact Yulia Tsvetkov Language

Evaluating Vacant and Abandoned Buildings IAAI/USFA Abandoned Building Project Inspection and

Dr. CU 2.0: A Scalable Detailed Routing Framework with Correct-by-Construction Design Rule

Global Construction: International Opportunities, Local Risks Sponsored By: 1 About Advisen

OEM Slides Steel Extended Contact Bearing Stages Stainless Steel Extended Contact Bearing

Managing Construction and Professional Services Contracts 2019 CDBG-DR Problem Solving Clinic

Method of summation of some slowly convergent series Pawe Wony Rafa Nowak e-mail:

Towards Understanding Triangle Construction Problems Vesna Marinkovi c Predrag Jani ci

Resilient and focused hydrocarbons Gordon Birrel Gordo Birrell EVP, production and operations