Cross-Lingual Word Sense Disambiguation using WordNets and Context Mapping Priyank Jaini Ankit Agrawal {pjaini,ankitag}@iitk.ac.in Department of Mathematics and Statistics IIT Kanpur Advisor: Prof. Amitabha Mukerjee Date: March 21,2013
What is Word Sense Disambiguation(WSD)? -assigning the correct sense(meaning/context) to a word in a sentence when it can have multiple meanings -Example: ->I was standing on the bank of river Ganga. ->Mr. Bank owns a bank. Importance and Motivation - Machine translation, Lexicography,semantic interpretation,Information retrieval etc -Hindi lacks resources -Can be used to create/enrich sense-tagged data
Our Approach ● Parallel Corpus and Alignment ● English WSD ● Synset Mapping ● Transfer to Hindi
Methodology and Algorithms Used 1)Parallel Corpus for 2)Alignment of text 3)English WSD on Hindi-English (Emille) using Church and the English text Gale Algorithm 4)Synset mapping using [10] 5)Transfer senses to Hindi text Figure taken from [1]
The English Word Sense Disambiguation (Step-3) - We shall use “WordNet::SenseRelate::AllWords” -Uses Lesk Algorithm for disambiguation -After this step, we would have a sense-tagged English text. - English WordNet would be used for English WSD
Synset Mapping (Step 4) -Takes an English synset as input and produces as output the best matching Hindi Synset -Uses the fact that in WordNet, the first word in a synset best represents the sense of the synset -Hypernymy relation is the basis for finding the best match -In Hypernymy Hierarchies, a weighted formula given in [10] is used to determine the best synset.
Synset Mapping -Candidate synsets: obtained by finding the Hindi translations of the first word in the input synset and then finding the Hindi synsets that contain one or more of these translations in them -Hypernymy hierarchies of these candidate synsets found. They are called candidate hierarchies -Hypernymy hierarchy of the input English synset is also obtained. -For each synset obtained in the English hypernymy hierarchy, hindi translations of all the words occuring in it are found. -These Hindi words are found in the candidate hierarchies. If a match is found, weight of that candidate synset is increased. Initially, the weights are zero. -The total weight for each candidate hierarchy is obtained, and the one with the highest weight is mapped to the English synset.
We are expecting: -Since a parallel aligned corpus is used we should achieve a better accuracy -Would give better results for scenarios where: -An English word is polysemous and it's Hindi equivalent is also polysemous -An English word is monosemous and it's Hindi equivalent is polysemous Limitations - Is valid only for nouns - Not trained for morphological handling
References 1)Debasri Chakrabarti,Dipak Kumar Narayan,Prabhakar Pandey,Pushpak Bhattacharyya.Experiences in building the Indo Word Net-A WordNet for Hindi. 2)Bahareh Sarrafzadeh, Nikolay Yakovets, Nick Cercone, Aijun An. Cross Lingual Word Sense Disambiguation for Languages with Scarce Resources. 3)Els Lefever and Veronique Hoste. SemEval-2010 Task 3:Cross-Lingual Word Sense Disambiguation. 4) Michael Lesk.Automatic sense disambiguation using machine readable dictionaries:how to tell a pine cone from an ice cream cone. In SIGDOC’86: Proceedings of the 5th annual international conference on Systems documentation,pages 24-26, New York, NY, USA, 1986.ACM 5) Satanjeev Banerjee and Ted Pedersen. Extended gloss overlaps as a measure of semantic relatedness. In IJCAI’03, pages 850-810,2003. 6) Els Lefever and Veronique Hoste. Examining the validity of Cross-Lingual Word Sense Disambiguation 7) http://wordnet.princeton.edu/ 8) Roberto Navigli. Word Sense Disambiguation-A Survey. 9) Roberto Navigli and Simone Paolo Ponzetto. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. 10) Ramanand, Akshay Ukey, Brahm Kiran Singh, and Pushpak Bhattacharyya. Mapping and structural analysis of multi-lingual wordnets.
Thank You!!
Church and Gale Algorithm -A method for aligning sentences based on a statistical model of character lengths -Uses the fact that longer/shorter sentences in one language tend to be translated into longer/shorter sentences in another language. -The algorithm is a two step process: 1)Paragraph alignment and then 2)Sentence alignment -Based on a probabilistic model -Also, it is language independent, though would have to be tested on Hindi-English. Ref:A Program for Aligning Sentences in Bilingual Corpora, William A Gale and Kenneth W. Church
English WSD:WordNet::SenseRelate::AllWords -Each target word is centered in a balanced window whose size is decided by the user. -The possilble senses of the word are measured for similarity relative to the senses of the surrounding words present in the window in a pairwise fashion -The sense of the word that has the highest score after summing up the pair-wise score is considered the sense of the word. -For finding similarity it uses the 10 measures of relatedness proposed in WordNet::Similarity[http://wn-similarity.sourceforge.net] Lesk Algorithm -Assigns sense to a word by comparing glosses of the surrounding words with the glosses of various senses of the target word. -The sense whose gloss has most number of overlaps is assigned. -Extended Lesk Algorithm uses context hierarchy of WordNet to improve the accuracy.
Recommend
More recommend