Cross-Lingual Word Sense Disambiguation using WordNets and Context - - PowerPoint PPT Presentation
Cross-Lingual Word Sense Disambiguation using WordNets and Context - - PowerPoint PPT Presentation
Cross-Lingual Word Sense Disambiguation using WordNets and Context Mapping Priyank Jaini Ankit Agrawal {pjaini,ankitag}@iitk.ac.in Department of Mathematics and Statistics IIT Kanpur Advisor: Prof. Amitabha Mukerjee Date: March 21,2013 What
What is Word Sense Disambiguation(WSD)?
- assigning the correct sense(meaning/context) to a word in a sentence when it can have
multiple meanings
- Example:
- >I was standing on the bank of river Ganga.
- >Mr.Bank owns a bank.
Importance and Motivation
- Machine translation, Lexicography,semantic interpretation,Information retrieval etc
- Hindi lacks resources
- Can be used to create/enrich sense-tagged data
Our Approach
- Parallel Corpus and Alignment
- English WSD
- Synset Mapping
- Transfer to Hindi
Methodology and Algorithms Used
1)Parallel Corpus for Hindi-English (Emille) 2)Alignment of text using Church and Gale Algorithm 3)English WSD on the English text 4)Synset mapping using [10] 5)Transfer senses to Hindi text
Figure taken from [1]
The English Word Sense Disambiguation (Step-3)
- We shall use “WordNet::SenseRelate::AllWords”
- Uses Lesk Algorithm for disambiguation
- After this step, we would have a sense-tagged English text.
- English WordNet would be used for English WSD
Synset Mapping (Step 4)
- Takes an English synset as input and produces as output the best matching Hindi
Synset
- Uses the fact that in WordNet, the first word in a synset best represents the sense of
the synset
- Hypernymy relation is the basis for finding the best match
- In Hypernymy Hierarchies, a weighted formula given in [10] is used to determine the
best synset.
Synset Mapping
- Candidate synsets: obtained by finding the Hindi translations of the first word in the
input synset and then finding the Hindi synsets that contain one or more of these translations in them
- Hypernymy hierarchies of these candidate synsets found. They are called candidate
hierarchies
- Hypernymy hierarchy of the input English synset is also obtained.
- For each synset obtained in the English hypernymy hierarchy, hindi translations of
all the words occuring in it are found.
- These Hindi words are found in the candidate hierarchies. If a match is found,
weight of that candidate synset is increased. Initially, the weights are zero.
- The total weight for each candidate hierarchy is obtained, and the one with the
highest weight is mapped to the English synset.
We are expecting:
- Since a parallel aligned corpus is used we should achieve a better accuracy
- Would give better results for scenarios where:
- An English word is polysemous and it's Hindi equivalent is also polysemous
- An English word is monosemous and it's Hindi equivalent is polysemous
Limitations
- Is valid only for nouns
- Not trained for morphological handling
References
1)Debasri Chakrabarti,Dipak Kumar Narayan,Prabhakar Pandey,Pushpak Bhattacharyya.Experiences in building the Indo Word Net-A WordNet for Hindi. 2)Bahareh Sarrafzadeh, Nikolay Yakovets, Nick Cercone, Aijun An. Cross Lingual Word Sense Disambiguation for Languages with Scarce Resources. 3)Els Lefever and Veronique Hoste. SemEval-2010 Task 3:Cross-Lingual Word Sense Disambiguation. 4) Michael Lesk.Automatic sense disambiguation using machine readable dictionaries:how to tell a pine cone from an ice cream cone. In SIGDOC’86: Proceedings of the 5th annual international conference on Systems documentation,pages 24-26, New York, NY, USA, 1986.ACM 5) Satanjeev Banerjee and Ted Pedersen. Extended gloss overlaps as a measure of semantic
- relatedness. In IJCAI’03, pages 850-810,2003.
6) Els Lefever and Veronique Hoste. Examining the validity of Cross-Lingual Word Sense Disambiguation 7) http://wordnet.princeton.edu/ 8) Roberto Navigli. Word Sense Disambiguation-A Survey. 9) Roberto Navigli and Simone Paolo Ponzetto. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. 10) Ramanand, Akshay Ukey, Brahm Kiran Singh, and Pushpak Bhattacharyya. Mapping and structural analysis of multi-lingual wordnets.
Thank You!!
Church and Gale Algorithm
- A method for aligning sentences based on a statistical model of character lengths
- Uses the fact that longer/shorter sentences in one language tend to be translated
into longer/shorter sentences in another language.
- The algorithm is a two step process: 1)Paragraph alignment and then 2)Sentence
alignment
- Based on a probabilistic model
- Also, it is language independent, though would have to be tested on Hindi-English.
Ref:A Program for Aligning Sentences in Bilingual Corpora, William A Gale and Kenneth W. Church
English WSD:WordNet::SenseRelate::AllWords
- Each target word is centered in a balanced window whose size is decided by the user.
- The possilble senses of the word are measured for similarity relative to the senses of
the surrounding words present in the window in a pairwise fashion
- The sense of the word that has the highest score after summing up the pair-wise score
is considered the sense of the word.
- For finding similarity it uses the 10 measures of relatedness proposed in
WordNet::Similarity[http://wn-similarity.sourceforge.net]
Lesk Algorithm
- Assigns sense to a word by comparing glosses of the surrounding words with the glosses
- f various senses of the target word.
- The sense whose gloss has most number of overlaps is assigned.
- Extended Lesk Algorithm uses context hierarchy of WordNet to improve the accuracy.