Cross-Lingual Word Sense Disambiguation using WordNets and Context - - PowerPoint PPT Presentation

cross lingual word sense disambiguation using wordnets
SMART_READER_LITE
LIVE PREVIEW

Cross-Lingual Word Sense Disambiguation using WordNets and Context - - PowerPoint PPT Presentation

Cross-Lingual Word Sense Disambiguation using WordNets and Context Mapping Priyank Jaini Ankit Agrawal {pjaini,ankitag}@iitk.ac.in Department of Mathematics and Statistics IIT Kanpur Advisor: Prof. Amitabha Mukerjee Date: March 21,2013 What


slide-1
SLIDE 1

Cross-Lingual Word Sense Disambiguation using WordNets and Context Mapping

Priyank Jaini Ankit Agrawal

{pjaini,ankitag}@iitk.ac.in Department of Mathematics and Statistics IIT Kanpur Advisor: Prof. Amitabha Mukerjee Date: March 21,2013

slide-2
SLIDE 2

What is Word Sense Disambiguation(WSD)?

  • assigning the correct sense(meaning/context) to a word in a sentence when it can have

multiple meanings

  • Example:
  • >I was standing on the bank of river Ganga.
  • >Mr.Bank owns a bank.

Importance and Motivation

  • Machine translation, Lexicography,semantic interpretation,Information retrieval etc
  • Hindi lacks resources
  • Can be used to create/enrich sense-tagged data
slide-3
SLIDE 3

Our Approach

  • Parallel Corpus and Alignment
  • English WSD
  • Synset Mapping
  • Transfer to Hindi
slide-4
SLIDE 4

Methodology and Algorithms Used

1)Parallel Corpus for Hindi-English (Emille) 2)Alignment of text using Church and Gale Algorithm 3)English WSD on the English text 4)Synset mapping using [10] 5)Transfer senses to Hindi text

Figure taken from [1]

slide-5
SLIDE 5

The English Word Sense Disambiguation (Step-3)

  • We shall use “WordNet::SenseRelate::AllWords”
  • Uses Lesk Algorithm for disambiguation
  • After this step, we would have a sense-tagged English text.
  • English WordNet would be used for English WSD
slide-6
SLIDE 6

Synset Mapping (Step 4)

  • Takes an English synset as input and produces as output the best matching Hindi

Synset

  • Uses the fact that in WordNet, the first word in a synset best represents the sense of

the synset

  • Hypernymy relation is the basis for finding the best match
  • In Hypernymy Hierarchies, a weighted formula given in [10] is used to determine the

best synset.

slide-7
SLIDE 7

Synset Mapping

  • Candidate synsets: obtained by finding the Hindi translations of the first word in the

input synset and then finding the Hindi synsets that contain one or more of these translations in them

  • Hypernymy hierarchies of these candidate synsets found. They are called candidate

hierarchies

  • Hypernymy hierarchy of the input English synset is also obtained.
  • For each synset obtained in the English hypernymy hierarchy, hindi translations of

all the words occuring in it are found.

  • These Hindi words are found in the candidate hierarchies. If a match is found,

weight of that candidate synset is increased. Initially, the weights are zero.

  • The total weight for each candidate hierarchy is obtained, and the one with the

highest weight is mapped to the English synset.

slide-8
SLIDE 8

We are expecting:

  • Since a parallel aligned corpus is used we should achieve a better accuracy
  • Would give better results for scenarios where:
  • An English word is polysemous and it's Hindi equivalent is also polysemous
  • An English word is monosemous and it's Hindi equivalent is polysemous

Limitations

  • Is valid only for nouns
  • Not trained for morphological handling
slide-9
SLIDE 9

References

1)Debasri Chakrabarti,Dipak Kumar Narayan,Prabhakar Pandey,Pushpak Bhattacharyya.Experiences in building the Indo Word Net-A WordNet for Hindi. 2)Bahareh Sarrafzadeh, Nikolay Yakovets, Nick Cercone, Aijun An. Cross Lingual Word Sense Disambiguation for Languages with Scarce Resources. 3)Els Lefever and Veronique Hoste. SemEval-2010 Task 3:Cross-Lingual Word Sense Disambiguation. 4) Michael Lesk.Automatic sense disambiguation using machine readable dictionaries:how to tell a pine cone from an ice cream cone. In SIGDOC’86: Proceedings of the 5th annual international conference on Systems documentation,pages 24-26, New York, NY, USA, 1986.ACM 5) Satanjeev Banerjee and Ted Pedersen. Extended gloss overlaps as a measure of semantic

  • relatedness. In IJCAI’03, pages 850-810,2003.

6) Els Lefever and Veronique Hoste. Examining the validity of Cross-Lingual Word Sense Disambiguation 7) http://wordnet.princeton.edu/ 8) Roberto Navigli. Word Sense Disambiguation-A Survey. 9) Roberto Navigli and Simone Paolo Ponzetto. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. 10) Ramanand, Akshay Ukey, Brahm Kiran Singh, and Pushpak Bhattacharyya. Mapping and structural analysis of multi-lingual wordnets.

slide-10
SLIDE 10

Thank You!!

slide-11
SLIDE 11

Church and Gale Algorithm

  • A method for aligning sentences based on a statistical model of character lengths
  • Uses the fact that longer/shorter sentences in one language tend to be translated

into longer/shorter sentences in another language.

  • The algorithm is a two step process: 1)Paragraph alignment and then 2)Sentence

alignment

  • Based on a probabilistic model
  • Also, it is language independent, though would have to be tested on Hindi-English.

Ref:A Program for Aligning Sentences in Bilingual Corpora, William A Gale and Kenneth W. Church

slide-12
SLIDE 12

English WSD:WordNet::SenseRelate::AllWords

  • Each target word is centered in a balanced window whose size is decided by the user.
  • The possilble senses of the word are measured for similarity relative to the senses of

the surrounding words present in the window in a pairwise fashion

  • The sense of the word that has the highest score after summing up the pair-wise score

is considered the sense of the word.

  • For finding similarity it uses the 10 measures of relatedness proposed in

WordNet::Similarity[http://wn-similarity.sourceforge.net]

Lesk Algorithm

  • Assigns sense to a word by comparing glosses of the surrounding words with the glosses
  • f various senses of the target word.
  • The sense whose gloss has most number of overlaps is assigned.
  • Extended Lesk Algorithm uses context hierarchy of WordNet to improve the accuracy.