Combining Probabilistic and Translation- Based Models for Information Retrieval based on Word Sense Annotations Elisabeth Wolf, Delphine Bernhard, Iryna Gurevych Ubiquitous Knowledge Processing (UKP) Lab Prof. Dr. Iryna Gurevych Fachbereich Informatik Technische Universität Darmstadt
UKP Motivation: monolingual task 1. Increase precision of WSD 2. Apply translation-based model + combination with probabilistic m. 1. ….. 1. ….. UBC NUS 2. ….. 2. ….. 3. ….. 3. ….. 1. ….. 2. ….. Comb 3. ….. Heuristic-based combinations Reranking of retrieved of both annotations documents I N D E X I N G R E T R I E V A L 02.10.09 | Computer Science Department | Ubiquitous Knowledge Processing Lab
1. Increase precision of WSD • Four different index types: UBC NUS <SYNSET SCORE="0.32" <SYNSET CODE="00735486-n"/> SCORE="0.82" <SYNSET CODE="00735486-n"/> SCORE="0.21" <SYNSET CODE="03857483-n"/> SCORE="0.18" <SYNSET CODE="03857483-n"/> SCORE="0.47" CODE="01252343-n"/> 02.10.09 | Computer Science Department | Ubiquitous Knowledge Processing Lab
1. Increase precision of WSD • Four different index types: UBC NUS <SYNSET SCORE="0.32" <SYNSET • UBCBest CODE="00735486-n"/> SCORE="0.82" <SYNSET CODE="00735486-n"/> SCORE="0.21" <SYNSET CODE="03857483-n"/> SCORE="0.18" <SYNSET CODE="03857483-n"/> SCORE="0.47" CODE="01252343-n"/> 02.10.09 | Computer Science Department | Ubiquitous Knowledge Processing Lab
1. Increase precision of WSD • Four different index types: UBC NUS <SYNSET SCORE="0.32" <SYNSET • UBCBest CODE="00735486-n"/> SCORE="0.82" • NUSBest <SYNSET CODE="00735486-n"/> SCORE="0.21" <SYNSET CODE="03857483-n"/> SCORE="0.18" <SYNSET CODE="03857483-n"/> SCORE="0.47" CODE="01252343-n"/> 02.10.09 | Computer Science Department | Ubiquitous Knowledge Processing Lab
1. Increase precision of WSD • Four different index types: UBC NUS <SYNSET SCORE="0.32" <SYNSET • UBCBest CODE="00735486-n"/> SCORE="0.82" • NUSBest <SYNSET CODE="00735486-n"/> • CombBest SCORE="0.21" <SYNSET CODE="03857483-n"/> SCORE="0.18" <SYNSET CODE="03857483-n"/> SCORE="0.47" CODE="01252343-n"/> 02.10.09 | Computer Science Department | Ubiquitous Knowledge Processing Lab
1. Increase precision of WSD • Four different index types: UBC NUS <SYNSET SCORE="0.32" <SYNSET • UBCBest CODE="00735486-n"/> SCORE="0.82" • NUSBest <SYNSET CODE="00735486-n"/> • CombBest SCORE="0.21" <SYNSET CODE="03857483-n"/> SCORE="0.18" <SYNSET CODE="03857483-n"/> SCORE="0.47" CODE="01252343-n"/> 02.10.09 | Computer Science Department | Ubiquitous Knowledge Processing Lab
1. Increase precision of WSD • Four different index types: UBC NUS <SYNSET SCORE="0.32" <SYNSET • UBCBest CODE="00735486-n"/> SCORE="0.82" • NUSBest <SYNSET CODE="00735486-n"/> • CombBest SCORE="0.21" <SYNSET CODE="03857483-n"/> SCORE="0.18" <SYNSET CODE="03857483-n"/> SCORE="0.47" CODE="01252343-n"/> 0.82 + 0.32 = 1.14 0.18 + 0.21 = 0.39 02.10.09 | Computer Science Department | Ubiquitous Knowledge Processing Lab
1. Increase precision of WSD • Four different index types: UBC NUS <SYNSET SCORE="0.32" <SYNSET • UBCBest CODE="00735486-n"/> SCORE="0.82" • NUSBest <SYNSET CODE="00735486-n"/> • CombBest SCORE="0.21" <SYNSET CODE="03857483-n"/> SCORE="0.18" <SYNSET CODE="03857483-n"/> SCORE="0.47" CODE="01252343-n"/> CombBest 0.82 + 0.32 = 1.14 0.18 + 0.21 = 0.39 02.10.09 | Computer Science Department | Ubiquitous Knowledge Processing Lab
1. Increase precision of WSD • Four different index types: UBC NUS <SYNSET SCORE="0.32" <SYNSET • UBCBest CODE="0111222-n "/> SCORE="0.82" • NUSBest <SYNSET CODE="00735486-n"/> • CombBest SCORE="0.21" <SYNSET CODE=„0333444-n"/> • CombBest+ SCORE="0.18" <SYNSET CODE="03857483-n"/> SCORE="0.47" CODE="01252343-n"/> 02.10.09 | Computer Science Department | Ubiquitous Knowledge Processing Lab
1. Increase precision of WSD • Four different index types: UBC NUS <SYNSET SCORE="0.32" <SYNSET • UBCBest CODE="0111222-n "/> SCORE="0.82" • NUSBest <SYNSET CODE="00735486-n"/> • CombBest SCORE="0.21" <SYNSET CODE=„0333444-n"/> • CombBest+ SCORE="0.18" <SYNSET CODE="03857483-n"/> SCORE="0.47" CODE="01252343-n"/> • Terrier, version 2.1 • Multi field indices: token, lemma, sense (UBCBest, NUSBest, CombBest, CombBest+) 02.10.09 | Computer Science Department | Ubiquitous Knowledge Processing Lab
2. Combination of Retrieval Models Probabilistic model + Translation-based Query expansion: model: Divergence From Randomness Monolingual translation-based model (TM): BM25 model (DFR_BM25): • Motivation: • address the lexical gap problem • learn translation probabilities between terms trained on Kullback-Leibler model (KL): parallel dataset: dictionary and encyclopedic definitions • 10 terms out of 3 top ranked docs • „the translation probability reflects the association between query term and document term” • Usage: • trained model recently successfully applied by Bernhard&Gurevych (2009) for answer finding • trained on token 02.10.09 | Computer Science Department | Ubiquitous Knowledge Processing Lab
2. Combination of Retrieval Models Probabilistic model + Translation-based Query expansion: model: Divergence From Randomness Monolingual translation-based model (TM): BM25 model (DFR_BM25): • Motivation: • address the lexical gap problem • learn translation probabilities between terms trained on Kullback-Leibler model (KL): parallel dataset: dictionary and encyclopedic definitions • 10 terms out of 3 top ranked docs • „the translation probability reflects the association between query term and document term” • Usage: • trained model recently successfully applied by Bernhard&Gurevych (2009) for answer finding • trained on token 02.10.09 | Computer Science Department | Ubiquitous Knowledge Processing Lab
2. Combination of Retrieval Models Probabilistic model + Translation-based Query expansion: model: Divergence From Randomness Monolingual translation-based model (TM): BM25 model (DFR_BM25): • Motivation: • address the lexical gap problem • learn translation probabilities between terms trained on Kullback-Leibler model (KL): parallel dataset: dictionary and encyclopedic definitions • 10 terms out of 3 top ranked docs • „the translation probability reflects the association between query term and document term” • Usage: • trained model recently successfully applied by Bernhard&Gurevych (2009) for answer finding • trained on token 02.10.09 | Computer Science Department | Ubiquitous Knowledge Processing Lab
2. Combination of Retrieval Models • Hypothesis: probabilistic and translation-based models retrieve different sets of relevant documents 1. ….. 1. ….. TM DFR_BM25 + KL 2. ….. 2. ….. 3. ….. 3. ….. token token lemma 1. ….. 2. ….. sense 3. ….. 02.10.09 | Computer Science Department | Ubiquitous Knowledge Processing Lab
2. Combination of Retrieval Models • Hypothesis: probabilistic and translation-based models retrieve different sets of relevant documents 1. ….. 1. ….. TM DFR_BM25 + KL 2. ….. 2. ….. 3. ….. 3. ….. token token lemma 1. ….. 2. ….. sense 3. ….. A) normalization: r norm (i) = (r orig (i) – r min ) / (r max – r min ) B) CombSUM by Fox&Shaw(1994): r comb (i) = SUM(Individual r norm (i)) 02.10.09 | Computer Science Department | Ubiquitous Knowledge Processing Lab
Extrinsic evaluation: sense index types • Retrieval based on indexed senses (DFR_BM25 +KL): Index type MAP (training) MAP (test) UBCBest 0.2514 0.2636 NUSBest 0.2930 0.3473 CombBest 0.2921 0.3313 CombBest+ 0.3011 0.3551 • CombBest+ outperforms CombBest • Focus on „combined“ indices 02.10.09 | Computer Science Department | Ubiquitous Knowledge Processing Lab
Extrinsic evaluation: sense index types • Retrieval based on indexed senses (DFR_BM25 +KL): Index type MAP (training) MAP (test) UBCBest 0.2514 0.2636 NUSBest 0.2930 0.3473 CombBest 0.2921 0.3313 CombBest+ 0.3011 0.3551 • CombBest+ outperforms CombBest • Focus on „combined“ indices 02.10.09 | Computer Science Department | Ubiquitous Knowledge Processing Lab
Recommend
More recommend