From From → IR WSD → IR WSD to to ← IR WSD ← IR WSD Julio Gonzalo Julio Gonzalo UNED UNED
→ IR @ UNED: WSD → IR @ UNED: WSD initial motivation initial motivation � 1997: EuroWordNet: 1997: EuroWordNet: let’s let’s use use it it! ! � � 1998: (manual) 1998: (manual) indexing with indexing with synsets +29% synsets +29% � � 1999: 1999: Sanderson pseudo Sanderson pseudo- -senses senses vs. vs. � WordNet synsets (EMNLP) WordNet synsets (EMNLP) � 1999: WSD versus 1999: WSD versus first sense heuristic first sense heuristic � (SIGLEX) (SIGLEX) � 2000: ITEM conceptual 2000: ITEM conceptual search engine search engine �
WSD strategy Conceptual versus textual indexing
ITEM search engine search engine ITEM � Scalable to several languages Scalable to several languages � � Conceptual Conceptual query expansion query expansion � � Translations via hyperonym relations Translations via hyperonym relations (e.g (e.g governor’s governor’s � race) ) race but but � Granularity Granularity � � Indexing units Indexing units versus versus translation units translation units � – Words Words are are not good for translation not good for translation – » (té cargado/ » (té cargado/strong strong tea) tea) – Phrases Phrases are are not good for indexing not good for indexing – » “ » “word word+ +sense sense+ +disambiguation disambiguation”/“ ”/“sense tagging sense tagging” ” → Is Word Sense Disambiguation an issue for the semantic web?
Website Term Browser QUERY EXPLORE DOCUMENT EXPLORE PHRASE RECONSULT WITH PHRASE
Website Term Browser WTB Evaluation Evaluation WTB • 1523 sessions with interaction • average 5.11 actions per session • explore phrase used in 65.13% sessions All queries 1 word queries >1 word queries First action DOC 40.70% 45.49% 37.30% after QUERY PHRASE 51.14% 45.65% 55.05% RECONSULT 8.141% 8.846% 7.640% Last action Before ending QUERY 48.74% 53.38% 45.15% Session with PHRASE 42.95% 40.85% 44.57% explore DOC RECONSULT 8.306% 5.764% 10.27%
Is WSD easier than MT/CLIR? abortion aborto Corpus evidence abortion issue •tema del aborto tema •asunto del aborto issue •asuntos como el aborto número •asuntos del aborto asunto •temas como el aborto edición •asunto aborto emisión Alignment without parallel corpora abortion issue tema del aborto
Results on Results on CLEF comparable corpus CLEF comparable corpus Spanish Size # Phrases Phrases # Aligned Aligned Size # # 2 6,577,763 2,004,760 2 6,577,763 2,004,760 3 7,623,168 252,795 3 7,623,168 252,795 English Size # Phrases Phrases # Aligned Aligned Size # # 2 3,830,663 1,456,140 2 3,830,663 1,456,140 3 3,058,698 198,956 3 3,058,698 198,956
Results on CLEF corpus CLEF corpus Results on 2 lemmas Algorithm Random Selection EN ES EN ES EN ES EN ES + frequent frequent .83 .80 .02 .02 + .83 .80 .02 .02 - frequent frequent .66 .54 .02 .02 - .66 .54 .02 .02 3 lemmas Algorithm Random Selection EN ES EN ES EN ES EN ES + frequent frequent .94 .80 .004 .005 + .94 .80 .004 .005 - frequent frequent .81 .62 .004 .004 - .81 .62 .004 .004
Noun Phrase translation 1) Select aligned sub-phrase with most frequent translation 2) discard overlapping sub-phrases 3) iterate. advances in treatment of a wide variety of diseases advances in treatment advances in treatment treatment of a wide treatment of a wide wide variety wide variety variety of disea variety of disease ses
advances in treatment of a wide variety of diseases advances in treatment advances in treatment treatment of a wide treatment of a wide wide variety wide variety variety of disea variety of disease variety of disea variety of disease ses ses tipo de enfermedades tipo de enfermedades
advances in treatment of a wide variety of diseases advances in treatment advances in treatment treatment of a wide treatment of a wide wide variety (amplio) wide variety (amplio) (amplio) wide variety wide variety (amplio) variety of disea variety of disease variety of disea variety of disease ses ses tipo de enfermedades tipo de enfermedades
advances in treatment of a wide variety of diseases advances advances in in in treatment treatment treatment advances advances in treatment treatment of a wide treatment of a wide wide variety (amplio) wide variety (amplio) (amplio) wide variety wide variety (amplio) variety of disea variety of disease variety of disea variety of disease ses ses avances en el trat avances en el tratamiento amiento tipo de enfermedades tipo de enfermedades
advances in treatment of a wide variety of diseases advances advances in in in treatment treatment treatment advances advances in treatment treatment of treatment of treatment of treatment of a a a wide a wide (amplio) wide wide (amplio) (amplio) (amplio) wide variety (amplio) wide variety (amplio) (amplio) wide variety wide variety (amplio) variety of disea variety of disease variety of disea variety of disease ses ses avances en el trat avances en el tratamiento amiento tipo de enfermedades tipo de enfermedades
advances in treatment of a wide variety of diseases advances advances in in in treatment treatment treatment advances advances in treatment treatment of treatment of treatment of treatment of a a a wide a wide wide wide wide variety wide variety wide variety wide variety variety of disea variety of disease variety of disea variety of disease ses ses avances en el tr avances en el tratamiento atamiento amplio tipo de enfermedades tipo de enfermedades
Is this document relevant? Source: Oard 2000
Systran Systran UNED @ iCLEF’2001
Noun phrases Noun phrases UNED @ iCLEF’2001
UNED @ iCLEF’2001 Results Results System Precision Recall System Precision Recall Systran MT MT 0.48 0.22 Systran 0.48 0.22 UNED NPs NPs 0.47 (- -2%) 2%) 0.34 (+52%) UNED 0.47 ( 0.34 (+52%) cf. U. Maryland experiment: word-by-word translation substantially worse than Systran.
UNED @ iCLEF’2002 CLIR Query formulation Query formulation CLIR Reference system UNED system system Reference system UNED � Assisted word Assisted word- -by by- - � Assisted formulation Assisted formulation � � word translation. . by phrases phrases. . word translation by � Automatic translation Automatic translation � using alignment. . using alignment
UNED @ iCLEF’2002 UNED query formulation query formulation UNED
UNED relevance relevance feedback feedback UNED
UNED relevance relevance feedback feedback UNED
UNED @ iCLEF’2002 Results Results System F α= System F α= 0 0.8 .8 Reference .23 Reference .23 UNED .37 (+65%) UNED .37 (+65%) Statistical significance: p< 0.05 linear mixed-effects model + ANOVA.
UNED @ iCLEF’2002 Initial Query formulation Initial Query formulation System Average time System Average time Reference 286 s. Reference 286 s. UNED 44 s. UNED 44 s.
UNED @ iCLEF’2002 Initial query formulation Initial query formulation System P @ 20 System P @ 20 Reference .19 Reference .19 UNED .29 UNED .29
And what about WSD? WSD? And what about � Supervised systems have little to Supervised systems have little to be be � supervised with... ... supervised with – Research on unsupervised systems Research on unsupervised systems (Senseval 2) (Senseval 2) – – Solve the acquisition bottleneck of supervised Solve the acquisition bottleneck of supervised – systems: : obtain obtain training training instances instances systems automatically. . automatically � Better understanding of the problem Better understanding of the problem: : sense sense � inventories, , test test suites, polysemy. suites, polysemy. inventories
WSD is harder than the is harder than the WSD applications applications → WSD: IR → � IR WSD: automatic assignment of automatic assignment of web web � directories to word senses ( (Computational Computational directories to word senses Linguistics, , to appear to appear) ) Linguistics → WSD: use MT → � MT WSD: use aligned phrases for partial aligned phrases for partial � disambiguation (no (no need for parallel need for parallel disambiguation corpora!) ( !) (work work in in progress progress) ) corpora � WSD WSD: : go to the basics go to the basics: : study sense study sense � inventories, , and and polysemy polysemy distinctions for distinctions for inventories clustering (SIGLEX 00, 02) (SIGLEX 00, 02) clustering
Recommend
More recommend