ICT-Crossn: The System of Cross- lingual Information Retrieval of ICT in NTCIR-7 INSTI INSTITUT Weihua Luo, Tian Xia, Ji Guo, Qun Liu UTE E OF COMPUTING T COMPUTING TECH Multilingual Interaction Technology Laboratory (MITEL) Institute of Computing Technology, Chinese Academy of CHNOLOGY Sciences NOLOGY presented in NTCIR-7 2008-12-18
Outline INSTITUTE OF COMPUTING TECHNOLOGY � Background � The system architecture � Query translation � Document retrieval � Experiments on the dry-run set � Results on the formal run � Conclusion 12/18/2008 IR4QA_NTCIR7_MITEL_Luo_Xia_Guo_Liu 2
Outline INSTITUTE OF COMPUTING TECHNOLOGY � Background � The system architecture � Query translation � Document retrieval � Experiments on the dry-run set � Results on the formal run � Conclusion 12/18/2008 IR4QA_NTCIR7_MITEL_Luo_Xia_Guo_Liu 3
Background INSTITUTE OF COMPUTING TECHNOLOGY � Tasks in NTCIR-7 Advanced Cross-lingual Information Access (ACLIA) � � Information Retrieval for Question Answering (IR4QA) A novel task in ACLIA task cluster in NTCIR-7 � Motivation � CLIR has made a great progress � Find out which IR technique would help CCLQA � Subtasks � CS-CS � EN-CS √ � CT-CT √ � EN-CT � JA-JA � EN-JA � 12/18/2008 IR4QA_NTCIR7_MITEL_Luo_Xia_Guo_Liu 4
Difference between CLIR and IR4QA INSTITUTE OF COMPUTING TECHNOLOGY � Query CLIR IR4QA <TOPIC> <TOPIC ID="ACLIA1-CS-T41"> <NUM>001</NUM> <QUESTION LANG="EN"> <SLANG>CH</SLANG> <![CDATA[ List the hazards of global warming.]]> <TLANG>EN</TLANG> </QUESTION> <TITLE>Time Warner, American Online (AOL), Merger, Impact</TITLE> <QUESTION LANG="CS"> <DESC>Find reports about the impact of AOL/Time Warner <![CDATA[ 列举全球气候变暖的危害。 ]]> merger.</DESC> </QUESTION> <NARR> <NARRATIVE LANG="EN"> <BACK>Time Warner and American Online (AOL) announced a merger on <![CDATA[ Users need to know the harm of global warming to human January 10th, 2000. The market value was estimated at $US350 billion beings and the environment.]]> making it the biggest merger in the US.</BACK> </NARRATIVE> <REL>Comments on AOL/Time Warner merger's effects on Internet and entertainment media businesses are relevant. Descriptions of the <NARRATIVE LANG="CS"> development of the AOL/Time Warner merger are partially relevant. <![CDATA[ 用户需要知道全球气候变暖对人类和环境有什么危害。 ]]> Information about the total amount and the transformation of ownership structure are irrelevant.</REL> </NARRATIVE> </NARR> </TOPIC> <CONC>Time Warner, American Online, AOL, Gerald Levin, merger, M&A, Merger and Acquisition, media, entertainment business</CONC> </TOPIC> 12/18/2008 IR4QA_NTCIR7_MITEL_Luo_Xia_Guo_Liu 5
Difference between CLIR and IR4QA INSTITUTE OF COMPUTING TECHNOLOGY � Query CLIR � � Monolingual description of Information need (IN) � Different grained description of IN TITLE � DESC � NARR � � Accurate keywords � Detailed background and relevance judgement IR4QA � � Bilingual description of IN � One grained description of IN QUESTION ≈ NARRATIVE � � No keywords available � No background and relevance judgement in detail � Question type provided with topics 12/18/2008 IR4QA_NTCIR7_MITEL_Luo_Xia_Guo_Liu 6
Difference between CLIR and IR4QA INSTITUTE OF COMPUTING TECHNOLOGY � Metrics of evaluation � CLIR � Traditional IR metrics: AP � IR4QA � IR evaluation AP--primary � Q � nDCG � � End-to-end evaluation with a QA system F-score � 12/18/2008 IR4QA_NTCIR7_MITEL_Luo_Xia_Guo_Liu 7
INSTITUTE OF COMPUTING TECHNOLOGY 12/18/2008 IR4QA_NTCIR7_MITEL_Luo_Xia_Guo_Liu 8
Outline INSTITUTE OF COMPUTING TECHNOLOGY � Background � The system architecture � Query translation � Document retrieval � Experiments on the dry-run set � Results on the formal run � Conclusion 12/18/2008 IR4QA_NTCIR7_MITEL_Luo_Xia_Guo_Liu 9
Our Implementation INSTITUTE OF COMPUTING TECHNOLOGY � Basic idea � A traditional CLIR framework still works � IR4QA is still an IR task � Major concerned issues � Shorter query Modification should be Improper translation of a few words may lead to retrieval of � more irrelevant documents made! � Different evaluation metrics (reference set) The judgement of document relevance is vague for a new � task Which IR model is more suitable? � 12/18/2008 IR4QA_NTCIR7_MITEL_Luo_Xia_Guo_Liu 10
Traditional CLIR Flowchart EN Query CS Document INSTITUTE OF COMPUTING TECHNOLOGY Dictionary- based, corpus- Query Translation based, MT module based… CS Query Formal Representation Formal Representation Relevance Estimation VSM model, Probabilistic model,LM model… Result Reranking document list 12/18/2008 IR4QA_NTCIR7_MITEL_Luo_Xia_Guo_Liu 11
Our Improvement for IR4QA EN Query CS Document INSTITUTE OF COMPUTING TECHNOLOGY Query Translation use of intermed CS Query iate data of SMT Formal Representation Formal Representation Relevance Estimation Result Reranking Borrow Idea of system combination of SMT document list 12/18/2008 IR4QA_NTCIR7_MITEL_Luo_Xia_Guo_Liu 12
ICT-Crossn INSTITUTE OF COMPUTING TECHNOLOGY 12/18/2008 IR4QA_NTCIR7_MITEL_Luo_Xia_Guo_Liu 13
Outline INSTITUTE OF COMPUTING TECHNOLOGY � Background � The system architecture � Query translation � Document retrieval � Experiments on the dry-run set � Results on the formal run � Conclusion 12/18/2008 IR4QA_NTCIR7_MITEL_Luo_Xia_Guo_Liu 14
Query Translation INSTITUTE OF COMPUTING TECHNOLOGY � Basic ideas � Phrase translation � Full text translation is unfit for short questions � Phrases resolve some ambiguities � Phrase based SMT creates huge phrase tables � OOV words translation � Many OOV words provide key information of questions � Any dictionary has a limited coverage � Corresponding translation may be found with search engines 12/18/2008 IR4QA_NTCIR7_MITEL_Luo_Xia_Guo_Liu 15
Phrase Translation INSTITUTE OF COMPUTING TECHNOLOGY � Phrase table � Trained on 5M pairs of EN-CS sentence � Word alignment with Giza++ � Phrase extraction and probabilities estimation with a tool in Mencius (our phrase-based SMT decoder)(F.J. Och, Hermann Ney, 2008) � Size up to 17GB � The phrase table is filtered by the test set 12/18/2008 IR4QA_NTCIR7_MITEL_Luo_Xia_Guo_Liu 16
Recommend
More recommend