DCU at FIRE 2013: Cross Language !ndian News Story Search Piyush - PowerPoint PPT Presentation

DCU at FIRE 2013: Cross Language !ndian News Story Search Piyush Arora, Jennifer Foster, Gareth J. F. Jones CNGL Centre for Global Intelligent Content School of Computing, Dublin City University, Ireland

Outline — Introduction — Our Approach — Experimental Details — Results — Conclusions and Future Work

Introduction CL!NSS FIRE'13 task: News story linking between English and Indian Languages documents.

Outline — Introduction — Our Approach — Experimental Details — Results — Conclusion and Future Work

Our Approach The approach used by us has 2 main steps: • Step-1 : Follow traditional cross-language information retrieval (CLIR) approach:  Index documents using Lucene search engine.  Translate input query from source to target language using machine translation (MT)  Rank documents for retrieval using Lucene search engine • Step-2 : Combine multiple runs using data fusion methods

Contd… Novel features of our approach • Query modifications using different features such as: o Summarize query documents to form focused queries prior to translation o Identify Named Entities (NEs) as candidates for transliteration o Combine MT translation with NEs transliterations to capture alternative translations • Adding weighting to reflect publication date relationship between query and target documents

Experimental Details Pre-Processing and Indexing • Index documents using Lucene. • Used Lucene's inbuilt Hindi Analyzer • Stopword list obtained by concatenating the following: 1. FIRE Hindi stopword list 2. Lucene internal stopword list 3. Stopword list created by selecting all words with Document Frequency (DF) > 5000

Contd… Cross Language Search • Input queries translated separately using: • Bing • Google System NDCG@1 NDCG@5 NDCG@10 NDCG@20 Palkosvi 0.32 0.33 0.34 0.36 Bing 0.54 0.52 0.53 0.55 Google 0.56 0.55 0.56 0.58 Baseline Results

Main Features Used For Query Modifications Summarizer : based on extraction of sentences weighted using various factors indicating importance to document • Varying length of summary • Summary length half of query document • Summary length one third of query document • Summary of top 3 ranked sentences from query document. • Use alternative translation services: Bing, Google

Summarizer Features Main Features used for summarizer: — skimming: position of a sentence in a paragraph. — namedEntity: number of named entities in each sentence. — TSISF: similar to TF-IDF function but works at sentence level. — titleTerm: overlap between the sentences and the terms in the title of a document. — clusterKeyword: relatedness between words in a sentence.

Contd… Transliteration English Word Translated Word Transliterated Word गेमॎस खेल Games राष्टॎरमंडल कॉमनवेलॎथ Commonwealth Using Date Adding a constant of 0.04 to the retrieved documents occurring in a window of 10 days of the query document.

Feature Selection — Using Google translation • Using 1/3 summary • Using 3-sentence summary • Using 3-sentence summary + all NE transliterated • Using complete input query + all NE transliterated — Using Bing translation • Using 1/3 summary • Using 3-sentence summary • Using complete input query + all NE transliterated

System NDCG@1 NDCG@5 NDCG@10 NDCG@20 1/3 summary 0.5408 0.5814 0.5872 0.5907 1/3 summary+ 0.5408 0.5757 0.5828 0.5957 NE transliterated 3-sentence 0.5918 0.5815 0.5855 0.5897 summary Complete query 0.5714 0.562 0.5743 0.591 +NE transliterated Results Using Google Translation

System NDCG@1 NDCG@5 NDCG@10 NDCG@20 3-sentence 0.5612 0.556 0.5623 0.5734 summary 1/3 summary 0.551 0.555 0.5639 0.5721 0.5102 0.5315 0.5463 0.5574 Complete query + NE transliterated Results Using Bing Translation

Data Fusion

Top 3 feature/system combinations selected: • Run-1: Using Google translation and 1/3 summary of input query. • Run-2: Using Google translation and combining 1/3 summary with and without NE transliterated, 3-sentence summary and using whole query + incorporating date factor. • Run-3: Combining all the features, i.e. including queries translated using both Google and Bing using complete query as well as 1/3 summary and 3-sentence summary with and without NE transliterated.

System NDCG@1 NDCG@5 NDCG@10 NDCG@20 Run-1 0.5408 0.5814 0.5872 0.5907 Run-2 0.6224 0.5835 0.5943 0.6022 0.6224 0.5733 0.5833 0.5956 Run-3 Results on Training set

Results on test set Evaluation - Submitted runs blind – submission combinations selected using features that performed best on the training set System NDCG@1 NDCG@5 NDCG@10 NDCG@20 Run-1 0.74 0.66587 0.6759 0.6849 0.6701 0.7047 0.7042 Run-2 0.74 Run-3 0.74 0.6809 0.7268 0.7249 Results on Test set

Conclusion & Future Work Future Work: Handling abbreviations such as “MNK”, “YSR”, political • party names, movie names, etc. • Handling spelling variants. • Normalizing text, handling language variations. • Minimizing translation and transliteration error. • Explore alternative scoring functions such as BM25. • Weighting different features rather than linearly scoring them.

Thank You Questions? This research is supported by Science Foundation Ireland (SFI) as a part of the CNGL Centre for Global Intelligent Content at DCU (Grant NO: 12/CE/I2267)

DCU at FIRE 2013: Cross Language !ndian News Story Search Piyush - PowerPoint PPT Presentation

DCU at FIRE 2013: Cross Language !ndian News Story Search Piyush Arora, Jennifer Foster, Gareth J. F. Jones CNGL Centre for Global Intelligent Content School of Computing, Dublin City University, Ireland Outline Introduction Our

ML4HMT: DCU Teams Overview Tsuyoshi Okita Dublin City University DCU Teams Overview Meta

T HE DCU W OMEN IN L EADERSHIP I NITIATIVE Tuesday, 3 rd November 2015 Launch of DCU Women in

PAN@FIRE 2013: Overview of the Cross-Language !ndian News Story Search (CL!NSS) Track Parth Gupta

ESPM 134 - -This week: This week: ESPM 134 Fire Suppression Fire Suppression Prescription

LANNING & LANNING & & & A MENDMENT A MENDMENT MENDMENT O VERVIEW MENDMENT O

1 QC STORY -32 QC STORY -32 QC STORY -32 QC Story-1 QC Story-1 QC Story-1 Awards and

Extending the DCU-250 Gold Standard f-structure Bank H. B echara hbechara@computing.dcu.ie

Communications Networks (EE414) Dr. Conor McArdle: room S335, mcardlec@eeng.dcu.ie,

2 Electric Fire Pump 3 Engine fire pump 4 3 Emergency Generator backup 5 Fire Alarm Control

DXA studio 40 Greene Avenue October 17, 2017 GREENE AVENUE 4 STORY 4 STORY 4 STORY 4 STORY

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

DIF SEK PART 4 SOFTWARE FOR FIRE DESIGN DIF SEK Part 4: Software for Fire Design 0 / 47 Fire

Arlington County Fire Department Fire Station #10 Arlington County Fire Department 10 Fire

Components Ari Grant Our Journey Layout of a feed story Code for a feed storys header

7/8/2013 1 7/8/2013 2 7/8/2013 3 7/8/2013 4 7/8/2013 5 7/8/2013 6 7/8/2013 7 7/8/2013

The Shadow of the Cross The Cross of Jesus part 1B The Shadow of the Cross Hebrews 10:1-14 The

Prasanth Chatarasi PhD Thesis Defense Habanero Extreme Scale Software Research Group School of

Computational Social Choice: Spring 2017 Ulle Endriss Institute for Logic, Language and

CS6200 Information Retrieval Jesse Anderton College of Computer and Information Science

Fair division Lirong Xia March 11, 2016 Last class: two-sided 1-1 stable matching Boys Kyle

Semantic Web: a short introduction Ivan Herman, Semantic Web Activity Lead, W3C Webelopers

Web CS490W: Web I nformation Search & Management Web opened the door for many important

Why do we use multiplexing on cars ? EVOLUTION DU CABLAGE METRES (longueur de cablage) NOMBRE

E40M LEDs, Time Multiplexing M. Horowitz, J. Plummer, R. Howe 1 Reading Course Reader 2.6

DCU at FIRE 2013: Cross Language !ndian News Story Search Piyush - PowerPoint PPT Presentation

DCU at FIRE 2013: Cross Language !ndian News Story Search Piyush Arora, Jennifer Foster, Gareth J. F. Jones CNGL Centre for Global Intelligent Content School of Computing, Dublin City University, Ireland Outline Introduction Our

ML4HMT: DCU Teams Overview Tsuyoshi Okita Dublin City University DCU Teams Overview Meta

T HE DCU W OMEN IN L EADERSHIP I NITIATIVE Tuesday, 3 rd November 2015 Launch of DCU Women in

PAN@FIRE 2013: Overview of the Cross-Language !ndian News Story Search (CL!NSS) Track Parth Gupta

ESPM 134 - -This week: This week: ESPM 134 Fire Suppression Fire Suppression Prescription

LANNING &amp; LANNING &amp; &amp; &amp; A MENDMENT A MENDMENT MENDMENT O VERVIEW MENDMENT O

1 QC STORY -32 QC STORY -32 QC STORY -32 QC Story-1 QC Story-1 QC Story-1 Awards and

Extending the DCU-250 Gold Standard f-structure Bank H. B echara hbechara@computing.dcu.ie

Communications Networks (EE414) Dr. Conor McArdle: room S335, mcardlec@eeng.dcu.ie,

2 Electric Fire Pump 3 Engine fire pump 4 3 Emergency Generator backup 5 Fire Alarm Control

DXA studio 40 Greene Avenue October 17, 2017 GREENE AVENUE 4 STORY 4 STORY 4 STORY 4 STORY

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

DIF SEK PART 4 SOFTWARE FOR FIRE DESIGN DIF SEK Part 4: Software for Fire Design 0 / 47 Fire

Arlington County Fire Department Fire Station #10 Arlington County Fire Department 10 Fire

Components Ari Grant Our Journey Layout of a feed story Code for a feed storys header

7/8/2013 1 7/8/2013 2 7/8/2013 3 7/8/2013 4 7/8/2013 5 7/8/2013 6 7/8/2013 7 7/8/2013

The Shadow of the Cross The Cross of Jesus part 1B The Shadow of the Cross Hebrews 10:1-14 The

Prasanth Chatarasi PhD Thesis Defense Habanero Extreme Scale Software Research Group School of

Computational Social Choice: Spring 2017 Ulle Endriss Institute for Logic, Language and

CS6200 Information Retrieval Jesse Anderton College of Computer and Information Science

Fair division Lirong Xia March 11, 2016 Last class: two-sided 1-1 stable matching Boys Kyle

Semantic Web: a short introduction Ivan Herman, Semantic Web Activity Lead, W3C Webelopers

Web CS490W: Web I nformation Search &amp; Management Web opened the door for many important

Why do we use multiplexing on cars ? EVOLUTION DU CABLAGE METRES (longueur de cablage) NOMBRE

E40M LEDs, Time Multiplexing M. Horowitz, J. Plummer, R. Howe 1 Reading Course Reader 2.6

LANNING & LANNING & & & A MENDMENT A MENDMENT MENDMENT O VERVIEW MENDMENT O

Web CS490W: Web I nformation Search & Management Web opened the door for many important