PAN@FIRE 2013: Overview of the Cross-Language !ndian News Story Search (CL!NSS) Track Parth Gupta 1 , Paul Clough 2 , Paolo Rosso 1 , Mark Stevenson 2 , and Rafael E. Banchs 3 1 Technical University of Valencia (UPV), Spain 2 University of Sheffield, UK 3 Institute for Infocomm Research (I 2 R), Singapore http://www.dsic.upv.es/grupos/nle/clinss.html December 4, 2013 Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 1 / 23
Outline Motivation 1 Task Description 2 Corpus 3 Evaluation 4 Participation Overview 5 References 6 Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 2 / 23
Motivation Cross-language NLP and IR heavily rely on parallel and comparable data Parallel data is precious but scarce Most of the available data is quasi-comparable - not topically aligned The technologies to extract parallel or comparable fragments from quasi-comparable data will be very useful in such scenarios Current Scene All languages don’t have parallel data - and the available data is too small to rely Comparable corpus (Wikipedia) is not reliable in many languages In fact many languages do not have enough data Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 3 / 23
Two Questions: 1 What can be considered a constant source of text across languages? 2 ... that can contain parallel or comparable fragments? Answer Wikipedia articles - often, people create pages by translating English pages! News stories - journalistic text re-use! Which languages to work on? Resource Poor Languages Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 4 / 23
Background - Web and Languages Language Web Representation a Rank Language Percentage 1 English 54.9% 2 Russian 6.1% 3 German 5.3% 4 Spanish 4.8% 5 Chinese 4.4% 6 French 4.3% 7 Japanese 4.2% 8 Arabic 3.0% 9 Portuguese 2.3% 10 Polish 1.8% . . . 36 Latvian 0.1% 37 Estonian 0.1% a Wikipedia page: “Languages used on the Internet” b Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 5 / 23
Background - Web and Languages Language Population a Rank Language Speakers (millions) % of world 1 Mandarin 955 14.1 2 Spanish 407 5.85 3 English 359 5.52 4 Hindi 311 4.46 5 Arabic 293 4.23 6 Portuguese 216 3.08 7 Bengali 206 3.05 8 Russian 154 2.42 9 Japanese 126 1.92 10 Punjabi 102 1.44 a The estimates used for this list are those of Nationalencyclopedin and is based on estimates published in 2010 - Wikipedia. Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 6 / 23
Motivation Contd.. How do such algorithms perform? [Platt et al., 2010] Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 7 / 23
Wikipedias and News data Wikipedia Size English 4,392,107 Spanish 1,061,460 German 1,658,515 . . . Hindi 109,046 Tamil 57,828 NT 1 Size TOI 2 Size Year 2011 117,411 243,773 2012 128,610 254,036 1 Navbharat Times: Hindi Daily 2 Times of India: English Daily Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 8 / 23
Task Description Observation News stories covering the same event published in different languages may be rich sources of parallel and comparable text. Some fragments in these stories are parallel, for example, personal quotes and translated versions of the same content. Definitions [Barker and Gaizauskas, 2012] Focal Event: The main event or events which provide a focus for the news story ◮ e.g. Romney vs. Obama in Ohio: With superior ground operations, the president widens his lead Background Event: an event that plays a supporting role in the text, providing context for the focal events ◮ e.g. Probable the last encounter between the two News Event: a group of related events, broader than and including the focal event, which may be reported over time in different news text installments ◮ e.g. Presidential election polls Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 9 / 23
Task Description Statement For each t ∈ T , find s ∈ S covering the same focal event and news event Link each story t in T to s in S Source Target which share same Collection Collection news event or focal event for each L � · · · � L n � L 2 S = L 1 T = English Articles Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 10 / 23
Flow Diagram Pair(A,B) Same News Event Different News Event Task : Story Year 2012/13 Detection Same News Event Same News Event Same Focal Event Different Focal Event Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 11 / 23
Article Title Relevance Level Target There’s lot more to talk than my 50th Test ton: Tendulkar english-document-00006.txt � ��� 50 �� � � � �� � �� к � a��� �� ки Source1 2 (same focal event) ��� � � � � � : � � �� � �к� There are many things except my 50th century: Tendulkar hindi-document-24799.txt ���� � � ��и � � �� � �� к� ����� Source2 1 (same news event) Sachin makes fifty in century hindi-document-08018.txt Table: Example English-Hindi text pairs describing the same news event but different focal events Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 12 / 23
Corpus Statistics Table: CL!NSS 2012 corpus statistics. The statistics are shown for the source partition D hi (Hindi) and a target collection D en . The column headers stand for: | D | number of documents in the corpus (partition), | D tokens | total number of tokens, | D voc | total size of vocabulary (unique terms). k= thousand, M = million. Partition | D | | D tokens | | D voc | 25 9.3k 2.5k D en D hi 50691 15.6M 143k Metadata ◮ Title of the news story ◮ Date of publication ◮ Content of the Story Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 13 / 23
Evaluation Framework Relevance The relevance level of the source news stories for the given test queries will be in 2,1,0 where, ◮ 2 = “same news event + same focal event” ◮ 1 = “same news event + different focal event” and ◮ 0 = “different news event” Measures NDCG@k, k = 1, 5, 10 Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 14 / 23
Evaluation: Relevance Judgment Tool Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 15 / 23
Relevance Overview Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 16 / 23
Timeline 6 May, 2013 Release of training corpus 4 Sept, 2013 Release of test corpus 27 Oct, 2013 Submission of runs 10 Nov, 2013 Release of qrels (result notification) 15 Nov, 2013 Working notes due 05 Dec, 2013 CL!NSS @ FIRE in New Delhi! Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 17 / 23
Participation Overview Submission details Teams were asked to submit results in terms of rank-list for each language pair. Each team could submit up to 3 runs to try different approaches or configurations. Participation Teams 2012 2013 Registered 10 16 Participated 3 8 Runs 8 23 Working notes 2 6 Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 18 / 23
Results Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 19 / 23
Lessons Learnt Sometimes manually determining the focal/news events is quite difficult. The scores achieved this year are quite high NGCD@1 0.78 vs. last year’s best 0.32 Incorporating meta-information explicitly in similarity estimation helps It is also observed that carefully selecting query terms from target documents help to improve the performance Although, the approaches are motivated to treat the problem as ranking, more sophisticated modeling of stories would certainly help determining same focal events Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 20 / 23
CL!NSS Programme Time Details Speaker/s 4 th December 12:00 Overview Talk Parth Gupta 5 th December 15:30 Participant Talk Amogh Param 15:45 Participant Talk Piyush Arora 16:00 Participant Talk Aarti Kumar 16:15 Participant Talk Sujoy Das Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 21 / 23
Thank You! ¨ ⌣ (on behalf of CL!NSS Team) http://www.dsic.upv.es/grupos/nle/clinss.html Supported By Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 22 / 23
References I Barker, E. and Gaizauskas, R. J. (2012). Assessing the comparability of news texts. In LREC . Platt, J. C., Toutanova, K., and tau Yih, W. (2010). Translingual document representations from discriminative projections. In EMNLP , pages 251–261. Parth Gupta (UPV, Spain) Overview of CL!NSS Track December 4, 2013 23 / 23
Recommend
More recommend