Overview of NTCIR-14 Makoto P. Kato Yiqun Liu University of Tsukuba Tsinghua University
Introduction to NTCIR • Evaluation Forum –An opportunity for researchers to get together to solve challenging research problems based on corporation of task organizers and participants: • Task organizers design tasks, prepare test collections, and evaluate participants’ systems • Participants develop systems to achieve better performances in the tasks Task organizers Participants Tasks Test collections System output Evaluation results �
Benefits of Evaluation Forums • Task organizers – Can obtain many findings on a certain problem – Can share workload of building a large-scale test collection – Can get attention to a certain research direction • Participants – Can focus on solving problems – Can tackle well-recognized problems with new resources, or novel problems at an early stage – Can demonstrate the performance of their developed systems in a fair comparison • Both sides should have benefits �
Task Selection Procedure • PC co-chairs requested PC members to review task proposals from task organizers • PC co-chairs made decision based on the reviews Program committee Task organizers PC members Task Reviews proposals Decision PC co-chairs Please consider task proposals at NTCIR! �
NTCIR-14 Program Committee (PC) Ben Carterette University of Delaware, USA Hsin-Hsi Chen National Taiwan University, Taiwan Tat-Seng Chua National University of Singapore, Singapore Nicola Ferro University of Padova, Italy Kalervo Järvelin University of Tampere, Finland Gareth J. F. Jones Dublin City University, Ireland Mandar Mitra Indian Statistical Institute, India Douglas W. Oard University of Maryland, USA Maarten de Rijke University of Amsterdam, the Netherlands Tetsuya Sakai Waseda University, Japan Mark Sanderson RMIT University, Australia Ian Soboroff NIST, USA Emine Yilmaz University College London, United Kingdom �
Review Process • 7+2 NTCIR-14 task proposals, each of which was reviewed by 4 or more PC members • 6+1 tasks were accepted, of which 5 are core tasks and 1+1 are pilot tasks �
NTCIR-14 General Schedule Note that each task could have its own schedule Date Event Mar 20, 2018 NTCIR-14 Kickoff May 15, 2018 Task Registration Due Jun 2018 Dataset Release Jun-Jul 2018 Dry Run Aug-Oct 2018 Formal Run Feb 1, 2019 Evaluation Result Release Feb 1, 2019 Task overview paper release (draft) Mar 15, 2019 Submission due of participant papers May 1, 2019 Camera-ready participant paper due Jun 2019 NTCIR-14 Conference & EVIA 2019 in NII, Tokyo �
Focuses of NTCIR-14 1.Heterogeneous information access 2.Dialogue generation and analysis 3.Meta research on information access communities �
Search for questions OpenLiveQ Search for web pages Search WWW Search for lifelog data Lifelog Heterogeneous data Summarize dialog data QALab A! Summarize Generate dialogues Generate STC C… B? Understand Understand numeric info. Dialog data FinNum Reproduce the best practices Reproduce CENTRE
NTCIR-14 Tasks • Core Tasks – Lifelog-3: (Lifelog Serach Task) – OpenLiveQ-2: (Open Live Test for Question Retrieval) – QALab-PoliInfo: (Question Answering Lab for Political Information) – STC-3: (Short Text Conversation) – WWW-2: (We Want Web) • Pilot Tasks – CENTRE: (CLEF/NTCIR/TREC REproducibility) – FinNum: (Fine-Grained Numeral Understanding in Financial Tweet) ��
Number of Active Participants Task # QA Lab for Entrance Exam (QALab) (11, 12, 13) 13 → QA Lab for Political Information (QALab-PoliInfo) (14) Personal Lifelog Organisation & Retrieval (Lifelog) (12, 13, 14) 6 Short Text Conversation (STC) (12, 13, 14) 13 Open Live Test for Question Retrieval (OpenLiveQ) (13, 14) 4 We Want Web (WWW) (13, 14) 4 Fine-Grained Numeral Understanding in Financial Tweet 6 (FinNum) (14) CLEF/NTCIR/TREC REproducibility (CENTRE) (14) 1 Total 47 Active participants: Research groups submitted final results for evaluation ��
Jargon: Test Collection General test collection Input Expected output IR test collection Highly relevant �������������1 �������������1 Document Relevance Irrelevant Topics collection judgements Evaluate Input Indexed Output Search system �
Jargon: Training, Development, and Test Sets Training Dev. Test Evaluate Train Train Output System • Training set: can be used to tune parameters in the system • Dev. set: can be used to tune hyper-parameters in the system • Test set: cannot be used to tune the system, but can only be used for evaluating the output. ��
Jargons: Run / Dry Run / Formal Run • Run: A result of a single execution of a developed system. e.g. This team submitted a run. • Dry run: A preliminary trial for improving the task design and familiarizing participants with the task • Formal run: An actual trial where submissions and their results are officially recorded ��
Jargon: Evaluation Metric • A measure of the system performance – General evaluation metrics Correct items System output Output Judge " ! System Assessor Precision Recall F1-measure # = |! ∩ "| ' = |! ∩ "| ) = 2#' ( |"| |!| # + ' – IR evaluation metrics: MAP, nDCG, ERR, Q-measure – Summarization evaluation metrics: ROUGE Please Google or Bing for details. They will be used in the overview presentations. ��
ENJOY THE CONFERENCE! • Keynote (TODAY) • Task Overviews (TODAY) • Invited Talks (TODAY) • Task Sessions (DAY-3 and DAY-4) • Poster Sessions (DAY-3 and DAY-4) • Banquet (DAY-3) • Panel (DAY-4) • Break-out Sessions (DAY-4) ��
Recommend
More recommend