CLEF-HIPE-2020 Named Entity Recognition and Linking on Historical - PowerPoint PPT Presentation

CLEF-HIPE-2020 Named Entity Recognition and Linking on Historical Newspapers 1 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

CLEF-HIPE-2020 in a nutshell - HIPE : Identifying Hi storical P eople, P laces and other E ntities - 1st NE processing shared task on historical documents - Tasks: - NE recognition and classification - NE linking - Participating teams: 13 2 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

Why HIPE? New data: Challenge: NLP on historical texts is hard Emergence of large-scale - Spelling variations archives of digitized contents - Noisy OCR - Multilingualism - Data sparsity - Limited resources or KB coverage New needs: → Objectives Content retrieval by strengthen the robustness of approaches; 1. humanities scholars enable performance comparison ; 2. foster efficient semantic indexing of digitized 3. cultural heritage collections. 3 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

Background impresso project mining 200 years of historical newspapers Project: https://impresso-project.ch Interface: https://impresso-project.ch/app/ 4 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

Semantic indexation of historical newspapers Search NEs (among others) over 47M articles 5 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

Semantic indexation of historical newspapers Visualize facsimile, OCR and entity mentions 6 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

Semantic indexation of historical newspapers Overview of named entities 7 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

Semantic indexation of historical newspapers 8 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

Tasks 1. NERC Recognition and classification of NERC Coarse entity mentions with NERC Fine - subtask 1: coarse types + entity components - subtask 2: fine-grained types. + metonymy + nested entities 9 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

Tasks 2. Entity Linking Participation bundles: Towards Wikidata QID or NIL - end-to-end EL: w/o mention boundaries - EL-only: with mention boundaries Participation guidelines: 10.5281/zenodo.3677171 10 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

Corpus selection - Digitized newspaper archives (CH, LU, US) - Diachronic: from 1738 to 2019 - Multilingual: fr, de, en - Sampling and manual triage: - journalistic content - no feuilleton, cross-words, meteo, etc. - exclusion of extreme OCR noise - no provision of different OCR → real-life setting 11 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

Corpus annotation - Trilingual annotators, trained on a mini-ref - INCEpTION platform - NERC annotation difficulties: M. Curtoys d' Anduaga, doyen du corps diplojt elfsue - NE mention boundaries espagnol, et ministre plenipotentiaire pendant 50 ans - consideration of multiple languages - what is to be annotated or not Z urichputsch, Baslerpropaganda - definition at time x - metonymy Commission imperiale, Die franz osische Regierung Is Savoie or Moldavia a region or a country? Annotation guidelines 12 10.5281/zenodo.3604227 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

Corpus annotation - Trilingual annotators, trained on a mini-ref Germany, Q183 962-1813: Holy Roman Empire, Q12548 - INCEpTION platform 1806-1813: Confederation of the Rhine, Q154741 - NERC annotation difficulties: 1815-1866: German Confederation, Q151624 1867-1870: North German Confederation,Q150981 - NE mention boundaries 1871-1918: German Empire, Q43287 - consideration of multiple languages 1918-1933: Weimar Republic, Q41304 - what is to be annotated or not 1933-1945: Nazi Germany, Q7318 - definition at time x 1949-1990: West Germany, Q713750 1949-1990: East Germany, Q16957 - metonymy - EL annotation difficulties: - Requires historical knowledge + Sherlock Holmes skills - Historical statuses of entities unequally represented in KB Annotation guidelines 13 10.5281/zenodo.3604227 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

Corpus characteristics newspaper articles 563 tokens 444,596 (linked) mentions 18,962 metonymy 1252 components 6,219 noisy mentions (test set) 10% NIL 25.72% # mentions: 10,923 (Fr), 6584 (De), 1455 (En) 14 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

Corpus release - train/dev/test (70/15/15) - no train set for English - no sentence segmentation - no sophisticated tokenization - document metadata CC BY-NC 4.0 https://github.com/impresso/CLEF-HIPE-2020/tree/master/data 10.5281/zenodo.3706857 15 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

Auxiliary resources In-domain Fr, De, and En embeddings: - fastText word embeddings (with and w/o subwords) - flair character embeddings (now integrated into the flair framework) CC BY-SA 4.0 https://files.ifi.uzh.ch/cl/siclemat/impresso/clef-hipe-2020/ 10.5281/zenodo.3706808 16 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

Evaluation - Entities (not tokens) as the unit of reference - Macro & Micro Precision, Recall and F1 measure - Evaluation scenarios: NERC EL Strict exact mention consideration of the top link only, (overlapping mention boundaries) boundaries Fuzzy overlapping boundaries historical mapping, cut-offs @3 and @5 (overlapping mention boundaries) HIPE Scorer: https://github.com/impresso/CLEF-HIPE-2020-scorer HIPE Eval Toolkit: https://github.com/impresso/CLEF-HIPE-2020-eval 17 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

Participation 75 runs 40 42% French registrations 31% German 26% English 6 teams work on all languages 13 11 participating teams Working Notes All participated to NERC-Coarse 3 to NERC-Fine 5 to EL-only and end-to-end EL 18 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

Participating systems’ main features - 11 teams applied neural approaches for NERC; - Most of them worked with contextualized embeddings , esp. BERT ; - Experimentation with various input embeddings (char, subword, word, historical or contemporary, type-level or contextualized) - Some attempted to improve the newspaper line-based input format with proper sentence segmentation and tokenization; 19 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

Results overview (NERC) French German English F1 scores Strict Fuzzy Strict Fuzzy Strict Fuzzy - Neural system with strong embedding NERC-Coarse literal resource prevail; Baseline .646 .769 .476 .585 .405 .562 - Performances correlates with amount of train/dev data; Median .677 .808 .636 .766 .463 .645 - BERT-based systems > Bi-LSTM; Best system .840 .921 .797 .878 .632 .806 - Great performances diversity, but results NERC-Coarse metonymic are better than expected (6 teams > .8); - NERC fine with 12 classes more difficult; Best system .783 .783 .634 .694 - - - NE components show reasonable NERC-Fine performances. Best system .784 856 .668 .771 - - NE components Best system .657 .751 .642 .707 - - 20 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

Results overview (EL) French German English F1 scores Strict Fuzzy Strict Fuzzy Strict Fuzzy - EL performances are lower, and as diverse; End-to-end Entity Linking (literal) - NERC error propagation in end-to-end Baseline .257 .270 .180 .195 239 .239 setting, but EL-only not a lot better; - Performance increase with cut-offs @3 and Best system .598 .617 .534 .557 .531 .531 @5. End-to-end Entity Linking (metonymic) Best system .297 .462 .396 .469 - - Entity linking only (with mentions provided) Overall, what helps: Baseline .498 .512 .418 .437 .506 .506 - BERT; - actively tackling the problems of Best system .639 .659 .582 .602 .658 .658 OCR noise, word hyphenation and sentence segmentation; - in-domain resources. 21 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

Time-based observations Analysis of F1 score as a function of time. Hypothesis : the older, the more difficult. Observation : no strong correlation between article publication date and performance. 22 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

Impact of OCR noise Evaluation on various noise levels - noise: length-normalized Levenshtein distance between surface form and manual transcription; - noisy vs non-noisy have remarkable differences on both NERC and EL; - greatest performance variation at medium noise level 23 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide

CLEF-HIPE-2020 Named Entity Recognition and Linking on Historical - PowerPoint PPT Presentation

CLEF-HIPE-2020 Named Entity Recognition and Linking on Historical Newspapers 1 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flckiger, S. Clematide CLEF-HIPE-2020 in a nutshell - HIPE : Identifying Hi storical P eople, P laces and

HIPE Evaluation Lab Robust Named Entity Recognition an Linking on Historical Documents Example

Named Entity Recognition Using BERT and ELMo Group 8 : Mikaela Guerrero Vikash Kumar Nitya

Recycling Named Entity Taggers Unsupervised Domain and Language Adaptation for Named Entity

Multi-Task Transfer Learning for Fine-Grained Named Entity Recognition Masato Hagiwara 1 , Ryuji

Efficient Dependency-Guided Named Entity Recognition Zhanming Jie Aldrian Obaja Muis Wei Lu

Named Entity WordNet *Istituto di Linguistica Computazionale (Pisa, Italy) ^University of

Event Extraction Event Template for Terrorist Acts OUTPUT: filled event INPUT: document

Assignment: Named Entity Recognition Empirical Methods in Natural Language Processing Philipp

Hybrid Adaptation of Named Entity Recognition for Statistical Machine Translation Vassilina

Hybrid Adaptation of Named Entity Recognition for Statistical Machine Translation Vassilina

Large-scale refinement of digital historical newspapers with named entity recognition IFLA

Named Entity Recognition & Sequence Labeling CSCI 699: ML for Knowledge Extraction &

Natural Language Processing Part of Speech Tagging and Named Entity Recognition Alessandro

The history of the Battle of Midway Data Cleaning with C#/.NET Named Entity Recognition via Machine

Information Extraction Extracting limited forms of information from text Named entity

Entity Linking and Coreference Resolution CSCI 699 Instructor: Xiang Ren USC Computer Science

(CLOJURE) ENTITY LINKING IN FOR FUN @Sojoner AGENDA Motivation Entity linking

Entity Linking Enityt Linking Laura Dietz dietz@cs.umass.edu University of Massachusetts Use

linking, cross-lingual entity linking) TAC 2011 Summarization Track Guided Summarization task

Named Entity Recognition Lecture 12: October 18, 2013 CS886 2 Natural Language Understanding

Cross-Lingual Cross-Document Coreference with Entity Linking Sean Monahan, John Lehmann, Timothy

VI.3 Named Entity Reconciliation Problem: Same entity appears in Different spellings

Extended Named Entity Recognition Using Finite-State Transducers Mauro Gaio 1 , Ludovic Moncla 1 1

Entity Linking with Multiple Knowledge Bases Bianca Pereira MSc. / PhD Day November 2015

CLEF-HIPE-2020 Named Entity Recognition and Linking on Historical - PowerPoint PPT Presentation

CLEF-HIPE-2020 Named Entity Recognition and Linking on Historical Newspapers 1 CLEF-HIPE-2020 Overview - M. Ehrmann, M. Romanello, A. Flckiger, S. Clematide CLEF-HIPE-2020 in a nutshell - HIPE : Identifying Hi storical P eople, P laces and

HIPE Evaluation Lab Robust Named Entity Recognition an Linking on Historical Documents Example

Named Entity Recognition Using BERT and ELMo Group 8 : Mikaela Guerrero Vikash Kumar Nitya

Recycling Named Entity Taggers Unsupervised Domain and Language Adaptation for Named Entity

Multi-Task Transfer Learning for Fine-Grained Named Entity Recognition Masato Hagiwara 1 , Ryuji

Efficient Dependency-Guided Named Entity Recognition Zhanming Jie Aldrian Obaja Muis Wei Lu

Named Entity WordNet *Istituto di Linguistica Computazionale (Pisa, Italy) ^University of

Event Extraction Event Template for Terrorist Acts OUTPUT: filled event INPUT: document

Assignment: Named Entity Recognition Empirical Methods in Natural Language Processing Philipp

Hybrid Adaptation of Named Entity Recognition for Statistical Machine Translation Vassilina

Hybrid Adaptation of Named Entity Recognition for Statistical Machine Translation Vassilina

Large-scale refinement of digital historical newspapers with named entity recognition IFLA

Named Entity Recognition &amp; Sequence Labeling CSCI 699: ML for Knowledge Extraction &amp;

Natural Language Processing Part of Speech Tagging and Named Entity Recognition Alessandro

The history of the Battle of Midway Data Cleaning with C#/.NET Named Entity Recognition via Machine

Information Extraction Extracting limited forms of information from text Named entity

Entity Linking and Coreference Resolution CSCI 699 Instructor: Xiang Ren USC Computer Science

(CLOJURE) ENTITY LINKING IN FOR FUN @Sojoner AGENDA Motivation Entity linking

Entity Linking Enityt Linking Laura Dietz dietz@cs.umass.edu University of Massachusetts Use

linking, cross-lingual entity linking) TAC 2011 Summarization Track Guided Summarization task

Named Entity Recognition Lecture 12: October 18, 2013 CS886 2 Natural Language Understanding

Cross-Lingual Cross-Document Coreference with Entity Linking Sean Monahan, John Lehmann, Timothy

VI.3 Named Entity Reconciliation Problem: Same entity appears in Different spellings

Extended Named Entity Recognition Using Finite-State Transducers Mauro Gaio 1 , Ludovic Moncla 1 1

Entity Linking with Multiple Knowledge Bases Bianca Pereira MSc. / PhD Day November 2015

Named Entity Recognition & Sequence Labeling CSCI 699: ML for Knowledge Extraction &