Recycling Named Entity Taggers Unsupervised Domain and Language Adaptation for Named Entity Recognition based on Parallel Corpora Master thesis of Chrysoula Zerva EPFL supervisor SONY supervisor Dr Martin Rajman Dr Wilhelm Haag
Outline ● Named Entity Recognition task ○ Definition, Process, Evaluation ● Importance ○ of NER ○ of Language (& domain) adaptation ● Core System ○ Architecture, Early results, Problems ● Evaluation ○ Final Results and Error analysis ● Conclusions
Named Entity Recognition
Named Entity: Definition Named entities : Atomic elements (in a text) that consist of one or more consecutive words and belong to predefined categories (labels) .
Named Entity: Definition Named entities : Atomic elements (in a text) that consist of one or more consecutive words and belong to predefined categories (labels) . Common labels: ORGANISATION, PERSON, LOCATION
Named Entity: Definition Named entities : Atomic elements (in a text) that consist of one or more consecutive words and belong to predefined categories (labels) . Common labels: ORGANISATION, PERSON, LOCATION The word sequence has to refer to a particular representation of the label For example: The president failed to explain the new military policy→ NO NE The president Barack Obama failed to explain the new military policy
Named Entity Recognition : LABELS Name expressions: PERSON: People, including fictional Mr Thomson explained... NORP: Nationalities or religious or political groups The swiss law prohibits... FACILITY: Buildings, airports, highways, bridges Our reporter at the White House ... ORGANIZATION: Companies, agencies, institutions EPFL is located near... GPE: Countries, cities, states, administrative areas Lausanne has a population of... LOCATION: Non-GPE locations, mountains, rivers The situation in the Balkans is... PRODUCT: Vehicles, weapons, foods (Not services) He is driving an SUV car... EVENT: Named hurricanes, battles, sports events After the second world war the ... WORK OF ART: Titles of books, songs “ Lord of the Rings ” is a three ... LAW: Named documents made into laws In the European Constitution... LANGUAGE: Any named language English is an international...
Named Entity Recognition : LABELS Time and Date expressions: DATE: Absolute or relative dates or periods Last year the results... TIME: Times smaller than a day Tomorrow at noon ... PERCENT: Percentage (including “%”) An estimated 5% of the people... MONEY: Monetary values, including unit A monthly salary of 5000$ QUANTITY: Measurements, as of weight or distance It weighs 3 pounds. ORDINAL: “first”, “second”, etc The first time that I ... CARDINAL: Numerals that do not fall under another type At least three people
Named Entity Recognition : LABELS Choosing criterion : Sufficient training resources Label distribution vs F-score performance 1.00 0.80 Ontonotes (pre-annotated) Europarl (non-annotated) 0.60 Europarl TestSet F-Score (Europarl) 0.40 Fscore (Ontonotes) 0.20 0.00 GPE DATE NORP PERCENT LOC QUANTITY EVENT PRODUCT LANGUAGE ORG PERSON CARDINAL MONEY ORDINAL TIME WORK_OF_ART FAC LAW
Named Entity Recognition Step 1: Named Entity Identification Step 2: Named Entity Classification
Named Entity Recognition Step 1: Named Entity Identification Step 2: Named Entity Classification Classify every token under the following set of Classify the tokens that are part of a NE under labels: (BIOES scheme) given set of predefined labels Label Meaning PERCENT B beginning of NE ORDINAL CARDINAL ORGANISATION I inside NE DATE PERSON O outside NE LOCATION E end of NE GPE NORP S single NE
Feature Extraction
Feature Extraction: Performed always: Before training AND Before parsing Preprocessing: ● tokenization ● Part Of Speech tagging ● Use of gazetteers, lexicons containing NEs Feature categories: ● Character-based (N-grams, Capitalised, All-Capitalised, Special Character, Numeric) ● Lexical (included in a gazetteer, lexicon, wordForm, left WordForm, right wordForm) ● Grammatical (Genitive, POS tag, left POS tag, right POS tag) ● Other (position in sentence, context (sequence of words)) ++ Combined Features : pair combination of the above
Feature extraction: Example CLASSIFICATION We deal with a horrific story in Kosovo IDENTIFICATION Capitalised Numeric Genitive N-grams Right word Similarly for the rest of the features... ... ... ... We 0 0 1 0 1 1 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 1 1 1 1 0 0 0 1 0 1 1 0 1 1 O O ... ... ... are 0 0 0 0 1 1 1 1 0 1 0 0 0 0 1 0 1 1 0 0 1 0 1 1 0 0 0 1 0 1 0 0 0 1 0 1 O O ... ... ... dealing 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 0 1 0 0 0 1 0 0 1 1 0 1 0 O O ... ... ... with 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 1 0 1 O O ... ... ... a 0 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 0 1 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 1 O O horrific 1 0 1 ... 0 1 1 1 0 0 1 ... 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 ... 1 1 1 0 0 1 0 1 0 1 1 O O ... ... ... situation 1 1 1 0 1 0 1 0 1 0 0 1 1 0 1 1 0 1 0 1 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 1 O O ... ... ... in 0 1 1 1 0 1 0 0 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1 1 0 0 0 1 0 1 0 1 0 1 0 O O Kosovo 1 0 0 ... 0 0 0 1 0 1 0 ... 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 ... 0 1 0 1 1 0 1 0 0 0 1 I GPE . 0 1 1 ... 0 1 0 1 1 1 1 ... 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 ... 0 0 1 1 0 0 0 1 1 1 0 O O
Evaluation: Metrics and Methods
Evaluation: Exact and Partial Matches Original Output1 Output2 Output3 Output 4 Output5 Output6 Output7 The B-L1 B-L1 O O O O B-L2 O European I-L1 I-L1 B-L1 O I-L1 I-L2 I-L2 O Parliament E-L1 E-L1 E-L1 I-L1 I-L1 I-L1 E-L2 O Evaluation metrics: Precision, Recall, F-score
Evaluation: Exact and Partial Matches Exact matches: Partial matches: Correctly identified NE: Assuming a NE (word Correctly identified NE: Assuming a NE (word sequence) that is labelled as L1, all tokens in the sequence) that is labelled as L1, at least one NE are attributed identical labelling to the original token in the NE is also labelled as L1
Evaluation: Example Tokens Original Attributed Tokens Original Attributed I O O to O O will O O talk O O leave O O to O O for O O Minister O O exact + partial Stockholm I-GPE I-GPE Ringholm I-PERSON O on O O , O O Monday B-DATE I-DATE to O O , I-DATE O members O O partial 6 I-DATE B-DATE of O O March E-DATE E-DATE the B-ORG O , O O Swedish I-ORG I-NORP in O O parliament E-ORG O order O O . O O
Importance of NER ...and of Recycling it...
Why is efficient NER important? Applications of NER in NLP Generally NER is an important first step in extracting meaningful information from text ● Provide keywords for indexing documents ○ news recommenders : document clustering, user profiles ○ document classification/retrieval ○ search engines ○ Automated keyword extraction ● Entities (especially proper names) point to objects about which we need to define relations, roles, events ○ question answering: refers to “grounding” named entities to a model, defining their scope and role ○ semantic parsing ○ coreference resolution
Why Recycle? Need for multilingual NLP applications multilingual NE recognition sufficient resources and tools for English BUT resources are fewer and expensive… manual annotation requires time and manpower not very flexible method to acquire a new corpus for every adaptation need
Why Recycle? Adaptation to other domains is also important: New domains require NER (biology, medicine, scientific texts) Even top scorers in evaluation campaigns fail to perform well on different test sets ( drop of 10%-30% ) [1],[2]
What to Recycle? Available : One NE tagger trained for ● English ● News Articles Ontonotes corpus English news Broadcasts Conll 2012 labels F-score performance : 74% - 79% Exact matches
What to Recycle? Available : One NE tagger trained for ● English ● News Articles Used for: ● news recommender (main application) ● conference management tool ● coreference resolution ● sentiment analysis
Recycling Scheme: Core System Architecture
Recycling Scheme: Core System Architecture SC TC Source Target SYSTEM Language Language transfer NE tagger NE tagger
Recycling Scheme: Core System Architecture SC TC Source Corpus Target Corpus Existing Existing NE tagger NE tagger European Parliament Proceedings (EuroParl) English - French English - Greek
Recycling Scheme: Core System Architecture Phase 2 Phase 1 Phase 3 Manually Source Target Annotated Source Language Language Target Transfer Source Train Language Parse Train Parallel Parallel Language Language NEs NE Tagger corpus corpus NE Tagger corpus
Early Results
Early Results: Exact Match Partial Match Precision Recall F-score Precision Recall F-score English Europarl 69.06 67.3 68.17 87.5 73.3 80.01 French EuroParl 63.23 53.41 57.91 74.88 74.05 74.46 Greek EuroParl 50.77 45.18 47.81 68.34 75.76 71.86 English Ontonotes 80.24 78.81 79.52 83.2 96.16 89.21 Need also to adapt to other domains...
Recommend
More recommend