recycling named entity taggers
play

Recycling Named Entity Taggers Unsupervised Domain and Language - PowerPoint PPT Presentation

Recycling Named Entity Taggers Unsupervised Domain and Language Adaptation for Named Entity Recognition based on Parallel Corpora Master thesis of Chrysoula Zerva EPFL supervisor SONY supervisor Dr Martin Rajman Dr Wilhelm Haag Outline


  1. Recycling Named Entity Taggers Unsupervised Domain and Language Adaptation for Named Entity Recognition based on Parallel Corpora Master thesis of Chrysoula Zerva EPFL supervisor SONY supervisor Dr Martin Rajman Dr Wilhelm Haag

  2. Outline ● Named Entity Recognition task ○ Definition, Process, Evaluation ● Importance ○ of NER ○ of Language (& domain) adaptation ● Core System ○ Architecture, Early results, Problems ● Evaluation ○ Final Results and Error analysis ● Conclusions

  3. Named Entity Recognition

  4. Named Entity: Definition Named entities : Atomic elements (in a text) that consist of one or more consecutive words and belong to predefined categories (labels) .

  5. Named Entity: Definition Named entities : Atomic elements (in a text) that consist of one or more consecutive words and belong to predefined categories (labels) . Common labels: ORGANISATION, PERSON, LOCATION

  6. Named Entity: Definition Named entities : Atomic elements (in a text) that consist of one or more consecutive words and belong to predefined categories (labels) . Common labels: ORGANISATION, PERSON, LOCATION The word sequence has to refer to a particular representation of the label For example: The president failed to explain the new military policy→ NO NE The president Barack Obama failed to explain the new military policy

  7. Named Entity Recognition : LABELS Name expressions: PERSON: People, including fictional Mr Thomson explained... NORP: Nationalities or religious or political groups The swiss law prohibits... FACILITY: Buildings, airports, highways, bridges Our reporter at the White House ... ORGANIZATION: Companies, agencies, institutions EPFL is located near... GPE: Countries, cities, states, administrative areas Lausanne has a population of... LOCATION: Non-GPE locations, mountains, rivers The situation in the Balkans is... PRODUCT: Vehicles, weapons, foods (Not services) He is driving an SUV car... EVENT: Named hurricanes, battles, sports events After the second world war the ... WORK OF ART: Titles of books, songs “ Lord of the Rings ” is a three ... LAW: Named documents made into laws In the European Constitution... LANGUAGE: Any named language English is an international...

  8. Named Entity Recognition : LABELS Time and Date expressions: DATE: Absolute or relative dates or periods Last year the results... TIME: Times smaller than a day Tomorrow at noon ... PERCENT: Percentage (including “%”) An estimated 5% of the people... MONEY: Monetary values, including unit A monthly salary of 5000$ QUANTITY: Measurements, as of weight or distance It weighs 3 pounds. ORDINAL: “first”, “second”, etc The first time that I ... CARDINAL: Numerals that do not fall under another type At least three people

  9. Named Entity Recognition : LABELS Choosing criterion : Sufficient training resources Label distribution vs F-score performance 1.00 0.80 Ontonotes (pre-annotated) Europarl (non-annotated) 0.60 Europarl TestSet F-Score (Europarl) 0.40 Fscore (Ontonotes) 0.20 0.00 GPE DATE NORP PERCENT LOC QUANTITY EVENT PRODUCT LANGUAGE ORG PERSON CARDINAL MONEY ORDINAL TIME WORK_OF_ART FAC LAW

  10. Named Entity Recognition Step 1: Named Entity Identification Step 2: Named Entity Classification

  11. Named Entity Recognition Step 1: Named Entity Identification Step 2: Named Entity Classification Classify every token under the following set of Classify the tokens that are part of a NE under labels: (BIOES scheme) given set of predefined labels Label Meaning PERCENT B beginning of NE ORDINAL CARDINAL ORGANISATION I inside NE DATE PERSON O outside NE LOCATION E end of NE GPE NORP S single NE

  12. Feature Extraction

  13. Feature Extraction: Performed always: Before training AND Before parsing Preprocessing: ● tokenization ● Part Of Speech tagging ● Use of gazetteers, lexicons containing NEs Feature categories: ● Character-based (N-grams, Capitalised, All-Capitalised, Special Character, Numeric) ● Lexical (included in a gazetteer, lexicon, wordForm, left WordForm, right wordForm) ● Grammatical (Genitive, POS tag, left POS tag, right POS tag) ● Other (position in sentence, context (sequence of words)) ++ Combined Features : pair combination of the above

  14. Feature extraction: Example CLASSIFICATION We deal with a horrific story in Kosovo IDENTIFICATION Capitalised Numeric Genitive N-grams Right word Similarly for the rest of the features... ... ... ... We 0 0 1 0 1 1 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 1 1 1 1 0 0 0 1 0 1 1 0 1 1 O O ... ... ... are 0 0 0 0 1 1 1 1 0 1 0 0 0 0 1 0 1 1 0 0 1 0 1 1 0 0 0 1 0 1 0 0 0 1 0 1 O O ... ... ... dealing 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 0 1 0 0 0 1 0 0 1 1 0 1 0 O O ... ... ... with 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 1 0 1 O O ... ... ... a 0 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 0 1 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 1 O O horrific 1 0 1 ... 0 1 1 1 0 0 1 ... 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 ... 1 1 1 0 0 1 0 1 0 1 1 O O ... ... ... situation 1 1 1 0 1 0 1 0 1 0 0 1 1 0 1 1 0 1 0 1 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 1 O O ... ... ... in 0 1 1 1 0 1 0 0 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1 1 0 0 0 1 0 1 0 1 0 1 0 O O Kosovo 1 0 0 ... 0 0 0 1 0 1 0 ... 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 ... 0 1 0 1 1 0 1 0 0 0 1 I GPE . 0 1 1 ... 0 1 0 1 1 1 1 ... 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 ... 0 0 1 1 0 0 0 1 1 1 0 O O

  15. Evaluation: Metrics and Methods

  16. Evaluation: Exact and Partial Matches Original Output1 Output2 Output3 Output 4 Output5 Output6 Output7 The B-L1 B-L1 O O O O B-L2 O European I-L1 I-L1 B-L1 O I-L1 I-L2 I-L2 O Parliament E-L1 E-L1 E-L1 I-L1 I-L1 I-L1 E-L2 O Evaluation metrics: Precision, Recall, F-score

  17. Evaluation: Exact and Partial Matches Exact matches: Partial matches: Correctly identified NE: Assuming a NE (word Correctly identified NE: Assuming a NE (word sequence) that is labelled as L1, all tokens in the sequence) that is labelled as L1, at least one NE are attributed identical labelling to the original token in the NE is also labelled as L1

  18. Evaluation: Example Tokens Original Attributed Tokens Original Attributed I O O to O O will O O talk O O leave O O to O O for O O Minister O O exact + partial Stockholm I-GPE I-GPE Ringholm I-PERSON O on O O , O O Monday B-DATE I-DATE to O O , I-DATE O members O O partial 6 I-DATE B-DATE of O O March E-DATE E-DATE the B-ORG O , O O Swedish I-ORG I-NORP in O O parliament E-ORG O order O O . O O

  19. Importance of NER ...and of Recycling it...

  20. Why is efficient NER important? Applications of NER in NLP Generally NER is an important first step in extracting meaningful information from text ● Provide keywords for indexing documents ○ news recommenders : document clustering, user profiles ○ document classification/retrieval ○ search engines ○ Automated keyword extraction ● Entities (especially proper names) point to objects about which we need to define relations, roles, events ○ question answering: refers to “grounding” named entities to a model, defining their scope and role ○ semantic parsing ○ coreference resolution

  21. Why Recycle? Need for multilingual NLP applications multilingual NE recognition sufficient resources and tools for English BUT resources are fewer and expensive… manual annotation requires time and manpower not very flexible method to acquire a new corpus for every adaptation need

  22. Why Recycle? Adaptation to other domains is also important: New domains require NER (biology, medicine, scientific texts) Even top scorers in evaluation campaigns fail to perform well on different test sets ( drop of 10%-30% ) [1],[2]

  23. What to Recycle? Available : One NE tagger trained for ● English ● News Articles Ontonotes corpus English news Broadcasts Conll 2012 labels F-score performance : 74% - 79% Exact matches

  24. What to Recycle? Available : One NE tagger trained for ● English ● News Articles Used for: ● news recommender (main application) ● conference management tool ● coreference resolution ● sentiment analysis

  25. Recycling Scheme: Core System Architecture

  26. Recycling Scheme: Core System Architecture SC TC Source Target SYSTEM Language Language transfer NE tagger NE tagger

  27. Recycling Scheme: Core System Architecture SC TC Source Corpus Target Corpus Existing Existing NE tagger NE tagger European Parliament Proceedings (EuroParl) English - French English - Greek

  28. Recycling Scheme: Core System Architecture Phase 2 Phase 1 Phase 3 Manually Source Target Annotated Source Language Language Target Transfer Source Train Language Parse Train Parallel Parallel Language Language NEs NE Tagger corpus corpus NE Tagger corpus

  29. Early Results

  30. Early Results: Exact Match Partial Match Precision Recall F-score Precision Recall F-score English Europarl 69.06 67.3 68.17 87.5 73.3 80.01 French EuroParl 63.23 53.41 57.91 74.88 74.05 74.46 Greek EuroParl 50.77 45.18 47.81 68.34 75.76 71.86 English Ontonotes 80.24 78.81 79.52 83.2 96.16 89.21 Need also to adapt to other domains...

Recommend


More recommend