Named Entity Recognition & Sequence Labeling CSCI 699: ML for - PowerPoint PPT Presentation

Workflow of Token-wise Classifiers Training 1. Collect a set of representative training documents 2. Label each token for its entity class or other (O) 3. Design feature extractors appropriate to the text and classes 4. Train a classifier to predict the labels of each token in the annotated training sentences Testing 1. Receive a set of testing documents 2. Run trained classifier to label each token 3. Appropriately output the recognized entities 40

Workflow of Token-wise Classifiers Training 1. Collect a set of representative training documents 2. Label each token for its entity class or other (O) 3. Design feature extractors appropriate to the text and classes 4. Train a classifier to predict the labels of each token in the annotated training sentences Testing 1. Receive a set of testing documents 2. Run trained classifier to label each token 3. Appropriately output the recognized entities 41

Encoding labels for sequence • IO labeling schema: • The O Part of ORG entity • European ORG Part of ORG entity • Commission ORG • said O • on O • Thursday O • it O • disagreed O • with O • German MISC Part of MISC entity 42

Encoding labels for sequence • BIOES labeling schema: • The O Begin of Entity • European B-ORG End of Entity • Commission E-ORG • said O • on O • Thursday O • it O • disagreed O • with O • German S-MISC Singleton Entity 43

Encoding labels for sequence • BIO / BIOES labeling schema: The • The O European • European B-ORG Commission • Commission E-ORG said • said O on • on O Thursday • Thursday O it • it O disagreed • disagreed O • with O with • German S-MISC German 44

NER as a Classification Problem • NED : Identify named entities using BIO tags • B beginning of an entity • I continues the entity • O word outside the entity • NEC : Classify into a predefined set of categories • Person names • Organizations (companies, governmental organizations, etc.) • Locations (cities, countries, etc.) • Miscellaneous (movie titles, sport events, etc.) 45

Features for NER • Words • Current word (essentially like a learned dictionary) • Previous/next word (context) • Other kinds of inferred linguistic classification • Part-of-speech tags • Label context • Previous (and perhaps next) label 46

Features: Word Substrings oxa : field 0 0 0 6 8 0 0 14 0 0 0 6 68 drug 708 18 company Cotrimoxazole Wethersfield movie place Alien Fury: Countdown to Invasion person 47

Features for NER • Word Shapes • Map words to simplified representation that encodes attributes such as length, capitalization, numerals, Greek letters, internal punctuation, etc. Varicella-zoster Xx-xxx mRNA xXXX CPA1 XXXd 48

Classical Learning Models on NER • KNN • Decision Tree • Naïve Bayes • SVM • … • Boosting, … 49

K Nearest Neighbor • Learning is just storing the representations of the training examples. • Testing instance x p : • compute similarity between x p and all training examples • take vote among x p k nearest neighbours • assign x p with the category of the most similar example in T 50

Distance measures in KNN • Nearest neighbor method uses similarity (or distance) metric. • Given two objects x and y both with n values calculate the Euclidean distance as 51

An walk-through example isPersonName isCapitalized isLiving teachesCS544 Jerry Hobbs 1 1 1 1 USC 0 1 0 0 eduard 1 0 1 1 hovy Kevin Knight 1 1 1 1 52

3-NN choose the category of the choose the closer neighbor category of the (can be erroneous majority of the due to noise) neighbors 53

KNN: Pros & Cons Pros Cons + robust - depends on similarity measure & k-NNs + simple - easily fooled by irrelevant attributes + training is very fast (storing examples) - computationally expensive 54

Decision Tree isPersonName isCapitalized isLiving X is PersonName? profession 0 0 0 NO Jerry Hobbs 1 1 1 YES USC 0 1 0 NO Jordan 1 1 0 NO isCapitalized Each internal node tests an attribute 1 0 isPersonName NO Each branch corresponds to an attribute value node 1 0 isLiving NO Each leaf node assigns a YES NO classification 55

Decision Tree: Pros & Cons Pros Cons + generate understandable rules - error prone in multi-class classification and small number of training examples + provide a clear indication of which features are most important for - expensive to train due to pruning classification 56

AdaBoost with Decision Tree (Carreras et al. 2002) • Learning algorithm: AdaBoost • Binary classification • Binary features • (Schapire & Singer, 99) • Weak rules (h t ): Decision Trees of fixed depth. 57

Feature generation Adam Smith Works for IBM in London . 58

Feature generation Adam Adam Smith Smith Works for IBM …………. in London . fp • Contextual • current word W 0 59

Feature generation Adam Adam, null, null, null, Smith, works, for Smith Smith, Adam, null, null, works, for, IBM Works for IBM …………. in London . fp, London, in, IBM, null, null, null • Contextual • current word W 0 • words around W 0 in [-3,…,+3] window 60

Feature generation Adam Adam, null, null, null, Smith, works, for,1,0 Smith Smith, Adam, null, null, works, for, IBM,1,0 Works for IBM …………. in London . fp, London, in, IBM, null, null, null,0,0 • Ortographic • initial-caps • all-caps 61

More features 62

AdaBoost with Decision Tree (Carreras et al. 2002) • Learning algorithm: AdaBoost • Binary classification • Binary features • (Schapire & Singer, 99) • Weak rules (h t ): Decision Trees of fixed depth. 63

Results for Entity Boundary Detection Carreras et Precision Recall F-score al.,2002 BIO dev. 92.45 90.88 91.66 CoNLL-2002 Spanish Evaluation Data Data sets #tokens #NEs Train 264,715 18,794 Development 52,923 4,351 Test 51,533 3,558 64

Results for Entity Classification Spanish Precision Recal F-score Spanish Precision Recall F-score Dev. l Test. LOC 79.04 80.00 79.52 LOC 85.76 79.43 82.47 MISC 55.48 54.61 55.04 MISC 60.19 57.35 58.73 ORG 79.57 76.06 77.77 ORG 81.21 82.43 81.81 PER 87.19 86.91 87.05 PER 84.71 93.47 88.87 overall 79.15 77.80 78.47 overall 81.38 81.40 81.39 65

Sequence Labeling for NER

Traditional Named Entity Recognition (NER) Systems • Heavy reliance on corpus-specific human labeling • Training sequence models is slow The The be best BBQ BBQ I’ I’ve ta tasted in in Phoenix Ph ix O O Food O O O Location NER Systems : Sequence ce Stanford NER mo model tr training Illinois Name Tagger IBM Alchemy APIs … A manual annotation interface e.g., (McMallum & Li, 2003), (Finkel et al.,2005), (Ratinov & Roth, 2009), … 67

Encoding labels for sequence • BIOES labeling schema: • The O Begin of Entity • European B-ORG End of Entity • Commission E-ORG • said O • on O • Thursday O • it O • disagreed O • with O • German S-MISC Singleton Entity 68

Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier NNP Slide from Ray Mooney

Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier DT Slide from Ray Mooney

Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier NN Slide from Ray Mooney

Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier CC Slide from Ray Mooney

Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier TO Slide from Ray Mooney

Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier VB Slide from Ray Mooney

Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier PRP Slide from Ray Mooney

Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier IN Slide from Ray Mooney

Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier DT Slide from Ray Mooney

Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier NN Slide from Ray Mooney

Sequence Labeling vs. Classification • Sequence Models are statistical models of whole token sequences that effectively label sub-sequences 81

Workflow of Sequence Modeling Training 1. Collect a set of representative training documents 2. Label each token for its entity class or other (O) 3. Design feature extractors appropriate to the text and classes 4. Train a sequence classifier to predict the labels from the data Testing 1. Receive a set of testing documents 2. Run sequence model inference to label each token 3. Appropriately output the recognized entities 82

Hidden Markov Models (HMMs) • Generative • Find parameters to maximize P(X,Y) • Assumes features are independent • When labeling X i future observations are taken into account (forward-backward) Slides by Jenny Finkel

What is an HMM? • Graphical Model Representation: Variables by time • Circles indicate states • Arrows indicate probabilistic dependencies between states 84

What is an HMM? • Green circles are hidden states • Dependent only on the previous state: Markov process • “The past is independent of the future given the present.” 85

What is an HMM? • Purple nodes are observed states • Dependent only on their corresponding hidden state 86

HMM Formalism S S S S S K K K K K • { S, K , P , A , B } • S : {s 1 …s N } are the values for the hidden states • K : {k 1 …k M } are the values for the observations 87

HMM Formalism A A A A S S S S S B B B K K K K K • { S, K , P , A , B } • P = { p i } are the initial state probabilities • A = {a ij } are the state transition probabilities • B = {b ik } are the observation state probabilities 88

Inference for an HMM • Compute the probability of a given observation sequence • Given an observation sequence, compute the most likely hidden state sequence • Given an observation sequence and set of possible models, which model most closely fits the data? 89

Sequence Probability o 1 o t-1 o t o t+1 o T Given an observation sequence and a model, compute the probability of the observation sequence = µ = P O ( 1 o ,..., o ) , ( A , B , ) T µ Compute P ( O | ) 90

Sequence Probability x 1 x t-1 x t x t+1 x T o 1 o t-1 o t o t+1 o T µ = P ( O | X , ) b b ... b x o x o x T o 1 1 2 2 T µ = p P ( X | ) a a ... a x x x x x x x - 1 1 2 2 3 T 1 T µ = µ µ P ( O , X | ) P ( O | X , ) P ( X | ) å µ = µ µ P ( O | ) P ( O | X , ) P ( X | ) X 91

MaxEnt Markov Models (MEMMs) • Discriminative • Find parameters to maximize P(Y|X) • No longer assume that features are independent • Do not take future observations into account (no forward-backward) Slides by Jenny Finkel

MEMM inference in systems • For a MEMM, the classifier makes a single decision at a time, conditioned on evidence from observations and previous decisions • A larger space of sequences is usually explored via search Features Decision Point Local Context W 0 22.6 W +1 % -3 -2 -1 0 +1 W -1 fell DT NNP VBD ??? ??? T -1 VBD The Dow fell 22.6 % T -1 -T -2 NNP-VBD hasDigit? true … … (Ratnaparkhi 1996; Toutanova et al. 2003, etc.)

Conditional Random Fields (CRFs) • Discriminative • Doesn’t assume that features are independent • When labeling Y i future observations are taken into account è The best of both worlds! Slides by Jenny Finkel

CRFs [Lafferty, Pereira, and McCallum 2001] • A whole-sequence conditional model rather than a chaining of local models. exp f ( c , d ) ∑ λ i i P ( c | d , ) λ = i exp f ( c ' , d ) ∑ ∑ λ i i c ' i • The space of c � s is now the space of sequences • But if the features f i remain local, the conditional sequence likelihood can be calculated exactly using dynamic programming • Training is slower, but CRFs avoid causal-competition biases

Model Trade-offs Discrim vs. Speed Normalization Generative HMM very fast generative local MEMM mid-range discriminative local CRF kinda slow discriminative global Slides by Jenny Finkel

Greedy Inference Sequence Model Best Sequence Inference • Greedy inference: We just start at the left, and use our classifier at each position to assign a label • The classifier can depend on previous labeling decisions as well as observed data • • Advantages: Fast, no extra memory requirements • Very easy to implement • With rich features including observations to the right, it may perform quite well • • Disadvantage: Greedy. We make commit errors we cannot recover from • Slides by Chris Manning

Viterbi Inference Sequence Model Best Sequence Inference • Viterbi inference: Dynamic programming or memorization. • Requires small window of state influence (e.g., past two states are relevant). • • Advantage: Exact: the global best sequence is returned. • • Disadvantage: Harder to implement long-distance state-state interactions (but beam inference • tends not to allow long-distance resurrection of sequences anyway). Slides by Chris Manning

Beam Inference Sequence Model Best Sequence Inference • Beam inference: At each position keep the top k complete sequences. • Extend each sequence in each local way. • The extensions compete for the k slots at the next position. • • Advantages: Fast; beam sizes of 3–5 are almost as good as exact inference in many cases. • Easy to implement (no dynamic programming required). • • Disadvantage: Inexact: the globally best sequence can fall off the beam. • Slides by Chris Manning

Stanford CRF System: A Case Study https://nlp.stanford.edu/software/CRF-NER.shtml

Named Entity Recognition & Sequence Labeling CSCI 699: ML for - PowerPoint PPT Presentation

Named Entity Recognition & Sequence Labeling CSCI 699: ML for Knowledge Extraction & Reasoning Instructor: Xiang Ren USC Computer Science Recap: Information Extraction Information extraction (IE) systems Find and understand

Named Entity Recognition Using BERT and ELMo Group 8 : Mikaela Guerrero Vikash Kumar Nitya

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

Recycling Named Entity Taggers Unsupervised Domain and Language Adaptation for Named Entity

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

POS tagging CMSC 723 / LING 723 / INST 725 Marine Carpuat POS tagging Sequence labeling with

Multi-Task Transfer Learning for Fine-Grained Named Entity Recognition Masato Hagiwara 1 , Ryuji

Background Sequence labeling MEMMs - ? HMMs you know, right? Structured

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Named Entity WordNet *Istituto di Linguistica Computazionale (Pisa, Italy) ^University of

EMNLP | 2020 SeqMix: Augmenting Active Sequence Labeling via Sequence Mixup Rongzhi Zhang, Yue

Information Extraction Extracting limited forms of information from text Named entity

Sequence Labeling Markov Models Many information extraction tasks can be formulated as

Conditional Random Fields Dietrich Klakow Overview Sequence Labeling Bayesian Networks

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

SURVIVING CRAPPY INTERNET Anouk Ruhaak @anoukruhaak BETWEEN OFFLINE AND ONLINE SPEED

Processes and threads assignments (LE: 1,3,5,7,8,9,10,11,13) 1. Consider a computer running an mp3

Compiler-assisted Performance Analysis Adam Nemet Apple anemet@apple.com Hotspot User

BUILDING A SOE / MOE Adam Reed The Australian National University Hashtag : #xw13 Please leave

The Strategist Adam Brandenburger J.P . Valles Professor, NYU Stern School of

Predicting the relevance of distributional semantic similarity with contextual information

List of Attendees Paul Stacey, John Storer, Rich Langan, David Patrick, Dean Peschel, Lindsey

Post-quantum cryptography Daniel J. Bernstein & Tanja Lange University of Illinois at Chicago

Named Entity Recognition & Sequence Labeling CSCI 699: ML for - PowerPoint PPT Presentation

Named Entity Recognition & Sequence Labeling CSCI 699: ML for Knowledge Extraction & Reasoning Instructor: Xiang Ren USC Computer Science Recap: Information Extraction Information extraction (IE) systems Find and understand

Named Entity Recognition Using BERT and ELMo Group 8 : Mikaela Guerrero Vikash Kumar Nitya

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

Recycling Named Entity Taggers Unsupervised Domain and Language Adaptation for Named Entity

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

POS tagging CMSC 723 / LING 723 / INST 725 Marine Carpuat POS tagging Sequence labeling with

Multi-Task Transfer Learning for Fine-Grained Named Entity Recognition Masato Hagiwara 1 , Ryuji

Background Sequence labeling MEMMs - ? HMMs you know, right? Structured

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Named Entity WordNet *Istituto di Linguistica Computazionale (Pisa, Italy) ^University of

EMNLP | 2020 SeqMix: Augmenting Active Sequence Labeling via Sequence Mixup Rongzhi Zhang, Yue

Information Extraction Extracting limited forms of information from text Named entity

Sequence Labeling Markov Models Many information extraction tasks can be formulated as

Conditional Random Fields Dietrich Klakow Overview Sequence Labeling Bayesian Networks

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

SURVIVING CRAPPY INTERNET Anouk Ruhaak @anoukruhaak BETWEEN OFFLINE AND ONLINE SPEED

Processes and threads assignments (LE: 1,3,5,7,8,9,10,11,13) 1. Consider a computer running an mp3

Compiler-assisted Performance Analysis Adam Nemet Apple anemet@apple.com Hotspot User

BUILDING A SOE / MOE Adam Reed The Australian National University Hashtag : #xw13 Please leave

The Strategist Adam Brandenburger J.P . Valles Professor, NYU Stern School of

Predicting the relevance of distributional semantic similarity with contextual information

List of Attendees Paul Stacey, John Storer, Rich Langan, David Patrick, Dean Peschel, Lindsey

Post-quantum cryptography Daniel J. Bernstein &amp; Tanja Lange University of Illinois at Chicago

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Post-quantum cryptography Daniel J. Bernstein & Tanja Lange University of Illinois at Chicago