named entity recognition sequence labeling
play

Named Entity Recognition & Sequence Labeling CSCI 699: ML for - PowerPoint PPT Presentation

Named Entity Recognition & Sequence Labeling CSCI 699: ML for Knowledge Extraction & Reasoning Instructor: Xiang Ren USC Computer Science Recap: Information Extraction Information extraction (IE) systems Find and understand


  1. Workflow of Token-wise Classifiers Training 1. Collect a set of representative training documents 2. Label each token for its entity class or other (O) 3. Design feature extractors appropriate to the text and classes 4. Train a classifier to predict the labels of each token in the annotated training sentences Testing 1. Receive a set of testing documents 2. Run trained classifier to label each token 3. Appropriately output the recognized entities 40

  2. Workflow of Token-wise Classifiers Training 1. Collect a set of representative training documents 2. Label each token for its entity class or other (O) 3. Design feature extractors appropriate to the text and classes 4. Train a classifier to predict the labels of each token in the annotated training sentences Testing 1. Receive a set of testing documents 2. Run trained classifier to label each token 3. Appropriately output the recognized entities 41

  3. Encoding labels for sequence • IO labeling schema: • The O Part of ORG entity • European ORG Part of ORG entity • Commission ORG • said O • on O • Thursday O • it O • disagreed O • with O • German MISC Part of MISC entity 42

  4. Encoding labels for sequence • BIOES labeling schema: • The O Begin of Entity • European B-ORG End of Entity • Commission E-ORG • said O • on O • Thursday O • it O • disagreed O • with O • German S-MISC Singleton Entity 43

  5. Encoding labels for sequence • BIO / BIOES labeling schema: The • The O European • European B-ORG Commission • Commission E-ORG said • said O on • on O Thursday • Thursday O it • it O disagreed • disagreed O • with O with • German S-MISC German 44

  6. NER as a Classification Problem • NED : Identify named entities using BIO tags • B beginning of an entity • I continues the entity • O word outside the entity • NEC : Classify into a predefined set of categories • Person names • Organizations (companies, governmental organizations, etc.) • Locations (cities, countries, etc.) • Miscellaneous (movie titles, sport events, etc.) 45

  7. Features for NER • Words • Current word (essentially like a learned dictionary) • Previous/next word (context) • Other kinds of inferred linguistic classification • Part-of-speech tags • Label context • Previous (and perhaps next) label 46

  8. Features: Word Substrings oxa : field 0 0 0 6 8 0 0 14 0 0 0 6 68 drug 708 18 company Cotrimoxazole Wethersfield movie place Alien Fury: Countdown to Invasion person 47

  9. Features for NER • Word Shapes • Map words to simplified representation that encodes attributes such as length, capitalization, numerals, Greek letters, internal punctuation, etc. Varicella-zoster Xx-xxx mRNA xXXX CPA1 XXXd 48

  10. Classical Learning Models on NER • KNN • Decision Tree • Naïve Bayes • SVM • … • Boosting, … 49

  11. K Nearest Neighbor • Learning is just storing the representations of the training examples. • Testing instance x p : • compute similarity between x p and all training examples • take vote among x p k nearest neighbours • assign x p with the category of the most similar example in T 50

  12. Distance measures in KNN • Nearest neighbor method uses similarity (or distance) metric. • Given two objects x and y both with n values calculate the Euclidean distance as 51

  13. An walk-through example isPersonName isCapitalized isLiving teachesCS544 Jerry Hobbs 1 1 1 1 USC 0 1 0 0 eduard 1 0 1 1 hovy Kevin Knight 1 1 1 1 52

  14. 3-NN choose the category of the choose the closer neighbor category of the (can be erroneous majority of the due to noise) neighbors 53

  15. KNN: Pros & Cons Pros Cons + robust - depends on similarity measure & k-NNs + simple - easily fooled by irrelevant attributes + training is very fast (storing examples) - computationally expensive 54

  16. Decision Tree isPersonName isCapitalized isLiving X is PersonName? profession 0 0 0 NO Jerry Hobbs 1 1 1 YES USC 0 1 0 NO Jordan 1 1 0 NO isCapitalized Each internal node tests an attribute 1 0 isPersonName NO Each branch corresponds to an attribute value node 1 0 isLiving NO Each leaf node assigns a YES NO classification 55

  17. Decision Tree: Pros & Cons Pros Cons + generate understandable rules - error prone in multi-class classification and small number of training examples + provide a clear indication of which features are most important for - expensive to train due to pruning classification 56

  18. AdaBoost with Decision Tree (Carreras et al. 2002) • Learning algorithm: AdaBoost • Binary classification • Binary features • (Schapire & Singer, 99) • Weak rules (h t ): Decision Trees of fixed depth. 57

  19. Feature generation Adam Smith Works for IBM in London . 58

  20. Feature generation Adam Adam Smith Smith Works for IBM …………. in London . fp • Contextual • current word W 0 59

  21. Feature generation Adam Adam, null, null, null, Smith, works, for Smith Smith, Adam, null, null, works, for, IBM Works for IBM …………. in London . fp, London, in, IBM, null, null, null • Contextual • current word W 0 • words around W 0 in [-3,…,+3] window 60

  22. Feature generation Adam Adam, null, null, null, Smith, works, for,1,0 Smith Smith, Adam, null, null, works, for, IBM,1,0 Works for IBM …………. in London . fp, London, in, IBM, null, null, null,0,0 • Ortographic • initial-caps • all-caps 61

  23. More features 62

  24. AdaBoost with Decision Tree (Carreras et al. 2002) • Learning algorithm: AdaBoost • Binary classification • Binary features • (Schapire & Singer, 99) • Weak rules (h t ): Decision Trees of fixed depth. 63

  25. Results for Entity Boundary Detection Carreras et Precision Recall F-score al.,2002 BIO dev. 92.45 90.88 91.66 CoNLL-2002 Spanish Evaluation Data Data sets #tokens #NEs Train 264,715 18,794 Development 52,923 4,351 Test 51,533 3,558 64

  26. Results for Entity Classification Spanish Precision Recal F-score Spanish Precision Recall F-score Dev. l Test. LOC 79.04 80.00 79.52 LOC 85.76 79.43 82.47 MISC 55.48 54.61 55.04 MISC 60.19 57.35 58.73 ORG 79.57 76.06 77.77 ORG 81.21 82.43 81.81 PER 87.19 86.91 87.05 PER 84.71 93.47 88.87 overall 79.15 77.80 78.47 overall 81.38 81.40 81.39 65

  27. Sequence Labeling for NER

  28. Traditional Named Entity Recognition (NER) Systems • Heavy reliance on corpus-specific human labeling • Training sequence models is slow The The be best BBQ BBQ I’ I’ve ta tasted in in Phoenix Ph ix O O Food O O O Location NER Systems : Sequence ce Stanford NER mo model tr training Illinois Name Tagger IBM Alchemy APIs … A manual annotation interface e.g., (McMallum & Li, 2003), (Finkel et al.,2005), (Ratinov & Roth, 2009), … 67

  29. Encoding labels for sequence • BIOES labeling schema: • The O Begin of Entity • European B-ORG End of Entity • Commission E-ORG • said O • on O • Thursday O • it O • disagreed O • with O • German S-MISC Singleton Entity 68

  30. Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier NNP Slide from Ray Mooney

  31. Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

  32. Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier DT Slide from Ray Mooney

  33. Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier NN Slide from Ray Mooney

  34. Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier CC Slide from Ray Mooney

  35. Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

  36. Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier TO Slide from Ray Mooney

  37. Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier VB Slide from Ray Mooney

  38. Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier PRP Slide from Ray Mooney

  39. Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier IN Slide from Ray Mooney

  40. Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier DT Slide from Ray Mooney

  41. Sequence Labeling as Classification • Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier NN Slide from Ray Mooney

  42. Sequence Labeling vs. Classification • Sequence Models are statistical models of whole token sequences that effectively label sub-sequences 81

  43. Workflow of Sequence Modeling Training 1. Collect a set of representative training documents 2. Label each token for its entity class or other (O) 3. Design feature extractors appropriate to the text and classes 4. Train a sequence classifier to predict the labels from the data Testing 1. Receive a set of testing documents 2. Run sequence model inference to label each token 3. Appropriately output the recognized entities 82

  44. Hidden Markov Models (HMMs) • Generative • Find parameters to maximize P(X,Y) • Assumes features are independent • When labeling X i future observations are taken into account (forward-backward) Slides by Jenny Finkel

  45. What is an HMM? • Graphical Model Representation: Variables by time • Circles indicate states • Arrows indicate probabilistic dependencies between states 84

  46. What is an HMM? • Green circles are hidden states • Dependent only on the previous state: Markov process • “The past is independent of the future given the present.” 85

  47. What is an HMM? • Purple nodes are observed states • Dependent only on their corresponding hidden state 86

  48. HMM Formalism S S S S S K K K K K • { S, K , P , A , B } • S : {s 1 …s N } are the values for the hidden states • K : {k 1 …k M } are the values for the observations 87

  49. HMM Formalism A A A A S S S S S B B B K K K K K • { S, K , P , A , B } • P = { p i } are the initial state probabilities • A = {a ij } are the state transition probabilities • B = {b ik } are the observation state probabilities 88

  50. Inference for an HMM • Compute the probability of a given observation sequence • Given an observation sequence, compute the most likely hidden state sequence • Given an observation sequence and set of possible models, which model most closely fits the data? 89

  51. Sequence Probability o 1 o t-1 o t o t+1 o T Given an observation sequence and a model, compute the probability of the observation sequence = µ = P O ( 1 o ,..., o ) , ( A , B , ) T µ Compute P ( O | ) 90

  52. Sequence Probability x 1 x t-1 x t x t+1 x T o 1 o t-1 o t o t+1 o T µ = P ( O | X , ) b b ... b x o x o x T o 1 1 2 2 T µ = p P ( X | ) a a ... a x x x x x x x - 1 1 2 2 3 T 1 T µ = µ µ P ( O , X | ) P ( O | X , ) P ( X | ) å µ = µ µ P ( O | ) P ( O | X , ) P ( X | ) X 91

  53. MaxEnt Markov Models (MEMMs) • Discriminative • Find parameters to maximize P(Y|X) • No longer assume that features are independent • Do not take future observations into account (no forward-backward) Slides by Jenny Finkel

  54. MEMM inference in systems • For a MEMM, the classifier makes a single decision at a time, conditioned on evidence from observations and previous decisions • A larger space of sequences is usually explored via search Features Decision Point Local Context W 0 22.6 W +1 % -3 -2 -1 0 +1 W -1 fell DT NNP VBD ??? ??? T -1 VBD The Dow fell 22.6 % T -1 -T -2 NNP-VBD hasDigit? true … … (Ratnaparkhi 1996; Toutanova et al. 2003, etc.)

  55. Conditional Random Fields (CRFs) • Discriminative • Doesn’t assume that features are independent • When labeling Y i future observations are taken into account è The best of both worlds! Slides by Jenny Finkel

  56. CRFs [Lafferty, Pereira, and McCallum 2001] • A whole-sequence conditional model rather than a chaining of local models. exp f ( c , d ) ∑ λ i i P ( c | d , ) λ = i exp f ( c ' , d ) ∑ ∑ λ i i c ' i • The space of c � s is now the space of sequences • But if the features f i remain local, the conditional sequence likelihood can be calculated exactly using dynamic programming • Training is slower, but CRFs avoid causal-competition biases

  57. Model Trade-offs Discrim vs. Speed Normalization Generative HMM very fast generative local MEMM mid-range discriminative local CRF kinda slow discriminative global Slides by Jenny Finkel

  58. Greedy Inference Sequence Model Best Sequence Inference • Greedy inference: We just start at the left, and use our classifier at each position to assign a label • The classifier can depend on previous labeling decisions as well as observed data • • Advantages: Fast, no extra memory requirements • Very easy to implement • With rich features including observations to the right, it may perform quite well • • Disadvantage: Greedy. We make commit errors we cannot recover from • Slides by Chris Manning

  59. Viterbi Inference Sequence Model Best Sequence Inference • Viterbi inference: Dynamic programming or memorization. • Requires small window of state influence (e.g., past two states are relevant). • • Advantage: Exact: the global best sequence is returned. • • Disadvantage: Harder to implement long-distance state-state interactions (but beam inference • tends not to allow long-distance resurrection of sequences anyway). Slides by Chris Manning

  60. Beam Inference Sequence Model Best Sequence Inference • Beam inference: At each position keep the top k complete sequences. • Extend each sequence in each local way. • The extensions compete for the k slots at the next position. • • Advantages: Fast; beam sizes of 3–5 are almost as good as exact inference in many cases. • Easy to implement (no dynamic programming required). • • Disadvantage: Inexact: the globally best sequence can fall off the beam. • Slides by Chris Manning

  61. Stanford CRF System: A Case Study https://nlp.stanford.edu/software/CRF-NER.shtml

Recommend


More recommend