Name Categories ⢠MUC started with 3 name categories: person, organiza2on, loca2on ⢠QA and some IE required much finer categories ā Led to sets with 100ā200 name categories ā Hierarchical categories
Excerpt from a Detailed Name Ontology (Sekine 2008) Organiza9on ⢠Loca9on ⢠Facility ⢠Product ⢠ā Product_Other, Material, Clothing, Money, Drug, Weapon, Stock, Award, Decora9on, Offense, Service, Class, Character, ID_Number ā Vehicle : Vehicle_Other, Car, Train, AircraM, Spaceship, Ship ā Food : Food_Other, Dish ā Art : Art_Other, Picture, Broadcast_Program, Movie, Show, Music, Book ā Prin9ng : Prin9ng_Other, Newspaper, Magazine ā Doctrine_Method : Doctrine_Method_Other, Culture, Religion, Academic, Style, Movement, Theory, Plan ā Rule : Rule_Other, Treaty, Law ā Title : Title_Other, Posi9on_Voca9on ā Language : Language_Other, Na9onal_Language ā Unit : Unit_Other, Currency ā¦
Systematic Name Polysemy ⢠Some names have mul9ple senses ā Spain ⢠Spain is south of France [geographic region] ⢠Spain signed a treaty with France [the government] ⢠Spain drinks lots of wine [the people] ā McDonalds ⢠McDonalds sold 3 billion Happy Means [the organiza9on] ⢠Iāll meet you in front of McDonalds [the loca9on] ⢠Designate a primary sense for each systema9cally polysemous name type ⢠ACE introduced āGPEā = geoāpoli9cal en9ty for regions with governments in recogni9on of this most common polysemy
Approaches to NER ⢠Handācoded rules ⢠Supervised models ⢠Semiāsupervised models ⢠Ac9ve learning
Hand-Coded Rules for NER For people: ⢠9tle (capitalizedātoken)+ ⢠where 9tle = āMr.ā | āMrs.ā | āMs.ā | ⦠⢠capitalizedātoken ini9al capitalizedātoken ⢠commonāfirstāname capitalizedātoken ⢠American first names available from census ⢠capitalizedātoken capitalizedātoken , 1āorā2ādigitānumber , For organiza9ons ⢠(capitalizedātoken)+ corporateāsuffix ⢠where corporateāsuffix = āCo.ā | āLtd.ā | ⦠For loca9ons ⢠capitalizedātoken , country
Burden of hand-crafted rules ⢠Wri9ng a few rules is easy ⢠Wri9ng lots of rules ⦠capturing all the indica9ve contexts ⦠is hard ⢠____ died ⢠____ was founded ⢠At some point addi9onal rules may hurt performance ā Need an annotated ādevelopment testā corpus to check progress ⢠Once we have an annotated corpus, can we use it to automa9cally train an NER ⦠a supervised model ?
BIO Tags ⢠How can we formulate NER as a standard ML problem? ⢠Use BIO tags to convert NER into a sequence tagging problem, which assigns a tag to each token: ā For each NE category c i , introduce tags Bāc i [beginning of name] and Iāc i [interior of name] ā Add in category O [other] ā For example, with categories per, org, and loc, we would have 7 tags Bāper, Iāper, Bāorg, Iāorg, Bāloc, Iāloc, and O ā Require that Iāc i be preceded by Bāc i or Iāc i Fred lives in New York Bāper O O Bāloc Iāloc
Using a Sequence Model ⢠Construct network with one state for each tag ⢠2n+1 states for n categories, plus start state ⢠Train model parameters using annotated corpus ā HMM or MEMM model ⢠Apply trained model to new text ā Find most likely path through network (Viterbi) ā Assign tags to tokens corresponding to states in path ā Convert BIO tags to names
A Minimal State Diagram for NER BāPER IāPER START O BāORG IāORG Only two name classes; assumes two names are separated by at least one āOā token.
Using a MEMM for NER ⢠Simplest MEMM ⦠ā P(s i | s iā1 , w i ) ā Have prior state, current word, (current word & prior state) as features ⢠Gevng some context ā Add prior word (w iā1 ) as feature ā Add next word (w i+1 ) as feature
Adding States for Context If we are using an HMM, can get context through preāperson and postāperson states Changing BāPER IāPER to postā preā BāPER IāPER PER PER
Adding States for Name Structure Changing BāPER IāPER to BāPER MāPER EāPER improves performance Different languages have by capturing more details different name structure āā IāPER of name structure best recognized by languageā specific states
Putting them together postā preā BāPER MāPER EāPER PER PER IāPER
More Local Features Lexical features ⢠ā Whether the current word (prior word, following word) has a specific value Dic9onary features ⢠ā Whether the current word is in a par9cular dic9onary ā Full name dic9onaries ⢠For major organiza9ons, countries, and ci9es ā Name component dic9onaries ⢠Common first names Word clusters ⢠ā Whether the current word belongs to a corpusāderived word cluster Shape features ⢠ā Capitalized, all caps, numeric, 2ādigit numeric, ⦠Partāofāspeech features ⢠Handācoded NER rules as features ā¢
Long-range features [1] ⢠Most names represent the same name type (person / org / loca9on) wherever they appear ā Par9cularly within a single document ā But in most cases across documents as well ⢠Some contexts will provide a clear indica9on of the name type, while others will be ambiguous ā We would like to use the unambiguous contexts to resolve the ambiguity across the document or the corpus ⢠Ex: ā On vaca9on, Fred visited Gilbert Park. ā Mr. Park was an old friend from college.
Long-range features [2] ⢠We can capture this informa9on with a twoāpass strategy ⦠ā On the first pass, build a table (āname cacheā) which records each name and they type it is assigned ⢠Possibly record only confident assignments ā On the second pass, incorporate a feature reflec9ng the dominant name type from the first pass ⢠This can be done across an individual document or a large corpus [Borthwick 1999]
Semi-supervised NER ⢠Annota9ng a large corpus to train a highā performance NER is fairly expensive ⢠We can use the same idea (of name consistency across documents) to train an NER using ā A smaller annotated corpus ā A large unannotated corpus
Co-training for NER ⢠We can split the features for NER into two sets: ā Spelling features (the en9re name + tokens in the name) ā Context features (leM and right contexts + syntac9c context) ⢠Start with a seed ā E.g., some common unambiguous full names ⢠Itera9vely grow seed, alterna9vely applying spelling and context models and adding most ā confidentlyālabeled instances to seed
Co-training for NER seed Add most confident exs Apply spelling model to labeled set Build context model Build spelling model Add most confident exs Apply context model to labeled set
Name co-training: results 3 classes: person, organiza9on, loca9on (and āotherā) ⢠Data: 1M sentences of news ⢠Seed: ⢠⢠New York, California, U.S. ļ loca9on ⢠contains(Mr.) ļ person ⢠MicrosoM, IBM ļ organiza9on ⢠contains(Incorporated) ļ organiza9on Took names appearing with apposi9ve modifier or as complement of ⢠preposi9on (88K name instances) Accuracy: 83% ⢠Clean accuracy (ignoring names not in one of the 3 categories): 91% ⢠(Collins and Singer 1999) ā¢
Semi-supervised NER: when to stop ⢠Semiāsupervised NER labels a few more examples at every itera9on ā It stops when it runs out of examples to label ⢠This is fine if ā Names are easily iden9fied (e.g., by capitaliza9on in English) ā Most names fall into one of the categories being trained (e.g., people, organiza9ons, and loca9ons for news stories)
Semi-supervised NER: semantic drift ⢠Semiāsupervised NER doesnāt work so well if ā The set of names is hard to iden9fy ⢠Monocase languages ⢠Extended name sets including lowerācase terms ā The categories being trained cover only a small por9on of the set of names ⢠The result is seman2c dri7 and seman2c spread ā The name categories gradually grow to include related terms
Fighting Semantic Drift ⢠We can fight driM by training a larger, more inclusive set of categories ā Including ānega9veā categories ⢠Categories we donāt really care about but include to compete with the original categories ā These nega9ve categories can be built ⢠By hand (Yangarber et al. 2003) ⢠Or automa9cally (McIntosh 2010)
Active Learning ⢠For supervised learning, we typically annotate text data sequen9ally ⢠Not necessarily the most efficient approach ⢠Most natural language phenomena have a Zipfean distribu9on ⦠a few very common constructs and lots of infrequent constructs ⢠AMer you have annotated āSpainā 50 9mes as a loca9on, the NER model is liā¢le improved by annota9ng it one more 9me ⢠We want to select the most informa2ve examples and present them to the annotator ⢠The data which, if labeled, is most likely to reduce NER error
How to select informative examples? ⢠Uncertaintyābased sampling ā For binary classifier ⢠For MaxEnt, probability near 50% ⢠For SVM, data near separa9ng hyperplane ā For nāary classifier, data with small margin ⢠Commiā¢eeābased sampling ā Data on which commiā¢ee members disagree ā (coātes9ng ⦠use two classifiers based on independent views)
Representativeness ⢠Itās more helpful to annotate examples involving common features ⢠Weigh9ng these features correctly will have a larger impact on error rate ⢠So we rank examples by frequency of features in the en9re corpus
Batching and Diversity ⢠Each itera9on of ac9ve learning involves running classifier on (a large) unlabeled corpus ā This can be quite slow ā Meanwhile annotator is wai9ng for something to annotate ⢠So we run ac9ve learning in batches ā Select best n examples to annotate each 9me ā But all items in a batch are selected using the same criteria and same system state, and so are likely to be similar ⢠To avoid example overlap, we impose a diversity requirement with a batch: limit maximum similarity of examples within a batch ā Compute similarity based on example feature vectors
Simulated Active Learning True ac9ve learning experiments are ⢠ā Hard to reproduce ā Very 9me consuming So most experiments involve simulated ac2ve learning : ⢠ā āunlabeledā data has really been labeled, but the labels have been hidden ā When data is selected, labels are revealed ā Disadvantage: āunlabeledā data canāt be so bit This leads us to ignore lots of issues of true ac9ve learning: ⢠ā An annota9on unit of one sentence or even one token may not be efficient for manual annota9on ā So reported speedāups may be op9mis9c (typical reports reduce by half the amount of data to achieve a given NER accuracy
Evaluating NER ⢠Systems are evaluated using an annotated test corpus ā Ideally dual annotated and adjudicated ⢠Name tags in system output are classified as correct, spurious, or missing: Cervantes wrote Don Quixote in Tarragona. System: person person Reference: person loca9on correct spurious missing
Metrics ⢠Systems are measured in terms of: correct recall = correct + mis sin g correct precision = correct + spurious F = 2 à recall à precision recall + precision
Typical Performance ⢠News corpora ā Training and test from same source ⢠3 categories: person, organiza9on, loca9on ⢠Based on CoNLL 2002 and 2003 mul9ālingual, mul9āsite evalua9ons ⢠English F = 89 ⢠Spanish F = 81 ⢠Dutch F = 77 ⢠German F = 72
Limitations ⢠Cited performance is for well matched training and test ⢠Same domain ⢠Same source ⢠Same epoch ā Performance deteriorates rapidly if less matched ⢠NER trained on Reuters (F=91), tested on Wall Street Journal (F=64) [Ciaramita and Altun 2003] ā Work on NER adapta9on is vital ⢠Adding rarer classes to NER is difficult ā Supervised learning inefficient ā Semiāsupervised learning is subject to seman9c driM
Course Outline ⢠Machine learning preliminaries ⢠Name extrac9on ⢠En9ty extrac9on ⢠Rela9on extrac9on ⢠Event extrac9on ⢠Other domains
Names, mentions, and entities ⢠Informa9on extrac9on gathers informa9on about discrete en99es such as people, organiza9ons, vehicles, books, cats, etc. ⢠Texts contain men9ons of these en99es; these men9ons may take the form of ⢠Names (āSarkozyā) ⢠Noun phrases headed by nouns (āthe presidentā) ⢠Pronouns (āheā)
Reference and co-reference ⢠Data base entries filled with nouns or pronouns are not very useful ⦠ā At a minimum, entries should be names ⢠But even names may be ambiguous ā So we may want to create a data base of en99es with unique IDās ā And express rela9ons and events in terms of these IDās
In-document coreference ⢠The first step is inādocument coreference ā linking all men9ons in a document which refer to the same en9ty ⢠If one of these men9ons is a name, this allows us to use the name in the extracted rela9ons ⢠Coreference has been extensively studied independently of IE ⢠Typically by construc9ng sta9s9cal models of the likelihood that a pair of men9ons are coreferen9al ⢠We will not review these models here
Cross-document [co]reference ⢠Crossādocument coreference links together the en99es men9oned by individual documents ⢠Generally limited to en99es which are named in both documents ⢠En9ty linking links an en9ty named in one document to an en9ty in a data base
Cross-document [co]reference ⢠Studied mainly in an IE sevng ⢠ACE 2008 ⢠KBP 2009ā2010ā2011 ⢠WePS ⢠Involves modeling ⢠Possible spelling / name varia9on ā William Jefferson Clinton ļļ Bill Clinton ā Osama bin Laden ļ ļ Usama bin Laden ⢠Probable coreference based on ā Shared / conflic9ng aā¢ributes ā Coāoccurring terms / names
Course Outline ⢠Machine learning preliminaries ⢠Name extrac9on ⢠En9ty extrac9on ⢠Rela9on extrac9on ⢠Event extrac9on ⢠Other domains
Relation ⢠A rela2on is a predica9on about a pair of en99es: ā Rodrigo works for UNED. ā Alfonso lives in Tarragona. ā Oā¢oās father is Ferdinand. ⢠Typically they represent informa9on which is permanent or of extended dura9on.
History of relations ⢠Rela9ons were introduced in MUCā7 (1997) ⢠3 rela9ons ⢠Extensively studied in ACE (2000 ā 2007) ⢠lots of training data ⢠Effec9vely included in KBP
ACE Relations ⢠Several revisions of rela9on defini9ons ⢠With goal of having a set of rela9ons which can be ore consistently annotated ⢠5ā7 major types, 19ā24 subtypes ⢠Both en99es must be men9oned in the same sentence ā Do not get a parentāchild rela9on from ⢠Ferdinand and Isabella were married in 1481. A son was born in 1485. ā Or an employee rela9on for ⢠Bank Santander replaced several execu9ves. Alfonso was named an execu9ve vice president. ⢠Base for extensive research ā On supervised and semiāsupervised methods
2004 Ace Relation Types Rela.on type Subtypes Physical Located, Near, Partāwhole Personalāsocial Business, Family, Other Employment / Membership / Subsidiary Employāexecu9ve, Employāstaff, Employāundetermined, Memberāofāgroup, Partner, Subsidiary, Other Agentāar9fact Userāorāowner, Inventorāorāmanufacturer, Other Personāorg affilia9on Ethnic, Ideology, Other GPE affilia9on Ci9zenāorāresident, Basedāin, Other Discourse ā
KBP Slots ⢠Many KBP slots represent rela9ons between en99es: ⢠Member_of ⢠Employee_of ⢠Country_of_birth ⢠Countries_of_residence ⢠Schools_aā¢ended ⢠Spouse ⢠Parents ⢠Children ⦠⢠En99es do not need to appear in the same sentence ⢠More limited training data ⢠Encouraged semiāsupervised methods
Characteristics ⢠Rela9ons appear in a wide range of forms: ā Embedded constructs (one argument contains the other) ⢠within a single noun group ā Johnās wife ⢠linked by a preposi9on ā the president of Apple ā Formulaic constructs ā Tarragona, Spain ā Walter Cronkite, CBS News, New York ā Longerārange (āpredicateālinkedā) constructs ⢠With a predicate disjoint from the arguments ā Fred lived in New York ā Fred and Mary got married
Hand-crafted patterns ⢠Most instances of rela9ons can be iden9fied by the types of the en99es and the words between the en99es ⢠But not all: Fred and Mary got married. ⢠So we can start by lis9ng word sequences: ⢠Person lives in loca9on ⢠Person lived in loca9on ⢠Person resides in loca9on ⢠Person owns a house in loca9on ⢠ā¦
Generalizing patterns ⢠We can get beā¢er coverage through syntac9c generaliza9on: ā Specifying base forms ⢠Person <v base=reside> in loca9on ā Specifying chunks ⢠Person <vgroup base=reside> in loca9on ā Specifying op9onal elements ⢠Person <vgroup base=reside> [<pp>] in loca9on
Dependency paths ⢠Generaliza9on can also be achieved by using paths in labeled dependency trees: person ā subject ā1 ā reside ā in āā loca2on reside subject in for Fred has years Madrid three
Pattern Redundancy ⢠Using a combina9on of sequen9al paā¢erns and dependency paā¢erns may provide extra robustness ⢠Dependency paā¢erns can handle more syntac9c varia9on but are more subject to analysis errors: āCarlos resided with his three cats in Madrid.ā resided with Carlos cats in his three Madrid
Supervised learning ⢠Collect training data ā Annotate corpus with en99es and rela9ons ā For every pair of en99es in a sentence ⢠If linked by a rela9on, treat as posi9ve training instance ⢠If not linked, treat as a nega9ve training instance ⢠Train model ā For n rela9on types, either ⢠Binary (iden9fica9on) model + nāway classifier model or ⢠Unified n+1 āway classifier ⢠On test data ā Apply en9ty classifier ā Apply rela9on classifier to every pair of en99es in same sentence
Supervised relation learner: features ⢠Heads of en99es ⢠Types of en99es ⢠Distance between en99es ⢠Containment rela9ons ⢠Word sequence between en99es ⢠Individual words between en99es ⢠Dependency path ⢠Individual words on dependency path
Kernel Methods ⢠Goal is to find training examples similar to test case ā Similarity of word sequence or tree structure ā Determining similarity through features is awkward ā Beā¢er to define a similarity measure directly: a kernel func9on ⢠Kernels can be used directly by ā SVMs ā Memoryābased learners (kānearestāneighbor) ⢠Kernels defined over ā Sequences ā Parse or Dependency Trees
Tree Kernels ⢠Tree kernels differ in ā Type of tree ⢠Par9al parse ⢠Parse ⢠Dependency ā Tree spans compared ⢠Shortest pathāenclosed tree ⢠Condi9onally larger context ā Flexibility of match
Shortest-path-enclosed Tree o o o o o o o o A 1 o o o A 2 o o ⢠For predicateālinked rela9ons, must extend shortestā pathāenclosed tree to include predicate
Composite Kernels ⢠Can combine different levels of representa9on ⢠Composite kernel can combine sequence and tree kernels
Semi-supervised methods ⢠Preparing training data is more costly than for names ā Must annotate en99es and rela9ons ⢠So there is a strong mo9va9on to minimize training data through semiāsupervised methods ⢠As for names, we will adopt a coātraining approach: ā Feature set 1: the two en99es ā Feature set 2: the contexts between the en99es ⢠We will limit the bootstrapping ā to a specific pair of en9ty types ā and to instances where both en99es are named
Semi-supervised learning ⢠Seed: ⢠[ Moby Dick, Herman Melville] ⢠Contexts for seed: ⢠⦠wrote ⦠⢠⦠is the author of ⦠⢠Other pairs appearing in these contexts ⢠[ Animal Farm , George Orwell] ⢠[ Don Quixote, Miguel de Cervantes] ⢠Addi9onal contexts ā¦
Co-training for relations seed Generate new seed tuples Find occurrences of Generate extrac9on seed tuples paā¢erns Tag en99es
Ranking contexts ⢠If rela9on R is func9onal, and [X, Y] is a seed, then [X, Yā], Yāā Y, is a nega9ve example ⢠Confidence of paā¢ern P P . positive Conf ( P ) = P . positive + P . negative ⢠where P.posi2ve = number of posi9ve matches to paā¢ern P P.nega2ve = number of nega9ve matches to paā¢ern P
Ranking pairs ⢠Once a confidence has been assigned to each paā¢ern, we can assign a confidence to each new pair based on the paā¢erns in which it appears ā Confidence of best paā¢ern ā Combina9on assuming paā¢erns are independent Conf ( X , Y ) = 1 ā (1 ā Conf ( P )) ā P ā contexts _ of _( X , Y )
Semantic drift ⢠Ranking / filtering quite effec9ve for func9onal rela9ons (book ļ author, company ļ headquarters) ā But expansion may occur into other rela9ons generally implied by seed (āseman9c driMā) ⢠Ex: from governor ļ state governed to person ļ state born in ⢠Precision poor without func9onal property
Distant supervision ⢠Some9mes a large data base is available involving the type of rela9on to be extracted ⢠A number of such public data bases are now available, such as FreeBase and Yago ⢠Text instances corresponding to some of the data base instances can be found in a large corpus or from the Web ⢠Together these can be used to train a rela9on classifier
Distant supervision: approach ⢠Given: ⢠Data base for rela9on R ⢠Corpus containing informa9on about rela9on R ⢠Collect <X, Y> pairs from data base rela9on R ⢠Collect sentences in corpus containing both X and Y ⢠These are posi9ve training examples ⢠Collect sentences in corpus containing X and some Yāwith the same en9ty type as Y such that <X,Yā> is not in the data base ⢠These are nega9ve training examples ⢠Use examples to train classifier which operates on pairs of en99es
Distant supervision: limitations ⢠The training data produced through distant supervision may be quite noisy: ⢠If a pair <X, Y> is involved in mul9ple rela9ons, R<X, Y> and Rā<X, Y> and the data base represents rela9on R, the text instance may represent rela9on Rā, yielding a false posi9ve training instance ā If many <X, Y> pairs are involved, the classifier may learn the wrong rela9on ⢠If a rela9on is incomplete in the data base ⦠for example, if resides_in<X, Y> contains only a few of the loca9ons where a person has resided ⦠then we will generate many false nega9ves, possibly leading the classifier to learn no rela9on at all
Evaluation ⢠Matching rela9on has matching rela9on type and arguments ā Count correct, missing, and spurious rela9ons ā Report precision, recall, and F measure ⢠Varia9ons ā Perfect men9ons vs. system men9ons ⢠Performance much worse with system men9ons ā an error in either men9on makes rela9on incorrect ā Rela9on type vs. rela9on subtype ā Name pairs vs. all men9ons ⢠Bootstrapped systems trained on nameāname paā¢erns ⢠Best ACE systems on perfect men9ons: F = 75
Course Outline ⢠Machine learning preliminaries ⢠Name extrac9on ⢠En9ty extrac9on ⢠Rela9on extrac9on ⢠Event extrac9on ⢠Other domains
Recommend
More recommend