Name Categories • MUC started with 3 name categories: person, organiza2on, loca2on • QA and some IE required much finer categories – Led to sets with 100‐200 name categories – Hierarchical categories
Excerpt from a Detailed Name Ontology (Sekine 2008) Organiza9on • Loca9on • Facility • Product • – Product_Other, Material, Clothing, Money, Drug, Weapon, Stock, Award, Decora9on, Offense, Service, Class, Character, ID_Number – Vehicle : Vehicle_Other, Car, Train, AircraM, Spaceship, Ship – Food : Food_Other, Dish – Art : Art_Other, Picture, Broadcast_Program, Movie, Show, Music, Book – Prin9ng : Prin9ng_Other, Newspaper, Magazine – Doctrine_Method : Doctrine_Method_Other, Culture, Religion, Academic, Style, Movement, Theory, Plan – Rule : Rule_Other, Treaty, Law – Title : Title_Other, Posi9on_Voca9on – Language : Language_Other, Na9onal_Language – Unit : Unit_Other, Currency …
Systematic Name Polysemy • Some names have mul9ple senses – Spain • Spain is south of France [geographic region] • Spain signed a treaty with France [the government] • Spain drinks lots of wine [the people] – McDonalds • McDonalds sold 3 billion Happy Means [the organiza9on] • I’ll meet you in front of McDonalds [the loca9on] • Designate a primary sense for each systema9cally polysemous name type • ACE introduced “GPE” = geo‐poli9cal en9ty for regions with governments in recogni9on of this most common polysemy
Approaches to NER • Hand‐coded rules • Supervised models • Semi‐supervised models • Ac9ve learning
Hand-Coded Rules for NER For people: • 9tle (capitalized‐token)+ • where 9tle = “Mr.” | “Mrs.” | “Ms.” | … • capitalized‐token ini9al capitalized‐token • common‐first‐name capitalized‐token • American first names available from census • capitalized‐token capitalized‐token , 1‐or‐2‐digit‐number , For organiza9ons • (capitalized‐token)+ corporate‐suffix • where corporate‐suffix = “Co.” | “Ltd.” | … For loca9ons • capitalized‐token , country
Burden of hand-crafted rules • Wri9ng a few rules is easy • Wri9ng lots of rules … capturing all the indica9ve contexts … is hard • ____ died • ____ was founded • At some point addi9onal rules may hurt performance – Need an annotated ‘development test’ corpus to check progress • Once we have an annotated corpus, can we use it to automa9cally train an NER … a supervised model ?
BIO Tags • How can we formulate NER as a standard ML problem? • Use BIO tags to convert NER into a sequence tagging problem, which assigns a tag to each token: – For each NE category c i , introduce tags B‐c i [beginning of name] and I‐c i [interior of name] – Add in category O [other] – For example, with categories per, org, and loc, we would have 7 tags B‐per, I‐per, B‐org, I‐org, B‐loc, I‐loc, and O – Require that I‐c i be preceded by B‐c i or I‐c i Fred lives in New York B‐per O O B‐loc I‐loc
Using a Sequence Model • Construct network with one state for each tag • 2n+1 states for n categories, plus start state • Train model parameters using annotated corpus – HMM or MEMM model • Apply trained model to new text – Find most likely path through network (Viterbi) – Assign tags to tokens corresponding to states in path – Convert BIO tags to names
A Minimal State Diagram for NER B‐PER I‐PER START O B‐ORG I‐ORG Only two name classes; assumes two names are separated by at least one ‘O’ token.
Using a MEMM for NER • Simplest MEMM … – P(s i | s i‐1 , w i ) – Have prior state, current word, (current word & prior state) as features • Gevng some context – Add prior word (w i‐1 ) as feature – Add next word (w i+1 ) as feature
Adding States for Context If we are using an HMM, can get context through pre‐person and post‐person states Changing B‐PER I‐PER to post‐ pre‐ B‐PER I‐PER PER PER
Adding States for Name Structure Changing B‐PER I‐PER to B‐PER M‐PER E‐PER improves performance Different languages have by capturing more details different name structure ‐‐ I‐PER of name structure best recognized by language‐ specific states
Putting them together post‐ pre‐ B‐PER M‐PER E‐PER PER PER I‐PER
More Local Features Lexical features • – Whether the current word (prior word, following word) has a specific value Dic9onary features • – Whether the current word is in a par9cular dic9onary – Full name dic9onaries • For major organiza9ons, countries, and ci9es – Name component dic9onaries • Common first names Word clusters • – Whether the current word belongs to a corpus‐derived word cluster Shape features • – Capitalized, all caps, numeric, 2‐digit numeric, … Part‐of‐speech features • Hand‐coded NER rules as features •
Long-range features [1] • Most names represent the same name type (person / org / loca9on) wherever they appear – Par9cularly within a single document – But in most cases across documents as well • Some contexts will provide a clear indica9on of the name type, while others will be ambiguous – We would like to use the unambiguous contexts to resolve the ambiguity across the document or the corpus • Ex: – On vaca9on, Fred visited Gilbert Park. – Mr. Park was an old friend from college.
Long-range features [2] • We can capture this informa9on with a two‐pass strategy … – On the first pass, build a table (“name cache”) which records each name and they type it is assigned • Possibly record only confident assignments – On the second pass, incorporate a feature reflec9ng the dominant name type from the first pass • This can be done across an individual document or a large corpus [Borthwick 1999]
Semi-supervised NER • Annota9ng a large corpus to train a high‐ performance NER is fairly expensive • We can use the same idea (of name consistency across documents) to train an NER using – A smaller annotated corpus – A large unannotated corpus
Co-training for NER • We can split the features for NER into two sets: – Spelling features (the en9re name + tokens in the name) – Context features (leM and right contexts + syntac9c context) • Start with a seed – E.g., some common unambiguous full names • Itera9vely grow seed, alterna9vely applying spelling and context models and adding most ‐ confidently‐labeled instances to seed
Co-training for NER seed Add most confident exs Apply spelling model to labeled set Build context model Build spelling model Add most confident exs Apply context model to labeled set
Name co-training: results 3 classes: person, organiza9on, loca9on (and ‘other’) • Data: 1M sentences of news • Seed: • • New York, California, U.S. loca9on • contains(Mr.) person • MicrosoM, IBM organiza9on • contains(Incorporated) organiza9on Took names appearing with apposi9ve modifier or as complement of • preposi9on (88K name instances) Accuracy: 83% • Clean accuracy (ignoring names not in one of the 3 categories): 91% • (Collins and Singer 1999) •
Semi-supervised NER: when to stop • Semi‐supervised NER labels a few more examples at every itera9on – It stops when it runs out of examples to label • This is fine if – Names are easily iden9fied (e.g., by capitaliza9on in English) – Most names fall into one of the categories being trained (e.g., people, organiza9ons, and loca9ons for news stories)
Semi-supervised NER: semantic drift • Semi‐supervised NER doesn’t work so well if – The set of names is hard to iden9fy • Monocase languages • Extended name sets including lower‐case terms – The categories being trained cover only a small por9on of the set of names • The result is seman2c dri7 and seman2c spread – The name categories gradually grow to include related terms
Fighting Semantic Drift • We can fight driM by training a larger, more inclusive set of categories – Including ‘nega9ve’ categories • Categories we don’t really care about but include to compete with the original categories – These nega9ve categories can be built • By hand (Yangarber et al. 2003) • Or automa9cally (McIntosh 2010)
Active Learning • For supervised learning, we typically annotate text data sequen9ally • Not necessarily the most efficient approach • Most natural language phenomena have a Zipfean distribu9on … a few very common constructs and lots of infrequent constructs • AMer you have annotated “Spain” 50 9mes as a loca9on, the NER model is li•le improved by annota9ng it one more 9me • We want to select the most informa2ve examples and present them to the annotator • The data which, if labeled, is most likely to reduce NER error
How to select informative examples? • Uncertainty‐based sampling – For binary classifier • For MaxEnt, probability near 50% • For SVM, data near separa9ng hyperplane – For n‐ary classifier, data with small margin • Commi•ee‐based sampling – Data on which commi•ee members disagree – (co‐tes9ng … use two classifiers based on independent views)
Representativeness • It’s more helpful to annotate examples involving common features • Weigh9ng these features correctly will have a larger impact on error rate • So we rank examples by frequency of features in the en9re corpus
Batching and Diversity • Each itera9on of ac9ve learning involves running classifier on (a large) unlabeled corpus – This can be quite slow – Meanwhile annotator is wai9ng for something to annotate • So we run ac9ve learning in batches – Select best n examples to annotate each 9me – But all items in a batch are selected using the same criteria and same system state, and so are likely to be similar • To avoid example overlap, we impose a diversity requirement with a batch: limit maximum similarity of examples within a batch – Compute similarity based on example feature vectors
Simulated Active Learning True ac9ve learning experiments are • – Hard to reproduce – Very 9me consuming So most experiments involve simulated ac2ve learning : • – “unlabeled” data has really been labeled, but the labels have been hidden – When data is selected, labels are revealed – Disadvantage: “unlabeled” data can’t be so bit This leads us to ignore lots of issues of true ac9ve learning: • – An annota9on unit of one sentence or even one token may not be efficient for manual annota9on – So reported speed‐ups may be op9mis9c (typical reports reduce by half the amount of data to achieve a given NER accuracy
Evaluating NER • Systems are evaluated using an annotated test corpus – Ideally dual annotated and adjudicated • Name tags in system output are classified as correct, spurious, or missing: Cervantes wrote Don Quixote in Tarragona. System: person person Reference: person loca9on correct spurious missing
Metrics • Systems are measured in terms of: correct recall = correct + mis sin g correct precision = correct + spurious F = 2 × recall × precision recall + precision
Typical Performance • News corpora – Training and test from same source • 3 categories: person, organiza9on, loca9on • Based on CoNLL 2002 and 2003 mul9‐lingual, mul9‐site evalua9ons • English F = 89 • Spanish F = 81 • Dutch F = 77 • German F = 72
Limitations • Cited performance is for well matched training and test • Same domain • Same source • Same epoch – Performance deteriorates rapidly if less matched • NER trained on Reuters (F=91), tested on Wall Street Journal (F=64) [Ciaramita and Altun 2003] – Work on NER adapta9on is vital • Adding rarer classes to NER is difficult – Supervised learning inefficient – Semi‐supervised learning is subject to seman9c driM
Course Outline • Machine learning preliminaries • Name extrac9on • En9ty extrac9on • Rela9on extrac9on • Event extrac9on • Other domains
Names, mentions, and entities • Informa9on extrac9on gathers informa9on about discrete en99es such as people, organiza9ons, vehicles, books, cats, etc. • Texts contain men9ons of these en99es; these men9ons may take the form of • Names (“Sarkozy”) • Noun phrases headed by nouns (“the president”) • Pronouns (“he”)
Reference and co-reference • Data base entries filled with nouns or pronouns are not very useful … – At a minimum, entries should be names • But even names may be ambiguous – So we may want to create a data base of en99es with unique ID’s – And express rela9ons and events in terms of these ID’s
In-document coreference • The first step is in‐document coreference – linking all men9ons in a document which refer to the same en9ty • If one of these men9ons is a name, this allows us to use the name in the extracted rela9ons • Coreference has been extensively studied independently of IE • Typically by construc9ng sta9s9cal models of the likelihood that a pair of men9ons are coreferen9al • We will not review these models here
Cross-document [co]reference • Cross‐document coreference links together the en99es men9oned by individual documents • Generally limited to en99es which are named in both documents • En9ty linking links an en9ty named in one document to an en9ty in a data base
Cross-document [co]reference • Studied mainly in an IE sevng • ACE 2008 • KBP 2009‐2010‐2011 • WePS • Involves modeling • Possible spelling / name varia9on – William Jefferson Clinton Bill Clinton – Osama bin Laden Usama bin Laden • Probable coreference based on – Shared / conflic9ng a•ributes – Co‐occurring terms / names
Course Outline • Machine learning preliminaries • Name extrac9on • En9ty extrac9on • Rela9on extrac9on • Event extrac9on • Other domains
Relation • A rela2on is a predica9on about a pair of en99es: – Rodrigo works for UNED. – Alfonso lives in Tarragona. – O•o’s father is Ferdinand. • Typically they represent informa9on which is permanent or of extended dura9on.
History of relations • Rela9ons were introduced in MUC‐7 (1997) • 3 rela9ons • Extensively studied in ACE (2000 – 2007) • lots of training data • Effec9vely included in KBP
ACE Relations • Several revisions of rela9on defini9ons • With goal of having a set of rela9ons which can be ore consistently annotated • 5‐7 major types, 19‐24 subtypes • Both en99es must be men9oned in the same sentence – Do not get a parent‐child rela9on from • Ferdinand and Isabella were married in 1481. A son was born in 1485. – Or an employee rela9on for • Bank Santander replaced several execu9ves. Alfonso was named an execu9ve vice president. • Base for extensive research – On supervised and semi‐supervised methods
2004 Ace Relation Types Rela.on type Subtypes Physical Located, Near, Part‐whole Personal‐social Business, Family, Other Employment / Membership / Subsidiary Employ‐execu9ve, Employ‐staff, Employ‐undetermined, Member‐of‐group, Partner, Subsidiary, Other Agent‐ar9fact User‐or‐owner, Inventor‐or‐manufacturer, Other Person‐org affilia9on Ethnic, Ideology, Other GPE affilia9on Ci9zen‐or‐resident, Based‐in, Other Discourse ‐
KBP Slots • Many KBP slots represent rela9ons between en99es: • Member_of • Employee_of • Country_of_birth • Countries_of_residence • Schools_a•ended • Spouse • Parents • Children … • En99es do not need to appear in the same sentence • More limited training data • Encouraged semi‐supervised methods
Characteristics • Rela9ons appear in a wide range of forms: – Embedded constructs (one argument contains the other) • within a single noun group – John’s wife • linked by a preposi9on – the president of Apple – Formulaic constructs – Tarragona, Spain – Walter Cronkite, CBS News, New York – Longer‐range (‘predicate‐linked’) constructs • With a predicate disjoint from the arguments – Fred lived in New York – Fred and Mary got married
Hand-crafted patterns • Most instances of rela9ons can be iden9fied by the types of the en99es and the words between the en99es • But not all: Fred and Mary got married. • So we can start by lis9ng word sequences: • Person lives in loca9on • Person lived in loca9on • Person resides in loca9on • Person owns a house in loca9on • …
Generalizing patterns • We can get be•er coverage through syntac9c generaliza9on: – Specifying base forms • Person <v base=reside> in loca9on – Specifying chunks • Person <vgroup base=reside> in loca9on – Specifying op9onal elements • Person <vgroup base=reside> [<pp>] in loca9on
Dependency paths • Generaliza9on can also be achieved by using paths in labeled dependency trees: person – subject ‐1 – reside – in ‐‐ loca2on reside subject in for Fred has years Madrid three
Pattern Redundancy • Using a combina9on of sequen9al pa•erns and dependency pa•erns may provide extra robustness • Dependency pa•erns can handle more syntac9c varia9on but are more subject to analysis errors: “Carlos resided with his three cats in Madrid.” resided with Carlos cats in his three Madrid
Supervised learning • Collect training data – Annotate corpus with en99es and rela9ons – For every pair of en99es in a sentence • If linked by a rela9on, treat as posi9ve training instance • If not linked, treat as a nega9ve training instance • Train model – For n rela9on types, either • Binary (iden9fica9on) model + n‐way classifier model or • Unified n+1 ‐way classifier • On test data – Apply en9ty classifier – Apply rela9on classifier to every pair of en99es in same sentence
Supervised relation learner: features • Heads of en99es • Types of en99es • Distance between en99es • Containment rela9ons • Word sequence between en99es • Individual words between en99es • Dependency path • Individual words on dependency path
Kernel Methods • Goal is to find training examples similar to test case – Similarity of word sequence or tree structure – Determining similarity through features is awkward – Be•er to define a similarity measure directly: a kernel func9on • Kernels can be used directly by – SVMs – Memory‐based learners (k‐nearest‐neighbor) • Kernels defined over – Sequences – Parse or Dependency Trees
Tree Kernels • Tree kernels differ in – Type of tree • Par9al parse • Parse • Dependency – Tree spans compared • Shortest path‐enclosed tree • Condi9onally larger context – Flexibility of match
Shortest-path-enclosed Tree o o o o o o o o A 1 o o o A 2 o o • For predicate‐linked rela9ons, must extend shortest‐ path‐enclosed tree to include predicate
Composite Kernels • Can combine different levels of representa9on • Composite kernel can combine sequence and tree kernels
Semi-supervised methods • Preparing training data is more costly than for names – Must annotate en99es and rela9ons • So there is a strong mo9va9on to minimize training data through semi‐supervised methods • As for names, we will adopt a co‐training approach: – Feature set 1: the two en99es – Feature set 2: the contexts between the en99es • We will limit the bootstrapping – to a specific pair of en9ty types – and to instances where both en99es are named
Semi-supervised learning • Seed: • [ Moby Dick, Herman Melville] • Contexts for seed: • … wrote … • … is the author of … • Other pairs appearing in these contexts • [ Animal Farm , George Orwell] • [ Don Quixote, Miguel de Cervantes] • Addi9onal contexts …
Co-training for relations seed Generate new seed tuples Find occurrences of Generate extrac9on seed tuples pa•erns Tag en99es
Ranking contexts • If rela9on R is func9onal, and [X, Y] is a seed, then [X, Y’], Y’≠Y, is a nega9ve example • Confidence of pa•ern P P . positive Conf ( P ) = P . positive + P . negative • where P.posi2ve = number of posi9ve matches to pa•ern P P.nega2ve = number of nega9ve matches to pa•ern P
Ranking pairs • Once a confidence has been assigned to each pa•ern, we can assign a confidence to each new pair based on the pa•erns in which it appears – Confidence of best pa•ern – Combina9on assuming pa•erns are independent Conf ( X , Y ) = 1 − (1 − Conf ( P )) ∏ P ∈ contexts _ of _( X , Y )
Semantic drift • Ranking / filtering quite effec9ve for func9onal rela9ons (book author, company headquarters) – But expansion may occur into other rela9ons generally implied by seed (‘seman9c driM’) • Ex: from governor state governed to person state born in • Precision poor without func9onal property
Distant supervision • Some9mes a large data base is available involving the type of rela9on to be extracted • A number of such public data bases are now available, such as FreeBase and Yago • Text instances corresponding to some of the data base instances can be found in a large corpus or from the Web • Together these can be used to train a rela9on classifier
Distant supervision: approach • Given: • Data base for rela9on R • Corpus containing informa9on about rela9on R • Collect <X, Y> pairs from data base rela9on R • Collect sentences in corpus containing both X and Y • These are posi9ve training examples • Collect sentences in corpus containing X and some Y’with the same en9ty type as Y such that <X,Y’> is not in the data base • These are nega9ve training examples • Use examples to train classifier which operates on pairs of en99es
Distant supervision: limitations • The training data produced through distant supervision may be quite noisy: • If a pair <X, Y> is involved in mul9ple rela9ons, R<X, Y> and R’<X, Y> and the data base represents rela9on R, the text instance may represent rela9on R’, yielding a false posi9ve training instance – If many <X, Y> pairs are involved, the classifier may learn the wrong rela9on • If a rela9on is incomplete in the data base … for example, if resides_in<X, Y> contains only a few of the loca9ons where a person has resided … then we will generate many false nega9ves, possibly leading the classifier to learn no rela9on at all
Evaluation • Matching rela9on has matching rela9on type and arguments – Count correct, missing, and spurious rela9ons – Report precision, recall, and F measure • Varia9ons – Perfect men9ons vs. system men9ons • Performance much worse with system men9ons – an error in either men9on makes rela9on incorrect – Rela9on type vs. rela9on subtype – Name pairs vs. all men9ons • Bootstrapped systems trained on name‐name pa•erns • Best ACE systems on perfect men9ons: F = 75
Course Outline • Machine learning preliminaries • Name extrac9on • En9ty extrac9on • Rela9on extrac9on • Event extrac9on • Other domains
Recommend
More recommend