motivation bootstrapping semantic lexicons
play

Motivation Bootstrapping Semantic Lexicons A semantic lexicon - PowerPoint PPT Presentation

Motivation Bootstrapping Semantic Lexicons A semantic lexicon contains semantic category Ex: dog, cat, lion, unannotated texts lizard, snake assignments for words. For example: blogger HUMAN sedan VEHICLE AK-47 WEAPON N best words


  1. Motivation Bootstrapping Semantic Lexicons • A semantic lexicon contains semantic category Ex: dog, cat, lion, unannotated texts lizard, snake assignments for words. For example: blogger HUMAN sedan VEHICLE AK-47 WEAPON N best words • General purpose resources, such as WordNet, are often insufficient for specific domains. Ex: terrier, poodle, tiger, ANIMAL : gshep, doxy, lab, labx, m/n, mix, patient frog, iguana co-occurrence statistics HUMAN : o • Automatic methods can be used to enhance existing resources or create domain-specific lexicons. prospective category words Pattern Templates Lexico-Syntactic Patterns Lexico-Syntactic Patterns <subject> passive-vp <target> was bombed <subject> active-vp <perpetrator> bombed • Lexico-syntactic contexts often reveal the semantic <subject> active-vp dobj <perpetrator> threw dynamite class of a word. <subject> active-vp infinitive <perpetrator> tried to kill <subject> passive-vp infinitive <perpetrator> was hired to kill • AutoSlog [Riloff 1993] is a pattern generator that was <subject> auxiliary dobj <victim> was fatality originally developed for event extraction tasks. active-vp <dobj> bombed <target> • Each pattern co-occurs with a NP in one of 3 syntactic infinitive <dobj> to kill <victim> active-vp infinitive <dobj> tried to kill <victim> positions: subject, direct object, PP object . passive-vp infinitive <dobj> was hired to kill <victim> subject auxiliary <dobj> fatality was <victim> Example Location Patterns <subject> was inhabited the locality was inhabited… passive-vp prep <np> was killed by <perpetrator> patrolling <direct object> …patrolling Zacamil neighborhood active-vp prep <np> exploded in <target> lives in <PP object> …lives in Argentina infinitive prep <np> to kill with <weapon> noun prep <np> assassination of <victim>

  2. BASILISK = B ootstrapping A pproach to Basilisk Bootstrapping Algorithm S emant I c L exicon I nduction using S emantic K nowledge Key Ideas behind Basilisk • Collective evidence over extraction patterns. Pattern Contexts • Learning multiple categories simultaneously. co-occurring best patterns Pattern words Candidate Pool Lexicon Word Pool 5 best words The Pattern Pool Scoring Patterns • Initially, we used a Pattern Pool of size 20, but the Every extraction pattern is scored and the best patterns pool became stagnant over time. are put into a Pattern Pool . The scoring function is: Solution: begin with a pattern pool of size 20, F i but increase the pool size by 1 after each RlogF (pattern i ) = N i * log 2 (F i ) iteration to infuse the pool with new candidates. where: • All head nouns that co-occur with patterns in the F i is the number of category members extracted by pattern i Pattern Pool are put into the Candidate Word Pool. N i is the total number of nouns extracted by pattern i

  3. Scoring Words based on Collective Selecting Words for the Lexicon Evidence score(word i ) = the average number of category members that co- 1. Given a word, collect all of its pattern contexts. occur with the pattern contexts containing the candidate word. N i 2. Compute the average # of distinct class members per ! F j j=1 pattern. (Actually, average over logarithms.) score (word i ) = N i INTUITION: a word receives a high score if it occurs N i in contexts that also consistently co-occur with ! log 2 (F j + 1) AvgLog (word i ) = j=1 known semantic class members. N i where: F j is the # of distinct category members that co-occur with pattern j N i is the total number of patterns that co-occur with word i Experimental Design Baseline Results • Used the MUC-4 corpus: 1700 texts related to terrorism. Head Nouns (8460 words) • Experiments on 6 semantic categories: building , event , human , location , time , weapon . building 188 ( 2.2%) event 501 ( 5.9%) • 10 seed words for each category. human 1856 (21.9%) location 1018 (12.0%) • 1000 words automatically generated for each category. time 112 ( 1.3%) weapon 147 ( 1.7%) (other) 4638 (54.8%) • Basilisk was compared with our previous algorithm ( meta- bootstrapping ).

  4. Human Location Seed Words 1000 500 Correct Lexicon Correct Lexicon 800 400 Entries Entries 600 BA-1 300 BA-1 400 MB-1 200 MB-1 We used the 10 most frequent words for each category. 200 100 0 0 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 Total Lexicon Entries Total Lexicon Entries Building: embassy, office, headquarters, church, offices, house, home, residence, hospital, airport Event Building 300 Event: attack, actions, war, meeting, elections, murder, attacks, action, 100 Correct Lexicon Correct Lexicon 250 80 struggle, agreement 200 Entries Entries BA-1 60 BA-1 150 MB-1 40 MB-1 100 Human: people, guerrillas, members, troops, Cristiani, rebels, president, 50 20 0 0 terrorists, soldiers, leaders 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 Total Lexicon Entries Total Lexicon Entries Location: country, El Salvador, Salvador, United States, area, Colombia, city, countries, department, Nicaragua Weapon Time 80 40 Time: time, years, days, November, hours, night, morning, week, year, day Correct Lexicon Correct Lexicon 60 30 Entries Entries BA-1 BA-1 40 20 Weapon: weapons, bomb, bombs, explosives, arms, missles, dynamite, rifles, MB-1 MB-1 10 20 materiel, bullets 0 0 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 Total Lexicon Entries Total Lexicon Entries Semantic Learning Case Study Learning Multiple Categories Simultaneously • Seed Words: 10 common disease names • Of the top 200 words hypothesized to be diseases: 89 were • We hypothesized that confusion errors can be already in the UMLS metathesaurus (32,000 names of reduced by learning multiple semantic categories diseases and organisms), but 111 were not! Including: simultaneously. adenomatosis h5n1 flu • “ One Sense per Domain ” assumption. tularaemia h7n3 kawasaki tularamia ev71 mad-cow-disease • Knowledge about competing categories can diarrhoea yf smut diphtheriae jyf constrain and steer the bootstrapping process. pertussis enterovirus-71 nvcjd pleuro-pneumonia fibropapillomas pepmv polioencephalomyelitis gastroeneteritis wsmv poliovirus

  5. Bootstrapping a Single Category Bootstrapping Multiple Categories Human Location Simple Conflict Resolution Correct Lexicon Entries Correct Lexicon Entries 1000 600 500 800 400 600 BA-M BA-M 300 BA-1 BA-1 400 200 200 100 0 0 • A word cannot be assigned to category X if it 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 Total Lexicon Entries Total Lexicon Entries has already been assigned to category Y. Building Event Correct Lexicon Entries Correct Lexicon Entries 120 300 250 100 80 200 BA-M BA-M • If a word is hypothesized for both category X 60 150 BA-1 BA-1 40 100 20 50 and category Y at the same time, choose the 0 0 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 category that receives the highest score. Total Lexicon Entries Total Lexicon Entries Weapon Time Correct Lexicon Entries Correct Lexicon Entries 100 40 80 30 60 BA-M BA-M 20 40 BA-1 BA-1 10 20 0 0 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 Total Lexicon Entries Total Lexicon Entries

  6. Human Location 700 600 A Smarter Scoring Function Correct Lexicon Correct Lexicon 600 500 500 400 Entries Entries 400 MB-M MB-M 300 300 MB-1 MB-1 200 200 100 100 A more proactive approach: incorporate knowledge 0 0 0 500 1000 1500 0 500 1000 1500 about other categories directly into the scoring Total Lexicon Entries Total Lexicon Entries function. Event Building 250 120 Correct Lexicon Correct Lexicon 100 200 80 Entries Entries 150 MB-M MB-M 60 100 MB-1 MB-1 40 50 New scoring function: 20 0 0 0 500 1000 1500 0 500 1000 1500 Total Lexicon Entries Total Lexicon Entries diff (w i ,c a ) = AvgLog (w i ,c a ) - max (AvgLog(w i ,c b )) Weapon Time b " a 100 30 Correct Lexicon Correct Lexicon 25 80 20 Entries Entries 60 MB-M MB-M 15 40 MB-1 MB-1 10 20 5 0 0 0 500 1000 1500 0 500 1000 1500 Total Lexicon Entries Total Lexicon Entries Human Location Subjective Noun Bootstrapping Correct Lexicon Entries Correct Lexicon Entries 1000 600 500 [Riloff, Wiebe, and Wilson, 2003] 800 BA-M+ BA-M+ 400 600 BA-M 300 BA-M 400 200 MB-M MB-M 200 100 0 0 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 hope, grief, joy, Total Lexicon Entries Total Lexicon Entries concern, worries Event Building Correct Lexicon Entries Correct Lexicon Entries 300 120 250 100 200 BA-M+ 80 BA-M+ 150 BA-M 60 BA-M 100 MB-M 40 MB-M 50 20 0 0 expressed <np> 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 Total Lexicon Entries voiced <np> Total Lexicon Entries Weapon show of <np> Time Best Patterns Correct Lexicon Entries Correct Lexicon Entries 100 50 80 40 BA-M+ BA-M+ 60 30 BA-M BA-M 40 20 happiness, relief, MB-M MB-M 20 10 0 0 Lexicon condolences, goodwill Best Nouns 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 Total Lexicon Entries Total Lexicon Entries

Recommend


More recommend