Information Extraction and the Future of Web Search A Probabilistic Model of Redundancy in Information Extraction Doug Downey, Oren Etzioni, Stephen Soderland University of Washington Department of Computer Science and Engineering http://www.cs.washington.edu/research/knowitall 2 Motivation for Web IE Review: Unsupervised Web IE • What universities have active biotech Goal: Extract information on any subject research and in what departments? automatically. • What percentage of the reviews of the Thinkpad T-40 are positive? The answer is not on any single Web page! 3 4 Binary Extraction Patterns Review: Extraction Patterns R (I 1 , I 2 ) � I 1 , R of I 2 Generic extraction patterns (Hearst ’92): • “… Cities such as Boston , Los Angeles , and Seattle …” Instantiated Pattern: Ceo (Person, Company) � <person> , CEO of <company> (“ C such as NP1 , NP2 , and NP3 ”) => IS-A(each(head( NP )), C ), … “…Jeff Bezos, CEO of Amazon…” “..Matt Damon, star of The Bourne Supremacy..” • “Detailed information for several countries such as maps , …” ProperNoun(head(NP)) “Erik Jonsson, CEO of Texas Instruments, mayor of “Erik Jonsson, CEO of Texas Instruments, mayor of • “I listen to pretty much all music but prefer Dallas from 1964-1971, and…” Dallas from 1964-1971, and…” country such as Garth Brooks ” 5 6 1
Redundancy in Information Extraction Review: Unsupervised Web IE In large corpora, the same fact is often asserted Goal: Extract information on any subject multiple times: automatically. “…and the rolling hills surrounding Sun Belt cities such as Atlanta” → Generic extraction patterns “ Atlanta is a city with a large number Generic patterns can make mistakes. of museums, theatres…” “…has offices in several major → Redundancy. metropolitan cities including Atlanta ” Given a term x and a set of sentences about a class C , what is the probability that x ∈ C ? 7 8 Redundancy – Two Intuitions Outline 1. Modeling redundancy – the problem 1) Repetition 2. U RNS model 2) Multiple extraction mechanisms 3. Parameter estimation for U RNS Phrase Hits 4. Experimental results 980 “ Atlanta and other cities” 286 5. Summary “ Canada and other cities” 5860 “cities such as Atlanta ” 7 “cities such as Canada ” Goal: A formal model of these intuitions. 9 10 1. Modeling Redundancy – The Problem 1. Modeling Redundancy – The Problem Consider a single extraction pattern: Consider a single extraction pattern: “ C such as x ” “ C such as x ” Given a term x and a set of sentences about a If an extraction x appears k times in a set of n class C , what is the probability that x ∈ C ? sentences containing this pattern , what is the probability that x ∈ C ? 11 12 2
Modeling with k Modeling with k Country(x) Country(x) extractions, n = 10 extractions, n = 10 P k − noisy or Noisy-Or Model : “…countries such as Saudi Arabia…” Saudi Arabia 2 0.99 ( ) “…countries such as the United States…” Japan 2 0.99 ∈ appears times P x C x k “…countries such as Saudi Arabia…” United States 1 0.9 − noisy or ( ) k “…countries such as Japan…” Africa 1 0.9 = − − 1 1 p “…countries such as Africa…” United Kingdom 1 0.9 p is the probability that a single 1 0.9 “…countries such as Japan…” Iraq sentence is true. “…countries such as the United Kingdom…” Afghanistan 1 0.9 “…countries such as Iraq…” Australia 1 0.9 p = 0.9 “…countries such as Afghanistan…” Important: –Distribution of C } Noisy-or ignores these “…countries such as Australia…” –Sample size ( n ) 13 14 Needed in Model: Sample Size Needed in Model: Distribution of C Country(x) Country(x) Country(x) P P P extractions, n = 10 k extractions, n ~50,000 k extractions, n ~50,000 k − − − noisy or noisy or noisy or Saudi Arabia 2 0.99 Japan 1723 0.9999… Japan 1723 0.9999… ( ) 2 0.99 295 0.9999… 295 0.9999… Japan Norway Norway ∈ appears times P x C x k United States 1 0.9 Israil 1 0.9 Israil 1 0.9 freq ( ) Africa 1 0.9 OilWatch Africa 1 0.9 OilWatch Africa 1 0.9 = − − 1000 1 1 k n p United Kingdom 1 0.9 Religion Paraguay 1 0.9 Religion Paraguay 1 0.9 Iraq 1 0.9 Chicken Mole 1 0.9 Chicken Mole 1 0.9 Afghanistan 1 0.9 Republics of Kenya 1 0.9 Republics of Kenya 1 0.9 Australia 1 0.9 Atlantic Ocean 1 0.9 Atlantic Ocean 1 0.9 New Zeland 1 0.9 New Zeland 1 0.9 As sample size increases, noisy-or becomes inaccurate. 15 16 Needed in Model: Distribution of C Needed in Model: Distribution of C Country(x) Country(x) City(x) P P P extractions, n ~50,000 k extractions, n ~50,000 k extractions, n ~50,000 k freq freq freq Japan 1723 0.9999… Japan 1723 0.9999… Toronto 274 0.9999… ( ) Norway 295 0.9999… Norway 295 0.9999… Belgrade 81 0.98 ∈ appears times P x C x k 1 0.05 freq 1 0.05 1 0.05 Israil Israil Lacombe ( ) OilWatch Africa 1 0.05 1000 OilWatch Africa 1 0.05 Kent County 1 0.05 = − − k n 1 1 p Religion Paraguay 1 0.05 Religion Paraguay 1 0.05 Nikki 1 0.05 Chicken Mole 1 0.05 Chicken Mole 1 0.05 Ragaz 1 0.05 Republics of Kenya 1 0.05 Republics of Kenya 1 0.05 Villegas 1 0.05 Atlantic Ocean 1 0.05 Atlantic Ocean 1 0.05 Cres 1 0.05 New Zeland 1 0.05 New Zeland 1 0.05 Northeastwards 1 0.05 Probability that x ∈ C depends on the distribution of C . 17 18 3
2. The U RNS Model – Single Urn Outline 1. Modeling redundancy – the problem 2. U RNS model 3. Parameter estimation for U RNS 4. Experimental results 5. Summary 19 20 2. The U RNS Model – Single Urn 2. The U RNS Model – Single Urn Urn for City(x) Urn for City(x) Tokyo Sydney Tokyo Tokyo Sydney U.K. U.K. Cairo U.K. Cairo U.K. Tokyo Tokyo …cities such as Tokyo … Utah Atlanta Utah Atlanta Yakima Atlanta Yakima Atlanta 21 22 Single Urn – Formal Definition Single Urn Example Urn for City(x) C – set of unique target labels E – set of unique error labels Tokyo Sydney num(b) – number of balls labeled by b ∈ C ∪ E U.K. num ( “Atlanta” ) = 2 num(B) – distribution giving the number of balls for Cairo U.K. each label b ∈ B . num ( C ) = {2, 2, 1, 1, 1} Tokyo num ( E ) = {2, 1} Utah Atlanta Yakima Atlanta Estimated from data 23 24 4
Single Urn: Computing Probabilities Single Urn: Computing Probabilities If an extraction x appears k times in a set of n Given that an extraction x appears k times in n sentences containing a pattern , what is the draws from the urn (with replacement), what is probability that x ∈ C ? the probability that x ∈ C ? 25 26 Uniform Special Case The U RNS Model – Multiple Urns Consider the case where num( c i ) = R C and num( e j ) = R E Correlation across extraction mechanisms is for all c i ∈ C , e j ∈ E higher for elements of C than for elements of E . Then: Then using a Poisson Approximation: Odds increase exponentially with k , but decrease exponentially with n . 27 28 Outline 3. Parameter Estimation for U RNS 1. Modeling redundancy – the problem Simplifying Assumptions: 2. U RNS model – Assume that num ( C ) and num ( E ) are Zipf distributed. 3. Parameter estimation for U RNS i − ∝ z • Frequency of i th most repeated label in C C 4. Experimental results – Then num ( C ) and num ( E ) are characterized by 5. Summary five parameters: , , , , z z C E p C E 29 30 5
Unsupervised Parameter Estimation Parameter Estimation Supervised Learning Unsupervised Learning – Differential Evolution (maximizing conditional – EM, with additional assumptions: likelihood) • | E | = 1,000,000 Unsupervised Learning • z E = 1 – Growing interest in IE without hand-tagged • p is given ( p = 0.9 for KnowItAll patterns) training data (e.g. DIPRE; Snowball; K NOW I T A LL ; Riloff and Jones 1999; Lin, Yangarber, and Grishman 2003) – How to estimate num ( C ) and num ( E )? 31 32 EM Process Outline EM for Unsupervised IE: 1. Modeling redundancy – the problem – E-Step: Assign probabilities to extracted facts 2. U RNS model using U RNS . 3. Parameter estimation for U RNS – M-Step : 4. Experimental results 1. Estimate z C by linear regression on log-log scale. 5. Summary 2. Set | C | equal to expected number of true labels extracted, plus unseen true labels (using Good- Turing estimation). 33 34 4. Experimental Results Unsupervised Likelihood Performance 5 Deviation from ideal log likelihood Previous Approach: PMI (in K NOW I T A LL , urns inspired by Turney, 2001) noisy-or 4 ( ) " Tacoma hotels" pmi Hits PMI (“ <City> hotels ” , “Tacoma”) = ( ) 3 " Tacoma" Hits –Expensive: several hit-count queries per extraction 2 –Using U RNS improves efficiency by ~8x 1 –‘Bootstrapped’ training data not representative 0 –Probabilities are polarized (Naïve Bayes) City Film Country MayorOf 35 36 6
Recommend
More recommend