Probase Haixun Wang Microsoft Research Asia
Short Text • Document Title • Search • Caption • Ad keywords • Question • Anchor text
The big question • How does the mind get so much out of so little? • Our minds build rich models of the world and make strong generalizations from input data that is sparse, noisy, and ambiguous – in many ways far too limited to support the inferences we make. • How do we do it?
Science 331 , 1279 (2011); MIT CMU Berkeley Stanford
If the mind goes beyond the data given, another source of information must make up the difference.
h1: all and only horses h2: all horses except Clydesdales h3: all animals
likelihood prior h1: all and only horses • 𝑄 ℎ 𝑒 ∝ 𝑄 𝑒 ℎ 𝑄(ℎ) h2: all horses except Clydesdales h3: all animals h1 h1 h2 h3
Which is “ kiki ” and which is “ bouba ”?
𝒍𝒇 \′𝒍𝒇 sound shape zigzaggedness
Another example Pablo Picasso 25 Oct 1881 Spanish
Probase: a semantic network for text understanding Concepts Entities isA isPropertyOf Co-occurrence
isA Extraction • Hearst pattern • domestic animals such as cats and dogs … NP such as NP, NP, ..., and|or NP such NP as NP,* or|and NP • animals other than cats NP, NP*, or other NP such as dogs … NP, NP*, and other NP NP, including NP,* or | and NP NP, especially NP,* or|and NP • China is a developing • … is a … pattern country. NP is a/an/the NP • Life is a box of chocolate.
… animals other than cats such as dogs … animals cats isA isA dogs dogs
… household pets other than animals such as reptiles , aquarium fish … household pets animals isA isA reptiles reptiles
Iterative Information Extraction Syntactic Knowledge patterns
Semantic Drifts A drifting point 10%-20% Precision Improvement
Probase Concepts (2.7 million+) Basic watercolor techniques countries Celebrity wedding dress designers Probase isA error rate: <%1 @1 and <10% for random pair
A traditional taxonomy
“python” in Probase “python”
# of descendants (WordNet)
Transitivity does not always hold furniture plastic material chair film
# of descendants (early version of Probase)
Probase Scores • Typicality • Vagueness • Representativeness foundation for inferencing • Ambiguity • Similarity
Typicality bird 𝑜(𝑑, 𝑓) + α 𝑄 𝑓 𝑑 = + α𝑂 𝑜(𝑑, 𝑓 𝑗 ) 𝑓 𝑗 ∈𝑑 𝑜 𝑑, 𝑓 + α 𝑄 𝑑 𝑓 = + α𝑂 𝑜 𝑑 𝑗 , 𝑓 𝑓∈𝑑 𝑗 “robin” is a more typical bird than a “penguin” 𝑞 𝑠𝑝𝑐𝑗𝑜 𝑐𝑗𝑠𝑒 > 𝑞(𝑞𝑓𝑜𝑣𝑗𝑜|𝑐𝑗𝑠𝑒)
Representativeness (basic level of categorization) software company max 𝑑 𝑞 𝑑 𝑓 ⋅ 𝑞(𝑓|𝑑) … … company largest OS vendor ? high typicality p(c|e) high typicality p(e|c) Microsoft
Vagueness key players factors items things reasons … 𝑊 𝐷 = | 𝑓 𝑗 𝑄 𝐷 𝑓 𝑗 ≥ 𝑑, ∀𝑓 𝑗 ∈ 𝐷}| 𝑂(𝐷) (Do people whom you regard highly regard you highly?)
Ambiguity • Probase defines 3 levels of ambiguity – Level 0 (1 sense): apple juice – Level 1 (2 or more related senses): Google – Level 2 (2 or more senses): python • Concepts form clusters, clusters form senses (through isa relation) region creature crop food animal country city state fruit vegetable meat predator
Similarity • microsoft, ibm 0.933 • google, apple 0.378 ?? 𝑡𝑗𝑛 𝑢 1 , 𝑢 2 = max 𝑦,𝑧 𝑑𝑝𝑡𝑗𝑜𝑓 (𝑑 𝑦 𝑢 1 , 𝑑 𝑧 𝑢 2 )
Applications • Query Understanding – Head/Modifier/Constraint detection • … • SRL (semantic role labeling) with FrameNet – e.g. Tom broke the window. agent patient
Example: FrameNet Frame: Apply_heat FE1 FE2 FE3 FE4 Concept P(c|FE) Instance P(w|FE) heat source 0.19 Stove 0.00019 Large metal 0.04 Radiator* 0.00015 Kitchen 0.02 Oven 0.00015 appliance Grill* 0.00014 Heater* 0.00013 Fireplace* 0.00013 Lamp* 0.00013 Hair dryer* 0.00012 Candle* 0.00012
Example: Head and Modifier Detection • toy kid • cover iphone (accessory, smart phone) • seattle hotel jobs
When concepts are too specific • Example: mobile windows operating system / head large and inferential software vendor / modifier • No generalization power • 𝑛𝑗𝑚𝑚𝑗𝑝𝑜 2 patterns
When concepts are too general Head Modifier … … modem comcast ((Device/Head, Company/Modifer) wireless router comcast … … Conflict Head Modifier … … netflix touchpad ((Device/Modifer, Company/Head) skype windows phone … …
Knowledge Bases WordNet Wikipedia Freebase Probase Feline; Felid; Adult male; Man; TV episode; Creative work; Musical Animal; Pet; Species; Mammal; Gossip; Gossiper; Domesticated animals; Cats; recording; Organism classification; Dated Small animal; Thing; Mammalian Gossipmonger; Rumormonger; Felines; Invasive animal species; location; Musical release; Book; Musical species; Small pet; Animal species; Cat Rumourmonger; Newsmonger; Cosmopolitan species; Sequenced album; Film character; Publication; Carnivore; Domesticated animal; Woman; Adult female; genomes; Animals described in Character species; Top level domain; Companion animal; Exotic pet; Stimulant; Stimulant drug; 1758; Animal; Domesticated animal; ... Vertebrate; ... Excitant; Tracked vehicle; ... Companies listed on the New York Business operation; Issuer; Literature Company; Vendor; Client; Stock Exchange; IBM; Cloud subject; Venture investor; Competitor; Corporation; Organization; computing providers; Companies Software developer; Architectural Manufacturer; Industry leader; based in Westchester County, New structure owner; Website owner; Firm; Brand; Partner; Large IBM N/A York; Multinational companies; Programming language designer; company; Fortune 500 company; Software companies of the United Computer manufacturer/brand; Technology company; Supplier; States; Top 100 US Federal Customer; Operating system developer; Software vendor; Global company; Contractors; ... Processor manufacturer; ... Technology company; ... Instance of : Cognitive function; Employer; Written work; Musical Knowledge; Cultural factor; Communication; Auditory recording; Musical artist; Musical album; Cultural barrier; Cognitive process; communication; Word; Higher Languages; Linguistics; Human Literature subject; Query; Periodical; Cognitive ability; Cultural Language cognitive process; Faculty; communication; Human skills; Type profile; Journal; Quotation subject; difference; Ability; Characteristic; Mental faculty; Module; Text; Wikipedia articles with ASCII art Type/domain equivalent topic; Broadcast Attribute of: Film; Area; Book; Textual matter; genre; Periodical subject; Video game Publication; Magazine; Country; content descriptor; ... Work; Program; Media; City; ...
What can Probase do? enable understanding and make up for the lack of depth
Knowledgebases covers every topic? knows about everything in a topic? contains rich connections? breadth and density enable understanding
Concept Learning India China Brazil emerging market country
taste body smell wine
Understanding Web Tables website president city motto state type director 600 500 # of Concepts 400 300 200 100 0 1 2 3 4 5 6 7 # of Attributes
population china country
collector of fine china earthenware
Bayesian • For a mixture of instances and properties: Noisy-Or model 𝑄 𝑑 𝑢 𝑚 = 1 − 1 − 𝑄 𝑑 𝑢 𝑚 , 𝑨 𝑚 = 1 1 − 𝑄 𝑑 𝑢 𝑚 , 𝑨 𝑚 = 0 where 𝑨 𝑚 = 1 indicates 𝑢 𝑚 is an entity, 𝑨 𝑚 = 0 indicates 𝑢 𝑚 is a property • Bayesian rule gives: 𝑄(𝑑|𝑢 𝑚 ) 𝑀 𝑚 𝑄 𝑑 𝑈 ∝ 𝑄 𝑑 𝑄 𝑢 𝑚 𝑑 ∝ 𝑄 𝑑 𝑀−1 𝑚
iPad apple company device
… … solitaire team d ell’s streak movie common neighbour game spot Ubuntu … … iphone iphone ipod ipod fruit google google mac mac … Ipod touch Ipod touch app app device apps apps … company microsoft microsoft … popular popular adobe adobe ipad apple tablet acer acer food … cell phones cell phones … android android news news tablet tablet product 3g 3g product … launch launch … iphone os iphone os steve jobs steve jobs apple’s apple’s concept cluster … … concept cluster home goods weblog green guide artist keyboard t-shirts … … cooccur1 cooccur2 no filtering filtering device tablet … … company product … … concept cluster
Modeling Co-occurrence Probase + LDA model Wikipedia Concept Topic Word
Recommend
More recommend