knowledge vault a web scale approach to probabilistic
play

Knowledge Vault: a web-scale approach to probabilistic knowledge - PowerPoint PPT Presentation

Knowledge Vault: a web-scale approach to probabilistic knowledge fusion Luna Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy , Thomas Strohmann, Shaohua Sun, Wei Zhang Google (Machine Intelligence group) KV @ KDD 2014


  1. Knowledge Vault: a web-scale approach to probabilistic knowledge fusion Luna Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy , Thomas Strohmann, Shaohua Sun, Wei Zhang Google (Machine Intelligence group) KV @ KDD 2014

  2. Outline of the talk 1. Knowledge Graph 2. Knowledge Vault 3. Fact mining from the web 4. Fact mining from graphs 5. Knowledge Fusion KV @ KDD 2014 2

  3. A Knowledge Graph is a multi-graph where nodes = entities, edges = relations NY Knicks opponent teamInLeague LA Lakers playFor playFor Kobe Bryant Pau Gasol playInLeague teammate Kobe Bryant KV @ KDD 2014 3

  4. Example Knowledge Graphs Google’s KG Facebook’s Walmart’s Microsoft’s Entity Graph Satori Kosmix KV @ KDD 2014 4

  5. Freebase is created by fusing structured data sources and human contributions MusicBrainz Wikipedia companies products people TVDB Geo movies places FB KV Talk at KDD, New York, August 25, 2014

  6. The long tail of knowledge • FB is very large (40M nodes, 637M edges) • But it still very incomplete: • We are missing many edges (facts) This talk Relation % unknown in Freebase Profession 68% Place of birth 71% Nationality 75% Education 91% Spouse 92% Parents 94% • We are also missing many nodes (entities) • We are also missing many edge types (schema) KV @ KDD 2014

  7. Outline of the talk 1. Knowledge Graph 2. Knowledge Vault 3. Fact mining from the web 4. Fact mining from graphs 5. Knowledge Fusion KV @ KDD 2014 7

  8. From Knowledge Graph to Knowledge Vault • There are many groups at Google working on enlarging KG while maintaining high precision . • KV is an exploratory research project to investigate other points along the precision-recall curve. • KV automatically extracts facts from public web sources. • KV embraces the inherent uncertainty associated with this process (every fact has associated confidence and provenance info). KV @ KDD 2014

  9. Previous projects on automatically building KBs (eg NELL, YAGO) predict facts based on text ? Kobe playFor LA Lakers Bryant Pr(<s, r, o>=1|D) “Kobe Bryant, “Kobe Bryant, the franchise player of the Lakers” “Kobe “Kobe once again saved his team” “Kobe Bryant “Kobe Bryant man of the match for Los Angeles” KV @ KDD 2014 9

  10. KV: Predict new facts based on text AND existing edges in FB ? NY Knicks opponent teamInLeague Kobe playFor LA Lakers Bryant LA Lakers playFor Pr(<s, r, o>=1|D) playInLeague Pau Gasol Kobe Bryant teammate “Kobe Bryant, “Kobe Bryant, the franchise player of the Lakers” “Kobe “Kobe once again saved his team” “Kobe Bryant “Kobe Bryant man of the match for Los Angeles” KV @ KDD 2014 10

  11. Web Extractors Priors Fusion KV @ KDD 2014 11

  12. KV is 50x bigger than comparable KBs Total # facts in KV > 2.5B 302M with Prob > 0.9 Open IE (e.g., Mausam et al., 2012) 381M with Prob > 0.7 5B assertions (Mausam, Michael Schmitz, personal communication, October 2013) KV @ KDD 2014 12

  13. Uses for KV's uncertain triples probably false possibly true triples possibly false probably true triples removed used as weak triples used for triples uploaded to KG from KG signals error analysis KV Talk at KDD, New York, August 25, 2014

  14. Outline of the talk 1. Knowledge Graph 2. Knowledge Vault 3. Fact mining from the web 4. Fact mining from graphs 5. Knowledge Fusion KV @ KDD 2014 14

  15. Fact extraction from the web Webmaster annotations Tables NL text Page structure Extractors Fusion KV @ KDD 2014 15

  16. Fact extraction from text (TXT) • First identify named entities (entity linkage). • Then classify verb phrase as one of 2000 relations Patrick Newport ,who has been working at IHS Global Insight, noted... ORG PER /m/201 /people/person/employment /m/101 The result is a probabilistic triple: Pr(<subject, reln, object>=1 | text) Classifier trained using distant supervision.* Details: see eg tutorial by Ralph Grishman (NYU): “Information Extraction: Capabilities and Challenges”, 2012 * Mintz et al, RANLP 2009 KV @ KDD 2014

  17. Fact extraction from DOM trees* • First identify named entities on page • Then classify X-path connecting each entity pair as one of 2000 relations * Cafarella et al, CACM’11 KV @ KDD 2014

  18. Fact extraction from tables (TBL)* Squares are CVT nodes * Cafarella et al, VLDB’08 KV @ KDD 2014

  19. Fact extraction from schema.org annotation (ANO) <script type=“application/ld+json”> {“@context” : “http://www.schema.org”, “@type” : “Event”, “startDate” : “2014-07-26”, ...} </script> ● About 20% of webpages have machine-readable annotations of commercial events, products, etc. ● Automatically map to KG schema. ● We still need to do entity linking. KV @ KDD 2014

  20. Combine outputs from all extractors • Train binary classifier on Webmaster annotations Tables f(t) = [score-txt(t), #txt(t), … ] using distant supervision. NL text Page structure • Platt scaling to get calibrated probabilities. Extractors Fusion KV @ KDD 2014 20

  21. ROC for each extraction system KV @ KDD 2014 21

  22. Confidence of true facts rises given more evidence KV @ KDD 2014 22

  23. Outline of the talk 1. Knowledge Graph 2. Knowledge Vault 3. Fact mining from the web 4. Fact mining from graphs 5. Knowledge Fusion KV @ KDD 2014 23

  24. Mining facts from graphs Web Priors Extractors Fusion KV @ KDD 2014 24

  25. Link prediction using tensor factorization • Many methods have been used to fill in missing values in binary matrices, eg tensor factorization associates a low-dimensional vector with every row and column. NY Knicks opponent teamInLeague LA Lakers playFor playFor Kobe Bryant Pau Gasol playInLeague teammate Kobe Bryant = < , ,> KV @ KDD 2014 25

  26. (Deep) neural network for link prediction - Represent each entity and relation by its own low-dimensional (100D) embedding vector. - Stack together, feed into neural net. - Train model to maximize log-likelihood of observed positive and negative triples. - Outperforms neural tensor model (Socher et al). NY teamInLeague Knicks opponent teamInLeague 2 Hidden playFor LA layers Lakers Kobe Bryant playFor Pau Gasol playFor Pau NBA Gasol playInLeague NY Knicks teammate LA Lakers Kobe Bryant KV @ KDD 2014 26

  27. Path Ranking Algorithm [Lao et al., EMNLP11] CityLocatedInCountry(Pittsburgh) = ? U.S. Japan Pennsylvania CityLocatedInCountry … (14) Pittsburgh Philadelphia Harisburg Atlanta Dallas AtLocation Tokyo PPG Delta Logistic Regresssion Feature Value Weight Feature = Typed Path CityInState, CityInstate -1 , CityLocatedInCountry 0.8 0.32 AtLocation -1 , AtLocation, CityLocatedInCountry 0.6 0.20 … … … CityLocatedInCountry(Pittsburgh) = U.S. p=0.58 Figure courtesy ofTom Mitchell and Partha Talukdar KV @ KDD 2014

  28. Example of paths / rules learned by PRA CityLocatedInCountry( city, country ): 7 of the 2985 learned paths 8.04 cityliesonriver, cityliesonriver -1 , citylocatedincountry 5.42 hasofficeincity -1 , hasofficeincity, citylocatedincountry 4.98 cityalsoknownas, cityalsoknownas, citylocatedincountry 2.85 citycapitalofcountry,citylocatedincountry -1 ,citylocatedincountry 2.29 agentactsinlocation -1 , agentactsinlocation, citylocatedincountry 1.22 statehascapital -1 , statelocatedincountry 0.66 citycapitalofcountry . . Figure courtesy of Tom Mitchell and Partha Talukdar KV @ KDD 2014

  29. PRA similar in performance to neural network KV Talk at KDD, New York, August 25, 2014

  30. Outline of the talk 1. Knowledge Graph 2. Knowledge Vault 3. Fact mining from the web 4. Fact mining from graphs 5. Knowledge Fusion KV @ KDD 2014 30

  31. Web Extractors Priors Fusion KV @ KDD 2014 31

  32. Fusing web extractions with graph priors KV @ KDD 2014 32

  33. Example: (Barry Richter, studiedAt, UW-Madison) “In the fall of 1989, Richter accepted a scholarship to the University of Wisconsin, where he played for four years and earned numerous individual accolades ...” “The Polar Caps' cause has been helped by the impact of knowledgeable coaches such as Andringa, Byce and former UW teammates Chris Tancill and Barry Richter.” è Web extraction confidence: 0.14 <Barry Richter, born in, Madison> <Barry Richter, lived in, Madison> è Final belief (fused with prior): 0.61 KV @ KDD 2014 33

  34. Summary and future work • KV has 2.5B triples automatically extracted from the web. • Combining web mining and graph mining can improve precision. • Work in progress Discovering new entities § • Clustering open IE extractions, CIKM 2014 • Robust wrapper induction for long-tail verticals (work in progress) Discovering new relations § • Clustering open IE extractions, CIKM 2014 • “Biperpedia”, VLDB 2014 Assessing trust-worthiness of web sites: VLDB 2014 § Common sense fact mining eg “apples” (work in progress) § KV @ KDD 2014 34

  35. EXTRA SLIDES KV @ KDD 2014 35

  36. Application 1: Knowledge Panels Augmenting the presentation with relevant facts KV @ KDD 2014 36

  37. Application 2: Related Entities KV @ KDD 2014 37

Recommend


More recommend