knowledge graph identification
play

KNOWLEDGE GRAPH IDENTIFICATION Jay Pujara 1 , Hui Miao 1 , Lise - PowerPoint PPT Presentation

KNOWLEDGE GRAPH IDENTIFICATION Jay Pujara 1 , Hui Miao 1 , Lise Getoor 1 , William Cohen 2 1 University of Maryland, College Park, US 2 Carnegie Mellon University International Semantic Web Conference 10/25/2013 Overview Problem: Approach:


  1. KNOWLEDGE GRAPH IDENTIFICATION Jay Pujara 1 , Hui Miao 1 , Lise Getoor 1 , William Cohen 2 1 University of Maryland, College Park, US 2 Carnegie Mellon University International Semantic Web Conference 10/25/2013

  2. Overview Problem: Approach: Build a Knowledge Graph Knowledge Graph from millions of noisy Identification reasons extractions jointly over all facts in the knowledge graph Method: Results: Use probabilistic soft logic State-of-the-art performance to easily specify models and on real-world datasets efficiently optimize them producing knowledge graphs with millions of facts

  3. CHALLENGES IN KNOWLEDGE GRAPH CONSTRUCTION

  4. Motivating Problem: New Opportunities Extraction Internet Knowledge Graph (KG) Cutting-edge IE Structured methods representation of Massive source of entities, their labels and publicly available the relationships information between them

  5. Motivating Problem: Real Challenges Extraction Internet Knowledge Graph Difficult! Noisy! Contains many errors and inconsistencies

  6. NELL: The Never-Ending Language Learner • Large-scale IE project (Carlson et al., 2010) • Lifelong learning: aims to “read the web” • Ontology of known labels and relations • Knowledge base contains millions of facts

  7. Examples of NELL errors

  8. Entity co-reference errors Kyrgyzstan has many variants: • Kyrgystan • Kyrgistan • Kyrghyzstan • Kyrgzstan • Kyrgyz Republic

  9. Missing and spurious labels Kyrgyzstan is labeled a bird and a country

  10. Missing and spurious relations Kyrgyzstan’s location is ambiguous – Kazakhstan, Russia and US are included in possible locations

  11. Violations of ontological knowledge • Equivalence of co-referent entities (sameAs) • SameEntity(Kyrgyzstan, Kyrgyz Republic) • Mutual exclusion (disjointWith) of labels • MUT(bird, country) • Selectional preferences (domain/range) of relations • RNG(countryLocation, continent) Enforcing these constraints require jointly considering multiple extractions

  12. KNOWLEDGE GRAPH IDENTIFICATION

  13. Motivating Problem (revised) Knowledge Graph (noisy) Extraction Graph Internet = Large-scale IE Joint Reasoning

  14. Knowledge Graph Identification Problem: Knowledge Graph Knowledge Graph = Identification Extraction Graph Solution: Knowledge Graph Identification (KGI) • Performs graph identification : • entity resolution • collective classification • link prediction • Enforces ontological constraints • Incorporates multiple uncertain sources

  15. Illustration of KGI: Extractions Uncertain Extractions: .5: Lbl(Kyrgyzstan, bird) .7: Lbl(Kyrgyzstan, country) .9: Lbl(Kyrgyz Republic, country) .8: Rel(Kyrgyz Republic, Bishkek, hasCapital)

  16. Illustration of KGI: Extraction Graph Extraction Graph Uncertain Extractions: .5: Lbl(Kyrgyzstan, bird) Kyrgyzstan Kyrgyz Republic .7: Lbl(Kyrgyzstan, country) .9: Lbl(Kyrgyz Republic, country) Rel(hasCapital) .8: Rel(Kyrgyz Republic, Bishkek, Lbl hasCapital) country bird Bishkek

  17. Illustration of KGI: Ontology + ER (Annotated) Extraction Graph Uncertain Extractions: SameEnt .5: Lbl(Kyrgyzstan, bird) Kyrgyzstan Kyrgyz Republic .7: Lbl(Kyrgyzstan, country) .9: Lbl(Kyrgyz Republic, country) Rel(hasCapital) .8: Rel(Kyrgyz Republic, Bishkek, Lbl Dom hasCapital) country Ontology: Dom(hasCapital, country) Mut(country, bird) bird Entity Resolution: Bishkek SameEnt(Kyrgyz Republic, Kyrgyzstan)

  18. Illustration of KGI (Annotated) Extraction Graph Uncertain Extractions: SameEnt .5: Lbl(Kyrgyzstan, bird) Kyrgyzstan Kyrgyz Republic .7: Lbl(Kyrgyzstan, country) .9: Lbl(Kyrgyz Republic, country) Rel(hasCapital) .8: Rel(Kyrgyz Republic, Bishkek, Lbl Dom hasCapital) country Ontology: Dom(hasCapital, country) Mut(country, bird) bird Entity Resolution: Bishkek SameEnt(Kyrgyz Republic, Kyrgyzstan) After Knowledge Graph Identification Kyrgyzstan Rel(hasCapital) Lbl Bishkek country Kyrgyz Republic

  19. MODELING KNOWLEDGE GRAPH IDENTIFICATION

  20. Viewing KGI as a probabilistic graphical model Rel(hasCapital, Lbl(Kyrgyzstan, bird) Kyrgyzstan, Bishkek) Lbl(Kyrgyzstan, country) Lbl(Kyrgyz Republic, country) Rel(hasCapital, Lbl(Kyrgyz Republic, Kyrgyz Republic, bird) Bishkek)

  21. Background: Probabilistic Soft Logic (PSL) • Templating language for hinge-loss MRFs, very scalable! • Model specified as a collection of logical formulas SameEnt ( E 1 , E 2 ) ˜ ∧ Lbl ( E 1 , L ) ⇒ Lbl ( E 2 , L ) • Uses soft-logic formulation • Truth values of atoms relaxed to [0,1] interval • Truth values of formulas derived from Lukasiewicz t-norm

  22. Background: PSL Rules to Distributions • Rules are grounded by substituting literals into formulas w EL : SameEnt (Kyrgyzstan , Kyrygyz Republic) ˜ ∧ Lbl (Kyrgyzstan , country) ⇒ Lbl (Kyrygyz Republic , country) • Each ground rule has a weighted distance to satisfaction derived from the formula’s truth value P ( G | E ) = 1 $ & ∑ Z exp − w r ϕ r ( G ) % ' r ∈ R • The PSL program can be interpreted as a joint probability distribution over all variables in knowledge graph, conditioned on the extractions

  23. Background: Finding the best knowledge graph • MPE inference solves max G P(G) to find the best KG • In PSL, inference solved by convex optimization • Efficient: running time scales with O(|R|)

  24. PSL Rules for the KGI Model

  25. PSL Rules: Uncertain Extractions Predicate representing uncertain Relation in relation extraction from extractor T Weight for source T Knowledge Graph (relations) w CR − T : CandRel T ( E 1 , E 2 , R ) ⇒ Rel ( E 1 , E 2 , R ) w CL − T : CandLbl T ( E, L ) ⇒ Lbl ( E, L ) Label in Weight for source T Predicate representing uncertain Knowledge Graph (labels) label extraction from extractor T

  26. PSL Rules: Entity Resolution ER predicate captures • Rules require co-referent confidence that entities entities to have the same are co-referent labels and relations • Creates an equivalence class of co-referent entities

  27. PSL Rules: Ontology Inverse: ˜ w O : Inv ( R, S ) ∧ Rel ( E 1 , E 2 , R ) ⇒ Rel ( E 2 , E 1 , S ) Selectional Preference: ˜ w O : Dom ( R, L ) ∧ Rel ( E 1 , E 2 , R ) ⇒ Lbl ( E 1 , L ) ˜ w O : Rng ( R, L ) ∧ Rel ( E 1 , E 2 , R ) ⇒ Lbl ( E 2 , L ) Subsumption: ˜ w O : Sub ( L, P ) ∧ Lbl ( E, L ) ⇒ Lbl ( E, P ) ˜ w O : RSub ( R, S ) ∧ Rel ( E 1 , E 2 , R ) ⇒ Rel ( E 1 , E 2 , S ) Mutual Exclusion: ˜ w O : Mut ( L 1 , L 2 ) ∧ Lbl ( E, L 1 ) ⇒ ˜ ¬ Lbl ( E, L 2 ) ˜ w O : RMut ( R, S ) ∧ Rel ( E 1 , E 2 , R ) ⇒ ˜ ¬ Rel ( E 1 , E 2 , S ) Adapted from Jiang et al., ICDM 2012

  28. EVALUATION

  29. T wo Evaluation Datasets LinkedBrainz NELL Description Community-supplied data about Real-world IE system extracting musical artists, labels, and general facts from the WWW creative works Noise Realistic synthetic noise Imperfect extractors and ambiguous web pages Candidate Facts 810K 1.3M Unique Labels 27 456 and Relations Ontological 49 67.9K Constraints

  30. LinkedBrainz dataset for KGI Mapping to FRBR/FOAF ontology mo:label mo:Release mo:Label DOM rdfs:domain mo:record foaf:maker RNG rdfs:range mo:Record mo:MusicalArtist inverseOf mo:track INV owl:inverseOf subClassOf subClassOf SUB rdfs:subClassOf mo:Track foaf:made mo:SoloMusicArtist mo:MusicGroup RSUB rdfs:subPropertyOf mo:published_as MUT owl:disjointWith mo:Signal

  31. Adding noise to LinkedBrainz Add realistic noise to LinkedBrainz data: Error Type Erroneous Data Co-reference User misspells artist Label User swaps artist and album fields Relation User omits or adds spurious albums for artist Reliability Gaussian noise on truth value of information

  32. LinkedBrainz experiments Comparisons: Baseline Use noisy truth values as fact scores PSL-EROnly Only apply rules for E ntity R esolution PSL-OntOnly Only apply rules for Ont ological reasoning PSL-KGI Apply K nowledge G raph I dentification model AUC Precision Recall F1 at .5 Max F1 Baseline 0.672 0.946 0.477 0.634 0.788 PSL-EROnly 0.797 0.953 0.558 0.703 0.831 PSL-OntOnly 0.753 0.964 0.605 0.743 0.832 PSL-KGI 0.901 0.970 0.714 0.823 0.919

  33. NELL Evaluation: two settings Target Set: restrict to a subset of KG Complete: Infer full knowledge graph (Jiang, ICDM12) ? ? • Closed-world model • Open-world model • Uses a target set: subset of KG • All possible entities, relations, labels • Derived from 2-hop neighborhood • Inference assigns truth value to • Excludes trivially satisfied variables each variable

  34. NELL experiments: T arget Set Task: Compute truth values of a target set derived from the evaluation data Comparisons: Baseline Average confidences of extractors for each fact in the NELL candidates NELL Evaluate NELL’s promotions (on the full knowledge graph) MLN Method of (Jiang, ICDM12) – estimates marginal probabilities with MC-SAT PSL-KGI Apply full Knowledge Graph Identification model Running Time: Inference completes in 10 seconds, values for 25K facts AUC F1 Baseline .873 .828 NELL .765 .673 MLN (Jiang, 12) .899 .836 PSL-KGI .904 .853

Recommend


More recommend