large scale knowledge graph identification using psl
play

LARGE-SCALE KNOWLEDGE GRAPH IDENTIFICATION USING PSL Jay Pujara 1 , - PowerPoint PPT Presentation

LARGE-SCALE KNOWLEDGE GRAPH IDENTIFICATION USING PSL Jay Pujara 1 , Hui Miao 1 , Lise Getoor 1 , William Cohen 2 1 University of Maryland, College Park, US 2 Carnegie Mellon University AAAI Symposium on Semantics for Big Data 11/16/2013


  1. LARGE-SCALE KNOWLEDGE GRAPH IDENTIFICATION USING PSL Jay Pujara 1 , Hui Miao 1 , Lise Getoor 1 , William Cohen 2 1 University of Maryland, College Park, US 2 Carnegie Mellon University AAAI Symposium on Semantics for Big Data 11/16/2013

  2. Overview Problem: Approach: Build a Knowledge Graph Knowledge Graph from millions of noisy Identification reasons extractions jointly over all facts in the knowledge graph Method: Results: Use probabilistic soft logic State-of-the-art performance to easily specify models and on real-world datasets efficiently optimize them producing knowledge graphs with millions of facts

  3. CHALLENGES IN KNOWLEDGE GRAPH CONSTRUCTION

  4. Motivating Problem: New Opportunities Extraction Internet Knowledge Graph (KG) Cutting-edge IE Structured methods representation of Massive source of entities, their labels and publicly available the relationships information between them

  5. Motivating Problem: Real Challenges Extraction Internet Knowledge Graph Difficult! Noisy! Contains many errors and inconsistencies

  6. NELL: The Never-Ending Language Learner • Large-scale IE project (Carlson et al., 2010) • Lifelong learning: aims to “read the web” • Ontology of known labels and relations • Knowledge base contains millions of facts

  7. Examples of NELL errors

  8. Entity co-reference errors Kyrgyzstan has many variants: • Kyrgystan • Kyrgistan • Kyrghyzstan • Kyrgzstan • Kyrgyz Republic

  9. Missing and spurious labels Kyrgyzstan is labeled a bird and a country

  10. Missing and spurious relations Kyrgyzstan’s location is ambiguous – Kazakhstan, Russia and US are included in possible locations

  11. Violations of ontological knowledge • Equivalence of co-referent entities (sameAs) • SameEntity(Kyrgyzstan, Kyrgyz Republic) • Mutual exclusion (disjointWith) of labels • MUT(bird, country) • Selectional preferences (domain/range) of relations • RNG(countryLocation, continent) Enforcing these constraints require jointly considering multiple extractions

  12. KNOWLEDGE GRAPH IDENTIFICATION

  13. Motivating Problem (revised) Knowledge Graph (noisy) Extraction Graph Internet = Large-scale IE Joint Reasoning

  14. Knowledge Graph Identification Problem: Knowledge Graph Knowledge Graph = Identification Extraction Graph Solution: Knowledge Graph Identification (KGI) • Performs graph identification : • entity resolution • collective classification • link prediction • Enforces ontological constraints • Incorporates multiple uncertain sources

  15. Illustration of KGI: Extractions Uncertain Extractions: .5: Lbl(Kyrgyzstan, bird) .7: Lbl(Kyrgyzstan, country) .9: Lbl(Kyrgyz Republic, country) .8: Rel(Kyrgyz Republic, Bishkek, hasCapital)

  16. Illustration of KGI: Extraction Graph Extraction Graph Uncertain Extractions: .5: Lbl(Kyrgyzstan, bird) Kyrgyzstan Kyrgyz Republic .7: Lbl(Kyrgyzstan, country) .9: Lbl(Kyrgyz Republic, country) Rel(hasCapital) .8: Rel(Kyrgyz Republic, Bishkek, Lbl hasCapital) country bird Bishkek

  17. Illustration of KGI: Ontology + ER (Annotated) Extraction Graph Uncertain Extractions: SameEnt .5: Lbl(Kyrgyzstan, bird) Kyrgyzstan Kyrgyz Republic .7: Lbl(Kyrgyzstan, country) .9: Lbl(Kyrgyz Republic, country) Rel(hasCapital) .8: Rel(Kyrgyz Republic, Bishkek, Lbl Dom hasCapital) country Ontology: Dom(hasCapital, country) Mut(country, bird) bird Entity Resolution: Bishkek SameEnt(Kyrgyz Republic, Kyrgyzstan)

  18. Illustration of KGI (Annotated) Extraction Graph Uncertain Extractions: SameEnt .5: Lbl(Kyrgyzstan, bird) Kyrgyzstan Kyrgyz Republic .7: Lbl(Kyrgyzstan, country) .9: Lbl(Kyrgyz Republic, country) Rel(hasCapital) .8: Rel(Kyrgyz Republic, Bishkek, Lbl Dom hasCapital) country Ontology: Dom(hasCapital, country) Mut(country, bird) bird Entity Resolution: Bishkek SameEnt(Kyrgyz Republic, Kyrgyzstan) After Knowledge Graph Identification Kyrgyzstan Rel(hasCapital) Lbl Bishkek country Kyrgyz Republic

  19. MODELING KNOWLEDGE GRAPH IDENTIFICATION

  20. Viewing KGI as a probabilistic graphical model Rel(hasCapital, Lbl(Kyrgyzstan, bird) Kyrgyzstan, Bishkek) Lbl(Kyrgyzstan, country) Lbl(Kyrgyz Republic, country) Rel(hasCapital, Lbl(Kyrgyz Republic, Kyrgyz Republic, bird) Bishkek)

  21. Background: Probabilistic Soft Logic (PSL) • Templating language for hinge-loss MRFs, very scalable! • Model specified as a collection of logical formulas SameEnt ( E 1 , E 2 ) ˜ ∧ Lbl ( E 1 , L ) ⇒ Lbl ( E 2 , L ) • Uses soft-logic formulation • Truth values of atoms relaxed to [0,1] interval • Truth values of formulas derived from Lukasiewicz t-norm

  22. Background: PSL Rules to Distributions • Rules are grounded by substituting literals into formulas w EL : SameEnt (Kyrgyzstan , Kyrygyz Republic) ˜ ∧ Lbl (Kyrgyzstan , country) ⇒ Lbl (Kyrygyz Republic , country) • Each ground rule has a weighted distance to satisfaction derived from the formula’s truth value P ( G | E ) = 1 $ & ∑ Z exp − w r ϕ r ( G ) % ' r ∈ R • The PSL program can be interpreted as a joint probability distribution over all variables in knowledge graph, conditioned on the extractions

  23. Background: Finding the best knowledge graph • MPE inference solves max G P(G) to find the best KG • In PSL, inference solved by convex optimization • Efficient: running time scales with O(|R|)

  24. PSL Rules for the KGI Model

  25. PSL Rules: Uncertain Extractions Predicate representing uncertain Relation in relation extraction from extractor T Weight for source T Knowledge Graph (relations) w CR − T : CandRel T ( E 1 , E 2 , R ) ⇒ Rel ( E 1 , E 2 , R ) w CL − T : CandLbl T ( E, L ) ⇒ Lbl ( E, L ) Label in Weight for source T Predicate representing uncertain Knowledge Graph (labels) label extraction from extractor T

  26. PSL Rules: Entity Resolution ER predicate captures • Rules require co-referent confidence that entities entities to have the same are co-referent labels and relations • Creates an equivalence class of co-referent entities

  27. PSL Rules: Ontology Inverse: ˜ w O : Inv ( R, S ) ∧ Rel ( E 1 , E 2 , R ) ⇒ Rel ( E 2 , E 1 , S ) Selectional Preference: ˜ w O : Dom ( R, L ) ∧ Rel ( E 1 , E 2 , R ) ⇒ Lbl ( E 1 , L ) ˜ w O : Rng ( R, L ) ∧ Rel ( E 1 , E 2 , R ) ⇒ Lbl ( E 2 , L ) Subsumption: ˜ w O : Sub ( L, P ) ∧ Lbl ( E, L ) ⇒ Lbl ( E, P ) ˜ w O : RSub ( R, S ) ∧ Rel ( E 1 , E 2 , R ) ⇒ Rel ( E 1 , E 2 , S ) Mutual Exclusion: ˜ w O : Mut ( L 1 , L 2 ) ∧ Lbl ( E, L 1 ) ⇒ ˜ ¬ Lbl ( E, L 2 ) ˜ w O : RMut ( R, S ) ∧ Rel ( E 1 , E 2 , R ) ⇒ ˜ ¬ Rel ( E 1 , E 2 , S ) Adapted from Jiang et al., ICDM 2012

  28. Probability Distribution over KGs P ( G | E ) = 1 $ & ∑ Z exp − w r ϕ r ( G ) % ' r ∈ R CandLbl T ( kyrgyzstan , bird ) ⇒ Lbl ( kyrgyzstan , bird ) ˜ Mut ( bird , country ) ∧ Lbl ( kyrgyzstan , bird ) ⇒ ˜ ¬ Lbl ( kyrgyzstan , country ) ˜ SameEnt ( kyrgz republic , kyrgyzstan ) ∧ Lbl ( kyrgz republic , country ) ⇒ Lbl ( kyrgyzstan , country )

  29. EVALUATION

  30. T wo Evaluation Datasets LinkedBrainz NELL Description Community-supplied data about Real-world IE system extracting musical artists, labels, and general facts from the WWW creative works Noise Realistic synthetic noise Imperfect extractors and ambiguous web pages Candidate Facts 810K 1.3M Unique Labels 27 456 and Relations Ontological 49 67.9K Constraints

  31. LinkedBrainz dataset for KGI Mapping to FRBR/FOAF ontology mo:label mo:Release mo:Label DOM rdfs:domain mo:record foaf:maker RNG rdfs:range mo:Record mo:MusicalArtist inverseOf mo:track INV owl:inverseOf subClassOf subClassOf SUB rdfs:subClassOf mo:Track foaf:made mo:SoloMusicArtist mo:MusicGroup RSUB rdfs:subPropertyOf mo:published_as MUT owl:disjointWith mo:Signal

  32. Adding noise to LinkedBrainz Add realistic noise to LinkedBrainz data: Error Type Erroneous Data Co-reference User misspells artist Label User swaps artist and album fields Relation User omits or adds spurious albums for artist Reliability Gaussian noise on truth value of information

  33. LinkedBrainz experiments Comparisons: Baseline Use noisy truth values as fact scores PSL-EROnly Only apply rules for E ntity R esolution PSL-OntOnly Only apply rules for Ont ological reasoning PSL-KGI Apply K nowledge G raph I dentification model AUC Precision Recall F1 at .5 Max F1 Baseline 0.672 0.946 0.477 0.634 0.788 PSL-EROnly 0.797 0.953 0.558 0.703 0.831 PSL-OntOnly 0.753 0.964 0.605 0.743 0.832 PSL-KGI 0.901 0.970 0.714 0.823 0.919

  34. NELL Evaluation: two settings Target Set: restrict to a subset of KG Complete: Infer full knowledge graph (Jiang, ICDM12) ? ? • Closed-world model • Open-world model • Uses a target set: subset of KG • All possible entities, relations, labels • Derived from 2-hop neighborhood • Inference assigns truth value to • Excludes trivially satisfied variables each variable

Recommend


More recommend