LARGE-SCALE KNOWLEDGE GRAPH IDENTIFICATION USING PSL Jay Pujara 1 , Hui Miao 1 , Lise Getoor 1 , William Cohen 2 1 University of Maryland, College Park, US 2 Carnegie Mellon University AAAI Symposium on Semantics for Big Data 11/16/2013
Overview Problem: Approach: Build a Knowledge Graph Knowledge Graph from millions of noisy Identification reasons extractions jointly over all facts in the knowledge graph Method: Results: Use probabilistic soft logic State-of-the-art performance to easily specify models and on real-world datasets efficiently optimize them producing knowledge graphs with millions of facts
CHALLENGES IN KNOWLEDGE GRAPH CONSTRUCTION
Motivating Problem: New Opportunities Extraction Internet Knowledge Graph (KG) Cutting-edge IE Structured methods representation of Massive source of entities, their labels and publicly available the relationships information between them
Motivating Problem: Real Challenges Extraction Internet Knowledge Graph Difficult! Noisy! Contains many errors and inconsistencies
NELL: The Never-Ending Language Learner • Large-scale IE project (Carlson et al., 2010) • Lifelong learning: aims to “read the web” • Ontology of known labels and relations • Knowledge base contains millions of facts
Examples of NELL errors
Entity co-reference errors Kyrgyzstan has many variants: • Kyrgystan • Kyrgistan • Kyrghyzstan • Kyrgzstan • Kyrgyz Republic
Missing and spurious labels Kyrgyzstan is labeled a bird and a country
Missing and spurious relations Kyrgyzstan’s location is ambiguous – Kazakhstan, Russia and US are included in possible locations
Violations of ontological knowledge • Equivalence of co-referent entities (sameAs) • SameEntity(Kyrgyzstan, Kyrgyz Republic) • Mutual exclusion (disjointWith) of labels • MUT(bird, country) • Selectional preferences (domain/range) of relations • RNG(countryLocation, continent) Enforcing these constraints require jointly considering multiple extractions
KNOWLEDGE GRAPH IDENTIFICATION
Motivating Problem (revised) Knowledge Graph (noisy) Extraction Graph Internet = Large-scale IE Joint Reasoning
Knowledge Graph Identification Problem: Knowledge Graph Knowledge Graph = Identification Extraction Graph Solution: Knowledge Graph Identification (KGI) • Performs graph identification : • entity resolution • collective classification • link prediction • Enforces ontological constraints • Incorporates multiple uncertain sources
Illustration of KGI: Extractions Uncertain Extractions: .5: Lbl(Kyrgyzstan, bird) .7: Lbl(Kyrgyzstan, country) .9: Lbl(Kyrgyz Republic, country) .8: Rel(Kyrgyz Republic, Bishkek, hasCapital)
Illustration of KGI: Extraction Graph Extraction Graph Uncertain Extractions: .5: Lbl(Kyrgyzstan, bird) Kyrgyzstan Kyrgyz Republic .7: Lbl(Kyrgyzstan, country) .9: Lbl(Kyrgyz Republic, country) Rel(hasCapital) .8: Rel(Kyrgyz Republic, Bishkek, Lbl hasCapital) country bird Bishkek
Illustration of KGI: Ontology + ER (Annotated) Extraction Graph Uncertain Extractions: SameEnt .5: Lbl(Kyrgyzstan, bird) Kyrgyzstan Kyrgyz Republic .7: Lbl(Kyrgyzstan, country) .9: Lbl(Kyrgyz Republic, country) Rel(hasCapital) .8: Rel(Kyrgyz Republic, Bishkek, Lbl Dom hasCapital) country Ontology: Dom(hasCapital, country) Mut(country, bird) bird Entity Resolution: Bishkek SameEnt(Kyrgyz Republic, Kyrgyzstan)
Illustration of KGI (Annotated) Extraction Graph Uncertain Extractions: SameEnt .5: Lbl(Kyrgyzstan, bird) Kyrgyzstan Kyrgyz Republic .7: Lbl(Kyrgyzstan, country) .9: Lbl(Kyrgyz Republic, country) Rel(hasCapital) .8: Rel(Kyrgyz Republic, Bishkek, Lbl Dom hasCapital) country Ontology: Dom(hasCapital, country) Mut(country, bird) bird Entity Resolution: Bishkek SameEnt(Kyrgyz Republic, Kyrgyzstan) After Knowledge Graph Identification Kyrgyzstan Rel(hasCapital) Lbl Bishkek country Kyrgyz Republic
MODELING KNOWLEDGE GRAPH IDENTIFICATION
Viewing KGI as a probabilistic graphical model Rel(hasCapital, Lbl(Kyrgyzstan, bird) Kyrgyzstan, Bishkek) Lbl(Kyrgyzstan, country) Lbl(Kyrgyz Republic, country) Rel(hasCapital, Lbl(Kyrgyz Republic, Kyrgyz Republic, bird) Bishkek)
Background: Probabilistic Soft Logic (PSL) • Templating language for hinge-loss MRFs, very scalable! • Model specified as a collection of logical formulas SameEnt ( E 1 , E 2 ) ˜ ∧ Lbl ( E 1 , L ) ⇒ Lbl ( E 2 , L ) • Uses soft-logic formulation • Truth values of atoms relaxed to [0,1] interval • Truth values of formulas derived from Lukasiewicz t-norm
Background: PSL Rules to Distributions • Rules are grounded by substituting literals into formulas w EL : SameEnt (Kyrgyzstan , Kyrygyz Republic) ˜ ∧ Lbl (Kyrgyzstan , country) ⇒ Lbl (Kyrygyz Republic , country) • Each ground rule has a weighted distance to satisfaction derived from the formula’s truth value P ( G | E ) = 1 $ & ∑ Z exp − w r ϕ r ( G ) % ' r ∈ R • The PSL program can be interpreted as a joint probability distribution over all variables in knowledge graph, conditioned on the extractions
Background: Finding the best knowledge graph • MPE inference solves max G P(G) to find the best KG • In PSL, inference solved by convex optimization • Efficient: running time scales with O(|R|)
PSL Rules for the KGI Model
PSL Rules: Uncertain Extractions Predicate representing uncertain Relation in relation extraction from extractor T Weight for source T Knowledge Graph (relations) w CR − T : CandRel T ( E 1 , E 2 , R ) ⇒ Rel ( E 1 , E 2 , R ) w CL − T : CandLbl T ( E, L ) ⇒ Lbl ( E, L ) Label in Weight for source T Predicate representing uncertain Knowledge Graph (labels) label extraction from extractor T
PSL Rules: Entity Resolution ER predicate captures • Rules require co-referent confidence that entities entities to have the same are co-referent labels and relations • Creates an equivalence class of co-referent entities
PSL Rules: Ontology Inverse: ˜ w O : Inv ( R, S ) ∧ Rel ( E 1 , E 2 , R ) ⇒ Rel ( E 2 , E 1 , S ) Selectional Preference: ˜ w O : Dom ( R, L ) ∧ Rel ( E 1 , E 2 , R ) ⇒ Lbl ( E 1 , L ) ˜ w O : Rng ( R, L ) ∧ Rel ( E 1 , E 2 , R ) ⇒ Lbl ( E 2 , L ) Subsumption: ˜ w O : Sub ( L, P ) ∧ Lbl ( E, L ) ⇒ Lbl ( E, P ) ˜ w O : RSub ( R, S ) ∧ Rel ( E 1 , E 2 , R ) ⇒ Rel ( E 1 , E 2 , S ) Mutual Exclusion: ˜ w O : Mut ( L 1 , L 2 ) ∧ Lbl ( E, L 1 ) ⇒ ˜ ¬ Lbl ( E, L 2 ) ˜ w O : RMut ( R, S ) ∧ Rel ( E 1 , E 2 , R ) ⇒ ˜ ¬ Rel ( E 1 , E 2 , S ) Adapted from Jiang et al., ICDM 2012
Probability Distribution over KGs P ( G | E ) = 1 $ & ∑ Z exp − w r ϕ r ( G ) % ' r ∈ R CandLbl T ( kyrgyzstan , bird ) ⇒ Lbl ( kyrgyzstan , bird ) ˜ Mut ( bird , country ) ∧ Lbl ( kyrgyzstan , bird ) ⇒ ˜ ¬ Lbl ( kyrgyzstan , country ) ˜ SameEnt ( kyrgz republic , kyrgyzstan ) ∧ Lbl ( kyrgz republic , country ) ⇒ Lbl ( kyrgyzstan , country )
EVALUATION
T wo Evaluation Datasets LinkedBrainz NELL Description Community-supplied data about Real-world IE system extracting musical artists, labels, and general facts from the WWW creative works Noise Realistic synthetic noise Imperfect extractors and ambiguous web pages Candidate Facts 810K 1.3M Unique Labels 27 456 and Relations Ontological 49 67.9K Constraints
LinkedBrainz dataset for KGI Mapping to FRBR/FOAF ontology mo:label mo:Release mo:Label DOM rdfs:domain mo:record foaf:maker RNG rdfs:range mo:Record mo:MusicalArtist inverseOf mo:track INV owl:inverseOf subClassOf subClassOf SUB rdfs:subClassOf mo:Track foaf:made mo:SoloMusicArtist mo:MusicGroup RSUB rdfs:subPropertyOf mo:published_as MUT owl:disjointWith mo:Signal
Adding noise to LinkedBrainz Add realistic noise to LinkedBrainz data: Error Type Erroneous Data Co-reference User misspells artist Label User swaps artist and album fields Relation User omits or adds spurious albums for artist Reliability Gaussian noise on truth value of information
LinkedBrainz experiments Comparisons: Baseline Use noisy truth values as fact scores PSL-EROnly Only apply rules for E ntity R esolution PSL-OntOnly Only apply rules for Ont ological reasoning PSL-KGI Apply K nowledge G raph I dentification model AUC Precision Recall F1 at .5 Max F1 Baseline 0.672 0.946 0.477 0.634 0.788 PSL-EROnly 0.797 0.953 0.558 0.703 0.831 PSL-OntOnly 0.753 0.964 0.605 0.743 0.832 PSL-KGI 0.901 0.970 0.714 0.823 0.919
NELL Evaluation: two settings Target Set: restrict to a subset of KG Complete: Infer full knowledge graph (Jiang, ICDM12) ? ? • Closed-world model • Open-world model • Uses a target set: subset of KG • All possible entities, relations, labels • Derived from 2-hop neighborhood • Inference assigns truth value to • Excludes trivially satisfied variables each variable
Recommend
More recommend