Deep Prolog: End-to-end Differentiable Proving in Knowledge Bases Tim Rockt¨ aschel University College London Computer Science 2nd Conference on Artificial Intelligence and Theorem Proving 26th of March 2017
Overview Machine Learning First-order Logic “Every father of a parent is a grandfather.” Deep Learning grandfatherOf ( X , Y ) :– Artificial Neural Network fatherOf ( X , Z ) , parentOf ( Z , Y ) . Inputs Outputs Trainable Function Y X • Behavior learned automatically • Behavior defined manually • Strong generalization • No generalisation • Needs a lot of training data • Needs no training data • Behavior not interpretable • Behavior interpretable Tim Rockt¨ aschel Deep Prolog: End-to-end Differentiable Proving in Knowledge Bases 1/37
Outline 1 Reasoning with Symbols Knowledge Bases Prolog: Backward Chaining 2 Reasoning with Neural Representations Symbolic vs. Neural Representations Neural Link Prediction Computation Graphs 3 Deep Prolog: Neural Backward Chaining 4 Optimizations Batch Proving Gradient Approximation Regularization by Neural Link Predictor 5 Experiments 6 Summary Tim Rockt¨ aschel Deep Prolog: End-to-end Differentiable Proving in Knowledge Bases 2/37
Outline 1 Reasoning with Symbols Knowledge Bases Prolog: Backward Chaining 2 Reasoning with Neural Representations Symbolic vs. Neural Representations Neural Link Prediction Computation Graphs 3 Deep Prolog: Neural Backward Chaining 4 Optimizations Batch Proving Gradient Approximation Regularization by Neural Link Predictor 5 Experiments 6 Summary Tim Rockt¨ aschel Deep Prolog: End-to-end Differentiable Proving in Knowledge Bases 3/37
Notation Constant : homer , bart , lisa etc. (lowercase) Variable : X , Y etc. (uppercase, universally quantified) Term : constant or variable Predicate : fatherOf , parentOf etc. function from terms to a Boolean Atom : predicate and terms, e.g., parentOf ( X , bart ) Literal : negated or non-negated atom, e.g., not parentOf ( bart , lisa ) Rule : head :– body . head : literal body : (possibly empty) list of literals representing conjunction Fact : ground rule (no free variables) with empty body, e.g., parentOf ( homer , bart ) . Tim Rockt¨ aschel Deep Prolog: End-to-end Differentiable Proving in Knowledge Bases 4/37
Example Knowledge Base 1 fatherOf ( abe , homer ) . 2 parentOf ( homer , lisa ) . 3 parentOf ( homer , bart ) . 4 grandpaOf ( abe , lisa ) . 5 grandfatherOf ( abe , maggie ) . 6 grandfatherOf ( X 1 , Y 1 ) :– fatherOf ( X 1 , Z 1 ) , parentOf ( Z 1 , Y 1 ) . 7 grandparentOf ( X 2 , Y 2 ) :– grandfatherOf ( X 2 , Y 2 ) . Tim Rockt¨ aschel Deep Prolog: End-to-end Differentiable Proving in Knowledge Bases 5/37
Backward Chaining 1 def or(KB, goal , Ψ ) : for rule head :– body in KB do 2 Ψ ′ ← unify( head , goal , Ψ) 3 if Ψ ′ � = failure then 4 for Ψ ′′ in and(KB, body , Ψ ′ ) do 5 yield Ψ ′′ 6 7 def and(KB, subgoals , Ψ ) : if subgoals is empty then return Ψ; 8 else 9 subgoal ← substitute(head( subgoals ), Ψ) 10 for Ψ ′ in or(KB, subgoal , Ψ ) do 11 for Ψ ′′ in and(KB, tail( subgoals ), Ψ ′ ) do yield Ψ ′′ ; 12 Tim Rockt¨ aschel Deep Prolog: End-to-end Differentiable Proving in Knowledge Bases 6/37
Unification 1 def unify( A , B , Ψ ) : if Ψ = failure then return failure; 2 else if A is variable then 3 return unifyvar( A , B , Ψ) 4 else if B is variable then 5 return unifyvar( B , A , Ψ) 6 else if A = [ a 1 , . . . , a N ] and B = [ b 1 , . . . , b N ] are atoms then 7 Ψ ′ ← unify([ a 2 , . . . , a N ], [ b 2 , . . . , b N ], Ψ) 8 return unify( a 1 , b 1 , Ψ ′ ) 9 else if A = B then return Ψ; 10 else return failure; 11 Tim Rockt¨ aschel Deep Prolog: End-to-end Differentiable Proving in Knowledge Bases 7/37
Example grandfatherOf ( abe , bart )? Query Example Knowledge Base: or 0 1. fatherOf ( abe , homer ) . 1 2 3 2. parentOf ( homer , bart ) . success failure failure 3. grandfatherOf ( X , Y ) :– and 0 { X / abe , Y / bart } fatherOf ( X , Z ) , 3.1 fatherOf ( abe , Z )? parentOf ( Z , Y ) . or 1 1 2 3 success failure failure { X / abe , Y / bart , Z / homer } 3.2 parentOf ( homer , bart )? or 1 1 2 3 failure success failure { X / abe , Y / bart , Z / homer } Tim Rockt¨ aschel Deep Prolog: End-to-end Differentiable Proving in Knowledge Bases 8/37
Outline 1 Reasoning with Symbols Knowledge Bases Prolog: Backward Chaining 2 Reasoning with Neural Representations Symbolic vs. Neural Representations Neural Link Prediction Computation Graphs 3 Deep Prolog: Neural Backward Chaining 4 Optimizations Batch Proving Gradient Approximation Regularization by Neural Link Predictor 5 Experiments 6 Summary Tim Rockt¨ aschel Deep Prolog: End-to-end Differentiable Proving in Knowledge Bases 9/37
Symbolic Representations Symbols (constants and predicates) do not share any information: grandpaOf � = grandfatherOf No notion of similarity: apple ∼ orange , professorAt ∼ lecturerAt No generalization beyond what can be symbolically inferred: isFruit ( apple ), apple ∼ organge , isFruit ( orange )? But... leads to powerful inference mechanisms and proofs for predictions: fatherOf ( abe , homer ) . parentOf ( homer , lisa ) . parentOf ( homer , bart ) . grandfatherOf ( X , Y ) :– fatherOf ( X , Z ) , parentOf ( Z , Y ) . grandfatherOf ( abe , Q )? { Q / lisa } , { Q / bart } Fairly easy to debug and trivial to incorporate domain knowledge: just change/add rules Hard to work with language, vision and other modalities ‘‘is a film based on the novel of the same name by’’ ( X , Y ) Tim Rockt¨ aschel Deep Prolog: End-to-end Differentiable Proving in Knowledge Bases 10/37
Neural Representations Lower-dimensional fixed-length vector representations of symbols (predicates and constants): v apple , v orange , v fatherOf , . . . ∈ R k Can capture similarity and even semantic hierarchy of symbols: v grandpaOf = v grandfatherOf , v apple ∼ v orange , v apple < v fruit Can be trained from raw task data (e.g. facts) Can be compositional v ‘‘is the father of’’ = RNN θ ( v is , v the , v father , v of ) But... need large amount of training data No direct way of incorporating prior knowledge v grandfatherOf ( X , Y ) :– v fatherOf ( X , Z ) , v parentOf ( Z , Y ) . Tim Rockt¨ aschel Deep Prolog: End-to-end Differentiable Proving in Knowledge Bases 11/37
Related Work Fuzzy Logic (Zadeh, 1965) Probabilistic Logic Programming, e.g., IBAL (Pfeffer, 2001), BLOG (Milch et al., 2005), Markov Logic Networks (Richardson and Domingos, 2006), ProbLog (De Raedt et al., 2007) . . . Inductive Logic Programming, e.g., Plotkin (1970), Shapiro (1991), Muggleton (1991), De Raedt (1999) . . . Statistical Predicate Invention (Kok and Domingos, 2007) Neural-symbolic Connectionism Propositional rules: EBL-ANN (Shavlik and Towell, 1989), KBANN (Towell and Shavlik, 1994), C-LIP (Garcez and Zaverucha, 1999) First-order inference (no training of symbol representations): Unification Neural Networks (Holld¨ obler, 1990; Komendantskaya 2011), SHRUTI (Shastri, 1992), Neural Prolog (Ding, 1995), CLIP++ (Franca et al. 2014), Lifted Relational Networks (Sourek et al. 2015) Tim Rockt¨ aschel Deep Prolog: End-to-end Differentiable Proving in Knowledge Bases 12/37
Neural Link Prediction Real world knowledge bases (like Freebase) are incomplete! placeOfBirth attribute is missing for 71% of people! Commonsense knowledge often not stated explicitly Weak logical relationships that can be used for inferring facts melinda spouseOf bill microsoft chairmanOf headquarteredIn livesIn? seattle Predict livesIn ( melinda , seattle ) using local scoring function f ( v livesIn , v melinda , v seattle ) Das et al. (2016) 13/37
State-of-the-art Neural Link Prediction f ( v livesIn , v melinda , v seattle ) DistMult (Yang et al., 2014) ComplEx (Trouillon et al., 2016) v s , v i , v j ∈ R k v s , v i , v j ∈ C k f ( v s , v i , v j ) = v ⊤ f ( v s , v i , v j ) = s ( v i ⊙ v j ) real( v s ) ⊤ (real( v i ) ⊙ real( v j )) � = v sk v ik v jk + real( v s ) ⊤ (imag( v i ) ⊙ imag( v j )) k + imag( v s ) ⊤ (real( v i ) ⊙ imag( v j )) − imag( v s ) ⊤ (imag( v i ) ⊙ real( v j )) Training Loss � L = − y log ( σ ( f ( v s , v i , v j ))) − (1 − y ) log (1 − σ ( f ( v s , v i , v j ))) r s ( e i , e j ) , y ∈ T Gradient-based optimization for learning v s , v i , v j from data How do we calculate gradients ∇ v s L , ∇ v i L , ∇ v j L ? Tim Rockt¨ aschel Deep Prolog: End-to-end Differentiable Proving in Knowledge Bases 14/37
Computation Graphs z sigm Example: z = f ( x , y ) = σ ( x ⊤ y ) Nodes represent variables (inputs or u 1 parameters) Directed edges to a node correspond to a dot differentiable operation y x Tim Rockt¨ aschel Deep Prolog: End-to-end Differentiable Proving in Knowledge Bases 15/37
Backpropagation ∇ z Chain Rule of Calculus: Given function z = f ( a ) = f ( g ( b )) z � ⊤ � ∂ b ∇ a z = ∇ b z sigm ∂ a ∂z Backpropagation is efficient recursive ∂u 1 application of the Chain Rule u 1 Gradient of z = σ ( x ⊤ y ) w.r.t. x ∂ u 1 ∇ x z = ∂ z ∂ x = ∂ z ∂ x = σ ( u 1 )(1 − σ ( u 1 )) y ∂u 1 ∂u 1 ∂ u 1 ∂ y ∂ x dot Given upstream supervision on z , we can learn x and y ! y x Deep Learning = “Large” differentiable computation graphs Tim Rockt¨ aschel Deep Prolog: End-to-end Differentiable Proving in Knowledge Bases 16/37
Recommend
More recommend