Efficient Weight Learning for Markov Logic Networks Speaker Manuel Noll Advisor Maximilian Dylla
What is this talk about? Well, of course about weight learning. A weight represents the relative strenght or importance of that rule, constraint respectively. We want to find the best fitting weights w for our world x. Stated differently, we have some observations x and want to maximize our model depending on the parameter w. arg max w P w X = x = 1 Z exp ∑ i w i n i x as a likelihood function. This is a convex optimization problem. A gradient descent method solves this problem.
Getting a better understanding of P w X = x = 1 Z exp ∑ i w i n i x n i x :is the number of times the ith formula is satisfied by the state of the world x Example Able to fly Eats fish Bird Nightingal Nightingal Pinguin Sparrow Bear Sparrow Pinguin Number Formula Weight 1 1 ∀ x : Bird x ⇒ AbleToFly x 2 5 ∀ x : ¬ AbleToFly x ⇒ EatsFish x Constant symbols C=(Nigthingal,Sparrow, Pinguin, Bear) We make a closed world assumption. Vector translation x=(1,1,1,0,1,1,0,0,0,0,1,1)
Getting a better understanding of P w X = x = 1 Z exp ∑ i w i n i x n i x :is the number of times the ith formula is satisfied by the state of the world x Example Able to fly Eats fish Bird Nightingal Nightingal Pinguin Sparrow Bear Sparrow Pinguin Number Formula Weight 1 1 ∀ x : Bird x ⇒ AbleToFly x 2 5 ∀ x : ¬ AbleToFly x ⇒ EatsFish x Number Formula Weight Satisfied by 1 1 Nightingal,Sparrow ∀ x : Bird x ⇒ AbleToFly x 2 5 Bear, Pinguin ∀ x : ¬ AbleToFly x ⇒ EatsFish x
P w X = x = 1 Z exp ∑ i w i n i x Can be seen as the probability that the given world is an interpretation of our model. When x is fixed the model depends on w and thus is a likelihood function on w. Given a prior distribution p of w, we can compute the posterior distribution of w P w ∣ X = x = P w X = x p w ∑ x ' P w X = x ' So-called maximum a posteriori estimation (MAP) of w.
Maximization of the Log-Likelihood function We want to fit our model to our world x, therefore we have to optimize the weights arg max w P w X = x = 1 Z exp ∑ i w i n i x Z = ∑ x' exp ∑ i w i n i x ' easier arg max w log P w X = x = ∑ i w i n i x − log ∑ x' exp ∑ i w i n i x' ⇒ ∂ log P w X = x = n i x − ∑ x' P w X = x' n i x ∂ w i = n i x − E w [ n i x ]
A conditional likelihood approach In many applications, we know a priori which predicates will be evidence and which ones will be queried. Goal is to correctly predict the latter given the former. Partition the ground atoms in the domain into a set of evidence atoms X and a set of query atoms Y . The conditional likelihood of Y given X is P w Y = y ∣ X = x = 1 exp ∑ i ∈ F Y w i n i x , y Z x F Y :the set of all MLN clauses with at least one grounding involving a query atom
A convex Problem argmax w P w Y = y ∣ X = x = 1 exp ∑ i ∈ F Y w i n i x , y Z x ⇒ ∂ log P w Y = y ∣ X = x = n i x , y − ∑ y ' P w Y = y' ∣ X = x n i x , y' ∂ w i = n i x , y − E w [ n i x , y ] The log-likelihood function is a convex function. w ∈[ 0, ∞] Since which is a convex set, we have a convex optimization problem. For a convex optimization problem holds, if a local minimum exists, then it is a global minimum.
Gradient descent / ascent Let f be a differentiable function then the iteration x n = x n − 1 ∇ f x n − 1 is the so-called gradient ascent method . . In our case w i ,t = w i ,t − 1 ∂ log P w Y = y ∣ X = x w t − 1 ∂ w i = w i, t − 1 n i x , y − E w [ n i x , y ] But how can we compute E w [ n i x , y ] which is still intractable?
MAP with MaxWalkSat E w [ n i x , y ] n i x , y can be approximated by the counts in the * x MAP state . y w This will be a good approximation if most of the probability mass of * x P w Y = y ∣ X = x y w is concentrated around . 1 exp ∑ i ∈ F Y w i n i x , y suggests solution Z x * x y w is the state that maximizes the sum of the weights of the satisfied ground clauses This weighted satisfiability problem is solved by MaxWalkSat.
Voted Perceptron Algorithm Initializing w i ~ N 0,1 For i=0 to T * x y w Approximate by MaxWalkSat * n i x , y w Count * [ n i x , y ] w i ,t = w i ,t − 1 n i x , y − E w w = 1 T T ∑ i w t Return
Markov Chain Monte Carlo (MCMC) E w [ n i x , y ] can be estimated via the Markov Chain Monte Carlo method y 1 , ... , y t The Markov Chain has to generate values whose P w X = x ∣ Y = y asymptotic distribution follows the distribution of Taking t samples from the MC and averaging shall gives us a good approxmiation for E w [ n i x , y ]
Markov Chain Monte Carlo (MCMC) E w [ n i x , y ] Estimate by a MCMC as follows 0 = X ∪ Y W Let For i=1 to T i = W i − 1 W For each y in Y P y ∣ Y ∖{ y } Compute t P y ∣ Y ∖{ y } y Sample from i t W y Set y = in 0 , ... ,W T W Samples are taken from
Contrastive Divergence Initializing w i ~ N 0,1 For i=0 to T E w [ n i x , y ] Approximate by MCMC * [ n i x , y ] w i ,t = w i ,t − 1 n i x , y − E w w = 1 T T ∑ i Return w t
Per Weight Learning Rates Gradient descent is highly vulnerable to ill-conditioning, that is large deviation from one of the ratio of the largest and smallest absolute eigenvalues of the Hessian. In MLN's is the Hessian [ E w [ n k ] E w [ n k ]− E w [ n k n k ] ] E w [ n 1 ] E w [ n 1 ]− E w [ n 1 n 1 ] E w [ n 1 ] E w [ n k ]− E w [ n 1 n k ] ... ⋮ ⋱ ⋮ E w [ n k ] E w [ n 1 ]− E w [ n k n 1 ] ... the negative covariance matrix of the clause counts i = introduce different learning weights n i
Diagonal Newton A gradient descent of the form − 1 ∇ f x t − 1 x t = x t − 1 − H for a differentiable function f and H the Hessian is the so-called Newton method For many weights calculating the inverse of H becomes infeasible. Use the inverse of the diagonalized Hessian [ E w [ n k ] E w [ n k ]− E w [ n k n k ] ] 1 ... 0 E w [ n 1 ] E w [ n 1 ]− E w [ n 1 n 1 ] ⋮ ⋱ ⋮ 1 0 ... So-called diagonal Newton method .
Diagonal Newton n i x , y − E w [ n i x , y ] w i ,t = w i ,t − 1 − E w [ n i x , y ] E w [ n i x , y ]− E w [ n i x , y n i x , y ] Step size for a given search direction d T ∇ P w Y = y ∣ X = x = d T H T d − 1 d d d limits the step size to a region in which the approximation is good
Diagonal Newton Choosing λ T actual = d t − 1 ∇ P t Approximated actual change T ∇ P t − 1 H t − 1 ∇ P t − 1 / 2 pred = d t − 1 Predicted change actual t 1 = t / 2 If 0.75 then pred actual t 1 = 4 t 0.25 If then pred T actual = d t − 1 ∇ P t If is negative, the step is rejected and redone after adjusting
Scaled Conjugate Gradient Instead of taking a constant step size, do a so-called line search + argmin a h a = f x k a d ,a ∈ℝ can be „loosley“ minimized On ill – conditioned problems inefficient, since line searchs along successive directions tend to partly undo the effect - gradient zigzags Impose at each step that gradient along previous directions remain zero Use Hessian instead of line search
Experiments Datasets Cora WebKB 1295 citations of 132 different Labeled 4165 web Papers pages, 10,935 web links, each marked with a Task: which citations refer to subset of the the same paper, given categories „student, the words in their author, faculty, professor, title and venue fields department, Research project, course“ Task: predict categories from web pages' words and links
MLN Rules for WebKB Has word , page ⇒ Class page ,class ¬ Has word , page ⇒ Class page ,class Class page1,class1 ∧ LinksTo page1, page2 ⇒ Class page2 ,class1 Learned separate weight for each of these rules for each (word,class) and (class,class) pair. Contained 10,981 weights Ground clauses exceeded 300,000
Experiments Methodology Cora WebKB Five-way cross-validation Four-way cross-validation Each algorithm trained for 4h on different learning rates on each training set with a Gaussian prior, based on preleminary experiments After tuning rerun the algorithms for 10h on the whole data set (including the held-out validation set)
Results
That's it folks!
Recommend
More recommend