Efficient Weight Learning for Markov Logic Networks Speaker - PowerPoint PPT Presentation

Efficient Weight Learning for Markov Logic Networks Speaker Manuel Noll Advisor Maximilian Dylla

What is this talk about? Well, of course about weight learning. A weight represents the relative strenght or importance of that rule, constraint respectively. We want to find the best fitting weights w for our world x. Stated differently, we have some observations x and want to maximize our model depending on the parameter w. arg max w P w  X = x = 1 Z exp  ∑ i w i n i  x  as a likelihood function. This is a convex optimization problem. A gradient descent method solves this problem.

Getting a better understanding of P w  X = x = 1 Z exp  ∑ i w i n i  x  n i  x  :is the number of times the ith formula is satisfied by the state of the world x Example Able to fly Eats fish Bird Nightingal Nightingal Pinguin Sparrow Bear Sparrow Pinguin Number Formula Weight 1 1 ∀ x : Bird  x ⇒ AbleToFly  x  2 5 ∀ x : ¬ AbleToFly  x ⇒ EatsFish  x  Constant symbols C=(Nigthingal,Sparrow, Pinguin, Bear) We make a closed world assumption. Vector translation x=(1,1,1,0,1,1,0,0,0,0,1,1)

Getting a better understanding of P w  X = x = 1 Z exp  ∑ i w i n i  x  n i  x  :is the number of times the ith formula is satisfied by the state of the world x Example Able to fly Eats fish Bird Nightingal Nightingal Pinguin Sparrow Bear Sparrow Pinguin Number Formula Weight 1 1 ∀ x : Bird  x ⇒ AbleToFly  x  2 5 ∀ x : ¬ AbleToFly  x ⇒ EatsFish  x  Number Formula Weight Satisfied by 1 1 Nightingal,Sparrow ∀ x : Bird  x ⇒ AbleToFly  x  2 5 Bear, Pinguin ∀ x : ¬ AbleToFly  x ⇒ EatsFish  x 

P w  X = x = 1 Z exp  ∑ i w i n i  x  Can be seen as the probability that the given world is an interpretation of our model. When x is fixed the model depends on w and thus is a likelihood function on w. Given a prior distribution p of w, we can compute the posterior distribution of w P  w ∣ X = x = P w  X = x  p  w  ∑ x ' P w  X = x '  So-called maximum a posteriori estimation (MAP) of w.

Maximization of the Log-Likelihood function We want to fit our model to our world x, therefore we have to optimize the weights arg max w P w  X = x = 1 Z exp  ∑ i w i n i  x  Z = ∑ x' exp  ∑ i w i n i  x '  easier arg max w log P w  X = x = ∑ i w i n i  x − log ∑ x' exp  ∑ i w i n i  x'  ⇒ ∂ log P w  X = x = n i  x − ∑ x' P w  X = x'  n i  x  ∂ w i = n i  x − E w [ n i  x ]

A conditional likelihood approach In many applications, we know a priori which predicates will be evidence and which ones will be queried. Goal is to correctly predict the latter given the former. Partition the ground atoms in the domain into a set of evidence atoms X and a set of query atoms Y . The conditional likelihood of Y given X is P w  Y = y ∣ X = x = 1 exp  ∑ i ∈ F Y w i n i  x , y  Z x F Y :the set of all MLN clauses with at least one grounding involving a query atom

A convex Problem argmax w P w  Y = y ∣ X = x = 1 exp  ∑ i ∈ F Y w i n i  x , y  Z x ⇒ ∂ log P w  Y = y ∣ X = x = n i  x , y − ∑ y ' P w  Y = y' ∣ X = x  n i  x , y'  ∂ w i = n i  x , y − E w [ n i  x , y ] The log-likelihood function is a convex function. w ∈[ 0, ∞] Since which is a convex set, we have a convex optimization problem. For a convex optimization problem holds, if a local minimum exists, then it is a global minimum.

Gradient descent / ascent Let f be a differentiable function then the iteration x n = x n − 1 ∇ f  x n − 1  is the so-called gradient ascent method . . In our case w i ,t = w i ,t − 1  ∂ log P w  Y = y ∣ X = x  w t − 1 ∂ w i = w i, t − 1  n i  x , y − E w [ n i  x , y ] But how can we compute E w [ n i  x , y ] which is still intractable?

MAP with MaxWalkSat E w [ n i  x , y ] n i  x , y  can be approximated by the counts in the *  x  MAP state . y w This will be a good approximation if most of the probability mass of *  x  P w  Y = y ∣ X = x  y w is concentrated around . 1 exp  ∑ i ∈ F Y w i n i  x , y  suggests solution Z x *  x  y w is the state that maximizes the sum of the weights of the satisfied ground clauses This weighted satisfiability problem is solved by MaxWalkSat.

Voted Perceptron Algorithm Initializing w i ~ N  0,1  For i=0 to T *  x  y w Approximate by MaxWalkSat *  n i  x , y w Count * [ n i  x , y ] w i ,t = w i ,t − 1  n i  x , y − E w w = 1 T T ∑ i w t Return

Markov Chain Monte Carlo (MCMC) E w [ n i  x , y ] can be estimated via the Markov Chain Monte Carlo method y 1 , ... , y t The Markov Chain has to generate values whose P w  X = x ∣ Y = y  asymptotic distribution follows the distribution of Taking t samples from the MC and averaging shall gives us a good approxmiation for E w [ n i  x , y ]

Markov Chain Monte Carlo (MCMC) E w [ n i  x , y ] Estimate by a MCMC as follows 0 = X ∪ Y W Let For i=1 to T i = W i − 1 W For each y in Y P  y ∣ Y ∖{ y } Compute t P  y ∣ Y ∖{ y } y Sample from i t W y Set y = in 0 , ... ,W T W Samples are taken from

Contrastive Divergence Initializing w i ~ N  0,1  For i=0 to T E w [ n i  x , y ] Approximate by MCMC * [ n i  x , y ] w i ,t = w i ,t − 1  n i  x , y − E w w = 1 T T ∑ i Return w t

Per Weight Learning Rates Gradient descent is highly vulnerable to ill-conditioning, that is large deviation from one of the ratio of the largest and smallest absolute eigenvalues of the Hessian. In MLN's is the Hessian [ E w [ n k ] E w [ n k ]− E w [ n k n k ] ] E w [ n 1 ] E w [ n 1 ]− E w [ n 1 n 1 ] E w [ n 1 ] E w [ n k ]− E w [ n 1 n k ] ... ⋮ ⋱ ⋮ E w [ n k ] E w [ n 1 ]− E w [ n k n 1 ] ... the negative covariance matrix of the clause counts  i =  introduce different learning weights n i

Diagonal Newton A gradient descent of the form − 1 ∇ f  x t − 1  x t = x t − 1 − H for a differentiable function f and H the Hessian is the so-called Newton method For many weights calculating the inverse of H becomes infeasible. Use the inverse of the diagonalized Hessian [ E w [ n k ] E w [ n k ]− E w [ n k n k ] ] 1 ... 0 E w [ n 1 ] E w [ n 1 ]− E w [ n 1 n 1 ] ⋮ ⋱ ⋮ 1 0 ... So-called diagonal Newton method .

Diagonal Newton n i  x , y − E w [ n i  x , y ] w i ,t = w i ,t − 1 − E w [ n i  x , y ] E w [ n i  x , y ]− E w [ n i  x , y  n i  x , y ] Step size for a given search direction d T ∇ P w  Y = y ∣ X = x  = d T H T d − 1 d  d d  limits the step size to a region in which the approximation is good

Diagonal Newton Choosing λ T  actual = d t − 1 ∇ P t Approximated actual change T ∇ P t − 1  H t − 1 ∇ P t − 1 / 2  pred = d t − 1 Predicted change  actual  t  1 = t / 2 If  0.75 then  pred  actual  t  1 = 4  t  0.25 If then  pred T  actual = d t − 1 ∇ P t If is negative, the step is rejected and  redone after adjusting

Scaled Conjugate Gradient Instead of taking a constant step size, do a so-called line search + argmin a h  a = f  x k  a d  ,a ∈ℝ can be „loosley“ minimized On ill – conditioned problems inefficient, since line searchs along successive directions tend to partly undo the effect - gradient zigzags Impose at each step that gradient along previous directions remain zero Use Hessian instead of line search

Experiments Datasets Cora WebKB 1295 citations of 132 different Labeled 4165 web Papers pages, 10,935 web links, each marked with a Task: which citations refer to subset of the the same paper, given categories „student, the words in their author, faculty, professor, title and venue fields department, Research project, course“ Task: predict categories from web pages' words and links

MLN Rules for WebKB Has  word , page ⇒ Class  page ,class  ¬ Has  word , page ⇒ Class  page ,class  Class  page1,class1 ∧ LinksTo  page1, page2 ⇒ Class  page2 ,class1  Learned separate weight for each of these rules for each (word,class) and (class,class) pair. Contained 10,981 weights Ground clauses exceeded 300,000

Experiments Methodology Cora WebKB Five-way cross-validation Four-way cross-validation Each algorithm trained for 4h on different learning rates on each training set with a Gaussian prior, based on preleminary experiments After tuning rerun the algorithms for 10h on the whole data set (including the held-out validation set)

Results

That's it folks!

Efficient Weight Learning for Markov Logic Networks Speaker - PowerPoint PPT Presentation

Efficient Weight Learning for Markov Logic Networks Speaker Manuel Noll Advisor Maximilian Dylla What is this talk about? Well, of course about weight learning. A weight represents the relative strenght or importance of that rule,

Markov Logic Markov Logic Probability First-Order Logic Propositional Logic Markov Logic

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

cProbLog: Restricting the Possible Worlds of Probabilistic Logic Programs Dimitar Shterionov

Markov Logic Networks Matt Richardson and Pedro Domingos (2006), Markov Logic Networks, Machine

Markov Logic Networks Matt Richardson and Pedro Domingos (2006), Markov Logic Networks, Machine

Markov Logic Networks November 18, 2008 CS 486/686 University of Waterloo Outline Markov

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Markov Logic Networks Andrea Passerini passerini@disi.unitn.it Statistical relational learning

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

Outline Markov networks (a.k.a. Markov random fields) Markov Networks Reading: Michael

Gemstones a Unit of Weight Gemstones a Unit of Weight The historical unit of weight

INTRODUCING Connecting Weight Loss Patients Directly to your Weight Loss Center Physicians Weight

Formulation and development of foods for weight management Paola Vitaglione Weight control and

/k Content 2/15 1. Introduction 2. Hamming weight 3. Rank weight 4. Extended rank weight

Gov 2000: 7. What is Regression? Matthew Blackwell Fall 2016 1 / 65 1. Relationships between

Slides Set 9(part b): Sampling Techniques for Probabilistic and Deterministic Graphical models

ECE 566: Grid Integration of Wind Energy Systems S. Suryanarayanan Associate Professor ECE

CSE 473: Artificial Intelligence Constraint Satisfaction Luke Zettlemoyer Multiple slides

S Graphics Paul Murrell paul@stat.auckland.ac.nz The University of Auckland S Graphics

Cartographic Papers covered Temporally Varying Georeferenced Statistics MacEachren et al. (1998)

Noise Studies April 4-8 M. Johnson April 2016 1 Testing Crew L. Bagby S. Chappa A.

Fitting a Model to Data Reading: 15.1, 15.5.2 Cluster image parts together by fitting a model