inference and representation
play

Inference and Representation David Sontag New York University - PowerPoint PPT Presentation

Inference and Representation David Sontag New York University Lecture 3, Sept. 15, 2014 David Sontag (NYU) Inference and Representation Lecture 3, Sept. 15, 2014 1 / 22 How to acquire a model? Possible things to do: Use expert knowledge to


  1. Inference and Representation David Sontag New York University Lecture 3, Sept. 15, 2014 David Sontag (NYU) Inference and Representation Lecture 3, Sept. 15, 2014 1 / 22

  2. How to acquire a model? Possible things to do: Use expert knowledge to determine the graph and the potentials. Use learning to determine the potentials, i.e., parameter learning . Use learning to determine the graph, i.e., structure learning . Manual design is difficult to do and can take a long time for an expert. We usually have access to a set of examples from the distribution we wish to model, e.g., a set of images segmented by a labeler. David Sontag (NYU) Inference and Representation Lecture 3, Sept. 15, 2014 2 / 22

  3. More rigorous definition Lets assume that the domain is governed by some underlying distribution p ∗ , which is induced by some network model M ∗ = ( G ∗ , θ ∗ ) We are given a dataset D of M samples from p ∗ The standard assumption is that the data instances are independent and identically distributed (IID) We are also given a family of models M , and our task is to learn some ˆ model M ∈ M (i.e., in this family) that defines a distribution p ˆ M We can learn model parameters for a fixed structure, or both the structure and model parameters David Sontag (NYU) Inference and Representation Lecture 3, Sept. 15, 2014 3 / 22

  4. Goal of learning ˆ The goal of learning is to return a model M that precisely captures the distribution p ∗ from which our data was sampled This is in general not achievable because of computational reasons limited data only provides a rough approximation of the true underlying distribution ˆ M to construct the ”best” approximation to M ∗ We need to select What is ”best”? David Sontag (NYU) Inference and Representation Lecture 3, Sept. 15, 2014 4 / 22

  5. What is “best”? This depends on what we want to do Density estimation: we are interested in the full distribution (so later we can 1 compute whatever conditional probabilities we want) Specific prediction tasks: we are using the distribution to make a prediction 2 Structure or knowledge discovery: we are interested in the model itself 3 (often of interest in data science) David Sontag (NYU) Inference and Representation Lecture 3, Sept. 15, 2014 5 / 22

  6. 1) Learning as density estimation We want to learn the full distribution so that later we can answer any probabilistic inference query In this setting we can view the learning problem as density estimation ˆ M as ”close” as possible to p ∗ We want to construct How do we evaluate ”closeness”? KL-divergence (in particular, the M-projection) is one possibility: � p ∗ ( x ) � �� D ( p ∗ || p θ ) = E x ∼ p ∗ log p θ ( x ) David Sontag (NYU) Inference and Representation Lecture 3, Sept. 15, 2014 6 / 22

  7. Expected log-likelihood We can simplify this somewhat: � p ∗ ( x ) � �� D ( p ∗ || p θ ) = E x ∼ p ∗ = − H ( p ∗ ) − E x ∼ p ∗ [log p θ ( x )] log p θ ( x ) The first term does not depend on θ . Then, finding the minimal M-projection is equivalent to maximizing the expected log-likelihood E x ∼ p ∗ [log p θ ( x )] Asks that p θ assign high probability to instances sampled from p ∗ , so as to reflect the true distribution Because of log, samples x where p θ ( x ) ≈ 0 weigh heavily in objective Although we can now compare models, since we are not computing H ( p ∗ ), we don’t know how close we are to the optimum Problem: In general we do not know p ∗ . David Sontag (NYU) Inference and Representation Lecture 3, Sept. 15, 2014 7 / 22

  8. Maximum likelihood Approximate the expected log-likelihood E x ∼ p ∗ [log p θ ( x )] with the empirical log-likelihood : 1 � E D [log p θ ( x )] = log p θ ( x ) |D| x ∈D Maximum likelihood learning is then: 1 � max log p θ ( x ) |D| θ x ∈D David Sontag (NYU) Inference and Representation Lecture 3, Sept. 15, 2014 8 / 22

  9. 2) Likelihood, Loss and Risk We now generalize this by introducing the concept of a loss function A loss function loss ( x , M ) measures the loss that a model M makes on a particular instance x Assuming instances are sampled from some distribution p ∗ , our goal is to find the model that minimizes the expected loss or risk , E x ∼ p ∗ [ loss ( x , M )] What is the loss function which corresponds to density estimation? Log-loss, 1 loss ( x , ˆ M ) = − log p θ ( x ) = log p θ ( x ) . p ∗ is unknown, but we can approximate the expectation using the empirical average, i.e., empirical risk 1 � � loss ( x , ˆ � loss ( x , ˆ E D M ) = M ) |D| x ∈D David Sontag (NYU) Inference and Representation Lecture 3, Sept. 15, 2014 9 / 22

  10. Example: conditional log-likelihood Suppose we want to predict a set of variables Y given some others X , e.g., for segmentation or stereo vision We concentrate on predicting p ( Y | X ), and use a conditional loss function loss ( x , y , ˆ M ) = − log p θ ( y | x ) . Since the loss function only depends on p θ ( y | x ), suffices to estimate the conditional distribution, not the joint This is the objective function we use to train conditional random fields (CRFs), which we discussed in Lecture 2 input: two images ! output: disparity ! David Sontag (NYU) Inference and Representation Lecture 3, Sept. 15, 2014 10 / 22

  11. How to avoid overfitting? Hard constraints, e.g. by selecting a less expressive hypothesis class: Bayesian networks with at most d parents Pairwise MRFs (instead of arbitrary higher-order potentials) Soft preference for simpler models: Occam’s Razor . Augment the learning objective function with regularization : objective ( x , M ) = loss ( x , M ) + R ( M ) (often equivalent to MAP estimation where we put a prior over parameters θ and maximize log p ( θ | x ) = log p ( x ; θ ) + log p ( θ ) − constant ) Can evaluate generalization performance using cross-validation David Sontag (NYU) Inference and Representation Lecture 3, Sept. 15, 2014 11 / 22

  12. Summary of how to think about learning Figure out what you care about, e.g. expected loss 1 � � loss ( x , ˆ E x ∼ P ∗ M ) Figure out how best to estimate this from what you have, e.g. regularized 2 empirical loss � � loss ( x , ˆ + R ( ˆ E D M ) M ) When used with log-loss, the regularization term can be interpreted as a prior distribution over models, p ( ˆ M ) ∝ exp( − R ( ˆ M )) (called maximum a posteriori (MAP) estimation ) Figure out how to optimize over this objective function, e.g. the 3 minimization � � loss ( x , ˆ + R ( ˆ min E D M ) M ) ˆ M David Sontag (NYU) Inference and Representation Lecture 3, Sept. 15, 2014 12 / 22

  13. ML estimation in Bayesian networks Suppose that we know the Bayesian network structure G Let θ x i | x pa ( i ) be the parameter giving the value of the CPD p ( x i | x pa ( i ) ; θ ) Maximum likelihood estimation corresponds to solving: N � log p ( x n ; θ ) = max max ℓ ( θ ; D ) θ θ n =1 subject to the non-negativity and normalization constraints This is equal to: | V | N N � � � log p ( x n ; θ ) log p ( x n i | x n max = max pa ( i ) ; θ ) θ θ n =1 n =1 i =1 | V | N � � log p ( x n i | x n = max pa ( i ) ; θ ) θ i =1 n =1 The optimization problem decomposes into an independent optimization problem for each CPD! David Sontag (NYU) Inference and Representation Lecture 3, Sept. 15, 2014 13 / 22

  14. ML estimation in Bayesian networks | V | N � � log p ( x n i | x n ℓ ( θ ; D ) = log p ( D ; θ ) = pa ( i ) ; θ ) n =1 i =1 | V | � � � � = log p ( x i | x pa ( i ) ; θ ) i =1 x pa ( i ) x i ˆ x ∈D : ˆ x i , ˆ x pa ( i ) = x i , x pa ( i ) | V | � � � = N x i , x pa ( i ) log θ x i | x pa ( i ) , x pa ( i ) x i i =1 where N x i , x pa ( i ) is the number of times that the (partial) assignment x i , x pa ( i ) is observed in the training data We have the closed form ML solution: N x i , x pa ( i ) θ ML x i | x pa ( i ) = � x i N ˆ x i , x pa ( i ) ˆ We were able to estimate each CPD independently because the objective decomposes by variable and parent assignment David Sontag (NYU) Inference and Representation Lecture 3, Sept. 15, 2014 14 / 22

  15. ML estimation in Markov networks How do we learn the parameters of an Ising model? = +1 = -1 p ( x 1 , · · · , x n ) = 1 � � � � Z exp w i , j x i x j − u i x i i < j i David Sontag (NYU) Inference and Representation Lecture 3, Sept. 15, 2014 15 / 22

  16. Bad news for Markov networks The global normalization constant Z ( θ ) kills decomposability: � θ ML = arg max log p ( x ; θ ) θ x ∈D �� � � = arg max log φ c ( x c ; θ ) − log Z ( θ ) θ x ∈D c �� � � = arg max log φ c ( x c ; θ ) − |D| log Z ( θ ) θ x ∈D c The log-partition function prevents us from decomposing the objective into a sum over terms for each potential Solving for the parameters becomes much more complicated David Sontag (NYU) Inference and Representation Lecture 3, Sept. 15, 2014 16 / 22

Recommend


More recommend