3 M IXTURE DENSITY ESTIMATION In this chapter we consider mixture densities, the main building block for the dimen- sion reduction techniques described in the following chapters. In the first section we introduce mixture densities and the expectation-maximization (EM) algorithm to esti- mate their parameters from data. The EM algorithm finds, from an initial parameter estimate, a sequence of parameter estimates that yield increasingly higher data log- likelihood. The algorithm is guaranteed to converge to a local maximum of the data log-likelihood as function of the parameters. However, this local maximum may yield significantly lower log-likelihood than the globally optimal parameter estimate. The first contribution we present is a technique that is empirically found to avoid many of the poor local maxima found when using random initial parameter estimates. Our technique finds an initial parameter estimate by starting with a one-component mix- ture and adding components to the mixture one-by-one. In Section 3.2 we apply this technique to mixtures of Gaussian densities and in Section 3.3 to k-means clustering. Each iteration of the EM algorithm requires a number of computations that scales lin- early with the product of the number of data points and the number of mixture compo- nents, this limits its applicability in large scale applications with many data points and mixture components. In Section 3.4, we present a technique to speed-up the estimation of mixture models from large quantities of data where the amount of computation can be traded against accuracy of the algorithm. However, for any preferred accuracy the al- gorithm is in each step guaranteed to increase a lower bound on the data log-likelihood. 3.1 The EM algorithm and Gaussian mixture densities In this section we describe the expectation-maximization (EM) algorithm for estimating the parameters of mixture densities. Parameter estimation algorithms are sometimes also referred to as ‘learning algorithms’ since the machinery that implements the algo- rithm, in a sense, ‘learns’ about the data by estimating the parameters. Mixture models,
42 M IXTURE DENSITY ESTIMATION a weighted sum of finitely many elements of some parametric class of component densi- ties, form an expressive class of models for density estimation. Due to the development of automated procedures to estimate mixture models from data, applications in a wide range of fields have emerged in recent decades. Examples are density estimation, clus- tering, and estimating class-conditional densities in supervised learning settings. Using the EM algorithm it is relatively straightforward to apply density estimation techniques in cases where some data is missing. The missing data could be the class labels of some objects in partially supervised classification problems or the value of some features that describe the objects for which we try to find a density estimate. 3.1.1 Mixture densities A mixture density (McLachlan and Peel, 2000) is defined as a weighted sum of, say k , component densities. The component densities are restricted to a particular parametric class of densities that is assumed to be appropriate for the data at hand or attractive for computational reasons. Let us denote by p ( x ; θ s ) the s -th component density, where θ s are the component parameters. We use π s to denote the weighing factor of the s -th component in the mixture. The weights must satisfy two constraints: (i) non-negativity: π s ≥ 0 and (ii) partition of unity: � k s =1 π s = 1 . The weights π s are also known as ‘mixing proportions’ or ‘mixing weights’ and can be thought of as the probability p ( s ) that a data sample will be drawn from mixture component s . A k component mixture density is then defined as: k � p ( x ) ≡ π s p ( x ; θ s ) . (3.1) s =1 For a mixture we collectively denote all parameters with θ = { θ 1 , . . . , θ k , π 1 , . . . , π k } . Throughout this thesis we assume that all data are identically and independently dis- tributed (i.i.d.), and hence that the likelihood of a set of data vectors is just the product of the individual likelihoods. One can think of a mixture density as modelling a process where first a ‘source’ s is selected according to the multinomial distribution { π 1 , . . . , π k } and then a sample is drawn from the corresponding component density p ( x ; θ s ) . Thus, the probability of selecting source s and datum x is π s p ( x ; θ s ) . The marginal probability of selecting datum x is then given by (3.1). We can think of the source that generated a data vector x as ‘missing information’: we only observe x and do not know the generating source. The expectation-maximization algorithm, presented in the next section, can be understood in terms of iteratively estimating this missing information. An important derived quantity is the ’posterior probability’ on a mixture component given a data vector. One can think of this distribution as a distribution on which mixture component generated a particular data vector, i.e. “Which component density was this
43 3.1 T HE EM ALGORITHM AND G AUSSIAN MIXTURE DENSITIES data vector drawn from?” or “to which cluster does this data vector belong?”. The posterior distribution on the mixture components is defined using Bayes rule: p ( s | x ) ≡ π s p ( x ; θ s ) π s p ( x ; θ s ) = s ′ π s ′ p ( x ; θ s ′ ) . (3.2) p ( x ) � The expectation-maximization algorithm to estimate the parameters of a mixture model from data makes essential use of these posterior probabilities. Mixture modelling is also known as semi-parametric density estimation and it can be placed in between two extremes: parametric and non-parametric density estimation. Parametric density estimation assumes the data is drawn from a density in a parametric class, say the class of Gaussian densities. The estimation problem then reduces to find- ing the parameters of the Gaussian that fits the data best. The assumption underlying parametric density estimation is often unrealistic but allows for very efficient parameter estimation. At the other extreme, non-parametric methods do not assume a particular form of the density from which the data is drawn. Non-parametric estimates typically take a form of a mixture density with a mixture component for every data point in the data set. The components, often referred to as ‘kernels’. A well known non-parametric density estimator is the Parzen estimator (Parzen, 1962) which uses Gaussian compo- nents with mean equal to the corresponding data point and small isotropic covariance. Non-parametric estimates can implement a large class of densities. The price we have to pay is that for the evaluation of the estimator at a new point we have to evaluate all the kernels, which is computationally demanding if the estimate is based on a large data set. Mixture modelling strikes a balance between these extremes: a large class of densities can be implemented and we can evaluate the density efficiently, since only relatively few density functions have to be evaluated. 3.1.2 Parameter estimation with the EM algorithm The first step when using a mixture model is to determine its architecture: a proper class of component densities and the number of component densities in the mixture. We will discuss these issues in Section 3.1.3. After these design choices have been made, we estimate the free parameters in the mixture model such that the model ‘fits’ our data as good as possible. The expectation-maximization algorithm is the most popular method to estimate the parameters of mixture models to a given data set. We define “fits the data as good as possible” as “assigns maximum likelihood to the data”. Hence, fitting the model to given data becomes searching for the maximum- likelihood parameters for our data in the set of probabilistic models defined by the cho- sen architecture. Since the logarithm is a monotone increasing function, the maximum likelihood criterion is equivalent to a maximum log-likelihood criterion and these crite- ria are often interchanged. Due to the i.i.d. assumption, the log-likelihood of a data set
Recommend
More recommend