lecture 8
play

LECTURE 8 The EM Algorithm Clustering Validation Sequence - PowerPoint PPT Presentation

DATA MINING LECTURE 8 The EM Algorithm Clustering Validation Sequence segmentation CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related) to one another and


  1. DATA MINING LECTURE 8 The EM Algorithm Clustering Validation Sequence segmentation

  2. CLUSTERING

  3. What is a Clustering? • In general a grouping of objects such that the objects in a group (cluster) are similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster Intra-cluster distances are distances are maximized minimized

  4. Clustering Algorithms • K-means and its variants • Hierarchical clustering • DBSCAN

  5. MIXTURE MODELS AND THE EM ALGORITHM

  6. Model-based clustering • In order to understand our data, we will assume that there is a generative process (a model) that creates/describes the data, and we will try to find the model that best fits the data. • Models of different complexity can be defined, but we will assume that our model is a distribution from which data points are sampled • Example: the data is the height of all people in Greece • In most cases, a single distribution is not good enough to describe all data points: different parts of the data follow a different distribution • Example: the data is the height of all people in Greece and China • We need a mixture model • Different distributions correspond to different clusters in the data.

  7. Gaussian Distribution • Example: the data is the height of all people in Greece • Experience has shown that this data follows a Gaussian (Normal) distribution • Reminder: Normal distribution: 𝑓 − 𝑦−𝜈 2 1 2𝜏 2 𝑄 𝑦 = 2𝜌𝜏 • 𝜈 = mean, 𝜏 = standard deviation

  8. Gaussian Model • What is a model? • A Gaussian distribution is fully defined by the mean 𝜈 and the standard deviation 𝜏 • We define our model as the pair of parameters 𝜄 = (𝜈, 𝜏) • This is a general principle: a model is defined as a vector of parameters 𝜄

  9. Fitting the model • We want to find the normal distribution that best fits our data • Find the best values for 𝜈 and 𝜏 • But what does best fit mean?

  10. Maximum Likelihood Estimation (MLE) • Suppose that we have a vector 𝑌 = (𝑦 1 , … , 𝑦 𝑜 ) of values and we want to fit a Gaussian 𝑂(𝜈, 𝜏) model to the data • Probability of observing point 𝑦 𝑗 : 𝑓 − 𝑦 𝑗 −𝜈 2 1 𝑄 𝑦 𝑗 = 2𝜏 2 2𝜌𝜏 • Probability of observing all points (assume independence) 𝑜 𝑜 𝑓 − 𝑦 𝑗 −𝜈 2 1 2𝜏 2 𝑄 𝑌 = 𝑄 𝑦 𝑗 = 2𝜌𝜏 𝑗=1 𝑗=1 • We want to find the parameters 𝜄 = (𝜈, 𝜏) that maximize the probability 𝑄(𝑌|𝜄)

  11. Maximum Likelihood Estimation (MLE) • The probability 𝑄(𝑌|𝜄) as a function of 𝜄 is called the Likelihood function 𝑜 𝑓 − 𝑦 𝑗 −𝜈 2 1 𝑀(𝜄) = 2𝜏 2 2𝜌𝜏 𝑗=1 • It is usually easier to work with the Log-Likelihood function 𝑜 𝑦 𝑗 − 𝜈 2 − 1 𝑀𝑀 𝜄 = − 2 𝑜 log 2𝜌 − 𝑜 log 𝜏 2𝜏 2 𝑗=1 • Maximum Likelihood Estimation • Find parameters 𝜈, 𝜏 that maximize 𝑀𝑀(𝜄) 𝑜 𝑜 𝜈 = 1 𝜏 2 = 1 (𝑦 𝑗 −𝜈) 2 = 𝜏 𝑌 2 𝑜 𝑦 𝑗 = 𝜈 𝑌 𝑜 𝑗=1 𝑗=1 Sample Mean Sample Variance

  12. MLE • Note: these are also the most likely parameters given the data 𝑄 𝜄 𝑌 = 𝑄 𝑌 𝜄 𝑄(𝜄) 𝑄(𝑌) • If we have no prior information about 𝜄 , or X, then maximizing 𝑄 𝑌 𝜄 is the same as maximizing 𝑄 𝜄 𝑌

  13. Mixture of Gaussians • Suppose that you have the heights of people from Greece and China and the distribution looks like the figure below (dramatization)

  14. Mixture of Gaussians • In this case the data is the result of the mixture of two Gaussians • One for Greek people, and one for Chinese people • Identifying for each value which Gaussian is most likely to have generated it will give us a clustering.

  15. Mixture model • A value 𝑦 𝑗 is generated according to the following process: • First select the nationality • With probability 𝜌 𝐻 select Greece, with probability 𝜌 𝐷 select China (𝜌 𝐻 + 𝜌 𝐷 = 1) We can also thing of this as a Hidden Variable Z that takes two values: Greece and China • Given the nationality, generate the point from the corresponding Gaussian • 𝑄 𝑦 𝑗 𝜄 𝐻 ~ 𝑂 𝜈 𝐻 , 𝜏 𝐻 if Greece 𝜄 𝐻 : parameters of the Greek distribution 𝜄 𝐷 : parameters of the China distribution • 𝑄 𝑦 𝑗 𝜄 𝐷 ~ 𝑂 𝜈 𝐷 , 𝜏 𝐷 if China

  16. Mixture Model • Our model has the following parameters Θ = (𝜌 𝐻 , 𝜌 𝐷 , 𝜈 𝐻 , 𝜏 𝐻 , 𝜈 𝐷 , 𝜏 𝐷 ) 𝜄 𝐻 : parameters of the Greek distribution Mixture probabilities 𝜄 𝐷 : parameters of the China distribution

  17. Mixture Model • Our model has the following parameters Θ = (𝜌 𝐻 , 𝜌 𝐷 , 𝜈 𝐻 , 𝜏 𝐻 , 𝜈 𝐷 , 𝜏 𝐷 ) Mixture probabilities Distribution Parameters • For value 𝑦 𝑗 , we have: 𝑄 𝑦 𝑗 |Θ = 𝜌 𝐻 𝑄 𝑦 𝑗 𝜄 𝐻 + 𝜌 𝐷 𝑄(𝑦 𝑗 |𝜄 𝐷 ) • For all values 𝑌 = 𝑦 1 , … , 𝑦 𝑜 𝑜 𝑄 𝑌|Θ = 𝑄(𝑦 𝑗 |Θ) 𝑗=1 • We want to estimate the parameters that maximize the Likelihood of the data

  18. Mixture Model • Our model has the following parameters Θ = (𝜌 𝐻 , 𝜌 𝐷 , 𝜈 𝐻 , 𝜏 𝐻 , 𝜈 𝐷 , 𝜏 𝐷 ) Mixture probabilities Distribution Parameters • For value 𝑦 𝑗 , we have: 𝑄 𝑦 𝑗 |Θ = 𝜌 𝐻 𝑄 𝑦 𝑗 𝜄 𝐻 + 𝜌 𝐷 𝑄(𝑦 𝑗 |𝜄 𝐷 ) • For all values 𝑌 = 𝑦 1 , … , 𝑦 𝑜 𝑜 𝑄 𝑌|Θ = 𝑄(𝑦 𝑗 |Θ) 𝑗=1 • We want to estimate the parameters that maximize the Likelihood of the data

  19. Mixture Models • Once we have the parameters Θ = (𝜌 𝐻 , 𝜌 𝐷 , 𝜈 𝐻 , 𝜈 𝐷 , 𝜏 𝐻 , 𝜏 𝐷 ) we can estimate the membership probabilities 𝑄 𝐻 𝑦 𝑗 and 𝑄 𝐷 𝑦 𝑗 for each point 𝑦 𝑗 : • This is the probability that point 𝑦 𝑗 belongs to the Greek or the Chinese population (cluster) Given from the Gaussian distribution 𝑂(𝜈 𝐻 , 𝜏 𝐻 ) for Greek 𝑄 𝑦 𝑗 𝐻 𝑄(𝐻) 𝑄 𝐻 𝑦 𝑗 = 𝑄 𝑦 𝑗 𝐻 𝑄 𝐻 + 𝑄 𝑦 𝑗 𝐷 𝑄(𝐷) 𝑄 𝑦 𝑗 𝜄 𝐻 𝜌 𝐻 = 𝑄 𝑦 𝑗 𝜄 𝐻 𝜌 𝐻 + 𝑄 𝑦 𝑗 𝜄 𝐷 𝜌 𝐷

  20. EM (Expectation Maximization) Algorithm • Initialize the values of the parameters in Θ to some random values • Repeat until convergence • E-Step: Given the parameters Θ estimate the membership probabilities 𝑄 𝐻 𝑦 𝑗 and 𝑄 𝐷 𝑦 𝑗 • M-Step: Compute the parameter values that (in expectation) maximize the data likelihood 𝑜 𝑜 𝜌 𝐷 = 1 𝜌 𝐻 = 1 Fraction of 𝑜 𝑄(𝐷|𝑦 𝑗 ) 𝑜 𝑄(𝐻|𝑦 𝑗 ) population in G,C 𝑗=1 𝑗=1 𝑜 𝑄 𝐷 𝑦 𝑗 𝑜 𝑄 𝐻 𝑦 𝑗 𝜈 𝐷 = 𝑦 𝑗 MLE Estimates 𝜈 𝐻 = 𝑦 𝑗 𝑜 ∗ 𝜌 𝐷 𝑜 ∗ 𝜌 𝐻 if 𝜌 ’s were fixed 𝑗=1 𝑗=1 𝑜 𝑄 𝐷 𝑦 𝑗 𝑜 𝑄 𝐻 𝑦 𝑗 2 = 2 = 𝑦 𝑗 − 𝜈 𝐷 2 𝑦 𝑗 − 𝜈 𝐻 2 𝜏 𝐷 𝜏 𝐻 𝑜 ∗ 𝜌 𝐷 𝑜 ∗ 𝜌 𝐻 𝑗=1 𝑗=1

  21. Relationship to K-means • E-Step: Assignment of points to clusters • K-means: hard assignment, EM: soft assignment • M-Step: Computation of centroids • K-means assumes common fixed variance (spherical clusters) • EM: can change the variance for different clusters or different dimensions (ellipsoid clusters) • If the variance is fixed then both minimize the same error function

  22. CLUSTERING EVALUATION

  23. Clustering Evaluation • How do we evaluate the “ goodness ” of the resulting clusters? • But “ clustering lies in the eye of the beholder ”! • Then why do we want to evaluate them? • To avoid finding patterns in noise • To compare clusterings, or clustering algorithms • To compare against a “ ground truth ”

  24. Clusters found in Random Data 1 1 0.9 0.9 0.8 0.8 0.7 0.7 Random DBSCAN 0.6 0.6 Points y 0.5 0.5 y 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x x 1 1 0.9 0.9 0.8 0.8 K-means Complete 0.7 0.7 Link 0.6 0.6 0.5 0.5 y y 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x x

  25. Different Aspects of Cluster Validation 1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random structure actually exists in the data. 2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels. 3. Evaluating how well the results of a cluster analysis fit the data without reference to external information. - Use only the data 4. Comparing the results of two different sets of cluster analyses to determine which is better. Determining the ‘correct’ number of clusters . 5. For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters.

Recommend


More recommend