data warehousing and machine learning
play

Data Warehousing and Machine Learning Density-based clustering - PowerPoint PPT Presentation

Data Warehousing and Machine Learning Density-based clustering Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 Density-based Clustering DWML Spring 2008 1 / 29 Densitiy Based Clustering DBSCAN Idea:


  1. Data Warehousing and Machine Learning Density-based clustering Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 Density-based Clustering DWML Spring 2008 1 / 29

  2. Densitiy Based Clustering DBSCAN Idea: identify contiguous regions of high density. Density-based Clustering DWML Spring 2008 2 / 29

  3. Densitiy Based Clustering Step 1: classification of points 1.: Choose parameters ǫ , k .: Label as core points : points with at least k other points within distance ǫ .: Label as border points : points within distance ǫ of a core point .: Label as isolated points : all remaining points Density-based Clustering DWML Spring 2008 3 / 29

  4. Densitiy Based Clustering Step 1: classification of points 1.: Choose parameters ǫ , k 2.: Label as core points : points with at least k other points within distance ǫ .: Label as border points : points within distance ǫ of a core point .: Label as isolated points : all remaining points Density-based Clustering DWML Spring 2008 3 / 29

  5. Densitiy Based Clustering Step 1: classification of points 1.: Choose parameters ǫ , k 2.: Label as core points : points with at least k other points within distance ǫ .: Label as border points : points within distance ǫ of a core point .: Label as isolated points : all remaining points Density-based Clustering DWML Spring 2008 3 / 29

  6. Densitiy Based Clustering Step 1: classification of points 1.: Choose parameters ǫ , k 2.: Label as core points : points with at least k other points within distance ǫ 3.: Label as border points : points within distance ǫ of a core point .: Label as isolated points : all remaining points Density-based Clustering DWML Spring 2008 3 / 29

  7. Densitiy Based Clustering Step 1: classification of points 1.: Choose parameters ǫ , k 2.: Label as core points : points with at least k other points within distance ǫ 3.: Label as border points : points within distance ǫ of a core point .: Label as isolated points : all remaining points Density-based Clustering DWML Spring 2008 3 / 29

  8. Densitiy Based Clustering Step 1: classification of points 1.: Choose parameters ǫ , k 2.: Label as core points : points with at least k other points within distance ǫ 3.: Label as border points : points within distance ǫ of a core point 4.: Label as isolated points : all remaining points Density-based Clustering DWML Spring 2008 3 / 29

  9. Densitiy Based Clustering Step 1: classification of points 1.: Choose parameters ǫ , k 2.: Label as core points : points with at least k other points within distance ǫ 3.: Label as border points : points within distance ǫ of a core point 4.: Label as isolated points : all remaining points Density-based Clustering DWML Spring 2008 3 / 29

  10. Densitiy Based Clustering Step 2: Define Connectivity Density-based Clustering DWML Spring 2008 4 / 29

  11. Densitiy Based Clustering Step 2: Define Connectivity 1. Two core points are directly connected if they are within ǫ distance of each other. 2. Each border point is directly connected to one randomly chosen core point within distance ǫ . Density-based Clustering DWML Spring 2008 4 / 29

  12. Densitiy Based Clustering Step 2: Define Connectivity 1. Two core points are directly connected if they are within ǫ distance of each other. 2. Each border point is directly connected to one randomly chosen core point within distance ǫ . 3. Each connected component of the directly connected relation (with at least one core point) is a cluster. Density-based Clustering DWML Spring 2008 4 / 29

  13. Densitiy Based Clustering Setting k and ǫ For fixed k there exist heuristic methods for choosing ǫ by considering the distribution in the data of the distance to the k th nearest neighbor. Pros and Cons + Can detect clusters of highly irregular shape + Robust with respect to outliers - Difficulties with clusters of varying density - Parameters k , ǫ must be suitably chosen Density-based Clustering DWML Spring 2008 5 / 29

  14. EM Clustering Probabilistic Model for Clustering Assumption: • Data a 1 , . . . , a N is generated by a mixture of k probability distributions P 1 , . . . , P k , i.e. k k X X P ( a ) = λ i P i ( a ) ( λ i = 1 ) i = 1 i = 1 • Cluster label of instance = (index of) distribution from which instance was drawn • The P i are not (exactly) known Density-based Clustering DWML Spring 2008 6 / 29

  15. EM Clustering Clustering principle Try to find the most likely explanation of the data, i.e. • determine (parameters of) P 1 , . . . , P k and λ 1 , . . . , λ k , such that • likelihood function N Y P ( a 1 , . . . , a N | P 1 , . . . , P k , λ 1 , . . . , λ k ) = P ( a j ) j = 1 is maximized. • instance a is assigned to cluster j = max k i = 1 P i ( a ) λ i . Density-based Clustering DWML Spring 2008 7 / 29

  16. EM Clustering Standard normal distribution Standard normal distribution 0.4 0.35 0.3 0.25 Probability density 0.2 0.15 0.1 0.05 0 −5 −4 −3 −2 −1 0 1 2 3 4 5 x A standard normal distribution (normal distribution with mean µ = 0 and standard deviation σ = 1): „ « − ( x − µ ) 2 1 P ( x | µ, σ ) = √ exp 2 σ 2 2 πσ Density -based Clustering DWML Spring 2008 8 / 29

  17. EM Clustering Bivariate normal distribution Bivariate normal Bivariate normal 5 4.5 4 3.5 3 0.2 2.5 y 2 0.15 1.5 Probability density 5 1 0.1 4 0.5 0 3 0.05 −0.5 2 −4 −2 0 2 4 6 x 1 0 −4 0 −2 0 y 2 4 6 x » 2 » 1 – – 0 . 5 µ = Σ = 2 0 . 5 0 . 5 A bivariate normal distribution with mean µ and covariance matrix Σ : „ « 1 − 1 T Σ − 1 ( x − µ ) P ( x | µ , Σ) = ( 2 π ) N / 2 | Σ | 1 / 2 exp 2 ( x − u ) Density -based Clustering DWML Spring 2008 9 / 29

  18. EM Clustering Mixture of Gaussians Mixture of three Gaussian distributions with weights λ = 0 . 2 , 0 . 3 , 0 . 5. „ « 1 − 1 T Σ − 1 ( x − µ ) P i ( x | µ , Σ) = ( 2 π ) N / 2 | Σ | 1 / 2 exp 2 ( x − u ) Density -based Clustering DWML Spring 2008 10 / 29

  19. EM Clustering Mixture Model → Data Equi -potential lines and centers of mixture components Density-based Clustering DWML Spring 2008 11 / 29

  20. EM Clustering Mixture Model → Data Equi -potential lines and centers of mixture components Sample from mixture Density-based Clustering DWML Spring 2008 11 / 29

  21. EM Clustering Mixture Model → Data Equi -potential lines and centers of mixture components Sample from mixture Data we see Density-based Clustering DWML Spring 2008 11 / 29

  22. EM Clustering Data → Clustering Density -based Clustering DWML Spring 2008 12 / 29

  23. EM Clustering Data → Clustering Fit a mixture of three Gaussians to the data Density -based Clustering DWML Spring 2008 12 / 29

  24. EM Clustering Data → Clustering Fit a mixture of three Gaussians to the data Density -based Clustering DWML Spring 2008 12 / 29

  25. EM Clustering Data → Clustering Fit a mixture of three Gaussians to the data Density -based Clustering DWML Spring 2008 12 / 29

  26. EM Clustering Data → Clustering Fit a mixture of three Gaussians to the data Assign instances to their most probable mixture components Density -based Clustering DWML Spring 2008 12 / 29

  27. EM Clustering Gaussian Mixture Models Each mixture component is a Gaussian distribution. A Gaussian distribution is determined by • a mean vector (“center”) • a covarianc matrix Usually: all components are assumed to have the same covariance matrix. Then to fit mixture to data: need to find weights and mean vectors of mixture components. If covariance matrix is a diagonal matrix with constant entries on the diagonal, then fitting the Gaussian mixture model is equivalent to minimizing the sum of squared errors (or within cluster point scatter), i.e. the k -means algorithm effectively fits such a Gaussian mixture model. Density-based Clustering DWML Spring 2008 13 / 29

  28. EM Clustering Naive Bayes Mixture Model (for discrete attributes): each mixture component is a distribution in which the attributes are independent: C A 3 A 4 A 5 A 6 A 7 A 1 A 2 Model determined by parameters: • λ 1 , . . . , λ k (prior probabilities of the class variable) • P ( A j = a | C = c j ) ( a ∈ States ( A j ) , c j ∈ States ( C ) ) Fitting the model: finding parameters that maximize probability of observed instances. Density -based Clustering DWML Spring 2008 14 / 29

  29. EM Clustering Clustering as fitting Incomplete Data Clustering data as incomplete labeled data: SL SW PL PW Cluster 5.1 3.5 1.4 0.2 ? 4.9 3.0 1.4 0.2 ? 6.3 2.9 6.0 2.1 ? 6.3 2.5 4.9 1.5 ? . . . . . . . . . . . . . . . SubAllCap TrustSend InvRet . . . B’zambia’ Cluster y n n . . . n ? n n n . . . n ? n y n . . . n ? n n n . . . n ? . . . . . . . . . . . . . . . . . . Density -based Clustering DWML Spring 2008 15 / 29

Recommend


More recommend