distance based methods drawbacks
play

Distance-based Methods: Drawbacks Hard to find clusters with - PowerPoint PPT Presentation

Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1 How to Find Irregular Clusters? Divide


  1. Distance-based Methods: Drawbacks • Hard to find clusters with irregular shapes • Hard to specify the number of clusters • Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1

  2. How to Find Irregular Clusters? • Divide the whole space into many small areas – The density of an area can be estimated – Areas may or may not be exclusive – A dense area is likely in a cluster • Start from a dense area, traverse connected dense areas and discover clusters in irregular shape Jian Pei: CMPT 459/741 Clustering (3) 2

  3. Directly Density Reachable p MinPts = 3 q Eps = 1 cm • Parameters – Eps: Maximum radius of the neighborhood – MinPts: Minimum number of points in an Eps- neighborhood of that point – NEps(p): {q | dist(p,q) ≤ Eps} • Core object p: |NEps(p)| ≥ MinPts – A core object is in a dense area • Point q directly density-reachable from p iff q ∈ NEps(p) and p is a core object Jian Pei: CMPT 459/741 Clustering (3) 3

  4. Density-Based Clustering • Density-reachable – Directly density reachable p 1 à p 2 , p 2 à p 3 , … , p n-1 à p n – p n density-reachable from p 1 • Density-connected – If points p, q are density-reachable from o then p and q are density-connected p q p p 1 o q Jian Pei: CMPT 459/741 Clustering (3) 4

  5. DBSCAN • A cluster: a maximal set of density- connected points – Discover clusters of arbitrary shape in spatial databases with noise Outlier Border Eps = 1cm Core MinPts = 5 Jian Pei: CMPT 459/741 Clustering (3) 5

  6. DBSCAN: the Algorithm • Arbitrary select a point p • Retrieve all points density-reachable from p wrt Eps and MinPts • If p is a core point, a cluster is formed • If p is a border point, no points are density- reachable from p and DBSCAN visits the next point of the database • Continue the process until all of the points have been processed Jian Pei: CMPT 459/741 Clustering (3) 6

  7. Challenges for DBSCAN • Different clusters may have very different densities • Clusters may be in hierarchies Jian Pei: CMPT 459/741 Clustering (3) 7

  8. OPTICS: A Cluster-ordering Method • Idea: ordering points to identify the clustering structure • “Group” points by density connectivity – Hierarchies of clusters • Visualize clusters and the hierarchy Jian Pei: CMPT 459/741 Clustering (3) 8

  9. Ordering Points • Points strongly density-connected should be close to one another • Clusters density-connected should be close to one another and form a “ cluster ” of clusters Jian Pei: CMPT 459/741 Clustering (3) 9

  10. OPTICS: An Example Reachability-distance undefined ε ε ε ‘ Cluster-order of the objects Jian Pei: CMPT 459/741 Clustering (3) 10

  11. DENCLUE: Using Density Functions • DENsity-based CLUstEring • Major features – Solid mathematical foundation – Good for data sets with large amounts of noise – Allow a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets – Significantly faster than existing algorithms (faster than DBSCAN by a factor of up to 45) – But need a large number of parameters Jian Pei: CMPT 459/741 Clustering (3) 11

  12. DENCLUE: Techniques • Use grid cells – Only keep grid cells actually containing data points – Manage cells in a tree-based access structure • Influence function: describe the impact of a data point on its neighborhood • Overall density of the data space is the sum of the influence function of all data points • Clustering by identifying density attractors – Density attractor: local maximal of the overall density function Jian Pei: CMPT 459/741 Clustering (3) 12

  13. Density Attractor Jian Pei: CMPT 459/741 Clustering (3) 13

  14. Center-defined and Arbitrary Clusters Jian Pei: CMPT 459/741 Clustering (3) 14

  15. A Shrinking-based Approach • Difficulties of Multi-dimensional Clustering – Noise (outliers) – Clusters of various densities – Not well-defined shapes • A novel preprocessing concept “Shrinking” • A shrinking-based clustering approach Jian Pei: CMPT 459/741 Clustering (3) 15

  16. Intuition & Purpose • For data points in a data set, what if we could make them move towards the centroid of the natural subgroup they belong to? • Natural sparse subgroups become denser, thus easier to be detected – Noises are further isolated Jian Pei: CMPT 459/741 Clustering (3) 16

  17. Inspiration • Newton’s Universal Law of Gravitation – Any two objects exert a gravitational force of attraction on each other – The direction of the force is along the line joining the objects – The magnitude of the force is directly proportional to the product of the gravitational masses of the objects, and inversely proportional to the square of the distance between them m m Fg G 1 2 – G: universal gravitational constant = 2 r • G = 6.67 x 10 -11 N m 2 /kg 2 Jian Pei: CMPT 459/741 Clustering (3) 17

  18. The Concept of Shrinking • A data preprocessing technique – Aim to optimize the inner structure of real data sets • Each data point is “attracted” by other data points and moves to the direction in which way the attraction is the strongest • Can be applied in different fields Jian Pei: CMPT 459/741 Clustering (3) 18

  19. Apply shrinking into clustering field • Shrink the natural sparse clusters to make them much denser to facilitate further cluster-detecting process. Multi- attribute hyperspac e Jian Pei: CMPT 459/741 Clustering (3) 19

  20. Data Shrinking • Each data point moves along the direction of the density gradient and the data set shrinks towards the inside of the clusters • Points are “ attracted ” by their neighbors and move to create denser clusters • It proceeds iteratively ; repeated until the data are stabilized or the number of iterations exceeds a threshold Jian Pei: CMPT 459/741 Clustering (3) 20

  21. Approximation & Simplification • Problem: Computing mutual attraction of each data points pair is too time consuming O(n 2 ) – Solution: No Newton's constant G, m 1 and m 2 are set to unit • Only aggregate the gravitation surrounding each data point • Use grids to simplify the computation Jian Pei: CMPT 459/741 Clustering (3) 21

  22. Termination condition • Average movement of all points in the current iteration is less than a threshold • The number of iterations exceeds a threshold Jian Pei: CMPT 459/741 Clustering (3) 22

  23. Optics on Pendigits Data Before data shrinking After data shrinking Jian Pei: CMPT 459/741 Clustering (3) 23

  24. Biclustering • Clustering both objects and attributes simultaneously • Four requirements – Only a small set of objects in a cluster (bicluster) – A bicluster only involves a small number of attributes – An object may participate in multiple biclusters or no biclusters – An attribute may be involved in multiple biclusters, or no biclusters Jian Pei: Big Data Analytics -- Clustering 24

  25. Application Examples • Recommender systems sample/condition – Objects: users w w w 11 12 1m – Attributes: items gene w w w 21 22 2m w w w – Values: user ratings 31 3m 32 • Microarray data – Objects: genes w w w n2 n1 nm – Attributes: samples – Values: expression levels Jian Pei: Big Data Analytics -- Clustering 25

  26. Biclusters with Constant Values · · · b 6 · · · b 12 · · · b 36 · · · b 99 · · · · · · 60 · · · 60 · · · 60 · · · 60 · · · a 1 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 60 · · · 60 · · · 60 · · · 60 · · · a 33 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 60 · · · 60 · · · 60 · · · 60 · · · a 86 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 10 10 10 10 10 20 20 20 20 20 50 50 50 50 50 0 0 0 0 0 On rows Jian Pei: Big Data Analytics -- Clustering 26

  27. Biclusters with Coherent Values • Also known as pattern-based clusters Jian Pei: Big Data Analytics -- Clustering 27

  28. Biclusters with Coherent Evolutions • Only up- or down-regulated changes over rows or columns 10 50 30 70 20 20 100 50 1000 30 50 100 90 120 80 0 80 20 100 10 Coherent evolutions on rows Jian Pei: Big Data Analytics -- Clustering 28

  29. Differences from Subspace Clustering • Subspace clustering uses global distance/ similarity measure • Pattern-based clustering looks at patterns • A subspace cluster according to a globally defined similarity measure may not follow the same pattern Jian Pei: Big Data Analytics -- Clustering 29

  30. Objects Follow the Same Pattern? pScore Object blue Obejct green D 1 D 2 The less the pScore, the more consistent the objects Jian Pei: Big Data Analytics -- Clustering 30

Recommend


More recommend