CURE : An E�cient Clustering Algo rithm fo r La rge Databases Sudipto Guha Rajeev Rastogi Kyuseok Shim Stanfo rd Universit y Bell Lab o rato ries Bell Lab o rato ries sudipto@cs.stanfo rd.edu rastogi@b ell-labs.com shim@b ell-labs.com
Motivation Useful technique fo r : 1. Discovering data distribution. 2. Discovering Interesting patterns. sudipto@cs.stanfo rd.edu 1
Problem De�nition Given : 1. n Data p oints 2. d Dimensional Metric Space Find k pa rtitions : Data within pa rtitions a re mo re simila r than across pa rtitions. sudipto@cs.stanfo rd.edu 2
T raditional Clustering Algo rithms Existing Algo rithms : [JD88] 1. P a rtitional 2. Hiera rchical sudipto@cs.stanfo rd.edu 3
P a rtitional Clustering Find k pa rtitions optimizing some criterion. Example : Squa re-erro r criterion k X X 2 min jj p � jj m i i =1 p 2 C i is the mean of cluster . m C i i sudipto@cs.stanfo rd.edu 4
Dra wbacks of P a rtitional Clustering Simila r results with other criteria. Gain from splitting la rge clusters o�sets merging small clusters. sudipto@cs.stanfo rd.edu 5
Hiera rchical Clustering 1. Nested P a rtitions 2. T ree Structure Mostly used : Agglomerative Hiera rchical Clustering. sudipto@cs.stanfo rd.edu 6
Agglomerative Hiera rchical Clustering 1. Initially each p oint is a distinct cluster 2. Rep eatedly merge closest clusters Closest : ( C ) = jj m � jj d ; C m mean i j i j 0 d ( C ; C ) = min jj p � p jj min i j p 2 C ;p 0 2 C i j Lik ewise ( C ) and ( C ) d ; C d ; C av e i j max i j sudipto@cs.stanfo rd.edu 7
Dra wbacks of Agglomerative Hiera rchical Clustering d d av e min Clustering with ( C ) and ( C ) . d ; C d ; C mean i j min i j d ( C ; C ) � Centroid app roach mean i j ( C ) � Minimum Spanning T ree app roach d ; C min i j sudipto@cs.stanfo rd.edu 8
Summa ry of Problems with T raditional Metho ds � P a rtitional Algo rithms split la rge clusters. � Centroid App roach splits la rge clusters, non- hyp erspherical shap es. Center of sub-clusters can b e fa r apa rt. � Minimum Spanning T ree App roach is sensitive to outliers and slight changes in p osition. Exhibits chaining e�ect on string of outliers. sudipto@cs.stanfo rd.edu 9
Lab eling Problem Centroid App roach : Even with co rrect centers, do w e lab el co rrectly ? Not unless w e use every p oint in data set to build hiera rchy . sudipto@cs.stanfo rd.edu 10
Related W o rk - I � CLARANS : P a rtitional Algo rithm. Uses k medoids and randomized iterative imp rovement b y exchanging medoids. Multiple I/O Scans required, can converge to lo cal optimum. Splits la rge clusters... � DBSCAN Densit y based algo rithm. Uses densit y in a small(given as pa rameter) neighb o rho o d. Finds and eliminates b ounda ry p oints. Uses spanning tree on neighb o rho o d graph. High I/O cost, p roblem with outliers, sensitive to densit y pa rameter. sudipto@cs.stanfo rd.edu 11
Related W o rk - I I � Birch : Hiera rchical Algo rithm. Preclusters data using CF tree. Inserts p oints into tree maintaining as many leaves as w ould �t main memo ry . Uses standa rd hiera rchical clustering to cluster the p reclusters. W o rks fo r convex, isotropic clusters of unifo rm size. Dep endent on o rder of insertions. sudipto@cs.stanfo rd.edu 12
Our Contribution � New Hiera rchical Clustering Algo rithm which is a middle ground of Centroid based and Spanning T ree based algo rithms. � Solution to the Lab eling Problem � Use of Random Sampling � Use of P a rtitioning sudipto@cs.stanfo rd.edu 13
Hiera rchical Clustering Algo rithm Centroid based algo rithms use 1 p oint to rep resent cluster. ) to o little info rmation hyp erspherical clusters. : : : Spanning T ree based algo rithms use all p oints to cluster. ) to o much info rmation : : : easily misled. Use small numb er of Rep resentatives fo r each cluster. sudipto@cs.stanfo rd.edu 14
Rep resentatives A set of p oints : Rep resentative � Small in numb er : c � Distributed over the cluster � Each p oint in cluster is close to one rep resentative. � Distance b et w een Clusters = smallest distance b et w een rep resentative sets sudipto@cs.stanfo rd.edu 15
Finding Scattered Represen tativ es � Distributed a round the center of cluster (Symmetry). � W ell sp read out over the cluster. Use F a rthest P oint Heuristic to scatter the p oints over the cluster sudipto@cs.stanfo rd.edu 16
Example sudipto@cs.stanfo rd.edu 17
Shrinking the Rep resentatives Why do w e need to alter the Rep resentative Set ? : T o o close to Bounda ry of cluster. α Shrink unifo rmly a round the mean (center) of the cluster. sudipto@cs.stanfo rd.edu 18
Clustering Algo rithm Initially every p oint is a sepa rate cluster. Merge closest clusters till the ca rdinalit y is at least c. If Ca rdinalit y c compute scattered rep resentatives. > Use only the 2 c rep resentative p oints. T o exp edite �nding closest clusters, maintain K-d tree on the rep resentative p oints. sudipto@cs.stanfo rd.edu 19
Analysis of Running Time X A B Y Every cluster having o r as closest ma y need to b e up dated. A B Time O (log n ) p er up date, O ( n ) up dates. 2 T otal time over the algo rithm O ( n log n ) . sudipto@cs.stanfo rd.edu 20
Random Sampling T o o much versus to o little. If each cluster has a certain numb er of p oints, with high p robabilit y w e will sample in p rop o rtion from the cluster. ) p oints in cluster translates to p oints in sample of size �n �s s Sample size is indep endent of n to rep resent all su�ciently la rge clusters sudipto@cs.stanfo rd.edu 21
P a rtitioning Sample ma y b e la rge due to desired accuracy . W e ma y w ant sample size la rger than main memo ry . P a rtition the samples into p pa rtitions � P a rtially Cluster each pa rtition, � collect all pa rtitions and complete clustering. 2 2 s s s s Time reduces to p � O ( log ) = O ( log ) . 2 p p p p Why not la rge p ? Consider Second step ab ove. Also loss in qualit y .... sudipto@cs.stanfo rd.edu 22
Lab eling Data on Disk Cho ose some constant numb er of rep resentatives from each cluster. F o r every new p oint seen Find nea rest rep resentative p oint. Assign the cluster lab el of this rep resentative p oint to the new p oint. Amelio rates the Lab eling Problem if su�ciently many rep resentatives chosen. sudipto@cs.stanfo rd.edu 23
Outlier Handling Outliers cannot have many p oints close to them. Random sampling p reserves this p rop ert y . Thus clusters a round the Outlier gro ws mo re slo wly . After pa rtial clustering has b een done, thro w a w a y slo wly gro wing ( small ca rdinalit y) clusters. W e apply this p ro cess in t w o phases. � After pa rtial clustering. � T o w a rds the end sudipto@cs.stanfo rd.edu 24
The Complete Algo rithm Draw Random sample Partition sample Partially cluster partitions Eliminate outliers Cluster partial clusters Label data in disk sudipto@cs.stanfo rd.edu 25
Sensitivit y Analysis W e w ant to test e�ects of va rying : � shrink age facto r � numb er of rep resentatives � Numb er of sample p oints 100000 p oints sudipto@cs.stanfo rd.edu 26
Sensitivit y Analysis : Shrink F acto r
Sensitivit y Analysis : Numb er of Rep resentatives c=2 c=10 c=5 2500 sample p oints and shrink age facto r 0.3 sudipto@cs.stanfo rd.edu 28
Sensitivit y Analysis : Random Sample Size 2000 p oints 2500 p oints 3000 p oints 10 rep resentatives and Shrink age facto r 0.3 sudipto@cs.stanfo rd.edu 29
Compa rison with Birch BIRCH 140 CURE (P = 1) CURE (P = 2) 120 CURE (P = 5) Execution Time (Sec.) 100 80 60 40 20 0 100000 200000 300000 400000 500000 Number of Points Data�le has 100000 p oints. Using Sun Ultra-2/200 machine with 512MB RAM (Lab eling excluded) sudipto@cs.stanfo rd.edu 30
Scale-up Exp eriments 50 P = 1 45 P = 2 P = 5 40 Execution Time (Sec.) 35 30 25 20 15 10 5 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Number of Sample Points Using Sun Ultra-2/200 machine with 512MB RAM (Lab eling excluded) sudipto@cs.stanfo rd.edu 31
Recommend
More recommend