Sathyanarayan Anand & Debasree Banerjee Swarm Intelligence 2005-06 09.02.2006 SI0506 - Data Clustering Using Flocking 1
What is Data Clustering? • Given a set of elements and a similarity measure between pairs of elements, to find an algorithm for grouping elements into clusters, so that similar elements end up in the same cluster. • Data element = Point in some high-dimensional space. • Applications: Geographic information systems, pattern recognition, medical imaging, marketing analysis, weather forecasting, etc. SI0506 - Data Clustering Using Flocking 2
Related Work • Hierarchical Algorithms: Break large clusters into smaller ones till desired granularity is reached. – Chameleon: Model based splitting of clusters. • Partitioning Algorithms: Move data between partitions to optimize some quality measure. – K-means clustering – Fuzzy c-means clustering • Density-Based Algorithms. – DBSCAN • Swarm-Based Algorithms. – Lumer-Faieta: Ant-colony based clustering. SI0506 - Data Clustering Using Flocking 3
Flocking Rules (As given by Craig Reynolds) Separation : steer to avoid crowding local flock mates. No two agents land up on the same data point. Alignment : steer towards the average heading of local flock mates. Cohesion : steer to move toward the average position of local flock mates. SI0506 - Data Clustering Using Flocking 4
The Algorithm – Similarity Measures • Used to determine if two data points, a and b , belong to the same cluster or not. – Euclidean distance: – Vector dot product : – Penalized Difference: abs( a – b ). p, where p is a vector that denotes the importance of each attribute. – Pearson’s Coefficient: SI0506 - Data Clustering Using Flocking 5
The Algorithm - Procedure • Initialize flock randomly on the dataset. • Repeat – Each agent performs local density based clustering • If the density of points around a given point, exceeds a given threshold then every point in the cluster takes the label of the point with the minimum label. • Merge clusters belonging to different agents. • Flock migrates to new location controlled by defining flock speed. • Flock Memory: Location not revisited until all other locations have been visited. • Local clustering leads to the emergence of global cluster pattern. SI0506 - Data Clustering Using Flocking 6
The Algorithm – Proof of Convergence • Markov process with state = centroid of flock. – Centroid = data point that minimizes cumulative distance to all other points. – Next state (centroid) depends only on current state. • Irreducibility = Any point can be reached from any point. • Ergodicity = Time taken to revisit a state is finite and a periodic. Ensured through flock memory. • In the limit of infinite time, an irreducible & ergodic Markov process converges to a stationary distribution. – Clustering becomes independent of initial state. – Similar proof used in spectral clustering techniques. SI0506 - Data Clustering Using Flocking 7
The Algorithm - Limitations • Density-based clustering highly susceptible to the radius and density threshold parameters. • Computational cost for creating an efficient data structure is exponential. Can be reduced using certain techniques. SI0506 - Data Clustering Using Flocking 8
Results – Synthetic Dataset SI0506 - Data Clustering Using Flocking 9
Results – Zoo Dataset SI0506 - Data Clustering Using Flocking 10
Results – Chameleon Dataset 1 SI0506 - Data Clustering Using Flocking 11
Results – Chameleon Dataset 2 SI0506 - Data Clustering Using Flocking 12
Results – Performance Comparison Dataset coverage w.r.t number of agents in the flock. SI0506 - Data Clustering Using Flocking 13
References 1. Zaiane O.R., Lee C.H. Clustering Spatial Data in the Presence of Obstacles: A Density-based Approach. IEEE Database Engineering and Applications Symposium, 2002. Proceedings. International 17-19 July 2002 Page(s):214 -223. 2. E. Lumer, and B. Faieta. Diversity and adaptation in populations of clustering ants. Proceedings, 3rd international conference on Simulation of adaptive behavior: from animals to animats 3, pages 501-508, 1994. 3. Ester M., Kriegel H.P., Sander J., Xu X.. A Density Based Approach for Discovering Clusters in Large Spatial Databases with Noise. In 2nd International Conference on Knowledge Discovery Databases and Data Mining (KDD’96), Portland, Oregon. AAAI Press, 1996. 4. Karypis G., Han S., Kumar V. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling. In IEEE Computer: Special Issue on Data Analysis and Mining, 1999. Volume 32, Number 8, Pages 68 - 75. 5. Bradley P.S., Fayyad U., Reina C. Scaling Clustering Algorithms for Large Databases. In 4th international Conference Knowledge Discovery Databases and Data Mining (KDD’98), New York City, AAAI Press, 1998. 6. F. Höppner. Speeding up Fuzzy c-Means: Using a Hierarchical Data Organization to Control the Precision of Membership Calculation. Fuzzy Sets and Systems, 128(3), pp. 365-378, 2002. 7. Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science. SI0506 - Data Clustering Using Flocking 14
Recommend
More recommend