On approximate geometric k - clustering Jiri Matousek DCG 2000

Problem • Given n -point set and k>1 find a partition of minimum cost • “Geometric” cost function • Approximately optimal

Results • 2-clustering – n log n approximate algorithm for fixed ε >0 • k- clustering • Can be improved with known lower bound on cluster size

Really Easy Problem • Given X and cluster centers c 1 , c 2 , …, c k , find the optimal clustering of X

General Approach • Snap X to a grid • Cover the space with points – potential cluster centers – polynomial in n • Test subsets of size k to find the best centers – only test sufficiently different subsets • Aiming for a “near linear” algorithm

Polynomial Grid Thm: Let d and k 0 be fixed. Suppose there is an algorithm A that, for a given ε >0, k • � k 0 , and an n-point multiset X’ � R d with points lying on an integer grid of size O(n 3 / ε ), finds an (1+ ε )-approximately optimal k-clustering of X’. Then a (1+ ε )- approximately optimal k 0 clustering for an arbitrary n-point subset X � R d can be computed with O(n log n) preprocessing and with at most C calls to algorithm A, with various at most n-point sets X’, with k � k 0 , and with αε instead of ε, where α >0 and C are constants. Pf: Grid size δ = αε∆ /5n 2 , ∆ = diam(X)/n. Let X be the original and X’ the snapped • set. If Π is the clustering of X, and Π ’ is corresponding clustering of X’ max change in distance to cluster center |diam(X) 2 – (diam(X) + 2 δ ) 2 | � 5 δ diam(X) – – If cost( Π ’) > 1/20 ∆ 2 , since Π ’ is (1+ αε )-approximate for X’, then corresponding Π is – (1+ ε )-approximate for X – Otherwise, apply the algorithm recursively groups of nearby clusters

Approximate Centroid Sets • Dfn: C is an ε -approximate centroid set for X if it intersects ε - tolerance ball of every subset of X (of size at least s ) – ε -tolerance ball for S is centered at c(S), with radius • Thm: Let be a finite point set and let k � 2 , and let C be an ε –approximate centroid set for S with cluster size � s . Then there are c 1 , c 2 , …, c k � C s.t. for all k -clusterings of with all clusters of size at least s • Pf: By algebra using definition of centroid sets

Contruction

Contruction Subdivide as long as Q contains at least s/2d+1 points

Contruction Closest σ larger than diam(B) B intersects at most 2 d cubes Some cube has s/2 d+1 points

Centroid Sets • Thm: and the construction can be performed in time • Pf: There are at most different side lengths, and each cube contains at least points, so there are

Well-separated Pairs Dfn: (x, y) and (x’, y’) are ε -near if • and x r ε r y x’ y’ Set P is ε –separated if no two pairs are ε –near • P is ε –complete for X if for every pair in X, there is an ε –near pair in P • k-tuples (c 1 , c 2 , … , c k ) and (c 1 ’, c 2 ’, … , c k ’) are ε –near (complete, • separated) if all pairs (c i , c j ) and (c i ’, c j ’) are ε –near

Approximate k-clustering • Thm: Let (c 1 , c 2 , … , c k ) and (c 1 ’, c 2 ’, …, c k ’) be two k-tuples in R d that are ε -near, ε � 1/9. Let Π = Π Vor (c 1 , c 2 , …, c k ) and Π ’ = Π Vor (c 1 ’, c 2 ’, …, c k ’), then cost( Π ’) � (1+6 ε ) cost( Π ) • Pf: c 2 ’ c 2 S 1 ’ x δ S 1 εδ c 1 ’ c 1

Approximate k-clustering • Instead of looking at all k-tuples in C, we only need to look at ε -complete set. – Still too many for near-linear algorithm • Look at ε –well spread tuples instead, i.e. no subset is 1/ ε isolated – Y � X is 1/ ε isolated if there is x � X/Y such that that is a distance 1/ ε diam(Y) from Y 1/ ε d R x d ε R – If X is ε –well spread, then diam(X) � (2/ ε ) k-1 δ

Building Cluster Centers • Let C � R d be an m-point set. Then we can compute the set C’ of k-tuples s.t. – For any ε –well spread k-tuple in C, there is a tuple in C’ that is ε –near it – |C|=O(m ε -k2d ) – Each k-tuple in C’ is ε /2 –well spread – At least one point in each k-tuple in C’ belongs to C – There are no more than O(1) k-tuples of C’ lying near any given k- tuple in R d – The minimum and maximum distance of points in each k-tuple in C’ are bounded by the constant multiples of minimum and maximum distance in C • The running time is O(m log m + m ε -k2d )

Building Cluster Centers • Generate a set of ε /2 –complete set of pairs P � C � C – There are O(m ε -d ) pairs, running time O(m log m + m ε -d ) – Each pair will be a basis for an ε /2 –well spread k-tuple y δ x

Algorithm • If k=1, return X • For k*=2,3…,k generate the set C* of k*-tuples for C • For each (c 1 , c 2 ,…,c k* ) � C*, let (X 1 , X 2 ,…,X k* )= Π Vor (c 1 , c 2 , …,c k* ) • Let C i be the points lying in εδ –neighborhood of c i δ is the smallest pairwise distance among c 1 ,…,c k – • For i=1…k*, call the algorithm recursively on X i and C i – vary number of clusters from 1 to k-k*+1 – find the combination k 1 +k 2 +…+k k* =k with the smallest cost • For each k* tuple in C* with all C i non-empty, output one with smallest cost

Algorithm k=10 k*=4 c 4 c 2 δ X 3 k i = 1,2…7 εδ c 1 c 3 C 3

Correctness • For any k-tuple (c 1 ,…,c k ), the algorithm generates a tuple that is ε -near

Running Time • All range queries are done by approximate range searching in time O(log n) • Approximate Voronoi partitioning can be done in time O(log n + ε -2(d-1) ) • Running time:

On approximate geometric k - clustering Jiri Matousek DCG 2000 - PowerPoint PPT Presentation

On approximate geometric k - clustering Jiri Matousek DCG 2000 Problem Given n -point set and k>1 find a partition of minimum cost Geometric cost function Approximately

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Similarity and clustering Dr. Ahmed Rafea Outline Motivation Clustering: An Overview

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Uniform Interpolation and Proof Systems Rosalie Iemhoff Utrecht University Workshop on

iLab Dynamic Routing Florian Wohlfart wohlfart@in.tum.de Chair of Network Architectures and

Efficient Designated-Verifier Non- Interactive Zero-Knowledge Proofs of Knowledge Pyrros Chaidos,

Advanced Systems Security: Linux Security Modules Trent Jaeger Systems and Internet

NCKU Programming Contest Training Course Strong Connected Component(SCC) 20 18 /0 4 / 25 Syuan-Yi

Mobility Support in Edge Routers Michael Eyrich Telecommunication Networks Group (TKN) Prof. Dr.

Uncertainty of Reconstructing Multiple Messages from Uniform-Tandem-Duplication Noise Yonatan

Computer Science 7KH &RQWLQXLQJ 3UREOHP RI 8QGHUUHSUHVHQWDWLRQ $QGUHZ %HUQDW &RPSXWHU