k -means++ seeding Have seen that the k -means algorithm can output arbitrarily poor solutions, if started with a bad set of initial centroids k -means++ is a simple, probabilistic algorithm to compute initial centroids These centroids are already a reasonably good solution for the k -problem (provably) In practice, combining k -means++ seeding wit a few rounds of the k -means algorithm usually leads to very good solutions to the k -means problem. 1 / 7
k -means++ seeding Notation D denotes the squared Euclidean distance, P ⊂ R d , | P | < ∞ x ∈ R d , C ⊂ R d , | C | < ∞ , D ( x , C ) := min c ∈ C D ( x , c ) A ⊆ P : D ( A , C ) := � a ∈ A D ( a , C ) C , | C | = k , set of centroids with corresponding set of clusters C = { C 1 , . . . , C k } , both simply called clustering. For A ⊆ P denote by D opt ( A ) := D ( A , C opt ) , C opt := optimal k -clustering, the contribution of A to the cost of an optimal clustering. Write cost k ( P ) instead of cost D k ( P ). If A ∈ C opt , then D opt ( A ) = cost 1 ( A ). 2 / 7
k -means++ seeding - distribution k -means++ distribution For any set C ⊂ R d , | C | < ∞ , denote by p C ( · ) the distribution on P defined by ∀ p ∈ P : p C ( p ) := D ( p , C ) D ( P , C ) 3 / 7
k -means++ seeding - algorithm k-Means++ ( P , k ) choose c ∈ P uniformly at random, C := { c } ; repeat chosse c ∈ P according to distribution p c ( · ); C := C ∪ { c } ; until | C | = k ; run k-Means on P with initial centers C ; return C ; 4 / 7
k -means++ seeding - main theorem Theorem 4.1 For any finite set of points P ⊂ R d and any k ∈ N , algorithm k-Means++ computes a k-clustering C of P such that E [ D ( P , C )] ≤ 8 · (2 + ln k ) · opt k ( P ) . 5 / 7
k -means++ seeding - main lemmas Lemma 4.2 Let A ⊆ P be a cluster of C opt . If a ∈ A is chosen uniformly at random from P, then E [ D ( A , { a } ) | a ∈ A ] = 2 · D opt ( A ) . 6 / 7
k -means++ seeding - main lemmas Lemma 4.2 Let A ⊆ P be a cluster of C opt . If a ∈ A is chosen uniformly at random from P, then E [ D ( A , { a } ) | a ∈ A ] = 2 · D opt ( A ) . Lemma 4.3 Let A ⊆ P be a cluster of C opt and let C , | C | < k , be arbitrary. If a is chosen according to p C ( · ) , then � � E D ( A , C ∪ { a } ) | a ∈ A ≤ 8 · D opt ( A ) . 6 / 7
k -means++ seeding - main lemmas Lemma 4.4 Let 0 < u < k , 0 ≤ t ≤ u. Let P u be the union of u different clusters of C opt and set P c := P \ P u . Finally, let B ⊆ P c and set C 0 := B and C j := C j − 1 ∪ { a j } , j = 1 , . . . , t, where a j is chosen according to p C j − 1 . Then D ( P c , B ) + 8 · D opt ( P u ) � � � � E D ( P , C t ) ≤ (1 + H t ) + u − t · D ( P u , B ) , u where H t = � t 1 i . i =1 7 / 7
Recommend
More recommend