K -Medoids for K -Means Seeding James Newling & Franc ¸ois Fleuret Machine Learning Group, Idiap Research Institute & EPFL December 5th, 2017 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE
The standard K -means pipeline First: Seeding. Second: Lloyd’s (a.k.a. K -means) algorithm. simulated data K = 12 2 , N = 25 K uniform K -means++ LLOYD LLOYD E = 0 . 105 E = 0 . 072 1 / 3
The standard K -means pipeline (+CLARANS) simulated data K = 12 2 , N = 25 K uniform K -means++ CLARANS CLARANS LLOYD LLOYD LLOYD LLOYD E = 0 . 032 E = 0 . 105 E = 0 . 072 E = 0 . 032 1 / 3
CLARANS of Ng and Han (1994) 1: while not converged do randomly choose 1 center and 1 non-center 2: if swapping them decreases E then 3: implement the swap 4: end if 5: 6: end while 2 / 3
CLARANS of Ng and Han (1994) 1: while not converged do randomly choose 1 center and 1 non-center 2: if swapping them decreases E then 3: implement the swap 4: end if 5: 6: end while Avoids local minima of LLOYD by, • long-range swaps • updating centers and samples simultanously . 2 / 3
CLARANS of Ng and Han (1994) 1: while not converged do randomly choose 1 center and 1 non-center 2: if swapping them decreases E then 3: implement the swap 4: end if 5: 6: end while Avoids local minima of LLOYD by, • long-range swaps • updating centers and samples simultanously . We present algorithmic improvements, where • computing new E is O ( N / K ) • implementing swap is O ( N ) . 2 / 3
Results • RNA dataset, d = 8 , N = 16 × 10 4 , K = 400 • 50 runs without CLARANS (red), 24 runs with (blue). 1 . 8 1 . 6 E 1 . 4 1 . 2 1 . 0 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 time [s] K -means++ LLOYD K -means++ CLARANS LLOYD • On 16 datasets, geometric mean improvement is 3 % . CLARANS with Levenshtein metric for sequence data, l 0 , l 1 , . . . , l ∞ for sparse/dense vectors, many others, on github. 3 / 3
The end james.newling@idiap.ch
Recommend
More recommend