K-means++: The Advantages of Careful Seeding Sergei Vassilvitskii - PowerPoint PPT Presentation

K-means++: The Advantages of Careful Seeding Sergei Vassilvitskii David Arthur (Stanford university)

Clustering R d Given points in split them into similar groups. k n

Clustering R d Given points in split them into similar groups. k n This talk: k-means clustering: � c ∈C � x − c � 2 min Find centers, that minimize k C 2 x ∈ X

Why Means? � c ∈C � x − c � 2 min Objective: Find centers, that minimize k C 2 x ∈ X � � x − y � 2 For one cluster: Find that minimizes y 2 x ∈ X 1 � Easy! y = x | X | x ∈ X

Lloyd’s Method: k-means Initialize with random clusters

Lloyd’s Method: k-means Assign each point to nearest center

Lloyd’s Method: k-means Recompute optimum centers (means)

Lloyd’s Method: k-means Repeat: Assign points to nearest center

Lloyd’s Method: k-means Repeat: Recompute centers

Lloyd’s Method: k-means Repeat...

Lloyd’s Method: k-means Repeat...Until clustering does not change

Analysis How good is this algorithm? Finds a local optimum That is potentially arbitrarily worse than optimal solution

Approximating k-means O ( n 3 /� d ) • Mount et al.: approximation in time 9 + � O ( n + k k +2 � − 2 dk log k ( n/� )) • Har Peled et al.: in time 1 + � 2 ( k/� ) O (1) nd • Kumar et al.: in time 1 + �

Approximating k-means O ( n 3 /� d ) • Mount et al.: approximation in time 9 + � O ( n + k k +2 � − 2 dk log k ( n/� )) • Har Peled et al.: in time 1 + � 2 ( k/� ) O (1) nd • Kumar et al.: in time 1 + � Lloyd’s method: 2 Ω( √ n ) • Worst-case time complexity: n O ( k ) • Smoothed complexity:

Approximating k-means O ( n 3 /� d ) • Mount et al.: approximation in time 9 + � O ( n + k k +2 � − 2 dk log k ( n/� )) • Har Peled et al.: in time 1 + � 2 ( k/� ) O (1) nd • Kumar et al.: in time 1 + � Lloyd’s method: For example, Digit Recognition dataset (UCI): n = 60 , 000 d = 600 Convergence to a local optimum in 60 iterations.

Challenge Develop an approximation algorithm for k-means clustering that is competitive with the k-means method in speed and solution quality. Easiest line of attack: focus on the initial center positions. Classical k-means: pick points at random. k

k-means on Gaussians

Easy Fix Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

Sensitive to Outliers

k-means++ Interpolate between the two methods: Let be the distance between and the nearest D ( x ) x cluster center. Sample proportionally to ( D ( x )) α = D α ( x ) Original Lloyd’s: α = 0 Furthest Point: α = ∞ k-means++: α = 2 Contribution of to the overall error x

k-Means++

k-Means++ Theorem: k-means++ is approximate in expectation. Θ(log k ) Ostrovsky et al. [06]: Similar method is approximate O (1) under some data distribution assumptions.

Proof - 1st cluster Fix an optimal clustering . C ∗ Pick first center uniformly at random Bound the total error of that cluster.

Proof - 1st cluster Let be the cluster. A Each point equally likely a 0 ∈ A to be the chosen center. Expected Error: 1 � � � a − a 0 � 2 E [ φ ( A )] = | A | a 0 ∈ A a ∈ A � � a − ¯ A � 2 = 2 φ ∗ ( A ) = 2 a ∈ A

Proof - Other Clusters Suppose next center came from a new cluster in OPT. Bound the total error of that cluster.

Other CLusters Let be this cluster, and the point selected. b 0 B Then: D 2 ( b 0 ) � � min( D ( b ) , � b − b 0 � ) 2 E [ φ ( B )] = b ∈ B D 2 ( b ) · � b 0 ∈ B b ∈ B Key step: D ( b 0 ) ≤ D ( b ) + � b − b 0 �

Cont. For any b: D 2 ( b 0 ) ≤ 2 D 2 ( b ) + 2 � b − b 0 � 2 2 D 2 ( b ) + 2 � � D 2 ( b 0 ) ≤ � b − b 0 � 2 Avg. over all b: | B | | B | b ∈ B b ∈ B Same for all b 0 Cost in uniform sampling

Cont. For any b: D 2 ( b 0 ) ≤ 2 D 2 ( b ) + 2 � b − b 0 � 2 2 D 2 ( b ) + 2 � � D 2 ( b 0 ) ≤ � b − b 0 � 2 Avg. over all b: | B | | B | b ∈ B b ∈ B Recall: D 2 ( b 0 ) � � min( D ( b ) , � b − b 0 � ) 2 E [ φ ( B )] = b ∈ B D 2 ( b ) · � b 0 ∈ B b ∈ B 4 � � � b − b 0 � 2 ≤ = 8 φ ∗ ( B ) | B | b 0 ∈ B b ∈ B

Wrap Up If clusters are well separated, and we always pick a center from a new optimal cluster, the algorithm is - competitive. 8

Wrap Up If clusters are well separated, and we always pick a center from a new optimal cluster, the algorithm is - competitive. 8 Intuition: if no points from a cluster are picked, then it probably does not contribute much to the overall error.

Wrap Up If clusters are well separated, and we always pick a center from a new optimal cluster, the algorithm is - competitive. 8 Intuition: if no points from a cluster are picked, then it probably does not contribute much to the overall error. Formally, an inductive proof shows this method is Θ(log k ) competitive.

Experiments Tested on several datasets: Synthetic • 10k points, 3 dimensions Cloud Cover (UCI Repository] • 10k points, 54 dimensions Color Quantization • 16k points, 16 dimensions Intrusion Detection (KDD Cup) • 500k points, 35 dimensions

Typical Run KM++ v. KM v. KM-Hybrid 1300 1200 1100 1000 LLOYD Error HYBRID KM++ 900 800 700 600 0 50 100 150 200 250 300 350 400 450 500 Stage

Experiments Total Error k-means km-Hybrid k-means++ Synthetic 0 . 016 0 . 015 0 . 014 6 . 06 × 10 5 6 . 02 × 10 5 5 . 95 × 10 5 Cloud Cover Color 741 712 670 32 . 9 × 10 3 3 . 4 × 10 3 Intrusion − Time: k-means++ 1% slower due to initialization.

Final Message Friends don’t let friends use k-means.

Thank You Any Questions?

K-means++: The Advantages of Careful Seeding Sergei Vassilvitskii - PowerPoint PPT Presentation

K-means++: The Advantages of Careful Seeding Sergei Vassilvitskii David Arthur (Stanford university) Clustering R d Given points in split them into similar groups. k n Clustering R d Given points in split them into

REVEGETATION REVEGETATION REVEGETATION REVEGETATION SEEDING SEEDING SEEDING SEEDING Or Or

Gushers Advantages Gushers Advantages Gusher s Advantages Gusher s Advantages R&D

Soybean Seeding Trend Analysis February 2020 The Story ry of f Soybean Seeding Rates

Advantages and Advantages and Advantages and Advantages and Disadvantages of Disadvantages of

K -Medoids for K -Means Seeding James Newling & Franc ois Fleuret Machine Learning Group,

Tahoe-Truckee Cloud Seeding Project Water Year 2018 1 DRI Cloud seeding generator: on (Sierra

Tahoe-Truckee Cloud Seeding Project Water Year 2018 1 DRI Cloud seeding generator: on (Sierra

Impacts of Changing Seeding Impacts of Changing Seeding Rates in Soybean Rates in Soybean Shawn

k -means++ seeding Have seen that the k -means algorithm can output arbitrarily poor solutions, if

Real-World applications of Boosting Yoav Freund UCSD Practical Advantages of AdaBoost

Research Update February 2020 In Introduction Soybeans: Seeding Rate, Fungicide and

Idaho Power Companys 2009 Cloud Seeding Program Summary Shaun Parkinson, Ph.D, P.E.

Idaho Power Companys 2009 Cloud Seeding Program Summary Kevin Wade Meteorological Information

USING PAST SEEDING TREATMENTS TO INFORM FUTURE SOURCING IN THE COLORADO PLATEAU ANDREA T.

Improved Hunt Seeding with Specific Anomaly Scoring Brenden Bishop January 8, 2019 1/21

Tahoe-Truckee Cloud Seeding Project Preliminary Results Water Year 2018 April 4, 2018 1 DRI

Garbage Collection Last time Compiling Object-Oriented Languages Today

Clo j ure A Dynamic Programming Language for the JVM (and CLR) Rich Hickey Agenda

JOP: A Java Optimized Processor for Embedded Real-Time Systems Martin Schberl University of

Verification of a Java Compiler in Isabelle Martin Strecker 7.1.2002 Java: Types, values,

Executing Formal Semantics with the Tool David L AZAR 1 Andrei A RUSOAIE 2 , ERB , A 1,2

0.5 1 1.5 2 2.5 3 Maximum # of dihotomies Gro wth funtion x 1 x 2 x 3

UPPER MERION AREA SCHOOL DISTRICT Facilities Planning March 2015 INTRODUCTIONS We Envision A

Lecture 1.4: Binomial and multinomial coefficients Matthew Macauley Department of Mathematical

K-means++: The Advantages of Careful Seeding Sergei Vassilvitskii - PowerPoint PPT Presentation

K-means++: The Advantages of Careful Seeding Sergei Vassilvitskii David Arthur (Stanford university) Clustering R d Given points in split them into similar groups. k n Clustering R d Given points in split them into

REVEGETATION REVEGETATION REVEGETATION REVEGETATION SEEDING SEEDING SEEDING SEEDING Or Or

Gushers Advantages Gushers Advantages Gusher s Advantages Gusher s Advantages R&amp;D

Soybean Seeding Trend Analysis February 2020 The Story ry of f Soybean Seeding Rates

Advantages and Advantages and Advantages and Advantages and Disadvantages of Disadvantages of

K -Medoids for K -Means Seeding James Newling &amp; Franc ois Fleuret Machine Learning Group,

Tahoe-Truckee Cloud Seeding Project Water Year 2018 1 DRI Cloud seeding generator: on (Sierra

Tahoe-Truckee Cloud Seeding Project Water Year 2018 1 DRI Cloud seeding generator: on (Sierra

Impacts of Changing Seeding Impacts of Changing Seeding Rates in Soybean Rates in Soybean Shawn

k -means++ seeding Have seen that the k -means algorithm can output arbitrarily poor solutions, if

Real-World applications of Boosting Yoav Freund UCSD Practical Advantages of AdaBoost

Research Update February 2020 In Introduction Soybeans: Seeding Rate, Fungicide and

Idaho Power Companys 2009 Cloud Seeding Program Summary Shaun Parkinson, Ph.D, P.E.

Idaho Power Companys 2009 Cloud Seeding Program Summary Kevin Wade Meteorological Information

USING PAST SEEDING TREATMENTS TO INFORM FUTURE SOURCING IN THE COLORADO PLATEAU ANDREA T.

Improved Hunt Seeding with Specific Anomaly Scoring Brenden Bishop January 8, 2019 1/21

Tahoe-Truckee Cloud Seeding Project Preliminary Results Water Year 2018 April 4, 2018 1 DRI

Garbage Collection Last time Compiling Object-Oriented Languages Today

Clo j ure A Dynamic Programming Language for the JVM (and CLR) Rich Hickey Agenda

JOP: A Java Optimized Processor for Embedded Real-Time Systems Martin Schberl University of

Verification of a Java Compiler in Isabelle Martin Strecker 7.1.2002 Java: Types, values,

Executing Formal Semantics with the Tool David L AZAR 1 Andrei A RUSOAIE 2 , ERB , A 1,2

0.5 1 1.5 2 2.5 3 Maximum # of dihotomies Gro wth funtion x 1 x 2 x 3

UPPER MERION AREA SCHOOL DISTRICT Facilities Planning March 2015 INTRODUCTIONS We Envision A

Lecture 1.4: Binomial and multinomial coefficients Matthew Macauley Department of Mathematical

Gushers Advantages Gushers Advantages Gusher s Advantages Gusher s Advantages R&D

K -Medoids for K -Means Seeding James Newling & Franc ois Fleuret Machine Learning Group,