CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 BIG DATA PART B. GEAR SESSIONS SESSION 2: MACHINE LEARNING FOR BIG DATA Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University FAQs • Quiz #3 • Scores will be available by 3/6 • Programming Assignment #2 • March 10 • Piazza discussion board • Critical Review http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1
CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • GEAR Session 2. Machine Learning for Big Data • Lecture 1. • Clustering Algorithms CS535 Big Data | Computer Science | Colorado State University GEAR Session 2. Machine Learning for Big Data Lecture 1. Distributed Clustering Models http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2
CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Clustering : Core concept • Set of N-dimensional vectors • Can be in the order of millions • Group (or cluster) them based on their proximity (or similarity) to each other in an N- dimensional space • Vectors or objects in a cluster (or group) are more similar to each other than in any other group CS535 Big Data | Computer Science | Colorado State University Clustering : Applications • Anomaly detection • Fraud detection • Recommendation systems • Medical imaging • Market research • Human genetic clustering http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3
CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University GEAR Session 2. Machine Learning for Big Data Lecture 1. Distributed Clustering Introduction CS535 Big Data | Computer Science | Colorado State University This material is built based on, • Arthur, D.; Vassilvitskii, S. (2007). " k -means++: the advantages of careful seeding" (PDF). Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms . Society for Industrial and Applied Mathematics Philadelphia, PA, USA. pp. 1027–1035 • Bahmani, B., Moseley, B., Vattani, A., Kumar, R. and Vassilvitskii, S., 2012. Scalable k- means++. arXiv preprint arXiv:1203.6402 . • Apache Spark Mllib: Clustering • https://spark.apache.org/docs/latest/ml-clustering.html http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4
CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University K-Means Clustering • A set of unlabeled points • Assumes that they form k clusters • Find a set of cluster centers that minimize the distance to nearest center • Finding a global optima is NP-hard: O(n dk+1 ) • Many approximate algorithms are available D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245–248, 2009. CS535 Big Data | Computer Science | Colorado State University Concept: k -Means Clustering (1/4) . . -10 -8 -6 -4 -2 0 2 4 6 . . x . . . . . . . . .. . . . . . . . x . . -4 -3 -2 -1 0 1 2 3 4 http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5
CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Concept: k -Means Clustering (2/4) . . -10 -8 -6 -4 -2 0 2 4 6 . . x . . . . . . . . . . . . . . . . . x . . -4 -3 -2 -1 0 1 2 3 4 CS535 Big Data | Computer Science | Colorado State University Concept: k -Means Clustering (3/4) . . -10 -8 -6 -4 -2 0 2 4 6 . . . . . . x . . . . . . . . . . . . . x . . -4 -3 -2 -1 0 1 2 3 4 http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6
CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Concept: k -Means Clustering (4/4) . . -10 -8 -6 -4 -2 0 2 4 6 . . . . . . x . . . . . . . . . . . . . x . . -4 -3 -2 -1 0 1 2 3 4 CS535 Big Data | Computer Science | Colorado State University k -Means algorithm- Lloyd’s Algorithm (1/2) • Input • k (number of clusters) • Training set {x (1) , x (2) , x (3) ,…. x (m) } x ( i ) ∈ R n (drop x 0 = 1 convention) http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 7
CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University k -Means algorithm- Lloyd’s Algorithm (2/2) • Randomly initialize K cluster centroids µ 1 , µ 1 ,... µ k ∈ R n repeat{ for i = 1 to m c (i) :=index (from i to K) of cluster centroid closest to x (i) for k = 1 to K μ k := average (mean) of points assigned to cluster k } CS535 Big Data | Computer Science | Colorado State University Cost function • The objective is to find : k ∑ ∑ µ i ||) 2 argmin (|| x − S i = 1 x ∈ S i • Where μ i is the mean of points in S i http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8
CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University k -Means algorithm- Lloyd’s Algorithm: Step-By-Step Instruction 1. Initialization Step • Select random k centers • Using a random uniform distribution 2. Assignment Step • Assign each observation to the cluster • Euclidean distance 3. Update Step • Calculate the new means of Euclidean distance to each assigned cluster • Update centroids 4. Termination Step • Stop when the centroids do not change for two consecutive steps. CS535 Big Data | Computer Science | Colorado State University k -Means for non-separated clusters Separated clusters Non-Separated clusters .. .. .. .. .. . . .. . . . .. .. .. .. .. . . . .. . .. .. . . . .. .. .. . .. .. . .. . .. http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 9
CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University How to choose the number of clusters • Value k in the algorithm . . . -10 -8 -6 -4 -2 0 2 4 6 . . . .. . . . . . . . . . . . . . . . -4 -3 -2 -1 0 1 2 3 4 CS535 Big Data | Computer Science | Colorado State University Choosing the value K (1/2) Elbow Method “Elbow” Cost function J Cost function J K (no. of clusters) K (no. of clusters) http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 10
CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Choosing the value K (2/2) . .. . .. . . . Extra Large .. .. Large .. .. .. .. .. .. . . . . .. . .. Large . .. .. .. . .. . Medium . . . Medium .. . .. .. .. Small Waist . . . . Waist .... . .. . . . . Small Extra Small Sleeve Length Sleeve Length CS535 Big Data | Computer Science | Colorado State University Distance Measures • Euclidean Distance • Manhattan Distance • Cosine Distance • Hamming Distance • Jaccard Dissimilarity • Edit Distance • Smith Waterman Similarity • Image Distance • Etc. http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 11
CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University GEAR Session 2. Machine Learning for Big Data Lecture 1. Distributed Clustering Scalable k-means CS535 Big Data | Computer Science | Colorado State University k -Means algorithm- Lloyd’s Algorithm: Strengths • Embarrassingly parallel • Converges to a local minima • O(nkdi) runtime http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 12
CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University k -Means algorithm- Lloyd’s Algorithm: Weaknesses • O(nkdi) runtime • Worst case? ! ≈ 2 $ • Large number of local minima • Many local minima are poor • k is unknown CS535 Big Data | Computer Science | Colorado State University The K-Means++ Algorithm • Avoiding cold-start improves results • Reducing the number of total iterations • Initialize cluster centers sequentially • Only the first center is randomly selected • Each further center is selected probabilistically to be far from existing centers • Result of this is an O(log k ) approximation to the global optima http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 13
Recommend
More recommend