FAQs Quiz #3 Scores will be available by 3/6 Programming - PDF document

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 BIG DATA PART B. GEAR SESSIONS SESSION 2: MACHINE LEARNING FOR BIG DATA Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University FAQs • Quiz #3 • Scores will be available by 3/6 • Programming Assignment #2 • March 10 • Piazza discussion board • Critical Review http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • GEAR Session 2. Machine Learning for Big Data • Lecture 1. • Clustering Algorithms CS535 Big Data | Computer Science | Colorado State University GEAR Session 2. Machine Learning for Big Data Lecture 1. Distributed Clustering Models http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Clustering : Core concept • Set of N-dimensional vectors • Can be in the order of millions • Group (or cluster) them based on their proximity (or similarity) to each other in an N- dimensional space • Vectors or objects in a cluster (or group) are more similar to each other than in any other group CS535 Big Data | Computer Science | Colorado State University Clustering : Applications • Anomaly detection • Fraud detection • Recommendation systems • Medical imaging • Market research • Human genetic clustering http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University GEAR Session 2. Machine Learning for Big Data Lecture 1. Distributed Clustering Introduction CS535 Big Data | Computer Science | Colorado State University This material is built based on, • Arthur, D.; Vassilvitskii, S. (2007). " k -means++: the advantages of careful seeding" (PDF). Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms . Society for Industrial and Applied Mathematics Philadelphia, PA, USA. pp. 1027–1035 • Bahmani, B., Moseley, B., Vattani, A., Kumar, R. and Vassilvitskii, S., 2012. Scalable k- means++. arXiv preprint arXiv:1203.6402 . • Apache Spark Mllib: Clustering • https://spark.apache.org/docs/latest/ml-clustering.html http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University K-Means Clustering • A set of unlabeled points • Assumes that they form k clusters • Find a set of cluster centers that minimize the distance to nearest center • Finding a global optima is NP-hard: O(n dk+1 ) • Many approximate algorithms are available D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245–248, 2009. CS535 Big Data | Computer Science | Colorado State University Concept: k -Means Clustering (1/4) . . -10 -8 -6 -4 -2 0 2 4 6 . . x . . . . . . . . .. . . . . . . . x . . -4 -3 -2 -1 0 1 2 3 4 http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Concept: k -Means Clustering (2/4) . . -10 -8 -6 -4 -2 0 2 4 6 . . x . . . . . . . . . . . . . . . . . x . . -4 -3 -2 -1 0 1 2 3 4 CS535 Big Data | Computer Science | Colorado State University Concept: k -Means Clustering (3/4) . . -10 -8 -6 -4 -2 0 2 4 6 . . . . . . x . . . . . . . . . . . . . x . . -4 -3 -2 -1 0 1 2 3 4 http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Concept: k -Means Clustering (4/4) . . -10 -8 -6 -4 -2 0 2 4 6 . . . . . . x . . . . . . . . . . . . . x . . -4 -3 -2 -1 0 1 2 3 4 CS535 Big Data | Computer Science | Colorado State University k -Means algorithm- Lloyd’s Algorithm (1/2) • Input • k (number of clusters) • Training set {x (1) , x (2) , x (3) ,…. x (m) } x ( i ) ∈ R n (drop x 0 = 1 convention) http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 7

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University k -Means algorithm- Lloyd’s Algorithm (2/2) • Randomly initialize K cluster centroids µ 1 , µ 1 ,... µ k ∈ R n repeat{ for i = 1 to m c (i) :=index (from i to K) of cluster centroid closest to x (i) for k = 1 to K μ k := average (mean) of points assigned to cluster k } CS535 Big Data | Computer Science | Colorado State University Cost function • The objective is to find : k ∑ ∑ µ i ||) 2 argmin (|| x − S i = 1 x ∈ S i • Where μ i is the mean of points in S i http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University k -Means algorithm- Lloyd’s Algorithm: Step-By-Step Instruction 1. Initialization Step • Select random k centers • Using a random uniform distribution 2. Assignment Step • Assign each observation to the cluster • Euclidean distance 3. Update Step • Calculate the new means of Euclidean distance to each assigned cluster • Update centroids 4. Termination Step • Stop when the centroids do not change for two consecutive steps. CS535 Big Data | Computer Science | Colorado State University k -Means for non-separated clusters Separated clusters Non-Separated clusters .. .. .. .. .. . . .. . . . .. .. .. .. .. . . . .. . .. .. . . . .. .. .. . .. .. . .. . .. http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 9

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University How to choose the number of clusters • Value k in the algorithm . . . -10 -8 -6 -4 -2 0 2 4 6 . . . .. . . . . . . . . . . . . . . . -4 -3 -2 -1 0 1 2 3 4 CS535 Big Data | Computer Science | Colorado State University Choosing the value K (1/2) Elbow Method “Elbow” Cost function J Cost function J K (no. of clusters) K (no. of clusters) http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 10

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Choosing the value K (2/2) . .. . .. . . . Extra Large .. .. Large .. .. .. .. .. .. . . . . .. . .. Large . .. .. .. . .. . Medium . . . Medium .. . .. .. .. Small Waist . . . . Waist .... . .. . . . . Small Extra Small Sleeve Length Sleeve Length CS535 Big Data | Computer Science | Colorado State University Distance Measures • Euclidean Distance • Manhattan Distance • Cosine Distance • Hamming Distance • Jaccard Dissimilarity • Edit Distance • Smith Waterman Similarity • Image Distance • Etc. http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 11

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University GEAR Session 2. Machine Learning for Big Data Lecture 1. Distributed Clustering Scalable k-means CS535 Big Data | Computer Science | Colorado State University k -Means algorithm- Lloyd’s Algorithm: Strengths • Embarrassingly parallel • Converges to a local minima • O(nkdi) runtime http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 12

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University k -Means algorithm- Lloyd’s Algorithm: Weaknesses • O(nkdi) runtime • Worst case? ! ≈ 2 $ • Large number of local minima • Many local minima are poor • k is unknown CS535 Big Data | Computer Science | Colorado State University The K-Means++ Algorithm • Avoiding cold-start improves results • Reducing the number of total iterations • Initialize cluster centers sequentially • Only the first center is randomly selected • Each further center is selected probabilistically to be far from existing centers • Result of this is an O(log k ) approximation to the global optima http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 13

FAQs Quiz #3 Scores will be available by 3/6 Programming - PDF document

CS535 Big Data 03/02/2020 Week 7-A Sangmi Lee Pallickara CS535 BIG DATA PART B. GEAR SESSIONS SESSION 2: MACHINE LEARNING FOR BIG DATA Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535

FAQs Safety Protective devices for machines FAQs What is functional safety and why is machine

Glossary Glossary FAQS FAQS Tools and Resources Tools and Resources Welcome to Your HR Leader

FAQs on Accreditation Criteria for FAQs on Accreditation Criteria for Government and Private

Announcements Check course web page under assignments for FAQs Read FAQs before sending

Under Labor Law 537 The FAQs can be accessed here -

FAQs Pat Tabor spearheaded a project when he was on the Board to have a source of information on

Promotion Open Session Introduction This document outlines the full transcript of the FAQS from

Budget Update FAQs and Clarifications Board of Education February 5, 2020 Kathleen Askelson,

DRN OC Updates October 5, 2015 Agenda Discussion of revised CDM Implementation FAQs: Shelley

PREVENTING MUSCULOSKELETAL DISORDERS AND TRAINING : FAQS DIANA ROBLA Social partners

Final Paper Format Guide and Presentation FAQs This document provides a basic overview of

Water and Sewer Department (WTWSD) Water Quality- July 12, 2016 FAQs Q: Is my public water

Crack Pipe FAQs: What service providers need to know Presenter: Andrew Ivsins Presentation

Welcome! The Webinar will Begin Shortly Technical Assistance FAQs 1. Why cant I hear anything?

UC SPONSORED RETIREE HEALTH PLANS FREQUENTLY ASKED QUESTIONS ( FAQs ) v.07102020 FAQ #1 When I

Travel Welcome to Acorn Adventure Ardche Adventure FAQs Any questions?

CS 147: Computer Systems Performance Analysis Workload Characterization 1 / 31 Overview CS147

Introduction to Machine Learning CMU-10701 19. Clustering and EM Barnabs Pczos Contents

Efficient Diameter Approximation for Large Graphs in MapReduce Geppino Pucci - Universit` a di

NVIDIA GPU Architecture for General Purpose Computing Anthony Lippert 4/27/09 1 Outline

Chapter 5: Cluster ering ing Jilles Vreeken IRDM 15/16 10 Nov 2015 Qu Question o of f

Security in SESAR 2020 Ruben Flohr ATM Expert, SESAR JU GAMMA final event 15 November 2017

Cybersecurity and Africa Benot MOREL Carnegie Mellon University Afrinic Cyberization of

Protecting Internet Threat Monitors: A Statistical Filtering Approach Yoichi Shinoda JAIST