Computational Tools for Data Science 02807, E 2018 Paul Fischer - PowerPoint PPT Presentation

Mining Streams Computational Tools for Data Science 02807, E 2018 Paul Fischer Institut for Matematik og Computer Science Danmarks Tekniske Universitet Efterår 2018 02807 Computational Tools for Data Science, Lecture 9 1 � 2018 P c . Fischer

Mining Streams Clustering Today’s schedule ◮ What is clustering ◮ Hierarchical clustering ◮ The k -means algorithm ◮ The DBSCAN algorithm (not in the book) ◮ Evaluating clusterings 02807 Computational Tools for Data Science, Lecture 9 2 � 2018 P c . Fischer

Mining Streams What is Clustering Clustering is the task of grouping objects from a large set in such a way that objects in the same group are more “similar” to each other than to those in other groups. The groups are called clusters . The measure of similarity has to be specified according to the problem under consideration. Example ◮ People with similar interests in social media. ◮ People with similar taste for movies in a streaming provider. ◮ Detecting similarities in medical tests. ◮ Detection of groups in statistical data. 02807 Computational Tools for Data Science, Lecture 9 3 � 2018 P c . Fischer

Mining Streams Examples How many clusters do you see? 02807 Computational Tools for Data Science, Lecture 9 4 � 2018 P c . Fischer

Mining Streams General assumptions We assume that the data to be considered is numerical. Each data point is a d -dimensional vector x = ( x 0 , . . . , x d − 1 ) The input to clustering is a multi-set S = � x 0 , . . . , x n − 1 � of n data points. For a multi-set S = � x 0 , . . . , x n − 1 � the centroid (center of gravity) cent ( S ) is defined by n − 1 cent ( S ) = 1 � x i n i = 0 where the sum is componentwise. That is for x = ( x 0 , x 1 , · · · , x d − 1 ) and y = ( y 0 , y 1 , · · · , y d − 1 ) : x + y = ( x 0 + y 0 , x 1 + y 1 , . . . , x d − 1 + y d − 1 ) A distance measure { dist ( · , · ) } is defined on R , where dist ( x , y ) ≥ 0 , dist ( x , y ) = dist ( y , x ) , and dist ( x , y ) ≤ dist ( x , z ) + dist ( z , y ) , i.e., dist is a metric. 02807 Computational Tools for Data Science, Lecture 9 5 � 2018 P c . Fischer

Mining Streams Hierarchical Clustering Outline of hierarchical clustering The algorithm joins cluster, which are close to each other. Let c i be the centroid of cluster C i . ◮ Initialisation Each data point is a cluster by itself, i.e., C i = { x i } and c i = x i . ◮ Merging Find clusters C i and C j where dist ( c i , c j ) is minimal (breaking ties, e.g., randomly). Merge C i and C j into a new cluster C k , where the indexing is done by new numbers or re-using existing ( k = i ). Remove C i an C j . Note, that merging is multi-set union, denoted ⊎ . ◮ Stop the process when some criterium is satisfied, e.g., a certain number of clusters is reached. 02807 Computational Tools for Data Science, Lecture 9 6 � 2018 P c . Fischer

Mining Streams Hierarchical Clustering Pseudo code for i = 0 , . . . , n − 1 do C i ← { x i } ; c i ← x i ; end goon ← true ; while goon do find i � = j with dist ( c i , c j ) is minimal; C k = C i ⊎ C j ; c k = cent ( C k ) ; Remove C i and C j as clusters and c i and c j as centers ; Update goon end Note that in general c k � = ( c i + c j ) / 2 , but the summands habe to weighted by the size of the clusters. c k � = ( | C i | c i + | C j | c j ) | C i | + | C j | 02807 Computational Tools for Data Science, Lecture 9 7 � 2018 P c . Fischer

Mining Streams Hierarchical Clustering Stop criteria ◮ A number of clusters has been specified beforehand. When only this number is left, the algoritm terminates. ◮ The density of the cluster resulting from a merger is bad. The density is the average distance between points in a cluster. This can also be used to reject mergers in course of the algorithm. ◮ See more in the book. Without further features, which “guide” the algorithm, hierarchical clustering might perform bad on larger data sets. 02807 Computational Tools for Data Science, Lecture 9 8 � 2018 P c . Fischer

Mining Streams Hierarchical Clustering Phylogenetic Trees Hierarchical clustering is useful to generate phylogenetic trees (on small data sets). C E B D A A B C D E 02807 Computational Tools for Data Science, Lecture 9 9 � 2018 P c . Fischer

Mining Streams k -means Algorithm The k -means algorithm The k -means algorithm requires the user to provide the number k of clusters and delivers a partition of S into k clusters, C 0 , . . . , C k − 1 . Idea: 0 Randomly select k points c 0 , . . . , c k − 1 from S . These are the centers of the clusters. 1 For each x i ∈ S , assign x i to that cluster the center of which is closest. 2 Re-compute the centers c j to be the centroids of C j . ◮ Iterate steps 1 and 2 until no (only very small) changes occur. 02807 Computational Tools for Data Science, Lecture 9 10 � 2018 P c . Fischer

Mining Streams k -means Algorithm The k -means algorithm Input : A multi-set S = � x 0 , . . . , x n − 1 � and a positive integer k Randomly select k distinct points c i from S ; while goon do for j = 0 , . . . , k − 1 do C j ← ∅ ; end for i = 0 , . . . , n − 1 do ℓ = arg min { dist ( x i , c j ) | j = 0 , . . . , k − 1 } ; C ℓ = C ℓ ⊎ { x i } ; end for j = 0 , . . . , k − 1 do c j ← cent ( C j ) ; end Update goon ; end 02807 Computational Tools for Data Science, Lecture 9 11 � 2018 P c . Fischer

Mining Streams k -means Algorithm DBSCAN, Idea D ensity- B ased S patial C lustering of A pplications with N oise (DBSCAN) ◮ One defines the concept of (density) reachable for the data points. ◮ The algorithm uses two parameters: ε > 0 , the neighborhood radius , and m ∈ N + , the minimum required neighbourhood size. ◮ The algorithms classifies points as core (centrally in a cluster), rim (at the edge of a cluster) and noise not belonging to any cluster. ◮ The number of clusters is not fixed beforehand, it is implicitly controlled by ε and m . ◮ A point x is core if there area at least m points (incl. x ) within distance ε , i.e., |{ z | dist ( x , z ) ≤ ε }| ≥ m . ◮ A point z is directly reachable from x if dist ( x , z ) ≤ ε and x is core . ◮ A point z is reachable from x if there are points x 1 , x 2 , . . . , x k , such that x = x 1 , z = x k , x i + 1 is reachable from x i , and x 1 , x 2 , . . . , x k − 1 are core . If z is not core , it is rim . 02807 Computational Tools for Data Science, Lecture 9 12 � 2018 P c . Fischer

Mining Streams k -means Algorithm DBSCAN, Idea 5 4 x 3 2 Point x is core for m = 4 . For m = 4 : core points in red, rim points in yellow, noise points in blue. 02807 Computational Tools for Data Science, Lecture 9 13 � 2018 P c . Fischer

Mining Streams k -means Algorithm DBSCAN Pseudo Code Algorithm 1: DBSCAN ( S , ε, m ) Algorithm 3: expand ( x , N , C , ε, m ) Mark all x i ∈ S as unvisited ; C ← C ⊎ { x } ; for i = 0 , . . . , n − 1 do for z ∈ N do if x i is unvisited then if z is not visited then N ← neigh ( x i , ε ) ; Mark z as visited ; N ′ ← neigh ( z , ε ) ; if | N | < m then Mark x i as noise ; � ≥ m then � N ′ � � if else N ← N ⊎ N ′ ; C ← ∅ ; end Mark x i as core ; end expand( x i , N , C , ε, m ); if z is not in any cluster then end C ← C ⊎ { z } ; end � ≥ m then � N ′ � � if end Mark z as core ; else Mark z as rim ; Algorithm 2: neigh ( x , ε ) end end return all points z with dist ( x , z ) ≤ ε end 02807 Computational Tools for Data Science, Lecture 9 14 � 2018 P c . Fischer

Mining Streams k -means Algorithm Evaluating the result One way is the DaviesBouldin index �� σ i + σ j k − 1 � � DB = 1 � | j � = i max dist ( c i , c j ) k i = 0 1 where c i = cent ( C i ) and σ i = � x ∈ C i dist ( x , c i ) the average distance of points in cluster C i form | C i | its center. This index is low if the distances ( σ i ) in the clusters are low and the distances between the clusters (dist ( c i , c j ) ) are large. 02807 Computational Tools for Data Science, Lecture 9 15 � 2018 P c . Fischer

Mining Streams k -means Algorithm Final remarks Some algorithms depend on user supplied parameters. For DBSCAN you can find some guideline on https://en.wikipedia.org/wiki/DBSCAN . Most clustering algorithms require finding “close by” points (nearest neighbours). Computing the distance form one point to every other one is very time consuming O ( n ) . Computing the distances for all points beforehand and storing them requires O ( n 2 ) space, which is infeasible for medium n . There are sophisticated data structures (e.g., Voronoi diagrams) for the nearest neighbor problem. However the suffer the “curse of dimensionality”. How does one represent clusters? 1) sets of points, 2) use an integer array C where C [ i ] is the number of the cluster in which x i is located, 3) use some other data structure. 02807 Computational Tools for Data Science, Lecture 9 16 � 2018 P c . Fischer

Computational Tools for Data Science 02807, E 2018 Paul Fischer - PowerPoint PPT Presentation

Mining Streams Computational Tools for Data Science 02807, E 2018 Paul Fischer Institut for Matematik og Computer Science Danmarks Tekniske Universitet Efterr 2018 02807 Computational Tools for Data Science, Lecture 9 1 2018 P c .

Computational Tools for Data Science 02807, E 2018 Filtering Streams Paul Fischer Institut for

Databases Course 02807 October 23, 2018 Carsten Witt Databases Database = an organized

Examples of online analysis tools for gene expression data Tools integrated in data repositories

1. Computational Fluid a. Computational Fluid Dynamics is in the domain of Computational Science

I nsulated Tools Presents KLEIN I nsulated Tools 2 KLEIN I nsulated Tools Topics Who needs

Computational Modeling CT @ VT Computational Modeling The third pillar of science and

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Better Data, Better Tools, Better Decisions: Introduction to the Office of Computational Science

Program Analsysis Tools Steven J Zeil April 18, 2013 Program Analsysis Tools Outline

Networks of Computational Social Science Ian Dennis Miller 2018-11-22 Ian Dennis Miller

The most important free tools for any website owner Google Webmaster Tools & Google Analytics

Tools for investigating THDM models Henning Bahl 14.11.2019, Hamburg Intro Tools Conclusions

Tools integrate Tools work together Tools work together Models Specs Code Traces Profiles

CSC2552 Topics in Computational Social Science: AI, Data, and Society Spring 2020 Lecture 2:

Computational Physics What is Computational Physics? Basic Computer Hardware Operating Systems

Computational Seismology and Grid Computational Seismology and Grid Computational Seismology and

The Protection Mainstreaming Mobile Application (ProM) The Protection Mainstreaming App is

High Performance Embedded High Performance Embedded Systems Systems IT Integration Solutions

Custom Execution Environments with Containers in Pegasus-enabled Scientific Workflows Karan Vahi

Spectral Clustering on Handwritten Digits Database Mid-Year Presentation Danielle Middlebrooks

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous

Minneapol olis-St. Paul R Region onal Cl Cluster Co Competitiveness St Study Lee Munnich,

Alpha Presentation Kubernetes Cluster Inspection Tool The Capstone Experience Team Google David

A Hierarchical Framework for Cross Domain MapReduce Execution Yuan Luo 1 , Zhenhua Guo 1 ,

Computational Tools for Data Science 02807, E 2018 Paul Fischer - PowerPoint PPT Presentation

Mining Streams Computational Tools for Data Science 02807, E 2018 Paul Fischer Institut for Matematik og Computer Science Danmarks Tekniske Universitet Efterr 2018 02807 Computational Tools for Data Science, Lecture 9 1 2018 P c .

Computational Tools for Data Science 02807, E 2018 Filtering Streams Paul Fischer Institut for

Databases Course 02807 October 23, 2018 Carsten Witt Databases Database = an organized

Examples of online analysis tools for gene expression data Tools integrated in data repositories

1. Computational Fluid a. Computational Fluid Dynamics is in the domain of Computational Science

I nsulated Tools Presents KLEIN I nsulated Tools 2 KLEIN I nsulated Tools Topics Who needs

Computational Modeling CT @ VT Computational Modeling The third pillar of science and

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Better Data, Better Tools, Better Decisions: Introduction to the Office of Computational Science

Program Analsysis Tools Steven J Zeil April 18, 2013 Program Analsysis Tools Outline

Networks of Computational Social Science Ian Dennis Miller 2018-11-22 Ian Dennis Miller

The most important free tools for any website owner Google Webmaster Tools &amp; Google Analytics

Tools for investigating THDM models Henning Bahl 14.11.2019, Hamburg Intro Tools Conclusions

Tools integrate Tools work together Tools work together Models Specs Code Traces Profiles

CSC2552 Topics in Computational Social Science: AI, Data, and Society Spring 2020 Lecture 2:

Computational Physics What is Computational Physics? Basic Computer Hardware Operating Systems

Computational Seismology and Grid Computational Seismology and Grid Computational Seismology and

The Protection Mainstreaming Mobile Application (ProM) The Protection Mainstreaming App is

High Performance Embedded High Performance Embedded Systems Systems IT Integration Solutions

Custom Execution Environments with Containers in Pegasus-enabled Scientific Workflows Karan Vahi

Spectral Clustering on Handwritten Digits Database Mid-Year Presentation Danielle Middlebrooks

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous

Minneapol olis-St. Paul R Region onal Cl Cluster Co Competitiveness St Study Lee Munnich,

Alpha Presentation Kubernetes Cluster Inspection Tool The Capstone Experience Team Google David

A Hierarchical Framework for Cross Domain MapReduce Execution Yuan Luo 1 , Zhenhua Guo 1 ,

The most important free tools for any website owner Google Webmaster Tools & Google Analytics