Cluster Analysis • Grouping the data items into a number of sets such that the members of each set have “more in common” with each other than with any members of any other set – “More in common” can be defined in many ways but some form of distance metric based on the characteristics of each data item is normal – Data items belonging to a cluster will be nearer to each other in terms of this distance measure than to data items in any other cluster • Clustering algorithms can be divided into 2 types – Hierarchical – Non-hierarchical Hierarchical Clustering • Hierarchical clustering produces a family of alternative clusterings • If we have n data items then we start with n clusters – this is our first clustering • We merge the two clusters which are “closest” according to some metric to form n-1 clusters – this is our second clustering • We continue to merge the closest pairs of clusters – producing successive clusterings – until we have just one cluster which contains all of the data items • This can be visualised in a dendrogram
Dendrogram Example Distance Metrics for Clusters • Clearly the distance/difference metrics we have considered so far cannot be applied to clusters – Clusters will, in general, contain more than one data item so there will be more than one value for each characteristic within a cluster • Common metrics for clusters include – Set the distance between two clusters to be the minimum distance between any pair of data items, where one data item is in one cluster and the other data item is in the other cluster – Set the distance between two clusters to be the maximum distance between any pair of data items – Set the distance between two clusters to be the average of the distances between all pairs of data items
Non-Hierarchical Clustering • Non-hierarchical methods are many and varied but they all produce just one clustering of the data items – The number of clusters to be formed is supplied as an input to the process • Each cluster is characterised by a centroid – The centroid of a cluster is usually defined to be the set of average values of the characteristics of the data items in the cluster • Initially, our clusters will contain no data items so we assign default centroid values to each cluster, carefully chosen to ensure a spread across the range of possibilities Non-Hierarchical Clustering Method • First, each data item is assigned to a cluster based on the distance between that item and the centroids of the clusters • After this assignment the clusters will contain actual data items and we can calculate their real centroids • Next we re-evaluate each data item and transfer it from its current cluster to the cluster whose centroid is closest to it – We note that this will change the centroids of both the cluster which the data item is removed from and the cluster to which it is added • We now iteratively repeat the evaluation of each data item until no further transfers are required
Nearest Neighbour Methods • Nearest neighbour methods can be used for both clustering and classification • We form a training set of data items which are intended to be typical of a certain class/cluster of such items • We next form a response value for each data item, based on some function of its characteristics • For each class/cluster we determine the average response value • Data items can then be assigned to classes/clusters according to the response value that each generates
Recommend
More recommend