Binary Attributes: Computing GINI Index Splits into two partitions Effect of Weighing partitions: – Larger and Purer Partitions are sought for. Parent B? C1 7 C2 5 Yes No Gini = 0.486 Node N1 Node N2 Gini(N1) = 1 – (5/6) 2 – (1/6) 2 N1 N2 Weighted Gini of N1 N2 = 0.278 C1 5 2 = 6/12 * 0.278 + 6/12 * 0.444 C2 1 4 Gini(N2) = 1 – (2/6) 2 – (4/6) 2 = 0.361 Gini=0.361 Gain = 0.486 – 0.361 = 0.125 = 0.444 ‹#› July 15, 2015 Mining Big Data
Continuous Attributes: Computing Gini Index Use Binary Decisions based on one Home Marital Annual Income Defaulted ID Owner Status value 1 Yes Single 125K No Several Choices for the splitting value 2 No Married 100K No – Number of possible splitting values 3 No Single 70K No = Number of distinct values 4 Yes Married 120K No Each splitting value has a count matrix 5 No Divorced 95K Yes associated with it 6 No Married 60K No – Class counts in each of the 7 Yes Divorced 220K No partitions, A < v and A v 8 No Single 85K Yes Simple method to choose best v 9 No Married 75K No 10 No Single 90K Yes – For each v, scan the database to 10 gather count matrix and compute its Annual Gini index Income ≤ 80 > 80 > 80K? – Computationally Inefficient! Yes 0 3 Repetition of work. No 3 4 Yes No ‹#› July 15, 2015 Mining Big Data
Decision Tree Based Classification Advantages: – Inexpensive to construct – Extremely fast at classifying unknown records – Easy to interpret for small-sized trees – Robust to noise (especially when methods to avoid overfitting are employed) – Can easily handle redundant or irrelevant attributes (unless the attributes are interacting) Disadvantages: – Space of possible decision trees is exponentially large. Greedy approaches are often unable to find the best tree. – Does not take into account interactions between attributes – Each decision boundary involves only a single attribute ‹#› July 15, 2015 Mining Big Data
Handling interactions + : 1000 instances Entropy (X) : 0.99 Entropy (Y) : 0.99 o : 1000 instances Y X ‹#› July 15, 2015 Mining Big Data
Handling interactions + : 1000 instances Entropy (X) : 0.99 Entropy (Y) : 0.99 o : 1000 instances Entropy (Z) : 0.98 Y Adding Z as a noisy Attribute Z will be attribute generated chosen for splitting! from a uniform distribution X Z Z X Y ‹#› July 15, 2015 Mining Big Data
Limitations of single attribute-based decision boundaries Both positive (+) and negative (o) classes generated from skewed Gaussians with centers at (8,8) and (12,12) respectively. ‹#› July 15, 2015 Mining Big Data
Model Overfitting ‹#› July 15, 2015 Mining Big Data
Classification Errors • Training errors (apparent errors) – Errors committed on the training set • Test errors – Errors committed on the test set • Generalization errors – Expected error of a model over random selection of records from same distribution ‹#› July 15, 2015 Mining Big Data
Example Data Set Two class problem: + : 5200 instances • 5000 instances generated from a Gaussian centered at (10,10) • 200 noisy instances added o : 5200 instances • Generated from a uniform distribution 10 % of the data used for training and 90% of the data used for testing ‹#› July 15, 2015 Mining Big Data
Increasing number of nodes in Decision Trees ‹#› July 15, 2015 Mining Big Data
Decision Tree with 4 nodes Decision Tree Decision boundaries on Training data ‹#› July 15, 2015 Mining Big Data
Decision Tree with 50 nodes Decision Tree Decision Tree Decision boundaries on Training data ‹#› July 15, 2015 Mining Big Data
Which tree is better? Decision Tree with 4 nodes Which tree is better ? Decision Tree with 50 nodes ‹#› July 15, 2015 Mining Big Data
Model Overfitting Underfitting: when model is too simple, both training and test errors are large Overfitting: when model is too complex, training error is small but test error is large ‹#› July 15, 2015 Mining Big Data
Model Overfitting Using twice the number of data instances • If training data is under-representative, testing errors increase and training errors decrease on increasing number of nodes • Increasing the size of training data reduces the difference between training and testing errors at a given number of nodes ‹#› July 15, 2015 Mining Big Data
Reasons for Model Overfitting • Presence of Noise • Lack of Representative Samples • Multiple Comparison Procedure ‹#› July 15, 2015 Mining Big Data
Effect of Multiple Comparison Procedure • Consider the task of predicting whether Day 1 Up stock market will rise/fall in the next 10 Day 2 Down trading days Day 3 Down Day 4 Up • Random guessing: Day 5 Down P ( correct ) = 0.5 Day 6 Down Day 7 Up • Make 10 random guesses in a row: Day 8 Up Day 9 Up 10 10 10 Day 10 Down 8 9 10 P (# correct 8 ) 0 . 0547 10 2 ‹#› July 15, 2015 Mining Big Data
Effect of Multiple Comparison Procedure • Approach: – Get 50 analysts – Each analyst makes 10 random guesses – Choose the analyst that makes the most number of correct predictions • Probability that at least one analyst makes at least 8 correct predictions 50 P (# correct 8 ) 1 ( 1 0 . 0547 ) 0 . 9399 ‹#› July 15, 2015 Mining Big Data
Effect of Multiple Comparison Procedure • Many algorithms employ the following greedy strategy: – Initial model: M – Alternative model: M‟ = M , where is a component to be added to the model (e.g., a test condition of a decision tree) – Keep M‟ if improvement, (M,M‟) > • Often times, is chosen from a set of alternative components, = { 1 , 2 , …, k } • If many alternatives are available, one may inadvertently add irrelevant components to the model, resulting in model overfitting ‹#› July 15, 2015 Mining Big Data
Effect of Multiple Comparison - Example Use additional 100 noisy variables generated from a uniform distribution along with X and Y as attributes. Use 30% of the data for training and 70% of the data for testing Using only X and Y as attributes ‹#› July 15, 2015 Mining Big Data
Notes on Overfitting • Overfitting results in decision trees that are more complex than necessary • Training error does not provide a good estimate of how well the tree will perform on previously unseen records • Need ways for incorporating model complexity into model development ‹#› July 15, 2015 Mining Big Data
Evaluating Performance of Classifier • Model Selection – Performed during model building – Purpose is to ensure that model is not overly complex (to avoid overfitting) • Model Evaluation – Performed after model has been constructed – Purpose is to estimate performance of classifier on previously unseen data (e.g., test set) ‹#› July 15, 2015 Mining Big Data
Methods for Classifier Evaluation • Holdout – Reserve k% for training and (100-k)% for testing • Random subsampling – Repeated holdout • Cross validation – Partition data into k disjoint subsets – k-fold: train on k-1 partitions, test on the remaining one – Leave-one-out: k=n • Bootstrap – Sampling with replacement b 1 acc 0 . 632 acc 0 . 368 acc – .632 bootstrap: boot i s b i 1 ‹#› July 15, 2015 Mining Big Data
Application on Biomedical Data ‹#› July 15, 2015 Mining Big Data
Application : SNP Association Study • Given: A patient data set that has genetic variations (SNPs) and their associated Phenotype (Disease). • Objective: Finding a combination of genetic characteristics that best defines the phenotype under study. … SNP 1 SNP 2 SNP M Disease Patient 1 … 1 1 1 1 … Patient 2 0 1 1 1 … Patient 3 1 0 0 0 … … … … … … Patient N 1 1 1 1 Genetic Variation in Patients (SNPs) as Binary Matrix and Survival/Disease (Yes/No) as Class Label. ‹#› July 15, 2015 Mining Big Data
SNP (Single nucleotide polymorphism) • Definition of SNP (wikipedia) – A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion (more than 1 percent) of a large population Each SNP has 3 values Individual 1 A G C G T G A T C G A G G C T A Individual 2 A G C G T G A T C G A G G C T A ( GG / GT / TT ) Individual 3 A G C G T G A G C G A G G C T A ( mm / Mm/ MM) Individual 4 A G C G T G A T C G A G G C T A Individual 5 A G C G T G A T C G A G G C T A SNP – How many SNPs in Human genome? – 10,000,000 ‹#› July 15, 2015 Mining Big Data
Why is SNPs interesting? • In human beings, 99.9 percent bases are same. • Remaining 0.1 percent makes a person unique. – Different attributes / characteristics / traits • how a person looks, • diseases a person develops. • These variations can be: – Harmless (change in phenotype) – Harmful (diabetes, cancer, heart disease, Huntington's disease, and hemophilia ) – Latent (variations found in coding and regulatory regions, are not harmful on their own, and the change in each gene only becomes apparent under certain conditions e.g. susceptibility to lung cancer) ‹#› July 15, 2015 Mining Big Data
Issues in SNP Association Study • In disease association studies number of SNPs varies from a small number (targeted study) to a million (GWA Studies) • Number of samples is usually small • Data sets may have noise or missing values. • Phenotype definition is not trivial (ex. definition of survival) • Environmental exposure, food habits etc adds more variability even among individuals defined under the same phenotype • Genetic heterogeneity among individuals for the same phenotype ‹#› July 15, 2015 Mining Big Data
Existing Analysis Methods • Univariate Analysis: single SNP tested against the phenotype for correlaton and ranked. – Feasible but doesn‟t capture the existing true combinations. • Multivariate Analysis: groups of SNPs of size two or more are tested for possible association with the phenotype. – Infeasible but captures any true combinations. • These two approaches are used to identify biomarkers. • Some approaches employ classification methods like SVMs to classify cases and controls. ‹#› July 15, 2015 Mining Big Data
Discovering SNP Biomarkers • Given a SNP data set of Myeloma patients, find a combination of 3404 SNPs SNPs that best predicts survival. cases • 3404 SNPs selected from various regions of the chromosome • 70 cases (Patients survived shorter than 1 year) • 73 Controls (Patients survived longer Controls than 3 years) Complexity of the Problem: • Large number of SNPs (over a million in GWA studies) and small sample size • Complex interaction among genes may be responsible for the phenotype • Genetic heterogeneity among individuals sharing the same phenotype (due to environmental exposure, food habits, etc) adds more variability • Complex phenotype definition (eg. survival) ‹#› July 15, 2015 Mining Big Data
Discovering SNP Biomarkers • Given a SNP data set of Myeloma patients, find a combination of 3404 SNPs SNPs that best predicts survival. cases • 3404 SNPs selected from various regions of the chromosome • 70 cases (Patients survived shorter than 1 year) • 73 Controls (Patients survived longer Controls than 3 years) Odds ratio Measures whether two Biomarker (SNPs) groups have the same odds of an event. OR = 1 Odds of event is equal in both groups Has Lacks Marker Marker OR > 1 Odds of event is higher in cases CASE a b OR < 1 Odds of event is higher in controls CLASS Control c d Odds ratio is invariant to row and a / b ad odds _ ratio column scaling c / d bc ‹#› July 15, 2015 Mining Big Data
Discovering SNP Biomarkers • Given a SNP data set of Myeloma patients, find a combination of 3404 SNPs SNPs that best predicts survival. cases • 3404 SNPs selected from various regions of the chromosome • 70 cases (Patients survived shorter than 1 year) • 73 Controls (Patients survived longer Controls than 3 years) Complexity of the Problem : • Large number of SNPs (over a million in GWA studies) and small sample size • Complex interaction among genes may be responsible for the phenotype • Genetic heterogeneity among individuals sharing the same phenotype (due to environmental exposure, food habits, etc) adds more variability • Complex phenotype definition (eg. survival) ‹#› July 15, 2015 Mining Big Data
P-value • P-value – Statistical terminology for a probability value – Is the probability that the we get an odds ratio as extreme as the one we got by random chance – Computed by using the chi- square statistic or Fisher‟s exact test • Chi-square statistic is not valid if the number of entries in a cell of the contingency table is small • p-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b ) if we are testing value is higher than expected by random chance using Fisher‟s exact test • A statistical test to determine if there are nonrandom associations between two categorical variables. – P-values are often expressed in terms of the negative log of p-value, e.g., -log10(0.005) = 2.3 ‹#› July 15, 2015 Mining Big Data 53
Discovering SNP Biomarkers • Given a SNP data set of Myeloma patients, find a combination of 3404 SNPs SNPs that best predicts survival. cases • 3404 SNPs selected from various regions of the chromosome • 70 cases (Patients survived shorter than 1 year) • 73 Controls (Patients survived longer Controls than 3 years) Highest p-value, moderate odds ratio Highest odds ratio, moderate p value Moderate odds ratio, moderate p value ‹#› July 15, 2015 Mining Big Data
Example: High pvalue, moderate odds ratio Biomarker (SNPs) Has Lacks Marker Marker CASE (a) 40 (b) 30 CLASS Control (c) 19 (d) 54 Highest p-value, moderate odds ratio Odds ratio = (a*d)/(b*c) = (40 * 54) / (30 * 19) = 3.64 P-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b ) = 1 – hygecdf( 39, 143, 59, 70 ) log10(0.0243) = 3.85 ‹#› July 15, 2015 Mining Big Data 55
Example … Biomarker (SNPs) Has Lacks Marker Marker CASE (a) 7 (b) 63 CLASS Control (c) 1 (d) 72 Odds ratio = (a*d)/(b*c) = (7 * 72) / (63* 1) = 8 Highest odds ratio, moderate p value P-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b ) = 1 – hygecdf( 6, 143, 8, 70) log10(pvalue) = 1.56 ‹#› July 15, 2015 Mining Big Data 56
Example … Biomarker (SNPs) x 10 Has Marker Lacks Marker CASE (a) 70 (b) 630 CLASS Control (c) 10 (d) 720 Odds ratio = (a*d)/(b*c) = (70 * 720) / (630* 10) = 8 P-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b ) = 1 – hygecdf( 60, 1430, 80, 700) log10(pvalue) = 6.56 ‹#› July 15, 2015 Mining Big Data
Example … Biomarker (SNPs) x 20 Has Marker Lacks Marker CASE (a) 140 (b) 1260 CLASS Control (c) 20 (d) 1440 Odds ratio = (a*d)/(b*c) = (140 * 1440) / (1260* 20) = 8 P-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b ) = 1 – hygecdf( 139, 2860, 160, 1400) log10(pvalue) = 11.9 ‹#› July 15, 2015 Mining Big Data
Issues with Traditional Methods Top ranked SNP: • Each SNP is tested and -log 10 P-value = 3.8; Odds ranked individually Ratio = 3.7 • Individual SNP associations with true phenotype are not distinguishable from random permutation of phenotype Van Ness et al 2009 However, most reported associations are not robust: of the 166 putative associations which have been studied three or more times, only 6 have been consistently replicated. ‹#› July 15, 2015 Mining Big Data
Evaluating the Utility of Univariate Rankings for Myeloma Data Leave-one-out Feature Cross validation Selection With SVM Biased Evaluation ‹#› July 15, 2015 Mining Big Data
Evaluating the Utility of Univariate Rankings for Myeloma Data Leave-one-out Cross Leave-one-out validation with SVM Feature Cross validation Selection With SVM Feature Selection Biased Evaluation Clean Evaluation ‹#› July 15, 2015 Mining Big Data
Random Permutation test • 10,000 random permutations of real phenotype generated. • For each one, Leave-one-out cross validation using SVM . • Accuracy larger than 65% are highly significant. (p-value is < 10 -4 ) ‹#› July 15, 2015 Mining Big Data
Nearest Neighbor Classifier ‹#› July 15, 2015 Mining Big Data
Nearest Neighbor Classifiers • Basic idea: – If it walks like a duck, quacks like a duck, then it‟s probably a duck Compute Test Distance Record Training Choose k of the “nearest” records Records ‹#› July 15, 2015 Mining Big Data
Nearest-Neighbor Classifiers Requires three things Unknown record – The set of stored records – Distance metric to compute distance between records – The value of k , the number of nearest neighbors to retrieve To classify an unknown record: – Compute distance to other training records – Identify k nearest neighbors – Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) ‹#› July 15, 2015 Mining Big Data
Nearest Neighbor Classification… • Choosing the value of k: – If k is too small, sensitive to noise points – If k is too large, neighborhood may include points from other classes X ‹#› July 15, 2015 Mining Big Data
Clustering ‹#› July 15, 2015 Mining Big Data
Clustering • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster Intra-cluster distances are distances are maximized minimized ‹#› July 15, 2015 Mining Big Data
Applications of Clustering • Applications: – Gene expression clustering – Clustering of patients based on phenotypic and genotypic factors for efficient disease diagnosis – Market Segmentation – Document Clustering – Finding groups of driver behaviors based upon patterns of automobile motions (normal, drunken, sleepy, rush hour driving, etc) Courtesy: Michael Eisen ‹#› July 15, 2015 Mining Big Data
Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters ‹#› July 15, 2015 Mining Big Data
Similarity and Dissimilarity Measures • Similarity measure – Numerical measure of how alike two data objects are. – Is higher when objects are more alike. – Often falls in the range [0,1] • Dissimilarity measure – Numerical measure of how different are two data objects – Lower when objects are more alike – Minimum dissimilarity is often 0 – Upper limit varies • Proximity refers to a similarity or dissimilarity ‹#› July 15, 2015 Mining Big Data
Euclidean Distance • Euclidean Distance n 2 dist ( x , y ) ( x y ) k k k 1 Where n is the number of dimensions (attributes) and x k and y k are, respectively, the k th attributes (components) or data objects x and y . • Correlation n 2 ( x x )( y y ) k k cov( x , y ) k 1 corr ( x , y ) std ( x ) std ( y ) n n 2 2 ( x x ) ( y y ) k k k 1 k 1 ‹#› July 15, 2015 Mining Big Data
Types of Clusterings • A clustering is a set of clusters • Important distinction between hierarchical and partitional sets of clusters • Partitional Clustering – A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset • Hierarchical clustering – A set of nested clusters organized as a hierarchical tree p1 p2 p3 p4 ‹#› July 15, 2015 Mining Big Data
Other Distinctions Between Sets of Clusters • Exclusive versus non-exclusive – In non-exclusive clusterings, points may belong to multiple clusters. – Can represent multiple classes or „border‟ points • Fuzzy versus non-fuzzy – In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 – Weights must sum to 1 – Probabilistic clustering has similar characteristics • Partial versus complete – In some cases, we only want to cluster some of the data • Heterogeneous versus homogeneous – Clusters of widely different sizes, shapes, and densities ‹#› July 15, 2015 Mining Big Data
Clustering Algorithms • K-means and its variants • Hierarchical clustering • Other types of clustering ‹#› July 15, 2015 Mining Big Data
K-means Clustering • Partitional clustering approach • Number of clusters, K, must be specified • Each cluster is associated with a centroid (center point) • Each point is assigned to the cluster with the closest centroid • The basic algorithm is very simple ‹#› July 15, 2015 Mining Big Data
Example of K-means Clustering Iteration 1 Iteration 5 Iteration 4 Iteration 2 Iteration 3 Iteration 6 3 3 3 3 3 3 2.5 2.5 2.5 2.5 2.5 2.5 2 2 2 2 2 2 1.5 1.5 1.5 1.5 1.5 1.5 y y y y y y 1 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 0 -2 -2 -2 -2 -2 -2 -1.5 -1.5 -1.5 -1.5 -1.5 -1.5 -1 -1 -1 -1 -1 -1 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 0 0 0 0 0 0 0.5 0.5 0.5 0.5 0.5 0.5 1 1 1 1 1 1 1.5 1.5 1.5 1.5 1.5 1.5 2 2 2 2 2 2 x x x x x x
K-means Clustering – Details • The centroid is (typically) the mean of the points in the cluster • Initial centroids are often chosen randomly – Clusters produced vary from one run to another • „Closeness‟ is measured by Euclidean distance, cosine similarity, correlation, etc • Complexity is O( n * K * I * d ) – n = number of points, K = number of clusters, I = number of iterations, d = number of attributes ‹#› July 15, 2015 Mining Big Data
Evaluating K-means Clusters • Most common measure is Sum of Squared Error (SSE) – For each point, the error is the distance to the nearest cluster – To get SSE, we square these errors and sum them K 2 SSE dist ( m , x ) i i 1 x C i • x is a data point in cluster C i and m i is the representative point for cluster C i – Given two sets of clusters, we prefer the one with the smallest error – One easy way to reduce SSE is to increase K, the number of clusters ‹#› July 15, 2015 Mining Big Data
Two different K-means Clusterings 3 Original Points 2.5 2 1.5 y 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x 3 3 2.5 2.5 2 2 1.5 1.5 y y 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x Optimal Clustering Sub-optimal Clustering ‹#› July 15, 2015 Mining Big Data
Limitations of K-means • K-means has problems when clusters are of differing – Sizes – Densities – Non-globular shapes • K-means has problems when the data contains outliers. ‹#› July 15, 2015 Mining Big Data
Limitations of K-means: Differing Sizes K-means (3 Clusters) Original Points ‹#› July 15, 2015 Mining Big Data
Limitations of K-means: Differing Density K-means (3 Clusters) Original Points ‹#› July 15, 2015 Mining Big Data
Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters) ‹#› July 15, 2015 Mining Big Data
Hierarchical Clustering • Produces a set of nested clusters organized as a hierarchical tree • Can be visualized as a dendrogram 5 – A tree like diagram that records the 1 3 sequences of merges or splits 5 2 1 0.2 2 3 6 0.15 0.1 4 0.05 4 0 3 6 2 5 4 1 ‹#› July 15, 2015 Mining Big Data
Strengths of Hierarchical Clustering • Do not have to assume any particular number of clusters – Any desired number of clusters can be obtained by „cutting‟ the dendrogram at the proper level • They may correspond to meaningful taxonomies – Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …) ‹#› July 15, 2015 Mining Big Data
Hierarchical Clustering • Two main types of hierarchical clustering – Agglomerative: • Start with the points as individual clusters • At each step, merge the closest pair of clusters until only one cluster (or k clusters) left – Divisive: • Start with one, all-inclusive cluster • At each step, split a cluster until each cluster contains a point (or there are k clusters) • Traditional hierarchical algorithms use a similarity or distance matrix – Merge or split one cluster at a time ‹#› July 15, 2015 Mining Big Data
Agglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm is straightforward 1. Compute the proximity matrix 2. Let each data point be a cluster 3. Repeat 4. Merge the two closest clusters 5. Update the proximity matrix 6. Until only a single cluster remains • Key operation is the computation of the proximity of two clusters – Different approaches to defining the distance between clusters distinguish the different algorithms ‹#› July 15, 2015 Mining Big Data
Starting Situation • Start with clusters of individual points and a proximity matrix p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . Proximity Matrix ... p1 p2 p3 p4 p9 p10 p11 p12 ‹#› July 15, 2015 Mining Big Data
Intermediate Situation • After some merging steps, we have some clusters C1 C2 C3 C4 C5 C1 C2 C3 C3 C4 C4 C5 C1 Proximity Matrix C5 C2 ... p1 p2 p3 p4 p9 p10 p11 p12 ‹#› July 15, 2015 Mining Big Data
Intermediate Situation • We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C1 C2 C3 C4 C5 C1 C2 C3 C3 C4 C4 C5 Proximity Matrix C1 C5 C2 ... p1 p2 p3 p4 p9 p10 p11 p12 ‹#› July 15, 2015 Mining Big Data
After Merging • The question is “How do we update the proximity matrix?” C2 U C1 C5 C3 C4 C1 ? ? ? ? ? C2 U C5 C3 C3 ? C4 ? C4 Proximity Matrix C1 C2 U C5 ... p1 p2 p3 p4 p9 p10 p11 p12 ‹#› July 15, 2015 Mining Big Data
How to Define Inter-Cluster Distance p1 p2 p3 p4 p5 . . . p1 Similarity? p2 p3 p4 • MIN p5 • MAX . • Group Average . Proximity Matrix • Distance Between Centroids . • Other methods driven by an objective function – Ward‟s Method uses squared error ‹#› July 15, 2015 Mining Big Data
How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 • MIN p5 • MAX . • Group Average . • Distance Between Centroids . Proximity Matrix • Other methods driven by an objective function – Ward‟s Method uses squared error ‹#› July 15, 2015 Mining Big Data
How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 • MIN p5 • MAX . • Group Average . • Distance Between Centroids . Proximity Matrix • Other methods driven by an objective function – Ward‟s Method uses squared error ‹#› July 15, 2015 Mining Big Data
How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 • MIN p5 • MAX . • Group Average . • Distance Between Centroids . Proximity Matrix • Other methods driven by an objective function – Ward‟s Method uses squared error ‹#› July 15, 2015 Mining Big Data
How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 • MIN p5 • MAX . • Group Average . • Distance Between Centroids . Proximity Matrix • Other methods driven by an objective function – Ward‟s Method uses squared error ‹#› July 15, 2015 Mining Big Data
Other Types of Cluster Algorithms • Hundreds of clustering algorithms • Some clustering algorithms – K-means – Hierarchical – Statistically based clustering algorithms • Mixture model based clustering – Fuzzy clustering – Self-organizing Maps (SOM) – Density-based (DBSCAN) • Proper choice of algorithms depends on the type of clusters to be found, the type of data, and the objective ‹#› July 15, 2015 Mining Big Data
Cluster Validity • For supervised classification we have a variety of measures to evaluate how good our model is – Accuracy, precision, recall • For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters? • But “clusters are in the eye of the beholder”! • Then why do we want to evaluate them? – To avoid finding patterns in noise – To compare clustering algorithms – To compare two sets of clusters – To compare two clusters ‹#› July 15, 2015 Mining Big Data
Clusters found in Random Data 1 1 0.9 0.9 0.8 0.8 Random 0.7 0.7 Points DBSCAN 0.6 0.6 0.5 y y 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x x 1 1 0.9 0.9 0.8 0.8 K-means Complete 0.7 0.7 Link 0.6 0.6 y 0.5 y 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x x ‹#› July 15, 2015 Mining Big Data
Recommend
More recommend