data mining techniques
play

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 8 - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 8 Jan-Willem van de Meent ( credit : Yijun Zhao, Carla Brodley, Eamonn Keogh) Classification Wrap-up Classifier Comparison Nearest Linear RBF Random Ada- Naive


  1. Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 8 Jan-Willem van de Meent ( credit : Yijun Zhao, Carla Brodley, Eamonn Keogh)

  2. Classification Wrap-up

  3. Classifier Comparison Nearest 
 Linear 
 RBF 
 Random 
 Ada- 
 Naive 
 Data QDA Neighbors SVM SVM Forest boost Bayes

  4. Confusion Matrix Predicted Truth True email spam Prediction 57.3% 4.0% email 5.3% 33.4% spam

  5. Confusion Matrix Predicted Truth True email spam Prediction 57.3% 4.0% email True Pos False Pos 5.3% 33.4% spam False Neg True Neg True Positive (TP): Hit (show e-mail) True Negative (TN): Correct rejection 
 False Positive (FP): False alarm, type I error 
 False Negative (FN): Miss, type II error

  6. Decision Theory Predicted True email spam 57.3% 4.0% email λ 11 λ 12 5.3% 33.4% spam λ 21 λ 22 R ( α 2 | x ) R ( α 1 | x ) > λ 21 p ( Y = 1 | x ) + λ 22 p ( Y = 2 | x ) λ 11 p ( Y = 1 | x ) + λ 12 p ( Y = 2 | x ) > ( λ 21 − λ 11 ) p ( Y = 1 | x ) ( λ 12 − λ 22 ) p ( Y = 2 | x ) > p ( Y = 1 | x ) λ 12 − λ 22 > p ( Y = 2 | x ) λ 21 − λ 11 where we have assumed (FN) > (TP)

  7. Precision and Recall TP PPV = TP + FP TP TPR = TP + FN

  8. Precision and Recall Precision or Positive Predictive Value (PPV) TP PPV = TP + FP Recall or Sensitivity, True Positive Rate (TPR) TP TPR = TP + FN F1 score: harmonic mean of Precisin and Recall 2 TP F 1 = (2 TP + FP + FN ) Specificity (SPC) or True Negative Rate (TNR) TN SPC = ( FP + TN )

  9. Precision-Recall Curve Recall Precision | − p ( Y = 1 | x ) λ 12 − λ 22 Vary detection > p ( Y = 2 | x ) λ 21 − λ 11 threshold ave assumed (FN) >

  10. ROC Curve Recall 1-Precision | − p ( Y = 1 | x ) λ 12 − λ 22 Vary detection > p ( Y = 2 | x ) λ 21 − λ 11 threshold ave assumed (FN) >

  11. ROC Curve True Positive Rate False Positive Rate | − p ( Y = 1 | x ) λ 12 − λ 22 Vary detection > p ( Y = 2 | x ) λ 21 − λ 11 threshold ave assumed (FN) >

  12. ROC Curve True Positive Rate False Positive Rate

  13. ROC Curve True Positive Rate False Positive Rate Macro-average (True Positive Rate)

  14. ROC Curve True Positive Rate False Positive Rate Micro-average (True Positive Rate)

  15. Clustering (a.k.a. unsupervised classification) with slides from Eamonn Keogh (UC Riverside)

  16. Clustering • Unsupervised learning (no labels for training) • Group data into similar classes that • Maximize inter-cluster similarity • Minimize intra-cluster similarity

  17. Two Types of Clustering Partitional Hierarchical Create a hierarchical Construct partitions and decomposition using evaluate them using “some criterion” “some criterion”

  18. What is a natural grouping? Choice of clustering criterion can be task-dependent Simpson’s School Females Males Family Employees

  19. What is Similarity? Can be hard to define, but we know it when we see it.

  20. Defining Distance Measures Peter Piotr 3 0.2 342.7 Need : Some function D ( x 1 , x 2 ) that 
 represents degree of dissimilarity

  21. Example: Distance Measures s k P Euclidean Distance ( ( x i − y i ) 2 ) i =1 k P Mahattan Distance | x i − y i | i =1 ◆ 1 ✓ k q ( | x i − y i | ) q P Minkowski Distance i =1

  22. Example: Kernels Polynomial Radial Basis Function (RBF) Squared Exponential (SE) Automatic Relevance 
 Determination (ARD)

  23. Inner Product vs Distance Measure Inner Product • ⟨ A, B ⟩ = ⟨ B, A ⟩ Symmetry • ⟨ α A, B ⟩ = α ⟨ A, B ⟩ Linearity Postive-definiteness • ⟨ A, Α ⟩ = 0, ⟨ A, Α ⟩ = 0 iff A = 0 Distance Measure • D(A, B) = D(B, A) Symmetry • D(A, A) = 0 Constancy of Self-Similarity • D(A, B) = 0 iff A = B Positivity (Separation) • D(A, B) ≤ D(A, C) + D(B, C) Triangular Inequality An inner product ⟨ A, B ⟩ induces 
 a distance measure D(A, B) = ⟨ A-B, A-B ⟩ 1/2

  24. Inner Product vs Distance Measure Inner Product • ⟨ A, B ⟩ = ⟨ B, A ⟩ Symmetry • ⟨ α A, B ⟩ = α ⟨ A, B ⟩ Linearity Postive-definiteness • ⟨ A, Α ⟩ = 0, ⟨ A, Α ⟩ = 0 iff A = 0 Distance Measure • D(A, B) = D(B, A) Symmetry • D(A, A) = 0 Constancy of Self-Similarity • D(A, B) = 0 iff A = B Positivity (Separation) • D(A, B) ≤ D(A, C) + D(B, C) Triangular Inequality Is the reverse also true? 
 Why?

  25. Hierarchical Clustering

  26. Dendrogram ( a.k.a. a similarity tree ) Similarity of A and B is D(A,B) represented as height 
 of lowest shared 
 internal node (Bovine: 0.69395, (Spider Monkey: 0.390, (Gibbon:0.36079,(Orang: 0.33636, (Gorilla: 0.17147, 
 (Chimp: 0.19268, Human: 0.11927): 0.08386): 0.06124): 0.15057): 0.54939);

  27. Dendrogram ( a.k.a. a similarity tree ) D(A,B) Natural when measuring 
 genetic similarity, distance 
 to common ancestor (Bovine: 0.69395, (Spider Monkey: 0.390, (Gibbon:0.36079,(Orang: 0.33636, (Gorilla: 0.17147, 
 (Chimp: 0.19268, Human: 0.11927): 0.08386): 0.06124): 0.15057): 0.54939);

  28. Example: Iris data Iris Setosa Iris versicolor Iris virginica https://en.wikipedia.org/wiki/Iris_flower_data_set

  29. Hierarchical Clustering ( Euclidian Distance ) https://en.wikipedia.org/wiki/Iris_flower_data_set

  30. Edit Distance Distance Patty and Selma Change dress color, 1 point Change earring shape, 1 point Change hair part, 1 point D(Patty, Selma) = 3 Distance Marge and Selma Change dress color, 1 point Add earrings, 1 point Decrease height, 1 point Take up smoking, 1 point Lose weight, 1 point D(Marge,Selma) = 5 Can be defined for any set of discrete features

  31. Edit Distance for Strings • Transform string Q into string C , using only Similarity “Peter” and “Piotr”? Substitution , Insertion and Deletion . Substitution 1 Unit • Assume that each of these operators has a Insertion 1 Unit cost associated with it. Deletion 1 Unit • The similarity between two strings can be D ( Peter , Piotr ) is 3 defined as the cost of the cheapest transformation from Q to C. Peter Substitution (i for e) Piter Insertion (o) Pioter Deletion (e) Pedro Piotr Peter Piotr Piero Pyotr Petros Pietro Pierre

  32. Hierarchical Clustering ( Edit Distance ) Pedro (Portuguese) Petros (Greek), Peter (English), Piotr (Polish), Peadar (Irish), Pierre (French), Peder (Danish), Peka (Hawaiian), Pietro (Italian), Piero (Italian Alternative), Petr (Czech), Pyotr (Russian) Cristovao (Portuguese) Christoph (German), Christophe (French), Cristobal (Spanish), Cristoforo (Italian), Kristoffer (Scandinavian), Krystof (Czech), Christopher (English) Miguel (Portuguese) Michalis (Greek), Michael (English), Mick (Irish) Cristovao Pedro Miguel Christoph n Piotr r Petros o Pierre o Peter Peka r Michalis Michael Mick Christopher e Cristobal Cristoforo Kristoffer f r o t a e r r h a o t t e d d p e s y e i a e y P P o d i P e r P t s K s P i r i r C h C

  33. Meaningful Patterns Edit distance yields clustering according to geography Slide from Eamonn Keogh Pedro ( Portuguese/Spanish ) Petros ( Greek ), Peter ( English ), Piotr ( Polish ), Peadar (Irish), Pierre ( French ), Peder ( Danish ), Peka (Hawaiian), Pietro ( Italian ), Piero ( Italian Alternative ), Petr (Czech), Pyotr ( Russian )

  34. Spurious Patterns In general clusterings will only be as meaningful as your distance metric spurious; there is no connection between the two South Georgia & Serbia & St. Helena & U.K. AUSTRALIA ANGUILLA FRANCE NIGER INDIA IRELAND BRAZIL South Sandwich Montenegro Dependencies Islands (Yugoslavia)

  35. Spurious Patterns In general clusterings will only be as meaningful as your distance metric spurious; there is no connection between the two South Georgia & Serbia & St. Helena & U.K. AUSTRALIA ANGUILLA FRANCE NIGER INDIA IRELAND BRAZIL South Sandwich Montenegro Dependencies Islands (Yugoslavia) Former UK colonies No relation

  36. “Correct” Number of Clusters

  37. “Correct” Number of Clusters Determine number of clusters by looking at distance

  38. Detecting Outliers The single isolated branch is suggestive of a data point that is very different to all others Outlier

  39. Bottom-up vs Top-down Since we cannot test all possible The number of dendrograms with n trees we will have to heuristic leafs = (2 n -3)!/[(2 ( n -2) ) ( n -2)!] search of all possible trees. We could do this.. Number Number of Possible of Leafs Dendrograms 2 1 Bottom-Up (agglomerative): 3 3 4 15 Starting with each item in its own 5 105 cluster, find the best pair to merge ... … into a new cluster. Repeat until all 10 34,459,425 clusters are fused together. Top-Down (divisive): Starting with all the data in a single cluster, consider every possible way to divide the cluster into two. Choose the best division and recursively operate on both sides.

  40. Distance Matrix We begin with a distance matrix which contains the distances between every pair of objects in our 0 8 8 7 7 database. 0 2 4 4 0 3 3 D( , ) = 8 0 1 D( , ) = 1 0

  41. Bottom-up (Agglomerative Clustering) Consider Choose … all possible the best merges… 25

Recommend


More recommend