CISC 4631 Data Mining Lecture 03: • Introduction to classification • Linear classifier Theses slides are based on the slides by • Tan, Steinbach and Kumar (textbook authors) • Eamonn Koegh (UC Riverside) 1
Classification: Definition • Given a collection of records ( training set ) – Each record contains a set of attributes , one of the attributes is the class . • Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 2
Illustrating Classification Task Learning Tid Attrib1 Attrib2 Attrib3 Class algorithm 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No Induction Yes 5 No Large 95K 6 No Medium 60K No Learn 7 Yes Large 220K No Model 8 No Small 85K Yes 9 No Medium 75K No Yes 10 No Small 90K Model 10 Training Set Apply Model Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? Deduction 13 Yes Large 110K ? 14 No Small 95K ? ? 15 No Large 67K 10 Test Set 3
Examples of Classification Task • Predicting tumor cells as benign or malignant • Classifying credit card transactions as legitimate or fraudulent • Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil • Categorizing news stories as finance, weather, entertainment, sports, etc 4
Classification Techniques • Decision Tree based Methods • Rule-based Methods • Memory based reasoning • Neural Networks • Naïve Bayes and Bayesian Belief Networks • Support Vector Machines • We will start with a simple linear classifier 5
The Classification Problem Katydids (informal definition) Given a collection of annotated data. In this case 5 instances Katydids of and five of Grasshoppers , decide what type of Grasshoppers insect the unlabeled example is. Katydid or Grasshopper? 6
For any domain of interest, we can measure features Color Color {Green, Brown, Gray, Other} {Green, Brown, Gray, Other} Has Wings? Has Wings? Abdomen Abdomen Thorax Thorax Antennae Antennae Length Length Length Length Length Length Mandible Mandible Size Size Spiracle Diameter Leg Length 7
My_Collection My_Collection We can store features Insect Insect Abdomen Abdomen Antennae Antennae Insect Class in a database. ID ID Length Length Length Length 1 2.7 5.5 Grasshopper 2 8.0 9.1 Katydid The classification 3 0.9 4.7 Grasshopper problem can now be 4 1.1 3.1 Grasshopper expressed as: 5 5.4 8.5 Katydid 6 2.9 1.9 Grasshopper • Given a training database 7 6.1 6.6 Katydid ( My_Collection ), predict the class 8 0.5 1.0 Grasshopper label of a previously unseen instance 9 8.3 6.6 Katydid 10 8.1 4.7 Katydids 11 5.1 7.0 ??????? ??????? previously unseen instance previously unseen instance = = 8
Grasshoppers Katydids 10 9 8 7 Antenna Length Antenna Length 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Abdomen Length Abdomen Length 9
Grasshoppers Katydids 10 9 8 7 Antenna Length Antenna Length 6 5 Each of these data objects are called… 4 • exemplars 3 • (training) 2 examples • instances 1 • tuples 1 2 3 4 5 6 7 8 9 10 Abdomen Length Abdomen Length 10
We will return to the previous We will return to the previous slide in two minutes. In the slide in two minutes. In the meantime, we are going to play meantime, we are going to play a quick game. a quick game. 11
Problem 1 Examples of class A Examples of class B 3 4 5 2.5 1.5 5 5 2 8 3 6 8 2.5 5 4.5 3 12
Problem 1 What class is this What class is this object? Examples of class A Examples of class B 8 1.5 3 4 5 2.5 What about this one, What about this one, 1.5 5 5 2 A or B? 8 3 6 8 4.5 7 2.5 5 4.5 3 13
Problem 2 Oh! This ones hard! Oh! This ones hard! Examples of class A Examples of class B 8 1.5 4 4 5 2.5 5 5 2 5 5 3 6 6 3 3 2.5 3 14
Problem 3 Examples of class A Examples of class B 6 6 This one is really hard! This one is really hard! 4 4 5 6 What is this, What is this, A or B? 1 5 7 5 4 8 6 3 3 7 7 7 15
Why did we spend so much Why did we spend so much time with this game? time with this game? Because we wanted to Because we wanted to show that almost all classification problems have a geometric interpretation, check out the next 3 slides… the next 3 slides… 16
Problem 1 10 9 8 7 Examples of class A Examples of class B 6 Left Bar Left Bar 5 4 3 2 3 4 5 2.5 1 1 2 3 4 5 6 7 8 9 10 Right Bar Right Bar 1.5 5 5 2 Here is the rule again. Here is the rule again. If the left bar is smaller If the left bar is smaller than the right bar, it is 8 3 6 8 an A, otherwise it is a B. 2.5 5 4.5 3 17
Problem 2 10 9 8 7 Examples of class A Examples of class B 6 Left Bar Left Bar 5 4 3 2 4 4 5 2.5 1 1 2 3 4 5 6 7 8 9 10 Right Bar Right Bar 5 5 2 5 Let me look it up… here it is.. Let me look it up… here it is.. the rule is, if the two bars are equal sizes, it is an A. Otherwise it is a B . Otherwise it is a 5 3 6 6 3 3 2.5 3 18
Problem 3 100 90 80 70 Examples of class A Examples of class B 60 Left Bar Left Bar 50 40 30 20 4 4 5 6 10 10 20 30 40 50 60 70 80 90 100 Right Bar Right Bar 1 5 7 5 4 8 6 3 The rule again: The rule again: if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it 3 7 7 7 is a is a B. 19
Grasshoppers Katydids 10 9 8 7 Antenna Length Antenna Length 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Abdomen Length Abdomen Length 20
11 5.1 7.0 ??????? ??????? previously unseen instance = previously unseen instance = • We can “project” the previously • We can “project” the previously unseen instance into the same space unseen instance into the same space 10 as the database. as the database. 9 • We have now abstracted away the • We have now abstracted away the 8 details of our particular problem. It details of our particular problem. It 7 will be much easier to talk about will be much easier to talk about Antenna Length Antenna Length 6 points in space. points in space. 5 4 3 2 1 Katydids 1 2 3 4 5 6 7 8 9 10 Grasshoppers Abdomen Length Abdomen Length 21
Simple Linear Classifier Simple Linear Classifier 10 9 8 R.A. Fisher 7 1890-1962 6 5 If previously unseen instance above the line 4 then class is Katydid 3 else 2 class is Grasshopper 1 Katydids 1 2 3 4 5 6 7 8 9 10 Grasshoppers 22
Classification Accuracy Predicted class Class = Katydid (1) Class = Grasshopper (0) Class = Katydid (1) f 11 f 10 Actual Class Class = Grasshopper (0) f 01 f 00 Number of correct predictions f 11 + f 00 Accuracy = --------------------------------------------- = ----------------------- Total number of predictions f 11 + f 10 + f 01 + f 00 Number of wrong predictions f 10 + f 01 Error rate = --------------------------------------------- = ----------------------- Total number of predictions f 11 + f 10 + f 01 + f 00 23
Confusion Matrix • In a binary decision problem, a classifier labels examples as either positive or negative. • Classifiers produce confusion/ contingency matrix, which shows four entities: TP (true positive), TN (true negative), FP (false positive), FN (false negative) Confusion Matrix Positive Negative (+) (-) Predicted TP FP positive (Y) Predicted FN TN negative (N) 24
The simple linear classifier is defined for higher dimensional spaces… 25
… we can visualize it as being an n-dimensional hyperplane 26
It is interesting to think about what would happen in this example if we did not have the 3 rd dimension… 27
We can no longer get perfect accuracy with the simple linear classifier… We could try to solve this problem by user a simple quadratic classifier or a simple cubic classifier.. However, as we will later see, this is probably a bad idea… 28
Which of the “Problems” can be solved by the Simple 10 Linear Classifier? 9 8 7 6 5 1) Perfect 4 2) Useless 3 2 3) Pretty Good 1 1 2 3 4 5 6 7 8 9 10 10 100 9 90 Problems that can 8 80 7 70 be solved by a linear 6 60 classifier are call 5 50 linearly separable . 4 40 3 30 2 20 1 10 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 90 100 29
Virginica A Famous Problem R. A. Fisher’s Iris Dataset . • 3 classes • 50 of each class Setosa The task is to classify Iris plants Versicolor into one of 3 varieties using the Petal Length and Petal Width. 30 Iris Setosa Iris Versicolor Iris Virginica
Recommend
More recommend