Data Mining Lecture 04: Decision Trees Theses slides are based - PowerPoint PPT Presentation

CISC 4631 Data Mining Lecture 04: • Decision Trees Theses slides are based on the slides by • Tan, Steinbach and Kumar (textbook authors) • Eamonn Koegh (UC Riverside) • Raymond Mooney (UT Austin) 1

Classification: Definition • Given a collection of records ( training set ) – Each record contains a set of attributes , one of the attributes is the class . • Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 2

Illustrating Classification Task Learning Tid Attrib1 Attrib2 Attrib3 Class algorithm 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No Induction Yes 5 No Large 95K 6 No Medium 60K No Learn 7 Yes Large 220K No Model 8 No Small 85K Yes 9 No Medium 75K No Yes 10 No Small 90K Model 10 Training Set Apply Model Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? Deduction 13 Yes Large 110K ? 14 No Small 95K ? ? 15 No Large 67K 10 Test Set 3

Classification Techniques • Decision Tree based Methods • Rule-based Methods • Memory based reasoning • Neural Networks • Naïve Bayes and Bayesian Belief Networks • Support Vector Machines 4

Example of a Decision Tree Splitting Attributes Tid Refund Marital Taxable Cheat Status Income 1 Yes Single 125K No Refund 2 No Married 100K No Yes No 3 No Single 70K No 4 Yes Married 120K No NO MarSt 5 No Divorced 95K Yes Married Single, Divorced 6 No Married 60K No TaxInc NO No 7 Yes Divorced 220K < 80K > 80K Yes 8 No Single 85K 9 No Married 75K No YES NO 10 No Single 90K Yes 10 Model: Decision Tree Training Data 5

Another Example of Decision Tree Single, MarSt Married Divorced Tid Refund Marital Taxable Cheat Status Income NO Refund 1 Yes Single 125K No No Yes 2 No Married 100K No NO TaxInc 3 No Single 70K No 4 Yes Married 120K No < 80K > 80K 5 No Divorced 95K Yes NO YES 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes There could be more than one tree that fits No 9 No Married 75K the same data! 10 No Single 90K Yes 10 6

Decision Tree Classification Task Tree Tid Attrib1 Attrib2 Attrib3 Class Induction 1 Yes Large 125K No algorithm 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No Induction Yes 5 No Large 95K 6 No Medium 60K No Learn 7 Yes Large 220K No Model 8 No Small 85K Yes 9 No Medium 75K No Yes 10 No Small 90K Model 10 Training Set Apply Decision Model Tree Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? Deduction 13 Yes Large 110K ? ? 14 No Small 95K 15 No Large 67K ? 10 Test Set 7

Apply Model to Test Data Test Data Start from the root of tree. Refund Marital Taxable Cheat Status Income No Married 80K ? Refund 10 Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO 8

Apply Model to Test Data Test Data Refund Marital Taxable Cheat Status Income No Married 80K ? Refund 10 Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO 9

Apply Model to Test Data Test Data Refund Marital Taxable Cheat Status Income No Married 80K ? Refund 10 Yes No NO MarSt Assign Cheat to “No” Married Single, Divorced TaxInc NO < 80K > 80K YES NO 13

Decision Tree Terminology 14

Decision Tree Classification Task Tree Attrib1 Attrib2 Attrib3 Class Tid Induction 1 Yes Large 125K No algorithm 2 No Medium 100K No 3 No Small 70K No Induction 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No Learn No 7 Yes Large 220K Model 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Model 10 Training Set Apply Decision Model Tree Attrib1 Attrib2 Attrib3 Class Tid 11 No Small 55K ? 12 Yes Medium 80K ? Deduction 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set 15

Decision Tree Induction • Many Algorithms: – Hunt’s Algorithm (one of the earliest) – CART – ID3, C4.5 – SLIQ,SPRINT • John Ross Quinlan is a computer science researcher in data mining and decision theory. He has contributed extensively to the development of decision tree algorithms, including inventing the canonical C4.5 and ID3 algorithms. 16

Decision Tree Classifier 10 Ross Quinlan 9 8 7 Antenna Length Antenna Length Abdomen Length > 7.1? Abdomen Length 6 5 yes no 4 Antenna Length > 6.0? Katydid Antenna Length 3 2 yes no 1 Katydid Grasshopper 1 2 3 4 5 6 7 8 9 10 17 Abdomen Length Abdomen Length

Antennae shorter than body? Yes No 3 Tarsi? Grasshopper Yes No Foretiba has ears? Yes No Cricket Decision trees predate computers Katydids Camel Cricket 18

Definition  Decision tree is a classifier in the form of a tree structure – Decision node: specifies a test on a single attribute – Leaf node: indicates the value of the target attribute – Arc/edge: split of one attribute – Path: a disjunction of test to make the final decision  Decision trees classify instances or examples by starting at the root of the tree and moving through it until a leaf node. 19

Decision Tree Classification • Decision tree generation consists of two phases – Tree construction • At start, all the training examples are at the root • Partition examples recursively based on selected attributes – Tree pruning • Identify and remove branches that reflect noise or outliers • Use of decision tree: Classifying an unknown sample – Test the attribute values of the sample against the decision tree 20

Decision Tree Representation • Each internal node tests an attribute • Each branch corresponds to attribute value • Each leaf node assigns a classification outlook sunny overcast rain humidity wind yes weak normal strong high no yes no yes 21

How do we construct the decision tree? • Basic algorithm (a greedy algorithm) – Tree is constructed in a top-down recursive divide-and-conquer manner – At start, all the training examples are at the root – Attributes are categorical (if continuous-valued, they can be discretized in advance) – Examples are partitioned recursively based on selected attributes. – Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) • Conditions for stopping partitioning – All samples for a given node belong to the same class – There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf – There are no samples left 22

Top-Down Decision Tree Induction • Main loop: A  the “best” decision attribute for next node 1. 2. Assign A as decision attribute for node 3. For each value of A, create new descendant of node 4. Sort training examples to leaf nodes 5. If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes 23

Tree Induction • Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. • Issues – Determine how to split the records • How to specify the attribute test condition? • How to determine the best split? – Determine when to stop splitting 24

How To Split Records • Random Split – The tree can grow huge – These trees are hard to understand. – Larger trees are typically less accurate than smaller trees. • Principled Criterion – Selection of an attribute to test at each node - choosing the most useful attribute for classifying examples. – How ? – Information gain • measures how well a given attribute separates the training examples according to their target classification • This measure is used to select among the candidate attributes at each step while growing the tree 25

Tree Induction • Greedy strategy: – Split the records based on an attribute test that optimizes certain criterion: – Hunt’s algorithm: recursively partition training records into successively purer subsets. How to measure purity/impurity • Entropy and information gain (covered in the lectures slides) • Gini (covered in the textbook) • Classification error 26

Data Mining Lecture 04: Decision Trees Theses slides are based - PowerPoint PPT Presentation

CISC 4631 Data Mining Lecture 04: Decision Trees Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside) Raymond Mooney (UT Austin) 1 Classification: Definition

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Latent class analysis with Stata Isabel Canette Principal Mathematician and Statistician

E x ploring relationships E XP L OR ATOR Y DATA AN ALYSIS IN P YTH ON Allen Do w ne y Professor

CS1150 Principles of Computer Science Boolean, Selection Statements Yanyan Zhuang Department of

Extending R through packages: Theres a package for everything R packages are available on CRAN

Welcome to summer of nytd! Session starts at 12pm EST Please turn your video off and mute your

COMP 204 Variables Mathieu Blanchette, based on material from Yue Li, Carlos Oliver and

Advanced R (with Tidyverse) Simon Andrews V2020-11 Course Content Expanding knowledge

COMP 204 Control flow - Conditionals Mathieu Blanchette, based on material from Yue Li, Carlos

Sambuz

Useful Links

Newsletter

Mail Us