mining
play

MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Logistic Regression; Nave Bayes for Text


  1. CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017

  2. Methods to Learn Vector Data Set Data Sequence Data Text Data Logistic Regression; Naïve Bayes for Text Classification Decision Tree ; KNN SVM; NN Clustering K-means; hierarchical PLSA clustering; DBSCAN; Mixture Models Linear Regression Prediction GLM* Apriori; FP growth GSP; PrefixSpan Frequent Pattern Mining Similarity Search DTW 2

  3. Vector Data: Trees • Tree-based Prediction and Classification • Classification Trees • Regression Trees* • Random Forest • Summary 3

  4. Tree-based Models • Use trees to partition the data into different regions and make predictions Root node age? <=30 overcast 31..40 >40 Internal nodes student? yes credit rating? excellent fair no yes yes no yes Leaf nodes 4

  5. Easy to Interpret • A path from root to a leaf node corresponds to a rule • E.g., if if age<=30 and student=no th then en target value=no age? <=30 31..40 >40 student? credit rating? yes excellent fair no yes yes no yes 5

  6. Vector Data: Trees • Tree-based Prediction and Classification • Classification Trees • Regression Trees* • Random Forest • Summary 6

  7. Decision Tree Induction: An Example age income student credit_rating buys_Xbox  Training data set: Buys_xbox <=30 high no fair no <=30 high no excellent no  The data set follows an example of 31…40 high no fair yes >40 medium no fair yes Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes  Resulting tree: >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no age? <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 <=30 high yes fair yes overcast 31..40 >40 >40 medium no excellent no student? yes credit rating? excellent fair no yes yes no yes 7

  8. How to choose attributes? Ages Credit_Rating 31…40 >40 <=30 Excellent Fair Yes Yes Yes Yes Yes VS. Yes Yes Yes Yes Yes No Yes Yes Yes Yes No Yes No Yes No No No Yes No Yes No No No Q: Which attribute is better for the classification task? 8

  9. Brief Review of Entropy • Entropy (Information Theory) • A measure of uncertainty (impurity) associated with a random variable • Calculation: For a discrete random variable Y taking m distinct values { 𝑧 1 , … , 𝑧 𝑛 } , 𝑛 𝑞 𝑗 log(𝑞 𝑗 ) , where 𝑞 𝑗 = 𝑄(𝑍 = 𝑧 𝑗 ) • 𝐼 𝑍 = − σ 𝑗=1 • Interpretation: • Higher entropy => higher uncertainty • Lower entropy => lower uncertainty m = 2 9

  10. Conditional Entropy • How much uncertainty of 𝑍 if we know an attribute 𝑌 ? • 𝐼 𝑍 𝑌 = σ 𝑦 𝑞 𝑦 𝐼(𝑍|𝑌 = 𝑦) Ages <=30 31…40 >40 Yes Yes Yes Yes Yes Yes No Yes Yes No Yes No No No Weighted average of entropy at each branch! 10

  11. Attribute Selection Measure: Information Gain (ID3/C4.5)  Select the attribute with the highest information gain  Let p i be the probability that an arbitrary tuple in D belongs to class C i , estimated by |C i , D |/|D|  Expected information (entropy) needed to classify a tuple in D: m    ( ) log ( ) Info D p p 2 i i  i 1  Information needed (after using A to split D into v partitions) to classify D (conditional entropy):   | | D v  j ( ) ( ) Info D Info D A j | | D  1 j  Information gained by branching on attribute A   Gain(A) Info(D) Info (D) A 11

  12. Attribute Selection: Information Gain 5 4  Class P: buys_xbox = “yes”   ( ) ( 2 , 3 ) ( 4 , 0 ) Info age D I I 14 14  Class N: buys_xbox = “no” 5 9 9 5 5    I     ( 3 , 2 ) 0 . 694 ( ) ( 9 , 5 ) log ( ) log ( ) 0 . 940 I Info D 2 2 14 14 14 14 14 age p i n i I(p i , n i ) 5 I ( 2 , 3 ) means “age <=30” has 5 out of <=30 2 3 0.971 14 14 samples, with 2 yes’es and 3 31…40 4 0 0 no’s. Hence >40 3 2 0.971    age income student credit_rating buys_xbox ( ) ( ) ( ) 0 . 246 Gain age Info D Info D <=30 high no fair no age <=30 high no excellent no Similarly, 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no  ( ) 0 . 029 Gain income 31…40 low yes excellent yes <=30 medium no fair no  <=30 low yes fair yes ( ) 0 . 151 Gain student >40 medium yes fair yes <=30 medium yes excellent yes  ( _ ) 0 . 048 31…40 Gain credit rating medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no 12

  13. Attribute Selection for a Branch 2 2 3 3 • age? 𝐽𝑜𝑔𝑝 𝐸 𝑏𝑕𝑓≤30 = − 5 log 2 5 − 5 log 2 5 = 0.971 • • 𝐻𝑏𝑗𝑜 𝑏𝑕𝑓≤30 𝑗𝑜𝑑𝑝𝑛𝑓 = 𝐽𝑜𝑔𝑝 𝐸 𝑏𝑕𝑓≤30 − 𝐽𝑜𝑔𝑝 𝑗𝑜𝑑𝑝𝑛𝑓 𝐸 𝑏𝑕𝑓≤30 = 0.571 • <=30 𝐻𝑏𝑗𝑜 𝑏𝑕𝑓≤30 𝑡𝑢𝑣𝑒𝑓𝑜𝑢 = 0.971 overcast 31..40 >40 • 𝐻𝑏𝑗𝑜 𝑏𝑕𝑓≤30 𝑑𝑠𝑓𝑒𝑗𝑢_𝑠𝑏𝑢𝑗𝑜𝑕 = 0.02 ? yes ? age? Which attribute next? <=30 overcast 31..40 >40 age income student credit_rating buys_xbox <=30 high no fair no <=30 high no excellent no student? yes ? <=30 medium no fair no <=30 low yes fair yes <=30 medium yes excellent yes 𝐸 𝑏𝑕𝑓≤30 no yes no yes 13

  14. Algorithm for Decision Tree Induction • Basic algorithm (a greedy algorithm) • Tree is constructed in a top-down recursive divide-and-conquer manner • At start, all the training examples are at the root • Attributes are categorical (if continuous-valued, they are discretized in advance) • Examples are partitioned recursively based on selected attributes • Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) • Conditions for stopping partitioning • All samples for a given node belong to the same class • There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf • There are no samples left – use majority voting in the parent partition 14

  15. Computing Information-Gain for Continuous-Valued Attributes • Let attribute A be a continuous-valued attribute • Must determine the best split point for A • Sort the value A in increasing order • Typically, the midpoint between each pair of adjacent values is considered as a possible split point • (a i +a i+1 )/2 is the midpoint between the values of a i and a i+1 • The point with the minimum expected information requirement for A is selected as the split-point for A • Split: • D1 is the set of tuples in D satisfying A ≤ split -point, and D2 is the set of tuples in D satisfying A > split-point 15

  16. Gain Ratio for Attribute Selection (C4.5) • Information gain measure is biased towards attributes with a large number of values • C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)   | | | | D D v   j j ( ) log ( ) SplitInfo D 2 A | | | | D D  1 j • GainRatio(A) = Gain(A)/SplitInfo(A) • Ex. • gain_ratio(income) = 0.029/1.557 = 0.019 • The attribute with the maximum gain ratio is selected as the splitting attribute 16

  17. Gini Index (CART, IBM IntelligentMiner) • If a data set D contains examples from n classes, gini index, gini ( D ) is defined as v 2    ( ) 1 gini D p j  1 j where p j is the relative frequency of class j in D • If a data set D is split on A into two subsets D 1 and D 2 , the gini index gini ( D ) is defined as | | | | D D   ( ) 1 ( ) 2 ( ) D gini gini gini A D D 1 2 | | | | D D • Reduction in Impurity:    ( ) ( ) ( ) gini A gini D gini D A • The attribute provides the smallest gini split ( D ) (or the largest reduction in impurity) is chosen to split the node ( need to enumerate all the possible splitting points for each attribute ) 17

  18. Computation of Gini Index • Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no” 2 2     9 5         ( ) 1 0 . 459 gini D     14 14 • Suppose the attribute income partitions D into 10 in D 1 : {low, medium} and 4 in D 2 : {high}     10 4       ( ) ( ) ( ) gini D Gini D Gini D  { , } 1 2 income low medium     14 14 Gini {low,high} is 0.458; Gini {medium,high} is 0.450. Thus, split on the {low,medium} (and {high}) since it has the lowest Gini index 18

  19. Comparing Attribute Selection Measures • The three measures, in general, return good results but • Inf nformat mation ion gai ain: • biased towards multivalued attributes • Gai ain n rat atio io: • tends to prefer unbalanced splits in which one partition is much smaller than the others (why?) • Gin ini in index: • biased to multivalued attributes 19

Recommend


More recommend