decision tree
play

Decision Tree CE-717 : Machine Learning Sharif University of - PowerPoint PPT Presentation

Decision Tree CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2019 Decision tree } One of the most intuitive classifiers that is easy to understand and construct } However, it also works very (very) well }


  1. Decision Tree CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2019

  2. Decision tree } One of the most intuitive classifiers that is easy to understand and construct } However, it also works very (very) well } Categorical features are preferred. If feature values are continuous, they are discretized first. } Application: Database mining 2

  3. Example C Yes No P A } Attributes: } A: age>40 P + - } C: chest pain S } S: smoking } P: physical test S - - A - + + - } Label: } Heart disease (+), No heart disease (-) 3

  4. Decision tree: structure } Leaves (terminal nodes) represent target variable } Each leaf represents a class label } Each internal node denotes a test on an attribute } Edges to children for each of the possible values of that attribute 4

  5. 5

  6. Decision tree: learning } Decision tree learning: construction of a decision tree from training samples. } Decision trees used in data mining are usually classification trees } There are many specific decision-tree learning algorithms, such as: } ID3 } C4.5 } Approximates functions of usually discrete domain } The learned function is represented by a decision tree 6

  7. Decision tree learning } Learning an optimal decision tree is NP-Complete } Instead, we use a greedy search based on a heuristic } We cannot guarantee to return the globally-optimal decision tree. } The most common strategy for DT learning is a greedy top-down approach } chooses a variable at each step that best splits the set of items. } Tree is constructed by splitting samples into subsets based on an attribute value test in a recursive manner 7

  8. How to construct basic decision tree? } We prefer decisions leading to a simple, compact tree with few nodes } Which attribute at the root? } Measure: how well the attributes split the set into homogeneous subsets (having same value of target) } Homogeneity of the target variable within the subsets. } How to form descendant? } Descendant is created for each possible value of 𝐡 } Training examples are sorted to descendant nodes 8

  9. Constructing a decision tree } Function FindTree(S,A) S: samples, A: attributes } If empty(A) or all labels of the samples in S are the same status = leaf } class = most common class in the labels of S } } else status = internal } a ← bestAttribute(S,A) } LeftNode = FindTree(S(a=1),A \ {a}) } RightNode = FindTree(S(a=0),A \ {a}) } } end } end Recursive calls to create left and right subtrees S(a=1) is the set of samples in S for which a=1 9 Top down, Greedy, No backtrack

  10. Constructing a decision tree } Function FindTree(S,A) S: samples, A: attributes } If empty(A) or all labels of the samples in S are the same status = leaf } class = most common class in the labels of S } } else status = internal } a ← bestAttribute(S,A) } LeftNode = FindTree(S(a=1),A \ {a}) } RightNode = FindTree(S(a=0),A \ {a}) Tree is constructed by splitting samples into subsets based on an } attribute value test in a recursive manner } end } end β€’ The recursion is completed when all members of the subset at Recursive calls to create left and right subtrees a node have the same label S(a=1) is the set of samples in S for which a=1 β€’ or when splitting no longer adds value to the predictions 10 Top down, Greedy, No backtrack

  11. ID3 β€’ ID3 (Examples,Target_Attribute,Attributes) β€’ Create a root node for the tree β€’ If all examples are positive, return the single-node tree Root, with label = + β€’ If all examples are negative, return the single-node tree Root, with label = - β€’ If number of predicting attributes is empty then return Root, with label = most common value of the target attribute in the examples β€’ β€’ else β€’ A =The Attribute that best classifies examples. β€’ T esting attribute for Root = A. β€’ for each possible value, 𝑀 $ , of A β€’ Add a new tree branch below Root, corresponding to the test A = 𝑀 $ . β€’ Let Examples( 𝑀 $ ) be the subset of examples that have the value for A β€’ if Examples( 𝑀 $ ) is empty then below this new branch add a leaf node with label = most common target value in the examples β€’ β€’ else below this new branch add subtree ID3 (Examples( π’˜ 𝒋 ),Target_Attribute,Attributes – {A}) β€’ return Root 11

  12. Which attribute is the best? 12

  13. Which attribute is the best? } A variety of heuristics for picking a good test } Information gain: originated with ID3 (Quinlan,1979). } Gini impurity } … } These metrics are applied to each candidate subset, and the resulting values are combined (e.g., averaged) to provide a measure of the quality of the split. 13

  14. οΏ½ Entropy 𝐼 π‘Œ = βˆ’ + 𝑄 𝑦 $ log 𝑄(𝑦 $ ) 4 5 ∈7 } Entropy measures the uncertainty in a specific distribution } Information theory: } 𝐼 π‘Œ : expected number of bits needed to encode a randomly drawn value of π‘Œ (under most efficient code) } Most efficient code assigns βˆ’log 𝑄(π‘Œ = 𝑗) bits to encode π‘Œ = 𝑗 } β‡’ expected number of bits to code one random π‘Œ is 𝐼(π‘Œ) 14

  15. Entropy for a Boolean variable 𝐼(π‘Œ) Entropy as a measure of impurity 𝑄(π‘Œ = 1) 1 1 𝐼 π‘Œ = βˆ’0.5 log < 2 βˆ’ 0.5 log < 2 = 1 𝐼 π‘Œ = βˆ’1 log < 1 βˆ’ 0 log < 0 = 0 15

  16. οΏ½ Information Gain (IG) 𝑇 I π»π‘π‘—π‘œ 𝑇, 𝐡 ≑ 𝐼 H 𝑍 βˆ’ + 𝑇 𝐼 H J 𝑍 I∈KLMNOP(Q) } 𝐡 : variable used to split samples } 𝑍 : target variable } 𝑇 : samples 16

  17. Information Gain: Example 17

  18. οΏ½ οΏ½ Mutual Information } The expected reduction in entropy of 𝑍 caused by knowing π‘Œ : 𝐽 π‘Œ, 𝑍 = 𝐼 𝑍 βˆ’ 𝐼 𝑍 π‘Œ = βˆ’ + + 𝑄 π‘Œ = 𝑗, 𝑍 = π‘˜ log 𝑄 π‘Œ = 𝑗 𝑄(𝑍 = π‘˜) 𝑄 π‘Œ = 𝑗, 𝑍 = π‘˜ $ T } Mutual information in decision tree: } 𝐼 𝑍 : Entropy of 𝑍 (i.e., labels) before splitting samples } 𝐼 𝑍 π‘Œ : Entropy of 𝑍 after splitting samples based on attribute π‘Œ } It shows expectation of label entropy obtained in different splits (where splits are formed based on the value of attribute π‘Œ ) 18

  19. οΏ½ οΏ½ οΏ½ οΏ½ Conditional entropy 𝐼 𝑍 π‘Œ = βˆ’ + + 𝑄 π‘Œ = 𝑗, 𝑍 = π‘˜ log 𝑄 𝑍 = π‘˜|π‘Œ = 𝑗 $ T 𝐼 𝑍 π‘Œ = + 𝑄 π‘Œ = 𝑗 + βˆ’π‘„ 𝑍 = π‘˜|π‘Œ = 𝑗 log 𝑄 𝑍 = π‘˜|π‘Œ = 𝑗 $ T probability of following i-th value for π‘Œ Entropy of 𝑍 for samples with π‘Œ = 𝑗 19

  20. Conditional entropy: example } 𝐼 𝑍 𝐼𝑣𝑛𝑗𝑒𝑗𝑒𝑧 } = [ \] ×𝐼 𝑍 𝐼𝑣𝑛𝑗𝑒𝑗𝑒𝑧 = πΌπ‘—π‘•β„Ž + [ \] ×𝐼 𝑍 𝐼𝑣𝑛𝑗𝑒𝑗𝑒𝑧 = π‘‚π‘π‘ π‘›π‘π‘š } 𝐼 𝑍 π‘‹π‘—π‘œπ‘’ } = g \] ×𝐼 𝑍 π‘‹π‘—π‘œπ‘’ = 𝑋𝑓𝑏𝑙 + j \] ×𝐼 𝑍 π‘‹π‘—π‘œπ‘’ = π‘‡π‘’π‘ π‘π‘œπ‘• 20

  21. How to find the best attribute? } Information gain as our criteria for a good split } attribute that maximizes information gain is selected } When a set of 𝑇 samples have been sorted to a node, choose π‘˜ -th attribute for test in this node where: π‘˜ = argmax π»π‘π‘—π‘œ 𝑇, π‘Œ $ $∈oOpLqrqrs LttP. = argmax 𝐼 H 𝑍 βˆ’ 𝐼 H 𝑍|π‘Œ $ } $∈oOpLqrqrs LttP. } = argmin 𝐼 H 𝑍|π‘Œ $ $∈oOpLqrqrs LttP. 21

  22. Information Gain: Example 22

  23. ID3 algorithm: Properties } The algorithm } either reaches homogenous nodes } or runs out of attributes } Guaranteed to find a tree consistent with any conflict-free training set } ID3 hypothesis space of all DTs contains all discrete-valued functions } Conflict free training set: identical feature vectors always assigned the same class } But not necessarily find the simplest tree (containing minimum number of nodes). } a greedy algorithm with locally-optimal decisions at each node (no backtrack). 23

  24. Decision tree learning: Function approximation problem } Problem Setting : } Set of possible instances π‘Œ } Unknown target function 𝑔: π‘Œ β†’ 𝑍 ( 𝑍 is discrete valued) } Set of function hypotheses 𝐼 = { β„Ž | β„Ž ∢ π‘Œ β†’ 𝑍 } } β„Ž is a DT where tree sorts each π’š to a leaf which assigns a label 𝑧 } Input : } Training examples {(π’š $ , 𝑧 $ )} of unknown target function 𝑔 } Output : } Hypothesis β„Ž ∈ 𝐼 that best approximates target function 𝑔 24

  25. Decision tree hypothesis space } Suppose attributes are Boolean } Disjunction of conjunctions } Which trees to show the following functions? } 𝑧 = 𝑦 \ π‘π‘œπ‘’ 𝑦 ~ } 𝑧 = 𝑦 \ 𝑝𝑠 𝑦 ] } 𝑧 = (𝑦 \ π‘π‘œπ‘’ 𝑦 ~ ) 𝑝𝑠(𝑦 < π‘π‘œπ‘’ ¬𝑦 ] ) ? 25

  26. Decision tree as a rule base } Decision tree = a set of rules } Disjunctions of conjunctions on the attribute values } Each path from root to a leaf = conjunction of attribute tests } All of the leafs with 𝑧 = 𝑗 are considered to find the rule for 𝑧 = 𝑗 26

  27. How partition instance space? } Decision tree } Partition the instance space into axis-parallel regions, labeled with class value [Duda & Hurt ’s Book] 27

  28. ID3 as a search in the space of trees } ID3 : heuristic search through space of DTs } Performs a simple to complex hill-climbing search (begins with empty tree) } prefers simpler hypotheses due to using IG as a measure of selecting attribute test } IG gives a bias for trees with minimal size. } ID3 implements a search (preference) bias instead of a restriction bias. 28

Recommend


More recommend