CS440/ECE448: Intro to Artificial Intelligence � Lecture 23: Decision Trees � Decision trees � Prof. Julia Hockenmaier � juliahmr@illinois.edu � � http://cs.illinois.edu/fa11/cs440 � � � Decision trees � Decision tree learning � drink? � Training data D = {( x 1 , y 1 ),…, ( x N , y N )} coffee � tea � – each x i = ( x 1 i ,…., x d i ) is a d -dimensional feature vector � – each y i is the target label (class) of the i-th data point � milk? � milk? � � no � yes � yes � no � Training algorithm: � – Initial tree = the root, corresponding to all items in D no sugar � sugar � sugar � no sugar � – A node is a leaf if all its data items have the same y � – At each non-leaf node: find the feature x i with the highest information gain, create a new child for each value of x i , distribute the items accordingly. � CS440/ECE448: Intro AI � 3 � CS440/ECE448: Intro AI � 4 �
Information Gain � Dealing with numerical attributes � How much information are we gaining by splitting Many attributes are not boolean (0,1) node S on attribute A with values V(A) ? � or nominal (classes) � � – Number of times a word appears in a text � Information required before the split: � – RGB values of a pixel � H(S parent ) � – height, weight, …. � Information required after the split: � ! i ∈ V(A) P(S child_i )H(S child_i ) Splitting on integer or real-valued attributes: � � – Find a split point: A i < " or A i # " ? N S child i � # Gain ( S parent , A ) = H ( S parent ) ! H ( S child i ) � S parent i " V ( A ) CS440/ECE448: Intro AI � 6 � Complete Training Data � Our training data � + - - + + + - - + - + - + + - - + + + - - + - + - - + - - + - + - - + - + - + + - - + + - - - + - + - + + - - + + + - - + - + - + + - - + - + � - - + + + - + - + + - + - + + + - - - + - + - + - - - + - + + + + - + - + - - + - + � - - - + - - + - - - � + + + + � - - + - + - + - - - - - - - - - - - + + + + - + + + � - - - - - - � - - - - - + + + + - - � + + + + + + � - - - - - � + + + + + + � CS440/ECE448: Intro AI � 8 �
The example space � Generalization � We need to label unseen examples accurately. � � But: � The training data is only a very small sample of the example space. � – We won ʼ t have seen all possible combinations of attribute values. � � The training data may be noisy � – Some items may have incorrect attributes or labels � � CS440/ECE448: Intro AI � 9 � CS440/ECE448: Intro AI � 10 � When does learning stop? � The effect of noise � The tree will grow until all leaf nodes have If the training data are noisy, � only one label. � it may introduce incorrect splits. � � � + - + - � If this false value + - + - � � � should have been + - - � + � A2: true A2: false true, we wouldn ʼ t split on A2. � + - - � + � + - � - � + - � - � If this + label should + � - � have been -, we wouldn ʼ t have to split + � - � any further. �
The effect of incomplete data � Overfitting � If the training data are incomplete, The decision tree might overfit the particularities of we may miss important generalizations. � the training data. � � � On training data � full + - + - + + - - - � + - + - + + - - - example + + � + + � space � + - + - � + - + - � Accuracy � training data � On test data � A4 � A4 A2 A2 � + + + + � - - - - � + - + - + � + - - � + - + � Size of tree � + + + + � - - - - - � - - + + � We should have split on A4, not A2. � CS440/ECE448: Intro AI � 14 � Reducing Overfitting in Decision Pruning a decision tree � Trees � 1. Train a decision tree on training data � Limit the depth of the tree � – No deeper than N (say 3 or 12 or 86 - how to choose?) (keep a part of training data as unseen � validation data) � Require a minimum number of examples used to select a split � 2. Prune from the leaves: � – Need at least M (is 10 enough? 20?) � – Want significance: Statistical hypothesis testing can help � Simplest method: � Replace (prune) each non-leaf node whose BEST: Learn an overfit tree and prune, using validation children are all leaves with its majority label. � (held-out) data � Keep this change if the accuracy on validation set does not degrade. � 15 � CS440/ECE448: Intro AI � 16 �
Dealing with overfitting � Bias-variance tradeoff � Overfitting is a very common problem in machine Bias: What kind of hypotheses do we allow? � learning. � We want rich enough hypotheses to capture � the target function f( x ) � � Many machine learning algorithms have Variance: How much does our learned hypothesis parameters that can be tuned to improve change if we resample the training data? � performance (because they reduce overfitting). � � Rich hypotheses (e.g. large decision trees) need more data (which we may not have) � We use a held-out data set to set these parameters. � CS440/ECE448: Intro AI � 17 � CS440/ECE448: Intro AI � 18 � Reducing variance: bagging � Create a new training set by sampling (with replacement) N items from the original data set. � � Repeat this K times to get K training sets. � Regression � (K is an odd number, e.g. 3, 5, …) � � Train one classifier on each of the K training sets � � Testing: take the majority vote of these K classifiers � � CS440/ECE448: Intro AI � 19 �
Polynomial curve fitting � Polynomial curve fitting � Given some data {(x,y)…}, with x, y ∈ R, � f ( x ) = w 0 + w 1 x 1 + w 2 x 2 + ... + w m x m find a function f such that f(x) = y. � m ! w i x i = o i = 0 Task: � find weights w 0 … w m to best fit the data. � o o � o This requires a loss (error) function o CS440/ECE448: Intro AI � 22 � Squared Loss � Accounting for model complexity � We would like to find the simplest polynomial to fit We want to find a weight vector w which our data. � minimizes the loss (error) on the training � data {(x 1 ,y 1 )…(x N , y N )} We need to penalize the degree of the polynomial. � � N ! L ( w ) = L 2 ( f w ( x i ), y i ) We can add a regularization term to the loss which � penalizes for overly complex functions) � i = 1 N � ! ) 2 = ( y i " f w ( x i ) i = 1 CS440/ECE448: Intro AI � 23 � CS440/ECE448: Intro AI � 24 �
Linear regression � Given some data {(x,y)…}, with x, y ∈ R, � find a function f(x) = w 1 x + w 0 such that f(x) = y. � Regression � o o o o o Linear regression � We need to minimize the loss on the training data: w = argmin w Loss(f w ) � � We need to set partial derivatives of Loss(f w ) � with respect to w1, w0 to zero. � � This has a closed-form solution (see book). �
Recommend
More recommend