Decision Trees 2-26-16
Reading Quiz Decision trees are an algorithm for which machine learning task? a) clustering b) dimensionality reduction c) classification d) regression
Reading Quiz Which error metric is most appropriate for evaluating a {0,1} classification task? a) worst-case error b) sum of squares error c) entropy d) precision and recall
Terminology Learning a model: ● input = training examples ○ feature = dimension ● output = model = hypothesis Using a learned model: ● input = test example ● output = class = label = target
Decision trees setting Supervised learning Classification Input: can be continuous or discrete Output: must be discrete can be {0,1} or multi-class
What type of model are we building? Should I play tennis? Should I read a Reddit post? Who plays tennis when it’s raining but not when it’s humid?
How do we build such a model? Modeling questions: ● How many decision nodes should there be? ● At each node, what feature should we split on? ● For each such feature, how should we split it? ○ This is trivial if the feature is boolean. Bad idea: generate all possible trees and test how well they work. Better idea: build the tree incrementally.
Building the tree incrementally. Within a region, pick the best: ● feature to split on ● value at which to split it elevation Sort the training data into the sub-regions. Recursively build decision trees for the sub-regions. $ / ft 2
Picking the best split Try all features. Try all possible splits of that feature. If feature F is ______, there are ________ possible splits to consider. ● binary ... one ● discrete and ordered ... | F | - 1 discrete and unordered … 2 | F | - 1 - 1 (two options for where to put each value) ● ● continuous … | training set | - 1 (any split between two points is the same) | F | is indicating the number of possible values a discrete feature can take on.
Discussion question: Can we do better? Try all features. Try all possible splits of that feature. If feature F is ______, there are ________ possible splits to consider. ● binary ... one ● discrete and ordered ... | F | - 1 discrete and unordered … 2 | F | - 1 - 1 (two options for where to put each value) ● ● continuous … | training set | - 1 (any split between two points is the same) Ordered or continuous cases: binary search. Unordered case: local search.
How do we pick the best split? Key idea: minimize entropy. ∑ e ∈ E ∑ Y ∈ T [val(e,Y) * log pval(e,Y) + (1-val(e,Y)) * log (1-pval(e,Y))] e: training example val(e,Y): true value Y: target feature pval(e,y): predicted value
Entropy: alternative explanation ● S is a collection of positive and negative examples ● Pos: proportion of positive examples in S ● Neg: proportion of negative examples in S Entropy(S) = -Pos * log2(Pos) - Neg * log2(Neg) ● Entropy is 0 when all members of S belong to the same class, for example when Pos=1 and Neg=0 ● Entropy is 1 when S contains an equal number of positive and negative examples, when Pos=1⁄2 and Neg=1⁄2
When do we stop splitting? Bad idea: when every training point is classified correctly. Why is this a bad idea? Better idea: maximum depth, or minimum number of points in a region.
Exercise: build a decision tree. HI CS BR FF EA RS TB AC MX IM SC ES SS CR DF SA R n y n y y y n n n y ? y y y n y R n y n y y y n n n n n y y y n ? D ? y y ? y y n n n n y n y y n n D n y y n ? y n n n n y n y n n y D y y y n y y n n n n y ? y y y y D n y y n y y n n n n n n y y y y D n y n y y y n n n n n n ? y y y R n y n y y y n n n n n n y y ? y R n y n y y y n n n n n y y y n y D y y y n n n y y y n n n n n ? ? R n y n y y n n n n n ? ? y y n n R n y n y y y n n n n y ? y y ? ? D n y y n n n y y y n n n y n ? ? D y y y n n y y y ? y y ? n n y ? R n y n y y y n n n n n y ? ? n ? R n y n y y y n n n y n y y ? n ? D y n y n n y n y ? y y y ? n n y D y ? y n n n y y y n n n y n y y R n y n y y y n n n n n ? y y n n D y y y n n n y y y n y n n n y y
Recommend
More recommend