Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting S´ ebastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30
Keywords ◮ Decision trees ◮ Divide and Conquer ◮ Impurity measure, Gini index, Information gain ◮ Pruning and overfitting ◮ CART and C4.5 Contents of this class: The general idea of learning decision trees Regression trees Classification trees Boosting and trees Random Forests and trees S. Gadat (TSE) SAD 2013 2 / 30
Introductory example Alt Bar F/S Hun Pat Pri Rai Res Typ Dur Wai x 1 Y N N Y 0.38 $$$ N Y French 8 Y x 2 Y N N Y 0.83 $ N N Thai 41 N x 3 N Y N N 0.12 $ N N Burger 4 Y x 4 Y N Y Y 0.75 $ Y N Thai 12 Y x 5 Y N Y N 0.91 $$$ N Y French 75 N x 6 N Y N Y 0.34 $$ Y Y Italian 8 Y x 7 N Y N N 0.09 $ Y N Burger 7 N x 8 N N N Y 0.15 $$ Y Y Thai 10 Y x 9 N Y Y N 0.84 $ Y N Burger 80 N x 10 Y Y Y Y 0.78 $$$ N Y Italian 25 N x 11 N N N N 0.05 $ N N Thai 3 N x 12 Y Y Y Y 0.89 $ N N Burger 38 Y Please describe this dataset without any calculation. S. Gadat (TSE) SAD 2013 3 / 30
Introductory example Alt Bar F/S Hun Pat Pri Rai Res Typ Dur Wai x 1 Y N N Y 0.38 $$$ N Y French 8 Y x 2 Y N N Y 0.83 $ N N Thai 41 N x 3 N Y N N 0.12 $ N N Burger 4 Y x 4 Y N Y Y 0.75 $ Y N Thai 12 Y x 5 Y N Y N 0.91 $$$ N Y French 75 N x 6 N Y N Y 0.34 $$ Y Y Italian 8 Y x 7 N Y N N 0.09 $ Y N Burger 7 N x 8 N N N Y 0.15 $$ Y Y Thai 10 Y x 9 N Y Y N 0.84 $ Y N Burger 80 N x 10 Y Y Y Y 0.78 $$$ N Y Italian 25 N x 11 N N N N 0.05 $ N N Thai 3 N x 12 Y Y Y Y 0.89 $ N N Burger 38 Y Why is Pat a better indicator than Typ? S. Gadat (TSE) SAD 2013 3 / 30
Deciding to wait. . . or not 1 3 4 6 8 12 2 5 7 9 10 11 S. Gadat (TSE) SAD 2013 4 / 30
Deciding to wait. . . or not 1 3 4 6 8 12 2 5 7 9 10 11 Pat [0;0.1] [0.1;0.5] [0.5;1] 7 11 8 4 12 1 3 6 2 5 9 10 S. Gadat (TSE) SAD 2013 4 / 30
Deciding to wait. . . or not 1 3 4 6 8 12 2 5 7 9 10 11 Pat [0;0.1] [0.1;0.5] [0.5;1] 7 11 8 4 12 1 3 6 2 5 9 10 Dur <40 >40 4 12 2 5 9 10 S. Gadat (TSE) SAD 2013 4 / 30
Deciding to wait. . . or not 1 3 4 6 8 12 2 5 7 9 10 11 Pat [0;0.1] [0.1;0.5] [0.5;1] 7 11 8 4 12 1 3 6 2 5 9 10 No Yes Dur <40 >40 4 12 2 5 9 10 Yes No S. Gadat (TSE) SAD 2013 4 / 30
The general idea of learning decision trees Decision trees Ingredients: ◮ Nodes Each node contains a test on the features which partitions the data. ◮ Edges The outcome of a node’s test leads to one of its child edges. ◮ Leaves A terminal node, or leaf, holds a decision value for the output variable. S. Gadat (TSE) SAD 2013 5 / 30
The general idea of learning decision trees Decision trees Ingredients: ◮ Nodes Each node contains a test on the features which partitions the data. ◮ Edges The outcome of a node’s test leads to one of its child edges. ◮ Leaves A terminal node, or leaf, holds a decision value for the output variable. We will look at binary trees ( ⇒ binary tests) and single variable tests. Binary attribute: node = attribute Continuous attribute: node = (attribute, threshold) S. Gadat (TSE) SAD 2013 5 / 30
The general idea of learning decision trees Decision trees Ingredients: ◮ Nodes Each node contains a test on the features which partitions the data. ◮ Edges The outcome of a node’s test leads to one of its child edges. ◮ Leaves A terminal node, or leaf, holds a decision value for the output variable. We will look at binary trees ( ⇒ binary tests) and single variable tests. Binary attribute: node = attribute Continuous attribute: node = (attribute, threshold) How does one build a good decision tree? For a regression problem? For a classification problem? S. Gadat (TSE) SAD 2013 5 / 30
The general idea of learning decision trees A little more formally A tree with M leaves describes a covering set of M hypercubes R m in X . Each R m hold a decision value ˆ y m . M ˆ � f ( x ) = y m I R m ( x ) ˆ m =1 Notation: q � N m = | x i ∈ R m | = I R m ( x i ) i =1 S. Gadat (TSE) SAD 2013 6 / 30
The general idea of learning decision trees The general idea: divide and conquer Example Set T , attributes x 1 , . . . , x p FormTree( T ) 1. Find best split ( j, s ) over T // Which criterion? 2. If ( j, s ) = ∅ , ◮ node = FormLeaf(T) // Which value for the leaf? 3. Else ◮ node = ( j, s ) ◮ split T according to ( j, s ) into ( T 1 , T 2) ◮ append FormTree( T 1 ) to node // Recursive call ◮ append FormTree( T 2 ) to node 4. Return node S. Gadat (TSE) SAD 2013 7 / 30
The general idea of learning decision trees The general idea: divide and conquer Example Set T , attributes x 1 , . . . , x p FormTree( T ) 1. Find best split ( j, s ) over T // Which criterion? 2. If ( j, s ) = ∅ , ◮ node = FormLeaf(T) // Which value for the leaf? 3. Else ◮ node = ( j, s ) ◮ split T according to ( j, s ) into ( T 1 , T 2) ◮ append FormTree( T 1 ) to node // Recursive call ◮ append FormTree( T 2 ) to node 4. Return node Remark This is a greedy algorithm, performing local search. S. Gadat (TSE) SAD 2013 7 / 30
The general idea of learning decision trees The R point of view Two packages for tree-based methods: t ree and r part. S. Gadat (TSE) SAD 2013 8 / 30
Regression trees Regression trees – criterion We want to fit a tree to the data { ( x i , y i ) } i =1 ..q with y i ∈ R . Criterion? S. Gadat (TSE) SAD 2013 9 / 30
Regression trees Regression trees – criterion We want to fit a tree to the data { ( x i , y i ) } i =1 ..q with y i ∈ R . q � 2 � y i − ˆ Criterion? Sum of squares: � f ( x i ) i =1 S. Gadat (TSE) SAD 2013 9 / 30
Regression trees Regression trees – criterion We want to fit a tree to the data { ( x i , y i ) } i =1 ..q with y i ∈ R . q � 2 � y i − ˆ Criterion? Sum of squares: � f ( x i ) i =1 Inside region R m , best ˆ y m ? S. Gadat (TSE) SAD 2013 9 / 30
Regression trees Regression trees – criterion We want to fit a tree to the data { ( x i , y i ) } i =1 ..q with y i ∈ R . q � 2 � y i − ˆ Criterion? Sum of squares: � f ( x i ) i =1 1 � Inside region R m , best ˆ y m ? y m = ˆ y i = Y R m N m x i ∈ R m Node impurity measure: 1 � y m ) 2 Q m = ( y i − ˆ N m x i ∈ R m S. Gadat (TSE) SAD 2013 9 / 30
Regression trees Regression trees – criterion Best partition: hard to find. But locally, best split? S. Gadat (TSE) SAD 2013 10 / 30
Regression trees Regression trees – criterion Best partition: hard to find. But locally, best split? Solve argmin C ( j, s ) j,s y 1 ) 2 + min � � y 2 ) 2 C ( j, s ) = min ( y i − ˆ ( y i − ˆ y 1 ˆ y 2 ˆ x i ∈ R 1 ( j,s ) x i ∈ R 2 ( j,s ) � 2 + � 2 � � � � = y i − Y R 1 ( j,s ) y i − Y R 2 ( j,s ) x i ∈ R 1 ( j,s ) x i ∈ R 1 ( j,s ) = N 1 Q 1 + N 2 Q 2 S. Gadat (TSE) SAD 2013 10 / 30
Regression trees Overgrowing the tree? ◮ Too small: rough average. ◮ Too large: overfitting. Alt Bar F/S Hun Pat Pri Rai Res Typ Dur x 1 Y N N Y 0.38 $$$ N Y French 8 x 2 Y N N Y 0.83 $ N N Thai 41 x 3 N Y N N 0.12 $ N N Burger 4 x 4 Y N Y Y 0.75 $ Y N Thai 12 x 5 Y N Y N 0.91 $$$ N Y French 75 x 6 N Y N Y 0.34 $$ Y Y Italian 8 x 7 N Y N N 0.09 $ Y N Burger 7 x 8 N N N Y 0.15 $$ Y Y Thai 10 x 9 N Y Y N 0.84 $ Y N Burger 80 x 10 Y Y Y Y 0.78 $$$ N Y Italian 25 x 11 N N N N 0.05 $ N N Thai 3 x 12 Y Y Y Y 0.89 $ N N Burger 38 S. Gadat (TSE) SAD 2013 11 / 30
Regression trees Overgrowing the tree? Stopping criterion? ◮ Stop if min j,s C ( j, s ) > κ ? Not good because a good split might be hidden in deeper nodes. ◮ Stop if N m < n ? Good to avoid overspecialization. ◮ Prune the tree after growing. cost-complexity pruning . Cost-complexity criterion: M � C α = N m Q m + αM m =1 Once a tree is grown, prune it to minimize C α . ◮ Each α corresponds to a unique cost-complexity optimal tree. ◮ Pruning method: Weakest link pruning , left to your curiosity. ◮ Best α ? Through cross-validation. S. Gadat (TSE) SAD 2013 11 / 30
Regression trees Regression trees in a nutshell ◮ Constant values on the leaves. ◮ Growing phase: greedy splits that minimize the squared-error impurity measure. ◮ Pruning phase: Weakest-link pruning that minimize the cost-complexity criterion. S. Gadat (TSE) SAD 2013 12 / 30
Regression trees Regression trees in a nutshell ◮ Constant values on the leaves. ◮ Growing phase: greedy splits that minimize the squared-error impurity measure. ◮ Pruning phase: Weakest-link pruning that minimize the cost-complexity criterion. Further reading on regression trees: ◮ MARS: Multivariate Adaptive Regression Splines. Linear functions on the leaves. ◮ PRIM: Patient Rule Induction Method. Focuses on extremas rather than averages. S. Gadat (TSE) SAD 2013 12 / 30
Recommend
More recommend