decision trees some exercises
play

Decision Trees: Some exercises 1. Exemplifying how to compute - PowerPoint PPT Presentation

0. Decision Trees: Some exercises 1. Exemplifying how to compute information gains and how to work with decision stumps CMU, 2013 fall, W. Cohen E. Xing, Sample questions, pr. 4 2. Timmy wants to know how to do well for ML exam. He collects


  1. 17. d. 3 8 H [1+ , 2 − ] + 5 def. = 8 H [2+ , 3 − ] H 0 / NotHeavy � 1 � � 2 � 3 3 1 + 2 3 + 5 5 2 + 3 5 = 3 log 2 3 log 2 5 log 2 5 log 2 8 2 8 3 � 1 � � 2 � 3 3 log 2 3 + 2 3 log 2 3 − 2 + 5 5 log 2 5 − 2 5 · 1 + 3 5 log 2 5 − 3 = 3 · 1 5 log 2 3 8 8 � � � � 3 log 2 3 − 2 + 5 log 2 5 − 3 5 log 2 3 − 2 = 8 3 8 5 3 8 log 2 3 − 2 8 + 5 8 log 2 5 − 3 8 log 2 3 − 2 = 8 5 8 log 2 5 − 4 = 8 ≈ 0 . 9512 def. ⇒ IG 0 / NotHeavy = H Edible − H 0 / NotHeavy = 0 . 9544 − 0 . 9512 = 0 . 0032 , IG 0 / NotHeavy = IG 0 / Smelly = IG 0 / Spotted = 0 . 0032 < IG 0 / Smooth = 0 . 0488

  2. 18. Important Remark (in Romanian) ˆ In loc s˘ a fi calculat efectiv aceste cˆ a¸ stiguri de informat ¸ie, pentru a determina atributul cel mai ,,bun“, ar fi fost suficient s˘ a compar˘ am valorile entropiilor condit ¸ionale medii H 0 / Smooth ¸ si H 0 / NotHeavy : IG 0 / Smooth > IG 0 / NotHeavy ⇔ H 0 / Smooth < H 0 / NotHeavy 3 2 − 3 8 log 2 3 < 5 8 log 2 5 − 1 ⇔ 2 ⇔ 12 − 3 log 2 3 < 5 log 2 5 − 4 ⇔ 16 < 5 log 2 5 + 3 log 2 3 ⇔ 16 < 11 . 6096 + 4 . 7548 (adev.) ˆ In mod alternativ, t ¸inˆ and cont de formulele de la problema UAIC, 2017 fall, S. Ciobanu, L. Ciortuz, putem proceda chiar mai simplu relativ la calcule (nu doar aici, ori de cˆ ate ori nu avem de-a face cu un num˘ ar mare de instant ¸e ): � 4 4 2 2 · 4 4 5 5 2 2 ⇔ 4 8 3 3 3 3 · � 3 3 < 5 5 ⇔ 4 8 < 3 3 · 5 5 ⇔ 2 16 < 3 3 · 5 5 H 0 / Neted˘ a < H 0 / U¸ a ⇔ 3 3 < soar˘ 2 2 · � 2 2 · � � � � � � � � � ⇔ 64 · 2 10 < 27 · 25 · 125 (adev.) ���� � �� � > 3 · 8 · 125 1024 � �� � 1000

  3. 19. Node 1: Smooth = 0 [2+,2−] [2+,2−] [2+,2−] NotHeavy Smelly Spotted 1 1 1 0 1 0 1 0 1 [0+,1−] [2+,1−] [2+,0−] [0+,2−] [1+,1−] [1+,1−] 0 1 0

  4. 20. Node 2: Smooth = 1 [1+,3−] [1+,3−] [1+,3−] NotHeavy Smelly Spotted 2 2 2 0 1 0 1 0 1 [1+,1−] [0+,2−] [0+,3−] [1+,0−] [1+,2−] [0+,1−] Node 3 0 0 1 0

  5. 21. The resulting ID3 Tree [3+,5−] Smooth 0 1 IF ( Smooth = 0 AND Smelly = 0) OR [2+,2−] [1+,3−] ( Smooth = 1 AND Smelly = 1) THEN Edibile ; Smelly Smelly ELSE ¬ Edible ; 0 1 0 1 [0+,2−] [0+,3−] [1+,0−] [2+,0−] 1 0 0 1 Classification of test instances : U Smooth = 1, Smelly = 1 ⇒ Edible = 1 V Smooth = 1, Smelly = 1 ⇒ Edible = 1 W Smooth = 0, Smelly = 1 ⇒ Edible = 0

  6. 22. Exemplifying the greedy character of the ID3 algorithm CMU, 2003 fall, T. Mitchell, A. Moore, midterm, pr. 9.a

  7. 23. Fie atributele binare de intrare A, B, C , atributul de ie¸ sire Y ¸ si urm˘ atoarele exemple de antrenament: A B C Y 1 1 0 0 1 0 1 1 0 1 1 1 0 0 1 0 a. Determinat ¸i arborele de decizie calculat de algoritmul ID3. Este acest arbore de decizie consistent cu datele de antrenament?

  8. 24. R˘ aspuns Nodul 0 : (r˘ ad˘ acina) [2+,2−] [2+,2−] [2+,2−] A B C 0 1 0 1 0 1 [1+,1−] [1+,1−] [1+,1−] [1+,1−] [0+,1−] [2+,1−] Nod 1 0 Se observ˘ a imediat c˘ a primii doi “compa¸ si de decizie” (engl. decision stumps) au IG = 0 , ˆ ın timp ce al treilea compas de decizie are IG > 0 . Prin urmare, ˆ ın nodul 0 (r˘ ad˘ acin˘ a) vom pune atributul C .

  9. 25. Nodul 1 : Avem de clasificat instant ¸ele cu C = 1 , deci alegerea se face ˆ ıntre atributele A ¸ si B . [2+,1−] [2+,1−] A B 0 1 0 1 [1+,1−] [1+,0−] [1+,1−] [1+,0−] Nod 2 1 1 Cele dou˘ a entropii condit ¸ionale medii sunt egale: H 1 /A = H 1 /B = 2 3 H [1+ , 1 − ] + 1 3 H [1+ , 0 − ] A¸ sadar, putem alege oricare dintre cele dou˘ a atribute. Pentru fixare, ˆ ıl alegem pe A .

  10. 26. [2+,2−] Nodul 2 : La acest nod nu mai avem disponibil C decˆ at atributul B , deci ˆ ıl vom pune pe acesta. 0 1 [2+,1−] [0+,1−] A Arborele ID3 complet este reprezentat ˆ ın figura 0 0 1 al˘ aturat˘ a. [1+,1−] [1+,0−] Prin construct ¸ie, algoritmul ID3 este consistent B cu datele de antrenament dac˘ a acestea sunt con- 1 sistente (i.e., necontradictorii). ˆ 0 1 In cazul nostru, se verific˘ a imediat c˘ a datele de antrenament sunt [0+,1−] [1+,0−] consistente. 0 1

  11. 27. b. Exist˘ a un arbore de decizie de adˆ ancime mai mic˘ a (decˆ at cea a arborelui ID3) consistent cu datele de mai sus? Dac˘ a da, ce concept (logic) reprezint˘ a acest arbore? R˘ aspuns: Din date se observ˘ a c˘ a atributul de ie¸ sire Y reprezint˘ a de fapt funct ¸ia logic˘ a A xor [2+,2−] B . A 0 1 Reprezentˆ and aceast˘ a funct ¸ie ca arbore de decizie, vom obt ¸ine arborele al˘ aturat. [1+,1−] [1+,1−] B B Acest arbore are cu un nivel mai put ¸in decˆ at arborele construit cu algoritmul 0 1 0 1 ID3. [0+,1−] [1+,0−] [1+,0−] [0+,1−] Prin urmare, arborele ID3 nu este op- 1 1 0 0 tim din punctul de vedere al num˘ arului de niveluri.

  12. 28. Aceasta este o consecint ¸˘ a a caracterului “greedy” al algoritmului ID3, datorat faptului c˘ a la fiecare iterat ¸ie alegem ,,cel mai bun“ atribut ˆ ın raport cu criteriul cˆ a¸ stigului de informat ¸ie. Se ¸ stie c˘ a algoritmii de tip “greedy” nu granteaz˘ a obt ¸inerea opti- mului global.

  13. 29. Exemplifying the application of the ID3 algorithm in the presence of both categorical and continue attributes CMU, 2012 fall, Eric Xing, Aarti Singh, HW1, pr. 1.1

  14. 30. As of September 2012, 800 extrasolar planets have been identified in our galaxy. Super- secret surveying spaceships sent to all these planets have established whether they are habitable for humans or not, but sending a spaceship to each planet is expensive. In this problem, you will come up with decision trees to predict if a planet is habitable based only on features observable using telescopes. Size Orbit Habitable Count a. In nearby table you are given the data from all 800 planets surveyed so far. The fea- Big Near Yes 20 tures observed by telescope are Size (“Big” or Big Far Yes 170 “Small”), and Orbit (“Near” or “Far”). Each Small Near Yes 139 row indicates the values of the features and Small Far Yes 45 habitability, and how many times that set of Big Near No 130 values was observed. So, for example, there Big Far No 30 were 20 “Big” planets “Near” their star that Small Near No 11 were habitable. Small Far No 255 Derive and draw the decision tree learned by ID3 on this data (use the maximum information gain criterion for splits, don’t do any pruning). Make sure to clearly mark at each node what attribute you are splitting on, and which value corresponds to which branch. By each leaf node of the tree, write in the number of habitable and inhabitable planets in the training data that belong to that node.

  15. 31. Answer: Level 1 H(374/800) H(374/800) [374+,426−] [374+,426−] Size Orbit B S N F [190+,160−] [184+,266−] [159+,141−] [215+,285−] H(19/35) H(92/225) H(47/100) H(43/100) � 92 � 19 � � 35 + 45 H ( Habitable | Size ) = 80 · H 80 · H 35 225 80 · 0 . 9946 + 45 35 IG ( Habitable ; Size ) = 0 . 0128 = 80 · 0 . 9759 = 0 . 9841 � 47 � 43 � � 3 + 5 H ( Habitable | Orbit ) = 8 · H 8 · H IG ( Habitable ; Orbit ) = 0 . 0067 100 100 3 8 · 0 . 9974 + 5 = 8 · 0 . 9858 = 0 . 9901

  16. 32. The final decision tree [374+,426−] Size B S [190+,160−] [184+,266−] Orbit Orbit N F N F [20+,130−] [170+,30−] [139+,11−] [45+,255−] − + + −

  17. 33. Size Orbit Temperature Habitable b. For just 9 of the planets, a third feature, Temperature (in Kelvin degrees), Big Far 205 No has been measured, as shown in the Big Near 205 No nearby table. Big Near 260 Yes Redo all the steps from part a on this data Big Near 380 Yes using all three features. For the Temper- Small Far 205 No ature feature, in each iteration you must Small Far 260 Yes maximize over all possible binary thresh- Small Near 260 Yes olding splits (such as T ≤ 250 vs. T > 250 , Small Near 380 No for example). Small Near 380 No According to your decision tree, would a planet with the features (Big, Near, 280) be predicted to be habitable or not habitable? Hint : You might need to use the following values of the entropy function for a Bernoulli variable of parameter p : H (1 / 3) = 0 . 9182 , H (2 / 5) = 0 . 9709 , H (92 / 225) = 0 . 9759 , H (43 / 100) = 0 . 9858 , H (16 / 35) = 0 . 9946 , H (47 / 100) = 0 . 9974 .

  18. 34. Answer Binary threshold splits for the continuous attribute Temperature : Temperature 205 232.5 260 320 380

  19. 35. Answer: Level 1 H(4/9) H(4/9) H(4/9) H(4/9) [4+,5−] [4+,5−] [4+,5−] [4+,5−] Size Orbit T<=232.5 T<=320 B S F N Y N Y N [2+,2−] [2+,3−] [1+,2−] [3+,3−] [0+,3−] [4+,2−] [3+,3−] [1+,2−] H=1 H(2/5) H(1/3) H=1 H(1/3) H=1 H(1/3) − H=0 = = > > > > � 187 � � 2 IG ( Habitable ; Size ) = H − 0 . 9838 9 + 5 4 = 4 9 + 5 � 400 H ( Habitable | Size ) = 9 · H 9 · 0 . 9709 = 0 . 9838 5 = 0 . 9969 − 0 . 9838 � 1 2 � = 2 = 0 . 0072 H ( Habitable | Temp ≤ 232 . 5) = 3 · H 3 · 0 . 9182 = 0 . 6121 3 IG ( Habitable ; Temp ≤ 232 . 5) = 0 . 3788

  20. 36. Answer: Level 2 H(1/3) H(1/3) H(1/3) [4+,2−] [4+,2−] [4+,2−] Size Orbit T<=320 B S F N Y N [2+,0−] [2+,2−] [1+,0−] [3+,2−] [3+,0−] [1+,2−] H=1 H(2/5) H(1/3) + + + H=0 H=0 H=0 > > > >= Note : The plain lines indicate that both the specific conditional entropies and their coefficients (weights) in the average conditional entropies satisfy the indicated relationship. (For ex- ample, H (2 / 5) > H (1 / 3) and 5 / 6 > 3 / 6 .) The dotted lines indicate that only the specific conditional entropies satisfy the indicated rela- tionship. (For example, H (2 / 2) = 1 > H (2 / 5) but 4 / 6 < 5 / 6 .)

  21. 37. The final decision tree: [4+,5−] T<=232.5 Y N [0+,3−] [4+,2−] c. According to your decision tree, would a planet with the features T<=320 − (Big, Near, 280) be predicted to be Y N habitable or not habitable? Answer: habitable . [3+,0−] [1+,2−] Size + B S [1+,0−] [0+,2−] + −

  22. 38. Exemplifying the application of the ID3 algorithm on continuous attributes , and in the presence of a noise . Decision surfaces; decision boundaries. The computation of the LOOCV error CMU, 2002 fall, Andrew Moore, midterm, pr. 3

  23. 39. X Y Suppose we are learning a classifier with binary input values Y = 0 and 1 0 Y = 1 . There is one real-valued input X . The training data is given in the 2 0 nearby table. 3 0 4 0 Assume that we learn a decision tree on this data. Assume that when 6 1 the decision tree splits on the real-valued attribute X , it puts the split 7 1 threshold halfway between the attributes that surround the split. For 8 1 example , using information gain as splitting criterion, the decision tree 8 . 5 0 would initially choos to split at X = 5 , which is halfway between X = 4 and 9 1 X = 6 datapoints. 10 1 Let Algorithm DT2 be the method of learning a decision tree with only two leaf nodes (i.e., only one split). Let Algorithm DT ⋆ be the method of learning a decision tree fully, with no prunning. a. What will be the training set error for DT2 and respectively DT ⋆ on our data? b. What will be the leave-one-out cross-validation error (LOOCV) for DT2 and re- spectively DT ⋆ on our data?

  24. 40. • training data: ID3 tree: X 0 1 2 3 4 5 6 7 8 9 10 [5−,5+] • discretization / decision thresholds: X<5 0 Da Nu X 1 2 3 4 6 7 8 9 10 [4−,0+] [1−,5+] 5 8.25 8.75 0 X<8.25 1 • compact representation of the ID3 tree: Da Nu [0−,3+] [1−,2+] 5 X 1 2 3 4 6 7 8 9 10 1 X<8.75 0 1 2 2 Da Nu • decision “surfaces”: − − [1−,0+] [0−,2+] + + 0 1 X 5 8.25 8.75

  25. 41. Level 0: [5−,5+] [5−,5+] [5−,5+] X<5 X<8.25 X<8.75 Da Nu Da Nu Da Nu [4−,0+] [1−,5+] [4−,3+] [1−,2+] [5−,3+] [0−,2+] < < ID3: < = IG computations Level 1: [1−,5+] [1−,5+] X<8.25 X<8.75 Da Nu Da Nu [0−,3+] [1−,2+] [1−,3+] [0−,2+] IG: 0.191 IG: 0.109 − − + + Decision "surfaces": 5 8.25 8.75

  26. 42. − − + + X=1,2,3,7,10: 5 8.25 8.75 − − + + X=4: ID3, LOOCV: 4.5 8.25 8.75 Decision surfaces − − + + X=6: 5.5 8.25 8.75 − − + + X=8: 5 7.75 8.75 LOOCV error: 3/10 − + + + X=8.5: 5 − − + + X=9: 5 8.25 9.25

  27. 43. DT2 [5−,5+] X<5 Da Nu [4−,0+] [1−,5+] 0 1 Decision "surfaces": − + 5

  28. 44. DT2, LOOCV IG computations [4−,5+] [4−,5+] [4−,5+] X<5 X<8.25 X<8.75 /4.5 Da Nu Da Nu Da Nu Case 1: X=1, 2, 3, 4 [3−,0+] [1−,5+] [3−,3+] [1−,2+] [4−,3+] [0−,2+] < < < = [5−,4+] [5−,4+] [5−,4+] X<5 X<8.25 X<8.75 /5.5 /7.75 Da Nu Da Nu Da Nu Case 2: X=6, 7, 8 [4−,0+] [1−,4+] [4−,2+] [1−,2+] [5−,2+] [0−,2+] < < < =

  29. 45. DT2, CVLOO IG computations (cont’d) [4−,5+] X<5 Case 3: X=8.5 Da Nu [4−,0+] [0−,5+] [5−,4+] [5−,4+] [5−,4+] X<5 X<8.25 X<8.75 /9.25 Da Nu Da Nu Da Nu Case 2: X=9, 10 [4−,0+] [1−,4+] [4−,3+] [1−,1+] [5−,3+] [0−,1+] < < < = CVLOO error: 1/10

  30. 46. Applying ID3 on a dataset with two continuous attributes: decision zones Liviu Ciortuz, 2017

  31. 47. X 2 Consider the training dataset in the nearby figure. 4 X 1 and X 2 are considered countinous at- tributes. 3 Apply the ID3 algorithm on this dataset. 2 Draw the resulting decision tree. Make a graphical representation of the 1 decision areas and decision boundaries determined by ID3. 0 X 1 0 1 2 3 4 5

  32. 48. Solution Level 1: [4+,5−] [4+,5−] [4+,5−] [4+,5−] [4+,5−] X1 < 5/2 X1 < 9/2 X2 < 3/2 X2 < 5/2 X2 < 7/2 Y N Y N Y N Y N Y N [2+,0−] [2+,5−] [2+,4−] [2+,1−] [1+,1−] [3+,4−] [3+,2−] [1+,3−] [4+,2−] [0+,3−] H(2/7) H(1/3) H(1/3) H=1 H(3/7) H(2/5) H(1/4) H(1/3) − − IG=0.091 H=0 H=0 > IG=0.319 > IG=0.378 = < H[Y| . ] = 7/9 H(2/7) H[Y| . ] = 2/3 H(1/3) H[Y| . ] = 5/9 H(2/5) + 4/9 H(1/4)

  33. 49. [4+,2−] [4+,2−] [4+,2−] [4+,2−] Level 2: X1 < 5/2 X1 < 4 X2 < 3/2 X2 < 5/2 Y N Y N Y N Y N [2+,0−] [2+,2−] [2+,2−] [2+,0−] [1+,1−] [3+,1−] [3+,2−] [1+,0−] H=1 H=1 H=1 H(1/4) H(2/5) − − − IG=0.04 H=0 H=0 H=0 IG=0.109 IG=0.251 = IG=0.251 > > = H[Y| . ] = 2/3 H[Y| . ] = 5/6 H(2/5) H[Y| . ] = 1/3 + 2/3 H(1/4) Notes: 1. Split thresholds for continuous attributes must be recomputed at each new iteration, because they may change. (For instance, here above, 4 replaces 4.5 as a threshold for X 1 .) 2. In the current stage, i.e., for the current node in the ID3 tree you may choose (as test) either X 1 < 5 / 2 or X 1 < 4 . 3. Here above we have an example of reverse relationships between weighted and respectively un-weighted specific entropies : H [2+ , 2 − ] > H [3+ , 2 − ] but 4 6 · H [2+ , 2 − ] < 5 6 · H [3+ , 2 − ] .

  34. 50. The final decision tree: Decision areas: [4+,5−] X 2 X2 < 7/2 Y N 4 [4+,2−] [0+,3−] X1 < 5/2 3 + − Y N 2 [2+,0−] [2+,2−] X1 < 4 + + 1 Y N 0 [0+,2−] [2+,0−] X 1 0 1 2 3 4 5 − +

  35. 51. Other criteria than IG for the best attribute selection in ID3: Gini impurity / index and Misclassification impurity CMU, 2003 fall, T. Mitchell, A. Moore, HW1, pr. 4

  36. 52. Entropy is a natural measure to quantify the impurity of a data set. The Decision Tree learning algorithm uses entropy as a splitting criterion by cal- culating the information gain to decide the next attribute to partition the current node. However, there are other impurity measures that could be used as the split- ting criteria too. Let’s investigate two of them. Assume the current node n has k classes c 1 , c 2 , . . . , c k . • Gini Impurity : i ( n ) = 1 − � k i =1 P 2 ( c i ) . • Misclassification Impurity : i ( n ) = 1 − max k i =1 P ( c i ) . a. Assume node n has two classes, c 1 and c 2 . Please draw a figure in which the three impurity measures ( Entropy , Gini and Misclassification ) are repre- sented as the function of P ( c 1 ) .

  37. 53. Answer 1.0 1.0 Entropy ( p ) = − p log 2 p − (1 − p ) log 2 (1 − p ) 0.8 0.8 1 − p 2 − (1 − p ) 2 = 2 p (1 − p ) Gini ( p ) = 0.6 0.6 MisClassif ( p ) = 0.4 0.4 � 1 − (1 − p ) , for p ∈ [0; 1 / 2) = 1 − p, for p ∈ [1 / 2; 1] 0.2 0.2 Entropy � p, Gini for p ∈ [0; 1 / 2) MisClassif = 0.0 0.0 1 − p, for p ∈ [1 / 2; 1] 0.0 0.0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1.0 1.0 p

  38. 54. b. Now we can define new splitting criteria based on the Gini and Misclassifi- cation impurities, which is called Drop-of-Impurity in some literatures. That is the difference between the impurity of the current node and the weighted sum of the impurities of children. For the binary category splits, Drop-of-Impurity is defined as ∆ i ( n ) = i ( n ) − P ( n l ) i ( n l ) − P ( n r ) i ( n r ) , where n l and n r are the left and respectively the right child of node n after splitting. Please calculate the Drop-of-Impurity (using both Gini and Misclassification Impurity ) for the following example data set in which C is the class variable to be predicted. A a 1 a 1 a 1 a 2 a 2 a 2 C c 1 c 1 c 2 c 2 c 2 c 2

  39. 55. Answer [2+,4−] A 0 a a 2 1 [2+,1−] [0+,3−] 1 2 Gini : p = 2 / 6 = 1 / 3 ⇒  2 · 1 3(1 − 1 3) = 2 3 · 2 3 = 4 i (0) =    9   ⇒ ∆ i (0) = 4 9 − 3 6 · 4 9 = 4 9 − 2 9 = 2 2 · 2 3(1 − 2 3) = 4 3 · 1 3 = 4 9 . i (1) =  9     i (2) = 0 Misclassification : p = 1 / 3 < 1 / 2 ⇒  p = 1 i (0) =    3   ⇒ ∆ i (0) = 1 3 − 1 2 · 1 3 = 1 1 − 2 3 = 1 6 . i (1) =  3     i (2) = 0

  40. 56. c. We choose the attribute that can maximize the Drop-of-Impurity to split a node. Please create a data set and show that on this data set, Misclassification Impurity based ∆ i ( n ) couldn’t determine which attribute should be used for splitting (e.g., ∆ i ( n ) = 0 for all the attributes), but Information Gain and Gini Impurity based ∆ i ( n ) can. Answer A a 1 a 1 a 1 a 2 a 2 a 2 a 2 C c 1 c 2 c 2 c 2 c 2 c 2 c 1 � 3 � 7 H [2+ , 1 − ] + 4 Entropy : ∆ i (0) = H [5+ , 2 − ] − 7 H [3+ , 1 − ] = 0 . 006 � = 0 ; � 2 � 2 � � � 3 � � � ��� � 10 �� 1 − 2 7 · 1 1 − 1 + 4 7 · 1 1 − 1 21 + 3 Gini : 2 − = 2 49 − 7 7 3 3 4 4 28 � 10 � 49 − 17 = 2 � = 0 ; 84 � 3 � Misclassification : ∆ i (0) = 2 7 · 1 3 + 4 7 · 1 7 − = 0 . 4

  41. 57. Note: A [quite bad] property [C +,C −] l 2 If C 1 < C 2 , C l 1 < C l 2 and C r 1 < C r A 2 (with C 1 = C l 1 + C r 1 and C 2 = C l 2 + C r 2 ), a 1 a 2 then the Drop-of-Impurity based on Misclassification is 0 . l l r r [C +,C −] [C +,C −] l 2 l 2 Proof � C l � 1 + C l C l + C r 1 + C r C r C 1 2 1 2 1 ∆ i ( n ) = − · · C l 1 + C l C r 1 + C r C 1 + C 2 C 1 + C 2 C 1 + C 2 2 2 − C l 1 + C r C 1 C 1 C 1 1 = = − = 0 . C 1 + C 2 C 1 + C 2 C 1 + C 2 C 1 + C 2

  42. 58. Exemplifying pre- and post-pruning of decision trees using a threshold for the Information Gain CMU, 2006 spring, Carlos Guestrin, midterm, pr. 4 [ adapted by Liviu Ciortuz ]

  43. 59. Starting from the data in the following table, the ID3 algorithm builds the decision tree shown nearby. X V W X Y 0 1 0 0 0 0 V 0 1 0 1 1 0 1 1 0 0 1 1 1 0 0 W W 0 1 0 1 1 1 1 1 0 1 1 0 a. One idea for pruning such a decision tree would be to start at the root, and prune splits for which the information gain (or some other criterion) is less than some small ε . This is called top- down pruning. What is the decision tree returned for ε = 0.0001? What is the training set error for this tree?

  44. 60. Answer We will first augment the given decision tree with informations regarding the data partitions (i.e., the [3+;2−] number of positive and, respectively, negative in- X stances) which were assigned to each test node during 0 1 the application of ID3 algorithm. [2+;2−] [1+;0−] The information gain yielded by the attribute X in the root node is: V 1 0 1 H [3+; 2 − ] − 1 / 5 · 0 − 4 / 5 · 1 = 0 . 971 − 0 . 8 = 0 . 171 > ε. [1+;1−] [1+;1−] Therefore, this node will not be eliminated from the W W tree. 0 1 0 1 The information gain for the attribute V (in the left- [0+;1−] [1+;0−] [1+;0−] [0+;1−] hand side child of the root node) is: 0 1 1 0 H [2+; 2 − ] − 1 / 2 · 1 − 1 / 2 · 1 = 1 − 1 = 0 < ε. X So, the whole left subtree will be cut off and replaced by a decision 0 1 node, as shown nearby. The training error produced by this tree is 2 / 5 . 0 1

  45. 61. b. Another option would be to start at the leaves, and prune subtrees for which the information gain (or some other criterion) of a split is less than some small ε . In this method, no ancestors of children with high information gain will get pruned. This is called bottom-up pruning. What is the tree returned for ε = 0.0001? What is the training set error for this tree? Answer: The information gain of V is IG ( Y ; V ) = 0 . A step later, the infor- mation gain of W (for either one of the descendent nodes of V ) is IG ( Y ; W ) = 1 . So bottom-up pruning won’t delete any nodes and the tree [given in the problem statement] remains unchanged. The training error is 0 .

  46. 62. c. Discuss when you would want to choose bottom-up pruning over top-down pruning and vice versa. Answer: Top-down pruning is computationally cheaper . When building the tree we can determine when to stop (no need for real pruning). But as we saw top-down pruning prunes too much. On the other hand, bottom-up pruning is more expensive since we have to first build a full tree — which can be exponentially large — and then apply pruning. The second problem with bottom-up pruning is that supperfluous attributes may fullish it (see CMU, CMU, 2009 fall, Carlos Guestrin, HW1, pr. 2.4). The third prob- lem with it is that in the lower levels of the tree the number of examples in the subtree gets smaller so information gain might be an inappropriate criterion for pruning, so one would usually use a statistical test instead.

  47. 63. Exemplifying χ 2 -Based Pruning of Decision Trees CMU, 2010 fall, Ziv Bar-Joseph, HW2, pr. 2.1

  48. 64. In class, we learned a decision tree pruning algorithm that iter- atively visited subtrees and used a validation dataset to decide whether to remove the subtree. However, sometimes it is desir- able to prune the tree after training on all of the available data. One such approach is based on statistical hypothesis testing. After learning the tree, we visit each internal node and test whether the attribute split at that node is actually uncorrelated with the class labels . We hypothesize that the attribute is independent and then use Pearson’s chi-square test to generate a test statistic that may provide evidence that we should reject this “null” hypothesis . If we fail to reject the hypothesis, we prune the subtree at that node.

  49. 65. a. At each internal node we can create a contingency table for the training examples that pass through that node on their paths to the leaves. The table will have the c class labels associated with the columns and the r values the split attribute associated with the rows. Each entry O i,j in the table is the number of times we observe a training sample with that attribute value and label, where i is the row index that corresponds to an attribute value and j is the column index that corresponds to a class label. In order to calculate the chi-square test statistic, we need a similar table of expected counts. The expected count is the number of observations we would expect if the class and attribute are independent. Derive a formula for each expected count E i,j in the table. Hint : What is the probability that a training example that passes through the node has a particular label? Using this probability and the independence assumption, what can you say about how many examples with a specific attribute value are expected to also have the class label?

  50. 66. b. Given these two tables for the split, you can now calculate the chi-square test statistic r c ( O i,j − E i,j ) 2 � � χ 2 = E i,j i =1 j =1 with degrees of freedom ( r − 1)( c − 1) . You can plug the test statistic and degrees of freedom into a software package a or an online calculator b to calculate a p -value. Typically, if p < 0 . 05 we reject the null hypothesis that the attribute and class are independent and say the split is statistically significant. The decision tree given on the next slide was built from the data in the table nearby. For each of the 3 internal nodes in the decision tree, show the p -value for the split and state whether it is statistically significant. How many internal nodes will the tree have if we prune splits with p ≥ 0 . 05 ? a Use 1-chi2cdf(x,df) in MATLAB or CHIDIST(x,df) in Excel. b (https://en.m.wikipedia.org/wiki/Chi-square distribution#Table of .CF.872 value vs p-value.

  51. 67. Input: [4−,8+] X 1 X 2 X 3 X 4 Class X 1 1 0 0 0 4 0 1 1 0 1 0 1 0 1 0 0 0 [4−,2+] [0−,6+] 1 0 1 1 1 X 1 0 1 1 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 [3−,0+] [1−,2+] 0 1 0 1 1 X 0 2 1 0 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 [0−,2+] [1−,0+] 0 0 0 0 0 0 1

  52. 68. Idea While traversing the ID3 tree [usually in bottom-up manner], remove the nodes for which there is not enough (“significant”) statistical evidence that there is a dependence between the values of the input attribute tested in that node and the values of the output attribute (the labels), supported by the set of instances assigned to that node.

  53. 69. Contingency tables  P ( X 4 = 0) = 6 12 = 1 2 , P ( X 4 = 1) = 1 O X 4 Class = 0 Class = 1   2 N =12 ⇒ X 4 = 0 4 2 P ( Class = 0) = 4 12 = 1 3 , P ( Class = 1) = 2   X 4 = 1 0 6 3  P ( X 1 = 0 | X 4 = 0) = 3 6 = 1   2      P ( X 1 = 1 | X 4 = 0) = 1  O X 1 | X 4 =0 Class = 0 Class = 1    2 N =6 ⇒ X 1 = 0 3 0 P ( Class = 0 | X 4 = 0) = 4 6 = 2   X 1 = 1 1 2   3     P ( Class = 1 | X 4 = 0) = 1    3  P ( X 2 = 0 | X 4 = 0 , X 1 = 1) = 2   3      P ( X 2 = 1 | X 4 = 0 , X 1 = 1) = 1 O X 2 | X 4 =0 ,X 1 =1 Class = 0 Class = 1     3 N =3 X 2 = 0 0 2 ⇒ P ( Class = 0 | X 4 = 0 , X 1 = 1) = 1    X 2 = 1 1 0  3     P ( Class = 1 | X 4 = 0 , X 1 = 1) = 2    3

  54. 70. The reasoning that leads to the computation of the expected number of observations P ( A = i, C = j ) = P ( A = i ) · P ( C = j ) � c � r k =1 O i,k k =1 O k,j P ( A = i ) = and P ( C = j ) = N N ( � c k =1 O i,k ) ( � r k =1 O k,j ) indep. P ( A = i, C = j ) = N 2 E [ A = i, C = j ] = N · P ( A = i, C = j )

  55. 71. Expected number of observations E X 4 Class = 0 Class = 1 E X 1 | X 4 Class = 0 Class = 1 X 4 = 0 2 4 X 1 = 0 2 1 X 4 = 1 2 4 X 1 = 1 2 1 E X 2 | X 4 ,X 1 =1 Class = 0 Class = 1 2 4 X 2 = 0 3 3 1 2 X 2 = 1 3 3 E X 4 ( X 4 = 0 , Class = 0) : N = 12 , P ( X 4 = 0) = 1 si P ( Class = 0) = 1 2 ¸ 3 ⇒ N · P ( X 4 = 0 , Class = 0) = N · P ( X 4 = 0) · P ( Class = 0) = 12 · 1 2 · 1 3 = 2

  56. 72. χ 2 Statistics r c ( O i,j − E i,j ) 2 � � χ 2 = E i,j i =1 j =1 X 4 = (4 − 2) 2 + (0 − 2) 2 + (2 − 4) 2 + (6 − 4) 2 χ 2 = 2 + 2 + 1 + 1 = 6 2 2 4 4 X 1 | X 4=0 = (3 − 2) 2 + (1 − 2) 2 + (0 − 1) 2 + (2 − 1) 2 χ 2 = 3 2 2 1 1 � � 2 � � 2 � � 2 � � 2 0 − 2 1 − 1 2 − 4 0 − 2 = 4 9 · 27 3 3 3 3 χ 2 X 2 | X 4=0 ,X 1=1 = + + + 4 = 3 2 1 4 2 3 3 3 3 p -values: 0 . 0143 , 0 . 0833 , and respectively 0 . 0833 . The first one of these p -values is smaller than ε , therefore the root node ( X 4 ) cannot be prunned.

  57. 73. Chi Squared Pearson test statistics 1.0 1.0 k = 1 k = 2 0.8 0.8 k = 3 k = 4 k = 6 0.6 0.6 k = 9 p−value 0.4 0.4 0.2 0.2 0.0 0.0 0 0 2 2 4 4 6 6 8 8 χ 2 − Pearson’s cumulative test statistic

  58. 74. Output (pruned tree) for 95% confidence level X 4 0 1 0 1

  59. 75. The AdaBoost algorithm: why was it designed the way it was designed, and the convergence of the training error , in certain conditions CMU, 2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.1-5 CMU, 2009 fall, Carlos Guestrin, HW2, pr. 3.1 CMU, 2009 fall, Eric Xing, HW3, pr. 4.2.2

  60. 76. Consider m training examples S = { ( x 1 , y 1 ) , . . . , ( x m , y m ) } , where x ∈ X and y ∈ {− 1 , 1 } . Suppose we have a weak learning algorithm A which produces a hypothesis h : X → {− 1 , 1 } given any distribution D of examples. AdaBoost is an iterative algorithm which works as follows: • Begin with a uniform distribution D 1 ( i ) = 1 m , i = 1 , . . . , m . • At each iteration t = 1 , . . . , T , • run the weak learning algo. A on the distribution D t and produce the hypothsis h t ; Note (1): Since A is a weak learning algorithm, the produced hypothesis h t at round t is only slightly better than random guessing, say, by a margin γ t : ε t = err D t ( h t ) = Pr x ∼ D t [ y � = h t ( x )] = 1 2 − γ t . Note (2): If at a certain iteration t ≤ T the weak classifier A cannot produce a hypothesis better than random guessing (i.e., γ t = 0 ) or it produces a hypothesis for which ε t = 0 , then the AdaBoost algorithm should be stopped. • update the distribution D t +1 ( i ) = 1 Z t · D t ( i ) · e − α t y i h t ( x i ) for i = 1 , . . . , m, (2) = 1 2 ln 1 − ε t not. where α t , and Z t is the normalizer. ε t �� T � • In the end, deliver H T = sign t =1 α t h t as the learned hypothesis, which will act as a weighted majority vote .

  61. 77. We will prove that the training error err S ( H T ) of AdaBoost decreases at a very fast rate, and in certain cases it converges to 0 . Important Remark The above formulation of the AdaBoost algorithm states no restriction on the h t hypothesis delivered by the weak classifier A at iteration t , except that ε t < 1 / 2 . However, in another formulation of the AdaBoost algorithm (in a more gen- eral setup; see for instance MIT, 2006 fall, Tommi Jaakkola, HW4, problem 3), it is requested / reccommended that hypothesis h t be chosen by (approxi- mately) maximizing the [criterion of] weighted training error on a whole class of hypotheses like, for instance, decision trees of depth 1 (decision stumps). In this problem we will not be concerned with such a request , but we will comply with it for instance in problem CMU, 2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr2.6, when showing how AdaBoost works in practice.

  62. 78. a. Prove the following relationships: i . Z t = e − α t · (1 − ε t ) + e α t · ε t (a consequence from (2) ) � ii . Z t = 2 ε t (1 − ε t ) (a consequence from i ., and the value stated for α t in the AdaBoost pseudo-code) iii . 0 < Z t < 1 (a consequence derivable from ii .)  D t ( i ) not.  , i ∈ M = { i | y i � = h t ( x i ) } , i.e., the mistake set   2 ε t iv . D t +1 ( i ) = D t ( i ) not.  2(1 − ε t ) , i ∈ C = { i | y i = h t ( x i ) } , i.e., the correct set   (a consequence derivable from (2) and ii .) v . ε i > ε j ⇒ α i < α j (a consequence from the value stated for α t in the AdaBoost pseudo-code) vi . err D t +1 ( h t ) = 1 · e α t · ε t , where err D t +1 ( h t ) not. = Pr D t +1 ( { x i | h t ( x i ) � = y i } ) Z t vii . err D t +1 ( h t ) = 1 / 2 (a consequence derivable from ii . and v .)

  63. 79. Solution a/ i . Since Z t is the normalization factor for the distribution D t +1 , we can write: m � � � D t ( i ) e − α t y i h t ( x i ) = D t ( i ) e − α t + D t ( i ) e α t = (1 − ε t ) · e − α t + ε t · e α t . Z t = (3) i =1 i ∈ C i ∈ M = 1 2 ln 1 − ε t not. a/ ii . Since α t , it follows that ε t � � 1 − ε t 1 ln 1 − ε t � � � ln 1 − ε t ε t 2 ε t e α t = e = e = (4) ε t and � 1 ε t e − α t = e α t = . (5) 1 − ε t So, � � ε t 1 − ε t � Z t = (1 − ε t ) · + ε t · = 2 ε t (1 − ε t ) . (6) 1 − ε t ε t Note that 1 − ε t > 1 because ε t ∈ (0 , 1 / 2) ; therefore α t > 0 . ε t

  64. 80. a/ iii . The second order function ε t (1 − ε t ) reaches its maximum value forr ε t = 1 / 2 , and the � 1 maximum is 1 / 4 . Since ε t ∈ (0 , 1 / 2) , it follows from (6) that Z t > 0 and Z t < 2 4 = 1 . a/ iv . Based on (2) , we can write: � e α t , D t +1 ( i ) = 1 for i ∈ M · D t ( i ) · e − α t , for i ∈ C. Z t Therefore, √ 1 − ε t i ∈ M ⇒ D t +1 ( i ) = 1 1 = D t ( i ) · D t ( i ) · e α t (4) = · D t ( i ) · √ ε t � Z t 2 ε t 2 ε t (1 − ε t ) √ ε t i ∈ C ⇒ D t +1 ( i ) = 1 1 D t ( i ) · D t ( i ) · e − α t (5) √ 1 − ε t = · D t ( i ) · = 2(1 − ε t ) . � Z t 2 ε t (1 − ε t )

  65. 81. � 1 − ε t a/ v . Starting from the definition α t = ln , we can write: ε t � � 1 − ε i 1 − ε j α i < α j ⇔ ln < ln ε i ε j Further on, since both ln and √ functions are strictly increasing, it follows that α i < α j ⇔ 1 − ε i < 1 − ε j ε i ,ε j > 0 ε j (1 − ε i ) < ε i (1 − ε j ) ⇔ ε j − ✟✟ ε i ε j < ε i − ✟✟ ⇔ ε i ε j ⇔ ε i > ε j ε i ε j a/ vi . It is easy to see that m 1 D t ( i ) e α t = 1 � � � e α t err D t +1 ( h t ) = D t +1 ( i ) · 1 { y i � = h t ( x i ) } = D t ( i ) Z t Z t i =1 i ∈ M i ∈ M � �� � ε t 1 · ε t · e α t = (7) Z t a/ vii . By substituting (6) and (4) into (7) , we will get: � ⇒ err D t +1 ( h t ) = 1 1 1 − ε t = 1 · ε t · e α t = · ε t · � Z t ε t 2 2 ε t (1 − ε t )

  66. 82. � � − 1 m · � T e − y i f ( x i ) , where f ( x ) = � T b. Show that D T +1 ( i ) = t =1 Z t t =1 α t h t ( x ) . 1 c. Show that err S ( H T ) ≤ � T not. � m t =1 Z t , where err S ( H T ) = i =1 1 { H T ( x i ) � = y i } is the m traing error produced by AdaBoost. d. Obviously, we would like to minimize test set error produced by AdaBoost, but it is hard to do so directly. We thus settle for greedily optimizing the upper bound on the training error found at part c . Observe that Z 1 , . . . , Z t − 1 are determined by the first t − 1 iterations, and we cannot change them at iteration t . A greedy step we can take to minimize the training set error bound on round t is to minimize Z t . Prove that the value of α t that minimizes Z t (among all possible values for α t ) is indeed α t = 1 2 ln 1 − ε t (see the previous slide). ε t e. Show that � T t =1 Z t ≤ e − 2 � T t =1 γ 2 t . f. From part c and d , we know the training error decreases at exponential rate with respect to T . Assume that there is a number γ > 0 such that γ ≤ γ t for t = 1 , . . . , T . (This γ is called a guarantee of empirical γ -weak learnability .) How many rounds are needed to achieve a training error ε > 0 ? Please express in big- O notation, T = O ( · ) .

  67. 83. Solution b. We will expand D t ( i ) recursively: 1 D T ( i ) e − α T y i h T ( x i ) D T +1 ( i ) = Z T D T ( i ) 1 e − α T y i h T ( x i ) = Z T 1 e − α T − 1 y i h T − 1 ( x i ) 1 e − α T y i h T ( x i ) = D T − 1 ( i ) Z T − 1 Z T . . . 1 e − � T t =1 α t y i h t ( x i ) = D 1 ( i ) � T t =1 Z t 1 e − y i f ( x i ) . = m · � T t =1 Z t

  68. 84. y c. We will make use of the fact that the exponential loss function upper bounds the 0-1 loss function, i.e. ) 1 1 { x< 0 } ≤ e − x : e − x 1 {x<0} m 1 � err S ( H T ) = 1 { y i f ( x i ) < 0 } [ x 0 m i =1 m 1 � e − y i f ( x i ) ≤ m i =1 m T m T 1 � � � � b. = D T +1 ( i ) · m · Z t = D T +1 ( i ) Z t m i =1 t =1 i =1 t =1 � m � T � � � � = D T +1 ( i ) · Z t t =1 i =1 � �� � 1 T � = Z t . t =1

  69. 85. d. We will start from the equation Z t = ε t · e α t + (1 − ε t ) · e − α t , which has been proven at part a . Note that the right-hand side is constant with respect to ε t (the error produced by h t , the hypothesis produced by the weak classifier A at the current step). Then we will proceed as usually, computing the partial derivative w.r.t. ε t : ∂ � ε t · e α t + (1 − ε t ) · e − α t � = 0 ⇔ ε t · e α t − (1 − ε t ) · e − α t = 0 ∂α t ⇔ ε t · ( e α t ) 2 = 1 − ε t ⇔ e 2 α t = 1 − ε t ⇔ 2 α t = ln 1 − ε t ⇔ α t = 1 2 ln 1 − ε t . ε t ε t ε t Note that 1 − ε t > 1 (and therefore α t > 0 ) because ε t ∈ (0 , 1 / 2) . ε t It can also be immediately shown that α t = 1 2 ln 1 − ε t is indeed the value for ε t which we reach the minim of the expression ε t · e α t +(1 − ε t ) · e − α t , and therefore of Z t too: ε t · e α t − (1 − ε t ) · e − α t > 0 ⇔ e 2 α t − 1 − ε t > 0 ⇔ α t > 1 2 ln 1 − ε t . ε t ε t

  70. 86. eps = 1/10 eps = 1/4 eps = 2/5 1.4 Plots of three Z ( β ) functions, Z ( β ) = ε t · β + (1 − ε t ) · 1 β 1.2 where not. = e α , ( α being free(!) here) β and ε t is fix. 1.0 Z It implies that 0.8 � 1 − ε t β min = ε t � Z ( β min ) = . . . = 2 ε t (1 − ε t ) 0.6 � 1 − ε t α min = ln β min = ln ε t 1 2 3 4 5 beta

  71. 87. y e. Making use of the relationship ( 6 ) proven at part a , and using the fact that 1 − x ≤ e − x for all x ∈ R , we can 1 e − x write: 1 − x T T � � � Z t = 2 · ε t (1 − ε t ) x 0 t =1 t =1 �� 1 T � � � 1 �� � = 2 · 2 − γ t 1 − 2 − γ t t =1 T � � 1 − 4 γ 2 = t t =1 T � � e − 4 γ 2 ≤ t t =1 T T � � � e − 2 γ 2 t ) 2 = ( e − 2 γ 2 = t t =1 t =1 e − 2 � T t =1 γ 2 t . =

  72. 88. f. γ γ t ε t 0 1/2 From the result obtained at parts c and d , we get: e − 2 T γ 2 = 1 e − 2 γ 2 � T = � ≤ e − 2 � T t =1 γ 2 t ≤ err S ( H T ) � e 2 γ 2 � T Therefore, err S ( H T ) < ε if − 2 T γ 2 < ln ε ⇔ 2 T γ 2 > − ln ε ⇔ 2 T γ 2 > ln 1 2 γ 2 ln 1 1 ε ⇔ T > ε � 1 � γ 2 ln 1 Hence we need T = O . ε Note : It follows that err S ( H T ) → 0 as T → ∞ .

  73. 89. Exemplifying the application of AdaBoost algorithm CMU, 2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.6

  74. 90. X 2 Consider the training dataset in the nearby fig- ure. Run T = 3 iterations of AdaBoost with deci- x 3 x 6 x 7 4 sion stumps (axis-aligned separators) as the base x 2 3 learners. Illustrate the learned weak hypotheses h t in this figure and fill in the table given below. x 1 x 4 x 2 8 (For the pseudo-code of the AdaBoost algorithm, see CMU, x x 1 5 9 2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.1-5. Please 0 read the Important Remark that follows that pseudo-code!) X 1 0 1 2 3 4 5 t ε t α t D t (1) D t (2) D t (3) D t (4) D t (5) D t (6) D t (7) D t (8) D t (9) err S ( H ) 1 2 3 Note : The goal of this exercise is to help you understand how AdaBoost works in practice. It is advisable that — after understanding this exercise — you would implement a program / function that calculates the weighted training error produced by a given decision stump, w.r.t. a certain probabilistic distribution ( D ) defined on the training dataset. Later on you will extend this program to a full-fledged implementation of AdaBoost.

  75. 91. Solution Unlike the graphical reprezentation that we used until now for decision stumps (as trees of depth 1), here we will work with the following analit- ical representation : for a continuous attribute X taking values x ∈ R and for any threshold s ∈ R , we can define two decision stumps: � � 1 if x ≥ s − 1 if x ≥ s sign ( x − s ) = and sign ( s − x ) = − 1 if x < s 1 if x < s. For convenience, in the sequel we will denote the first decision stump with X ≥ s and the second with X < s . According to the Important Remark that follows the AdaBoost pseudo-code [see CMU, 2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.1-5], at each iter- ation ( t ) the weak algorithm A selects the/a decision stump which, among all decision stumps, has the minimum weighted training error w.r.t. the current distribution ( D t ) on the training data.

  76. 92. Notes When applying the ID3 algorithm, for each continous attribute X , we used a threshold for each pair of examples ( x i , y i ) , ( x i +1 , y i +1 ) , with y i y i +1 < 0 such that x i < x i +1 , but no x j ∈ Val ( X ) for which x i < x j < x i +1 . We will proceed similarly when applying AdaBoost with decision stumps and continous attributes. [ In the case of ID3 algorithm, there is a theoretical result stating that there is no need to consider other thresholds for a continuous attribute X apart from those situated beteen pairs of successive values ( x i < x i +1 ) having opposite labels ( y i � = y i +1 ), because the Information Gain (IG) for the other thresholds ( x i < x i +1 , with y i = y i +1 ) is provably less than the maximal IG for X . LC: A similar result can be proven, which allows us to simplify the application of the weak classifier ( A ) in the framework of the AdaBoost algorithm. ] Moreover, we will consider also a threshold from the outside of the interval of values taken by the attribute X in the training dataset. [The decision stumps corresponding to this “outside” threshold can be associated with the decision trees of depth 0 that we met in other problems.]

  77. 93. Iteration t = 1 : Therefore, at this stage (i.e, the first iteration of AdaBoost) the thresholds for the two continuous variables ( X 1 and X 2 ) corresponding to the two coordinates of the training instances ( x 1 , . . . , x 9 ) are • 1 2 , 5 2 , and 9 2 for X 1 , and • 1 2 , 3 2 , 5 2 and 7 2 for X 2 . One can easily see that we can get rid of the “outside” threshold 1 2 for X 2 , because the decision stumps corresponding to this threshold act in the same as the decision stumps associated to the “outside” threshold 1 2 for X 1 . The decision stumps corresponding to this iteration together with their as- sociated weighted training errors are shown on the next slide. When filling those tabels, we have used the equalities err D t ( X 1 ≥ s ) = 1 − err D t ( X 1 < s ) and, similarly, err D t ( X 2 ≥ s ) = 1 − err D t ( X 2 < s ) , for any threshold s and every iteration t = 1 , 2 , . . . . These equalities are easy to prove.

  78. 94. 1 5 9 s 2 2 2 4 2 4 9 + 2 9 = 2 err D 1 ( X 1 < s ) 9 9 3 5 7 1 err D 1 ( X 1 ≥ s ) 9 9 3 1 3 5 7 s 2 2 2 2 4 1 9 + 3 9 = 4 2 9 + 1 9 = 1 2 err D 1 ( X 2 < s ) 9 9 3 9 5 5 2 7 err D 1 ( X 2 ≥ s ) 9 9 3 9 It can be seen that the minimal weighted training error ( ε 1 = 2 / 9 ) is obtained for the � 7 � decision stumps X 1 < 5 / 2 and X 2 < 7 / 2 . Therefore we can choose h 1 = sign 2 − X 2 as best hypothesis at iteration t = 1 ; the corresponding separator is the line X 2 = 7 2 . The h 1 hypothesis wrongly classifies the instances x 4 and x 5 . Then �� � � 1 2 − 2 9 = 5 18 and α 1 = 1 2 ln 1 − ε 1 1 − 2 : 2 7 γ 1 = = ln 9 = ln 2 ≈ 0 . 626 ε 1 9

  79. 95. Now the algorithm must get a new distribution ( D 2 ) by altering the old one ( D 1 ) so that the next iteration concentrates more on the misclassified instances.  � 1 · 1 2   9 · for i ∈ { 1 , 2 , 3 , 6 , 7 , 8 , 9 } ;  1  Z 1 7 ) y i h 1 ( x i ) = D 1 ( i )( e − α 1 D 2 ( i ) = � Z 1 � �� � √ 1 · 1 7   9 · for i ∈ { 4 , 5 } .  2 / 7  Z 1 2 Remember that Z 1 is a normalization factor for D 2 . X 2 So, 1/14 1/14 1/14 √ � � � � x 3 x x 1 2 7 = 2 14 4 6 7 − h 1 Z 1 = 7 · 7 + 2 · = 0 . 8315 + 1/14 9 2 9 x 3 2 1/14 1/4 1/14 Therefore, x 1 x x 2 4 8  � 1/4 1/14 9 14 · 1 7 = 1 2 x x  1  √ 9 · for i �∈ { 4 , 5 } ; 5 9   14 2 D 2 ( i ) = � 0 9 14 · 1 7 2 = 1  X 1  √ 9 · for i ∈ { 4 , 5 } .  0 1 2 3 4 5  4 2

  80. 96. Note � 7 � If, instead of sign 2 − X 2 we would have taken, as hypothesis h 1 , the deci- � 5 � sion stump sign 2 − X 1 , the subsequent calculus would have been slightly different (although both decision stumps have the same – minimal – weighted training error, 2 9 ): x 8 and x 9 would have been allocated the weights 1 4 , while si x 5 would have been allocated the weights 1 x 4 ¸ 14 . (Therefore, the output of AdaBoost may not be uniquely determined!)

  81. 97. Iteration t = 2 : 1 5 9 s 2 2 2 4 2 14 + 2 2 4 + 2 14 = 11 err D 2 ( X 1 < s ) 14 14 14 10 12 3 err D 2 ( X 1 ≥ s ) 14 14 14 1 3 5 7 s 2 2 2 2 4 1 4 + 3 14 = 13 2 4 + 1 14 = 8 2 4 = 1 err D 2 ( X 2 < s ) 14 28 14 2 10 15 6 1 err D 2 ( X 2 ≥ s ) 14 28 14 2 Note : According to the theoretical result presented at part a of CMU, 2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.1-5, computing the weighted error rate of the decision stump [corresponding to the test] X 2 < 7 / 2 is now super- fluous, because this decision stump has been chosen as optimal hypothesis at the previous iteration. (Nevertheless, we had placed it into the tabel, for the sake of a thorough presentation.)

  82. 98. � 5 � Now the best hypothesis is h 2 = sign 2 − X 1 ; the corresponding separator is the line X 1 = 5 2 . P D 2 ( { x 8 , x 9 } ) = 2 14 = 1 ⇒ γ 2 = 1 2 − 1 7 = 5 ε 2 = 7 = 0 . 143 14 �� � � √ 1 − ε 2 1 − 1 : 1 α 2 = ln = ln 7 = ln 6 = 0 . 896 ε 2 7  1 · D 2 ( i ) · 1 √ if h 2 ( x i ) = y i ;   Z 2  1 6 ) y i h 2 ( x i ) = · D 2 ( i ) · ( e − α 2 D 3 ( i ) = √ Z 2 1 � �� � √   · D 2 ( i ) · 6 otherwise 1 / 6  Z 2  1 · 1 14 · 1 √ for i ∈ { 1 , 2 , 3 , 6 , 7 } ;    Z 2 6     1 · 1 4 · 1 = √ for i ∈ { 4 , 5 } ; Z 2 6    √ 1 · 1    14 · 6 for i ∈ { 8 , 9 } .  Z 2

  83. 99. √ √ √ 5 · 1 14 · 1 6 + 2 · 1 4 · 1 6 + 2 · 1 5 1 6 = 12 + 12 24 6 = 2 6 √ √ √ √ √ √ Z 2 = 14 · 6 = 6 + 6 + = ≈ 0 . 7 7 7 14 2 14 6 14 h 2 X 2 − + 1/24 1/24 1/24 x 3 x 6 x 7  7 6 · 1 14 · 1 6 = 1 4 − √ √ for i ∈ { 1 , 2 , 3 , 6 , 7 } ; h 1    24 + 2 1/24  x 2   3   7 6 · 1 4 · 1 6 = 7 √ √ for i ∈ { 4 , 5 } ; D 3 ( i ) = 1/24 7/48 1/4 x 1 x 4 x 48 2 2  8    √ 7 · 1 6 = 1 7/48 1/4   x x 9 √ 14 · for i ∈ { 8 , 9 } .  1  4 5 2 6 0 X 1 0 1 2 3 4 5

Recommend


More recommend