Tree Based Methods (Ensemb mble Scheme mes) Machine Learning Spring 2018 Feb 26 2018 Kasthuri Kannan kasthuri.kannan@nyumc.org
Ov Over ervi view • Decision Trees • Overview • Spli1ng nodes • Limita8ons • Bagging/Bootstrap Aggrega8ng and Boos8ng • How bagging reduces variance? • Boos8ng • Random Forests • Overview • Why RF works? • Cancer Genomics Applica8on
Cl Classific fica=on on • Given a collec8on of records (training set ) • Each record contains a set of aLributes, one of the aLributes is the class/label • Find a model for class aLribute as a func8on of the values of other aLributes. • Goal: previously unseen records should be assigned a class as accurately as possible. • A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Classific Cl fica=on on Il Illustra=on on Learning Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No algorithm 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No Induction Yes 5 No Large 95K No 6 No Medium 60K Learn No 7 Yes Large 220K Model Yes 8 No Small 85K No 9 No Medium 75K Yes 10 No Small 90K Model 10 Training Set Apply Model Tid Attrib1 Attrib2 Attrib3 Class ? 11 No Small 55K ? 12 Yes Medium 80K Deduction ? 13 Yes Large 110K ? 14 No Small 95K 15 No Large 67K ? 10 Test Set Courtesy: www.cs.kent.edu/~jin/DM07/
Classifi fica=on Examp mples • Predic8ng tumor cells as benign or malignant • Classifying credit card transac8ons as legi8mate or fraudulent • Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil • Categorizing news stories as finance, weather, entertainment, sports, etc. • Several algorithms – Decision trees, Support Vector Machines, Rule-based Methods etc.
Decision Tree (Examp mple) Splitting Attributes Refund Marital Taxable Tid Cheat Status Income 1 Yes Single 125K No Refund 2 No Married 100K No Yes No No 3 No Single 70K No 4 Yes Married 120K NO MarSt Yes 5 No Divorced 95K Married Single, Divorced 6 No Married 60K No TaxInc NO 7 Yes Divorced 220K No < 80K > 80K Yes 8 No Single 85K No 9 No Married 75K NO YES Yes 10 No Single 90K 10 Model: Decision Tree Training Data
Decision Tree (Another Examp mple) Single, MarSt Married Divorced Tid Refund Marital Taxable Cheat Status Income NO Refund No 1 Yes Single 125K No Yes No 2 No Married 100K 3 No Single 70K No NO TaxInc 4 Yes Married 120K No < 80K > 80K 5 No Divorced 95K Yes NO YES No 6 No Married 60K No 7 Yes Divorced 220K Yes 8 No Single 85K There could be more than one tree that fits 9 No Married 75K No the same data! 10 No Single 90K Yes 10
Applying the mo model Once the decision tree is built, the model can be used to test an unclassified data Test Data Refund Marital Taxable Cheat Status Income No Married 80K ? Refund 10 Yes No NO MarSt Assign Cheat to “ No ” Married Single, Divorced TaxInc NO < 80K > 80K NO YES
Tumo mor Classifi fica=on (Examp mple) Tumor samples/patients given by expression in (gene1,gene2) Normal samples/patients given by expression in (gene1,gene2) gene2 (1.5,0.8) x < 1 No Yes y = 1 y < 1 y < 0.5 y = 0.5 No No Yes Yes T T N N gene1 x = 1
Decision Tree Algorithms ms • Many Algorithms: • Hunt ’ s Algorithm (one of the earliest) • CART • ID3, C4.5 • SLIQ,SPRINT Hunt’s Algorithm (General Idea) Let D t be the set of training records that reach a node t General Procedure: If D t contains records that belong the same class y t , then t is a leaf node labeled as y t If D t is an empty set, then t is a leaf node labeled by the default class, y d If D t contains records that belong to more than one class, use an aLribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.
Hunt’s Algorithm m (Illustra=on) Tid Refund Marital Taxable Cheat Status Income Refund Don ’ t No Yes Cheat 1 Yes Single 125K No 2 No Married 100K No Don ’ t Don ’ t Cheat Cheat 3 No Single 70K No 4 Yes Married 120K No Yes 5 No Divorced 95K No 6 No Married 60K Refund Refund Yes No 7 Yes Divorced 220K No Yes No 8 No Single 85K Yes Don ’ t Marital Don ’ t Cheat Marital 9 No Married 75K No Cheat Status Status 10 No Single 90K Yes Single, Married Single, Married Divorced 10 Divorced Don ’ t Cheat Taxable Don ’ t Cheat Cheat Income < 80K >= 80K Cheat Don ’ t Cheat
Tr Tree Induc=on • Greedy strategy. • Split the records based on an aLribute test that op8mizes certain criterion. • Main ques8ons • Determine how to split the records • Binary or mul8-way split? • How to determine the best split? • Determine when to stop spli1ng
Test Condi=on (SpliHng Based on Nomi minal/Ordinal AKributes) Mul8-way split: Use as many par88ons as dis8nct values. Car Type Family Luxury Sports Binary split: Divides values into two subsets. Need to find op8mal par88oning. Car Type Car Type OR {Sports, {Family, {Family} {Sports} Luxury} Luxury}
Te Test Condi=on (SpliHng Based on Con=nuous AKributes) Taxable Taxable Income Income? > 80K? < 10K > 80K Yes No [10K,25K) [25K,50K) [50K,80K) (i) Binary split (ii) Multi-way split
Binary vs. Mul=-way split – – which is the best? hLp://www.cse.msu.edu/~cse802/DecisionTrees.pdf
Binary vs. Mul=-way split – – which is the best? hLp://www.cse.msu.edu/~cse802/DecisionTrees.pdf
Determi mining Best Split Before Spli1ng: 10 records of class 0, 10 records of class 1 Own Car Student Car? Type? ID? Family Luxury c 1 c 20 Yes No c 10 c 11 Sports C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0 C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1 Which aLribute is the best for spli1ng?
Determi mining Best Split • Greedy approach: • Nodes with homogeneous class distribu8on are preferred • Need a measure of node impurity: C0: 5 C0: 9 C1: 5 C1: 1 Non-homogeneous, Homogeneous, High degree of impurity Low degree of impurity
Measures of Node Imp mpurity • Gini Index 2 GINI ( t ) 1 [ p ( j | t )] ∑ = − j • Entropy Entropy ( t ) p ( j | t ) log p ( j | t ) = − ∑ j • Misclassifica8on error Error ( t ) 1 max P ( i | t ) = − i
Comp mpu=ng Measures of Node Imp mpurity C0 N00 Before SpliWng: M0 C1 N01 A? B? Yes No Yes No Node N1 Node N2 Node N3 Node N4 C0 N10 C0 N20 C0 N30 C0 N40 C1 N11 C1 N21 C1 N31 C1 N41 M2 M3 M4 M1 M12 M34 Gain = M0 – M12 vs M0 – M34
Comp mpu=ng Measures of Node Imp mpurity (Gini Index) • Gini Index for a given node t : 2 GINI ( t ) 1 [ p ( j | t )] ∑ = − j (NOTE: p( j | t) is the rela8ve frequency of class j at node t) • Maximum (1 - 1/ n c ) when records are equally distributed among all classes, implying least interes8ng informa8on • Minimum (0.0) when all records belong to one class, implying most interes8ng informa8on
Examp mple (Gini Index of a Node) A? 2 GINI ( t ) 1 [ p ( j | t )] ∑ = − j Yes No Node N1 Node N2 M1 C0# 0" P(C0) = 0/6 = 0 P(C1) = 6/6 = 1 C0# 0" C0# 1" C1# 6" C1# 5" C1# 6" Gini = 1 – P(C0) 2 – P(C1) 2 = 1 – 0 – 1 = 0 # # # M2 M2 M1 C0# 1" P(C1) = 1/6 P(C2) = 5/6 C1# 5" Gini = 1 – (1/6) 2 – (5/6) 2 = 0.278 #
SpliH SpliHng ng base based d on n Gini ini Inde ndex • Used in CART, SLIQ, SPRINT • When a node p is split into k par88ons (children), the quality of split is computed as, where, n L = number of records at the lej child node, n R = number of records at the right child node • Split on the aLribute that maximizes Gini split
SpliH SpliHng ng Base ased d on n Gini ini Inde ndex • Splits into two par88ons • Effect of Weighing par88ons: Parent Larger and purer par88ons are sought for. • C1 6 C2 6 B? Gini = 0.500 Yes No Node N1 Node N2 Gini(N1) = 1 – (5/6) 2 – (2/6) 2 = 0.194 N1 N2 Gini(Children) C1 5 1 = 7/12 * 0.194 + Gini split(B) = 0.5-0.3 = 0.2 Gini(N2) 5/12 * 0.528 C2 2 4 = 1 – (1/6) 2 – (4/6) 2 = 0.333 Gini=0.333 = 0.528
Measures of Node Imp mpurity (Entropy) • Entropy at a given node t: Entropy ( t ) p ( j | t ) log p ( j | t ) = − ∑ j • (NOTE: p( j | t) is the rela8ve frequency of class j at node t). • Measures homogeneity of a node. Maximum (log n c ) when records are equally distributed among all classes implying least • informa8on Minimum (0.0) when all records belong to one class, implying most informa8on • • Entropy based computa8ons are similar to the GINI index computa8ons
Examp mple (Entropy) C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 P(C1) = 1/6 P(C2) = 5/6 C1 1 C2 5 Entropy = – (1/6) log 2 (1/6) – (5/6) log 2 (1/6) = 0.65 P(C1) = 2/6 P(C2) = 4/6 C1 2 C2 4 Entropy = – (2/6) log 2 (2/6) – (4/6) log 2 (4/6) = 0.92
Recommend
More recommend