decision trees
play

Decision Trees TJ Machine Learning Club Classification vs. - PowerPoint PPT Presentation

Decision Trees TJ Machine Learning Club Classification vs. Regression Classification Classifying photos of fruits Determining whether tumor is benign or malignant Regression Predicting COVID-19 cases given demographic data


  1. Decision Trees TJ Machine Learning Club

  2. Classification vs. Regression Classification ○ Classifying photos of fruits ■ Determining whether tumor is benign or malignant ■ Regression ○ Predicting COVID-19 cases given demographic data ■ Predicting house prices given house features ■ Source: https://medium.com/datasoc/whats-the-problem-1ff8b338094b

  3. Features Labels Features vs. Labels Features (like x): Characteristics of the input - In the picture, features are whether or not patient smokes (smoke), consumes alcohol (alco), and performs physical activity (active) Label (like y): The prediction or classification of the input - Whether or not patient has cardiovascular disease (cardio)

  4. Training and Testing Datasets Training data has both features and labels Testing data only has the features Need to predict cardio

  5. What is a Decision Tree? A decision tree is just a series of questions ● The key in creating a decision tree is asking the right questions ●

  6. Gini Impurity - Measure of how “messy” some collection of data is i = some data k = class index c = total number of classes p(k|i) = probability of randomly selecting item of class k from data

  7. Ex. Gini Impurity Let’s calculate the Gini Impurity for these groups of data, where the two possible classes are blue or red:

  8. Ex. Gini Impurity 0.444 0.5 0

  9. Ex. Gini Impurity 0.444 0.5 0 Minimum possible impurity Maximum possible impurity

  10. Information Gain - D p , D left , D right are the parent node, left node dataset, and right node dataset respectively - I is a measure of impurity (like Gini Impurity) - N p , N left , and N right are the number of items in the parent, left, and right nodes respectively - f is the question you are asking to create the split

  11. Let’s figure out which question is a better question T = Tennis Player to ask to split the athletes according to sport B = Basketball Player T B T B B T T B T B B T Age > 27? Height > 6’4’’ Y Y N N T B B T T B T T B B B T

  12. T B T B B T Age > 27? Y N T B B B T T

  13. 1/2 T B T B B T Age > 27? Y N 4/9 4/9 T B B B T T

  14. 1/2 T B T B B T Age > 27? Y N 4/9 4/9 T B B B T T

  15. T B T B B T Height > 6’4’’ Y N T T B B B T

  16. 1/2 T B T B B T Height > 6’4’’ Y N 0 3/8 T T B B B T

  17. 1/2 T B T B B T Height > 6’4’’ Y N 0 3/8 T T B B B T

  18. Information Gain: 0.055556 Information Gain: 0.25 Since Information Gain is higher, this the better question to ask to classify our athletes T B T B B T T B T B B T Age > 27? Height > 6’4’’ Y Y N N T B B T T B T T B B B T

  19. How to Come Up with Values for the Questions? - The most straightforward way: Try out different values from the items in your training dataset

  20. Overfitting Techniques to prevent overfitting in decision trees: ● Continue recursively generating nodes only if information gain is larger than some threshold (e.g. ● 0.1) After creating the tree, prune all nodes that are at a depth greater than some threshold ●

Recommend


More recommend