lecture 4 rule based classification and regression
play

Lecture 4: Rule-based classification and regression Felix Held, - PowerPoint PPT Presentation

Lecture 4: Rule-based classification and regression Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 1st April 2019 Amendment: Bias-Variance Tradeoff ] Variance Bias 2 Error Model complexity Overfit Underfit


  1. Lecture 4: Rule-based classification and regression Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 1st April 2019

  2. Amendment: Bias-Variance Tradeoff ] Variance Bias 2 Error Model complexity Overfit Underfit 𝑆 𝑔 averaged over 𝐲 𝑔(𝐲)]] + Bias-Variance Decomposition 1/26 2 𝑔(𝐲)]) + Irreducible Error 𝜏 2 = Total expected prediction error 𝑔(𝐲)) 2 ] = 𝑆 𝔽 π‘ž(𝒰,𝐲,𝑧) [(𝑧 βˆ’ Λ† Bias 2 averaged over 𝐲 𝔽 π‘ž(𝐲) [(𝑔(𝐲) βˆ’ 𝔽 π‘ž(𝒰) [ Λ† 𝔽 π‘ž(𝐲) [ Var π‘ž(𝒰) [ Λ† Variance of Λ† Irreducible Error

  3. Observations 𝑔(𝑦)] β†’ 𝑔(𝑦) Might not be fulfilled in reality. the number of variables π‘ž staying fixed and increasing π‘œ . for increasing sample size 𝑔(𝑦)) β†’ 0 for increasing sample size 𝑔 𝑔 are sample-size dependent 2/26 β–Ά Irreducible error cannot be changed β–Ά Bias and variance of Λ† β–Ά For a consistent estimator Λ† 𝔽 π‘ž(𝒰) [ Λ† β–Ά In many cases: Var π‘ž(𝒰) ( Λ† β–Ά Caution: Theoretical guarantees are often dependent on

  4. Amendment: Leave-One-Out Cross-validation (LOOCV) Cross-validation with 𝑑 = π‘œ is called leave-one-out cross-validation . exist for many special cases (e.g. regularized regression) point is used for testing and the training sets are very similar results vary drastically with 𝑑 . Maybe the underlying model assumptions are not appropriate. 3/26 β–Ά Popular because explicit formulas (or approximations) β–Ά Uses the most data for training possible β–Ά More variable than 𝑑 -fold CV for 𝑑 < π‘œ since only one data β–Ά In praxis: Try out different values for 𝑑 . Be cautious if

  5. Classification and Partitions

  6. Classification and Partitions A classification algorithm constructs a partition of feature space and assigns a class to each. assigns a class in each modelling π‘ž(𝑗|𝐲) and determines decision boundaries through Bayes’ rule feature space conditional on the class. It models π‘ž(𝐲, 𝑗) by assuming that π‘ž(𝐲|𝑗) is a normal distribution and either estimates π‘ž(𝑗) from data or through prior knowledge. 4/26 β–Ά kNN creates local neighbourhoods in feature space and β–Ά Logistic regression divides feature space implicitly by β–Ά Discriminant analysis creates an explicit model of the

  7. New point-of-view: Rectangular Partitioning and a regression function is given by neighbourhoods.) (Derivations are similar to kNN with regions instead of 𝐲 π‘š βˆˆπ‘† 𝑛 βˆ‘ |𝑆 𝑛 | 1 ( 𝑛=1 βˆ‘ 𝑁 𝑔(𝐲) = Λ† 5/26 Idea: Create an explicit partition by dividing feature space 𝐲 π‘š βˆˆπ‘† 𝑛 𝑛=1 βˆ‘ 𝑁 1≀𝑗≀𝐿 𝑑(𝐲) = arg max Μ‚ classes 𝑗 ∈ {1, … , 𝐿} is (classification) to each region. mean (regression) or constant conditional class probability into rectangular regions and assign a constant conditional Given regions 𝑆 𝑛 for 𝑛 = 1, … , 𝑁 , a classification rule for 1 (𝐲 ∈ 𝑆 𝑛 ) ( βˆ‘ 1 (𝑗 π‘š = 𝑗)) 𝑧 π‘š ) 1 (𝐲 ∈ 𝑆 𝑛 )

  8. Classification and Regression Trees (CART) > values/classes in each region sequence of binary splits Partition from a 6/26 Partition > Partition Arbitrary Rectangular β–Ά Complexity of partitioning: β–Ά Classification and Regression Trees create a sequence of binary axis-parallel splits in order to reduce variability of 0 0 0 0 00 x2 >= 2.2 4 0 0 0 0 0 0 yes no 00 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 1.00 .00 x1 >= 3.5 00 0 x 2 0 0 0 0 0 0 60% 2 1 1 0 0 0 1 0 1 1 0 0 1 1 0 1 1.00 .00 .00 1.00 0 11 1 0 0 0 1 11 0 20% 20% 0 2 4 6 x 1

  9. CART: Tree building/growing 1. Start with all data in a root node sequence of splits preceding them defines the regions 𝑆 𝑛 . All nodes without descendents are called leaf nodes . The stopping criterion 3. Repeat Step 2 on all child nodes until the tree reaches a 2.2 Choose the feature π‘˜ that led to the best splitting of the π‘˜ } and π‘˜ } the greatest improvement in node purity : threshold 𝑒 2. Binary splitting 7/26 2.1 Consider each feature 𝑦 β‹…π‘˜ for π‘˜ = 1, … , π‘ž . Choose a π‘˜ (for continuous features) or a partition of the feature categories (for categorical features) that results in {𝑗 π‘š ∢ 𝑦 π‘šπ‘˜ > 𝑒 {𝑗 π‘š ∢ 𝑦 π‘šπ‘˜ ≀ 𝑒 data and create a new child node for each subset

  10. Measures of node purity 𝜌 𝑗𝑛 after a split can be used as an impurity measure. maximal when all classes are equally common. 𝜌 𝑗𝑛 𝐿 Entropy/deviance: 𝜌 𝑗𝑛 ) 𝜌 𝑗𝑛 (1 βˆ’ Λ† 𝐿 Use Gini impurity: βˆ‘ 8/26 βˆ‘ Λ† 1 Misclassification error: |𝑆 𝑛 | 𝐲 π‘š βˆˆπ‘† 𝑛 1 (𝑗 π‘š = 𝑗) 𝜌 𝑗𝑛 = β–Ά Three common measures to determine impurity in a region 𝑆 𝑛 are (for classification trees) 1 βˆ’ max 𝑗 Λ† 𝑗=1 Λ† βˆ’ βˆ‘ 𝑗=1 Λ† 𝜌 𝑗𝑛 log Λ† β–Ά All criteria are zero when only one class is present and β–Ά For regression trees the decrease in mean squared error

  11. Node impurity in two class case Example for a two-class problem ( 𝑗 = 0 or 1 ). Λ† problems for misclassification error). Only gini impurity and entropy are used in practice (averaging 9/26 empirical frequency of class 0 in a region 𝑆 𝑛 . 𝜌 0𝑛 is the 0.6 Impurity 0.4 0.2 0.0 0.00 0.25 0.50 0.75 1.00 Ο€ 0m Impurity Measure Entropy Gini Misclassification

  12. Stopping criteria 30 splits from root node) Running CART until one of these criteria is fulfilled generates 10/26 β–Ά Minimum size of leaf nodes (e.g. 5 samples per leaf node) β–Ά Minimum decrease in impurity (e.g. cutoff at 1%) β–Ά Maximum tree depth, i.e. number of splits (e.g. maximum β–Ά Maximum number of leaf nodes a max tree .

  13. Summary of CART boundaries chance of being picked is used for splitting and which is best might change with small changes of the data) 11/26 β–Ά Pro: Outcome is easily interpretable β–Ά Pro: Can easily handle missing data β–Ά Neutral: Only suitable for axis-parallel decision β–Ά Con: Features with more potential splits have a higher β–Ά Con: Prone to overfitting/unstable (only the best feature

  14. CART and overfitting How can overfitting be avoided? stopping since a weak split might lead to a strong split later collapsing internal nodes. This can be more effective since weak splits are allowed during tree building. (β€œThe silly certainty of hindsight”) stacking, … 12/26 β–Ά Tuning of stopping criteria: These can easily lead to early β–Ά Pruning: Build a max tree first. Then reduce its size by β–Ά Ensemble methods: Examples are bagging, boosting,

  15. A note on pruning ⎡⏟ ⎡ ⎡ ⎡ ⎡ ⎡ ⎡ ⎡ Cost ⎡ + 𝛽|π‘ˆ| ⏟ Complexity where (𝑗 π‘š , 𝐲 π‘š ) is the training data, Μ‚ 𝑑 the CART classification rule and |π‘ˆ| is the number of leaf nodes/regions defined by the tree. ⎡ ⎡⏟⎡ ⎡ βˆ‘ defined as 𝐷 𝛽 (π‘ˆ) = βˆ‘ 𝑆 𝑛 βˆˆπ‘ˆ ( 1 ⎡ |𝑆 𝑛 | 𝐲 π‘š βˆˆπ‘† 𝑛 Μ‚ 𝑑(𝐲))) ⏟⎡ ⎡ ⎡ ⎡ 13/26 β–Ά A common strategy is cost-complexity pruning . β–Ά For a given 𝛽 > 0 and a tree π‘ˆ its cost-complexity is 1 (𝑗 π‘š β‰  β–Ά It can be shown that successive subtrees π‘ˆ 𝑙 of the max tree π‘ˆ max can be found such that each tree π‘ˆ 𝑙 minimizes 𝐷 𝛽 𝑙 (π‘ˆ 𝑙 ) where 𝛽 1 β‰₯ β‹― β‰₯ 𝛽 𝐾 β–Ά The tree with the lowest cost-complexity is chosen

  16. Re-cap of the bootstrap and variance reduction

  17. The Bootstrap – A short recapitulation (I) Given a sample 𝑦 𝑗 , 𝑗 = 1, … , π‘œ from an underlying population All of these approaches require fairly large sample sizes. bootstrap ) generalized linear models) distributional assumptions fulfilled Computation: πœ„ . Μ‚ variability of Solution: Find confidence intervals (CIs) quantifying the πœ„ ? Μ‚ What is the uncertainty of πœ„(𝑦 1 , … , 𝑦 π‘œ ) . Μ‚ πœ„ = Μ‚ estimate a statistic πœ„ by 14/26 β–Ά Through theoretical results (e.g. linear models) if β–Ά Linearisation for more complex models (e.g. nonlinear or β–Ά Nonparametric approaches using the data (e.g.

  18. The Bootstrap – A short recapitulation (II) 𝑦 1 , … , Μƒ 1 Check out this blog post! impossible. 1 The data is discrete and values not seen in the data are πœ„ Μ‚ distribution of Μ‚ Nonparametric bootstrap 𝑦 π‘œ ) πœ„ 𝑐 ( Μƒ Μ‚ 2. Calculate 𝑦 1 , … , Μƒ Μƒ 1. Sample 𝑐 = 1, … , 𝐢 15/26 Given a sample 𝑦 1 , … , 𝑦 π‘œ bootstrapping performs for 𝑦 π‘œ with replacement from original sample β–Ά 𝐢 should be large (in the 1000–10000s) β–Ά The distribution of πœ„ 𝑐 approximates the sampling β–Ά The bootstrap makes exactly one strong assumption :

  19. CI for statistics of an exponential random variable Data (n = 200) simulated from 𝑦 ∼ Exp (1/3) , i.e. 𝔽 π‘ž(𝑦) [𝑦] = 3 (dotted) [red = empirical, blue = theoretical] 16/26 0.4 0.3 Frequency 0.2 0.1 0.0 0 5 10 15 x β–Ά Orange histogram shows original sample β–Ά Blue line is the true density β–Ά Black outlined histogram shows a bootstrapped sample β–Ά Vertical lines are the mean of 𝑦 (dashed) and the 99% quantile

Recommend


More recommend