Lecture 4: Rule-based classification and regression Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 1st April 2019
Amendment: Bias-Variance Tradeoff ] Variance Bias 2 Error Model complexity Overfit Underfit π π averaged over π² π(π²)]] + Bias-Variance Decomposition 1/26 2 π(π²)]) + Irreducible Error π 2 = Total expected prediction error π(π²)) 2 ] = π π½ π(π°,π²,π§) [(π§ β Λ Bias 2 averaged over π² π½ π(π²) [(π(π²) β π½ π(π°) [ Λ π½ π(π²) [ Var π(π°) [ Λ Variance of Λ Irreducible Error
Observations π(π¦)] β π(π¦) Might not be fulfilled in reality. the number of variables π staying fixed and increasing π . for increasing sample size π(π¦)) β 0 for increasing sample size π π are sample-size dependent 2/26 βΆ Irreducible error cannot be changed βΆ Bias and variance of Λ βΆ For a consistent estimator Λ π½ π(π°) [ Λ βΆ In many cases: Var π(π°) ( Λ βΆ Caution: Theoretical guarantees are often dependent on
Amendment: Leave-One-Out Cross-validation (LOOCV) Cross-validation with π = π is called leave-one-out cross-validation . exist for many special cases (e.g. regularized regression) point is used for testing and the training sets are very similar results vary drastically with π . Maybe the underlying model assumptions are not appropriate. 3/26 βΆ Popular because explicit formulas (or approximations) βΆ Uses the most data for training possible βΆ More variable than π -fold CV for π < π since only one data βΆ In praxis: Try out different values for π . Be cautious if
Classification and Partitions
Classification and Partitions A classification algorithm constructs a partition of feature space and assigns a class to each. assigns a class in each modelling π(π|π²) and determines decision boundaries through Bayesβ rule feature space conditional on the class. It models π(π², π) by assuming that π(π²|π) is a normal distribution and either estimates π(π) from data or through prior knowledge. 4/26 βΆ kNN creates local neighbourhoods in feature space and βΆ Logistic regression divides feature space implicitly by βΆ Discriminant analysis creates an explicit model of the
New point-of-view: Rectangular Partitioning and a regression function is given by neighbourhoods.) (Derivations are similar to kNN with regions instead of π² π βπ π β |π π | 1 ( π=1 β π π(π²) = Λ 5/26 Idea: Create an explicit partition by dividing feature space π² π βπ π π=1 β π 1β€πβ€πΏ π(π²) = arg max Μ classes π β {1, β¦ , πΏ} is (classification) to each region. mean (regression) or constant conditional class probability into rectangular regions and assign a constant conditional Given regions π π for π = 1, β¦ , π , a classification rule for 1 (π² β π π ) ( β 1 (π π = π)) π§ π ) 1 (π² β π π )
Classification and Regression Trees (CART) > values/classes in each region sequence of binary splits Partition from a 6/26 Partition > Partition Arbitrary Rectangular βΆ Complexity of partitioning: βΆ Classification and Regression Trees create a sequence of binary axis-parallel splits in order to reduce variability of 0 0 0 0 00 x2 >= 2.2 4 0 0 0 0 0 0 yes no 00 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 1.00 .00 x1 >= 3.5 00 0 x 2 0 0 0 0 0 0 60% 2 1 1 0 0 0 1 0 1 1 0 0 1 1 0 1 1.00 .00 .00 1.00 0 11 1 0 0 0 1 11 0 20% 20% 0 2 4 6 x 1
CART: Tree building/growing 1. Start with all data in a root node sequence of splits preceding them defines the regions π π . All nodes without descendents are called leaf nodes . The stopping criterion 3. Repeat Step 2 on all child nodes until the tree reaches a 2.2 Choose the feature π that led to the best splitting of the π } and π } the greatest improvement in node purity : threshold π’ 2. Binary splitting 7/26 2.1 Consider each feature π¦ β π for π = 1, β¦ , π . Choose a π (for continuous features) or a partition of the feature categories (for categorical features) that results in {π π βΆ π¦ ππ > π’ {π π βΆ π¦ ππ β€ π’ data and create a new child node for each subset
Measures of node purity π ππ after a split can be used as an impurity measure. maximal when all classes are equally common. π ππ πΏ Entropy/deviance: π ππ ) π ππ (1 β Λ πΏ Use Gini impurity: β 8/26 β Λ 1 Misclassification error: |π π | π² π βπ π 1 (π π = π) π ππ = βΆ Three common measures to determine impurity in a region π π are (for classification trees) 1 β max π Λ π=1 Λ β β π=1 Λ π ππ log Λ βΆ All criteria are zero when only one class is present and βΆ For regression trees the decrease in mean squared error
Node impurity in two class case Example for a two-class problem ( π = 0 or 1 ). Λ problems for misclassification error). Only gini impurity and entropy are used in practice (averaging 9/26 empirical frequency of class 0 in a region π π . π 0π is the 0.6 Impurity 0.4 0.2 0.0 0.00 0.25 0.50 0.75 1.00 Ο 0m Impurity Measure Entropy Gini Misclassification
Stopping criteria 30 splits from root node) Running CART until one of these criteria is fulfilled generates 10/26 βΆ Minimum size of leaf nodes (e.g. 5 samples per leaf node) βΆ Minimum decrease in impurity (e.g. cutoff at 1%) βΆ Maximum tree depth, i.e. number of splits (e.g. maximum βΆ Maximum number of leaf nodes a max tree .
Summary of CART boundaries chance of being picked is used for splitting and which is best might change with small changes of the data) 11/26 βΆ Pro: Outcome is easily interpretable βΆ Pro: Can easily handle missing data βΆ Neutral: Only suitable for axis-parallel decision βΆ Con: Features with more potential splits have a higher βΆ Con: Prone to overfitting/unstable (only the best feature
CART and overfitting How can overfitting be avoided? stopping since a weak split might lead to a strong split later collapsing internal nodes. This can be more effective since weak splits are allowed during tree building. (βThe silly certainty of hindsightβ) stacking, β¦ 12/26 βΆ Tuning of stopping criteria: These can easily lead to early βΆ Pruning: Build a max tree first. Then reduce its size by βΆ Ensemble methods: Examples are bagging, boosting,
A note on pruning β΅β β΅ β΅ β΅ β΅ β΅ β΅ β΅ Cost β΅ + π½|π| β Complexity where (π π , π² π ) is the training data, Μ π the CART classification rule and |π| is the number of leaf nodes/regions defined by the tree. β΅ β΅ββ΅ β΅ β defined as π· π½ (π) = β π π βπ ( 1 β΅ |π π | π² π βπ π Μ π(π²))) ββ΅ β΅ β΅ β΅ 13/26 βΆ A common strategy is cost-complexity pruning . βΆ For a given π½ > 0 and a tree π its cost-complexity is 1 (π π β βΆ It can be shown that successive subtrees π π of the max tree π max can be found such that each tree π π minimizes π· π½ π (π π ) where π½ 1 β₯ β― β₯ π½ πΎ βΆ The tree with the lowest cost-complexity is chosen
Re-cap of the bootstrap and variance reduction
The Bootstrap β A short recapitulation (I) Given a sample π¦ π , π = 1, β¦ , π from an underlying population All of these approaches require fairly large sample sizes. bootstrap ) generalized linear models) distributional assumptions fulfilled Computation: π . Μ variability of Solution: Find confidence intervals (CIs) quantifying the π ? Μ What is the uncertainty of π(π¦ 1 , β¦ , π¦ π ) . Μ π = Μ estimate a statistic π by 14/26 βΆ Through theoretical results (e.g. linear models) if βΆ Linearisation for more complex models (e.g. nonlinear or βΆ Nonparametric approaches using the data (e.g.
The Bootstrap β A short recapitulation (II) π¦ 1 , β¦ , Μ 1 Check out this blog post! impossible. 1 The data is discrete and values not seen in the data are π Μ distribution of Μ Nonparametric bootstrap π¦ π ) π π ( Μ Μ 2. Calculate π¦ 1 , β¦ , Μ Μ 1. Sample π = 1, β¦ , πΆ 15/26 Given a sample π¦ 1 , β¦ , π¦ π bootstrapping performs for π¦ π with replacement from original sample βΆ πΆ should be large (in the 1000β10000s) βΆ The distribution of π π approximates the sampling βΆ The bootstrap makes exactly one strong assumption :
CI for statistics of an exponential random variable Data (n = 200) simulated from π¦ βΌ Exp (1/3) , i.e. π½ π(π¦) [π¦] = 3 (dotted) [red = empirical, blue = theoretical] 16/26 0.4 0.3 Frequency 0.2 0.1 0.0 0 5 10 15 x βΆ Orange histogram shows original sample βΆ Blue line is the true density βΆ Black outlined histogram shows a bootstrapped sample βΆ Vertical lines are the mean of π¦ (dashed) and the 99% quantile
Recommend
More recommend