Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Ensemble Modeling: Bagging and Boosting Opening the Black Box (Part 1)
Overview The basic motivation Combination rules and voting systems Bagging Boosting Opening the black box (part 1) 2
The Basic Motivation 3
Ensemble modeling: the basic motivation Are two models better than one? Intuitively, this does make sense: you might have two models that each are good at predicting a certain (different) subsegment of your data set So this seems like a good idea to increase performance We’ll also see that we will be able to make models more robust to overfitting, more robust to noise “Wisdom of (simulated) crowds” Combination of models is called an “ensemble” https://towardsdatascience.com/the-unexpected-lesson-within-a-jelly-bean-jar-1b6de9c40cca 4
Can we have it all? Overfitting: Model is too specific, works great on training data but not on a new data set E.g.: a very deep decision tree 5
Can we have it all? We have seen early stopping and pruning Using a strong validation setup, too But at the end: an accuracy level we might not be happy with 6
Can we have it all? Also consider: what if we could combine multiple linear classifiers? 7
Combination Rules and Voting Systems 8
Combination rules Let’s say we’ve created two models How to combine them? Model 1 Model 2 True label Ensemble? (threshold: 0.54) (threshold: 0.50) Yes 0.80 (yes) 0.70 (yes) Yes 0.78 (yes) 0.50 (yes) Yes 0.54 (yes) 0.50 (yes) No 0.57 (yes) 0.30 (no) No 0.30 (no) 0.70 (yes) No 0.22 (no) 0.40 (no) 9
Combination rules Algebraic combination Determine new, optimal cutoff! Model 1 Model 2 True label Min (0.50) Max (0.78) Mean (0.52) (0.54) (0.50) Yes 0.80 (yes) 0.70 (yes) 0.70 (yes) 0.80 (yes) 0.75 (yes) Yes 0.78 (yes) 0.50 (yes) 0.50 (yes) 0.78 (yes) 0.64 (yes) Yes 0.54 (yes) 0.50 (yes) 0.50 (yes) 0.54 (no) 0.52 (yes) No 0.57 (yes) 0.30 (no) 0.30 (no) 0.57 (no) 0.44 (no) No 0.30 (no) 0.70 (yes) 0.30 (no) 0.70 (no) 0.50 (no) No 0.22 (no) 0.40 (no) 0.22 (no) 0.40 (no) 0.31 (no) As always, the mean is pretty stable, especially when combining well-calibrated classifiers Breaks down with uncalibrated classifiers when adding many models together A learning step in itself (meta-learning) 10
Voting Useful when combining models: majority voting Less sensitive to underlying probability distributions, no need for calibration or determination of optimal new cutoff 1 2 3 4 5 6 → “yes” wins (4 to 2) What about weighted voting? 1 2 3 4 5 6 Model 4 gets 5 votes, the others 1 → “no” wins (5+1 to 4) We could also go for a linear combination of the probabilities Though again: how to determine the weights? A learning step in itself 11
Mixture of experts Jordan and Jacobs’ mixture of experts (Jacobs, 1991) generates several “experts” (classifiers) whose outputs are combined through a linear rule The weights of this combination are determined by a “gating network”, typically trained using the expectation maximization (EM) algorithm But: loss of interpretability, additional production strain! 12
Stacking Wolpert’s (Wolpert, 1992) stacked generalization (or stacking): An ensemble of Tier 1 classifiers is first trained on a subset of the training data Outputs of these classifiers are then used to train a Tier 2 classifier (meta-classifier) The underlying idea is to learn whether training data have been properly learned For example, if a particular classifier incorrectly learned a certain region of the feature space, then the Tier 2 classifier may be able to learn this behavior But: loss of interpretability, additional production strain! http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/ 13
Smoothing λ × +(1 − λ )× http://www.overkillanalytics.net/more-is-always-better-the-power-of-simple-ensembles/ 14
Bagging 15
Bagging How about techniques with a “built-in” ensemble system? Bagging (bootstrap aggregating) is one of the earliest, most intuitive and perhaps the simplest ensemble based algorithms, with a surprisingly good performance (Breiman, 1996) The main idea is to add diversity to the classifiers Obtained by using bootstrapped replicas of the training data: different training data subsets are randomly drawn – with replacement – from the entire training dataset Each training data subset is used to train a different classifier of the same type Individual classifiers are then combined by taking a simple majority vote of their decisions Since the training datasets may overlap substantially, additional measures can be used to increase diversity, such as Using a subset of the training data for training each classifier Using a subset of features Using unstable classifiers Other ideas 16
Bagging 17
Out-of-bag (OOB) validation When using bagging, one can already estimate the generalization capabilities of the ensemble model using the training data: out-of-bag (OOB) validation When validating an instance i , only consider those models which did not have i in their bootstrap sample A good initial validation check, though an independent test set is still required! 18
Random forests: the quintessential bagging technique Random forests: bagging-based ensemble learning method for classification and regression Construct a multitude of decision trees at training time and outputting the class that is the majority vote of the classes (classification) or mean prediction (regression) of the individual trees Applies bagging, so one part of randomness comes from bootstrapping each decision tree, i.e. each decision tree sees a random bootstrap of the training data However, random forests use an additional piece of randomness , i.e. to select the candidate attributes to split on at every split in every tree only consider a random subset of features (sampled at every split!) Random decision forests correct for decision trees’ habit of overfitting to their training set No more pruning needed The algorithm was developed by Leo Breiman and Adele Cutler Great performance in most cases! 19
Random forests: the quintessential bagging technique How many trees? No risk of overfit, so use plenty Depth of tree? No pruning necessary, strictly speaking But one can still decide to apply some pruning or early stopping mechanisms (many techniques will do so) Size of bootstrap Can be 100% (this doesn’t mean selecting all instances, as we’re drawing with replacement!) Lower values possible given enough data points Key is to build enough trees M : size of subset of features? 1, 2, all (i.e. “default bagging”)? Heuristic: for regression, for classification (with N the number of features) N ⌊ √ N ⌋ max ( ⌊ , 1 ⌋ ) 3 Alternative: find through cross-validation! Thinking points: how to assign a probability? How to set the thresholds of the base classifiers (do we need to)? 20
Random forests: the quintessential bagging technique Random forests are easy to use, don’t require much configuration or preprocessing Because you are building many trees, will include lots of interaction effects “for free” Good at avoiding overfitting (by design) However… how to explain 100 trees vs. 1… Many fun extensions, e.g. Extra Randomized Trees: also consider a random subset of the possible splitting points, instead of only a random subset features! See also Maximizing Tree Diversity by Building Complete-Random Decision Trees (Liu et al., 2005) There is even a thing such as completely randomized trees (and we’ll see an application of those soon) 21
Boosting 22
Boosting Similar to bagging, boosting also creates an ensemble of classifiers which are then combined by majority voting However, not using bootstrapping this time around Instead, classifiers are added sequentially were each new classifier aims at correcting the mistakes by the ensemble thus far In short: steering the learning towards fixing the mistakes it made in a previous step Main idea is cooperation between classifiers , rather than adding diversity 23
Boosting 24
AdaBoost: a (not so quintessential any more) boosting technique Iterative approach: AdaBoost trains an ensemble of weak learners over a number of rounds T At first, every instance has the same weight ( ), so AdaBoost trains D 1 = 1/ N a normal classifier Next, samples that were misclassified by the ensemble so far are given a heavier weight The learner is also given a weight ( ) α t depending on its accuracy and incorporated into the ensemble AdaBoost then constructs a new learner: now incorporating the weights so far 25
AdaBoost Friedman et al. showed that AdaBoost can be implemented as additive logistic regression model Assuming logistic regression as the base, weak learner AdaBoost optimizes exponential loss function Nice mathematical solution, shows that AdaBoost is closely linked to a particular loss function 26
Recommend
More recommend