Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Supervised Learning
Overview Regression Logistic regression K-NN Decision and regression trees 2
The analytics process 3
Recall Supervised learning You have a labelled data set at your disposal Correlate features to target Common case: predict the future based on patterns observed now (predictive) Classification (categorical) versus regression (continuous) Unsupervised learning Describe patterns in data Clustering, association rules, sequence rules No labelling required Common case: descriptive, explanatory For supervised learning, our data set will contain a label 4
Recall Regression: continuous label Most classification use cases use a binary categorical variable Classification: categorical label Churn prediction: churn yes/no Credit scoring: default yes/no For classification: Fraud detection: suspicous yes/no Binary classification (positive/negative Response modeling: customer buys yes/no outcome) Predictive maintenance: needs check yes/no Multiclass classification (more than two possible outcomes) Ordinal classification (target is ordinal) Multilabel classification (multiple outcomes are possible) For regression: Absolute values Delta values Quantiles regression Single versus multi-output models is possible as well (Definitions in literature and documentation can differ a bit) 5
Defining your target Recommender system: a form of multi-class? Multi-label? Survival analysis: instead of yes/no predict the “time until yes occurs” Oftentimes, different approaches are possible Regression, quantile regression, mean residuals regression? Or: predicting the absolute value or the change? Or: convert manually to a number of bins and perform classification? Or: reduce the groups to two outcomes? Or: sequential binary classification (“classifier chaining”)? Or: perform segmentation first and build a model per segment? 6
Regression 7
Regression https://xkcd.com/605/ 8
Linear regression Not much new here… with y = β 0 + β 1 x 1 + β 2 x 2 + … + ϵ ϵ ∼ N (0, σ ) Price = 100000 + 100000 * number of bedrooms : mean response when (y-intercept) → β 0 x = 0 : change in mean response when increases by one unit β 1 x 1 How to determine the parameters? : minimize sum of squared errors (SSE) β ∑ n y i ) 2 argmin → i =1 ( y i − ^ With standard error σ = √ SSE / n OLS: “Ordinary Least Squares”. Why SSE though? 9
Logistic Regression 10
Logistic regression Classification is solved as well? Customer Age Income Gender … Response John 30 1200 M No → 0 Sophie 24 2200 F No → 0 Sarah 53 1400 F Yes → 1 David 48 1900 M No → 0 Seppe 35 800 M Yes → 1 y = β 0 + β 1 age + β 2 income + β 3 gender ^ But no guarantee that output is 0 or 1 Okay fine, a probability then, but no guarantee that outcome is between 0 and 1 either Target and errors also not normally distributed (assumption of OLS violated) 11
Logistic regression We use a bounding function to limit the outcome between 0 and 1: (logistic, sigmoid) 1 f ( z ) = 1+ e − z Same basic formula, but now with the goal of binary classification Two possible outcomes: either 0 or 1, no or yes – a categorical, binary label, not continuous Logistic regression is thus a technique for classification rather than regression Though the predictions are still continuous: between [0,1] 12
Logistic regression Linear regression with a transformation such that the output is always between 0 and 1, and can thus be interpreted as a probability (e.g. response or churn probability) P (response = yes|age, income, gender) = 1 − P (response = no|age, income, gender) = 1 1 + e −( β 0 + β 1 age+ β 2 income+ β 3 gender) Or (“logit” – natural logarithm of odds): P (response = yes|age, income, gender) ln ( ) = β 0 + β 1 age + β 2 income + β 3 gender P (response = no|age, income, gender) 13
Logistic regression Customer Age Income Gender … Response Our first predictive model: a John 30 1200 M No → 0 formula Sophie 24 2200 F No → 0 Sarah 53 1400 F Yes → 1 David 48 1900 M No → 0 Not very spectacular, but note: Seppe 35 800 M Yes → 1 Easy to understand ↓ Easy to construct Easy to implement 1 In some settings, the end result will 1 + e −(0.10+0.22age+0.05income−0.80gender) be a logistic model “extracted” from more complex approaches ↓ Response Customer Age Income Gender … Score Will 44 1500 M 0.76 Emma 28 1000 F 0.44 14
Logistic regression 1 1 + e −( β 0 + β 1 age+ β 2 income+ β 3 gender) If increases by 1: X i logit | X i +1 = logit | X i + β i odds | X i +1 = odds | X i e β i : “odds-ratio”: multiplicative increase in odds when increases by 1 (other e β i X i variables constant) → → odds/probability increase with e β i > 1 β i > 0 X i → → odds/probability decrease with e β i < 1 β i < 0 X i Doubling amount: Amount of change required for doubling primary outcome odds Doubling amount for X i = log (2)/ β i 15
Logistic regression Easy to interpret and understand Statistical rigor, a “well-calibrated” classifier Linear decision boundary, though interaction effects can be taken into the model (and explicitely; allows for investigation) Sensitive to outliers Categorical variables need to be converted (e.g. using dummy encoding as the most common approach, though recall ways to reduce large amount of dummies) Somewhat sensitive to the curse of dimensionality… 16
Regularization 17
Stepwise approaches Statisticians love “parsimonious” models: If a “smaller” model works just as well as a “larger” one, prefer the smaller one Also: “curse of dimensionality” Makes sense: most statistical techniques don’t like dumping in your whole feature set all at once Selection based approaches (build up final model step-by-step): Forward selection Backward selection Hybrid (stepwise) selection See MASS::stepAIC , leaps::regsubsets , caret , or simply step in R Not implemented by default in Python (neither scikit-learn or statsmodels ) … What’s going on? 18
Stepwise approaches Trying to get the best, smallest model given some information about a large number of variables is reasonable Many sources cover stepwise selection methods However, this is not really a legitimate situation Frank Harrell (1996): “ It yields R-squared values that are badly biased to be high, the F and chi-squared tests quoted next to each variable on the printout do not have the claimed distribution. The method yields confidence intervals for effects and predicted values that are falsely narrow (Altman and Andersen, 1989). It yields p-values that do not have the proper meaning, and the proper correction for them is a difficult problem. It gives biased regression coefficients that need shrinkage (the coefficients for remaining variables are too large (Tibshirani, 1996). It has severe problems in the presence of collinearity. It is based on methods (e.g., F tests for nested models) that were intended to be used to test prespecified hypotheses. Increasing the sample size does not help very much (Derksen and Keselman, 1992). It uses a lot of paper. “ (https://www.stata.com/support/faqs/statistics/stepwise-regression-problems/) 19
Stepwise approaches Some of these issues have / can be fixed (e.g. using proper tests), but… “ Developing and confirming a model based on the same dataset is called data dredging. Although there is some underlying relationship amongst the variables, and stronger relationships are expected to yield stronger scores, these are random variables and the realized values contain error. Thus, when you select variables based on having better realized values, they may be such because of their underlying true value, error, or both. True, using the AIC is better than using p-values, because it penalizes the model for complexity, but the AIC is itself a random variable ( if you run a study several times and fit the same model, “ the AIC will bounce around just like everything else ). This actually already reveals something we’ll visit again when talking about evaluation! Take-away: use a proper train-test setup! 20
Regularization SSE Model 1 = (1 − 1) 2 + (2 − 2) 2 + (3 − 3) 2 + (8 − 4) 2 = 16 SSE Model 2 = (1 − −1) 2 + (2 − 2) 2 + (3 − 5) 2 + (8 − 8) 2 = 8 21
Regularization Key insight: introduce a penalty on the size of the weights Constrained, instead of less parameters! Makes the model less sensitive to outliers, improves generalization Lasso and ridge regression: Standard: with β ∑ n y i ) 2 y = β 0 + β 1 x 1 + … + β p x p + ϵ argmin → i =1 ( y i − ^ Lasso regression (L1 regularization): y i ) 2 + λ ∑ p β ∑ n argmin → i =1 ( y i − ^ j =1 | β j | Ridge regression (L2 regularization): β ∑ n y i ) 2 + λ ∑ p j =1 β 2 i =1 ( y i − ^ argmin → j No penalization on the intercept! Obviously: standardization/normalization required! 22
Lasso and ridge regression https://newonlinecourses.science.psu.edu/stat508/lesson/5/5.1 23
Recommend
More recommend