Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Preprocessing
Overview Selection Cleaning Transformation Feature selection Feature extraction Sampling 2
The analytics process 3
The data set Instance Age Account activity Owns credit card Churn 1 24 low yes no 2 21 medium no yes 3 42 high yes no 4 34 low no yes 5 19 medium yes yes 6 44 medium no no 7 31 high yes no 8 29 low no no 9 41 high no no 10 29 medium yes no 11 52 medium no yes 12 55 high no no 13 52 medium yes no 14 38 low no yes 4
The data set A tabular data set ("structured data"): Has instances (examples, rows, observations, customers, ...) and attributes (features, fields, variables, predictors) These features can be: Numeric (continuous) or categorical (discrete, nominal, ordinal, factor, binary) Target (label, class, dependent variable) can be present Can also be numeric, categorical, ... 5
Constructing a data set takes work Merging different data sources Levels of aggregation e.g. household versus individual customer Linking instance identifiers Definition of target variable 6
Selection As data mining can only uncover patterns actually present in the data, target data set must be large/complete enough to contain these patterns But concise enough to be mined within an acceptable time limit Data marts and data warehouses can help When you want to predict a target variable y Make absolutely sure you’re not cheating by including explanatory variables which are too “perfectly” correlated with the target “Too good to be true” Do you know this explanatory variable after the target outcome, or before? Your finished model will not be able to look into the future! Set aside a hold-out test set 7
Exploration Visual analytics as a means for initial exploration (EDA: exploratory data analysis) Boxplots Scatter plots Histograms Correlation plots Basic statistics There is some debate on this topic 8
Cleaning Consistency Detecting errors and duplications K.U.L., KU Leuven, K.U.Leuven, KULeuven… Transforming different representations to common format Male, Man, MAN → M True, YES, Ok → 1 Removing “future variables” Or variables that got modified according to the target at hand 9
Cleaning: missing values Many techniques cannot deal with missing values Not applicable (credit card limit = NA if no credit card owned) versus unknown or not disclosed (age = NA) Detection is easy Treatment: Delete : complete row/column Replace (impute): by mean, median, or mode Or by separate model (the missing value becomes the target) – often not worth it in practice Keep : include a separate missing value indicator feature 10
Cleaning: missing values 11
Cleaning: missing values Common approach: delete rows/columns with too many missing values, impute others using median and mode, add separate column indicating if original value was missing Sometimes imputation of median/mode based within same class-label More advanced imputation: nearest-neighbor based 12
Cleaning: missing values Always keep the production setting in mind: new unseen instances can contain missing values as well Don’t impute with a new median! But using the same “rules” as with the training data Also when working with validation data! What if we’ve never observe a missing value for this feature before? Use the original training data to construct an imputation Consider rebuilding your model (monitoring!) 13
Cleaning: outliers Extreme observations (age = 241 or income = 3 million) Can be valid or invalid Some techniques are sensitive to outliers Detection: histograms, z-scores, box-plots Sometimes the detection of outliers is the whole “analysis” task Anomaly detection (fraud analytics) Treatment: As missing : for invalid outliers, consider them as missing values Treat : for valid outliers, but depends on the technique you will use: keep as-is, truncate (cap), bin, group 14
Cleaning: outliers 15
Cleaning: outliers 16
Cleaning: duplicates Duplicate rows can be valid or invalid Treat accordingly 17
Transformations: standardization and normalization Standardization: constrain feature to ~N(0,1) Good for Gaussian-distributed data and some techniques: SVMs, regression models, k-nearest neighbor Useless for some techniques: decision trees x − μ = x new σ 18
Transformations: standardization and normalization Normalization: also called feature scaling Rescale to [0,1], [-1,1] In credit risk models, this is oftentimes applied to the resulting scores so that they fall in [0, 1000] x − x min = x new − x x max min 19
Transformations: categorization Continuous to nominal Binning: group ranges into categories Can be useful to treat outliers Oftentimes driven by domain knowledge Equal width versus equal frequency Grouping: grouping multiple nominal levels together In case you have many levels (…) 20
Transformations: dummyfication and other encodings Nominal to continuous Dummy variables: artificial variable to represent an attribute with multiple categories Mainly used in regression (cannot deal with categoricals directly) E.g. account activity = high, medium, low Convert to: account_activity_high = 0, 1 account_activity_medium = 0, 1 account_activity_low = 0, 1 ... and then drop one Binary encoding E.g. binarization: account_activity_high = 1 → 001 account_activity_medium = 2 → 010 account_activity_low = 3 → 100 More compact then dummy variables 21
Transformations: high-level categoricals Other ideas Replace with counting variable (for high-level categoricals that are difficult to group using domain knowledge) (“grouping”) E.g. postal code = 2000, 3000, 9401, 3001… Solution 1: Group using domain knowledge (working area, communities) Solution 2: Create new variable postal_code_count E.g. postal_code_count = 23 if original postcode appears 23 times in training data (Again: keep the same rule for validation/production!) Lose detailed information but goal is that model can pick up on frequencies Odds based grouping Weight of evidence encoding Probabilistic transformations and other “Kaggle”-tricks such as Leave One Out Mean (Owen Zhang) Risky, not always robust Including noise and "jitter" 22
Odds based grouping The "pivot" table approach Create a pivot table of the attribute versus target and compute the odds Group variable values having similar odds 23
Weights of evidence encoding Weights of evidence variables can be defined as follows: p c 1, cat = ln ( ) WoE cat p c 2, cat = number of instances with class 1 in category / number of instance with class 1 (total) p c 1, cat = number of instances with class 2 in category / number of instance with class 2 (total) p c 2, cat > p If p then WoE > 0 c 1, cat c 2, cat < p If p then WoE < 0 c 1, cat c 2, cat WoE has a monotonic relationship with the target variable 24
Weights of evidence encoding Higher weights of evidence indicates less risk (monotonic relation to target) ( p − p ) WoE ∑ Information Value (IV) defined as: G B Can be used to screen variables (important variables have IV > 0.1) 25
Weights of evidence encoding https://cran.r-project.org/web/packages/smbinning/index.html 26
Some other creative approaches For geospatial data: group nearby communities together, gaussian regression, ... First build a decision tree only on the one categorical variable 1-dimensional k-means clustering on a continuous variable to suggest groups 27
Leave one out mean (Owen Zhang) 28
Hashing trick The “hashing trick”: for categoricals with many levels and expected that new levels will occur in new instances Oftentimes used for text mining Alternative for bagging: 29
Hashing trick The “hashing trick”: for categoricals with many levels and expected that new levels will occur in new instances But: perhaps there are smarter approaches? 30
Embeddings Other “embedding” approaches are possible as well in the textual domain “word2vec and friends” Can even be used for high-level and sparse categoricals We’ll come back to this later on, when we talk about text mining in more detail https://arxiv.org/pdf/1604.06737.pdf 31
Transformations: mathematical approaches Logarithms, square roots, etc Mostly done to enforce some linearity Box Cox Yeo Johnson Other "power transformations": applied to create a monotonic transformation of data using power functions. This is a useful data transformation technique used to stabilize variance, make the data more normal distribution-like, improve the validity of measures of association such as the Pearson correlation 32
Transformations: interaction variables When no linear effect is present on x ∼ y and x ∼ y but there is one on f ( x , x ) ∼ y 1 2 1 2 In most cases: f ( x , x ) = x × x 1 2 1 2 33
Transformations: delta's, trends, windows Evolutions over time or differences between features are crucial in many settings Solution 1: Keep track of an instance through time and add in as separate rows (“panel data analysis”) Solution 2: One point in time as “base”, add in relative features 34
Recommend
More recommend