advanced analytics in business d0s07a big data platforms
play

Advanced Analytics in Business [D0S07a] Big Data Platforms & - PowerPoint PPT Presentation

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Preprocessing Overview Selection Exploration Cleaning Transformations Feature engineering Feature selection Conclusion 2 Today 3 Where we want to


  1. Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Preprocessing

  2. Overview Selection Exploration Cleaning Transformations Feature engineering Feature selection Conclusion 2

  3. Today 3

  4. Where we want to get to Instance Age Account activity Owns credit card Churn 1 24 low yes no 2 21 medium no yes 3 42 high yes no 4 34 low no yes 5 19 medium yes yes 6 44 medium no no 7 31 high yes no 8 29 low no no 9 41 high no no 10 29 medium yes no 11 52 medium no yes 12 55 high no no 13 52 medium yes no 14 38 low no yes 4

  5. The data set Recall, a tabular data set (“structured data”): Has instances (examples, rows, observations, customers, …) And features (attributes, fields, variables, predictors, covariates, explanatory variables, regressors, independent variables) These features can be: Numeric (continuous) Categorical (discrete, factor) either nominal (binary as a special case) or ordinal Target (label, class, dependent variable, repsonse variable) can also be present Numeric, categorical, … 5

  6. Constructing a data set takes work Merging different data sources Levels of aggregation, e.g. household versus individual customer Linking instance identifiers Definition of target variable Cleaning, preprocessing, featurization Let’s go through the different steps… 6

  7. Data Selection 7

  8. Data types Master data External data Relates to core entities company is working Social media data (e.g. Facebook, Twitter, with e.g. for sentiment analysis) E.g. Customers, Products, Employees, Macro economic data (e.g. GPD, inflation) Suppliers, Vendors Weather data Typically stores in operational data bases Competitor data and data warehouses (historical view) Search data (e.g. Google Trends) Transactional data Web scraped data Timing, quantity and items Open data E.g. POS data, credit card transactions, money transfers, web visits, etc. External data that anyone can access, use and share Will typically require a featurization step (see later) Government data (e.g. Eurostat, OECD) Scientific data Though keep the production context in mind! 8

  9. Example: Google Trends https://medium.com/dataminingapps-articles/forecasting-with-google-trends-114ab741bda4 9

  10. Example: Google Street View https://arxiv.org/ftp/arxiv/papers/1904/1904.05270.pdf 10

  11. Data types Structured, unstructured, semi-structured data Better: tabular, imagery, time series, … Small vs. big data Metadata Data that describes other data Data about data Data definitions E.g. stored in DBMS catalog Oftentimes lacking but can help a great deal in understanding, and feature extraction as well 11

  12. Selection As data mining can only uncover patterns actually present in the data, target data set must be large/complete enough to contain these patterns But concise enough to be mined within an acceptable time limit Data marts and data warehouses can help When you want to predict a target variable y Make absolutely sure you’re not cheating by including explanatory variables which are too “perfectly” correlated with the target “Too good to be true” Do you know this explanatory variable after the target outcome, or before? At the time when your model will be used, or after? Your finished model will not be able to look into the future! Set aside a hold-out test set as soon as possible 12

  13. Data Exploration 13

  14. Exploration Visual analytics as a means for initial exploration (EDA: exploratory data analysis) Boxplots Scatter plots Histograms Correlation plots Basic statistics There is some debate on this topic re: bias 14

  15. Data Cleaning 15

  16. Cleaning Consistency Detecting errors and duplications K.U.L., KU Leuven, K.U.Leuven, KULeuven… Transforming different representations to common format Male, Man, MAN → M True, YES, Ok → 1 Removing “future variables” Or variables that got modified according to the target at hand 16

  17. Missing values Many techniques cannot deal with missing values Not applicable (credit card limit = NA if no credit card owned) versus unknown or not disclosed (age = NA) Missing at random vs. missing not at random Detection is easy Main treatment options: Delete : complete row/column Replace (impute): by mean, median (more robust), or mode Or by separate model (the missing value then becomes the target to predict) – often not worth it in practice Keep : include a separate missing value indicator feature 17

  18. Missing values “Detection is easy?” 18

  19. Missing values missingno (Python) or VIM (R) 19

  20. Missing values 20

  21. Missing values Common approach: delete rows with too many missing values, impute features using median and mode, add separate column indicating if original value was missing Sometimes imputation of median/mode based within same class-label More advanced imputation: nearest-neighbor based 21

  22. Intermezzo Always keep the production setting in mind: new unseen instances can contain missing values as well Don’t impute with a new median! But using the same “rules” as with the training data Also when working with validation data! What if we’ve never observe a missing value for this feature before? Use the original training data to construct an imputation Consider rebuilding your model (monitoring!) Everything you do over multiple instances becomes part of the model! 22

  23. Outliers Extreme observations (age = 241 or income = 3 million) Can be valid or invalid Some techniques are sensitive to outliers Detection: histograms, z-scores, box-plots Sometimes the detection of outliers is the whole “analysis” task Anomaly detection (fraud analytics) Treatment: As missing : for invalid outliers, consider them as missing values Treat : for valid outliers, but depends on the technique you will use: keep as-is, truncate (cap), categorize 23

  24. Outliers 24

  25. Outliers 25

  26. Outliers 26

  27. Duplicate rows Duplicate rows can be valid or invalid Treat accordingly 27

  28. Transformations: standardization Standardization: constrain feature to ∼ N (0, 1) Good/necessary for Gaussian-distributed data and some techniques: SVMs, regression models, k-nearest neighbor, everything working with Euclidian distance or similarity Useless for other techniques: decision trees x new = x − μ σ 28

  29. Transformations: normalization Normalization: also called “feature scaling” Rescale to [0,1], [-1,1] In credit risk models, this is oftentimes applied to the resulting scores so that they fall in [0, 1000] x − x min x new = x max − x min 29

  30. Transformations: categorization Also called “coarse classification”, “classing”, “binning”, “grouping” Continuous to nominal Binning: group ranges into categories Can be useful to treat outliers Oftentimes driven by domain knowledge Equal width/interval binning versus equal frequency binning (histogram equalization) Nominal to reduced nominal Grouping: grouping multiple nominal levels together In case you have many levels (…) 30

  31. Transformations: categorization Treat outliers Make final model more interpretable Reduce curse of dimensionality following from high number of levels Introduce non-linear effects into linear models 31

  32. Transformations: dummyfication and other encodings Nominal to continuous Dummy variables: artificial variable to represent an attribute with multiple categories (“one-hot-encoding”) Mainly used in regression (cannot deal with categoricals directly) E.g. account activity = high, medium, low Convert to: account_activity_high = 0, 1 account_activity_medium = 0, 1 account_activity_low = 0, 1 … and then drop one Binary encoding E.g. binarization: account_activity_high = 1 → 001 account_activity_medium = 2 → 010 account_activity_low = 3 → 100 More compact than dummy variables 32

  33. Transformations: high-level categoricals What now if we have a categorical value with too many levels (or, alternatively, too many dummy variables) Domain-knowledge driven grouping (e.g. NACE codes) Frequency-based grouping E.g. postal code = 2000, 3000, 9401, 3001… Solution 1: Group using domain knowledge (working area, communities) Solution 2: Create new variable postal_code_count E.g. postal_code_count = 23 if original postcode appears 23 times in training data (Again: keep the same rule for validation/production!) Lose detailed information but goal is that model can pick up on frequencies Odds based grouping Weight of evidence encoding Probabilistic transformations and other “Kaggle”-tricks such as Leave One Out Mean (Owen Zhang) Decision tree based Embeddings Not : integer encoding if your variable is not ordinal! 33

  34. Transformations: odds based grouping Drawback of equal-interval or equal-frequency based binning: do not take outcome into account The “pivot” table approach Create a pivot table of the attribute versus target and compute the odds Group variable values having similar odds 34

Recommend


More recommend