CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 10: DATA ENGINEERING Spring 2019 Marion Neumann
RECAP: FEATURE ENGINEERING 5 Good Reasons for Feature Engineering: 1) get better represented features ( scaling , standardization ) à improve model training and prediction quality 2) get more expressive features à improve model training and prediction quality à ACTIVITY 1 3) get more representative features à remove noise 4) get less features What data does not come in à more efficient computation vector form? 5) represent features in 2d or 3d visualization à 6) bring data in to vector representation " $ ⋮ à not all data necessarily comes in vector form ⃑ " = " & 2
COMPLEX DATA • data that does not come as numerical vectors à requires feature extraction Example: images 3
COMPLEX DATA • data that does not come as numerical vectors à requires feature extraction Example: text Same great flavor and friendly service as in the S 18th street location. This location is not as small but it's hard to talk to friends. Thankfully there is great outdoor seating to escape the noise. location great friends small … 4
COMPLEX DATA • data that does not come as numerical vectors à requires feature extraction Example: street networks 5
COMPLEX DATA • data that does not come as numerical vectors à requires feature extraction Example: chemical compounds non-mutagenic mutagenic ??? 6
COMPLEX DATA • data that does not come as numerical vectors à requires feature extraction Example: point clouds 7
COMPLEX DATA • data that does not come as numerical vectors à requires feature extraction Example: social networks 8
FEATURE EXTRACTION • feature extraction is challenging • which features to compute and use? • too many choices and combinations • NNs can be used to learn features • sometimes those features are even interpretable • works well for images and text 9
FEATURE EXTRACTION VS LEARNING 10
FEATURE EXTRACTION VS LEARNING politics Same great flavor and friendly service as in the S 18th street location. This sports location is not as small but it's hard to talk to friends. Thankfully there is great culture outdoor seating to escape the noise. politics Same great flavor and friendly service as in the S 18th street location. This sports location is not as small but it's hard to talk to friends. Thankfully there is great culture outdoor seating to escape the noise. 11
FEATURE EXTRACTION VS LEARNING Train NN on very large corpus politics sports culture Train separate model Use pre-trained NN on new (small) data ! " = 0.23 simple ML model ! ( = 3.42 kNN • random forest • ! * = 0.89 … • positive negative 12
DATA ENGINEERING • instead of removing or creating features • create or remove data points ! • Why could that be useful? • remove outliers • create more training examples à data augmentation 13
CAUSES OF OUTLIERS • data entry errors (human errors) • measurement errors (instrument errors) • experimental errors • data extraction error • experiment planning/executing errors • intentional • dummy outliers made to test detection methods • data processing errors • data manipulation • data set unintended mutations • sampling errors • extracting or mixing data from wrong or various sources • natural à not an error, novelties in data 14
( OUTLIER DETECTION sample ! = # $ %# & %%# ' = ) ( * - + mean ( +,) sample • standardize your data . / = ) . = ' # 2 04 & (0) ∑ 23$ standard deviation • empirical rule: samples 7 8 48 55 68 -31 91 outliers outliers Z-Score Analysis ( parametric approach à makes the parametric assumption that the features are normally distributed) 15
DATA AUGMENTATION • easy for images à preform image transformations 16
DATA AUGMENTATION Why à NNs need a ton of training data + . 007 ⇥ = x + sign ( r x J ( θ , x , y )) x ✏ sign ( r x J ( θ , x , y )) y = “panda” “nematode” “gibbon” w/ 57.7% w/ 8.2% w/ 99.3 % confidence confidence confidence • NN performance is very sensitive to adversarially created noise • collecting (and labelling data) is very time consuming/expensive 17
SUMMARY & READING • More expressive features (even simple transformations) can greatly improve the training time and model quality. • Feature learning is useful to deal with non-vectorial input data . • Data engineering can improve supervised models by removing outliers or data augmentation . • Neural Networks are tricky to train and very sensitive to noise. • [ DSFS ] • Ch18: Neural Networks (p213-218) • http://playground.tensorflow.org 18
Recommend
More recommend