CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 10: DATA ENGINEERING - PowerPoint PPT Presentation

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 10: DATA ENGINEERING Spring 2019 Marion Neumann

RECAP: FEATURE ENGINEERING 5 Good Reasons for Feature Engineering: 1) get better represented features ( scaling , standardization ) à improve model training and prediction quality 2) get more expressive features à improve model training and prediction quality à ACTIVITY 1 3) get more representative features à remove noise 4) get less features What data does not come in à more efficient computation vector form? 5) represent features in 2d or 3d visualization à 6) bring data in to vector representation " $ ⋮ à not all data necessarily comes in vector form ⃑ " = " & 2

COMPLEX DATA • data that does not come as numerical vectors à requires feature extraction Example: images 3

COMPLEX DATA • data that does not come as numerical vectors à requires feature extraction Example: text Same great flavor and friendly service as in the S 18th street location. This location is not as small but it's hard to talk to friends. Thankfully there is great outdoor seating to escape the noise. location great friends small … 4

COMPLEX DATA • data that does not come as numerical vectors à requires feature extraction Example: street networks 5

COMPLEX DATA • data that does not come as numerical vectors à requires feature extraction Example: chemical compounds non-mutagenic mutagenic ??? 6

COMPLEX DATA • data that does not come as numerical vectors à requires feature extraction Example: point clouds 7

COMPLEX DATA • data that does not come as numerical vectors à requires feature extraction Example: social networks 8

FEATURE EXTRACTION • feature extraction is challenging • which features to compute and use? • too many choices and combinations • NNs can be used to learn features • sometimes those features are even interpretable • works well for images and text 9

FEATURE EXTRACTION VS LEARNING 10

FEATURE EXTRACTION VS LEARNING politics Same great flavor and friendly service as in the S 18th street location. This sports location is not as small but it's hard to talk to friends. Thankfully there is great culture outdoor seating to escape the noise. politics Same great flavor and friendly service as in the S 18th street location. This sports location is not as small but it's hard to talk to friends. Thankfully there is great culture outdoor seating to escape the noise. 11

FEATURE EXTRACTION VS LEARNING Train NN on very large corpus politics sports culture Train separate model Use pre-trained NN on new (small) data ! " = 0.23 simple ML model ! ( = 3.42 kNN • random forest • ! * = 0.89 … • positive negative 12

DATA ENGINEERING • instead of removing or creating features • create or remove data points ! • Why could that be useful? • remove outliers • create more training examples à data augmentation 13

CAUSES OF OUTLIERS • data entry errors (human errors) • measurement errors (instrument errors) • experimental errors • data extraction error • experiment planning/executing errors • intentional • dummy outliers made to test detection methods • data processing errors • data manipulation • data set unintended mutations • sampling errors • extracting or mixing data from wrong or various sources • natural à not an error, novelties in data 14

( OUTLIER DETECTION sample ! = # $ %# & %%# ' = ) ( * - + mean ( +,) sample • standardize your data . / = ) . = ' # 2 04 & (0) ∑ 23$ standard deviation • empirical rule: samples 7 8 48 55 68 -31 91 outliers outliers Z-Score Analysis ( parametric approach à makes the parametric assumption that the features are normally distributed) 15

DATA AUGMENTATION • easy for images à preform image transformations 16

DATA AUGMENTATION Why à NNs need a ton of training data + . 007 ⇥ = x + sign ( r x J ( θ , x , y )) x ✏ sign ( r x J ( θ , x , y )) y = “panda” “nematode” “gibbon” w/ 57.7% w/ 8.2% w/ 99.3 % confidence confidence confidence • NN performance is very sensitive to adversarially created noise • collecting (and labelling data) is very time consuming/expensive 17

SUMMARY & READING • More expressive features (even simple transformations) can greatly improve the training time and model quality. • Feature learning is useful to deal with non-vectorial input data . • Data engineering can improve supervised models by removing outliers or data augmentation . • Neural Networks are tricky to train and very sensitive to noise. • [ DSFS ] • Ch18: Neural Networks (p213-218) • http://playground.tensorflow.org 18

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 10: DATA ENGINEERING - PowerPoint PPT Presentation

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 10: DATA ENGINEERING Spring 2019 Marion Neumann RECAP: FEATURE ENGINEERING 5 Good Reasons for Feature Engineering: 1) get better represented features ( scaling , standardization ) improve model

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 2: EXPLORATORY DATA ANALYSIS Spring 2019 Marion

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: REGRESSION Spring 2019 Marion Neumann RECAP:

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 3: SENTIMENT ANALYSIS Spring 2019 Marion Neumann

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 6: LEARNING PRINCIPLES Spring 2019 Marion Neumann

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 8: SIMILARITY-BASED PREDICTION Spring 2019 Marion

CSE217 INTRODUCTION TO DATA SCIENCE COURSE WEBSITE, SYLLABUS, ACADEMIC INTEGRITY Spring 2019

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Data Science in the Wild Lecture 1: Introduction Eran Toch Data Science in the Wild, Spring 2019

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Lecture 4: Introduction to Regression CS109A Introduction to Data Science Pavlos Protopapas,

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Lecture #0: Introduction to C109a S-109A Introduction to Data Science Pavlos Protopapas and Kevin

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses Eran Toch Data Science in the

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Lecture #1: Introduction to CS109A aka STAT121A, AC209A, CSCIE-109A CS109A Introduction to Data

Math 211 Math 211 Lecture #9 September 26, 2000 2 Runge-Kutta Methods Runge-Kutta Methods y

Knowing What We Dont Know: Quantifying Uncertainties in Direct Reaction Theory Amy Lovell

Performance Evaluation and Experimental Comparisons for Classifiers Prof. Richard Zanibbi

How to transfer experimental results to theorists? Convener: Thomas Blake (Warwick U.)

HG-CoLoR: enHanced de bruijn Graph for the error COrrection of LOng Reads Pierre Morisse , Thierry

Measurement of using B K and B KK K decays David London Universit e de

Adaptive Experiments for Policy Choice Maximilian Kasy Anja Sautmann December 7, 2018

Introductory Statistics Day 1 Introduction Data is the sword of the 21st century, those who

Sambuz

Useful Links

Newsletter

Mail Us