Wh y generate feat u res ? FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON Robert O ' Callaghan Director of Data Science , Ordergroo v e
Feat u re Engineering FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
Different t y pes of data Contin u o u s : either integers ( or w hole n u mbers ) or � oats ( decimals ) Categorical : one of a limited set of v al u es , e . g . gender , co u ntr y of birth Ordinal : ranked v al u es , o � en w ith no detail of distance bet w een them Boolean : Tr u e / False v al u es Datetime : dates and times FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
Co u rse str u ct u re Chapter 1: Feat u re creation and e x traction Chapter 2: Engineering mess y data Chapter 3: Feat u re normali z ation Chapter 4: Working w ith te x t feat u res FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
Pandas import pandas as pd df = pd.read_csv(path_to_csv_file) print(df.head()) FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
Dataset SurveyDate \ 0 2018-02-28 20:20:00 1 2018-06-28 13:26:00 2 2018-06-06 03:37:00 3 2018-05-09 01:06:00 4 2018-04-12 22:41:00 FormalEducation 0 Bachelor's degree (BA. BS. B.Eng.. etc.) 1 Bachelor's degree (BA. BS. B.Eng.. etc.) 2 Bachelor's degree (BA. BS. B.Eng.. etc.) 3 Some college/university study ... 4 Bachelor's degree (BA. BS. B.Eng.. etc.) FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
Col u mn names print(df.columns) Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby', 'Country', 'StackOverflowJobsRecommend', 'VersionControl', 'Age', 'Years Experience', 'Gender', 'RawSalary'], dtype='object') FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
Col u mn t y pes print(df.dtypes) SurveyDate object FormalEducation object ConvertedSalary float64 ... Years Experience int64 Gender object RawSalary object dtype: object FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
Selecting specific data t y pes only_ints = df.select_dtypes(include=['int']) print(only_ints.columns) Index(['Age', 'Years Experience'], dtype='object') FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
Lets get going ! FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON
Dealing w ith Categorical Variables FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON Robert O ' Callaghan Director of Data Science , Ordergroo v e
Encoding categorical feat u res FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
Encoding categorical feat u res FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
Encoding categorical feat u res One - hot encoding D u mm y encoding FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
One - hot encoding pd.get_dummies(df, columns=['Country'], prefix='C') C_France C_India C_UK C_USA 0 0 1 0 0 1 0 0 0 1 2 0 0 1 0 3 0 0 1 0 4 1 0 0 0 FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
D u mm y encoding pd.get_dummies(df, columns=['Country'], drop_first=True, prefix='C') C_India C_UK C_USA 0 1 0 0 1 0 0 1 2 0 1 0 3 0 1 0 4 0 0 0 FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
One - hot v s . d u mmies One - hot encoding : E x plainable feat u res D u mm y encoding : Necessar y information w itho u t d u plication FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
Inde x Se x 0 Male 1 Female 2 Male Inde x Male Female Inde x Male 0 1 0 0 1 1 0 1 1 0 2 1 0 2 1 FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
Limiting y o u r col u mns counts = df['Country'].value_counts() print(counts) 'USA' 8 'UK' 6 'India' 2 'France' 1 Name: Country, dtype: object FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
Limiting y o u r col u mns mask = df['Country'].isin(counts[counts < 5].index) df['Country'][mask] = 'Other' print(pd.value_counts(colors)) 'USA' 8 'UK' 6 'Other' 3 Name: Country, dtype: object FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
No w y o u deal w ith categorical v ariables FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON
N u meric v ariables FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON Robert O ' Callaghan Director of Data Science , Ordergroo v e
T y pes of n u meric feat u res Age Price Co u nts Geospatial data FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
Does si z e matter ? FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
Binari z ing n u meric v ariables df['Binary_Violation'] = 0 df.loc[df['Number_of_Violations'] > 0, 'Binary_Violation'] = 1 FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
Binari z ing n u meric v ariables FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
Binning n u meric v ariables import numpy as np df['Binned_Group'] = pd.cut( df['Number_of_Violations'], bins=[-np.inf, 0, 2, np.inf], labels=[1, 2, 3] ) FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
Binning n u meric v ariables FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON
Lets start practicing ! FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON
Recommend
More recommend