data distrib u tions
play

Data distrib u tions FE ATU R E E N G IN E E R IN G FOR MAC H IN - PowerPoint PPT Presentation

Data distrib u tions FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON Robert O ' Callaghan Director of Data Science , Ordergroo v e Distrib u tion ass u mptions FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON Obser


  1. Data distrib u tions FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON Robert O ' Callaghan Director of Data Science , Ordergroo v e

  2. Distrib u tion ass u mptions FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  3. Obser v ing y o u r data import matplotlib as plt df.hist() plt.show() FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  4. Del v ing deeper w ith bo x plots FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  5. Bo x plots in pandas df[['column_1']].boxplot() plt.show() FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  6. Paring distrib u tions import seaborn as sns sns.pairplot(df) FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  7. F u rther details on y o u r distrib u tions df.describe() FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  8. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON

  9. Scaling and transformations FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON Robert O ' Callaghan Data Scientist

  10. Scaling data FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  11. Min - Ma x scaling FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  12. Min - Ma x scaling FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  13. Min - Ma x scaling in P y thon from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() scaler.fit(df[['Age']]) df['normalized_age'] = scaler.transform(df[['Age']]) FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  14. Standardi z ation FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  15. Standardi z ation in P y thon from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(df[['Age']]) df['standardized_col'] = scaler\ .transform(df[['Age']]) FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  16. Log Transformation FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  17. Log transformation in P y thon from sklearn.preprocessing import PowerTransformer log = PowerTransformer() log.fit(df[['ConvertedSalary']]) df['log_ConvertedSalary'] = log.transform(df[['ConvertedSalary']]) FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  18. Final Slide FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON

  19. Remo v ing o u tliers FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON Robert O ' Callaghan Director of Data Science , Ordergroo v e

  20. What are o u tliers ? FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  21. Q u antile based detection FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  22. Q u antiles in P y thon q_cutoff = df['col_name'].quantile(0.95) mask = df['col_name'] < q_cutoff trimmed_df = df[mask] FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  23. Standard de v iation based detection FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  24. Standard de v iation detection in P y thon mean = df['col_name'].mean() std = df['col_name'].std() cut_off = std * 3 lower, upper = mean - cut_off, mean + cut_off new_df = df[(df['col_name'] < upper) & (df['col_name'] > lower)] FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  25. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON

  26. Scaling and transforming ne w data FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON Robet O ' Callaghan Director of Data Science , Ordergroo v e

  27. Re u se training scalers scaler = StandardScaler() scaler.fit(train[['col']]) train['scaled_col'] = scaler.transform(train[['col']]) # FIT SOME MODEL # .... test = pd.read_csv('test_csv') test['scaled_col'] = scaler.transform(test[['col']]) FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  28. Training transformations for re u se train_mean = train[['col']].mean() train_std = train[['col']].std() cut_off = train_std * 3 train_lower = train_mean - cut_off train_upper = train_mean + cut_off # Subset train data test = pd.read_csv('test_csv') # Subset test data test = test[(test[['col']] < train_upper) & (test[['col']] > train_lower)] FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  29. Wh y onl y u se training data ? Data leakage : Using data that y o u w on ' t ha v e access to w hen assessing the performance of y o u r model FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  30. A v oid data leakage ! FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON

Recommend


More recommend