wh y do missing v al u es e x ist
play

Wh y do missing v al u es e x ist ? FE ATU R E E N G IN E E R IN G - PowerPoint PPT Presentation

Wh y do missing v al u es e x ist ? FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON Robert O ' Callaghan Director of Data Science , Ordergroo v e Ho w gaps in data occ u r Data not being collected properl y Collection


  1. Wh y do missing v al u es e x ist ? FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON Robert O ' Callaghan Director of Data Science , Ordergroo v e

  2. Ho w gaps in data occ u r Data not being collected properl y Collection and management errors Data intentionall y being omi � ed Co u ld be created d u e to transformations of the data FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  3. Wh y w e care ? Some models cannot w ork w ith missing data ( N u lls / NaNs ) Missing data ma y be a sign of a w ider data iss u e Missing data can be a u sef u l feat u re FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  4. Missing v al u e disco v er y print(df.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 999 entries, 0 to 998 Data columns (total 12 columns): SurveyDate 999 non-null object ... StackOverflowJobsRecommend 487 non-null float64 VersionControl 999 non-null object Gender 693 non-null object RawSalary 665 non-null object dtypes: float64(2), int64(2), object(8) memory usage: 93.7+ KB FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  5. Finding missing v al u es print(df.isnull()) StackOverflowJobsRecommend VersionControl ... \ 0 True False ... 1 False False ... 2 False False ... 3 True False ... 4 False False ... Gender RawSalary 0 False True 1 False False 2 True True 3 False False 4 False False FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  6. Finding missing v al u es print(df['StackOverflowJobsRecommend'].isnull().sum()) 512 FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  7. Finding non - missing v al u es print(df.notnull()) StackOverflowJobsRecommend VersionControl ... \ 0 False True ... 1 True True ... 2 True True ... 3 False True ... 4 True True ... Gender RawSalary 0 True False 1 True True 2 False False 3 True True 4 True True FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  8. Go ahead and find missing v al u es ! FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON

  9. Dealing w ith missing v al u es ( I ) FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON Robert O ' Callaghan Director of Data Science , Ordergroo v e

  10. List w ise deletion SurveyDate ConvertedSalary Hobby ... \ 0 2/28/18 20:20 NaN Yes ... 1 6/28/18 13:26 70841.0 Yes ... 2 6/6/18 3:37 NaN No ... 3 5/9/18 1:06 21426.0 Yes ... 4 4/12/18 22:41 41671.0 Yes ... FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  11. List w ise deletion in P y thon # Drop all rows with at least one missing values df.dropna(how='any') FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  12. List w ise deletion in P y thon # Drop rows with missing values in a specific column df.dropna(subset=['VersionControl']) FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  13. Iss u es w ith deletion It deletes v alid data points Relies on randomness Red u ces information FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  14. Replacing w ith strings # Replace missing values in a specific column # with a given string df['VersionControl'].fillna( value='None Given', inplace=True ) FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  15. Recording missing v al u es # Record where the values are not missing df['SalaryGiven'] = df['ConvertedSalary'].notnull() # Drop a specific column df.drop(columns=['ConvertedSalary']) FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  16. Practice time FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON

  17. Fill contin u o u s missing v al u es FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON Robert O ' Callaghan Director of Data Science , Ordergroo v e

  18. Deleting missing v al u es Can ' t delete ro w s w ith missing v al u es in the test set FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  19. What else can y o u do ? Categorical col u mns : Replace missing v al u es w ith the most common occ u rring v al u e or w ith a string that � ags missing v al u es s u ch as ' None ' N u meric col u mns : Replace missing v al u es w ith a s u itable v al u e FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  20. Meas u res of central tendenc y Mean Median FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  21. Calc u lating the meas u res of central tendenc y print(df['ConvertedSalary'].mean()) print(df['ConvertedSalary'].median()) 92565.16992481203 55562.0 FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  22. Fill the missing v al u es df['ConvertedSalary'] = df['ConvertedSalary'].fillna( df['ConvertedSalary'].mean() ) df['ConvertedSalary'] = df['ConvertedSalary']\ .astype('int64') FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  23. Ro u nding v al u es df['ConvertedSalary'] = df['ConvertedSalary'].fillna( round(df['ConvertedSalary'].mean()) ) FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  24. Let ' s Practice ! FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON

  25. Dealing w ith other data iss u es FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON Robert O ' Callaghan Director of Data Science , Ordergroo v e

  26. Bad characters print(df['RawSalary'].dtype) dtype('O') FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  27. Bad characters print(df['RawSalary'].head()) 0 NaN 1 70,841.00 2 NaN 3 21,426.00 4 41,671.00 Name: RawSalary, dtype: object FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  28. Dealing w ith bad characters df['RawSalary'] = df['RawSalary'].str.replace(',', '') df['RawSalary'] = df['RawSalary'].astype('float') FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  29. Finding other stra y characters coerced_vals = pd.to_numeric(df['RawSalary'], errors='coerce') FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  30. Finding other stra y characters print(df[coerced_vals.isna()].head()) 0 NaN 2 NaN 4 $51408.00 Name: RawSalary, dtype: object FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  31. Chaining methods df['column_name'] = df['column_name'].method1() df['column_name'] = df['column_name'].method2() df['column_name'] = df['column_name'].method3() Same as : df['column_name'] = df['column_name']\ .method1().method2().method3() FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

  32. Go ahead and fi x bad characters ! FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON

Recommend


More recommend