dropping data
play

Dropping Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John - PowerPoint PPT Presentation

Dropping Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills Where can data go bad ? Recorded w rong Uniq u e e v ents Forma ed incorrectl y D u plications Missing Not rele v ant FEATURE


  1. Dropping Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

  2. Where can data go bad ? Recorded w rong Uniq u e e v ents Forma � ed incorrectl y D u plications Missing Not rele v ant FEATURE ENGINEERING WITH PYSPARK

  3. Dropping Col u mns df.select(['NO', 'UNITNUMBER', 'CLASS']).show() +----+----------+-----+ | NO|UNITNUMBER|CLASS| +----+----------+-----+ | 1| null| SF| | 156| A8| SF| | 157| 207| SF| | 158| 701G| SF| | 159| 36| SF| M u ltiple � elds are not needed for o u r anal y sis 'NO' a u to - generated record n u mber 'UNITNUMBER' irrele v ant data 'CLASS' all constant FEATURE ENGINEERING WITH PYSPARK

  4. Dropping Col u mns drop(*cols) *cols – a col u mn name to drop or a list of col u mn names to drop . Ret u rns a ne w DataFrame that drops the speci � ed # List of columns to drop cols_to_drop = ['NO', 'UNITNUMBER', 'CLASS'] # Drop the columns df = df.drop(*cols_to_drop) FEATURE ENGINEERING WITH PYSPARK

  5. Te x t Filtering where(condition) condition – a Col u mn of types.BooleanType or a string of SQL e x pression . Filters dataframe w here the condition is tr u e like(other) other – a SQL LIKE pa � ern Ret u rns a boolean Col u mn ~ The NOT condition df = df.where(~df['POTENTIALSHORTSALE'].like('Not Disclosed')) FEATURE ENGINEERING WITH PYSPARK

  6. O u tlier Filtering Filter data to w ithin three standard de v iations (3?) of the mean (?) FEATURE ENGINEERING WITH PYSPARK

  7. Val u e Filtering E x ample # Calculate values used for filtering std_val = df.agg({'SALESCLOSEPRICE': 'stddev'}).collect()[0][0] mean_val = df.agg({'SALESCLOSEPRICE': 'mean'}).collect()[0][0] # Create three standard deviation (? ± 3?) upper and lower bounds for data hi_bound = mean_val + (3 * std_val) low_bound = mean_val - (3 * std_val) # Use where() to filter the DataFrame between values df = df.where((df['LISTPRICE'] < hi_bound) & (df['LISTPRICE'] > low_bound)) FEATURE ENGINEERING WITH PYSPARK

  8. Dropping NA ' s or NULLs DataFrame.dropna() how : ‘ an y’ or ‘ all ’. If ‘ an y’, drop a record if it contains an y n u lls . If ‘ all ’, drop a record onl y if all its v al u es are n u ll . thresh : int , defa u lt None If speci � ed , drop records that ha v e less than thresh non - n u ll v al u es . This o v er w rites the ho w parameter . subset : optional list of col u mn names to consider . FEATURE ENGINEERING WITH PYSPARK

  9. Dropping NA ' s or NULLs # Drop any records with NULL values df = df.dropna() # drop records if both LISTPRICE and SALESCLOSEPRICE are NULL df = df.dropna(how='all', subset['LISTPRICE', 'SALESCLOSEPRICE ']) # Drop records where at least two columns have NULL values df = df.dropna(thresh=2) FEATURE ENGINEERING WITH PYSPARK

  10. Dropping D u plicates What is a d u plicate ? T w o or more records contains all the same information A � er dropping col u mns or joining datasets , check for d u plicates dropDuplicates() Can be r u n across entire DataFrame or a list of col u mns In P y Spark there is no order for w hich record is remo v ed # Entire DataFrame df.dropDuplicates() # Check only a column list df.dropDuplicates(['streetaddress']) FEATURE ENGINEERING WITH PYSPARK

  11. Let ' s practice ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

  12. Adj u sting Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

  13. Wh y Transform Data ? FEATURE ENGINEERING WITH PYSPARK

  14. What is MinMa x Scaling FEATURE ENGINEERING WITH PYSPARK

  15. Minma x Scaling # define min and max values and collect them max_days = df.agg({'DAYSONMARKET': 'max'}).collect()[0][0] min_days = df.agg({'DAYSONMARKET': 'min'}).collect()[0][0] # create a new column based off the scaled data df = df.withColumn("scaled_days", (df['DAYSONMARKET'] - min_days) / (max_days - min_days)) df[['scaled_days']].show(5) +--------------------+ | scaled_days| +--------------------+ |0.044444444444444446| |0.017777777777777778| | 0.12444444444444444| | 0.08444444444444445| | 0.09333333333333334| +--------------------+ only showing top 5 rows FEATURE ENGINEERING WITH PYSPARK

  16. What is Standardi z ation ? Transform data to standard normal distrib u tion z = (x - ?)/ ? Mean , ? of 0 Standard De v iation , ? of 1 FEATURE ENGINEERING WITH PYSPARK

  17. Standardi z ation mean_days = df.agg({'DAYSONMARKET': 'mean'}).collect()[0][0] stddev_days = df.agg({'DAYSONMARKET': 'stddev'}).collect()[0][0] # Create a new column with the scaled data df = df.withColumn("ztrans_days", (df['DAYSONMARKET'] - mean_days) / stddev_days) df.agg({'ztrans_days': 'mean'}).collect() [Row(avg(ztrans_days)=-3.6568525985103407e-16)] df.agg({'ztrans_days': 'stddev'}).collect() [Row(stddev(ztrans_days)=1.0000000000000009)] FEATURE ENGINEERING WITH PYSPARK

  18. What is Log Scaling Unscaled distrib u tion Log - scaled distrib u tion FEATURE ENGINEERING WITH PYSPARK

  19. Log Scaling # import the log function from pyspark.sql.functions import log # Recalculate log of SALESCLOSEPRICE df = df.withColumn('log_SalesClosePrice', log(df['SALESCLOSEPRICE'])) FEATURE ENGINEERING WITH PYSPARK

  20. Let ' s practice ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

  21. Working w ith Missing Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

  22. Ho w does data go missing in the digital age ? Data Collection Broken Sensors Data Storage R u les 2017-01-01 v s Jan u ar y 1 st , 2017 Joining Disparate Data Monthl y to Weekl y Intentionall y Missing Pri v ac y Concerns FEATURE ENGINEERING WITH PYSPARK

  23. T y pes of Missing Missing completel y at random Missing Data is j u st a completel y random s u bset Missing at random Missing conditionall y at random based on another obser v ation Missing not at random Data is missing beca u se of ho w it is collected FEATURE ENGINEERING WITH PYSPARK

  24. Assessing Missing Val u es df.where(df['ROOF'].isNull()).count() When to drop ro w s w ith missing data ? Missing v al u es are rare 765 Missing Completel y at Random isNull() Tr u e if the c u rrent e x pression is n u ll . FEATURE ENGINEERING WITH PYSPARK

  25. Plotting Missing Val u es # Import library import seaborn as sns # subset the dataframe sub_df = df.select(['ROOMAREA1']) # sample the dataframe sample_df = sub_df.sample(False, .5, 4) # Convert to Pandas DataFrame pandas_df = sample_df.toPandas() # Plot it sns.heatmap(data=pandas_df.isnull()) FEATURE ENGINEERING WITH PYSPARK

  26. Missing Val u es Heatmap FEATURE ENGINEERING WITH PYSPARK

  27. Imp u tation of Missing Val u es Process of replacing missing v al u es R u le Based Val u e based on b u siness logic Statistics Based Using mean , median , etc Model Based Use model to predict v al u e FEATURE ENGINEERING WITH PYSPARK

  28. Imp u tation of Missing Val u es ** fillna(value, subset=None) value the v al u e to replace missings w ith subset the list of col u mn names to replace missings # Replacing missing values with zero df.fillna(0, subset=['DAYSONMARKET']) # Replacing with the mean value for that column col_mean = df.agg({'DAYSONMARKET': 'mean'}).collect()[0][0] df.fillna(col_mean, subset=['DAYSONMARKET']) FEATURE ENGINEERING WITH PYSPARK

  29. Let ' s practice ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

  30. Getting More Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

  31. Tho u ghts on E x ternal Data Sets PROS CONS Add important predictors Ma y ' bog ' anal y sis do w n S u pplement / replace v al u es Eas y to ind u ce data leakage Cheap or eas y to obtain Become data set s u bject ma � er e x pert FEATURE ENGINEERING WITH PYSPARK

  32. Abo u t Joins Orienting o u r data directions Le �; o u r starting data set Right ; ne w data set to incorporate FEATURE ENGINEERING WITH PYSPARK

  33. P y Spark DataFrame Joins DataFrame.join( other, # Other DataFrame to merge on=None, # The keys to join on how=None) # Type of join to perform (default is 'inner') FEATURE ENGINEERING WITH PYSPARK

  34. P y Spark Join E x ample # Inspect dataframe head hdf.show(2) +----------+--------------------+ | dt| nm| +----------+--------------------+ |2012-01-02| New Year Day| |2012-01-16|Martin Luther Kin...| +----------+--------------------+ only showing top 2 rows # Specify join conditon cond = [df['OFFMARKETDATE'] == hdf['dt']] # Join two hdf onto df df = df.join(hdf, on=cond, 'left') # How many sales occurred on bank holidays? df.where(~df['nm'].isNull()).count() 0 FEATURE ENGINEERING WITH PYSPARK

  35. SparkSQL Join Appl y SQL to y o u r dataframe # Register the dataframe as a temp table df.createOrReplaceTempView("df") hdf.createOrReplaceTempView("hdf") # Write a SQL Statement sql_df = spark.sql(""" SELECT * FROM df LEFT JOIN hdf ON df.OFFMARKETDATE = hdf.dt """) FEATURE ENGINEERING WITH PYSPARK

  36. Let ' s Join Some Data ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

Recommend


More recommend