Dropping Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John - PowerPoint PPT Presentation

Dropping Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

Where can data go bad ? Recorded w rong Uniq u e e v ents Forma � ed incorrectl y D u plications Missing Not rele v ant FEATURE ENGINEERING WITH PYSPARK

Dropping Col u mns df.select(['NO', 'UNITNUMBER', 'CLASS']).show() +----+----------+-----+ | NO|UNITNUMBER|CLASS| +----+----------+-----+ | 1| null| SF| | 156| A8| SF| | 157| 207| SF| | 158| 701G| SF| | 159| 36| SF| M u ltiple � elds are not needed for o u r anal y sis 'NO' a u to - generated record n u mber 'UNITNUMBER' irrele v ant data 'CLASS' all constant FEATURE ENGINEERING WITH PYSPARK

Dropping Col u mns drop(*cols) *cols – a col u mn name to drop or a list of col u mn names to drop . Ret u rns a ne w DataFrame that drops the speci � ed # List of columns to drop cols_to_drop = ['NO', 'UNITNUMBER', 'CLASS'] # Drop the columns df = df.drop(*cols_to_drop) FEATURE ENGINEERING WITH PYSPARK

Te x t Filtering where(condition) condition – a Col u mn of types.BooleanType or a string of SQL e x pression . Filters dataframe w here the condition is tr u e like(other) other – a SQL LIKE pa � ern Ret u rns a boolean Col u mn ~ The NOT condition df = df.where(~df['POTENTIALSHORTSALE'].like('Not Disclosed')) FEATURE ENGINEERING WITH PYSPARK

O u tlier Filtering Filter data to w ithin three standard de v iations (3?) of the mean (?) FEATURE ENGINEERING WITH PYSPARK

Val u e Filtering E x ample # Calculate values used for filtering std_val = df.agg({'SALESCLOSEPRICE': 'stddev'}).collect()[0][0] mean_val = df.agg({'SALESCLOSEPRICE': 'mean'}).collect()[0][0] # Create three standard deviation (? ± 3?) upper and lower bounds for data hi_bound = mean_val + (3 * std_val) low_bound = mean_val - (3 * std_val) # Use where() to filter the DataFrame between values df = df.where((df['LISTPRICE'] < hi_bound) & (df['LISTPRICE'] > low_bound)) FEATURE ENGINEERING WITH PYSPARK

Dropping NA ' s or NULLs DataFrame.dropna() how : ‘ an y’ or ‘ all ’. If ‘ an y’, drop a record if it contains an y n u lls . If ‘ all ’, drop a record onl y if all its v al u es are n u ll . thresh : int , defa u lt None If speci � ed , drop records that ha v e less than thresh non - n u ll v al u es . This o v er w rites the ho w parameter . subset : optional list of col u mn names to consider . FEATURE ENGINEERING WITH PYSPARK

Dropping NA ' s or NULLs # Drop any records with NULL values df = df.dropna() # drop records if both LISTPRICE and SALESCLOSEPRICE are NULL df = df.dropna(how='all', subset['LISTPRICE', 'SALESCLOSEPRICE ']) # Drop records where at least two columns have NULL values df = df.dropna(thresh=2) FEATURE ENGINEERING WITH PYSPARK

Dropping D u plicates What is a d u plicate ? T w o or more records contains all the same information A � er dropping col u mns or joining datasets , check for d u plicates dropDuplicates() Can be r u n across entire DataFrame or a list of col u mns In P y Spark there is no order for w hich record is remo v ed # Entire DataFrame df.dropDuplicates() # Check only a column list df.dropDuplicates(['streetaddress']) FEATURE ENGINEERING WITH PYSPARK

Let ' s practice ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

Adj u sting Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

Wh y Transform Data ? FEATURE ENGINEERING WITH PYSPARK

What is MinMa x Scaling FEATURE ENGINEERING WITH PYSPARK

Minma x Scaling # define min and max values and collect them max_days = df.agg({'DAYSONMARKET': 'max'}).collect()[0][0] min_days = df.agg({'DAYSONMARKET': 'min'}).collect()[0][0] # create a new column based off the scaled data df = df.withColumn("scaled_days", (df['DAYSONMARKET'] - min_days) / (max_days - min_days)) df[['scaled_days']].show(5) +--------------------+ | scaled_days| +--------------------+ |0.044444444444444446| |0.017777777777777778| | 0.12444444444444444| | 0.08444444444444445| | 0.09333333333333334| +--------------------+ only showing top 5 rows FEATURE ENGINEERING WITH PYSPARK

What is Standardi z ation ? Transform data to standard normal distrib u tion z = (x - ?)/ ? Mean , ? of 0 Standard De v iation , ? of 1 FEATURE ENGINEERING WITH PYSPARK

Standardi z ation mean_days = df.agg({'DAYSONMARKET': 'mean'}).collect()[0][0] stddev_days = df.agg({'DAYSONMARKET': 'stddev'}).collect()[0][0] # Create a new column with the scaled data df = df.withColumn("ztrans_days", (df['DAYSONMARKET'] - mean_days) / stddev_days) df.agg({'ztrans_days': 'mean'}).collect() [Row(avg(ztrans_days)=-3.6568525985103407e-16)] df.agg({'ztrans_days': 'stddev'}).collect() [Row(stddev(ztrans_days)=1.0000000000000009)] FEATURE ENGINEERING WITH PYSPARK

What is Log Scaling Unscaled distrib u tion Log - scaled distrib u tion FEATURE ENGINEERING WITH PYSPARK

Log Scaling # import the log function from pyspark.sql.functions import log # Recalculate log of SALESCLOSEPRICE df = df.withColumn('log_SalesClosePrice', log(df['SALESCLOSEPRICE'])) FEATURE ENGINEERING WITH PYSPARK

Working w ith Missing Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

Ho w does data go missing in the digital age ? Data Collection Broken Sensors Data Storage R u les 2017-01-01 v s Jan u ar y 1 st , 2017 Joining Disparate Data Monthl y to Weekl y Intentionall y Missing Pri v ac y Concerns FEATURE ENGINEERING WITH PYSPARK

T y pes of Missing Missing completel y at random Missing Data is j u st a completel y random s u bset Missing at random Missing conditionall y at random based on another obser v ation Missing not at random Data is missing beca u se of ho w it is collected FEATURE ENGINEERING WITH PYSPARK

Assessing Missing Val u es df.where(df['ROOF'].isNull()).count() When to drop ro w s w ith missing data ? Missing v al u es are rare 765 Missing Completel y at Random isNull() Tr u e if the c u rrent e x pression is n u ll . FEATURE ENGINEERING WITH PYSPARK

Plotting Missing Val u es # Import library import seaborn as sns # subset the dataframe sub_df = df.select(['ROOMAREA1']) # sample the dataframe sample_df = sub_df.sample(False, .5, 4) # Convert to Pandas DataFrame pandas_df = sample_df.toPandas() # Plot it sns.heatmap(data=pandas_df.isnull()) FEATURE ENGINEERING WITH PYSPARK

Missing Val u es Heatmap FEATURE ENGINEERING WITH PYSPARK

Imp u tation of Missing Val u es Process of replacing missing v al u es R u le Based Val u e based on b u siness logic Statistics Based Using mean , median , etc Model Based Use model to predict v al u e FEATURE ENGINEERING WITH PYSPARK

Imp u tation of Missing Val u es ** fillna(value, subset=None) value the v al u e to replace missings w ith subset the list of col u mn names to replace missings # Replacing missing values with zero df.fillna(0, subset=['DAYSONMARKET']) # Replacing with the mean value for that column col_mean = df.agg({'DAYSONMARKET': 'mean'}).collect()[0][0] df.fillna(col_mean, subset=['DAYSONMARKET']) FEATURE ENGINEERING WITH PYSPARK

Getting More Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

Tho u ghts on E x ternal Data Sets PROS CONS Add important predictors Ma y ' bog ' anal y sis do w n S u pplement / replace v al u es Eas y to ind u ce data leakage Cheap or eas y to obtain Become data set s u bject ma � er e x pert FEATURE ENGINEERING WITH PYSPARK

Abo u t Joins Orienting o u r data directions Le �; o u r starting data set Right ; ne w data set to incorporate FEATURE ENGINEERING WITH PYSPARK

P y Spark DataFrame Joins DataFrame.join( other, # Other DataFrame to merge on=None, # The keys to join on how=None) # Type of join to perform (default is 'inner') FEATURE ENGINEERING WITH PYSPARK

P y Spark Join E x ample # Inspect dataframe head hdf.show(2) +----------+--------------------+ | dt| nm| +----------+--------------------+ |2012-01-02| New Year Day| |2012-01-16|Martin Luther Kin...| +----------+--------------------+ only showing top 2 rows # Specify join conditon cond = [df['OFFMARKETDATE'] == hdf['dt']] # Join two hdf onto df df = df.join(hdf, on=cond, 'left') # How many sales occurred on bank holidays? df.where(~df['nm'].isNull()).count() 0 FEATURE ENGINEERING WITH PYSPARK

SparkSQL Join Appl y SQL to y o u r dataframe # Register the dataframe as a temp table df.createOrReplaceTempView("df") hdf.createOrReplaceTempView("hdf") # Write a SQL Statement sql_df = spark.sql(""" SELECT * FROM df LEFT JOIN hdf ON df.OFFMARKETDATE = hdf.dt """) FEATURE ENGINEERING WITH PYSPARK

Let ' s Join Some Data ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

Dropping Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John - PowerPoint PPT Presentation

Dropping Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills Where can data go bad ? Recorded w rong Uniq u e e v ents Forma ed incorrectl y D u plications Missing Not rele v ant FEATURE

Dropping on the Edge: Flexibility and Dropping on the Edge: Flexibility and Trac Conrmation

The Simulation of Genetic Data David Duffy Queensland Institute of Medical Research Brisbane,

Are your students dropping out because youre reinforcing their insecurities? Dave Paunesku,

Name-dropping in 18th Century Public Discourse Aleksi Jalavala 1 , Annika Pensola 1 , Bruno

President Truman and Dropping the Atomic Bomb Mark P. Adams, Truman Presidential Library and

Decision on Dropping the Dec Bidding Rule Greg Cook Manager, Market Design and Regulatory Policy

KLB Higher Education Evening 2016 1 Reasons for dropping out 2 Reasons to apply to

Time Dropping Soapsticks Brings New Life to an Old Idea Joe Lopez: Chevron USA Jim Votaw:

Dropping legacy devices in Qemu Gabriel Laskar <gabriel@lse.epita.fr> Why do we need a

17: Dynamic Programming CS1101S: Programming Methodology Martin Henz October 19, 2012 CS1101S:

Dropping in 80Gbits (sort of) of Stateful Firewalling with OpenBSD (PF, OpenOSPF) UKNOF 37,

CENSORSHIP RESISTANCE CMSC 414 APR 17 2018 CENSORSHIP COMES IN MANY FORMS DROPPING PACKETS

MC-ADAPT: Adaptive Task Dropping under Task-level Mode Switch Jaewoo Lee Department of Computer

PSA APIs An overview Attestation API 1.0.1 we are in the process of dropping a new

CENSORSHIP RESISTANCE CMSC 414 APR 17 2018 CENSORSHIP COMES IN MANY FORMS DROPPING PACKETS

Module 4: Creating Data Types and Tables Overview Creating Data Types Creating Tables

CSE 311: Foundations of Computing Subset Construction Fall 2013 Subset construction: NFA

Computers

Trace reconstruction for deletion channels Yuval Peres Microsoft Research Based on joint work

Scientific Programming: Algorithms (part B) Programming paradigms Luca Bianco - Academic Year

An Introduction to Matroids & Greedy in Approximation Algorithms (Juli` an Mestre, ESA

Compact Subsets Theorem Suppose that K is a subset of a topological space X. 1 If X is compact

Discovery of Genomic Structural Variations with Next-Generation Sequencing Data Marcel H. Schulz

# Camera camera = Camera.open();

Dropping Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John - PowerPoint PPT Presentation

Dropping Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills Where can data go bad ? Recorded w rong Uniq u e e v ents Forma ed incorrectl y D u plications Missing Not rele v ant FEATURE

Dropping on the Edge: Flexibility and Dropping on the Edge: Flexibility and Trac Conrmation

The Simulation of Genetic Data David Duffy Queensland Institute of Medical Research Brisbane,

Are your students dropping out because youre reinforcing their insecurities? Dave Paunesku,

Name-dropping in 18th Century Public Discourse Aleksi Jalavala 1 , Annika Pensola 1 , Bruno

President Truman and Dropping the Atomic Bomb Mark P. Adams, Truman Presidential Library and

Decision on Dropping the Dec Bidding Rule Greg Cook Manager, Market Design and Regulatory Policy

KLB Higher Education Evening 2016 1 Reasons for dropping out 2 Reasons to apply to

Time Dropping Soapsticks Brings New Life to an Old Idea Joe Lopez: Chevron USA Jim Votaw:

Dropping legacy devices in Qemu Gabriel Laskar &lt;gabriel@lse.epita.fr&gt; Why do we need a

17: Dynamic Programming CS1101S: Programming Methodology Martin Henz October 19, 2012 CS1101S:

Dropping in 80Gbits (sort of) of Stateful Firewalling with OpenBSD (PF, OpenOSPF) UKNOF 37,

CENSORSHIP RESISTANCE CMSC 414 APR 17 2018 CENSORSHIP COMES IN MANY FORMS DROPPING PACKETS

MC-ADAPT: Adaptive Task Dropping under Task-level Mode Switch Jaewoo Lee Department of Computer

PSA APIs An overview Attestation API 1.0.1 we are in the process of dropping a new

CENSORSHIP RESISTANCE CMSC 414 APR 17 2018 CENSORSHIP COMES IN MANY FORMS DROPPING PACKETS

Module 4: Creating Data Types and Tables Overview Creating Data Types Creating Tables

CSE 311: Foundations of Computing Subset Construction Fall 2013 Subset construction: NFA

Computers

Trace reconstruction for deletion channels Yuval Peres Microsoft Research Based on joint work

Scientific Programming: Algorithms (part B) Programming paradigms Luca Bianco - Academic Year

An Introduction to Matroids &amp; Greedy in Approximation Algorithms (Juli` an Mestre, ESA

Compact Subsets Theorem Suppose that K is a subset of a topological space X. 1 If X is compact

Discovery of Genomic Structural Variations with Next-Generation Sequencing Data Marcel H. Schulz

# Camera camera = Camera.open();

Dropping legacy devices in Qemu Gabriel Laskar <gabriel@lse.epita.fr> Why do we need a

An Introduction to Matroids & Greedy in Approximation Algorithms (Juli` an Mestre, ESA