Where to Begin FE ATU R E E N G IN E E R IN G W ITH P YSPAR K - PowerPoint PPT Presentation

Where to Begin FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

Di v ing Straight to Anal y sis Here be Monsters Become y o u r o w n e x pert De � ne goals of anal y sis Research y o u r data Be c u rio u s , ask q u estions FEATURE ENGINEERING WITH PYSPARK

The Data Science Process FEATURE ENGINEERING WITH PYSPARK

Spark changes fast and freq u entl y Latest doc u mentation : h � ps :// spark . apache . org / docs / latest / Speci � c v ersion (2.3.1) h � ps :// spark . apache . org / docs /2.3.1/ Check y o u r v ersions ! # return spark version spark.version # return python version import sys sys.version_info FEATURE ENGINEERING WITH PYSPARK

Data Formats : Parq u et Data is s u pplied as Parq u et Stored Col u mn -w ise Fast to q u er y col u mn s u bsets Str u ct u red , de � ned schema Fields and Data T y pes de � ned Great for mess y te x t data Ind u str y Adopted Good skill to ha v e ! ???? FEATURE ENGINEERING WITH PYSPARK

Getting the Data to Spark P y Spark read methods P y Spark s u pports man y � le t y pes ! # JSON spark.read.json('example.json') # CSV or delimited files spark.read.csv('example.csv') # Parquet spark.read.parquet('example.parq') # Read a parquet file to a PySpark DataFrame df = spark.read.parquet('example.parq') FEATURE ENGINEERING WITH PYSPARK

Let ' s Practice ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

Defining A Problem FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

What ’ s Yo u r Problem ? Predict the selling price of a ho u se Gi v en is listed price and feat u res X , independent ' kno w n ' v ariables Ho w m u ch to b uy the ho u se for Y , dependent 'u nkno w n ' v ariable SALESCLOSEPRICE FEATURE ENGINEERING WITH PYSPARK

Conte x t & Limitations of o u r Real Estate Homes sold St Pa u l , MN Area Incl u des se v eral s u b u rbs Real Estate T y pes Residential - Single Residential - M u lti - Famil y F u ll Year of Data Impact of seasonalit y FEATURE ENGINEERING WITH PYSPARK

What t y pes of attrib u tes are a v ailable ? Dates Price Date Listed List Price Year B u ilt Sales Closing Price Location Amenities Cit y Pool School District Fireplace Address Garage Si z e Constr u ction Materials # Bedrooms & Bathrooms Siding Li v ing Area Roo � ng FEATURE ENGINEERING WITH PYSPARK

Validating Yo u r Data Load DataFrame.count() for ro w co u nt df.count() 5000 DataFrame.columns for a list of col u mns df.columns ['No.', 'MLSID', 'StreetNumberNumeric', ... ] Length of DataFrame.columns for the n u mber of col u mns len(df.columns) 74 FEATURE ENGINEERING WITH PYSPARK

Checking Datat y pes DataFrame.dtypes Creates a list of col u mns and their data t y pes t u ples df.dtypes [('No.', 'integer'), ('MLSID', 'string'), ... ] FEATURE ENGINEERING WITH PYSPARK

Let ' s Practice FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

Vis u all y Inspecting Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

Getting Descripti v e w ith DataFrame . describe () df.describe(['LISTPRICE']).show() +-------+------------------+ |summary| LISTPRICE| +-------+------------------+ | count| 5000| | mean| 263419.365| | stddev|143944.10818036905| | min| 100000| | max| 99999| +-------+------------------+ FEATURE ENGINEERING WITH PYSPARK

Man y descripti v e f u nctions are alread y a v ailable Mean pyspark.sql.functions.mean(col) Ske w ness pyspark.sql.functions.skewness(col) Minim u m pyspark.sql.functions.min(col) Co v ariance cov(col1, col2) Correlation corr(col1, col2) FEATURE ENGINEERING WITH PYSPARK

E x ample w ith mean () mean(col) Aggregate f u nction : ret u rns the a v erage ( mean ) of the v al u es in a gro u p . df.agg({'SALESCLOSEPRICE': 'mean'}).collect() [Row(avg(SALESCLOSEPRICE)=262804.4668)] FEATURE ENGINEERING WITH PYSPARK

E x ample w ith co v() cov(col1, col2) Parameters : col 1 – � rst col u mn col 2 – second col u mn df.cov('SALESCLOSEPRICE', 'YEARBUILT') 1281910.3840634783 FEATURE ENGINEERING WITH PYSPARK

seaborn : statistical data v is u ali z ation FEATURE ENGINEERING WITH PYSPARK

Notes on plotting Plo � ing P y Spark DataFrames u sing standard libraries like Seaborn req u ire con v ersion to Pandas WARNING : Sample P y Spark DataFrames before con v erting to Pandas ! sample(withReplacement, fraction, seed=None) withReplacement allo w repeats in sample fraction % of records to keep seed random seed for reprod u cibilit y # Sample 50% of the PySpark DataFrame and count rows df.sample(False, 0.5, 42).count() 2504 FEATURE ENGINEERING WITH PYSPARK

Prepping for plotting a distrib u tion Seaborn distplot() seaborn.distplot(a) a : Series , 1 d - arra y, or list . Obser v ed data . # Import your favorite visualization library import seaborn as sns # Sample the dataframe sample_df = df.select(['SALESCLOSEPRICE']).sample(False, 0.5, 42) # Convert the sample to a Pandas DataFrame pandas_df = sample_df.toPandas() # Plot it sns.distplot(pandas_df) FEATURE ENGINEERING WITH PYSPARK

Distrib u tion plot of sales closing price FEATURE ENGINEERING WITH PYSPARK

Relationship plotting Seaborn lmplot() seaborn.lmplot(x, y, data) x , y : strings , Inp u t v ariables ; these sho u ld be col u mn names in data . data : Pandas DataFrame # Import your favorite visualization library import seaborn as sns # Select columns s_df = df.select(['SALESCLOSEPRICE', 'SQFTABOVEGROUND']) # Sample dataframe s_df = s_df.sample(False, 0.5, 42) # Convert to Pandas DataFrame pandas_df = s_df.toPandas() # Plot it sns.lmplot(x='SQFTABOVEGROUND', y='SALESCLOSEPRICE', data=pandas_df) FEATURE ENGINEERING WITH PYSPARK

Linear model plot bet w een SQFT abo v e gro u nd and sales price FEATURE ENGINEERING WITH PYSPARK

Let ' s practice ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

Where to Begin FE ATU R E E N G IN E E R IN G W ITH P YSPAR K - PowerPoint PPT Presentation

Where to Begin FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills Di v ing Straight to Anal y sis Here be Monsters Become y o u r o w n e x pert De ne goals of anal y sis Research y o u r data Be

Thank you for joining! We will begin our webinar shortly. Before we begin, please check that the

Thank you for joining! We will begin our webinar shortly. Before we begin, please check that the

W l Welcome! ! The workshop will begin at The workshop will begin at 2:00 Eastern/11:00

W l Welcome! ! The webinar will begin at The webinar will begin at 2:00 Eastern/11:00 Pacific

W l Welcome! ! The webinar will begin at The webinar will begin at 2:00 Eastern/11:00 Pacific

W l Welcome! ! The webinar will begin at The webinar will begin at 2:00 Eastern/11:00 Pacific

W l Welcome! ! The webinar will begin at The webinar will begin at 2:00 Eastern/11:00 Pacific

W l Welcome! ! The webinar will begin at The webinar will begin at 2:00 Eastern/11:00 Pacific

W l Welcome! ! The webinar will begin at The webinar will begin at 2:00 Eastern/11:00 Pacific

W l Welcome! ! The webinar will begin at The webinar will begin at 2:00 Eastern/11:00 Pacific

W l Welcome! ! The webinar will begin at The webinar will begin at 2:00 Eastern/11:00 Pacific

W l Welcome! ! The webinar will begin at The webinar will begin at 1:00 Eastern/10:00 Pacific

W l Welcome! ! The webinar will begin at The webinar will begin at 2:00 Eastern/11:00 Pacific

W l Welcome! ! The webinar will begin at The webinar will begin at 12:00 Eastern/9:00 Pacific

W l Welcome! ! The webinar will begin at The webinar will begin at 2:00 Eastern/11:00 Pacific

W l Welcome! ! The webinar will begin at The webinar will begin at 2:00 Eastern/11:00 Pacific

W l Welcome! ! The webinar will begin at The webinar will begin at 2:00 Eastern/11:00 Pacific

Facilitating Virtually: We will begin at 12:30 Helpful Zoom Tips to begin: Navigation bar at

The Webcast Will Begin Shortly The presentations will begin at 2:00 p.m. EDT Dont forget to

THE WEBINAR WILL BEGIN SHORTLY Todays webinar will begin at 3:00 PM EDT All lines are muted

The Webcast Will Begin Shortly The presentations will begin at 2:00 pm EST. Dont forget to

These all continued with one accord in prayer and supplication, with the women and Mary the

OPENING FREE WRITE Welcome to Artifact Lab 3 of 8. Please begin with this exercise on your own

GDPR Practical Start where can you begin rolling out? GDPR Practical Start where can you begin