Where to Begin FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills
Di v ing Straight to Anal y sis Here be Monsters Become y o u r o w n e x pert De � ne goals of anal y sis Research y o u r data Be c u rio u s , ask q u estions FEATURE ENGINEERING WITH PYSPARK
The Data Science Process FEATURE ENGINEERING WITH PYSPARK
Spark changes fast and freq u entl y Latest doc u mentation : h � ps :// spark . apache . org / docs / latest / Speci � c v ersion (2.3.1) h � ps :// spark . apache . org / docs /2.3.1/ Check y o u r v ersions ! # return spark version spark.version # return python version import sys sys.version_info FEATURE ENGINEERING WITH PYSPARK
Data Formats : Parq u et Data is s u pplied as Parq u et Stored Col u mn -w ise Fast to q u er y col u mn s u bsets Str u ct u red , de � ned schema Fields and Data T y pes de � ned Great for mess y te x t data Ind u str y Adopted Good skill to ha v e ! ???? FEATURE ENGINEERING WITH PYSPARK
Getting the Data to Spark P y Spark read methods P y Spark s u pports man y � le t y pes ! # JSON spark.read.json('example.json') # CSV or delimited files spark.read.csv('example.csv') # Parquet spark.read.parquet('example.parq') # Read a parquet file to a PySpark DataFrame df = spark.read.parquet('example.parq') FEATURE ENGINEERING WITH PYSPARK
Let ' s Practice ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K
Defining A Problem FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills
What ’ s Yo u r Problem ? Predict the selling price of a ho u se Gi v en is listed price and feat u res X , independent ' kno w n ' v ariables Ho w m u ch to b uy the ho u se for Y , dependent 'u nkno w n ' v ariable SALESCLOSEPRICE FEATURE ENGINEERING WITH PYSPARK
Conte x t & Limitations of o u r Real Estate Homes sold St Pa u l , MN Area Incl u des se v eral s u b u rbs Real Estate T y pes Residential - Single Residential - M u lti - Famil y F u ll Year of Data Impact of seasonalit y FEATURE ENGINEERING WITH PYSPARK
What t y pes of attrib u tes are a v ailable ? Dates Price Date Listed List Price Year B u ilt Sales Closing Price Location Amenities Cit y Pool School District Fireplace Address Garage Si z e Constr u ction Materials # Bedrooms & Bathrooms Siding Li v ing Area Roo � ng FEATURE ENGINEERING WITH PYSPARK
Validating Yo u r Data Load DataFrame.count() for ro w co u nt df.count() 5000 DataFrame.columns for a list of col u mns df.columns ['No.', 'MLSID', 'StreetNumberNumeric', ... ] Length of DataFrame.columns for the n u mber of col u mns len(df.columns) 74 FEATURE ENGINEERING WITH PYSPARK
Checking Datat y pes DataFrame.dtypes Creates a list of col u mns and their data t y pes t u ples df.dtypes [('No.', 'integer'), ('MLSID', 'string'), ... ] FEATURE ENGINEERING WITH PYSPARK
Let ' s Practice FE ATU R E E N G IN E E R IN G W ITH P YSPAR K
Vis u all y Inspecting Data FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills
Getting Descripti v e w ith DataFrame . describe () df.describe(['LISTPRICE']).show() +-------+------------------+ |summary| LISTPRICE| +-------+------------------+ | count| 5000| | mean| 263419.365| | stddev|143944.10818036905| | min| 100000| | max| 99999| +-------+------------------+ FEATURE ENGINEERING WITH PYSPARK
Man y descripti v e f u nctions are alread y a v ailable Mean pyspark.sql.functions.mean(col) Ske w ness pyspark.sql.functions.skewness(col) Minim u m pyspark.sql.functions.min(col) Co v ariance cov(col1, col2) Correlation corr(col1, col2) FEATURE ENGINEERING WITH PYSPARK
E x ample w ith mean () mean(col) Aggregate f u nction : ret u rns the a v erage ( mean ) of the v al u es in a gro u p . df.agg({'SALESCLOSEPRICE': 'mean'}).collect() [Row(avg(SALESCLOSEPRICE)=262804.4668)] FEATURE ENGINEERING WITH PYSPARK
E x ample w ith co v() cov(col1, col2) Parameters : col 1 – � rst col u mn col 2 – second col u mn df.cov('SALESCLOSEPRICE', 'YEARBUILT') 1281910.3840634783 FEATURE ENGINEERING WITH PYSPARK
seaborn : statistical data v is u ali z ation FEATURE ENGINEERING WITH PYSPARK
Notes on plotting Plo � ing P y Spark DataFrames u sing standard libraries like Seaborn req u ire con v ersion to Pandas WARNING : Sample P y Spark DataFrames before con v erting to Pandas ! sample(withReplacement, fraction, seed=None) withReplacement allo w repeats in sample fraction % of records to keep seed random seed for reprod u cibilit y # Sample 50% of the PySpark DataFrame and count rows df.sample(False, 0.5, 42).count() 2504 FEATURE ENGINEERING WITH PYSPARK
Prepping for plotting a distrib u tion Seaborn distplot() seaborn.distplot(a) a : Series , 1 d - arra y, or list . Obser v ed data . # Import your favorite visualization library import seaborn as sns # Sample the dataframe sample_df = df.select(['SALESCLOSEPRICE']).sample(False, 0.5, 42) # Convert the sample to a Pandas DataFrame pandas_df = sample_df.toPandas() # Plot it sns.distplot(pandas_df) FEATURE ENGINEERING WITH PYSPARK
Distrib u tion plot of sales closing price FEATURE ENGINEERING WITH PYSPARK
Relationship plotting Seaborn lmplot() seaborn.lmplot(x, y, data) x , y : strings , Inp u t v ariables ; these sho u ld be col u mn names in data . data : Pandas DataFrame # Import your favorite visualization library import seaborn as sns # Select columns s_df = df.select(['SALESCLOSEPRICE', 'SQFTABOVEGROUND']) # Sample dataframe s_df = s_df.sample(False, 0.5, 42) # Convert to Pandas DataFrame pandas_df = s_df.toPandas() # Plot it sns.lmplot(x='SQFTABOVEGROUND', y='SALESCLOSEPRICE', data=pandas_df) FEATURE ENGINEERING WITH PYSPARK
Linear model plot bet w een SQFT abo v e gro u nd and sales price FEATURE ENGINEERING WITH PYSPARK
Let ' s practice ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K
Recommend
More recommend