Feat u re Generation FE ATU R E E N G IN E E R IN G W ITH P YSPAR - PowerPoint PPT Presentation

Feat u re Generation FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist

Wh y generate ne w feat u res ? M u ltipl y ing S u mming Di � erencing Di v iding FEATURE ENGINEERING WITH PYSPARK

Wh y generate ne w feat u res ? FEATURE ENGINEERING WITH PYSPARK

Combining T w o Feat u res M u ltiplication # Creating a new feature, area by multiplying df = df.withColumn('TSQFT', (df['WIDTH'] * df['LENGTH'])) FEATURE ENGINEERING WITH PYSPARK

Other Wa y s to Combine T w o Feat u res # Sum two columns df = df.withColumn('TSQFT', (df['SQFTBELOWGROUND'] + df['SQFTABOVEGROUND'])) # Divide two columns df = df.withColumn('PRICEPERTSQFT', (df['LISTPRICE'] / df['TSQFT'])) # Difference two columns df = df.withColumn('DAYSONMARKET', datediff('OFFMARKETDATE', 'LISTDATE')) FEATURE ENGINEERING WITH PYSPARK

What ' s the limit ? A u tomation of Feat u res Feat u reTools & TSFresh E x plosion of Feat u res Higher Order & Be y ond ? FEATURE ENGINEERING WITH PYSPARK

Go forth and combine ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

Time Feat u res FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

The C y clical Nat u re of Things FEATURE ENGINEERING WITH PYSPARK

Choosing the Right Le v el FEATURE ENGINEERING WITH PYSPARK

Treating Date Fields as Dates ... from pyspark.sql.functions import to_date # Cast the data type to Date df = df.withColumn('LISTDATE', to_date('LISTDATE')) # Inspect the field df[['LISTDATE']].show(2) +----------+ | LISTDATE| +----------+ |2017-07-14| |2017-10-08| +----------+ only showing top 2 rows FEATURE ENGINEERING WITH PYSPARK

Time Components from pyspark.sql.functions import year, month # Create a new column of year number df = df.withColumn('LIST_YEAR', year('LISTDATE')) # Create a new column of month number df = df.withColumn('LIST_MONTH', month('LISTDATE')) from pyspark.sql.functions import dayofmonth, weekofyear # Create new columns of the day number within the month df = df.withColumn('LIST_DAYOFMONTH', dayofmonth('LISTDATE')) # Create new columns of the week number within the year df = df.withColumn('LIST_WEEKOFYEAR', weekofyear('LISTDATE')) FEATURE ENGINEERING WITH PYSPARK

Basic Time Based Metrics from pyspark.sql.functions import datediff # Calculate difference between two date fields df.withColumn('DAYSONMARKET', datediff('OFFMARKETDATE', 'LISTDATE')) FEATURE ENGINEERING WITH PYSPARK

Lagging Feat u res window() Ret u rns a record based o � a gro u p of records lag(col, count=1) Ret u rns the v al u e that is o � set b y ro w s before the c u rrent ro w FEATURE ENGINEERING WITH PYSPARK

Lagging Feat u res , the P y Spark Wa y from pyspark.sql.functions import lag from pyspark.sql.window import Window # Create Window w = Window().orderBy(m_df['DATE']) # Create lagged column m_df = m_df.withColumn('MORTGAGE-1wk', lag('MORTGAGE', count=1).over(w)) # Inspect results m_df.show(3) +----------+------------+----------------+ | DATE| MORTGAGE| MORTGAGE-1wk| +----------+------------+----------------+ |2013-10-10| 4.23| null| |2013-10-17| 4.28| 4.23| |2013-10-24| 4.13| 4.28| +----------+------------+----------------+ only showing top 3 rows FEATURE ENGINEERING WITH PYSPARK

It ' s TIME to practice ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

E x tracting Feat u res FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist , General Mills

E x tracting Age w ith Te x t Match ROOF Asphalt Shingles , Pitched , Age 8 Years or Less Asphalt Shingles , Age O v er 8 Years Asphalt Shingles , Age 8 Years or Less Roof _ Age becomes Roof >8y rs Age 8 Years or Less ? 0 Age O v er 8 Years ? 1 Age 8 Years or Less ? 0 FEATURE ENGINEERING WITH PYSPARK

E x tracting Age w ith Te x t Match from pyspark.sql.functions import when # Create boolean filters find_under_8 = df['ROOF'].like('%Age 8 Years or Less%') find_over_8 = df['ROOF'].like('%Age Over 8 Years%') # Apply filters using when() and otherwise() df = df.withColumn('old_roof', (when(find_over_8, 1) .when(find_under_8, 0) .otherwise(None))) # Inspect results df[['ROOF', 'old_roof']].show(3, truncate=100) +----------------------------------------------+--------+ | ROOF|old_roof| +----------------------------------------------+--------+ | null| null| |Asphalt Shingles, Pitched, Age 8 Years or Less| 0| | Asphalt Shingles, Age Over 8 Years| 1| +----------------------------------------------+--------+ only showing top 3 rows FEATURE ENGINEERING WITH PYSPARK

Splitting Col u mns ROOF becomes Roof _ Material Asphalt Shingles , Pitched , Age 8 Years or Less ? Asphalt Shingles N u ll ? Asphalt Shingles , Age O v er 8 Years ? Asphalt Shingles Metal , Age 8 Years or Less ? Metal Tile , Age 8 Years or Less ? Tile Asphalt Shingles ? Asphalt Shingles FEATURE ENGINEERING WITH PYSPARK

Splitting Col u mns from pyspark.sql.functions import split # Split the column on commas into a list split_col = split(df['ROOF'], ',') # Put the first value of the list into a new column df = df.withColumn('Roof_Material', split_col.getItem(0)) # Inspect results df[['ROOF', 'Roof_Material']].show(5, truncate=100) +----------------------------------------------+----------------+ | ROOF| Roof_Material| +----------------------------------------------+----------------+ | null| null| |Asphalt Shingles, Pitched, Age 8 Years or Less|Asphalt Shingles| | null| null| |Asphalt Shingles, Pitched, Age 8 Years or Less|Asphalt Shingles| | Asphalt Shingles, Age Over 8 Years|Asphalt Shingles| +----------------------------------------------+----------------+ only showing top 5 rows FEATURE ENGINEERING WITH PYSPARK

E x plode ! Starting Record NO roof _ list 2 [ Asphalt Shingles , Pitched , Age 8 Years or Less ] E x ploded Record NO e x_ roof _ list 2 Asphalt Shingles 2 Pitched 2 Age 8 Years or Less FEATURE ENGINEERING WITH PYSPARK

Pi v ot ! E x ploded Record NO e x_ roof _ list 2 Asphalt Shingles 2 Pitched 2 Age 8 Years or Less Pi v oted Record Age 8 Years or Age O v er 8 Asphalt NO Flat Metal Other Pitched ... Less Years Shingles 2 0 1 1 0 0 0 1 ... FEATURE ENGINEERING WITH PYSPARK

E x plode & Pi v ot ! from pyspark.sql.functions import split, explode, lit, coalesce, first # Split the column on commas into a list df = df.withColumn('roof_list', split(df['ROOF'], ', ')) # Explode list into new records for each value ex_df = df.withColumn('ex_roof_list', explode(df['roof_list'])) # Create a dummy column of constant value ex_df = ex_df.withColumn('constant_val', lit(1)) # Pivot the values into boolean columns piv_df = ex_df.groupBy('NO').pivot('ex_roof_list')\ .agg(coalesce(first('constant_val'))) FEATURE ENGINEERING WITH PYSPARK

Let ' s w rangle some feat u res ! FE ATU R E E N G IN E E R IN G W ITH P YSPAR K

Binari z ing , B u cketing & Encoding FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist

Binari z ing FIREPLACES becomes Has _ Fireplace 1 ? 1 3 ? 1 1 ? 1 2 ? 1 0 ? 0 FEATURE ENGINEERING WITH PYSPARK

Binari z ing from pyspark.ml.feature import Binarizer # Cast the data type to double df = df.withColumn('FIREPLACES', df['FIREPLACES'].cast('double')) # Create binarizing transformer bin = Binarizer(threshold=0.0, inputCol='FIREPLACES', outputCol='FireplaceT') # Apply the transformer df = bin.transform(df) # Inspect the results df[['FIREPLACES','FireplaceT']].show(3) +----------+-------------+ |FIREPLACES| FireplaceT| +----------+-------------+ | 0.0| 0.0| | 1.0| 1.0| | 2.0| 1.0| +----------+-------------+ only showing top 3 rows FEATURE ENGINEERING WITH PYSPARK

B u cketing from pyspark.ml.feature import Bucketizer # Define how to split data splits = [0, 1, 2, 3, 4, float('Inf')] # Create bucketing transformer buck = Bucketizer(splits=splits, inputCol='BATHSTOTAL', outputCol='baths') # Apply transformer df = buck.transform(df) # Inspect results df[['BATHSTOTAL', 'baths']].show(4) +----------+-----------------+ |BATHSTOTAL|baths | +----------+-----------------+ | 2| 2.0| | 3| 3.0| | 1| 1.0| | 5| 4.0| +----------+-----------------+ only showing top 4 rows FEATURE ENGINEERING WITH PYSPARK

One Hot Encoding CITY becomes LELM MAPW OAKD STP WB LELM - Lake Elmo ? 1 0 0 0 0 MAPW - Maple w ood ? 0 1 0 0 0 OAKD - Oakdale ? 0 0 1 0 0 STP - Saint Pa u l ? 0 0 0 1 0 WB - Woodb u r y ? 0 0 0 0 1 FEATURE ENGINEERING WITH PYSPARK

One Hot Encoding the P y Spark Wa y from pyspark.ml.feature import OneHotEncoder, StringIndexer # Create indexer transformer stringIndexer = StringIndexer(inputCol='CITY', outputCol='City_Index') # Fit transformer model = stringIndexer.fit(df) # Apply transformer indexed = model.transform(df) FEATURE ENGINEERING WITH PYSPARK

Feat u re Generation FE ATU R E E N G IN E E R IN G W ITH P YSPAR - PowerPoint PPT Presentation

Feat u re Generation FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data Scientist Wh y generate ne w feat u res ? M u ltipl y ing S u mming Di erencing Di v iding FEATURE ENGINEERING WITH PYSPARK Wh y generate ne w feat

Feat u re engineering P R E P R OC E SSIN G FOR MAC H IN E L E AR N IN G IN P YTH ON Sarah G

Feat ure Select ion using/ f or Feat ure Select ion using/ f or Transduct ive ransduct ive S

Wh y generate feat u res ? FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P

Regression : feat u re selection P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E

Feat u re e x traction D IME N SION AL ITY R E D U C TION IN P YTH ON Jeroen Boe y e Machine

Feat u re crossing FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist , Uni v

Dimensionalit y red u ction : feat u re e x traction P R AC TIC IN G MAC H IN E L E AR N IN G

Selecting feat u res for model performance D IME N SION AL ITY R E D U C TION IN P YTH ON

Transforming ne w feat u res FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist

Classification and feat u re engineering MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN

MRT9 Specifications Features Feat es Performance Performance Engine Power with

Time - dela y ed feat u res and a u to - regressi v e models MAC H IN E L E AR N IN G FOR TIME

CH CHECKLIST ECKLIST of f Essential sential Fe Feat atures ures Medium-to-High Densities

Fiverr Experiment Feat. Derrick, Karen and Janelle What is Fiverr? Launched in 2010, Fiverr is

ver. 1, 15.01.2020 Main feat atures: 1. 1. Interchangeab geable printing modules es wide

From gridified scripts to workflows: the FSL Feat case Tristan Glatard and Slvia D. Olabarriaga

Mainstreaming walking as low carbon urban mobility to improve air quality in the cities Feat.:

20 years of isogeny-based cryptography Luca De Feo feat. Jean Kieffer, Benjamin Smith

Deploying FMWR-DS Feat aturing uring a Fort Knox Case Study dy What is FMWR-DS? The Fam

Preprocessing data SU P E R VISE D L E AR N IN G W ITH SC IK IT - L E AR N Andreas M ller

Camera to camera communication Why it is unique Descript ion Use cases Feat ures Hive LPR

Generation Z Libraries and the Post-Millennial Generation

Introd u ction to NLP feat u re engineering FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH

Deep he(a)p, big feat arXiv:1707.06887 A Distributional Perspective on Reinforcement Learning