Introduction to Data Pipelines CLEAN IN G DATA W ITH P YS PARK - PowerPoint PPT Presentation

Introduction to Data Pipelines CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data Engineering Consultant

What is a data pipeline? A set of steps to process data from source(s) to �nal output Can consist of any number of steps or components Can span many systems We will focus on data pipelines within Spark CLEANING DATA WITH PYSPARK

What does a data pipeline look like? Input(s) CSV, JSON, web services, databases Transformations withColumn() , .filter() , .drop() Output(s) CSV, Parquet, database Validation Analysis CLEANING DATA WITH PYSPARK

Pipeline details Not formally de�ned in Spark Typically all normal Spark code required for task schema = StructType([ StructField('name', StringType(), False), StructField('age', StringType(), False) ]) df = spark.read.format('csv').load('datafile').schema(schema) df = df.withColumn('id', monotonically_increasing_id()) ... df.write.parquet('outdata.parquet') df.write.json('outdata.json') CLEANING DATA WITH PYSPARK

Let's Practice! CLEAN IN G DATA W ITH P YS PARK

Data handling techniques CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data Engineering Consultant

What are we trying to parse? Incorrect data width, height, image Empty rows Commented lines # This is a comment Headers 200 300 affenpinscher;0 Nested structures Multiple delimiters 600 450 Collie;307 Collie;101 Non-regular data 600 449 Japanese_spaniel;23 Differing numbers of columns per row Focused on CSV data CLEANING DATA WITH PYSPARK

Stanford ImageNet annotations Identi�es dog breeds in images Provides list of all identi�ed dogs in image Other metadata (base folder, image size, etc.) Example rows: 02111277 n02111277_3206 500 375 Newfoundland,110,73,416,298 02108422 n02108422_4375 500 375 bull_mastiff,101,90,214,356 \ bull_mastiff,282,74,416,370 CLEANING DATA WITH PYSPARK

Removing blank lines, headers, and comments Spark's CSV parser: Automatically removes blank lines Can remove comments using an optional argument df1 = spark.read.csv('datafile.csv.gz', comment='#') Handles header �elds De�ned via argument Ignored if a schema is de�ned df1 = spark.read.csv('datafile.csv.gz', header='True') CLEANING DATA WITH PYSPARK

Automatic column creation Spark will: Automatically create columns in a DataFrame based on sep argument df1 = spark.read.csv('datafile.csv.gz', sep=',') Defaults to using , Can still successfully parse if sep is not in string df1 = spark.read.csv('datafile.csv.gz', sep='*') Stores data in column defaulting to _c0 Allows you to properly handle nested separators CLEANING DATA WITH PYSPARK

Let's practice! CLEAN IN G DATA W ITH P YS PARK

Data validation CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data Engineering Consultant

De�nition Validation is: Verifying that a dataset complies with the expected format Number of rows / columns Data types Complex validation rules CLEANING DATA WITH PYSPARK

Validating via joins Compares data against known values Easy to �nd data in a given set Comparatively fast parsed_df = spark.read.parquet('parsed_data.parquet') company_df = spark.read.parquet('companies.parquet') verified_df = parsed_df.join(company_df, parsed_df.company == company_df.company) This automatically removes any rows with a company not in the valid_df ! CLEANING DATA WITH PYSPARK

Complex rule validation Using Spark components to validate logic: Calculations Verifying against external source Likely uses a UDF to modify / verify the DataFrame CLEANING DATA WITH PYSPARK

Final analysis and delivery CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data Engineering Consultant

Analysis calculations (UDF) Calculations using UDF def getAvgSale(saleslist): totalsales = 0 count = 0 for sale in saleslist: totalsales += sale[2] + sale[3] count += 2 return totalsales / count udfGetAvgSale = udf(getAvgSale, DoubleType()) df = df.withColumn('avg_sale', udfGetAvgSale(df.sales_list)) CLEANING DATA WITH PYSPARK

Analysis calculations (inline) Inline calculations df = df.read.csv('datafile') df = df.withColumn('avg', (df.total_sales / df.sales_count)) df = df.withColumn('sq_ft', df.width * df.length) df = df.withColumn('total_avg_size', udfComputeTotal(df.entries) / df.numEntries) CLEANING DATA WITH PYSPARK

Congratulations and next steps CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data Engineering Consultant

Next Steps Review Spark documentation Try working with data on actual clusters Work with various datasets CLEANING DATA WITH PYSPARK

Thank you! CLEAN IN G DATA W ITH P YS PARK

Introduction to Data Pipelines CLEAN IN G DATA W ITH P YS PARK - PowerPoint PPT Presentation

Introduction to Data Pipelines CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data Engineering Consultant What is a data pipeline? A set of steps to process data from source(s) to nal output Can consist of any number of steps or components

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

Pipelines on Pipelines: Creating Agile CI/CD Workflows for Airflow DAGs By Victor Shafran CPO

Pipelines and Informed Planning Alliance (PIPA) Pipelines and Informed Planning Alliance (PIPA)

Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering

Building Data Pipelines in Python Marco Bonzanini QCon London 2017 Nice to meet you R&D

Introduction to read alignment pipelines and gene expression estimates Johan Reimegrd Read

Greg Neiheisel CTO Astronomer Data Engineering Platform Streaming data Data pipelines Code

Building Data applications with Go from Bloom filters to Data pipelines Sergii Khomenko, Data

Symposium Co-locating Nuclear Plants with Natural Gas Pipelines Paul Blanch, Energy Consultant

Princeton Hydro LLC. Pipelines in the Landscape Both photographs attributed to Delaware

Safety of Gas Gathering Pipelines RIN: 2137-AF38 Docket: PHMSA 2011 0023 Gas Pipeline

An environmentally attractive source of energy Part four Pipelines are low risk Gas

What Is Teacher Well-being and Why Is It Important? Findings from a Landscape Review Agenda 1.

Gisela Purcell Massey University Overview 1. Indigenous Entrepreneurship 2. Entrepreneurial

Infants and Toddlers Karen Nemeth, Ed.M. www.languagecastle.com Language Castle

Automorphisms of finite groups G a finite group. : G Aut ( V ) a representation. Let

Search for Standard Model Higgs Boson l.jj.jj A. Podkowa Production in the WH WWW

Day 1: Morning Breakout Session Beyond the Units of Study Beth Moore, Instructor

LCS 11: Cognitive Science GQ 4.1 group discussion Language acquisition 3 Constraints on word

Washington SEL Capacity Building Training Series Strengthening SEL Implementation in Schools and

Introduction to Data Pipelines CLEAN IN G DATA W ITH P YS PARK - PowerPoint PPT Presentation

Introduction to Data Pipelines CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data Engineering Consultant What is a data pipeline? A set of steps to process data from source(s) to nal output Can consist of any number of steps or components

Licensed Pipelines &amp; the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

Pipelines on Pipelines: Creating Agile CI/CD Workflows for Airflow DAGs By Victor Shafran CPO

Pipelines and Informed Planning Alliance (PIPA) Pipelines and Informed Planning Alliance (PIPA)

Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering

Building Data Pipelines in Python Marco Bonzanini QCon London 2017 Nice to meet you R&amp;D

Introduction to read alignment pipelines and gene expression estimates Johan Reimegrd Read

Greg Neiheisel CTO Astronomer Data Engineering Platform Streaming data Data pipelines Code

Building Data applications with Go from Bloom filters to Data pipelines Sergii Khomenko, Data

Symposium Co-locating Nuclear Plants with Natural Gas Pipelines Paul Blanch, Energy Consultant

Princeton Hydro LLC. Pipelines in the Landscape Both photographs attributed to Delaware

Safety of Gas Gathering Pipelines RIN: 2137-AF38 Docket: PHMSA 2011 0023 Gas Pipeline

An environmentally attractive source of energy Part four Pipelines are low risk Gas

What Is Teacher Well-being and Why Is It Important? Findings from a Landscape Review Agenda 1.

Gisela Purcell Massey University Overview 1. Indigenous Entrepreneurship 2. Entrepreneurial

Infants and Toddlers Karen Nemeth, Ed.M. www.languagecastle.com Language Castle

Automorphisms of finite groups G a finite group. : G Aut ( V ) a representation. Let

Search for Standard Model Higgs Boson l.jj.jj A. Podkowa Production in the WH WWW

Day 1: Morning Breakout Session Beyond the Units of Study Beth Moore, Instructor

LCS 11: Cognitive Science GQ 4.1 group discussion Language acquisition 3 Constraints on word

Washington SEL Capacity Building Training Series Strengthening SEL Implementation in Schools and

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

Building Data Pipelines in Python Marco Bonzanini QCon London 2017 Nice to meet you R&D