Components of a data platform BUILDIN G DATA EN GIN EERIN G P IP - PowerPoint PPT Presentation

Components of a data platform BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON Oliver Willekens Data Engineer at Data Minded

Course contents ingest data using Singer apply common data cleaning operations gain insights by combining data with PySpark test your code automatically deploy Spark transformation pipelines => intro to data engineering pipelines BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Data is valuable BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Democratizing data increases insights BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Genesis of the data BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Operational data is stored in the landing zone BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Cleaned data prevents rework BUILDING DATA ENGINEERING PIPELINES IN PYTHON

The business layer provides most insights BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Pipelines move data from one zone to another BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Let’s reason! BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON

Introduction to data ingestion with Singer BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON Oliver Willekens Data Engineer at Data Minded

Singer’s core concepts Aim: “The open-source standard for writing scripts that move data” Singer is a speci�cation data exchange format: JSON extract and load with taps and targets => language independent BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Singer’s core concepts Aim: “The open-source standard for writing scripts that move data” Singer is a speci�cation data exchange format: JSON extract and load with taps and targets => language independent communicate over streams : schema (metadata) state (process metadata) record (data) BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Describing the data through its schema columns = ("id", "name", "age", "has_children") users = {(1, "Adrian", 32, False), (2, "Ruanne", 28, False), (3, "Hillary", 29, True)} json_schema = { "properties": {"age": {"maximum": 130, "minimum": 1, "type": "integer"}, "has_children": {"type": "boolean"}, "id": {"type": "integer"}, "name": {"type": "string"}}, "$id": "http://yourdomain.com/schemas/my_user_schema.json", "$schema": "http://json-schema.org/draft-07/schema#"} BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Describing the data through its schema import singer singer.write_schema(schema=json_schema, stream_name='DC_employees', key_properties=["id"]) {"type": "SCHEMA", "stream": "DC_employees", "schema": {"properties": {"age": {"maximum": 130, "minimum": 1, "type": "integer"}, "has_children": {"type": "boolean"}, "id": {"type": "integer"}, "name": {"type": "string"}}, "$id": "http://yourdomain.com/schemas/my_user_schema.json", "$schema": "http://json-schema.org/draft-07/schema#"}, "key_properties": ["id"]} BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Serializing JSON import json json.dumps(json_schema["properties"]["age"]) '{"maximum": 130, "minimum": 1, "type": "integer"}' with open("foo.json", mode="w") as fh: json.dump(obj=json_schema, fp=fh) # writes the json-serialized object # to the open file handle BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Let’s practice! BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON

Running an ingestion pipeline with Singer BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON Oliver Willekens Data Engineer at Data Minded

Streaming record messages columns = ("id", "name", "age", "has_children") users = {(1, "Adrian", 32, False), (2, "Ruanne", 28, False), (3, "Hillary", 29, True)} singer.write_record(stream_name="DC_employees", record=dict(zip(columns, users.pop()))) {"type": "RECORD", "stream": "DC_employees", "record": {"id": 1, "name": "Adrian", "age": 32, "has_children": false}} fixed_dict = {"type": "RECORD", "stream": "DC_employees"} record_msg = {**fixed_dict, "record": dict(zip(columns, users.pop()))} print(json.dumps(record_msg)) BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Chaining taps and targets # Module: my_tap.py import singer singer.write_schema(stream_name="foo", schema=…) singer.write_records(stream_name="foo", records=…) Ingestion pipeline: Pipe the tap’s output into a Singer target, using the | symbol (Linux & MacOS) python my_tap.py | target-csv python my_tap.py | target-csv --config userconfig.cfg my-packaged-tap | target-csv --config userconfig.cfg BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Modular ingestion pipelines my-packaged-tap | target-csv my-packaged-tap | target-google-sheets my-packaged-tap | target-postgresql --config conf.json tap-custom-google-scraper | target-postgresql --config headlines.json BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Keeping track with state messages BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Keeping track with state messages id name last_updated_on 1 Adrian 2019-06-14T14:00:04.000+02:00 2 Ruanne 2019-06-16T18:33:21.000+02:00 3 Hillary 2019-06-14T10:05:12.000+02:00 singer.write_state(value={"max-last-updated-on": some_variable}) Run this tap-mydelta on 2019-06-14 at 12:00:00.000+02:00 (2nd row wasn’t yet present then): {"type": "STATE", "value": {"max-last-updated-on": "2019-06-14T10:05:12.000+02:00"}} BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Let’s practice! BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON

Components of a data platform BUILDIN G DATA EN GIN EERIN G P IP - PowerPoint PPT Presentation

Components of a data platform BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON Oliver Willekens Data Engineer at Data Minded Course contents ingest data using Singer apply common data cleaning operations gain insights by combining

INSIDE THE PLATFORM Who are we Classic platforms Classic platform Modern platform Modern

Massive Data Algorithmics Lecture 10: Connected Components and MST Massive Data Algorithmics

Delfi COM Platform What stands behind the term: Delfi COM Platform? The Delfi COM Platform

React Native Platform specific code Native Components Methods React Native provides two ways

ESCeL Platform & NCS Initiatives ESCeL eLearning Platform ESCeL Platform MISSION STATEMENT

The Unity platform Unity Studios The Unity platform Unity Studios The Unity platform

Power Platform: Getting Started OVERVIEW OF POWER PLATFORM Hugo Barona AZURE SOLUTION ARCHITECT

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

Company introduction Soyter Components Our company Soyter Components located in Klaudyn near

Digital System-On-Chip components at ESA components at ESA ASIC technology platforms and

OnMap Big Data Platform Content OnMap Platform Our Product Solutions Our Solutions

Building Consistent Cross-Platform Interfaces Building Consistent Cross-Platform Interfaces

Forbruger Europa Christel Hst clp@forbrugereuropa.dk ODR platform About the ODR-platform

Presentation of Platform Zero Incidents Platform Zero Incidents Platform Zero Incidents MENTAL

The Power of Pull The Power of Pull a platform approach to learning a platform approach

M-LIMA PLATFORM M-LIMA PLATFORM Definition M-LIMA is the platform that is meant to improve the

In Introduction to swit itchdev SR SR-IOV offloads Or r Ge Gerli litz, Had adar Hen-Zion,

At the least, compute one Tap in a 2. Separate AGU from DALU for rich addressing modes Single

Data publication at PADC using TAP ObsTap for CTA, Gaia and EPN-TAP for Europlanet Pierre Le

Tap n Ghost A Compilation of Novel Attack Techniques against Smartphone Touchscreens Seita

Introduction In this lecture we will begin by reprising the work done last time for the squeezed-

iOS Gesture Recognizers CocoaConf Boston October 2013 Jonathan Penn @jonathanpenn Slides

COMP 150: Probabilistic Robotics for Human-Robot Interaction Instructor: Jivko Sinapov

Rump kernel based QEMU stubdomain Wei Liu Seattle August 17-18, 2015 Agenda Xen 4.6

Components of a data platform BUILDIN G DATA EN GIN EERIN G P IP - PowerPoint PPT Presentation

Components of a data platform BUILDIN G DATA EN GIN EERIN G P IP ELIN ES IN P YTH ON Oliver Willekens Data Engineer at Data Minded Course contents ingest data using Singer apply common data cleaning operations gain insights by combining

INSIDE THE PLATFORM Who are we Classic platforms Classic platform Modern platform Modern

Massive Data Algorithmics Lecture 10: Connected Components and MST Massive Data Algorithmics

Delfi COM Platform What stands behind the term: Delfi COM Platform? The Delfi COM Platform

React Native Platform specific code Native Components Methods React Native provides two ways

ESCeL Platform &amp; NCS Initiatives ESCeL eLearning Platform ESCeL Platform MISSION STATEMENT

The Unity platform Unity Studios The Unity platform Unity Studios The Unity platform

Power Platform: Getting Started OVERVIEW OF POWER PLATFORM Hugo Barona AZURE SOLUTION ARCHITECT

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

Company introduction Soyter Components Our company Soyter Components located in Klaudyn near

Digital System-On-Chip components at ESA components at ESA ASIC technology platforms and

OnMap Big Data Platform Content OnMap Platform Our Product Solutions Our Solutions

Building Consistent Cross-Platform Interfaces Building Consistent Cross-Platform Interfaces

Forbruger Europa Christel Hst clp@forbrugereuropa.dk ODR platform About the ODR-platform

Presentation of Platform Zero Incidents Platform Zero Incidents Platform Zero Incidents MENTAL

The Power of Pull The Power of Pull a platform approach to learning a platform approach

M-LIMA PLATFORM M-LIMA PLATFORM Definition M-LIMA is the platform that is meant to improve the

In Introduction to swit itchdev SR SR-IOV offloads Or r Ge Gerli litz, Had adar Hen-Zion,

At the least, compute one Tap in a 2. Separate AGU from DALU for rich addressing modes Single

Data publication at PADC using TAP ObsTap for CTA, Gaia and EPN-TAP for Europlanet Pierre Le

Tap n Ghost A Compilation of Novel Attack Techniques against Smartphone Touchscreens Seita

Introduction In this lecture we will begin by reprising the work done last time for the squeezed-

iOS Gesture Recognizers CocoaConf Boston October 2013 Jonathan Penn @jonathanpenn Slides

COMP 150: Probabilistic Robotics for Human-Robot Interaction Instructor: Jivko Sinapov

Rump kernel based QEMU stubdomain Wei Liu Seattle August 17-18, 2015 Agenda Xen 4.6

ESCeL Platform & NCS Initiatives ESCeL eLearning Platform ESCeL Platform MISSION STATEMENT