Building Reusable and Trustworthy pipelines 1 — Airflow Summit 2020, @nehiljain
Outline 1. Context 2. Design Requirements 3. Proposed Solution 4. Example Code 2 — Airflow Summit 2020, @nehiljain
Context 3 — Airflow Summit 2020, @nehiljain
Hello ! Data engineer @ SnapTravel ▸ SnapTravel ▸ M-commerce startup ▸ Data team: 8, Data Sources: 86 ▸ Data infrastructure, Data engineering, Analytics engineering ▸ + + + stack ▸ 4 — Airflow Summit 2020, @nehiljain
Purpose " # ! Share BI pipelines Community with lessons learnt feedback 5 — Airflow Summit 2020, @nehiljain
How are my company ? ▸ gross_revenue ▸ contribution_margin ▸ number_of_active_users ▸ retention_rate ▸ conversion_rate 6 — Airflow Summit 2020, @nehiljain
Hows my airflow repo ? ▸ number_prs_merged ▸ number_prs_closed_without_merge ▸ number_prs_opened ▸ number_of_commits 7 — Airflow Summit 2020, @nehiljain
8 — Airflow Summit 2020, @nehiljain
Let us consider ▸ The pipeline failed in production ▸ Shift focus on to issues, comments ▸ Gitlab released a new version of API ▸ I want to analyze other apache projects too ▸ Github produced similar insights and their numbers didn't match mine 9 — Airflow Summit 2020, @nehiljain
! Been there done that? 10 — Airflow Summit 2020, @nehiljain
Classify the problems ▸ Toil ▸ Cannot scale Data Analytics ▸ Data Discovery ▸ Data Trust ▸ Throw over the boundary ▸ Ambiguous ownership 11 — Airflow Summit 2020, @nehiljain
What can we do to solve this? 12 — Airflow Summit 2020, @nehiljain
..build tools, infrastructure, frameworks and services — Maxime Beauchemin 13 — Airflow Summit 2020, @nehiljain
Design Requirements 14 — Airflow Summit 2020, @nehiljain
15 — Airflow Summit 2020, @nehiljain
Single Source of Truth ▸ Standardization ▸ Data Lineage ▸ Empower non-technical folks 16 — Airflow Summit 2020, @nehiljain
Easy to consume ▸ Airflow + Other OSS ▸ Ideally pip install awesome-elt-tool ▸ Low barrier to entry for data analytics ▸ Operational creep 17 — Airflow Summit 2020, @nehiljain
Promote data integrity ▸ Test the raw data supply ▸ Automated analytics testing 18 — Airflow Summit 2020, @nehiljain
Meta Data Engineering 19 — Airflow Summit 2020, @nehiljain
20 — Airflow Summit 2020, @nehiljain
Proposed Solution 21 — Airflow Summit 2020, @nehiljain
Conceptually 22 — Airflow Summit 2020, @nehiljain
ETL vs ELT ▸ Load once and transform ▸ Reduced complexity ▸ Reduce cost ▸ Speed of delivery 23 — Airflow Summit 2020, @nehiljain
Validate your source data 24 — Airflow Summit 2020, @nehiljain
▸ expect_column_to_exist ▸ expect_table_row_count_to_be_between ▸ expect_table_row_count_to_equal ▸ expect_multicolumn_values_to_be_unique ▸ expect_column_values_to_not_be_null ▸ expect_column_values_to_be_null ▸ expect_column_fancy_statistic_to_be 25 — Airflow Summit 2020, @nehiljain
Why? ▸ Profiling ▸ Data Docs <-> Tests ▸ Send notifications automatically 26 — Airflow Summit 2020, @nehiljain
Extract - Load 27 — Airflow Summit 2020, @nehiljain
Singer - What? 28 — Airflow Summit 2020, @nehiljain
tap-github --config tap_config.json | target-postgres --config target_config.json >> state.json 29 — Airflow Summit 2020, @nehiljain
Singer - Why? ▸ Standardized communication ▸ Incremental out of the box ▸ Documentation ▸ See your data in under 10 mins 30 — Airflow Summit 2020, @nehiljain
31 — Airflow Summit 2020, @nehiljain
It's a long list 32 — Airflow Summit 2020, @nehiljain
Transform 33 — Airflow Summit 2020, @nehiljain
DBT - What? 34 — Airflow Summit 2020, @nehiljain
35 — Airflow Summit 2020, @nehiljain
36 — Airflow Summit 2020, @nehiljain
DBT - Why? ▸ Modular code 37 — Airflow Summit 2020, @nehiljain
DBT - Why? ▸ Modular code ▸ Testing is 1st Class 38 — Airflow Summit 2020, @nehiljain
DBT - Why? ▸ Modular code ▸ Testing is 1st Class ▸ Data documentation is 1st Class 39 — Airflow Summit 2020, @nehiljain
Great adoption 40 — Airflow Summit 2020, @nehiljain
All together 41 — Airflow Summit 2020, @nehiljain
Meltano ▸ Open Source, GitLab ▸ Self Hosted pip3 install meltano meltano init airflow-analytics-project meltano add extractor tap-github meltano add loader target-postgres meltano add transformer dbt meltano add transform tap-github # add env variables meltano elt tap-gitlab target-postgres --transform=run --job_id=gitlab-to-postgres meltano add orchestrator airflow 42 — Airflow Summit 2020, @nehiljain
Let's look at the code 43 — Airflow Summit 2020, @nehiljain
44 — Airflow Summit 2020, @nehiljain
A templated approach 45 — Airflow Summit 2020, @nehiljain
46 — Airflow Summit 2020, @nehiljain
47 — Airflow Summit 2020, @nehiljain
48 — Airflow Summit 2020, @nehiljain
49 — Airflow Summit 2020, @nehiljain
Sit back & Relax 50 — Airflow Summit 2020, @nehiljain
Some challenges out there ▸ Visualisation/BI layer ▸ Analytics code coverage ▸ Singer community 51 — Airflow Summit 2020, @nehiljain
Key Takeaways ▸ Standardized tooling ▸ ELT >> ETL ▸ GE + Singer + DBT orchestrated by Airflow 52 — Airflow Summit 2020, @nehiljain
Thanks 53 — Airflow Summit 2020, @nehiljain
Q & A 54 — Airflow Summit 2020, @nehiljain
Resources ▸ Meltano Project ▸ Advanced Data Engineering Patterns with Apache Airflow by Maxime Beauchemin ▸ The Rise of the Data Engineer ▸ The Future of Data Engineering ▸ Downfall of the data engineer 55 — Airflow Summit 2020, @nehiljain
Resources ▸ Singer | Open Source ETL ▸ Why we are building an open-source platform for ELT pipelines - Meltano ▸ Dbt Docs 56 — Airflow Summit 2020, @nehiljain
Recommend
More recommend