building reusable and trustworthy pipelines
play

Building Reusable and Trustworthy pipelines 1 Airflow Summit - PowerPoint PPT Presentation

Building Reusable and Trustworthy pipelines 1 Airflow Summit 2020, @nehiljain Outline 1. Context 2. Design Requirements 3. Proposed Solution 4. Example Code 2 Airflow Summit 2020, @nehiljain Context 3 Airflow Summit 2020,


  1. Building Reusable and Trustworthy pipelines 1 — Airflow Summit 2020, @nehiljain

  2. Outline 1. Context 2. Design Requirements 3. Proposed Solution 4. Example Code 2 — Airflow Summit 2020, @nehiljain

  3. Context 3 — Airflow Summit 2020, @nehiljain

  4. Hello ! Data engineer @ SnapTravel ▸ SnapTravel ▸ M-commerce startup ▸ Data team: 8, Data Sources: 86 ▸ Data infrastructure, Data engineering, Analytics engineering ▸ + + + stack ▸ 4 — Airflow Summit 2020, @nehiljain

  5. Purpose " # ! Share BI pipelines Community with lessons learnt feedback 5 — Airflow Summit 2020, @nehiljain

  6. How are my company ? ▸ gross_revenue ▸ contribution_margin ▸ number_of_active_users ▸ retention_rate ▸ conversion_rate 6 — Airflow Summit 2020, @nehiljain

  7. Hows my airflow repo ? ▸ number_prs_merged ▸ number_prs_closed_without_merge ▸ number_prs_opened ▸ number_of_commits 7 — Airflow Summit 2020, @nehiljain

  8. 8 — Airflow Summit 2020, @nehiljain

  9. Let us consider ▸ The pipeline failed in production ▸ Shift focus on to issues, comments ▸ Gitlab released a new version of API ▸ I want to analyze other apache projects too ▸ Github produced similar insights and their numbers didn't match mine 9 — Airflow Summit 2020, @nehiljain

  10. ! Been there done that? 10 — Airflow Summit 2020, @nehiljain

  11. Classify the problems ▸ Toil ▸ Cannot scale Data Analytics ▸ Data Discovery ▸ Data Trust ▸ Throw over the boundary ▸ Ambiguous ownership 11 — Airflow Summit 2020, @nehiljain

  12. What can we do to solve this? 12 — Airflow Summit 2020, @nehiljain

  13. ..build tools, infrastructure, frameworks and services — Maxime Beauchemin 13 — Airflow Summit 2020, @nehiljain

  14. Design Requirements 14 — Airflow Summit 2020, @nehiljain

  15. 15 — Airflow Summit 2020, @nehiljain

  16. Single Source of Truth ▸ Standardization ▸ Data Lineage ▸ Empower non-technical folks 16 — Airflow Summit 2020, @nehiljain

  17. Easy to consume ▸ Airflow + Other OSS ▸ Ideally pip install awesome-elt-tool ▸ Low barrier to entry for data analytics ▸ Operational creep 17 — Airflow Summit 2020, @nehiljain

  18. Promote data integrity ▸ Test the raw data supply ▸ Automated analytics testing 18 — Airflow Summit 2020, @nehiljain

  19. Meta Data Engineering 19 — Airflow Summit 2020, @nehiljain

  20. 20 — Airflow Summit 2020, @nehiljain

  21. Proposed Solution 21 — Airflow Summit 2020, @nehiljain

  22. Conceptually 22 — Airflow Summit 2020, @nehiljain

  23. ETL vs ELT ▸ Load once and transform ▸ Reduced complexity ▸ Reduce cost ▸ Speed of delivery 23 — Airflow Summit 2020, @nehiljain

  24. Validate your source data 24 — Airflow Summit 2020, @nehiljain

  25. ▸ expect_column_to_exist ▸ expect_table_row_count_to_be_between ▸ expect_table_row_count_to_equal ▸ expect_multicolumn_values_to_be_unique ▸ expect_column_values_to_not_be_null ▸ expect_column_values_to_be_null ▸ expect_column_fancy_statistic_to_be 25 — Airflow Summit 2020, @nehiljain

  26. Why? ▸ Profiling ▸ Data Docs <-> Tests ▸ Send notifications automatically 26 — Airflow Summit 2020, @nehiljain

  27. Extract - Load 27 — Airflow Summit 2020, @nehiljain

  28. Singer - What? 28 — Airflow Summit 2020, @nehiljain

  29. tap-github --config tap_config.json | target-postgres --config target_config.json >> state.json 29 — Airflow Summit 2020, @nehiljain

  30. Singer - Why? ▸ Standardized communication ▸ Incremental out of the box ▸ Documentation ▸ See your data in under 10 mins 30 — Airflow Summit 2020, @nehiljain

  31. 31 — Airflow Summit 2020, @nehiljain

  32. It's a long list 32 — Airflow Summit 2020, @nehiljain

  33. Transform 33 — Airflow Summit 2020, @nehiljain

  34. DBT - What? 34 — Airflow Summit 2020, @nehiljain

  35. 35 — Airflow Summit 2020, @nehiljain

  36. 36 — Airflow Summit 2020, @nehiljain

  37. DBT - Why? ▸ Modular code 37 — Airflow Summit 2020, @nehiljain

  38. DBT - Why? ▸ Modular code ▸ Testing is 1st Class 38 — Airflow Summit 2020, @nehiljain

  39. DBT - Why? ▸ Modular code ▸ Testing is 1st Class ▸ Data documentation is 1st Class 39 — Airflow Summit 2020, @nehiljain

  40. Great adoption 40 — Airflow Summit 2020, @nehiljain

  41. All together 41 — Airflow Summit 2020, @nehiljain

  42. Meltano ▸ Open Source, GitLab ▸ Self Hosted pip3 install meltano meltano init airflow-analytics-project meltano add extractor tap-github meltano add loader target-postgres meltano add transformer dbt meltano add transform tap-github # add env variables meltano elt tap-gitlab target-postgres --transform=run --job_id=gitlab-to-postgres meltano add orchestrator airflow 42 — Airflow Summit 2020, @nehiljain

  43. Let's look at the code 43 — Airflow Summit 2020, @nehiljain

  44. 44 — Airflow Summit 2020, @nehiljain

  45. A templated approach 45 — Airflow Summit 2020, @nehiljain

  46. 46 — Airflow Summit 2020, @nehiljain

  47. 47 — Airflow Summit 2020, @nehiljain

  48. 48 — Airflow Summit 2020, @nehiljain

  49. 49 — Airflow Summit 2020, @nehiljain

  50. Sit back & Relax 50 — Airflow Summit 2020, @nehiljain

  51. Some challenges out there ▸ Visualisation/BI layer ▸ Analytics code coverage ▸ Singer community 51 — Airflow Summit 2020, @nehiljain

  52. Key Takeaways ▸ Standardized tooling ▸ ELT >> ETL ▸ GE + Singer + DBT orchestrated by Airflow 52 — Airflow Summit 2020, @nehiljain

  53. Thanks 53 — Airflow Summit 2020, @nehiljain

  54. Q & A 54 — Airflow Summit 2020, @nehiljain

  55. Resources ▸ Meltano Project ▸ Advanced Data Engineering Patterns with Apache Airflow by Maxime Beauchemin ▸ The Rise of the Data Engineer ▸ The Future of Data Engineering ▸ Downfall of the data engineer 55 — Airflow Summit 2020, @nehiljain

  56. Resources ▸ Singer | Open Source ETL ▸ Why we are building an open-source platform for ELT pipelines - Meltano ▸ Dbt Docs 56 — Airflow Summit 2020, @nehiljain

Recommend


More recommend