data orchestration with apache airflow data driven
play

Data Orchestration with Apache Airflow Data driven empower the - PowerPoint PPT Presentation

Data Orchestration with Apache Airflow Data driven empower the organization to seek more understanding, through data analytics, about their business processes Apache Airflow Airflow is a platform to programmatically author, schedule


  1. Data Orchestration with Apache Airflow

  2. Data driven “empower the organization to seek more understanding, through data analytics, about their business processes”

  3. Apache Airflow “Airflow is a platform to programmatically author, schedule and monitor workflows.”

  4. Batch Data Sources External Data Sink BigQuery BigQuery Cloud Friendly Data Storage Data Lake Warehouse Stream Data Sources Cloud Cloud Dataflow Pub/Sub

  5. Cloud Composer

  6. Important ETL principles ● Processes are idempotent and deterministic

  7. Important ETL principles ● Processes are idempotent and deterministic ● Reusable, parameterisable components

  8. Important ETL principles ● Processes are idempotent and deterministic ● Reusable, parameterisable components ● Data partitioning (usually by date)

  9. Important ETL principles ● Processes are idempotent and deterministic ● Reusable, parameterisable components ● Data partitioning (usually by date) ● Dealing with alerts and SLA’s

  10. Important ETL principles ● Processes are idempotent and deterministic ● Reusable, parameterisable components ● Data partitioning (usually by date) ● Dealing with alerts and SLA’s ● Globally consistent paths to data

  11. Important ETL principles ● Processes are idempotent and deterministic ● Reusable, parameterisable components ● Data partitioning (usually by date) ● Dealing with alerts and SLA’s ● Globally consistent paths to data ● Data at rest is immutable

  12. Important ETL principles ● Processes are idempotent and deterministic ● Reusable, parameterisable components ● Data partitioning (usually by date) ● Dealing with alerts and SLA’s ● Globally consistent paths to data ● Data at rest is immutable “Data at rest to data at rest”

  13. extract_customer = PostgresToPostgresOperator ( src_postgres_conn_id ='postgres_oltp', dest_postgress_conn_id ='postgres_dwh', sql='select_customer.sql', pg_table='staging.customer', parameters={"window_start_date":"{{ ds }}", "window_end_date":"{{ tomorrow_ds }}"}, pool='postgres_dwh')

  14. extract_customer = PostgresToPostgresOperator( src_postgres_conn_id='postgres_oltp', dest_postgress_conn_id='postgres_dwh', sql ='select_customer.sql', pg_table ='staging.customer', parameters={"window_start_date":"{{ ds }}", "window_end_date":"{{ tomorrow_ds }}"}, pool='postgres_dwh')

  15. extract_customer = PostgresToPostgresOperator( src_postgres_conn_id='postgres_oltp', dest_postgress_conn_id='postgres_dwh', sql='select_customer.sql', pg_table='staging.customer', parameters ={"window_start_date":"{{ ds }}", "window_end_date":"{{ tomorrow_ds }}"}, pool='postgres_dwh')

  16. extract_customer = PostgresToPostgresOperator( src_postgres_conn_id='postgres_oltp', dest_postgress_conn_id='postgres_dwh', sql='select_customer.sql', pg_table='staging.customer', parameters={"window_start_date":"{{ ds }}", "window_end_date":"{{ tomorrow_ds }}"}, pool ='postgres_dwh')

  17. References https://airflow.incubator.apache.org/ https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Home https://gtoonstra.github.io/etl-with-airflow/ https://github.com/gtoonstra/airflow-deployments https://medium.freecodecamp.org/the-rise-of-the-data-engineer-91be18f1e603 https://medium.com/@maximebeauchemin/the-downfall-of-the-data-engineer-5bfb701e5d6b

  18. Thank you! Gerard Toonstra - Data Engineer / Architect - BigData Republic gerard.toonstra@bigdatarepublic.nl https://www.bigdatarepublic.nl/

Recommend


More recommend