Data Orchestration with Apache Airflow
Data driven “empower the organization to seek more understanding, through data analytics, about their business processes”
Apache Airflow “Airflow is a platform to programmatically author, schedule and monitor workflows.”
Batch Data Sources External Data Sink BigQuery BigQuery Cloud Friendly Data Storage Data Lake Warehouse Stream Data Sources Cloud Cloud Dataflow Pub/Sub
Cloud Composer
Important ETL principles ● Processes are idempotent and deterministic
Important ETL principles ● Processes are idempotent and deterministic ● Reusable, parameterisable components
Important ETL principles ● Processes are idempotent and deterministic ● Reusable, parameterisable components ● Data partitioning (usually by date)
Important ETL principles ● Processes are idempotent and deterministic ● Reusable, parameterisable components ● Data partitioning (usually by date) ● Dealing with alerts and SLA’s
Important ETL principles ● Processes are idempotent and deterministic ● Reusable, parameterisable components ● Data partitioning (usually by date) ● Dealing with alerts and SLA’s ● Globally consistent paths to data
Important ETL principles ● Processes are idempotent and deterministic ● Reusable, parameterisable components ● Data partitioning (usually by date) ● Dealing with alerts and SLA’s ● Globally consistent paths to data ● Data at rest is immutable
Important ETL principles ● Processes are idempotent and deterministic ● Reusable, parameterisable components ● Data partitioning (usually by date) ● Dealing with alerts and SLA’s ● Globally consistent paths to data ● Data at rest is immutable “Data at rest to data at rest”
extract_customer = PostgresToPostgresOperator ( src_postgres_conn_id ='postgres_oltp', dest_postgress_conn_id ='postgres_dwh', sql='select_customer.sql', pg_table='staging.customer', parameters={"window_start_date":"{{ ds }}", "window_end_date":"{{ tomorrow_ds }}"}, pool='postgres_dwh')
extract_customer = PostgresToPostgresOperator( src_postgres_conn_id='postgres_oltp', dest_postgress_conn_id='postgres_dwh', sql ='select_customer.sql', pg_table ='staging.customer', parameters={"window_start_date":"{{ ds }}", "window_end_date":"{{ tomorrow_ds }}"}, pool='postgres_dwh')
extract_customer = PostgresToPostgresOperator( src_postgres_conn_id='postgres_oltp', dest_postgress_conn_id='postgres_dwh', sql='select_customer.sql', pg_table='staging.customer', parameters ={"window_start_date":"{{ ds }}", "window_end_date":"{{ tomorrow_ds }}"}, pool='postgres_dwh')
extract_customer = PostgresToPostgresOperator( src_postgres_conn_id='postgres_oltp', dest_postgress_conn_id='postgres_dwh', sql='select_customer.sql', pg_table='staging.customer', parameters={"window_start_date":"{{ ds }}", "window_end_date":"{{ tomorrow_ds }}"}, pool ='postgres_dwh')
References https://airflow.incubator.apache.org/ https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Home https://gtoonstra.github.io/etl-with-airflow/ https://github.com/gtoonstra/airflow-deployments https://medium.freecodecamp.org/the-rise-of-the-data-engineer-91be18f1e603 https://medium.com/@maximebeauchemin/the-downfall-of-the-data-engineer-5bfb701e5d6b
Thank you! Gerard Toonstra - Data Engineer / Architect - BigData Republic gerard.toonstra@bigdatarepublic.nl https://www.bigdatarepublic.nl/
Recommend
More recommend