Democratized data workflows at scale Emil Todorov Mihail Petkov

Our agenda for today ● Why Airflow? ● Architecture ● Security ● Execution environment in Kubernetes

FT is a data driven organization

Time for a change

Why Airflow?

Scalable Extendable Dynamic Elegant

Architecture

Architecture Worker Pod Scheduler Worker Pod Pod PostgreSQL Worker Web Server Pod Pod

Business User Tech

Airflow will be used by multiple teams

Airflow requirements Team 1 Team 2 Team N

Teams will share Airflow resources

Airflow shared components Team 1 Team 2 Team N Team 1 DAGs Team 2 DAGs Team N DAGs Team 1 Team 2 Team N Connections Connections Connections

Teams will share Kubernetes resources

Kubernetes shared components Team 1 Team 2 Worker Pod Worker Pod Team N Worker Pod

How to evolve this architecture?

Airflow instance per team

One instance components

Instance per team problems ● Adding new team is hard ● Maintaining environment per team is difficult ● Releasing new features is slow ● Resources are not fully utilised ● Total cost increase

Another way?

Multitenancy

Multiple independent instances in a shared environment

Multi-tenant components

How to make AWS multi-tenant?

IAM Security Team 1 IAM user Team 2 IAM user Team N IAM user

How to enhance Kubernetes?

System namespace Airflow scheduler Airflow web server Team 1 namespace Team 2 namespace Team N namespace Service Account Service Account Service Account Resource Quota Resource Quota Resource Quota Team 1 Team 1 Team 2 Team 2 Team 3 Team 3 worker worker worker worker worker worker Pod Pod Pod Pod Pod Pod

How to improve PostgreSQL?

CHANGES

How to extend Airflow?

Redesign Airflow source code

Redesign Airflow source code ● Module per team

Redesign Airflow source code ● Module per team ● Connections per team

Redesign Airflow source code ● Module per team ● Connections per team ● Extend hooks, operators and sensors

Redesign Airflow source code ● Module per team ● Connections per team ● Extend hooks, operators and sensors ● Use airflow_local_settings.py

Redesign repository structure Team 1 DAG repository Airflow system code Airflow repository Team 2 DAG repository repository Team N DAG repository

Execution environment in Kubernetes

ETL Extract Transform Load DATA SOURCE 1 AGGREGATIONS DATA DESTINATION DATA SOURCE 2

Extract Extract Transform Load DATA SOURCE 1 AGGREGATIONS DATA DESTINATION DATA SOURCE 2

Load Extract Transform Load DATA SOURCE 1 AGGREGATIONS DATA DESTINATION DATA SOURCE 2

Transform?

Example workflow Task 1 Task 3 Task 4 Task 2

Our goals Language agnostic jobs Cross task data access

KubernetesPodOperator

Our goals Language agnostic jobs Cross task data access

Unique storage pattern ● Unique team name from the multitenancy ● Unique DAG id ● Unique task id per DAG ● Unique execution date per DAG run /{team}/{dag_id}/{task_id}/{execution_date}

The power of extensibility

ExecutionEnvironmentOperator KubernetesPodOperator KUBERNETES POD OPERATOR EXECUTE ExecutionEnvironmentOperator KUBERNETES PRE EXECUTE POD OPERATOR POST EXECUTE EXECUTE

Configurable cross task data dependencies

Example input configuration

Example output configuration

Pre-execute ● Bootstrap the environment ● Enrich the configuration ● Export the configuration to the execution environment pod KUBERNETES PRE EXECUTE POD OPERATOR EXECUTE

Post-execute ● Handle the execution ● Clear all bootstraps ● Deal with the output KUBERNETES POD OPERATOR POST EXECUTE EXECUTE

POC with AWS S3 as intermediate storage Task 1 Task 3 Task 4 Task 2

Is this efficient? Multiple downloads and uploads Single processing power Always loading the data in memory

How to evolve the execution environment? Remove unnecessary data transfers Parallelize the processing Provide hot data access

Shared file system

Kubernetes persistent volume Task 1 Task 2 Task 3 Task 4

Kubernetes persistent volume with EFS Task 1 Task 2 Task 3 Task 4

So far so good Remove unnecessary data transfers Parallelize the processing Provide hot data access

One worker?

Benefits from Spark ● Runs perfectly in Kubernetes ● Supports many distributed storages ● Allows faster data processing ● Supports multiple languages ● Easy to use

SparkExecutionEnvironmentOperator KUBERNETES PRE EXECUTE POD OPERATOR POST EXECUTE EXECUTE CLEAR SPARK SETUP SPARK RUN SPARK BASED ENVIRONMENT BASED IMAGE RESOURCES

Spark execution environment Spark driver Spark workers

Our current state Remove unnecessary data transfers Parallelize the processing Provide hot data access

Hot & cold data HOT DATA COLD DATA Task 1 Task 2 Task 3 Task 4

Alluxio HOT DATA COLD DATA Task 1 Task 2 Task 3 Task 4

Thank you! #apacheairflow

Democratized data workflows at scale Emil Todorov Mihail Petkov - PowerPoint PPT Presentation

Democratized data workflows at scale Emil Todorov Mihail Petkov Our agenda for today Why Airflow? Architecture Security Execution environment in Kubernetes FT is a data driven organization Time for a change Why

FREEDOM WITHIN A FRAMEWORK HOW CAMPBELL SOUP COMPANY DEMOCRATIZED MARKETING DATA TO SUPPORT

Importing data Peter Humburg Statistician, Macquarie University DataCamp ChIP-seq Workflows in

Convergence of computation and data workflows IS-ENES Workshop on Workflows and Metadata

Workflows Description, Workflows Description, Enactment and Monitoring in Enactment and

Introduction to differential binding Peter Humburg Statistician, Macquarie University DataCamp

Automate your workflows with Kotlin Fosdem - 2020 1 Automate your workflows with Kotlin

Achieving Coordination Through Dynamic Construction of Open Workflows Louis Thomas, Justin

Cirrus: A Serverless Framework for End-to-end ML Workflows Joao Carreira , Pedro Fonseca, Alexey

Workflows as an Operational Tool Scientific Computing using Data Scien lkay ALTINTA , Ph.D.

Integrated Data Placement and Task Assignment for Scientific Workflows in Clouds Kamer Kaya

Appifying Data Workflows To Create Composable, User Friendly Data Products Austen Head

PDF 2.0 New and Improved Features Supporting More Workflows MATT KUZNICKI DUFF JOHNSON

A UML Activity Diagram Extension and Template for Bioinformatics Workflows: A Design Science Study

Collateral Management Harmonisation Survey on tax processes and workflows CMH-TF Frankfurt,

Tools and Workflows in Geospatial Environmental Planning and Management Ari Jolma, Ioan

Knowledge Discovery Workflows in the Exploration of Complex Astronomical Datasets Raffaele

PLATFORM AS A SERVICE MULTI TENANCY AND OPEN STANDARDS Peter Chittum @pchittum

Operating Multi-Tenant Kafka Services for Developers Data Council SF 2019 Ali Hamidi - Heroku

Implementing a Cooperative Multi-Tenant Capable Prometheus Users: run small-scale

Chapter 5 Electronic Mail Security -Pretty Good Privacy (PGP) -S/MIME 1 Need for E-Mail

Bala lancing Efficiency and Fair irness in in Heterogeneous GPU Clu lusters for Deep Learning

Paradrop: Enabling Lightweight Multi-tenancy at the Networks Extreme Edge Peng Liu, Dale

Virtualized Congestion Control Bryce Cronkite-Ratcliff, Aran Bergman, Shay Vargaftik,

& Rise of the Contingent Workforce Mehul Rajparia Lumesse APAC 26/11/2018 Lumesse Talent