democratized data workflows at scale
play

Democratized data workflows at scale Emil Todorov Mihail Petkov - PowerPoint PPT Presentation

Democratized data workflows at scale Emil Todorov Mihail Petkov Our agenda for today Why Airflow? Architecture Security Execution environment in Kubernetes FT is a data driven organization Time for a change Why


  1. Democratized data workflows at scale Emil Todorov Mihail Petkov

  2. Our agenda for today ● Why Airflow? ● Architecture ● Security ● Execution environment in Kubernetes

  3. FT is a data driven organization

  4. Time for a change

  5. Why Airflow?

  6. Scalable Extendable Dynamic Elegant

  7. Architecture

  8. Architecture Worker Pod Scheduler Worker Pod Pod PostgreSQL Worker Web Server Pod Pod

  9. Business User Tech

  10. Airflow will be used by multiple teams

  11. Airflow requirements Team 1 Team 2 Team N

  12. Teams will share Airflow resources

  13. Airflow shared components Team 1 Team 2 Team N Team 1 DAGs Team 2 DAGs Team N DAGs Team 1 Team 2 Team N Connections Connections Connections

  14. Teams will share Kubernetes resources

  15. Kubernetes shared components Team 1 Team 2 Worker Pod Worker Pod Team N Worker Pod

  16. How to evolve this architecture?

  17. Airflow instance per team

  18. One instance components

  19. Instance per team problems ● Adding new team is hard ● Maintaining environment per team is difficult ● Releasing new features is slow ● Resources are not fully utilised ● Total cost increase

  20. Another way?

  21. Multitenancy

  22. Multiple independent instances in a shared environment

  23. Multi-tenant components

  24. How to make AWS multi-tenant?

  25. IAM Security Team 1 IAM user Team 2 IAM user Team N IAM user

  26. IAM Security Team 1 IAM user Team 2 IAM user Team N IAM user

  27. How to enhance Kubernetes?

  28. System namespace Airflow scheduler Airflow web server Team 1 namespace Team 2 namespace Team N namespace Service Account Service Account Service Account Resource Quota Resource Quota Resource Quota Team 1 Team 1 Team 2 Team 2 Team 3 Team 3 worker worker worker worker worker worker Pod Pod Pod Pod Pod Pod

  29. How to improve PostgreSQL?

  30. CHANGES

  31. How to extend Airflow?

  32. Redesign Airflow source code

  33. Redesign Airflow source code ● Module per team

  34. Redesign Airflow source code ● Module per team ● Connections per team

  35. Redesign Airflow source code ● Module per team ● Connections per team ● Extend hooks, operators and sensors

  36. Redesign Airflow source code ● Module per team ● Connections per team ● Extend hooks, operators and sensors ● Use airflow_local_settings.py

  37. Redesign repository structure Team 1 DAG repository Airflow system code Airflow repository Team 2 DAG repository repository Team N DAG repository

  38. Execution environment in Kubernetes

  39. ETL Extract Transform Load DATA SOURCE 1 AGGREGATIONS DATA DESTINATION DATA SOURCE 2

  40. Extract Extract Transform Load DATA SOURCE 1 AGGREGATIONS DATA DESTINATION DATA SOURCE 2

  41. Load Extract Transform Load DATA SOURCE 1 AGGREGATIONS DATA DESTINATION DATA SOURCE 2

  42. Transform?

  43. Example workflow Task 1 Task 3 Task 4 Task 2

  44. Our goals Language agnostic jobs Cross task data access

  45. KubernetesPodOperator

  46. Our goals Language agnostic jobs Cross task data access

  47. Unique storage pattern ● Unique team name from the multitenancy ● Unique DAG id ● Unique task id per DAG ● Unique execution date per DAG run /{team}/{dag_id}/{task_id}/{execution_date}

  48. The power of extensibility

  49. ExecutionEnvironmentOperator KubernetesPodOperator KUBERNETES POD OPERATOR EXECUTE ExecutionEnvironmentOperator KUBERNETES PRE EXECUTE POD OPERATOR POST EXECUTE EXECUTE

  50. Configurable cross task data dependencies

  51. Example input configuration

  52. Example output configuration

  53. Pre-execute ● Bootstrap the environment ● Enrich the configuration ● Export the configuration to the execution environment pod KUBERNETES PRE EXECUTE POD OPERATOR EXECUTE

  54. Post-execute ● Handle the execution ● Clear all bootstraps ● Deal with the output KUBERNETES POD OPERATOR POST EXECUTE EXECUTE

  55. POC with AWS S3 as intermediate storage Task 1 Task 3 Task 4 Task 2

  56. Is this efficient? Multiple downloads and uploads Single processing power Always loading the data in memory

  57. How to evolve the execution environment? Remove unnecessary data transfers Parallelize the processing Provide hot data access

  58. Shared file system

  59. Kubernetes persistent volume Task 1 Task 2 Task 3 Task 4

  60. Kubernetes persistent volume with EFS Task 1 Task 2 Task 3 Task 4

  61. So far so good Remove unnecessary data transfers Parallelize the processing Provide hot data access

  62. One worker?

  63. Benefits from Spark ● Runs perfectly in Kubernetes ● Supports many distributed storages ● Allows faster data processing ● Supports multiple languages ● Easy to use

  64. SparkExecutionEnvironmentOperator KUBERNETES PRE EXECUTE POD OPERATOR POST EXECUTE EXECUTE CLEAR SPARK SETUP SPARK RUN SPARK BASED ENVIRONMENT BASED IMAGE RESOURCES

  65. Spark execution environment Spark driver Spark workers

  66. Our current state Remove unnecessary data transfers Parallelize the processing Provide hot data access

  67. Hot & cold data HOT DATA COLD DATA Task 1 Task 2 Task 3 Task 4

  68. Alluxio HOT DATA COLD DATA Task 1 Task 2 Task 3 Task 4

  69. Thank you! #apacheairflow

Recommend


More recommend