building better data pipelines using apache airflow
play

Building ( Better ) Data Pipelines using Apache Airflow Sid Anand - PowerPoint PPT Presentation

Building ( Better ) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1 About Me Work [ed | s] @ Co-Chair for Maintainer of Spare time 2 Apache Airflow What is it? 3 Apache Airflow : What is it? In a


  1. Building ( Better ) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 � 1

  2. About Me Work [ed | s] @ Co-Chair for Maintainer of Spare time � 2

  3. Apache Airflow What is it? � 3

  4. Apache Airflow : What is it? In a : Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs or Directed Acyclic Graphs) � 4

  5. Apache Airflow UI Walk-Through � 5

  6. Apache Airflow : UI Walk-through � 6

  7. Airflow - Authoring DAGs Airflow : Visualizing a DAG � 7

  8. Airflow - Authoring DAGs Airflow : Author DAGs in Python! No need to bundle many XML files! � 8

  9. Airflow - Authoring DAGs Airflow : The Tree View offers a view of DAG Runs over time! � 9

  10. Airflow - Performance Insights Airflow : Gantt charts reveal the slowest tasks for a run! � 10

  11. Airflow - Performance Insights Airflow : …And we can easily see performance trends over time � 11

  12. Apache Airflow Why use it? � 12

  13. Apache Airflow : Why use it? When would you use a Workflow Scheduler like Airflow? • ETL Pipelines • Machine Learning Pipelines • Predictive Data Pipelines • Fraud Detection, Scoring/Ranking, Classification, Recommender System, etc… • General Job Scheduling (e.g. Cron) • DB Back-ups, Scheduled code/config deployment � 13

  14. Apache Airflow : Why use it? What should a Workflow Scheduler do well? Schedule a graph of dependencies • where Workflow = A DAG of Tasks • Handle task failures • Report / Alert on failures • Monitor performance of tasks over time • Enforce SLAs • E.g. Alerting if time or correctness SLAs are not met • Easily scale for growing load • � 14

  15. Apache Airflow : Why use it? What Does Apache Airflow Add? • Configuration-as-code • Usability - Stunning UI / UX • Centralized configuration • Resource Pooling • Extensibility � 15

  16. Use-Case : Message Scoring Batch Pipeline Architecture � 16

  17. Use-Case : Message Scoring S3 uploads every 15 minutes enterprise A enterprise B S3 enterprise C � 17

  18. Use-Case : Message Scoring enterprise A enterprise B S3 enterprise C Airflow kicks of a Spark message scoring job every hour � 18

  19. Use-Case : Message Scoring enterprise A enterprise B S3 S3 enterprise C Spark job writes scored messages and stats to another S3 bucket � 19

  20. Use-Case : Message Scoring enterprise A enterprise B S3 S3 enterprise C This triggers SNS/SQS SNS messages events SQS � 20

  21. Use-Case : Message Scoring enterprise A enterprise B S3 S3 enterprise C SNS An Autoscale Group (ASG) of Importers spins up when it detects SQS SQS messages ASG Importers � 21

  22. Use-Case : Message Scoring enterprise A enterprise B S3 S3 enterprise C SNS The importers rapidly ingest scored messages and aggregate statistics into SQS the DB ASG DB Importers � 22

  23. Use-Case : Message Scoring enterprise A enterprise B S3 S3 enterprise C SNS Users receive alerts of untrusted emails & SQS can review them in the web app ASG DB Importers � 23

  24. Use-Case : Message Scoring enterprise A enterprise B S3 S3 enterprise C SNS Airflow manages the entire process SQS ASG DB Importers � 24

  25. Airflow DAG � 25

  26. Apache Airflow Incubating � 26

  27. Apache Airflow : Incubating Timeline • Airflow was created @ Airbnb in 2015 by Maxime Beauchemin • Max launched it @ Hadoop Summit in Summer 2015 • On 3/31/2016, Airflow —> Apache Incubator Today • 2400+ Forks • 7600+ GitHub Stars • 430+ Contributors • 150+ companies officially using it! • 14 Committers/Maintainers <— We’re growing here � 27

  28. Thank You! � 28

  29. Apache Airflow Behind the Scenes � 29

  30. Apache Airflow : Behind the Scenes Airflow is a platform to programmatically author, schedule and monitor workflows ( a.k.a. DAGs ) It ships with a • DAG Scheduler • Web application (UI) • Powerful CLI • Celery Workers! � 30

  31. Apache Airflow : Behind the Scenes Webserver 1. A user schedules / manages DAGs using the Airflow UI! Scheduler 2. Airflow’s webserver stores scheduling metadata in the Meta DB metadata DB 3. The scheduler picks up new Celery / RabbitMQ schedules and distributes work over Celery / RabbitMQ Worker Worker Worker 4. Airflow workers pick up Airflow tasks over Celery � 31

  32. Apache Airflow : Behind the Scenes Webserver 1. A user schedules / manages DAGs using the Airflow UI! Scheduler 2. Airflow’s webserver stores Meta DB scheduling metadata in the metadata DB Celery / RabbitMQ 3. The scheduler picks up new schedules and distributes work over Celery / Worker Worker Worker RabbitMQ 4. Airflow workers pick up Airflow tasks over Celery � 32

  33. Apache Airflow : Behind the Scenes Webserver 1. A user schedules / manages DAGs using the Airflow UI! Scheduler 2. Airflow’s webserver stores scheduling metadata in the Meta DB metadata DB 3. The scheduler picks up new Celery / RabbitMQ schedules and distributes work over Celery / RabbitMQ Worker Worker Worker 4. Airflow workers pick up Airflow tasks over Celery � 33

  34. Apache Airflow : Behind the Scenes Webserver 1. A user schedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores Scheduler scheduling metadata in the Meta DB metadata DB 3. The scheduler picks up new schedules and distributes Celery / RabbitMQ work over Celery / RabbitMQ Worker Worker Worker 4. Airflow workers pick up Airflow tasks from RabbitMQ � 34

  35. Thank You! � 35

Recommend


More recommend