Building ( Better ) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 � 1
About Me Work [ed | s] @ Co-Chair for Maintainer of Spare time � 2
Apache Airflow What is it? � 3
Apache Airflow : What is it? In a : Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs or Directed Acyclic Graphs) � 4
Apache Airflow UI Walk-Through � 5
Apache Airflow : UI Walk-through � 6
Airflow - Authoring DAGs Airflow : Visualizing a DAG � 7
Airflow - Authoring DAGs Airflow : Author DAGs in Python! No need to bundle many XML files! � 8
Airflow - Authoring DAGs Airflow : The Tree View offers a view of DAG Runs over time! � 9
Airflow - Performance Insights Airflow : Gantt charts reveal the slowest tasks for a run! � 10
Airflow - Performance Insights Airflow : …And we can easily see performance trends over time � 11
Apache Airflow Why use it? � 12
Apache Airflow : Why use it? When would you use a Workflow Scheduler like Airflow? • ETL Pipelines • Machine Learning Pipelines • Predictive Data Pipelines • Fraud Detection, Scoring/Ranking, Classification, Recommender System, etc… • General Job Scheduling (e.g. Cron) • DB Back-ups, Scheduled code/config deployment � 13
Apache Airflow : Why use it? What should a Workflow Scheduler do well? Schedule a graph of dependencies • where Workflow = A DAG of Tasks • Handle task failures • Report / Alert on failures • Monitor performance of tasks over time • Enforce SLAs • E.g. Alerting if time or correctness SLAs are not met • Easily scale for growing load • � 14
Apache Airflow : Why use it? What Does Apache Airflow Add? • Configuration-as-code • Usability - Stunning UI / UX • Centralized configuration • Resource Pooling • Extensibility � 15
Use-Case : Message Scoring Batch Pipeline Architecture � 16
Use-Case : Message Scoring S3 uploads every 15 minutes enterprise A enterprise B S3 enterprise C � 17
Use-Case : Message Scoring enterprise A enterprise B S3 enterprise C Airflow kicks of a Spark message scoring job every hour � 18
Use-Case : Message Scoring enterprise A enterprise B S3 S3 enterprise C Spark job writes scored messages and stats to another S3 bucket � 19
Use-Case : Message Scoring enterprise A enterprise B S3 S3 enterprise C This triggers SNS/SQS SNS messages events SQS � 20
Use-Case : Message Scoring enterprise A enterprise B S3 S3 enterprise C SNS An Autoscale Group (ASG) of Importers spins up when it detects SQS SQS messages ASG Importers � 21
Use-Case : Message Scoring enterprise A enterprise B S3 S3 enterprise C SNS The importers rapidly ingest scored messages and aggregate statistics into SQS the DB ASG DB Importers � 22
Use-Case : Message Scoring enterprise A enterprise B S3 S3 enterprise C SNS Users receive alerts of untrusted emails & SQS can review them in the web app ASG DB Importers � 23
Use-Case : Message Scoring enterprise A enterprise B S3 S3 enterprise C SNS Airflow manages the entire process SQS ASG DB Importers � 24
Airflow DAG � 25
Apache Airflow Incubating � 26
Apache Airflow : Incubating Timeline • Airflow was created @ Airbnb in 2015 by Maxime Beauchemin • Max launched it @ Hadoop Summit in Summer 2015 • On 3/31/2016, Airflow —> Apache Incubator Today • 2400+ Forks • 7600+ GitHub Stars • 430+ Contributors • 150+ companies officially using it! • 14 Committers/Maintainers <— We’re growing here � 27
Thank You! � 28
Apache Airflow Behind the Scenes � 29
Apache Airflow : Behind the Scenes Airflow is a platform to programmatically author, schedule and monitor workflows ( a.k.a. DAGs ) It ships with a • DAG Scheduler • Web application (UI) • Powerful CLI • Celery Workers! � 30
Apache Airflow : Behind the Scenes Webserver 1. A user schedules / manages DAGs using the Airflow UI! Scheduler 2. Airflow’s webserver stores scheduling metadata in the Meta DB metadata DB 3. The scheduler picks up new Celery / RabbitMQ schedules and distributes work over Celery / RabbitMQ Worker Worker Worker 4. Airflow workers pick up Airflow tasks over Celery � 31
Apache Airflow : Behind the Scenes Webserver 1. A user schedules / manages DAGs using the Airflow UI! Scheduler 2. Airflow’s webserver stores Meta DB scheduling metadata in the metadata DB Celery / RabbitMQ 3. The scheduler picks up new schedules and distributes work over Celery / Worker Worker Worker RabbitMQ 4. Airflow workers pick up Airflow tasks over Celery � 32
Apache Airflow : Behind the Scenes Webserver 1. A user schedules / manages DAGs using the Airflow UI! Scheduler 2. Airflow’s webserver stores scheduling metadata in the Meta DB metadata DB 3. The scheduler picks up new Celery / RabbitMQ schedules and distributes work over Celery / RabbitMQ Worker Worker Worker 4. Airflow workers pick up Airflow tasks over Celery � 33
Apache Airflow : Behind the Scenes Webserver 1. A user schedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores Scheduler scheduling metadata in the Meta DB metadata DB 3. The scheduler picks up new schedules and distributes Celery / RabbitMQ work over Celery / RabbitMQ Worker Worker Worker 4. Airflow workers pick up Airflow tasks from RabbitMQ � 34
Thank You! � 35
Recommend
More recommend