Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1
About Me 2 Work [ed | s] @ Maintainer of Spare time Co-Chair for
Apache Airflow 3 What is it?
4 Apache Airflow : What is it? In a : Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs or Directed Acyclic Graphs)
Apache Airflow 5 UI Walk-Through
6 Apache Airflow : UI Walk-through
Airflow - Authoring DAGs 7 Airflow: Visualizing a DAG
8 Airflow: Author DAGs in Python! No need to bundle many XML files! Airflow - Authoring DAGs
9 Airflow: The Tree View offers a view of DAG Runs over time! Airflow - Authoring DAGs
Airflow - Performance Insights 10 Airflow: Gantt charts reveal the slowest tasks for a run!
11 Airflow: …And we can easily see performance trends over time Airflow - Performance Insights
Apache Airflow 12 Why use it?
13 Apache Airflow : Why use it? When would you use a Workflow Scheduler like Airflow? • ETL Pipelines • Machine Learning Pipelines • Predictive Data Pipelines • Fraud Detection, Scoring/Ranking, Classification, Recommender System, etc… • General Job Scheduling (e.g. Cron) • DB Back-ups, Scheduled code/config deployment
14 What should a Workflow Scheduler do well? • Schedule a graph of dependencies • where Workflow = A DAG of Tasks • Handle task failures • Report / Alert on failures • Monitor performance of tasks over time • Enforce SLAs • E.g. Alerting if time or correctness SLAs are not met • Easily scale for growing load Apache Airflow : Why use it?
15 What Does Apache Airflow Add? • Configuration-as-code • Usability - Stunning UI / UX • Centralized configuration • Resource Pooling • Extensibility Apache Airflow : Why use it?
Use-Case : Message Scoring Batch Pipeline Architecture 16
Use-Case : Message Scoring 17 enterprise A enterprise B enterprise C S3 S3 uploads every 15 minutes
Use-Case : Message Scoring 18 enterprise A enterprise B enterprise C S3 Airflow kicks of a Spark message scoring job every hour
Use-Case : Message Scoring 19 enterprise A enterprise B enterprise C S3 Spark job writes scored messages and stats to another S3 bucket S3
Use-Case : Message Scoring 20 enterprise A enterprise B enterprise C S3 This triggers SNS/SQS messages events S3 SNS SQS
Use-Case : Message Scoring 21 enterprise A enterprise B enterprise C S3 An Autoscale Group (ASG) of Importers spins up when it detects SQS messages S3 SNS SQS Importers ASG
22 enterprise A enterprise B enterprise C S3 The importers rapidly ingest scored messages and aggregate statistics into the DB S3 SNS SQS Importers ASG DB Use-Case : Message Scoring
23 enterprise A enterprise B enterprise C S3 Users receive alerts of untrusted emails & can review them in the web app S3 SNS SQS Importers ASG DB Use-Case : Message Scoring
24 enterprise A enterprise B enterprise C S3 S3 SNS SQS Importers ASG DB Airflow manages the entire process Use-Case : Message Scoring
25 Airflow DAG
Apache Airflow 26 Incubating
27 Apache Airflow : Incubating Timeline • Airflow was created @ Airbnb in 2015 by Maxime Beauchemin • Max launched it @ Hadoop Summit in Summer 2015 • On 3/31/2016, Airflow —> Apache Incubator Today • 2400+ Forks • 7600+ GitHub Stars • 430+ Contributors • 150+ companies officially using it! • 14 Committers/Maintainers <— We’re growing here
Thank You! 28
Apache Airflow 29 Behind the Scenes
30 Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs) It ships with a • DAG Scheduler • Web application (UI) • Powerful CLI • Celery Workers! Apache Airflow : Behind the Scenes
31 Apache Airflow : Behind the Scenes Webserver Scheduler WorkerWorkerWorker Meta DB 1. A user schedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores scheduling metadata in the metadata DB 3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ 4. Airflow workers pick up Airflow tasks over Celery Celery / RabbitMQ
32 Webserver Scheduler WorkerWorkerWorker Meta DB 1. A user schedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores scheduling metadata in the metadata DB 3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ 4. Airflow workers pick up Airflow tasks over Celery Celery / RabbitMQ Apache Airflow : Behind the Scenes
1. A user schedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores scheduling metadata in the metadata DB 3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ 4. Airflow workers pick up Airflow tasks over Celery 33 Webserver Scheduler WorkerWorkerWorker Meta DB Celery / RabbitMQ Apache Airflow : Behind the Scenes
34 Webserver Scheduler WorkerWorkerWorker Meta DB 1. A user schedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores scheduling metadata in the metadata DB 3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ 4. Airflow workers pick up Airflow tasks from RabbitMQ Celery / RabbitMQ Apache Airflow : Behind the Scenes
Thank You! 35

Building Better Data Pipelines using Apache Airflow

  • 1.
    Building (Better) DataPipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1
  • 2.
    About Me 2 Work [ed| s] @ Maintainer of Spare time Co-Chair for
  • 3.
  • 4.
    4 Apache Airflow :What is it? In a : Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs or Directed Acyclic Graphs)
  • 5.
  • 6.
    6 Apache Airflow :UI Walk-through
  • 7.
    Airflow - AuthoringDAGs 7 Airflow: Visualizing a DAG
  • 8.
    8 Airflow: Author DAGsin Python! No need to bundle many XML files! Airflow - Authoring DAGs
  • 9.
    9 Airflow: The TreeView offers a view of DAG Runs over time! Airflow - Authoring DAGs
  • 10.
    Airflow - PerformanceInsights 10 Airflow: Gantt charts reveal the slowest tasks for a run!
  • 11.
    11 Airflow: …And wecan easily see performance trends over time Airflow - Performance Insights
  • 12.
  • 13.
    13 Apache Airflow :Why use it? When would you use a Workflow Scheduler like Airflow? • ETL Pipelines • Machine Learning Pipelines • Predictive Data Pipelines • Fraud Detection, Scoring/Ranking, Classification, Recommender System, etc… • General Job Scheduling (e.g. Cron) • DB Back-ups, Scheduled code/config deployment
  • 14.
    14 What should aWorkflow Scheduler do well? • Schedule a graph of dependencies • where Workflow = A DAG of Tasks • Handle task failures • Report / Alert on failures • Monitor performance of tasks over time • Enforce SLAs • E.g. Alerting if time or correctness SLAs are not met • Easily scale for growing load Apache Airflow : Why use it?
  • 15.
    15 What Does ApacheAirflow Add? • Configuration-as-code • Usability - Stunning UI / UX • Centralized configuration • Resource Pooling • Extensibility Apache Airflow : Why use it?
  • 16.
    Use-Case : Message Scoring BatchPipeline Architecture 16
  • 17.
    Use-Case : MessageScoring 17 enterprise A enterprise B enterprise C S3 S3 uploads every 15 minutes
  • 18.
    Use-Case : MessageScoring 18 enterprise A enterprise B enterprise C S3 Airflow kicks of a Spark message scoring job every hour
  • 19.
    Use-Case : MessageScoring 19 enterprise A enterprise B enterprise C S3 Spark job writes scored messages and stats to another S3 bucket S3
  • 20.
    Use-Case : MessageScoring 20 enterprise A enterprise B enterprise C S3 This triggers SNS/SQS messages events S3 SNS SQS
  • 21.
    Use-Case : MessageScoring 21 enterprise A enterprise B enterprise C S3 An Autoscale Group (ASG) of Importers spins up when it detects SQS messages S3 SNS SQS Importers ASG
  • 22.
    22 enterprise A enterprise B enterpriseC S3 The importers rapidly ingest scored messages and aggregate statistics into the DB S3 SNS SQS Importers ASG DB Use-Case : Message Scoring
  • 23.
    23 enterprise A enterprise B enterpriseC S3 Users receive alerts of untrusted emails & can review them in the web app S3 SNS SQS Importers ASG DB Use-Case : Message Scoring
  • 24.
    24 enterprise A enterprise B enterpriseC S3 S3 SNS SQS Importers ASG DB Airflow manages the entire process Use-Case : Message Scoring
  • 25.
  • 26.
  • 27.
    27 Apache Airflow :Incubating Timeline • Airflow was created @ Airbnb in 2015 by Maxime Beauchemin • Max launched it @ Hadoop Summit in Summer 2015 • On 3/31/2016, Airflow —> Apache Incubator Today • 2400+ Forks • 7600+ GitHub Stars • 430+ Contributors • 150+ companies officially using it! • 14 Committers/Maintainers <— We’re growing here
  • 28.
  • 29.
  • 30.
    30 Airflow is aplatform to programmatically author, schedule and monitor workflows (a.k.a. DAGs) It ships with a • DAG Scheduler • Web application (UI) • Powerful CLI • Celery Workers! Apache Airflow : Behind the Scenes
  • 31.
    31 Apache Airflow :Behind the Scenes Webserver Scheduler WorkerWorkerWorker Meta DB 1. A user schedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores scheduling metadata in the metadata DB 3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ 4. Airflow workers pick up Airflow tasks over Celery Celery / RabbitMQ
  • 32.
    32 Webserver Scheduler WorkerWorkerWorker Meta DB 1. Auser schedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores scheduling metadata in the metadata DB 3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ 4. Airflow workers pick up Airflow tasks over Celery Celery / RabbitMQ Apache Airflow : Behind the Scenes
  • 33.
    1. A userschedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores scheduling metadata in the metadata DB 3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ 4. Airflow workers pick up Airflow tasks over Celery 33 Webserver Scheduler WorkerWorkerWorker Meta DB Celery / RabbitMQ Apache Airflow : Behind the Scenes
  • 34.
    34 Webserver Scheduler WorkerWorkerWorker Meta DB 1. Auser schedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores scheduling metadata in the metadata DB 3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ 4. Airflow workers pick up Airflow tasks from RabbitMQ Celery / RabbitMQ Apache Airflow : Behind the Scenes
  • 35.