What is Airflow?

Airflow is a platform to programmatically author, schedule and monitor your workflows and pipelines.

What are the benefits for using Airflow?

Programmatically author workflow

In Airflow, you can define your workflow programmatically with Python scripts and that would put you in a very good position by leveraging all the convenience and sweet that Python provide.

This is a huge improvement if you experienced with Oozie or other GUI-typed (or even without a GUI) scheduling tools. Code-based solution is always easily to modify, deploy and maintain comparing with GUI typed works.

Schedule workflow

If you are familiar with the concept of cron and crontab context in Linux, you will find it is quite easy to use crontab syntax to schedule your workflow.

Monitor workflow

Airflow provide a nice web dashboard for presenting all the current and historic workflows status, monitoring and tuning workflow performance, and more.

You can get some extra bonus except the aforementioned features, e.g.

  • distribute workflow tasks across worker nodes
  • performance analysis diagram

Terminology

DAG

DAG is short for Directed Acyclic Graph. You can think a DAG is a collection of tasks that you want to run, and it organise in a way that reflects their relationships and dependencies.

Operator

An Operator is a class that acts as a template for carrying out some work. It’s an abstraction and generalization for a specific type of works, e.g. the PythonOperator is to execute a Python script.

Task

A task is a parameterized instance of an Operator object.

Scheduler

A scheduler monitors all DAGs and tasks, and triggers the task instances when their dependencies have been met.

Reference