What is Airflow?
Airflow is a platform to programmatically author, schedule and monitor your workflows and pipelines.
What are the benefits for using Airflow?
Programmatically author workflow
In Airflow, you can define your workflow programmatically with Python
scripts and that would put you in a very good position by leveraging all the convenience and sweet that Python provide.
This is a huge improvement if you experienced with Oozie
or other GUI-typed (or even without a GUI) scheduling tools. Code-based solution is always easily to modify, deploy and maintain comparing with GUI typed works.
Schedule workflow
If you are familiar with the concept of cron
and crontab
context in Linux, you will find it is quite easy to use crontab
syntax to schedule your workflow.
Monitor workflow
Airflow provide a nice web dashboard for presenting all the current and historic workflows status, monitoring and tuning workflow performance, and more.
You can get some extra bonus except the aforementioned features, e.g.
- distribute workflow tasks across worker nodes
- performance analysis diagram
Terminology
DAG
DAG is short for Directed Acyclic Graph. You can think a DAG is a collection of tasks that you want to run, and it organise in a way that reflects their relationships and dependencies.
Operator
An Operator is a class that acts as a template for carrying out some work. It’s an abstraction and generalization for a specific type of works, e.g. the PythonOperator
is to execute a Python script.
Task
A task is a parameterized instance of an Operator object.
Scheduler
A scheduler monitors all DAGs and tasks, and triggers the task instances when their dependencies have been met.