What is Airflow?
Airflow is a platform to programmatically author, schedule and monitor your workflows and pipelines.
What are the benefits for using Airflow?
Programmatically author workflow
In Airflow, you can define your workflow programmatically with
Python scripts and that would put you in a very good position by leveraging all the convenience and sweet that Python provide.
This is a huge improvement if you experienced with
Oozie or other GUI-typed (or even without a GUI) scheduling tools. Code-based solution is always easily to modify, deploy and maintain comparing with GUI typed works.
If you are familiar with the concept of
crontab context in Linux, you will find it is quite easy to use
crontab syntax to schedule your workflow.
Airflow provide a nice web dashboard for presenting all the current and historic workflows status, monitoring and tuning workflow performance, and more.
You can get some extra bonus except the aforementioned features, e.g.
- distribute workflow tasks across worker nodes
- performance analysis diagram
DAG is short for Directed Acyclic Graph. You can think a DAG is a collection of tasks that you want to run, and it organise in a way that reflects their relationships and dependencies.
An Operator is a class that acts as a template for carrying out some work. It’s an abstraction and generalization for a specific type of works, e.g. the
PythonOperator is to execute a Python script.
A task is a parameterized instance of an Operator object.
A scheduler monitors all DAGs and tasks, and triggers the task instances when their dependencies have been met.