Airflow has quietly become the default way data teams schedule and orchestrate pipelines. It's the thing that runs your nightly ETL, kicks off your model training, and pages someone when the 3 a.m. load fails. And like most defaults, it's widely used and narrowly understood — people write DAGs without a clear picture of what the scheduler is doing, why a task didn't start when they expected, or what an executor even is. Those gaps are exactly where the production pain comes from, so let's open it up.
Airflow has a handful of components that cooperate through one shared database. Once you see how the DAG, the scheduler, the executor, and the metadata database relate, the system's behavior — including its quirks — becomes predictable.
The DAG: pipelines as code
Airflow's founding idea is "configuration as code." A pipeline is a DAG — a directed acyclic graph — defined in a Python file. Nodes are tasks (instances of operators, which encapsulate a unit of work — run a SQL query, call an API, launch a container), and edges are dependencies. Because it's Python, the DAG can be generated programmatically, parameterized, and version-controlled like any other code.
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime
with DAG(
dag_id="daily_etl",
schedule_interval="@daily",
start_date=datetime(2021, 1, 1),
catchup=False,
) as dag:
extract = BashOperator(task_id="extract", bash_command="extract.sh")
transform = BashOperator(task_id="transform", bash_command="transform.sh")
load = BashOperator(task_id="load", bash_command="load.sh")
extract >> transform >> load # dependency chain
A critical distinction that trips up newcomers: this file defines the workflow; it doesn't run it. The DAG file is parsed repeatedly by Airflow to learn the structure, and the actual execution is driven by the scheduler creating runs over time. Two derived concepts matter: a DagRun is one execution of the whole DAG for a particular logical date, and a TaskInstance is one task within one DagRun — the thing that actually has a state (queued, running, success, failed) and gets retried.
The number-one DAG-authoring mistake: putting heavy code at the top level of the DAG file. The scheduler parses every DAG file on a regular loop — frequently — to detect changes. If your file makes a database call or a long computation at import time (outside an operator), that cost is paid on every parse, and it drags the whole scheduler down. Keep the top level cheap; do real work inside operators, which run only when the task runs.
The components and how they talk
Airflow is not one process. It's several, and they coordinate entirely through the metadata database — there's no direct messaging between them, which is the key to understanding the whole system.
| Component | Responsibility |
|---|---|
| Scheduler | Parses DAGs, creates DagRuns on schedule, and decides which task instances are ready to run (dependencies met) — then hands them to the executor |
| Executor | Determines how/where ready tasks actually run (in-process, on Celery workers, as Kubernetes pods) |
| Workers | The processes that execute task code (with distributed executors) |
| Metadata database | The single source of truth — all DAG, DagRun, and TaskInstance state lives here |
| Webserver | The UI — reads state from the metadata DB to show DAGs, runs, logs |
graph TD
DAGS["DAG files (.py)"]
SCHED["Scheduler
parse DAGs → create DagRuns →
find ready TaskInstances"]
DB[("Metadata database
(single source of truth:
all task/run state)")]
EXEC["Executor
(Local / Celery / Kubernetes)"]
WORK["Workers
run task code"]
WEB["Webserver (UI)"]
DAGS --> SCHED
SCHED <--> DB
SCHED --> EXEC
EXEC --> WORK
WORK <--> DB
WEB <--> DB
Airflow's components coordinate only through the metadata database — nothing talks to anything else directly. The scheduler decides what's ready and records it; the executor and workers run tasks and write state back; the webserver just reads it. This DB-as-bus design is why the metadata database's health is the health of your whole Airflow, and why everyone shares one consistent view.
The scheduler loop: what actually decides a task runs
The scheduler runs a continuous loop, and walking it explains most "why didn't my task start?" questions. On each pass it: parses DAG files (picking up changes), creates any DagRuns now due per each DAG's schedule_interval, examines the task instances of active DagRuns to find those whose upstream dependencies are satisfied, marks them queued, and hands them to the executor. As tasks finish, workers write their success/failure back to the metadata DB, which unblocks downstream tasks on the next loop.
The consequence to internalize: scheduling is not instantaneous or real-time. A task becomes eligible only after (a) its DagRun's interval has elapsed and (b) the scheduler's loop gets around to it and sees its dependencies met. On a busy instance with thousands of tasks, that loop latency is real. Airflow is a batch scheduler, not an event-driven millisecond dispatcher — expecting sub-second reaction is expecting the wrong thing from it.
Executors: the choice that defines your deployment
The scheduler decides what runs; the executor decides how and where. This single configuration choice shapes your scalability and operational model more than anything else:
| Executor | How tasks run | Fits |
|---|---|---|
| Sequential | One task at a time, in-process (SQLite) | Demos and local debugging only |
| Local | Parallel subprocesses on the scheduler machine | Small single-node deployments |
| Celery | Tasks distributed to a pool of long-running Celery workers via a message broker (Redis/RabbitMQ) | Horizontal scale with a stable worker fleet |
| Kubernetes | Each task runs in its own dynamically launched pod | Elastic, isolated, per-task resourcing; no idle worker fleet |
The Celery vs Kubernetes decision is the common production fork. Celery keeps a warm worker pool — low per-task latency, but you pay for idle workers and share their environment. Kubernetes spins up a clean pod per task — perfect isolation and elastic scaling with no idle cost, at the price of pod-startup latency per task. (There's also a hybrid, CeleryKubernetes, for mixing both.) Neither is "better"; they fit different workloads.
What Airflow 2.0 changed
Airflow 2.0 (late 2020) was a substantial overhaul, and by now in 2021 the 2.x line is what you should be running. The headline changes directly address the historical pain points:
- A highly-available scheduler. The single biggest one. Pre-2.0, the scheduler was a single point of failure and a throughput bottleneck. Airflow 2.0 lets you run multiple active schedulers concurrently (coordinating through row-level locks in the metadata DB) — both for high availability and for far higher scheduling throughput.
- The TaskFlow API. A cleaner way to write DAGs with the
@taskdecorator, where plain Python functions become tasks and passing return values between them handles data dependencies (via XCom) automatically — much less boilerplate than the classic operator-and-XCom style. - A full, stable REST API for programmatic control and integration, and a faster, refreshed UI.
Design tasks to be idempotent and retry-safe. Airflow runs each task instance for a specific logical date and will retry failures and let you re-run history (backfill). A task that isn't idempotent — that double-counts or duplicates rows when run twice for the same date — will eventually corrupt data, because reruns are normal operation, not an exception. Make "run this task again for date X" always produce the same result.
What to carry away
Airflow is a set of processes coordinating through one metadata database: the scheduler parses DAGs, creates DagRuns on their interval, and marks task instances ready when dependencies clear; the executor (Local, Celery, or Kubernetes) decides how those tasks actually run; the webserver just reads the shared state. It's a batch scheduler, so scheduling is loop-driven, not real-time, and the metadata DB's health is the system's health.
Author with the grain: keep DAG files cheap to parse, do real work inside operators, make tasks idempotent so retries and backfills are safe, and pick the executor that matches your scale and isolation needs. Do that and Airflow is a dependable backbone; fight it and you'll spend your nights wondering why a task is stuck in queued.