# dbt Internals and Best Practices: What Happens When You Run dbt run

Everyone uses dbt. Fewer people understand what it actually does. It's not a query engine, not an orchestrator, and not a data warehouse — it's a SQL compiler with a dependency resolver and a test runner built in. Understanding the compilation pipeline is the difference between a dbt project that runs reliably at scale and one that collapses into a spaghetti of `{{ ref() }}` calls nobody can debug.

This article covers dbt's internal compilation pipeline (what happens between `dbt run` and the first SQL statement hitting your warehouse), materializations and how they differ on Snowflake vs Databricks vs Redshift vs BigQuery, incremental strategies in depth, common project structure mistakes, and how dbt performs on modern lightweight engines like DuckDB and MotherDuck for development and testing.

## The Compilation Pipeline: Jinja → SQL → Execution

When you run `dbt run`, dbt does five things before executing a single SQL statement against your warehouse:

```mermaid
graph LR
    A["1. Parse project\n(dbt_project.yml +\nall .sql .yml files)"]
    B["2. Resolve\ndependencies\n(build DAG from ref()/source())"]
    C["3. Jinja render\n(compile templates\nto raw SQL)"]
    D["4. Adapter\ntranslation\n(dialect-specific SQL)"]
    E["5. Execute\n(warehouse API calls\nin topological order)"]

    A --> B --> C --> D --> E
          
```

dbt's five-phase execution pipeline. Steps 1–4 happen on your local machine or dbt Cloud runner — no warehouse queries. Only step 5 hits the warehouse. This means compilation errors (syntax, missing refs, undefined variables) are caught before any compute is consumed.

### Phase 1: Parse

dbt reads every `.sql`, `.yml`, and `.py` file in your project, parses model configs, source definitions, macros, and tests. This phase builds the internal graph — a Python dict mapping node IDs to their metadata. At this point dbt knows every model exists but hasn't compiled any SQL yet.

### Phase 2: DAG Resolution

The `{{ ref('model_name') }}` function is a dependency declaration, not a SQL expression. dbt resolves all `ref()` and `source()` calls to their fully qualified table names (e.g., `analytics.prod.orders_daily`) and builds a directed acyclic graph. If model B refs model A, B depends on A — and dbt will execute A before B. Circular references fail here with a clear error.

### Phase 3: Jinja Compilation

dbt uses Jinja2 to render model SQL. Every model file is a Jinja template; `{{ ref('stg_orders') }}` resolves to the fully-qualified table name, `{{ config() }}` sets model properties, `{{ is_incremental() }}` evaluates to True/False based on run context. The output is plain SQL with no Jinja syntax — this is what `dbt compile` shows you and what actually runs against the warehouse.

```sql
-- models/marts/fct_orders.sql (Jinja template)
{{ config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='merge'
) }}

SELECT
    order_id,
    customer_id,
    order_total,
    created_at
FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
    WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}
```

```sql
-- Compiled output (what dbt sends to Snowflake)
MERGE INTO analytics.prod.fct_orders AS DBT_INTERNAL_DEST
USING (
    SELECT order_id, customer_id, order_total, created_at
    FROM analytics.prod.stg_orders
    WHERE created_at > (SELECT MAX(created_at) FROM analytics.prod.fct_orders)
) AS DBT_INTERNAL_SOURCE
ON DBT_INTERNAL_DEST.order_id = DBT_INTERNAL_SOURCE.order_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...
```

### Phase 4: Adapter Translation

The adapter layer translates dbt's dialect-independent compiled SQL into the specific SQL dialect of your warehouse. This is where the differences between Snowflake, Redshift, BigQuery, and Databricks become relevant. The same `merge` incremental strategy compiles to different SQL on each platform — Snowflake uses `MERGE INTO`, BigQuery uses `MERGE` with slightly different syntax, Databricks uses `MERGE INTO` on Delta tables, Redshift uses a delete-then-insert pattern (no native MERGE until recently).

## Materializations: What They Actually Do

### Table

Drops and recreates the table on every run: `DROP TABLE IF EXISTS; CREATE TABLE AS SELECT ...`. Simple, reliable, expensive for large tables. Use for small dimension tables, reference data, and marts with relatively fast rebuild times.

### View

Creates or replaces a view: `CREATE OR REPLACE VIEW AS SELECT ...`. No storage cost. Query performance depends entirely on the upstream tables. Use for staging models and lightweight transformations where freshness matters more than query speed.

### Incremental

The most complex and most misunderstood materialization. On the first run (when the table doesn't exist), behaves like a full table build. On subsequent runs, only processes new/changed rows. The exact SQL depends on the `incremental_strategy`:

| Strategy | Mechanics | Best for | Supported on |
| --- | --- | --- | --- |
| `append` | INSERT new rows only. No deduplication. | Immutable event logs | All adapters |
| `merge` | MERGE ON unique_key: update matches, insert new. Idempotent. | Fact tables with updates | Snowflake, BigQuery, Databricks (Delta), Postgres |
| `delete+insert` | DELETE matching rows, INSERT all new rows. Slower than merge but more compatible. | Replacing partitions | All adapters (Redshift prefers this) |
| `insert_overwrite` | Replace entire partition(s) atomically. Partition-level idempotency. | Large time-partitioned tables | BigQuery, Databricks, Spark |
| `microbatch` | dbt 1.9+: Processes each time batch separately with retries. Event-time-based. | Reliable large-scale incremental with late data | BigQuery, Snowflake, Databricks (1.9+) |

### Ephemeral

Not materialized at all — compiled as a CTE and inlined into the downstream model's SQL. No storage, no warehouse object, no way to query it directly. Use for intermediate transformations you'd otherwise write as subqueries. The downside: complex ephemeral chains make debugging hard because you can't `SELECT * FROM that_intermediate_step` to check what's in it.

## Platform Differences That Actually Matter

### Snowflake

The reference dbt platform. Every incremental strategy works reliably. Dynamic Tables (Snowflake-native streaming materialization) can replace some incremental models for near-real-time use cases. `CLUSTER BY` is the Snowflake equivalent of partitioning — specify it in dbt config for large fact tables. Snowflake's ZERO_COPY_CLONE capability (cloning a table without copying data) makes `dbt test --store-failures` very cheap — failed test results get cloned into a test schema without data movement.

### Databricks (Delta Lake)

Databricks' Delta Lake supports all dbt incremental strategies plus liquid clustering (Delta's partitioning evolution, replacing static `PARTITIONED BY`). The `insert_overwrite` strategy replaces Spark partitions atomically and is often faster than MERGE on large tables. Python models (dbt-databricks >= 1.3) let you write Spark DataFrame code as a dbt model — useful for ML feature engineering that's easier in PySpark than SQL. One gotcha: Unity Catalog on Databricks changes table naming from `database.schema.table` to `catalog.schema.table` — updating your `dbt_project.yml` database config is required when migrating to Unity Catalog.

### Redshift

Redshift lacks a native MERGE statement until Redshift Serverless (and even there it's limited). dbt's default incremental strategy on Redshift is `delete+insert`: it deletes rows matching the `unique_key`, then inserts all new rows. This creates vacuum bloat — rows deleted from column-store pages leave ghost rows that consume space and slow queries until `VACUUM` runs. Production Redshift dbt projects need a scheduled `VACUUM ANALYZE` on incremental tables, or they'll degrade significantly over time. Redshift's SORT KEY is the performance lever dbt exposes via the `sort` config — critical for range-based query patterns on large fact tables.

### BigQuery

BigQuery partitioned tables + dbt's `insert_overwrite` strategy is the standard production pattern: each run replaces the affected date partition(s) atomically. Partition replacement is idempotent and handles late-arriving data correctly. BigQuery's `require_partition_filter` table option (prevents full-table scans — every query must include a partition filter) can be set via dbt config and is strongly recommended for any table over a few hundred GB.

### Azure Synapse / SQL Server

The dbt-synapse adapter is maintained by Microsoft and covers most dbt features, but Synapse Dedicated SQL Pool lacks some Delta Lake features and has different MERGE semantics. External tables from ADLS Gen2 are handled via source configs but have query performance limitations. For Azure-native dbt work in 2024, Fabric Lakehouse (via the dbt-fabric adapter) is increasingly the preferred target over Synapse Dedicated SQL Pool, which Microsoft is steering users away from.

## dbt on Modern Open/Lightweight Engines

### DuckDB

dbt-duckdb made dbt development genuinely fast. DuckDB is an in-process OLAP engine (runs inside your Python process or CLI, reads Parquet/CSV/JSON directly from disk or S3) that executes SQL at warehouse speeds on a laptop. With dbt-duckdb, you can run your entire dbt project locally against a sample of production data in seconds — no warehouse needed, no costs, no cloud credentials. This changes the development loop: instead of writing a model, pushing to a branch, waiting for a CI job, and paying for warehouse compute to run a test, you write a model, run it locally in 3 seconds, and iterate.

```yaml
# profiles.yml — DuckDB for local dev
my_project:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: "./dev.duckdb"       # local file, or :memory: for ephemeral
      # Can read Parquet directly from S3 without loading into the DB
      settings:
        s3_region: us-east-1

    prod:
      type: snowflake
      account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
      ...
```

### MotherDuck

MotherDuck is serverless DuckDB in the cloud. Same SQL dialect and dbt-duckdb adapter, but your data lives in a managed cloud database rather than a local file. The hybrid execution model is particularly clever: you can join a local DuckDB table with a MotherDuck cloud table in a single query, running the compute nearest the data. For dbt projects, this means a development environment that's cloud-hosted (no local file management) but still orders of magnitude cheaper than Snowflake or BigQuery for development and testing workloads.

## Project Structure: The Patterns That Scale

The standard dbt project structure follows the Fivetran/dbt Labs "jaffle shop" pattern, which has become the community default for good reason:

```bash
models/
├── staging/          # 1-1 with source tables; rename, recast, light cleaning
│   ├── salesforce/
│   │   ├── _salesforce__sources.yml
│   │   ├── _salesforce__models.yml
│   │   ├── stg_salesforce__contacts.sql
│   │   └── stg_salesforce__opportunities.sql
│   └── stripe/
│       ├── stg_stripe__charges.sql
│       └── stg_stripe__customers.sql
├── intermediate/     # reusable building blocks; not exposed directly to BI
│   ├── int_customer_order_rollup.sql
│   └── int_revenue_by_month.sql
└── marts/            # business-logic models; exposed to BI and downstream
    ├── core/
    │   ├── dim_customer.sql
    │   └── fct_orders.sql
    └── finance/
        └── fct_mrr.sql
```

The **staging layer** is a 1:1 mapping from source system to a clean, typed, consistently named table. Rename `acct_no` to `account_number` here. Cast types here. Do nothing else here. The discipline of keeping staging models thin is what makes the intermediate and marts layers predictable.

## The 10 Most Common dbt Project Mistakes

1. **Putting business logic in staging models.** Staging is for cleaning, not business rules. The moment you write a `CASE WHEN` with business logic in a staging model, it becomes invisible to anyone looking at the marts layer.

1. **Not using contracts on mart models.** dbt contracts (enforced column definitions + data types) prevent silent schema changes from breaking downstream BI reports. Every model exposed to a BI tool or external consumer should have a contract.

1. **Incremental models without `+full_refresh` protection.** A misbehaving incremental model that needs a full refresh will silently contain stale data until someone manually runs `--full-refresh`. Add a freshness assertion to your tests.

1. **Over-relying on ephemeral models for debugging.** Ephemeral models are invisible at query time. Large chains of ephemeral models produce enormous CTEs that are impossible to profile or debug in the warehouse.

1. **Missing `on_schema_change` config.** When an incremental model's source adds a new column, the incremental table won't include it until a full refresh — unless you set `on_schema_change: 'sync_all_columns'`.

1. **Using `{{ ref() }}` in macros.** Macros don't participate in the dependency graph. Using `ref()` inside a macro doesn't create a DAG dependency — the referenced model might not exist yet when the macro runs.

1. **No separation of dev and prod targets.** Running dbt in dev should write to a dev schema, not prod. Misconfigured profiles that point dev runs at prod tables have caused real data loss.

1. **Testing only for not-null and unique.** These are the minimum. Accepted-values, relationship, and custom expression tests are what catch real data quality issues.

1. **Circular source definitions.** Defining a staging model as a source of another staging model (instead of using `ref()`) breaks lineage tracking and creates hidden dependencies.

1. **Giant models.** A model with 500 lines of SQL doing 15 joins is not a model — it's a monolith. Break it into intermediate models. Future debuggers will thank you.

**The one dbt feature most projects underuse:** `dbt docs generate` and `dbt docs serve`. A documented, auto-generated data catalog with lineage graphs, column descriptions, and test coverage is a significant data governance artifact — and it's free if you write YAML descriptions. The organizations that document their dbt models in YAML have fundamentally different data literacy than those that don't.

## dbt Semantic Layer (MetricFlow)

dbt 1.6+ introduced the Semantic Layer (powered by MetricFlow), which lets you define metrics — not just SQL transformations — in dbt. A metric defined in dbt can be queried by BI tools (Tableau, Looker, Hex, Evidence) via a consistent interface, ensuring the number means the same thing everywhere it appears.

```yaml
# models/metrics/revenue_metrics.yml
semantic_models:
  - name: orders
    model: ref('fct_orders')
    entities:
      - name: order
        type: primary
        expr: order_id
      - name: customer
        type: foreign
        expr: customer_id
    dimensions:
      - name: order_date
        type: time
        type_params:
          time_granularity: day
    measures:
      - name: order_total
        agg: sum
        expr: order_total

metrics:
  - name: total_revenue
    type: simple
    label: Total Revenue
    type_params:
      measure: order_total
  - name: revenue_30d
    type: cumulative
    label: Revenue (30-day rolling)
    type_params:
      measure: order_total
      window: 30 days
```

The Semantic Layer is dbt's answer to the "metrics are defined differently in every BI tool" problem. It's still maturing, but for teams running multiple BI tools consuming the same dbt project, it's the right architectural direction — define the metric once, query it consistently everywhere.
