# Streaming Databases: Materialize, RisingWave, and Incremental View Maintenance

Picture the dashboard query everyone has written: a join across a few tables, a couple of aggregations, a filter, and it needs to be current. The usual answer is to schedule it — run it every minute, or every five, and accept that the number on screen is as stale as the last run and as expensive as a full recompute each time. Crank the schedule tighter and the cost climbs, because you're recomputing the entire result from scratch to discover that three rows changed. That waste — recomputing everything to reflect a tiny delta — is the exact thing a streaming database is built to eliminate.

A **streaming database** is a system where you define a query as a **materialized view** and the database keeps that view continuously up to date as new data arrives, by computing only the *change* to the result rather than re-running the query. The technique is **incremental view maintenance (IVM)**, and it's the heart of **Materialize** and **RisingWave**. I'll explain how that differs from both a normal database and a stream processor, the dataflow model that makes it work, and the trade-offs — because there are real ones.

## The core idea: maintain the result, don't recompute it

Start with what a materialized view is in a normal warehouse: a stored, precomputed query result that's fast to read but goes stale, and gets refreshed by periodically re-running the whole query. A streaming database flips the refresh model. When a base row changes, it pushes that single change through the query's logic and updates only the affected part of the stored result. The view is never "refreshed" — it's continuously *maintained*, always reflecting the latest input within a tiny delay.

The difference in cost is the point. If a thousand-row table gets one new row, recomputing a `GROUP BY` touches all thousand rows; incrementally maintaining it touches one group. As input scales up and change rate stays modest, IVM's "work proportional to the change" beats "work proportional to the whole dataset" by orders of magnitude. That's why these systems can hold a complex join-and-aggregate view fresh to within a second at a cost that a tight refresh schedule could never match.

```mermaid
graph LR
    SRC["Source change(one INSERT / UPDATE / DELETE)"]
    OP1["Join operator(keeps state)"]
    OP2["Aggregate operator(keeps running totals)"]
    VIEW["Materialized view(always current)"]
    Q["Reads: instant,just serve the stored result"]
    SRC -->|"delta flows through"| OP1
    OP1 -->|"delta"| OP2
    OP2 -->|"updates only affected rows"| VIEW
    VIEW --> Q
          
```

Incremental view maintenance. A single change flows through stateful operators that update only the rows of the result it affects, instead of the query re-scanning the whole input. Reads just serve the already-maintained view, so they're instant. The engine does work proportional to the *change*, not the dataset size — the opposite of scheduled recompute.

## How it's different from a stream processor

This is where people get confused, because "process data as it streams" also describes [Flink](apache-flink-internals) and Kafka Streams. The distinction is real and worth being precise about. A **stream processor** is a programming framework: you write a job that consumes streams and emits results, and you own the operational lifecycle and usually a separate store to query the output. A **streaming database** presents as a database: you issue `CREATE MATERIALIZED VIEW … AS SELECT …`, and then you `SELECT` from that view over a normal SQL connection (Materialize even speaks the Postgres wire protocol) and get a strongly consistent, current answer. It maintains the view *and* serves queries against it.

Put simply: a stream processor is something you *program*; a streaming database is something you *query*. The streaming database also tends to make a stronger consistency promise — the view reflects a consistent point across all its inputs, so a join never shows you a half-updated state — whereas a hand-built streaming pipeline leaves correctness like that to you. There's overlap, and you can build similar things either way, but the developer experience and the guarantees differ.

|  | Scheduled query (warehouse) | Stream processor (Flink) | Streaming database |
| --- | --- | --- | --- |
| How you express it | SQL, re-run on a schedule | Program a dataflow job | SQL materialized view |
| Freshness | As stale as last run | Continuous | Continuous |
| Cost model | Full recompute each run | Incremental | Incremental (IVM) |
| How you read results | Query the table | Query a separate sink store | Query the view directly (SQL) |
| Consistency of complex joins | Strong (at run time) | You engineer it | Strong, maintained |

## The engine: dataflow and operator state

Under the hood, a streaming database compiles your SQL view into a **dataflow graph** — operators (filter, join, aggregate) wired together, with data changes flowing along the edges as a stream of "this row was added / removed." Materialize is built on **differential dataflow** (and Timely Dataflow), a model designed precisely for efficient incremental and iterative computation, where every datum carries a logical timestamp and a *diff* (+1 for an insert, −1 for a delete) and operators combine these diffs to keep their outputs correct. RisingWave implements the same incremental philosophy with its own dataflow engine and a focus on cloud-native, state-on-object-storage operation.

The thing to understand operationally is that these operators are **stateful**. To incrementally maintain a join, the engine must remember both sides so a new row on one side can be matched against the other without rescanning a source. To maintain an aggregate, it keeps running totals per group. That state is what makes incremental updates cheap — and it's also the resource you're really paying for and must size, since the memory (or storage) footprint scales with the state your views need to hold, not with throughput alone. This is the same lesson that [state in Flink](apache-flink-internals) teaches: in streaming, your state is the system.

```sql
-- Define the result once; the database keeps it current forever
CREATE MATERIALIZED VIEW revenue_by_region AS
SELECT r.region,
       count(*)            AS orders,
       sum(o.amount)       AS revenue
FROM orders o
JOIN regions r ON r.id = o.region_id
WHERE o.status = 'completed'
GROUP BY r.region;

-- Read it like any table — always reflects the latest orders, instantly
SELECT * FROM revenue_by_region ORDER BY revenue DESC;
```

The natural pairing is change data capture. Point [Debezium](debezium-cdc) at your operational Postgres or MySQL, stream the row-level changes into a streaming database, and define your dashboards and metrics as materialized views over them. Now the heavy, always-fresh analytical queries run continuously off to the side — never touching the production database — and your app reads current results from a simple `SELECT`. That CDC-into-IVM pattern is the sweet spot these systems were built for.

## The trade-offs — and they're real

Incremental maintenance is not magic, and the honest limits decide whether it fits:

- **State cost.** Maintaining joins and aggregates means holding state, and for big joins that state can be large and expensive. The win is conditional on your change rate being modest relative to your data size — high-churn-everything workloads erode the advantage.

- **Not all SQL is cheap to maintain incrementally.** Some operations (certain window functions, complex correlated subqueries, ordering-heavy queries) are hard or costly to keep incremental. What's trivial to recompute can be awkward to maintain.

- **It's not a general-purpose database.** A streaming database is built to maintain a defined set of views, not to serve arbitrary ad-hoc OLAP or act as your OLTP system of record. It's a focused tool.

- **Operational maturity.** As of 2024 this is a young, fast-moving category. The concepts are proven (differential dataflow has serious research behind it), but the ecosystem, tooling, and battle-testing are thinner than for a warehouse or Kafka.

**Don't reach for a streaming database when a scheduled query would do.** If your dashboard genuinely tolerates five-minute staleness, a cron'd query on your existing warehouse is simpler, cheaper to operate, and one less system to run. The streaming database earns its keep when freshness is a real requirement *and* the query is expensive to recompute — when "just run it more often" has become the cost problem you're trying to escape. Adopt it to solve that specific pain, not because real-time sounds better than batch.

## What to carry away

A streaming database keeps a SQL **materialized view continuously current by computing only the change to the result** — incremental view maintenance — instead of re-running the query on a schedule. That makes work proportional to the data's rate of change rather than its size, which is the whole economic argument over tight refresh schedules. Unlike a [stream processor](apache-flink-internals) you program and pair with a separate store, a streaming database is something you *query*: define the view, then `SELECT` from it with strong, maintained consistency. Materialize (on differential dataflow) and RisingWave are the leading implementations.

The cost is the state those incremental operators must hold, and the limits are real — not all SQL maintains cheaply, the category is young, and it's a focused tool, not a warehouse or an OLTP store. Use it where freshness is required and recompute is expensive, ideally fed by [CDC](debezium-cdc) off your operational database. Where five-minute staleness is fine, a scheduled query still wins. Knowing which situation you're in is the whole decision.
