# ClickHouse at Scale: Insert Performance, Async Inserts, and Real-Time Streaming from Kafka

🟡 This is Part 3 of a 3-part series: ClickHouse Deep Dive

1. [Internals: How a Columnar OLAP Engine Actually Works](clickhouse-architecture-internals)

1. [Schema Design, ORDER BY, and Query Optimization](clickhouse-schema-query-optimization)

1. Insert Performance & Real-Time Streaming (you are here)

ClickHouse will read a billion rows in a second and then fall over because your application inserted one row at a time. That asymmetry catches almost everyone, because the instinct from transactional databases — insert as the events arrive, one statement per event — is exactly the wrong instinct here. The read path and the write path want opposite things, and production ClickHouse lives or dies on getting the write path right.

This closes the series. [Part 1](clickhouse-architecture-internals) explained *why* small inserts hurt: every insert becomes an immutable part, and the background merge process has to clean up after you. [Part 2](clickhouse-schema-query-optimization) got the table designed. Now we make ingestion fast and reliable — batching, async inserts, buffering, the Kafka streaming pattern, deduplication, and lifecycle management.

## Why tiny inserts are the cardinal sin

Recall the model from Part 1: one `INSERT` produces one part — a directory of sorted, compressed column files plus its index. A part carries fixed overhead regardless of whether it holds 1 row or 1 million. Insert a million rows one at a time and you've created a million tiny parts, each with its own files and metadata, and the merge process now has a hopeless backlog. ClickHouse protects itself: when active parts in a partition exceed a threshold (`parts_to_throw_insert`, default 300), inserts start failing with the infamous `Too many parts (N). Merges are processing significantly slower than inserts.`

The fix is simply to **insert in large batches**. The guidance is to aim for a substantial number of rows per insert — at least in the low thousands, and tens of thousands to hundreds of thousands is comfortable. Each batch becomes one reasonably sized part, merges keep up, and throughput climbs dramatically.

**The one-line rule:** batch by rows *and* by time, whichever comes first — e.g. "flush every 100k rows or every second." Buffer on the client side, or let one of the mechanisms below do the batching for you. Never wire a per-event `INSERT` straight to ClickHouse.

## Async inserts: when you can't batch on the client

Sometimes you genuinely have many independent clients each sending small inserts — lots of app servers, IoT devices, serverless functions — and coordinating client-side batching is impractical. For exactly this, ClickHouse offers **asynchronous inserts**. With `async_insert = 1`, the server collects incoming inserts into an in-memory buffer and flushes them as a single part when the buffer hits a size or time threshold. The batching moves from the client to the server.

```sql
-- Per-query, or set on the user/profile
SET async_insert = 1;
SET wait_for_async_insert = 1;          -- 1 = wait for the flush (safer ack)
SET async_insert_max_data_size = 10000000;   -- flush at ~10 MB
SET async_insert_busy_timeout_ms = 1000;     -- ...or after 1 second
```

The key trade-off is `wait_for_async_insert`. Set to `1`, the client's insert returns only after the server has flushed the buffer to a part — slower acks, but you know the data is durable when the call returns. Set to `0`, inserts return immediately (highest throughput) but an ack no longer guarantees the row is persisted; a crash before flush loses the buffer. Choose deliberately based on how much you'd mind losing the last second of un-flushed data.

### The Buffer engine: an older, blunter instrument

Before async inserts, the classic tool was the `Buffer` table engine — an in-memory staging table that accumulates rows and periodically flushes them to a target MergeTree table. It still works and occasionally fits, but async inserts have largely superseded it for new designs: a `Buffer` table's contents live only in RAM (lost on restart), complicate reads, and add a moving part to operate. Prefer async inserts unless you have a specific reason.

## Real-time streaming: the Kafka engine + materialized view pattern

For continuous ingestion from a stream, ClickHouse has a built-in **Kafka table engine**, and it composes with materialized views into the canonical real-time analytics pipeline. The shape is three objects, and it's worth understanding precisely because the roles are easy to confuse.

1. A **Kafka engine table** — this is *not* storage. It's a consumer: a streaming source you read from. Each row is read once and the consumer's offset advances; you can't query it twice for the same data.

1. A **MergeTree target table** — the real, durable, queryable table, designed per [Part 2](clickhouse-schema-query-optimization).

1. A **materialized view** — the pump. It's triggered as the Kafka table consumes messages, transforms each batch, and inserts into the MergeTree target. ClickHouse handles the batching, so you get large parts naturally.

```mermaid
graph LR
    K["Kafka topic(events)"]
    KE["Kafka engine table(consumer — read once)"]
    MV["Materialized view(transform + batch)"]
    MT["MergeTree target table(durable, queryable)"]
    AGG["AggregatingMergeTreerollup (optional)"]
    K --> KE --> MV --> MT
    MT -. second MV .-> AGG
          
```

The standard ClickHouse streaming pipeline. The Kafka engine table is a consumer, not a store; the materialized view pumps consumed messages into a real MergeTree table in healthy batches; an optional second materialized view maintains a pre-aggregated rollup for instant dashboards. ClickHouse pulls from Kafka itself — no external connector or stream processor required for the common case.

```sql
-- 1) The consumer (not storage)
CREATE TABLE events_queue
(
    raw String
)
ENGINE = Kafka
SETTINGS
    kafka_broker_list = 'broker1:9092,broker2:9092',
    kafka_topic_list  = 'events',
    kafka_group_name  = 'clickhouse-events',
    kafka_format      = 'JSONAsString',
    kafka_num_consumers = 2;

-- 2) The durable target (designed per Part 2)
CREATE TABLE events
(
    tenant_id  UInt32,
    event_date Date,
    user_id    UInt64,
    event_type LowCardinality(String),
    amount     Decimal(18, 2),
    ts         DateTime CODEC(DoubleDelta, ZSTD(1))
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_date)
ORDER BY (tenant_id, event_date, user_id);

-- 3) The pump: parse JSON and insert into the target as messages arrive
CREATE MATERIALIZED VIEW events_mv TO events AS
SELECT
    JSONExtractUInt(raw, 'tenant_id')        AS tenant_id,
    toDate(JSONExtractUInt(raw, 'ts'))       AS event_date,
    JSONExtractUInt(raw, 'user_id')          AS user_id,
    JSONExtractString(raw, 'event_type')     AS event_type,
    JSONExtractFloat(raw, 'amount')          AS amount,
    toDateTime(JSONExtractUInt(raw, 'ts'))   AS ts
FROM events_queue;
```

This is genuinely powerful: ClickHouse pulls from Kafka directly, with no separate connector service or stream processor for the common case. For the mechanics of what's on the other side of that topic — partitions, consumer groups, offsets, and why a consumer group gets stuck — see [Kafka Internals](kafka-internals), and for why Kafka sustains the throughput that makes this pattern viable, [Why Kafka Is So Fast](kafka-performance-scalability).

**Operating the Kafka engine has sharp edges.** By default it commits offsets after handing a batch to the materialized view, so a crash mid-write can drop or, on retry, duplicate a batch — it's effectively at-least-once, which makes the deduplication below essential. Watch consumer lag, size `kafka_num_consumers` against the topic's partition count (more consumers than partitions just idle), and keep the materialized view's transform cheap — heavy logic there throttles the whole pipeline.

## Deduplication: living with at-least-once

Streaming pipelines redeliver. Retries, rebalances, and replays all mean the same event can arrive twice, so a real-time table needs a dedup strategy. ClickHouse gives you two complementary mechanisms.

**Insert-level deduplication.** For replicated tables, ClickHouse hashes each inserted block and ignores a re-inserted identical block (controlled by `insert_deduplication`). This transparently absorbs the common "retry sent the same batch again" case — same rows, same order, deduplicated for free.

**Row-level deduplication via `ReplacingMergeTree`.** For logical duplicates (same event, different batch), use the engine from [Part 2](clickhouse-schema-query-optimization): a `ReplacingMergeTree` keyed on the event's identity keeps only the latest version per key. The crucial caveat carries over — dedup happens at merge time, so query with `FINAL` or aggregate with `argMax` if you need duplicate-free results before merges have caught up.

```sql
CREATE TABLE events_dedup
(
    event_id   UUID,
    tenant_id  UInt32,
    event_date Date,
    payload    String,
    version    UInt64           -- e.g. ingest timestamp; latest wins
)
ENGINE = ReplacingMergeTree(version)
PARTITION BY toYYYYMM(event_date)
ORDER BY (tenant_id, event_id);

-- Duplicate-free read before merges consolidate:
SELECT * FROM events_dedup FINAL WHERE tenant_id = 42;
```

## TTL: lifecycle without cron jobs

Real-time tables grow without limit unless you manage retention, and ClickHouse builds that into the schema with `TTL` — no external scheduler. Rows or whole parts expire automatically at merge time, and TTL can do more than delete: it can roll data off to cheaper storage tiers or down-aggregate it as it ages.

```sql
ALTER TABLE events MODIFY TTL
    event_date + INTERVAL 90 DAY DELETE,                       -- drop after 90 days
    event_date + INTERVAL 30 DAY TO VOLUME 'cold';             -- tier to cheap storage at 30
```

Because partitioning is by month (Part 2), expiry mostly drops whole parts cleanly rather than rewriting them — another reason the coarse-partition rule pays off.

## The whole series in five sentences

**ClickHouse is fast because it reads columns, skips granules via the `ORDER BY` index, and crunches the survivors in vectorized SIMD loops** (Part 1). **Performance is mostly a schema decision** — the sort key, narrow types and `LowCardinality`, the right MergeTree engine, dictionaries over joins, and precomputation with projections and materialized views (Part 2). **And it stays healthy only if you respect the write path**: big batches, async inserts or the Kafka-engine-plus-materialized-view pattern for streaming, deduplication because delivery is at-least-once, and TTL for lifecycle (Part 3). **The read path and the write path want opposite things, and production work is balancing them.** **Get the table right for how it's queried and how data arrives, and ClickHouse delivers the sub-second analytics that made you reach for it.**

🟡 The ClickHouse Deep Dive series

1. [Internals: How a Columnar OLAP Engine Actually Works](clickhouse-architecture-internals)

1. [Schema Design, ORDER BY, and Query Optimization](clickhouse-schema-query-optimization)

1. Insert Performance & Real-Time Streaming (this article)
