Avro, Protobuf & the Schema Registry: Serialization and Schema Evolution

Most streaming pipelines start with JSON because it's easy, and most of them eventually break for the same reason. A producer team adds a field, or renames one, or changes a number to a string — a one-line change, shipped without a thought — and somewhere downstream a consumer that's been running fine for months starts throwing, or worse, silently misreads the data. There was never a contract; there was just a shared hope that the shape wouldn't change. The fix is to make the message format an enforced contract, and that's exactly what Avro, Protobuf, and a schema registry provide. It's the least glamorous part of a streaming platform and one of the most important.

This piece is about the serialization and schema layer that sits under a system like Kafka. I'll cover why JSON fails at scale, what Avro and Protobuf give you instead, and how the Schema Registry turns "we hope the schema is compatible" into a guarantee the platform checks for you.

Why JSON breaks down at scale

JSON is wonderful for human-facing APIs and terrible as a high-volume pipeline format, for three concrete reasons:

It carries no schema. Every message is self-describing field names and untyped values; nothing declares what a valid message is. A consumer guesses, and guesses break.
It's verbose. Field names are repeated in full in every single message. At millions of messages a second, shipping "customer_id" a billion times is a lot of wasted bytes, storage, and bandwidth.
It has no evolution safety. Nothing stops a producer from making a change that silently breaks consumers, because there's no notion of compatibility to check against.

What you want instead is a format where messages are compact, typed, and governed by a schema that can change under rules. That's the niche Avro and Protobuf fill.

Avro and Protobuf: compact, typed messages

Both Apache Avro and Protocol Buffers (Protobuf) serialize data to compact binary using a schema, so the field names don't travel in every message — just the values, positionally. They differ in flavor:

	Avro	Protobuf
Schema	JSON-defined; schema needed to read (often travels with the data or via a registry)	IDL (`.proto`) compiled to code in many languages
Evolution anchor	Field names + defaults; reader/writer schema resolution	Field numbers (tags) — names can change, numbers can't
Sweet spot	Data pipelines, big-data ecosystem (deep Kafka/Hadoop roots)	Service-to-service RPC (gRPC), polyglot codegen

The key idea both rely on is the separation of the writer's schema (what produced the bytes) from the reader's schema (what's consuming them). Because the data is typed and positional, a reader using a slightly different but compatible schema can still decode it — fill in a defaulted new field, ignore a field it doesn't know. That's what makes safe evolution possible at all; JSON has no equivalent because it has no schema to resolve against.

// An Avro schema: types are explicit, and a default makes a field safe to add
{
  "type": "record",
  "name": "Order",
  "fields": [
    { "name": "order_id",   "type": "string" },
    { "name": "amount",     "type": "double" },
    { "name": "currency",   "type": "string", "default": "USD" }
  ]
}

The Schema Registry: where the contract lives

Compact typed messages solve size and typing, but there's a practical problem: if the schema doesn't travel in every message (it shouldn't — that defeats the compactness), how does a consumer know which schema decodes these bytes? The answer is a Schema Registry: a service that stores schemas and hands out IDs. The flow is clean:

graph TD
    P["Producer"]
    SR["Schema Registry
stores schemas, assigns IDs,
checks compatibility"]
    K["Kafka topic
(message = schema ID + binary payload)"]
    C["Consumer"]
    P -->|"1. register schema → get ID
(rejected if incompatible)"| SR
    P -->|"2. publish: ID + compact bytes"| K
    K -->|"3. read ID + bytes"| C
    C -->|"4. fetch schema by ID (cached)"| SR

The Schema Registry flow. A producer registers its schema and embeds only a small schema ID in each message alongside the binary payload. A consumer reads the ID and fetches the matching schema (cached after the first lookup) to decode. The registry also gates registration — a schema that violates the topic's compatibility rule is rejected before it ever produces a bad message.

The message on the wire is just a schema ID plus the compact binary — no field names, no full schema. Consumers look up the schema by ID once and cache it. But the registry's most valuable job isn't storage; it's the gatekeeping: when a producer tries to register a new version of a schema, the registry checks it against a configured compatibility rule and refuses the change if it would break consumers. The contract is enforced at registration time, not discovered at 3 a.m. in a consumer's stack trace.

Compatibility: the rules that let teams move independently

The compatibility mode is the heart of the whole thing, because it encodes who can upgrade when. The three you'll actually use:

Backward compatible — a new schema can read data written with the old schema. This is the common default: it lets you upgrade consumers first. Safe changes: adding a field with a default, removing a field.
Forward compatible — an old schema can read data written with the new schema. This lets you upgrade producers first: old consumers keep working against new messages by ignoring fields they don't know.
Full compatible — both directions hold; the safest and most restrictive, allowing only changes that are backward and forward compatible (essentially: add/remove fields that have defaults).

The reason this matters operationally: in a real system you cannot redeploy every producer and consumer atomically. Compatibility rules are what let a dozen teams evolve their schemas on their own timelines without coordinating a flag day — the registry guarantees that whatever order things deploy in, nothing reads garbage. That guarantee is the entire point of the layer.

The single most useful habit: always give new fields a default value. A new field with a default is the canonical safe change — old readers can skip it, new readers fill it in — which means it satisfies backward and forward compatibility at once. Most schema-evolution pain comes from adding a required field with no default (now old data can't be read by the new schema) or reusing/renumbering a field. Add-with-default, and you can evolve almost indefinitely without a break.

Never reuse a Protobuf field number, and never repurpose a field's meaning. In Protobuf the field number is the identity that's serialized — change a field's type or reuse a retired number for something new, and old bytes decode into the wrong field with no error, just corrupt data. (Avro has the analogous trap with renaming fields that lack aliases.) Treat field numbers and names as append-only: deprecate, never reuse. The format will not protect you from semantic changes it can't see — that discipline is on you, and the registry's compatibility check is your safety net, not a substitute for it.

What to carry away

The serialization layer is the contract layer of a streaming platform. JSON fails at scale because it carries no schema, repeats field names in every message, and offers no evolution safety. Avro and Protobuf fix the first two with compact, typed, schema-defined binary — and enable the third by separating the writer's schema from the reader's. The Schema Registry stores schemas behind small IDs (so messages stay tiny) and, crucially, enforces compatibility at registration — backward (upgrade consumers first), forward (producers first), or full — so independent teams can evolve schemas without a coordinated flag day.

Adopt it the moment a pipeline has more than one team or more than one consumer: pick a compatibility mode, add new fields with defaults, and never reuse field identities. It's unglamorous plumbing, but it's the difference between a pipeline that evolves smoothly for years and one that breaks every time someone touches a producer. It pairs directly with Kafka and with log-based change data capture, where stable, evolvable schemas are what keep downstream consumers honest.