MongoDB Internals: The Document Model, WiredTiger, Replica Sets, and Sharding

MongoDB got popular for a reason people are sometimes embarrassed to admit: it let you store an object the way your application already thought about it, without a schema migration or a join. A user with their addresses and orders nested inside is one document, fetched in one read. That ergonomic win is real — but it also means a lot of teams adopt MongoDB without ever learning what's underneath, and then get surprised when a cluster falls over, a failover loses writes, or queries on a billion-document collection grind to a halt. The fixes are never mysterious once you understand four things: the document model, the storage engine, replication, and sharding.

MongoDB is a document database: it stores records as flexible, nested documents (BSON) rather than rows in fixed tables. On top of that sit the operational mechanisms that make it a distributed system — WiredTiger for storage, replica sets for high availability, and sharding for horizontal scale, where one decision (the shard key) dominates everything. I'll take them in order.

The document model: BSON and the embedding question

A MongoDB document is a JSON-like object stored in a binary format called BSON (binary JSON, which adds types like dates and 64-bit integers and makes traversal fast). Documents live in collections, which don't enforce a schema — two documents in the same collection can have different fields. This is schema flexibility, and it's the model's headline feature and its sharpest edge.

The central modeling decision is embed vs reference. You can nest related data inside a document (embed an order's line items) or store a reference to another document (like a foreign key) and look it up separately. Embedding gives you the one-read win and atomicity within a document; referencing avoids duplication and unbounded growth. The guiding rule: data that's read together should live together — but watch the document size limit (16 MB) and unbounded arrays, because an order document that embeds an ever-growing list of events will eventually hit the wall.

"Schemaless" is a promise the database keeps and your data doesn't. The flexibility that makes prototyping fast becomes a liability at scale: with no enforced schema, a collection quietly accumulates five shapes of the same document, optional fields that are sometimes missing and sometimes null, and types that drift (a price that's a string in old records and a number in new ones). The database won't stop any of it. Treat the schema as something you own in the application and validation layer — flexibility is a tool, not an excuse to skip data modeling. (MongoDB does offer optional schema validation; use it.)

WiredTiger: the storage engine

Since version 3.2, the default storage engine is WiredTiger, and it's a conventional, well-engineered B-tree-based engine (see B-trees vs LSM-trees for the family). Two of its properties matter most in practice. First, document-level concurrency: multiple writers can modify different documents in the same collection simultaneously, which is a massive improvement over the old engine's collection-level locking and the reason write throughput jumped. Second, compression: WiredTiger compresses both data and indexes on disk by default, often shrinking storage substantially.

It also keeps a working set in an in-memory cache and uses a write-ahead journal for durability — so, as with any B-tree engine, RAM for the working set and the cache-hit rate are what govern read latency. If MongoDB feels slow, "does the working set fit in memory?" is the first question, long before you blame the query.

Replica sets: high availability and the oplog

A production MongoDB deployment isn't a single server — it's a replica set: a primary plus secondaries holding copies of the same data. All writes go to the primary, which records every change in a special capped collection called the oplog (operations log). Secondaries continuously tail the primary's oplog and replay those operations to stay in sync. The oplog is the heart of replication — and, not coincidentally, what change-data-capture tools read to stream MongoDB changes downstream.

The payoff is automatic failover. The members heartbeat each other; if the primary disappears, the remaining members hold an election and a secondary is voted in as the new primary, typically within seconds, with no manual intervention. This is why you run an odd number of voting members (commonly three) — you need a majority to elect a leader and to avoid split-brain.

This is where write concern earns its keep, and where data gets silently lost if you ignore it. Write concern sets how many members must acknowledge a write before it's "done." The default majority-style acknowledgement means a write survives the loss of the primary; a weaker w:1 (primary only) is faster but a write acknowledged by a primary that then crashes before a secondary copied it can be rolled back and lost. If your data matters, require majority acknowledgement and understand that durability is a setting, not a guarantee you get for free.

Sharding: horizontal scale, and the key that decides it

When data or throughput outgrows one replica set, MongoDB scales horizontally by sharding: partitioning a collection across multiple shards (each shard is itself a replica set). Three roles make this work: mongos, a router that the application talks to and which directs queries to the right shard(s); config servers, which store the metadata mapping data ranges to shards; and the shards themselves. MongoDB splits the data into chunks by the shard key and a balancer moves chunks between shards to keep them even.

graph TD
    APP["Application"]
    MONGOS["mongos router"]
    CFG["Config servers
(chunk → shard metadata)"] S1["Shard A (replica set)
primary + secondaries"] S2["Shard B (replica set)
primary + secondaries"] S3["Shard C (replica set)
primary + secondaries"] APP --> MONGOS MONGOS --> CFG MONGOS --> S1 MONGOS --> S2 MONGOS --> S3

A sharded MongoDB cluster. The app talks to mongos, which consults the config servers to learn which shard holds which range of the shard key, then routes the query. Each shard is a full replica set, so the cluster combines horizontal scale (sharding) with high availability (replication). Pick the shard key well and a query hits one shard; pick it badly and it hits all of them.

Everything about how well a sharded cluster performs comes down to the shard key — the field(s) MongoDB partitions on. A good shard key has high cardinality, spreads writes evenly, and matches your common query filters so queries are targeted (routed to one shard). A bad one is catastrophic in two classic ways:

  • A monotonically increasing key (a timestamp, an auto-incrementing id) sends every new write to the same "newest" chunk on one shard — a hotspot — so you've added servers but all writes still pile onto one.
  • A key absent from your queries forces scatter-gather: mongos has to broadcast the query to every shard and merge results, so the cluster gets slower as you add shards, not faster.

The shard key is close to irreversible, so it's the decision to agonize over. Changing it after the fact historically meant dumping and reloading the collection — a brutal operation on a large production cluster. Teams that picked a convenient-but-wrong key (often a timestamp) end up with a lopsided cluster they can't easily fix. Model the access patterns first, choose a key that distributes writes and targets your reads, and don't shard at all until a single replica set genuinely can't keep up — premature sharding adds operational weight you may never need.

What to carry away

MongoDB's appeal is the document model — store data the shape your app uses it, embed what's read together — but "schemaless" means the schema discipline moves to you, not away. WiredTiger gives it document-level concurrency and on-disk compression, with the working-set-in-RAM rule governing read speed. Replica sets deliver high availability through the oplog and automatic elections — with durability gated by the write concern you choose. And sharding delivers horizontal scale, but the shard key decides whether that scale is real or an expensive illusion (hotspots and scatter-gather punish a bad choice, and it's painful to change).

Adopt MongoDB for what it's genuinely good at — flexible, nested, read-together data and easy horizontal growth — and respect the two decisions that make or break it: how you model documents, and how you shard. For the storage-engine family underneath WiredTiger, see B-trees vs LSM-trees; for the opposite (masterless, LSM, AP) point in the NoSQL design space, Cassandra.