Elasticsearch Internals: The Inverted Index, Lucene Segments, and Sharding

Elasticsearch is one of those tools everyone runs and few understand. It powers search boxes, log analytics, observability dashboards, and security platforms — and most teams operate it by trial and error, scaling shards by superstition and tuning settings copied from a blog post. That works until it doesn't: an index goes red, a cluster grinds under too many shards, or searches that were instant get slow. Every one of those problems is easier to reason about once you know what's underneath, which is mostly Apache Lucene wrapped in a distributed system. Let me unpack both layers.

Two questions organize everything: why is it fast at search (the inverted index and Lucene's segment model), and how does it scale and stay available (shards, replicas, and the cluster). I'll take them in that order.

The inverted index: why search is fast

Start with the data structure the whole thing is built on. A normal database, asked "which documents contain the word payment?", would in the worst case scan every row. An inverted index flips the problem: instead of mapping documents to their words, it maps each word (term) to the list of documents that contain it. Ask for "payment" and you get the posting list directly — no scan.

Documents:
  doc1: "payment failed"
  doc2: "payment received"
  doc3: "login failed"

Inverted index (term → postings):
  payment  → [doc1, doc2]
  failed   → [doc1, doc3]
  received → [doc2]
  login    → [doc3]

Query "payment AND failed"  →  intersect [doc1,doc2] ∩ [doc1,doc3]  =  [doc1]

A boolean query becomes set operations over posting lists — intersections and unions — which are extremely fast and well-optimized in Lucene. Before any of this, text runs through an analyzer at index time: it's tokenized into terms, lowercased, and often stemmed (so "running" and "runs" collapse to "run") and stripped of stop-words. The same analysis is applied to the query, which is why a search for "Running" matches a document containing "runs." Get the analyzer wrong and search quality collapses — it's the most common root cause of "why doesn't my search find this?"

Lucene segments: immutability and near-real-time

Here's the design choice that explains most of Elasticsearch's operational behavior. An inverted index is expensive to update in place, so Lucene doesn't. Instead, it writes immutable segments: a segment is a small, self-contained inverted index, and once written it is never modified. New documents go into new segments. A shard is just a collection of these segments plus a little coordinating state.

Immutability buys a lot — segments can be cached aggressively, read without locking, and reasoned about simply — but it forces a particular lifecycle that every Elasticsearch operator should know cold:

Operation	What happens	Why it matters
Index	New docs buffer in memory and append to the translog (write-ahead log)	The translog makes a doc durable before it's searchable
Refresh	The in-memory buffer becomes a new searchable segment (default every 1s)	This is the "near" in near-real-time — docs aren't searchable until a refresh
Flush	Segments are fsync'd to disk and the translog is cleared	Durable persistence; recovery replays the translog
Merge	Background process merges many small segments into fewer large ones	Too many segments slows search; merging also purges deleted docs

"Near-real-time" has a precise meaning: a document you index is not searchable until the next refresh (≈1 second by default). That one-second window is the price of the segment model. If you're bulk-loading and don't need instant searchability, raising the refresh interval (or disabling it during the load) dramatically increases indexing throughput, because you're creating far fewer tiny segments for the merge process to clean up later.

Deletes and updates are lazy

Because segments are immutable, you can't actually delete a document from one. Instead Elasticsearch marks it deleted in a per-segment bitmap and filters it out of results; the bytes are only reclaimed when that segment is later merged. An "update" is just a delete-plus-reindex into a new segment. This is why heavy update/delete workloads accumulate deleted docs and lean on merging to stay efficient — the same immutability-plus-background-compaction pattern that shows up in LSM-tree storage engines.

Shards and replicas: how it scales and survives

Now the distributed layer. An Elasticsearch index is split into shards, and each shard is a complete Lucene index (segments and all). Sharding is how an index grows beyond one machine and how search parallelizes — each shard searches independently. Every shard also has zero or more replica copies on other nodes:

Primary shard — handles indexing (writes) first, then forwards to its replicas.
Replica shard — a copy that provides redundancy (if a node dies, a replica is promoted to primary) and extra search throughput (reads can be served by replicas).

graph TD
    subgraph IDX["Index: 3 primary shards, 1 replica each"]
        direction LR
        subgraph N1["Node 1"]
            P0["P0 (primary)"]
            R1["R1 (replica)"]
        end
        subgraph N2["Node 2"]
            P1["P1 (primary)"]
            R2["R2 (replica)"]
        end
        subgraph N3["Node 3"]
            P2["P2 (primary)"]
            R0["R0 (replica)"]
        end
    end

Shards and replicas spread across a 3-node cluster. Each primary's replica lives on a different node, so any single node can fail without data loss — its primaries' replicas are promoted elsewhere. Primaries balance indexing load; replicas add both resilience and read capacity. A document's shard is chosen by hashing its routing key (the doc ID by default), which is why the primary shard count is fixed at index creation.

The decision you can't easily undo: the number of primary shards is fixed when the index is created, because a document's home shard is hash(routing) % number_of_primary_shards — change the divisor and every document would route differently. Replicas you can change anytime; primaries you effectively cannot (you reindex). The most common Elasticsearch mistake is over-sharding: thousands of tiny shards each carry fixed overhead (memory, file handles, cluster-state bookkeeping) and bury the cluster. Aim for a sensible shard size (tens of GB), not a large shard count.

How a search actually runs: query then fetch

A search that spans shards runs in two phases, and knowing them explains a lot of latency behavior. In the query phase, the coordinating node forwards the query to one copy (primary or replica) of every shard; each shard searches its segments locally and returns just the IDs and scores of its top matches. The coordinator merges these into a globally ranked list. In the fetch phase, it then asks only the shards holding the actual top-N documents to return their full content.

Two consequences fall out. The cluster only fetches full documents for the final result set, not for every match — efficient. But deep pagination is expensive: to return results 10,000–10,010, every shard must produce its top 10,010, the coordinator must merge them all, and that cost grows with the page offset. (It's the reason "search after" / scroll exist for deep traversal instead of huge from offsets.)

The cluster: one master, shared state

Tying it together, the nodes form a cluster with an elected master node responsible for cluster state — which indices exist, where every shard is allocated, and the mapping for each index. The master doesn't sit in the data path of searches or indexing; it manages metadata and shard allocation, rebalancing shards when nodes join or leave. Master election and split-brain prevention are exactly the kind of consensus concern any distributed system has to solve, and Elasticsearch's coordination layer (hardened considerably in the 7.x line) exists to keep cluster state consistent under node churn.

What to carry away

Search is fast because of the inverted index (term → documents, turning queries into set operations over posting lists) and Lucene's immutable segments, whose refresh/flush/merge lifecycle gives you near-real-time search and explains the 1-second visibility delay, lazy deletes, and the importance of merging. It scales and survives because an index is split into shards (each a full Lucene index, parallelizing search and fixed in count) with replicas for redundancy and read throughput, coordinated by a master that owns cluster state.

Hold those, and the operational mysteries resolve. Documents not showing up immediately? Refresh interval. Cluster sluggish and heavy on memory? Too many shards. Search quality off? The analyzer. Deep pages slow? Query-then-fetch. Elasticsearch stops being a black box the moment you see the Lucene inside it.