# HBase Internals: Regions, the LSM Write Path, and Row-Key Design

HDFS gives you a brilliant place to store enormous files and a terrible place to do anything random. It's built for streaming big sequential reads and appends — ask it to fetch or update one row out of billions, fast, and it shrugs. HBase exists to fill exactly that gap: random, real-time read/write access to billions of rows, layered on top of the same HDFS that can't do it alone. It's the open-source take on Google's Bigtable, and once you see how it reconciles "random access" with "an append-only filesystem," the rest of its behavior — including the ways it bites you — follows.

I'll build it up in the order that makes the design click: the **architecture** (who's in charge of what), **regions** (how a table is split and found), the **LSM write and read paths** (how random access happens on an append-only store), and finally **row-key design**, which is the one decision that decides whether your cluster flies or falls over. For the HDFS foundation underneath all of this, see [Hadoop & HDFS Internals](hadoop-hdfs-internals).

## What HBase is (and isn't)

HBase is a distributed, sorted, sparse, multi-dimensional map — a wide-column store. A table is a set of rows sorted by **row key**; each row has one or more **column families**; each family holds arbitrary columns; each cell is versioned by timestamp. "Sparse" matters: a row only stores the columns it actually has, so a table can have millions of possible columns and each row populate a handful, with no cost for the empties.

What it is *not*: a relational database. There are no joins, no SQL out of the box (Phoenix bolts SQL on top), and exactly one index — the row key. You either look up by row key, or you scan a contiguous range of row keys. That single constraint drives every design decision you'll make on HBase, which is why we end on row-key design.

## The architecture: who runs what

HBase has a clear division of labor across four players, and confusing their roles is the source of a lot of operational mistakes:

| Component | Role |
| --- | --- |
| **RegionServer** | Serves reads and writes for the regions assigned to it. This is where the data work happens. |
| **HMaster** | Coordinates: assigns regions to RegionServers, handles splits/balancing and schema changes. Not in the data path. |
| **ZooKeeper** | The cluster's source of truth for liveness and coordination — tracks which servers are alive and where to find the meta table. |
| **HDFS** | The actual storage. Every HFile and write-ahead log lives as replicated blocks on HDFS. |

```mermaid
graph TD
    C["Client"]
    ZK["ZooKeeper(liveness + meta location)"]
    HM["HMaster(region assignment, splits, schema)"]
    subgraph RS["RegionServers"]
        RS1["RegionServer 1regions A–F"]
        RS2["RegionServer 2regions G–M"]
    end
    HDFS["HDFS(HFiles + WAL as replicated blocks)"]
    C -->|"1. where is my region?"| ZK
    C -->|"3. read/write row"| RS1
    HM -.->|"assign / balance"| RS1
    HM -.-> RS2
    HM --- ZK
    RS1 --> HDFS
    RS2 --> HDFS
          
```

HBase's division of labor. Clients talk to ZooKeeper to locate a region, then go *directly* to the RegionServer that owns it — the HMaster is never in the read/write path. RegionServers persist everything to HDFS underneath. Because exactly one RegionServer owns a given region at a time, every row has a single authoritative server, which is what gives HBase strong per-row consistency.

## Regions: how a table is split and found

A table starts as one **region** — a contiguous range of row keys — and as it grows past a size threshold, it **auto-splits** into two regions at a midpoint key. Regions are distributed across RegionServers, so a big table's key space is spread over the cluster. This is HBase's unit of distribution and load balancing: more regions, more parallelism, spread across more servers.

How does a client find the region for a given row key? Through a special catalog table, `hbase:meta`, which maps row-key ranges to the RegionServers that host them. ZooKeeper knows where `hbase:meta` lives; the client reads meta to locate the target region, then talks straight to that RegionServer — and caches the mapping so it doesn't repeat the lookup every time.

```text
Row key space of table "events", split into regions:

  [ region 1 ]  keys  "" .. "g"        → RegionServer 1
  [ region 2 ]  keys  "g" .. "m"       → RegionServer 1
  [ region 3 ]  keys  "m" .. "t"       → RegionServer 2
  [ region 4 ]  keys  "t" .. ""        → RegionServer 2

hbase:meta maps each key range to its hosting RegionServer.
Lookups and scans are routed by where the row key falls.
```

## The LSM write path: random writes on an append-only store

Here's the central trick. HDFS files are write-once, append-only — you can't update a record in place. So how does HBase support fast, random updates? With a **log-structured merge-tree**: never modify, only append, and merge later. The same family of design as Cassandra's storage engine (covered in [Cassandra Internals](cassandra-internals)), here built directly on HDFS. A write flows like this:

1. Append the edit to the **Write-Ahead Log (WAL)** on HDFS — sequential, durable. If the RegionServer crashes, the WAL is replayed to recover anything not yet persisted.

1. Put the edit into the **MemStore**, an in-memory sorted buffer (per column family). The write is now acknowledged — nothing random touched disk.

1. When a MemStore fills, it **flushes** to a new immutable **HFile** on HDFS, sorted by row key.

```mermaid
graph TD
    W["Write (Put)"]
    WAL["WAL on HDFS(sequential append — durability)"]
    MS["MemStore(in-memory, sorted per family)"]
    HF["HFile on HDFS(immutable, sorted)"]
    CMP["Compactionminor: merge a few HFilesmajor: merge all + drop deletes/old versions"]
    W --> WAL
    W --> MS
    MS -->|"flush when full"| HF
    HF --> CMP
    CMP --> HF
          
```

The LSM write path. Every write is a sequential WAL append plus an in-memory MemStore update — no random disk I/O — which is why HBase ingests writes fast. Immutable HFiles accumulate, so background **compaction** merges them: minor compactions combine a few recent files; a major compaction merges everything in a region and physically drops deleted cells and superseded versions. Deletes, like in any LSM, are tombstones until a major compaction reclaims them.

## The read path: merging memory and many files

Because a row's latest data might be in the MemStore and its older data spread across several HFiles, a read has to reconcile all of them. A get or scan checks the **MemStore** (newest, in memory), the **BlockCache** (an LRU cache of recently read HFile blocks), and the relevant **HFiles**, then merges by row key and timestamp so you see the current view.

Reading many HFiles per lookup sounds expensive, and it can be, so HBase prunes aggressively. Each HFile carries a **bloom filter** (a fast probabilistic "could this file contain this row/column?" check that skips files which definitely don't) and a block index for seeking within the files that might. That's why a healthy region with not-too-many HFiles serves point lookups quickly — and why too many small HFiles (a region that hasn't compacted) makes reads crawl. Compaction isn't housekeeping you can ignore; it's read performance.

## Row-key design: the decision that makes or breaks the cluster

Everything above converges here. Because the row key is the only index and rows are stored sorted by it, the row key determines both how you can query and how load spreads across the cluster. Get it wrong and no amount of hardware saves you.

**The hotspotting trap.** Rows are sorted by key and split into ranges, so monotonically increasing keys — timestamps, sequential IDs, `auto_increment` — send every new write to the *same* region on the *same* RegionServer. That one server melts while the rest of the cluster sits idle. This is the single most common way HBase deployments fail under load: a "natural" sequential key creates a write hotspot. It's the same lesson Cassandra teaches about partition keys, sharpened by HBase's range-sorted layout.

The fixes all aim to **spread writes across regions while keeping useful scan locality**:

- **Salting** — prefix the key with a small hash bucket (e.g. `(hash(id) % 16) + "_" + id`), spreading writes across 16 regions. The cost: a full scan must hit all buckets.

- **Hashing / field promotion** — lead the key with a high-cardinality, well-distributed field instead of a sequential one, so adjacent writes land in different regions.

- **Pre-splitting** — create the table with regions already split across the key space, so you don't serialize through one region while it slowly auto-splits under load.

- **Key reversal** — reversing a sequential ID (or a domain like `moc.elpmaxe`) turns a monotonic prefix into a well-distributed one.

The tension is always the same: a key designed purely to spread writes can scatter the rows you want to scan together, and a key designed for scan locality can hotspot writes. Designing for your actual access pattern — what you look up, what ranges you scan, how writes arrive — is the whole job.

## Where HBase fits — and where it doesn't

HBase is a **CP system** in CAP terms: because exactly one RegionServer owns a region, reads and writes to a row are strongly consistent, and during the brief window when a failed RegionServer's regions are being reassigned, those regions are unavailable rather than serving stale data. That's a deliberate, different choice from [Cassandra](cassandra-internals)'s masterless, tunable, availability-first model — the two are often compared, and the contrast is exactly consistency-vs-availability plus master-based-vs-masterless.

| Reach for HBase when | Reach elsewhere when |
| --- | --- |
| Huge tables with random real-time read/write by key | You need joins, ad-hoc SQL, or secondary indexes (use an RDBMS / add Phoenix) |
| Strong per-row consistency matters | You need always-on writes through node failure (Cassandra's AP model) |
| You already run Hadoop/HDFS and want random access on it | Low-latency single-digit-ms at small scale (the HDFS + JVM + compaction tax isn't worth it) |
| Sparse, wide, versioned data (time series, messages, CDC) | Analytical scans over columns (use a columnar store) |

The honest caveats: HBase carries real operational weight. It depends on HDFS and ZooKeeper both being healthy, JVM garbage-collection pauses on RegionServers can spike latency, and a neglected compaction strategy produces either too many HFiles (slow reads) or compaction storms (I/O spikes). It rewards teams that already run the Hadoop stack and punishes those hoping for a low-effort key-value store.

## What to carry away

HBase turns HDFS — great at sequential, useless at random — into a random-access store with an **LSM engine**: writes append to a WAL and an in-memory MemStore, flush to immutable HFiles, and compaction merges them, so random updates ride on an append-only filesystem. A table is split into **regions** spread across RegionServers and located through `hbase:meta`, with one server authoritative per region — the source of its strong consistency. And because the **row key is the only index and the unit of distribution**, row-key design decides both your query patterns and whether writes spread or hotspot.

Design the key for how data arrives and how you'll read it, keep compaction healthy, and respect that this is a CP store on a heavy stack — do that, and HBase delivers the one thing HDFS never could: fast random access at billions of rows.
