# Hadoop & HDFS Internals: HDFS, YARN, and the MapReduce Model

If you've spent any time in data engineering this decade, you've spent it in Hadoop's shadow. It's the system that made "big data" a job title — the idea that you could take a rack of ordinary servers, treat their disks as one enormous filesystem, and run computation next to the data instead of dragging the data to the computation. A decade on, that idea is everywhere, even in the cloud-native tools busily trying to replace Hadoop. So it's worth understanding what's actually under the hood, because the concepts outlive the implementation.

Hadoop is really three things stacked together: **HDFS** (a distributed filesystem), **YARN** (a cluster resource manager), and **MapReduce** (a computation model that runs on YARN). I'll take them in that order, because that's the order the data flows — store it, schedule work on it, compute. Then I'll be honest about where the cracks are showing in 2018.

## HDFS: one filesystem across many disks

HDFS solves a specific problem: store files too big for any one machine, on commodity hardware that will fail, without losing data. It does this by splitting every file into large **blocks** — 128 MB by default — and scattering those blocks across the cluster's **DataNodes**, with each block **replicated** (3 copies by default) on different machines.

The architecture is deliberately asymmetric. A single **NameNode** holds all the metadata — the directory tree, and for every file, which blocks compose it and which DataNodes hold each block. The NameNode keeps this in memory for speed. The DataNodes hold the actual block data and do nothing clever; they store blocks, serve them, and report in. The NameNode never touches file data — it only ever tells clients where to go.

**The NameNode is the original sin and the original bottleneck.** Because all metadata lives in one node's memory, the cluster's file count is capped by NameNode RAM, and historically the NameNode was a single point of failure. This is why HDFS is happiest with a modest number of *large* files and miserable with millions of *small* ones — each file and block consumes NameNode memory regardless of size. The "small files problem" has ruined more Hadoop clusters than hardware ever did. (HA NameNodes with a standby and a quorum journal address the availability half; the memory ceiling remains.)

### The write path

Watching a write happen is the best way to understand HDFS's design. When a client writes a file, it doesn't send data through the NameNode. It asks the NameNode *where* to put each block, and the NameNode replies with a list of DataNodes. The client then streams the block to the first DataNode, which **pipelines** it to the second, which pipelines it to the third. The replication happens as a chain, not a star.

```mermaid
graph TD
    C["Client"]
    NN["NameNode(metadata only —which blocks, which nodes)"]
    D1["DataNode 1block replica 1"]
    D2["DataNode 2block replica 2"]
    D3["DataNode 3block replica 3"]
    C -->|"1. where do I put this block?"| NN
    NN -->|"2. these 3 DataNodes"| C
    C -->|"3. stream block"| D1
    D1 -->|"4. pipeline replica"| D2
    D2 -->|"5. pipeline replica"| D3
    D3 -.->|"6. ack chain back"| C
          
```

The HDFS write pipeline. The NameNode hands out block placement but never carries data; the client streams to the first DataNode, which forwards down a replication pipeline. Block placement is rack-aware — typically two replicas in one rack and one in another — trading a little write cost for survival of a whole-rack failure.

Reads are the mirror image: the client asks the NameNode for the block locations of a file, then reads each block directly from the nearest DataNode that holds it. The NameNode is consulted for metadata and then gets out of the way. DataNodes send periodic **heartbeats** and **block reports** to the NameNode; if a DataNode goes silent, the NameNode notices its replicas are now under-replicated and schedules re-replication of those blocks elsewhere. Durability is self-healing.

## YARN: who gets to run, and where

Storing data is half the system. The other half is running computation on it without everyone trampling each other. That's **YARN** (Yet Another Resource Negotiator), introduced in Hadoop 2 to separate resource management from the MapReduce computation model — which is what later let Spark, Tez, and others run on the same cluster.

YARN has a central **ResourceManager** that owns the cluster's capacity, and a **NodeManager** on every machine that owns that machine's slice (CPU and memory, allocated as **containers**). The clever part is per-application delegation: when you submit a job, YARN starts an **ApplicationMaster** for it in a container, and *that* negotiates with the ResourceManager for more containers and orchestrates the job's tasks. The ResourceManager doesn't micromanage your job; it just hands out containers and lets each application's master drive.

| Component | Scope | Responsibility |
| --- | --- | --- |
| **ResourceManager** | Cluster | Tracks total capacity; schedules containers across applications |
| **NodeManager** | One machine | Launches/monitors containers; reports node health |
| **ApplicationMaster** | One application | Requests containers; orchestrates that job's tasks; handles task failure |
| **Container** | A slice of one node | A bounded chunk of CPU/memory where a task actually runs |

## MapReduce: the model, and the shuffle that defines it

MapReduce is the computation pattern Hadoop was born around. You express a job as two functions. **Map** runs over input splits in parallel, emitting intermediate key-value pairs. **Reduce** runs over those pairs *grouped by key*, producing the output. The canonical word-count: map emits `(word, 1)` for every word; reduce sums the 1s per word.

The genius — and the pain — is in the middle, the **shuffle**. Between map and reduce, every intermediate pair has to travel to the reducer responsible for its key, which means map outputs are partitioned by key, sorted, written to local disk, transferred across the network, and merged on the reduce side. The map and reduce functions are the part you write; the shuffle is the part that decides whether your job finishes in minutes or hours.

```mermaid
graph LR
    subgraph MAP["Map phase (data-local where possible)"]
        M1["Map tasksplit 1"]
        M2["Map tasksplit 2"]
        M3["Map tasksplit 3"]
    end
    SH(["SHUFFLEpartition by key → sort →write to disk → network transfer → merge"])
    subgraph RED["Reduce phase"]
        R1["Reduce taskkeys A–M"]
        R2["Reduce taskkeys N–Z"]
    end
    M1 --> SH
    M2 --> SH
    M3 --> SH
    SH --> R1
    SH --> R2
          
```

The MapReduce shuffle: the expensive heart of every job. Map tasks are scheduled *data-local* — YARN tries to run them on the DataNode holding their input split, so computation moves to data, not the reverse. But the shuffle materializes intermediate results to disk and moves them across the network, which is why MapReduce is throughput-oriented and high-latency, and why in-memory engines that avoid round-tripping to disk between stages started winning.

That last point is the whole story of where the industry is heading. MapReduce writes intermediate data to disk between every map and reduce, and chained jobs write their results back to HDFS only to read them again. For a single batch pass that's fine. For the iterative workloads that dominate analytics and machine learning — where you pass over the same data dozens of times — paying a full disk round trip each iteration is brutal, and it's precisely the gap that in-memory frameworks have moved into.

## What Hadoop 3.0 changed

Hadoop 3.0, which landed at the end of 2017, is worth knowing because it sharpens the trade-offs:

- **Erasure coding in HDFS.** The big one. Instead of storing 3 full replicas (200% storage overhead), HDFS can now use erasure coding — the same idea as RAID — to get the same durability at roughly 50% overhead. The cost is more CPU and network on reconstruction, so it's aimed at colder data where the storage savings dominate.

- **YARN improvements.** Better support for long-running services and finer resource handling, as the cluster increasingly hosts more than batch MapReduce.

- **Multiple standby NameNodes** and other availability hardening.

Erasure coding is the most telling change: it's an admission that storage efficiency matters now in a way it didn't when disks were the cheap part and the priority was just keeping data alive.

## The honest read in 2018

Hadoop's core ideas — distributed storage with replication, moving computation to data, a model where you reason about partitioning and shuffle — are permanent contributions. But the monolithic Hadoop stack is under real pressure, and it's worth saying why plainly:

- **Storage and compute are coupled.** To store more, you add DataNodes; to compute more, you add the same nodes. The cloud's model — cheap object storage (S3 and friends) decoupled from elastic compute — breaks that coupling, and once you can store data independently of the cluster, much of HDFS's reason for existing on-prem weakens.

- **MapReduce is the slow option.** In-memory engines avoid the disk round trips between stages and run the same logic far faster, especially for iterative and interactive work. Most new pipelines aren't written as MapReduce anymore.

- **Operating it is heavy.** A production Hadoop cluster is a lot of moving parts to secure, tune, and keep alive — and the NameNode memory ceiling and small-files problem are permanent operational taxes.

None of that erases the model. When you tune a shuffle in a modern engine, reason about partition counts, or think about data locality and replication, you're using Hadoop's vocabulary. The implementation may be on its way out; the mental model it taught a generation of data engineers is not. That's exactly why it's still worth opening the box.
