The Transformer Explained: Attention, BERT, and the NLP Inflection Point

Something happened in NLP this year that doesn't happen often: the ground moved. For most of the decade, the state of the art in language tasks was some flavor of recurrent network — LSTMs, GRUs, sequence-to-sequence with attention bolted on. Progress was real but incremental. Then, over about eighteen months, a new architecture and a new training recipe combined to reset the leaderboards across nearly every benchmark at once. As I write this in late 2018, with Google's BERT having just posted results that felt frankly implausible, it's worth stepping back to explain what changed and why it matters — because I don't think this is a fad.

Two ideas are doing the work. The first is the Transformer, an architecture that throws out recurrence entirely in favor of attention. The second is transfer learning for language — pretraining a big model on oceans of unlabeled text, then fine-tuning it on your specific task. Each is significant; together they're an inflection point.

The problem with recurrence

To see why the Transformer caught on, you have to feel the pain it removed. Recurrent networks process a sentence one token at a time, carrying a hidden state forward: to compute the representation at word 50, you must first have computed words 1 through 49. This has two costs. It's inherently sequential, so it can't exploit the parallelism that makes modern GPUs fast — you can't compute step 50 until step 49 is done. And it struggles with long-range dependencies: information from early in a long sentence has to survive being passed through dozens of intermediate states to influence the end, and in practice it degrades.

Attention had already been added to recurrent seq2seq models as a patch — letting a decoder "look back" at all encoder positions rather than relying on a single fixed summary vector. It helped a lot. The 2017 paper "Attention Is All You Need" asked the radical question: if attention is what's helping, what if we remove the recurrence and keep only attention?

Self-attention: the core idea

Self-attention lets every word in a sentence look directly at every other word, in one step, and decide how much each one matters to its own representation. No passing state down a chain — every position attends to every position simultaneously.

Mechanically, each token's embedding is projected into three vectors: a query, a key, and a value. To compute a token's new representation, you take its query and score it against the key of every token (a dot product), normalize those scores with a softmax into weights, and take the weighted sum of all the value vectors. A token "pays attention" to the tokens whose keys best match its query. For the word "it" in "the animal didn't cross the street because it was tired," self-attention can learn to put weight on "animal" — resolving the reference directly, regardless of distance.

The mental model: self-attention is content-based lookup. Each word broadcasts a query ("what am I looking for?"), every word advertises a key ("what do I offer?"), and the match decides whose value you blend in. Because it's all matrix multiplication over the whole sequence at once, it's massively parallel — exactly what recurrence was not.

Two refinements make it work in practice. Multi-head attention runs several of these attention operations in parallel with different learned projections, so the model can attend to different kinds of relationships at once — syntax in one head, coreference in another. And because attention has no inherent notion of order (it sees a set, not a sequence), the Transformer adds positional encodings to the input embeddings so the model knows where each token sits.

graph TD
    IN["Token embeddings + positional encoding"]
    subgraph BLOCK["Transformer block (stacked N times)"]
        MHA["Multi-head self-attention
(every token attends to every token)"]
        AN1["Add & LayerNorm"]
        FF["Feed-forward network
(per position)"]
        AN2["Add & LayerNorm"]
        MHA --> AN1 --> FF --> AN2
    end
    OUT["Contextual representations"]
    IN --> MHA
    AN2 --> OUT

A Transformer block: multi-head self-attention followed by a position-wise feed-forward network, each wrapped in a residual connection and layer normalization. Stack these blocks and you get the encoder. Because there's no recurrence, an entire sequence flows through in parallel — the architectural choice that makes training on huge corpora practical.

The original Transformer was an encoder-decoder for machine translation: an encoder stack builds rich representations of the source sentence, and a decoder stack generates the target, attending both to its own prior outputs and to the encoder. But the more consequential development of 2018 is what happens when you take just half of it and train it differently.

The bigger shift: pretrain, then fine-tune

The second idea is, to me, the one with the longer tail. Traditionally you trained an NLP model from scratch on your labeled task data — and labeled data is scarce and expensive. The new recipe flips this: first pretrain a large model on a vast amount of unlabeled text using a self-supervised objective (predict a missing or next word — the text is its own label), then fine-tune that pretrained model on your small labeled dataset. The model arrives at your task already knowing the language.

2018 produced a rapid succession of these:

Model	Pretraining idea	Architecture
ELMo	Deep contextual word vectors from a bidirectional language model	BiLSTM (still recurrent)
GPT	Left-to-right language modeling, then fine-tune	Transformer decoder
BERT	Masked language modeling + next-sentence prediction (deeply bidirectional)	Transformer encoder

BERT is the one that just rearranged the field. Its key move is the masked language model objective: randomly hide some tokens and train the model to predict them from both directions at once. A left-to-right model only sees the left context when predicting a word; BERT sees both sides, producing genuinely bidirectional representations. Pretrained on a huge corpus and then fine-tuned with a small task-specific head, the same BERT model set new records across a broad sweep of language-understanding benchmarks — question answering, inference, classification — often by large margins.

graph LR
    CORPUS["Massive unlabeled text"]
    PRE["PRETRAIN
(self-supervised:
predict masked / next tokens)"]
    BASE["General language model
(knows grammar, facts, structure)"]
    subgraph FT["Fine-tune on small labeled data"]
        T1["Sentiment"]
        T2["Question answering"]
        T3["NER / classification"]
    end
    CORPUS --> PRE --> BASE
    BASE --> T1
    BASE --> T2
    BASE --> T3

The pretrain-then-fine-tune recipe. One expensive pretraining run on unlabeled text yields a reusable foundation; cheap fine-tuning adapts it to many downstream tasks. This decouples "learning the language" from "learning your task" — and it's why a single architecture suddenly tops benchmarks that used to each demand bespoke models.

Why this is an inflection point, not a trend

Step back and the pattern is bigger than any one model. The Transformer removed the architectural bottleneck that kept language models small and slow to train — recurrence — and made it practical to train very large models on very large corpora in parallel. The pretrain-fine-tune recipe then turned that training investment into a reusable asset: pretrain once, adapt cheaply many times. Put those together and you have a clear, scalable direction of travel — bigger models, more pretraining data, broad transfer — rather than a single clever architecture.

A few honest caveats from where I sit in late 2018. These models are expensive to pretrain — the compute is out of reach for most individual teams, which is exactly why pretrained weights being released matters so much. They're large to serve. And "understanding" is the wrong word for what they do; they're extraordinary at capturing statistical structure in language, which turns out to carry a startling amount of usable signal, but it's pattern-matching at scale, not comprehension.

Still, I'd bet on the direction. The combination of attention-based architectures and transfer learning has, in roughly a year, moved NLP further than the previous several. If you build anything that touches language, the practical implication is already clear: stop training from scratch. Start from a pretrained Transformer and fine-tune. The era of bespoke per-task NLP models is ending, and 2018 is the year it ended.